U.S. patent application number 12/789409 was filed with the patent office on 2010-12-16 for methods and systems to identify operational reaction pathways.
This patent application is currently assigned to THE REGENTS OF THE UNIVERSITY OF CALIFORNIA. Invention is credited to Markus W. Covert, Markus Herrgard, Bernard O. Palsson.
Application Number | 20100317007 12/789409 |
Document ID | / |
Family ID | 32108009 |
Filed Date | 2010-12-16 |
United States Patent
Application |
20100317007 |
Kind Code |
A1 |
Palsson; Bernard O. ; et
al. |
December 16, 2010 |
Methods and Systems to Identify Operational Reaction Pathways
Abstract
The present invention provides a method for identifying an
operational reaction pathway of a biosystem. The method includes
(a) providing a set of systemic reaction pathways through a
reaction network representing said biosystem; (b) providing a set
of phenomenological reaction pathways of said biosystem, and (c)
comparing said set of systemic reaction pathways with said set of
phenomenological reaction pathways, wherein a pathway common to
said sets is an operational reaction pathway of said biosystem.
Also described is a method of refining a biosystem reaction
network; a method of reconciling biosystem data sets; a method of
determining the effect of a genetic polymorphism on whole cell
function; and a method of diagnosing a genetic
polymorphism-mediated pathology.
Inventors: |
Palsson; Bernard O.; (La
Jolla, CA) ; Covert; Markus W.; (San Diego, CA)
; Herrgard; Markus; (La Jolla, CA) |
Correspondence
Address: |
DLA PIPER LLP (US)
4365 EXECUTIVE DRIVE, SUITE 1100
SAN DIEGO
CA
92121-2133
US
|
Assignee: |
THE REGENTS OF THE UNIVERSITY OF
CALIFORNIA
|
Family ID: |
32108009 |
Appl. No.: |
12/789409 |
Filed: |
May 27, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10367248 |
Feb 14, 2003 |
7734420 |
|
|
12789409 |
|
|
|
|
60419023 |
Oct 15, 2002 |
|
|
|
Current U.S.
Class: |
435/6.1 |
Current CPC
Class: |
G16B 5/00 20190201; G16B
25/00 20190201; G01N 33/5005 20130101; G16B 40/00 20190201 |
Class at
Publication: |
435/6 |
International
Class: |
C12Q 1/68 20060101
C12Q001/68 |
Claims
1. A method of diagnosing a genetic polymorphism-mediated
pathology, comprising: (a) applying a biochemical or physiological
condition stressing a physiological state of a reaction network
representing a biosystem with a genetic polymorphism-mediated
pathology, said applied biochemical or physiological condition
correlating with said genetic polymorphism-mediated pathology; and
(b) measuring one or more biochemical or physiological indicators
of said pathology within said reaction network, wherein a change in
said one or more biochemical or physiological indicators in said
stressed state compared to an unstressed physiological state
indicates the presence of a genetic polymorphism corresponding to
said pathology.
2. The method of claim 1, wherein said biochemical or physiological
condition is selected from a change in flux load, pH, reactants,
and products.
3. The method of claim 1, wherein said biochemical or physiological
condition comprises an oxidative load or an energy load.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a divisional application of U.S.
application Ser. No. 10/367,248 filed Feb. 14, 2003, now issued as
U.S. Pat. No. 7,734,420; which claims the benefit under 35 USC
.sctn.119(e) to U.S. application Ser. No. 60/419,023 filed Oct. 15,
2002, now expired. The disclosure of each of the prior applications
is considered part of and is incorporated by reference in the
disclosure of this application.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] This invention relates generally to the construction of in
silico model organisms and, more specifically, methods and systems
specifying operational reaction pathways and for the generation of
optimal in silico models of actual organisms.
[0004] 1. Background Information
[0005] Therapeutic agents, including drugs and gene-based agents,
are being rapidly developed by the pharmaceutical industry with the
goal of preventing or treating human disease. Dietary supplements,
including herbal products, vitamins and amino acids, are also being
developed and marketed by the nutraceutical industry. Additionally,
efforts for faster and more effective methods for biological
fermentation and other bioprocessing of food stuffs and industrial
compounds has been under development. Faster and more efficient
production of crops and other agricultural products is also yet
another area of intense development in the food industry.
[0006] Because of the complexity of biochemical reaction networks
in and between cells of an organism, even relatively minor
perturbations caused by a therapeutic agent, change in a dietary
component or environmental or growth conditions, can affect
hundreds of biochemical reactions. Such changes or perturbations
can lead to both desirable and undesirable effects in any
therapeutic, industrial or agricultural process involving living
cells. It would therefore be beneficial if a particular process
could predict the effects on a living system such as a cell or
organism of such perturbations.
[0007] However, current approaches to therapeutic, industrial and
agricultural development for compounds and processes used therein
do not take into account the effect of perturbations on cellular
behavior at the level of accuracy needed for efficient and
economical production of products. In order to design effective
methods of manipulating cellular activities for the optimization of
such processes or to achieve the optimal intended effect of an
applied a compound, it would be helpful to understand cellular
behavior from an integrated perspective.
[0008] However, cellular behaviors involve the simultaneous
function and integration of many interrelated genes, gene products
and chemical reactions. Because of this interconnectivity, it is
difficult to predict a priori the effect of a change in a single
gene or gene product, or the effect of a drug or an environmental
factor, on cellular behavior. The ability to accurately predict
cellular behavior under different conditions would be extremely
valuable in many areas of medicine and industry. For example, if it
were possible to predict which gene products are suitable drug
targets, it would considerably shorten the time it takes to develop
an effective antibiotic or anti-tumor agent. Likewise, if it were
possible to predict the optimal fermentation conditions and genetic
make-up of a microorganism for production of a particular
industrially important product, it would allow for rapid and
cost-effective improvements in the performance of these
microorganisms.
[0009] Thus, there exists a need for models and modeling methods
that can be used to accurately simulate and effectively analyze the
behavior of cells and organisms under a variety of conditions. The
present invention satisfies this need and provides related
advantages as well.
SUMMARY OF THE INVENTION
[0010] The invention provides a method of identifying an
operational reaction pathway of a biosystem. The method consists
of: (a) providing a set of systemic reaction pathways through a
reaction network representing said biosystem; (b) providing a set
of phenomenological reaction pathways of said biosystem, and (c)
comparing said set of systemic reaction pathways with said set of
phenomenological reaction pathways, wherein a pathway common to
said sets is an perational reaction pathway of said biosystem.
[0011] Also provided is a method of refining a biosystem reaction
network. The method consists of: (a) providing a mathematical
representation of a biosystem; (b) determining differences between
observed behavior of a biosystem and in silico behavior of said
mathematical representation of said biosystem under similar
conditions; (c) modifying a structure of said mathematical
representation of said biosystem; (d) determining differences
between said observed behavior of said biosystem and in silico
behavior of said modified mathematical representation of said
biosystem under similar conditions, and (e) repeating steps (d) and
(e) until behavioral differences are minimized, wherein
satisfaction of a predetermined accuracy criteria indicates an
improvement in said biosystem reaction network.
[0012] Further provided is a method of reconciling biosystem data
sets. The method consists of: (a) providing a first reaction
network reconstructed from legacy data comprising a plurality of
hierarchical reaction categories; (b) providing a second reaction
network obtained from empirical data, and (c) determining a
consistency measure between said hierarchical reaction categories
in said first reaction network and elements in said second reaction
network, wherein a high degree of said consistency measure for said
hierarchical reaction categories indicates the validity of said
first reaction network or a subcomponent thereof.
[0013] A method of determining the effect of a genetic polymorphism
on whole cell function is also provided. The method consists of:
(a) generating a reaction network representing a biosystem with a
genetic polymorphism-mediated pathology; (b) applying a biochemical
or physiological condition stressing a physiological state of said
reaction network, and (c) determining a sensitivity to said applied
biochemical or physiological condition in said stressed
physiological state compared to a reaction network representing a
normal biosystem, wherein said sensitivity is indicative of a
phenotypic consequence of said genetic polymorphism-mediated
pathology.
[0014] The invention additionally provides a method of diagnosing a
genetic polymorphism-mediated pathology. The method consists of:
(a) applying a biochemical or physiological condition stressing a
physiological state of a reaction network representing a biosystem
with a genetic polymorphism-mediated pathology, said applied
biochemical or physiological condition correlating with said
genetic polymorphism-mediated pathology, and (b) measuring one or
more biochemical or physiological indicators of said pathology
within said reaction network, wherein a change in said one or more
biochemical or physiological indicators in said stressed state
compared to an unstressed physiological state indicates the
presence of a genetic polymorphism corresponding to said
pathology.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] FIG. 1 shows a schematic diagram for steps involved in
determining operational pathways of a biochemical reaction
network.
[0016] FIG. 2a shows a schematic representation of systemic
reaction pathways as one branch of a regulatory tree with the
regulated genes shown on the horizontal axis.
[0017] FIG. 2b shows a process by which mathematical
representations of biosystems can be improved in an iterative
fashion using algorithmic approaches and targeted
experimentation.
[0018] FIG. 3 shows a phase plane for succinate for an in
silico-generated metabolic flux profile of core metabolism in E.
coli was prepared.
[0019] FIG. 4 shows phase I of a phase plane for a flux
distribution matrix generated with the E. coli core metabolism
using the oxygen and succinate input values show next to the
figure.
[0020] FIG. 5 shows an Singular Value Decomposition (SVD) analysis
on the flux matrix shown in FIG. 4.
[0021] FIG. 6 shows the contribution level of each condition, or
point shown in phase I of the FIG. 4 phase plane, for various modes
obtained from SVD.
[0022] FIG. 7 shows the contribution level of each condition, or
point shown in phase I of the FIG. 4 phase plane, for various modes
obtained from SVD.
[0023] FIG. 8 shows the reduced set of extreme pathways for
succinate that is presented in Table 2.
[0024] FIG. 9 shows a schematic diagram of flux balance analysis
(FBA) and convex analysis to identify extreme and operational
pathways of the invention.
[0025] FIG. 10 shows decomposed flux vectors using the modes
obtained from SVD of P for the extreme pathways of the red blood
cell (RBC) metabolic network.
[0026] FIG. 11 shows a histogram of the first five modes of the SVD
analysis shown in FIG. 10 under maximum (Max), moderate (Mid) and
nominal state (no load) oxidative and energy loads.
[0027] FIG. 12 shows a schematic diagram for building large-scale
in silico models of complex biological processes.
[0028] FIG. 13 shows the localization of single nucleotide
polymorphism clusters found in clinically diagnosed
glucose-6-phosphate dehydrogenase (G6PD) patients.
[0029] FIG. 14 shows the toleration of oxidative load between
chronic and non chronic hemolytic anemia states having G6PD
SNPs.
[0030] FIG. 15 shows the characterization and toleration of energy
loads for glycolytic states harboring different pyruvate kinase
(PK) SNP variants.
[0031] FIG. 16 shows the reconciliation of legacy and empirical
data sets for regulatory networks of yeast and E. coli.
[0032] FIG. 17 shows a schematic diagram of an algorithm for
reconciliation of data sets and iterative improvement of a
mathematical or in silico model.
[0033] FIG. 18 shows a skeleton network of core metabolism and
regulation, together with a table containing relevant chemical
reactions and regulatory rules which govern the transcriptional
regulation.
[0034] FIG. 19 shows calculation of the expression of regulated
genes in an actual organism and model system resulting from phase I
of an iterative process of the invention.
[0035] FIG. 20 shows computed flux distributions using flux balance
analysis (FBA) for the aerobic growth without regulation using an
in silico model of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0036] The invention provides methods and systems for determining
the interaction, integration and coordination of a set of
components of a biosystem. The invention can thus be used to
rapidly and systematically specify a reconstructed biochemical
reaction network at the genome-scale and to relate the activity of
the components and their interaction to a specific phenotype or
physiological state. Understanding which components are operational
under particular conditions allows for improved methods of
engineering desirable functions into living cells, fixing
malfunctioning circuits, and controlling endogenous circuits by the
proper manipulation of the cells' environment. Furthermore, a rapid
method for characterizing a biochemical network allows for the
characterization of a virtually uncharacterized biosystem with a
minimum of experimental effort.
[0037] The invention provides a method for determining the
operational pathways of a biochemical reaction network. The
invention method is practiced by (a) providing a biochemical
reaction network, comprised of reactions which can be regulated;
(b) providing a set of experimental data which represent various
physiological or pathological states of the biosystem under given
conditions; (c) determining a set of systemic pathways which define
the biosystem in whole or in part; (d) determining a set of
phenomenological reaction pathways which describe the experimental
states of the biosystem ; and (e) determining the operational
pathways common to both the systemic and phenomenological pathways
sets both at whole-genome and biosystem subcomponent scale (FIG.
1).
[0038] As used herein, the term "reaction" is intended to mean a
chemical conversion that consumes a substrate or forms a product. A
conversion included in the term can occur due to the activity of
one or more enzymes that are genetically encoded by an organism, or
can occur spontaneously in a cell or organism. A conversion
included in the term can be, for example, a conversion of a
substrate to a product, such as one due to nucleophilic or
electrophilic addition, nucleophilic or electrophilic substitution,
elimination, reduction or oxidation. A conversion included in the
term can also be a change in location, such as a change that occurs
when a reactant is transported across a membrane or from one
compartment to another. The substrate and product of a reaction can
be differentiated according to location in a particular
compartment, even though they are chemically the same. Thus, a
reaction that transports a chemically unchanged reactant from a
first compartment to a second compartment has as its substrate the
reactant in the first compartment and as its product the reactant
in the second compartment. The term "reaction" also includes a
conversion that changes a macromolecule from a first conformation,
or substrate conformation, to a second conformation, or product
conformation. Such conformational changes can be due, for example,
to transduction of energy due to binding a ligand such as a hormone
or receptor, or from a physical stimulus such as absorption of
light. It will be understood that when used in reference to an in
silico biochemical reaction network, a "reaction" is intended to be
a representation of a conversion as described above.
[0039] As used herein, the term "reactant" is intended to mean a
chemical that is a substrate or a product of a reaction. The term
can include substrates or products of reactions catalyzed by one or
more enzymes encoded by an organism's genome, reactions occurring
in an organism that are catalyzed by one or more non-genetically
encoded catalysts, or reactions that occur spontaneously in a cell
or organism. Metabolites are understood to be reactants within the
meaning of the term. It will be understood that when used in the
context of an in silico model or data structure, a reactant is
understood to be a representation of chemical that is a substrate
or product of a reaction.
[0040] As used herein the term "substrate" is intended to mean a
reactant that can be converted to one or more products by a
reaction. The term can include, for example, a reactant that is to
be chemically changed due to nucleophilic or electrophilic
addition, nucleophilic or electrophilic substitution, elimination,
reduction or oxidation or that is to change location such as by
being transported across a membrane or to a different compartment.
The term can include a macromolecule that changes conformation due
to transduction of energy.
[0041] As used herein, the term "product" is intended to mean a
reactant that results from a reaction with one or more substrates.
The term can include, for example, a reactant that has been
chemically changed due to nucleophilic or electrophilic addition,
nucleophilic or electrophilic substitution, elimination, reduction
or oxidation or that has changed location such as by being
transported across a membrane or to a different compartment. The
term can include a macromolecule that changes conformation due to
transduction of energy.
[0042] As used herein, the term "regulatory reaction" is intended
to mean a chemical conversion or interaction that alters the
activity of a catalyst. A chemical conversion or interaction can
directly alter the activity of a catalyst such as occurs when a
catalyst is post-translationally modified or can indirectly alter
the activity of a catalyst such as occurs when a chemical
conversion or binding event leads to altered expression of the
catalyst. Thus, transcriptional or translational regulatory
pathways can indirectly alter a catalyst or an associated reaction.
Similarly, indirect regulatory reactions can include reactions that
occur due to downstream components or participants in a regulatory
reaction network. When used in reference to a data structure or in
silico model, the term is intended to mean a first reaction that is
related to a second reaction by a function that alters the flux
through the second reaction by changing the value of a constraint
on the second reaction.
[0043] A regulatory reaction can further include information about
inhibitory or inducing effects of an active or inactive regulator
on transcription of a gene. For example, a regulatory reaction may
have one or more regulators associated with it which effect
transcription of a gene.
[0044] A regulatory reaction can further include information about
the interaction of regulators which influence gene expression. For
example a regulatory reaction may have a combination of two or more
regulators associated with it which are dependent upon each other
to effect transcription of a gene.
[0045] A regulatory reaction can further include information in the
form of Boolean logic statements which indicates the interaction
and dependency of regulators for transcription of a particular
gene. For example, a particular gene may have a Boolean logic
assigned to it which describes the necessary regulators and
regulatory interactions required for expression of that gene.
[0046] As used herein, the term "regulator" refers to a substance
which regulates transcription, post-transcriptional modification or
activity of one or more genes, proteins, mRNA transcripts. Such a
regulator may be a regulatory protein, small molecule and the
like.
[0047] As used herein, the term "regulatory event" is intended to
mean a modifier of the flux through a reaction that is independent
of the amount of reactants available to the reaction. A
modification included in the term can be a change in the presence,
absence, or amount of an enzyme that catalyzes a reaction. A
modifier included in the term can be a regulatory reaction such as
a signal transduction reaction or an environmental condition such
as a change in pH, temperature, redox potential or time. It will be
understood that when used in reference to an in silico model or
data structure a regulatory event is intended to be a
representation of a modifier of the flux through a reaction that is
independent of the amount of reactants available to the
reaction.
[0048] As used herein, the term "reaction network" refers to a
representation of the functional interrelationships between a
collection of reactions and reaction components. Reaction
components included in a reaction network can be any component
involved in a reaction, such as a substrate, product, enzyme,
cofactor, activator, inhibitor, transporter, and the like.
Functional interrelationships include, for example, those between a
substrate and its product; those between a substrate or product and
the enzyme that catalyzes the conversion from substrate to product;
those between an enzyme and its cofactor, activator or inhibitor;
those between a receptor and a ligand or other pairs of
macromolecules that physically interact; those between a
macromolecule and its transporter; those between proteins involved
in transcriptional regulation and their DNA-binding sites in
regulatory regions regulating specific target genes; and the
like.
[0049] A reaction network can further include information regarding
the stoichiometry of reactions within the network. For example, a
reaction component can have a stoichiometric coefficient assigned
to it that reflects the quantitative relationship between that
component and other components involved in the reaction.
[0050] A reaction network can further include information regarding
the reversibility of reactions within the network. A reaction can
be described as occurring in either a reversible or irreversible
direction. Reversible reactions can either be represented as one
reaction that operates in both the forward and reverse direction or
be decomposed into two irreversible reactions, one corresponding to
the forward reaction and the other corresponding to the backward
reaction.
[0051] A reaction network can include both intra-system reactions
and exchange reactions. Intra-system reactions are the chemically
and electrically balanced interconversions of chemical species and
transport processes, which serve to replenish or drain the relative
amounts of certain reactants. Exchange reactions are those which
constitute sources and sinks, allowing the passage of reactants
into and out of a compartment or across a hypothetical system
boundary. These reactions represent the demands placed on the
biological system. As a matter of convention the exchange reactions
are further classified into demand exchange and input/output
exchange reactions. Input/output exchange reactions are used to
allow components to enter or exit the system. A demand exchange
reaction is used to represent components that are required to be
produced by the cell for the purposes of creating a new cell, such
as amino acids, nucleotides, phospholipids, and other biomass
constituents, or metabolites that are to be produced for
alternative purposes.
[0052] A reaction network can further include both metabolic and
regulatory reactions. Metabolic reactions can be represented by
stoichiometry and reversibility while regulatory reactions can be
represented by Boolean logic statements which both depend on and
effect the presence or absence, activity or inactivity of metabolic
or regulatory proteins.
[0053] A reaction network can be represented in any convenient
manner. For example, a reaction network can be represented as a
reaction map with interrelationships between reactants indicated by
arrows. For mathematical manipulation according to the methods of
the invention, a reaction network can conveniently be represented
as a set of linear algebraic equations or presented as a
stoichiometric matrix. A stoichiometric matrix, S, can be provided,
which is an m x n matrix where m corresponds to the number of
reactants and n corresponds to the number of reactions in the
network. Stoichiometric matrices and methods for their preparation
and use are described, for example, in Schilling et al., Proc.
Natl. Acad. Sci. USA 95:4193-4198 (1998). As a further example, a
reaction network can conveniently be represented as a set of linear
algebraic equations and Boolean logic equations. The Boolean logic
equations may be evaluated and lead to the removal or addition of
certain reactions from the stoichiometric matrix, due to the
inhibitory or inducing effect of regulatory events. Such a
representation is described, for example, in Covert M W, Schilling
C H, Palsson B. J Theor Biol. 213:73-88 (2001).
[0054] The invention methods can be practiced with reaction
networks of either low or high complexity, such as networks that
include substantially all of the reactions that naturally occur for
a particular biosystem. Thus, a reaction network can include, for
example, at least about 10, 50, 100, 150, 250, 400, 500, 750, 1000,
2500, 5000 or more reactions, which can represent, for example, at
least about 5%, 10%, 20%, 30%, 50%, 60%, 75%, 90%, 95% or 98% of
the total number of naturally occurring reactions for a particular
biosystem.
[0055] A reaction network represents reactions that participate in
one or more biosystems. As used herein, the term "biosystem" refers
to an entire organism or cell therefrom, or to a "biological
process" that occurs in, to or by the organism or cell. Thus, a
reaction network can represent reactions that occur at the whole
organismal, whole cell or subcellular level. Additionally, the
reaction network may represent interactions between different
organisms or cells.
[0056] The term "organism" refers both to naturally occurring
organisms and to non-naturally occurring organisms, such as
genetically modified organisms. An organism can be a virus, a
unicellular organism, or a multicellular organism, and can be
either a eukaryote or a prokaryote. Further, an organism can be an
animal, plant, protist, fungus or bacteria. Exemplary organisms
include pathogens, and organisms that produce or can be made to
produce commercially important products, such as therapeutics,
enzymes, nutraceuticals and other macromolecules. Examples of
organisms include Arabidopsis thaliana, Bacillus subtilis, Bos
taurus, Caenorhabditis elegans, Chlamydomonas reihardtii, Danio
rerio, Dictyostelium discoideum, Drosophila melanogaster,
Escherichia coli, hepatitis C virus, Haemophilus influenzae,
Helicobacter pylori, Homo sapiens, Mus musculus, Mycoplasma
pneumoniae, Oryza sativa, Plasmodium falciparum, Pnemocystis
carinii, Rattus norvegicus, Saccharomyces cerevisiae,
Schizosaccharomyces pombe, Takifugu rubripes, Xenopus laevis, Zea
mays, and the like.
[0057] A "biological process" of an organism or cell refers to a
physiological function that requires a series of integrated
reactions. A biological process can be, for example, cellular
metabolism; cell motility; signal transduction (including
transduction of signals initiated by hormones, growth factors,
hypoxia, cell-substrate interactions and cell-cell interactions);
cell cycle control; transcription; translation; degradation;
sorting; repair; differentiation; development; apoptosis; and the
like. Biological process are described, for example, in Stryer, L.,
Biochemistry, W.H. Freeman and Company, New York, 4th Edition
(1995); Alberts et al., Molecular Biology of The Cell, Garland
Publishing, Inc., New York, 2nd Edition (1989); Kuby, Immunology,
3rd Edition, W.H. Freeman & Co., New York (1997); and Kornberg
and Baker, DNA Replication, W.H. Freeman and Company, New York, 2nd
Edition (1992).
[0058] In one embodiment, the biosystem includes the biological
process of cellular metabolism, and the reaction network
representing the biosystem, referred to as a "metabolic reaction
network," includes cellular metabolic reactions. A basic review of
cellular metabolism can be found, for example, in Stryer, L.,
Biochemistry, W.H. Freeman and Company, New York, 4th Edition
(1995). Cellular metabolism can be usefully divided into central
and peripheral metabolic reactions. Central metabolism includes
reactions that belong to glycolysis, pentose phosphate pathway
(PPP), tricarboxylic acid (TCA) cycle and respiration. Peripheral
metabolism, which includes all metabolic reactions that are not
part of central metabolism, includes reactions involved in the
biosynthesis of an amino acid, degradation of an amino acid,
biosynthesis of a purine, biosynthesis of a pyrimidine,
biosynthesis of a lipid, metabolism of a fatty acid, biosynthesis
of a cofactor, metabolism of a cell wall component, transport of a
metabolite or metabolism of a carbon source, nitrogen source,
phosphate source, oxygen source, sulfur source, hydrogen source or
the like.
[0059] In another embodiment, the biosystem includes the biological
process of transcriptional regulation, and the reaction network
representing the biosystem, referred to as a "transcriptional
regulatory reaction network," includes cellular transcriptional
regulatory reactions. A basic review of cellular transcriptional
regulation can be found, for example, in Alberts et al., Molecular
Biology of The Cell, Garland Publishing, Inc., New York, 2nd
Edition (1989). Transcriptional regulatory events may be grouped by
the types of genes regulated, for example those genes associated
with metabolism, cell cycle, flagellar biosynthesis and the
like.
[0060] In another embodiment, the biosystem includes the biological
processes of cellular metabolism and transcriptional regulation and
the reaction network representing the biosystem includes both
metabolic and transcriptional regulatory reactions.
[0061] A reaction network that includes substantially all of the
reactions of a whole organism or cell, or substantially all of the
reactions of a particular biological process of an organism or
cell, is referred to as a "genome-scale" reaction network.
Genome-scale reaction networks representing the metabolism of
various organisms have been described, including E. coli (PCT
publication WO 00/46405); H. pylori (Schilling et al., J.
Bacteriol. 184:4582-4593 (2002)); and H. influenzae Edwards J. S.
and Palsson B. O. J. Biol. Chem. 274:17410-6 (2001)).
[0062] For other biosystems, genome-scale reaction networks can be
prepared by methods known in the art. Generally, these methods
involve first generating a comprehensive list of reactions that are
capable of occurring in the organism, cell or biosystem, and
determining their interconnectivity. The list can include reactions
determined from an analysis of the annotated genome of the
organism, supplemented as required from scientific literature and
from experimental data. Also included can be transport reactions,
biomass composition demands, growth associated energy requirements,
and the like.
[0063] The genome sequences of a large number of animals, plants,
protists, fungi, bacteria and viruses have been completed or are in
progress (see, for example, genome entries in The Institute for
Genome Research (TIGR) database (www.tigr.org/tdb/) and in the NCBI
Entrez Genome database
(www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome)). Other World
Wide Web-based sources of annotated genome sequence information and
reconstructed network information include EcoCyc, Metabolic
pathways database (MPW), Kyoto Encyclopedia of Genes and Genomes
(KEGG), What is There (WIT) and Biology Workbench.
[0064] For organisms whose genomes have not yet been sequenced, a
variety of methods for obtaining the genomic sequence are known in
the art. In most large-scale genome sequencing methods, every step
from isolating DNA, cloning or amplifying DNA, preparing sequencing
reactions, and separating and detecting labeled fragments to obtain
sequence, is automated (Meldrum, Genome Res. 10:1081-1092 (2000)).
Most methods use a combination of sequencing methods, such as a
combination of random shotgun sequencing with a directed finishing
phase. Other methods use a whole-genome shotgun approach, in which
random fragments of total genomic DNA are subcloned directly, and
high-throughput sequencing is used to provide redundant coverage of
the genome. Another approach is to sequence each end of every BAC
in a genome library, and match a finished sequence to a BAC end
sequence to select the next clone (Venter et al., Science
280:1540-1542 (1998); Waterston et al, Science 282:53-54
(1998)).
[0065] For a newly sequenced genome, the open reading frames (ORFs)
or coding regions may be distinguished from the rest of the DNA
sequence by variety of methods. Determining the location of an ORF
in a DNA sequence, its strand, and nucleotide composition may be
conducted by searching for gene signals (e.g., promoters, binding
sites, start and stop codon, etc.) or by analzying gene content
(e.g., codon preference, positional base frequency, etc.), or a
combination of both methods. Algorithms and computational tools are
available to determine the ORFs of an entire DNA sequence using
these methods available through institutes such as the University
of Wisconsin Genetics Computer Group and National Center for
Biotechnology Information. Furthermore, other computational
algorithms have been developed by which bacterial or eukaryotic
genes may be identified by algorithmic methods such as hidden
Markov models, which routinely find more than 99% of protein-coding
regions and RNA genes (Pevzner, "Computational molecular biology:
an algorithmic approach," in Computational Molecular Biology.
Cambridge, Mass.:MIT Press, xviii, p. 314 (2000); Baldi et al.,
"Bioinformatics: the machine learning approach," in Adaptive
Computation and Machine Learning. Cambridge, Mass,: MIT Press
xviii, p. 351 (1998); Fraser et al., Nature 406:799-803
(2000)).
[0066] In order to assign function to the coding regions, newly
identified ORFs are searched against databases containing genes and
protein sequences of known function for sequence similarity.
Several algorithms such as the BLAST and FASTA family of programs
have been developed and are available publically by which the
similarity of a functionally unknown ORF may be determined against
functionally annotated genes. A major portion of unidentified genes
in a newly sequence organism can be assigned functionally with this
procedure.
[0067] If the putative function of a gene is not established by
gene or protein sequence similarity, other techniques such as gene
clustering by function or location may be used to assess the role
of a gene in the network. Gene products that participate in the
same overall function can constitute a pathway in the cell.
"Missing links" in a pathway constructed from an initial sequence
annotation suggests the existence of genes that have not yet been
identified. Searching the sequence against other organisms provides
clues about the possible nucleotide sequence of the missing genes,
which in turn facilitates targeting functionality of the unassigned
coding regions. Algorithms have been developed that perform this
procedure in various genome databases such as KEGG and WIT. In
addition, genes of the neighboring location may be clustered into
operons that are regulated and function in a coordinated fashion
when the DNA sequence is compared to that of other organisms. From
the annotated genetic information, together with biochemical and
physiological information, the interrelatedness of reactions and
reaction components is determined and the reaction network is
completed.
[0068] In addition to defining the ORFs or coding regions of the
genome, regulatory regions can be defined by variety of methods.
Regulatory regions contain binding sites for transcriptional
regulators and components of the transcriptional machinery. These
sites determine the specificity of transcriptional regulation as
the ability of transcriptional regulators to regulate the gene
controlled by the regulatory region. The methods to identify
regulatory regions and sites include comparing non-coding regions
of closely related genomes to identify highly conserved segments of
the genome that may correspond to regulatory regions. Groups of
non-coding regions of a genome can also be searched for commonly
occurring sequence fragments to identify specific binding site
patterns in the genome. These groups can be defined for example by
similarity in biological function of the genes controlled by the
regulatory regions. In addition existing definitions of binding
site patterns for specific transcriptional regulators stored in
specific databases such as Saccharomyces Promoter Database (Zhu and
Zhang, Bioinformatics 15:607-611 (1999)) or TRANSFAC (Wingender et
al., Nucl. Acids Res. 29:281-283 (2001)) can be used to search the
genome for new binding sites for a regulator. Identifying
regulatory sites for specific transcription regulators allows
establishing potential target genes regulated by these regulators
and thus suggesting new regulatory reactions to be added to the
regulatory network.
[0069] As used herein, the term "reaction pathway" refers to a
route through a reaction network through which reaction components,
regulatory information or signaling molecules can potentially flow.
It will be appreciated that the actual amount and/or rate of
substrate to product conversion through a reaction pathway (also
known as "flux") is a function of the physiological state of the
biosystem under consideration, and that reaction pathways
(including operational, extreme and phenomenological reaction
pathways as described below) are generally specified in connection
with the physiological state of the biosystem. The term
"physiological state" is intended to refer to any specified
internal and external parameters that affect, or are likely to
affect, flux through a biosystem. Parameters that can affect flux
include, for example, the actual or intended inputs to the
biosystem (such as the carbon, nitrogen, phosphorus, sulfur or
hydrogen source; the presence or amount of oxygen, nutrients,
hormones, growth factors, inhibitors and the like); the actual or
intended outputs of the biological system (such as biomass
components, secreted products and the like) and environmental
variables (such as temperature, pH and the like). Other parameters
that can affect flux include, for example, the state of
differentiation or transformation of the cell; cell age; its
contact with a substrate or with neighboring cells; the addition or
deletion of expressed genes; and the like.
[0070] As used herein, term "systemic reaction pathway" refers to a
reaction pathway identified by an automated method applied to a
suitable representation of a reaction network. The method may
involve mathematical or algorithmic operations to identify the
reaction pathways, and it may include user definable parameters
that influence the identification of reaction pathways. The
systemic reaction pathways need not to be unique and they may only
apply to a subset of the reaction network.
[0071] Methods of identifying systemic reaction pathways using
convex analysis have been described in the art. Such methods
include, for example, stoichiometric network analysis (SNA)
(Clarke, Cell Biophys. 12:237-253 (1988); elementary mode analysis
(Schuster et al., Trends Biotech. 17:53-60 (1999); and extreme
pathway analysis (Schilling et al., J. Theor. Biol. 203:229-248
(2000); Schilling et al., Biotechnol. Bioeng. 71:286-306 (2001)).
The distinctions between these types of analysis are described in
Schilling et al. supra (2000).
[0072] In one embodiment, the systemic reaction pathway is an
extreme pathway. The term "extreme pathway" refers to a
systemically independent pathway that spans a convex,
high-dimensional space that circumscribes all potential steady
state flux distributions achievable by a defined reaction
network.
[0073] It will be understood that the steps needed to "provide" a
set of systemic reaction pathways for use in the invention methods
will depend on the amount and type of information already available
regarding the biosystem and reaction network. For certain
biosystems and physiological states, sets of extreme reaction
pathways have been described in the art. For example, extreme
pathways for a human red blood cell metabolic network are described
in Wiback et al., Biophys. J. 83:808-818 (2002). Extreme pathways
for a H. influenzae metabolic network are described in Schilling et
al., J. Theor. Biol. 203:249-283 (2000) and Papin et al., J. Theor.
Biol. 215:67-82 (2002). Extreme pathways for a H. pylori metabolic
network are described in Price et al., Genome Res. 12:760-769
(2002).
[0074] Extreme reaction pathways can also be determined de novo,
using methods known in the art (Schilling et al. supra (2000);
Schilling et al. supra (2001)). Appropriate stoichiometric and
thermodynamic constraints can be imposed on the intrasystem and
exchange reactions in the reaction network under steady-state
conditions. Constraints can also imposed on the input and output of
reactants to and from the biosystem. Optionally, regulatory
constraints can also be imposed (Covert et al., J. Theor. Biol.
213:73-88 (2001); Covert et al., J. Biol. Chem. 277:28058-28064
(2002)). This results in a system of linear equalities and
inequalities that can be solved using convex analysis. The solution
space corresponds geometrically to a convex polyhedral cone in
high-dimensional space emanating from the origin, which is referred
to as the steady state "flux cone." Within this flux cone lie all
of the possible steady-state solutions, and hence all the allowable
flux distributions of the biosystem. The extreme pathways
correspond to vectors that define the edges of the flux cone.
[0075] In another embodiment, the systemic reaction pathway is one
branch of a regulatory tree. The regulated genes of a biosystem may
be depicted as shown in FIG. 2a with the regulated genes shown on
the horizontal axis. In a Boolean representation, each protein and
each gene may be considered "on" or "off" (active or inactive,
respectively). The combination of the activity state of all genes
and proteins in a biosystem may be considered a "systemic
regulatory pathway" or a "systemic signaling pathway".
[0076] In another embodiment, the systemic reaction pathway is a
set of regulators and regulatory reactions influencing the activity
of a regulated gene or the set of genes regulated by a regulator or
a group of regulators. These sets may be identified by analyzing
the connectivity of a regulatory network represented as a graph and
identifying nodes in the network connected to a particular node
(regulator or regulated gene). The smallest possible set of such
kind is one involving one regulatory reaction between a regulator
and a target gene.
[0077] As used herein, the term "phenomenological reaction pathway"
refers to a reaction pathway defined through analyzing experimental
data to describe the state of the biosystem in whole or part. The
data types that can be used to define phenomenological reaction
pathways include but are not limited to transcriptomic, proteomic,
metabolomic, fluxomic, protein-protein interaction, and DNA-binding
site occupancy data. The data analysis methods used to define the
phenomenological pathways from the experimental data include but
are not limited to systems identification, statistical,
algorithmic, or signal processing techniques.
[0078] Phenomenological information about the reactions and
reactants of a biosystem can be determined by methods known in the
art, and can be either qualitative or quantitative. For example,
phenomenological information can be obtained by determining
transcription of genes, expression or interactions of proteins,
production of metabolites or other reactants, or use of reactions
in the biosystem. By analogy to the term "genome," such
information, when obtained at the scale of substantially the whole
organism or cell, is called, respectively, the "transcriptome,"
"proteome," "metabolome" and "fluxome."
[0079] Methods of determining gene expression at the transcriptome
scale (also known as "transcriptomics") are known in the art and
include, for example, DNA microarray methods, which allow the
simultaneous analysis of all transcripts simultaneously (Shena et
al., Science 270:467-470 (1995); DeRisi et al., Science 278:680-686
(1997)) and serial analysis of gene expression (SAGE) methods
(Velculescu et al., Trends Genet. 16:423-425 (2000)); Methods of
determining protein expression (also known as "proteomics") are
also known in the art. Expression proteomic methods generally
involve separation of proteins, such as by two-dimensional gel
electrophoresis, followed by protein imaging using radiolabels,
dyes or stains. Separated proteins are then identified using
methods such as peptide mass fingerprinting by mass spectrometry
and peptide-sequence tag, analysis by nano-electrospray (Blackstock
et al., Trends Biotechnol. 17:121-127 (1999)).
[0080] Method for determining interactions between biological
molecules in the cell at a large scale are also known in the art.
Protein-protein interaction information, which allows inferences as
to a protein's function, can be obtained, for example, using
large-scale two-hybrid analysis to identify pairwise protein
interactions (Fromont-Racine et al., Nat. Genet. 16:277-282 (1997).
Indirect protein-DNA interaction information can be obtained using
chromatin immunoprecipitation chip (ChIP-ChIP) methods, which
allows the genome-scale identification of genomic binding sites of
DNA-binding proteins and genomic targets of transcription factors
(Iyer et al., Nature 409:533-538 (2001)).
[0081] Methods of determining the complement of metabolites in a
cell (also known as "metabolomics") are also known in the art and
include, for example, nuclear magnetic resonance (NMR) spectroscopy
such as 13C-NMR; mass spectroscopy such as gas
chromatography/time-of-flight mass spectroscopy (GC/TOFMS); and
liquid chromatography (Fiehn, Plant Mol. Biol. 48:155-171 (2002);
Phelps et al., Curr. Opin. Biotech. 13:20-24 (2002)).
[0082] Likewise, methods of measuring the fluxes through reaction
pathways (also known as "fluxomics") are known in the art, such as
metabolic flux ratio analysis (METAFoR) (Sauer et al., J.
Bacteriol. 181:6679-6688 (1999)). METAFoR quantifies the relative
abundance of intact carbon bonds in biomass constituents that
originate from uniformly isotopically labeled precursor molecules,
which reflects the metabolic pathways used.
[0083] By repeatedly varying the physiological state of the
biosystem, either experimentally or in silico, a series of
phenomenological measurements at different states can be obtained
or predicted. These data can be organized in vectorial form and
represented in matrix or tabular formats. For example, a set of
gene array expression data can be organized as a matrix where each
row is a gene, each column is an experiment, and each value is an
expression level or ratio. As another example, a set of fluxome
data can be organized as a matrix where each row is a reaction,
each column is an experiment and each value is a flux level or
ratio. As a further example, a set of phenotypic data can be
organized as a matrix where each row is an experiment, each column
is an environmental component (such as nutrients, waste products,
or biomass) and each value is a rate of uptake, secretion, or
growth.
[0084] The phenomenological information can be analyzed by various
methods known in the art, such as methods of system identification,
statistical data analysis, combinatorial algorithms, or signal
processing to determine a set of phenomenological reaction
pathways.
[0085] Methods of system identification are known in the art and
include, for example, various types of clustering analysis methods
(reviewed in Sherlock et al., Curr. Opin. Immunol. 12:201-205
(2000)). Clustering methods can be applied to experimental data in
matrix or tabular formats to extract groups of genes that are
co-expressed. These groups that can either be disjoint or
overlapping can be used as definitions of phenomenological
pathways. Alternatively, a data vector within each cluster can be
chosen to be a representative phenomenological pathway for that
cluster--this vector could for example be the mean value of the
data points within the cluster also known as the centroid of the
cluster.
[0086] Clustering analysis methods include, for example,
hierarchical clustering analysis (Eisen et al., Proc. Natl. Acad.
Sci. USA 95:14863-14868 (1998); Wen et al., Proc. Natl. Acad. Sci.
USA 95:334-339 (1998)), whereby single reactant profiles are
successively joined to form nodes, which are then joined further.
The process continues until all individual profiles and nodes have
been joined to form a single hierarchical tree. Clustering analysis
methods also include divisive clustering analysis (Alon et al.,
Proc. Natl. Acad. Sci. USA 96:6745-6750 (1999)), in which two
vectors are initialized randomly, and each reactant is assigned to
one of the two vectors using a probability function. The vectors
are iteratively recalculated to form the centroids of the two
clusters, and each cluster is successively split in the same manner
until each cluster consists of a single profile. Clustering
analysis methods also include methods in which the data is
partitioned into reasonably homogeneous groups. Clustering methods
that incorporate partitioning include, for example, self-organizing
maps (Kohenen, "Self Organizing Maps," Berlin: Springer (1995);
Tamayo et al., Proc. Natl. Acad. Sci. USA 96:2907-2912 (1999)) and
k-means clustering (Everitt, "Cluster Analysis 122," London:
Heinemann (1974).
[0087] Another method of system identification is principal
component analysis of the data, which is closely related to the
singular value decomposition (SVD) of the data matrix (Holter et
al., Proc. Natl. Acad. Sci. USA 97:8409-9414 (2000); Alter et al.,
Proc. Natl. Acad. Sci. USA 97:10101-10106 (2000); Holter et al.,
Proc. Natl. Acad. Sci. USA 98:1693-1698 (2001)). Principal
component analysis is a statistical technique for determining the
key variables in a multidimensional data set that explain the
differences in the observations, and can be used to simplify the
analysis and visualization of multidimensional data sets. SVD is a
linear transformation of data, such as gene expression data, from
genes x arrays space to reduced diagonalized "eigengenes" x
"eigenarrays" space, where the eigengenes (or eigenarrays) are
unique orthonormal superpositions of the genes (or arrays). After
normalization and sorting of the data, the individual genes and
arrays become grouped according to similar regulation and function,
or similar physiological state, respectively. Principal component
and SVD analysis output a set of vectors in the data space (e.g., n
dimensional where n is the number of genes) ordered by how much of
the variability in the data each vector each principal component or
mode captures. These vectors can each be interpreted as
phenomenological pathways describing the major modes of usage of
the gene/protein complement of the organism under specific
conditions that the experiments analyzed represent.
[0088] Software for various types of large-scale data analysis,
including hierarchical clustering, self-organizing maps, K-means
clustering and principal component analysis, is known in the art or
can be developed for a particular application. Exemplary analysis
software includes "XCluster" (see
genome-www.stanford.edu/.about.sherlock/cluster.html on the World
Wide Web), "Cluster" software (see rana.lbl.gov/EisenSoftware.htm
on the World Wide Web) and "Genesis" software (see
genome.tugraz.at/Software/Genesis/Description.html on the World
Wide Web).
[0089] The skilled person can determine which method, or which
combination of methods, is suitable to analyze phenomenological
information to determine a set of phenomenological reaction
pathways.
[0090] As used herein, the term "operational reaction pathway"
refers to a systemic reaction pathway of a biosystem that is
feasible taking into account the reactants present in, or fluxes
through, the biosystem. Operational reaction pathways thus
constitute a subset of systemic reaction pathways that are likely
to actually exhibit flux in the biosystem. The subset of systemic
pathways that are consistent with phenomenological information
about the biosystem are determined to identify operational reaction
pathways consistent with the reactants present or reaction fluxes
through the biosystem.
[0091] Once a set of systemic reaction pathways and a set of
phenomenological reaction pathways have been provided, the two sets
are compared, and common pathways identified. As described above,
the two sets of pathways can be represented in vectorial form, or
in the form of groups of genes participating in the pathways, or in
other convenient ways. There are a number of mathematical methods
known in the art by which two vectors or two groupings can be
compared.
[0092] For example, the two sets of vectors can be compared using a
number of measures for pairwise similarity between vectors
including: (1) Euclidean distance, which corresponds to the squared
distance between two points in space, or in this case two vectors,
taking into account both the direction and the magnitude of the
vectors (Hubbard J. H. and Hubbard B. B. Vector Calculus, Linear
Algebra, and Differential Forms, Prentice-Hall (1999)); (2) Pearson
correlation coefficient, which measures the angle between two
vectors whose length is normalized to one, and is thus independent
of the length of the vectors (Larsen R. J. and Marx M. L. An
Introduction to Mathematical Statistics and Applications, Prentice
Hall, New Jersey (1986)); or (3) Jackknife correlation coefficient,
which is similar to Pearson correlation coefficient, but is
corrected for the effect of single outliers components of the
vectors to provide a more robust distance measure (Heyer et al.,
Genome Res. 9:1106-1115 (1999)). Other methods for comparing
vectors are known in the art.
[0093] Similarly, methods for comparing groupings of genes based on
systemic and phenomenological definitions include: (1) the Rand
index, which measures the overlap between two different groupings
of the same set of genes (Yeung K. Y et al. Bioinformatics 17:177
(2001)); and (2) correspondence analysis, which provides a
two-dimensional graphical representation of the agreement between
two groupings such that the systemic and phenomenological pathways
that are most similar to each other are shown to be located closest
to each other (Johnson R. A. and Wichern D. W. Applied Multivariate
Statistical Analysis, 5th Ed., Prentice Hall, New Jersey
(2002)).
[0094] The skilled person can determine which method, or which
combination of methods, is suitable for comparing systemic reaction
pathways and phenomenological reaction pathways to identify
operational reaction pathways.
[0095] The invention also provides a method determining the effect
of a genetic polymorphism on whole cell function. The method
consists of: (a) generating a reaction network representing a
biosystem with a genetic polymorphism-mediated pathology; (b)
applying a biochemical or physiological condition stressing a
physiological state of the reaction network, and (c) determining a
sensitivity to the applied biochemical or physiological condition
in the stressed physiological state compared to a reaction network
representing a normal biosystem, wherein the sensitivity is
indicative of a phenotypic consequence of the genetic
polymorphism-mediated pathology. The biochemical or physiological
condition can be, for example, a change in flux load, pH,
reactants, or products as well as system or subsystem changes such
as those in oxidative or energy load.
[0096] Briefly, the above methods for analyzing physiological
states of a biosystem, comparing them to systemic reaction pathways
and determining one or more operational reaction pathways can
similarly be employed to determine the effect of genetic
polymorphisms on a biosystem or subcomponent thereof. For example,
phenomenological information used for comparison with systemic
reactions can be obtained from either actual or simulated genetic
mutations of enzymes or other polypeptides. Changes in activity of
the enzyme or polypeptide due to the mutation can be obtained from
sources describing the defect or estimated based on available
information or predictive computations using a variety of methods
well known in the art. The activities that can be assessed include,
for example, catalytic function of an enzyme or binding activity of
a polypeptide such as a transcription regulator.
[0097] In silico models constituting a reaction network of a
genetic polymorphism can be constructed as described previously and
the effect of the polymorphism can be assessed in context of the
biosystem as a whole. Conditions that the reaction network are
subjected to can be varied and the effect of single or multiple,
combined polymorphisms can be determined on whole biosystem
function or as the polymorphism relates to subsystems thereof. For
example, systemic pathways or operational pathways can be
calculated in the presence or absence of the genetic polymorphism.
Comparision of systemic pathways, operational pathways or a
phenotypic manifestation between the two reaction networks can be
performed to determine the differences, if any, between the native
reaction network and the polymorphic counterpart. Such differences
can include, for example, creation of a new systemic or operational
pathway, omission of such a pathway and changes in the rate or
magnitude of such a pathway. The result of such changes between the
normal and polymorphic states also will reveal the consequential
impact on biochemical or physiological function or on phenotypic
expression of the genetic polymorphism.
[0098] Conditions that can be varied include, for example, any
biochemical or physiological component of the system. Such
conditions can be either external to the biosystem including, for
example, external environmental growth conditions such as
temperature, pH, carbon source and other input/output reactions
which allow components to enter or exit the biosystem.
Alternatively, such biochemical or physiological conditions can be
internal to the biosystem. Specific examples of internal conditions
include, for example, exchange reactions indicative of sources and
sinks allowing passage of reactants across a system or subsystem
boundary, intra-system reactions that replenish or drain reactants,
and demand reactions which represent categories of components
produced by the cell. Biochemical or physiological conditions
internal to the biosystem also can include changes in pH,
utilization of carbon sources, availability of metabolites,
cofactors, substrates and products. Other changed internal
conditions can include, for example, alterations in system loads
such as oxidative or energy load on its corresponding subsystem.
Various other biochemical or physiological conditions well known to
those skilled in the art can similarly be varied in the methods of
the invention to obtain comparative reaction network simulations
for determining the effect of a genetic polymorphism on biosystem
function.
[0099] Altering or changing a condition for each biosystem will
generally be sufficient for a comparison between a native and a
counterpart polymorhic biosystem. However, the effect can be
enhanced when the biochemical or physiological condition is applied
to the native and polymorphic biosystem at a magnitude sufficient
to stress the biosystem or a correlative subsystem thereof. For
example, where the activity of a polymorphic enzyme is altered only
slightly compared to its native counterpart, the difference in
activity may not substantially affect cellular function within an
activity range tested. In part, an insignificant impact on cellular
function can be due to the production of sufficient product to
perform normal cellular activity regardless of an activity
deficiency. However, where the activity of the polymorphic enzyme
is tested under stressed conditions, it can be unable to fulfill
the added cellular demand due to the additional work required of
the system. Accordingly, under stressed conditions, a comparison of
the native reaction network functioning and that of the polymorphic
reaction network will more readily reveal those activity effects of
the polymorphic enzyme due to failure of product production under
excess requirements.
[0100] The term "stress" or "stressing" as used in reference to
applying a biochemical or physiological condition is intended to
mean placing a biosystem, reaction network or subsystem thereof
under a state of strain or influence of extra effort. The stress
can be mild or intense so long as it applies demands, loads or
effort on the components extra to that under the normal or nominal
state of the biosystem, reaction network or subsystem thereof.
Therefore, stressing a system state is intended to include imposing
a condition that causes the system to exert additional effort
toward achieving a goal. Specific examples of applying a
biochemical or physiological condition to a biosystem that stresses
a physiological state is described further below in Example
III.
[0101] Genetic polymorphisms can constitute, for example, single
nucleotide polymorphisms (SNPs) and well as multiple changes within
a encoding gene resulting in a polymorphic region within the gene
or its polypeptide coding region. Polymorphisms in gene or coding
region structure can alter the expression levels of the harboring
nucleic acid, activity of the encoded polypeptide or both.
Polymorphisms well known to those skilled in the art of genetics
and genomics include, for example, allelic variants of genes, SNPs
and polymorphic regions of a referenced nucleic acid. Specific
examples, of genetic polymorphisms include those variations in
coding sequence described in Example III for glucose-6-phosphate
dehydrogenase (G6PD) and pyruvate kinase (PK). Numerous other
genetic polymorphisms and their associated diseases are similarly
well known to those skilled in the art.
[0102] Given the teachings and guidance provided herein, the
methods of the invention for determining the effect of a genetic
polymorphism on cellular function can be used with any known or
subsequently determined genetic polymorphism. Similarly, the
linkage between the genetic defect and the pathology mediated also
can be previously known or subsequently determined. Moreover, and
as described further below, it can be used to diagnose previously
undetermined genetic polymorphisms that alter an activity of an
enzyme or polypeptide. However, by determining the effect of the
defect in the context of a whole biosystem, a more accurate
phenotype and assessment of the functional abilities of the
biosystem can be obtained. Accurate determination of phenotypic and
functional attributes of such complicated systems can be
advantageously applied for a more meaningful treatment of the
genetic polymorphism-mediated disease.
[0103] Sensitivities of the polymorphic enzyme to the stressed
condition can be more or less pronounced depending on which
polymorphisim is incorporated into the reaction system, the degree
of polypeptide activity change due to the polymorphism and the
level of stress that is exerted on the system. Those skilled in the
art will know or can determine, given the teachings and guidance
provided herein, what sensitivities are indicative of a particular
polymorphic enzyme or other polypeptide. For example,
glucoso-6-phosphate dehydrogenase (G6PD) functions in the oxidative
branch of the pentose pathway and is sensitive to changes in
maximum velocity (V.sub.max) and cofactor binding affinity
(K.sub.i-NADPH). Enzymes with alterations in these activities
result in changed in oxidative requirements which can be used as
indicators of the metabolic state for G6PD's having altered
activity. For example, one sensitive indicator of the metabolic
state of the biosystem is the NADPH/NADP ratio. This ratio can be
measured under stressed conditions and compared between the
polymorphic reaction network with that of the normal network to
determine the phenotypic and functional changes on the biosystem.
As described further below in Example III, polymorphic enzymes
having alterations in these G6PD activities can be distinguished in
the methods of the invention as those mediating non-chronic and
chronic hemolytic anemia.
[0104] Similarly, pyruvate kinase (PK) functions in glycolysis and
is sensitive to changes in V.sub.max and the affinity for
substrates such as phosphoenolpyruvate (K.sub.PEP). Alterations in
these activities result in changes in ATP concentration, and 2,3
DPG concentration. Sensitive indicators of V.sub.max and K.sub.PEP
can include, for example, the concentration of ATP when the
biosystem is under maximum energy loads or stress compared to
normal conditions. As with G6PD, polymorphic PK enzymes having
alterations in these activities show that anemic patients have a
diminished ability to deviate from the normal homeostatic
state.
[0105] For determining the effect on function, a reaction network
specifying the activity of the polymorphic enzyme is constructed
and the system is stressed as described above. Sensitivity to the
stressed condition compared to that of the normal or native
reaction network can then be determined using a variety of
indicators. Those described above for G6PD and PK are exemplary
indicators for enzyme activity. Those skilled in the art will
understand, given the teachings and guidance provided herein that
other indicators of biochemical or physiological activity of the
particular enzyme or polypeptide being assessed can be used in the
methods of the invention. For example, essentially any measure of
substrate, product, cofactor, or other metabolite can be used as an
indicator of polypeptide activity. Such indicators can be assessed
directly or indirectly such as by measuring the products of
downstream reactions and the like. Moreover, ratios of such
indicators or of general indicators of a particular biochemical or
physiological state can similarly be used. For example, ATP, and
energy cofactors such as NADPH and NADP are general indicators of
the oxidative state and energy charge, respectively, of a
biosystem.
[0106] Changes in activity under stressed conditions of such
biochemical or physiological indicators will identify the change in
function of the biosystem due to the altered activity as well as
show the phenotypic consequences of the polymorphic enzyme. For
example, the inability of a biosystem to respond to excess
oxidative or energy requirements can show, for example, that the
polymorphic enzyme is unable to adequately produce components
within its assigned subsystem to handle the increased work
requirements caused by the stress. A functional biosystem change
can correspond to, for example, altered demands and products that
are produced as well as changes in flux or pathways which
compensate the deficient enzyme activity. A phenotypic outcome can
be, for example, inhibition of biosystem proliferation, decrease in
biosystem mass or even biosystem lysis and death.
[0107] The methods of the invention also can be used for diagnosis
of a genetic polymorphism-mediated pathology. The methods described
above can be used to generate a biosystem reaction network
representing activities of suspected genetic polymorphism. The
biosystem reaction network can be stressed as described above and
the reaction network containing the suspected polymorphic enzyme
activity compared to that of a normal reaction network. A change in
function or phenotype of the suspected polymorphic network compared
to the normal will indicate that the genetic alteration is linked
to the enzyme deficiency. Those skilled in the art will understand
that a plurality of suspected enzyme defects can be identified and
linked to a particular disease given the teachings and guidance
provided herein. For example, those skilled in the art can use
activity measurements from a suspected patient in the creation of a
plurality of reaction networks. Comparison of the function or
phenotype of the networks harboring suspect activities with normal
networks will identify the differences in function or phenotype and
whether any of such identified differences are sufficient to result
in a pathological condition.
[0108] Therefore, the invention provides a method of diagnosing a
genetic polymorphism-mediated pathology. The method consists of:
(a) applying a biochemical or physiological condition stressing a
physiological state of a reaction network representing a biosystem
with a genetic polymorphism-mediated pathology, the applied
biochemical or physiological condition correlating with the genetic
polymorphism-mediated pathology, and (b) measuring one or more
biochemical or physiological indicators of the pathology within the
reaction network, wherein a change in the one or more biochemical
or physiological indicators in the stressed state compared to an
unstressed physiological state indicates the presence of a genetic
polymorphism corresponding to the pathology.
[0109] The invention further provides a method of reconciling
biosystem data sets. The method consists of: (a) providing a first
regulatory network reconstructed from legacy data comprising a
plurality of hierarchical regulatory events; (b) providing a second
regulatory network obtained from empirical data, and (c)
determining a consistency measure between the hierarchical
regulatory events in the first regulatory network and elements in
the second regulatory network, wherein a high degree of the
consistency measure for the hierarchical regulatory events
indicates the validity of the first regulatory network or a
subcomponent thereof.
[0110] The method of the invention for reconciling data sets is
useful for determining the accuracy of a biosystem model as well as
for identifying new components, linkages, networks and subnetwork
of a biosystem model. The model can be based on scientifically
proven data, mathematical interpretations as well as on pure
computational analysis or even theoretical prediction. Regardless
of the source of a biosystem model, the method for reconciling data
sets compares the model or a data set representation thereof to
another source of data to identify the consistency between one
model or data set and that of the comparison model or data set. The
degree of consistency between the two models or data sets thereof
will show how accurate the initial model is to its corresponding
natural biosystem.
[0111] Data sets representing whole biosystems can be reconciled
using the methods of the invention as well as any substructure
thereof. Substructures can consist of subnetworks or modules of the
biosystem reaction network. While the exact boundaries of
subnetworks and boundaries can vary depending on the assessment
criteria used, one feature is that such substructures can be
evaluated, analyzed or identified as a unit in itself. Criteria for
boundary determination can include, for example, functional
attributes, structural attributes and graphical or mathematical
separateness, for example. Specific examples of subnetworks or
modules of a biosystem have been described above and below and are
further shown in FIG. 16 and its associated Example IV. Other
examples are well known to those skilled in the art and can be
employed in the methods of the invention given the teachings
provided herein.
[0112] Data sets applicable for comparison can include a broad
range of different types and sizes. For example, the data sets can
contain a large and complex number of diverse data elements or
components of the reaction network. Alternatively, the data sets
can be small and relatively simple such as when comparing
subnetworks or modules of the reaction network. Those skilled in
the art will appreciate that the more inclusive each data set for
comparison is with respect to its system components, the more
accurate and reliable will be the consistency measure. However,
those skilled in the art will know, or can determine, a reliable
means to compensate for inherent differences based on the character
of one or both of the initial data sets. Therefore, the method of
the invention can be used for reconciling data sets where the pair
of data sets for comparison can be either large or small, or
diverse or simple, as well as for comparison where the data sets
within the pair are either large or small, or diverse or simple
with respect to each other.
[0113] As used herein, the term "legacy" or "legacy data" is
intended to refer to known information or data such as that
obtainable from literature, other reports, computational data,
databases or a combination thereof. The information can be obtained
from the public domain or previously known by the user's own
investigations. The term therefore is intended to include secondary
data that has received the benefit of scientific evaluation and
considerations toward the system to which it pertains, the
scientific authenticity or the theory which it promotes. Legacy
data in essentially any obtainable form can be used in the methods
of the invention and can include, for example, literary, graphical,
electronic, mathematical or computational forms as well as
functional equivalents and transformations thereof. Given the
teachings and guidance provided herein, those skilled in the art
will known how to use a particular format either directly or
following transformation into a useful format for representing a
reaction network of the invention. A variety of such useful formats
have been described above and below and others are well known to
those skilled in the art.
[0114] As used herein, the term "empirical" or "empirical data"
refers to data based on primary factual information, observation,
or direct sense experience. Empirical data is therefore intended to
refer to raw data or primary data that has not received the benefit
of scientific evaluation and considerations toward the system to
which it pertains, the scientific authenticity or the theory which
it promotes. The term is intended to include, for example, data,
data sets or equivalent transformational forms thereof
corresponding to gene expression data, protein activity data and
the like. It can include, for example, large, high throughput
datasets such as that obtainable by genomic, proteomic,
transcriptomic, metabolic and fluxomic data acquisition as well as
small data sets obtainable by a variety of research methods well
known to those skilled in the art. Other forms of primary data well
known to those skilled in the art can similarly be employed in the
methods of the invention.
[0115] Useful attributes of reconciling data sets include, for
example, both validation of known reaction network and subnetwork
models as well as the identification or discovery of new
subnetworks or modules thereof. Validation of an existing model is
useful in itself because it authenticates previous scientific
theories as well as subsequent discoveries based on the original
model. Similarly, invalidation of a network model can be useful,
for example, because it informs the user that components, links or
scientific premises may be omitted from the network model as a
whole. Moreover, reconciliation of data sets can identify
subnetworks or modules of the biosystem reaction network model by
showing differential validation of a particular subsystem or of
several subsystems within the whole. For example, discovery of new
subnetworks or identification of valid subnetworks within the whole
can occur when some, but not all, modules within the biosystem
network are reconciled. Identifications are particularly striking
where the subnetwork or module thereof constitute relatively
independent entities within the biosystem reaction network or are
relatively decoupled from the body of the biosystem network.
Finally, information gained from reconciliation of data sets and
validation of whole networks, subnetworks or modules thereof can be
used to refine the network or subnetworks by altering the model
determining whether the altered model reconciles with the
comparative data set.
[0116] Validation and discovery methods of the invention are
applicable to essentially any form or format of the reaction
network. For example, data sets can be reconciled where a reaction
network is represented by an in silico model, a mathematical
representation thereof, a statistical representation, computational
representation, graphical representation or any of a variety of
other formats well known to those skilled in the art.
[0117] Reconciliation of data sets allows for the validation of
essentially any causal relationships within the compared biosystem
networks. For example, the method for reconciliation of data sets
can be employed on data sets specifying all types of reaction
networks described herein. Therefore, the method is applicable to
reaction networks corresponding to a metabolic reaction network, a
regulatory reaction network, a transcriptional reaction network or
a genome-scale reaction network, or any combination thereof. To
perform the method of reconciliation, a first reaction network can
be provided that is reconstructed from legacy data. As described
previously, the legacy data can be obtained from a secondary source
that has assembled primary data into a working model of the
biosystem network components. The first reaction network is
compared with a second reaction network obtained from empirical
data. The empirical data can consist of, for example, any primary
data representing an activity or other attribute of the components
within the biosystem.
[0118] A comparison of data sets can be accomplished by, for
example, any method known to those skilled in the art that provides
a measure of consistency between the network representation and the
empirical data. In one embodiment a consistency measure is
determined between the empirical data and the legacy data, or the
legacy-derived network model by, for example, grouping the network
components into hierarchical organization of reaction categories.
The reaction categories are useful for determining consistency
measurements between the data sets to be reconciled. The reaction
categories can include, for example, reactants and products,
reaction fluxes, metabolic reactions, regulatory reactions and
regulatory events. Moreover, the reaction categories can be
arbitrary, or based on, for example, functional criteria,
statistical criteria, or molecular associations so long as the
categories provide an acceptable framework for obtaining a
consistency measure between the legacy-derived network and the
empirical data set.
[0119] Exemplary reaction categories for the specific embodiment of
a regulatory reaction network are described further below in
Example IV. Briefly, elements of a regulatory network can be
separated into, for example, three categories based on functional
interactions. These categories include, for example, pair-wise
regulatory interactions, target-regulator units and regulons. Given
the teachings and guidance provided herein, categories other than
these for regulatory networks as well as categories for other types
of reaction networks can be identified or generated by those
skilled in the art. For example, other types of categories can
include anabolic or catabolic reactions or cell signaling
functions. The particular type of category will depend on the type
of reaction network to be reconciled and the measure of consistency
selected to be used in the method of the invention.
[0120] Consistency of the data sets to be reconciled can be
determined by a variety of methods well known to those skilled in
the art. Such methods can be employed to generate a value for each
of category or element within a network that can be analyzed for
significance. For example, in the above exemplary reaction
categories, consistency measurements for pair-wise interactions can
be obtained, for example, by Pearson correlation coefficients
whereas consistency measurements for target-regulator units can be
determined by, for example, multiple correlation coefficients.
Further, consistency measurements for regulons can be determined
by, for example, the average within regulon correlation. Other
methods well known in the art also can be employed and include, for
example, mutual information-based measures (Cover T M & Thomas
J A. Elements of Information Theory, Wiley (1991)), or nonlinear
regression methods (Hastie T, Thibshirani R & Friedman J. The
Elements of Statistical Learning, Springer (2001)). The mutual
information measures require discretization of the original data,
but allow incorporating nonlinear dependencies that are not
accounted for by Pearson or multiple correlation coefficients.
Similarly non-linear correlation measures can be used as
consistency metrics, but their added flexibility compared to linear
correlation may result in overestimating the consistency between
empirical data and a proposed network structure. The statistical
significance of particular values of a consistency measure can be
determined to assess whether the legacy data and empirical data
constitute a good fit. A high degree of consistency measure, such
as those that are statistically significant, indicate that the two
networks, subnetworks or subcomonents reconcile. Further, those
data sets that reconcile either as to the whole network or a
subnetwork thereof indicate a validation of the legacy model
whereas those that are inconsistent indicate a divergence between
the legacy-derived model and the empirical data.
[0121] The invention further provides a method of refining a
biosystem reaction network. The method consists of: (a) providing a
mathematical representation of a biosystem; (b) determining
differences between observed behavior of a biosystem and in silico
behavior of the mathematical representation of the biosystem under
similar conditions; (c) modifying a structure of the mathematical
representation of the biosystem; (d) determining differences
between the observed behavior of the biosystem and in silico
behavior of the modified mathematical representation of the
biosystem under similar conditions, and (e) repeating steps (d) and
(e) until behavioral differences are minimized, wherein
satisfaction of a predetermined accuracy criteria indicates an
improvement in the biosystem reaction network.
[0122] The method can further include the steps of: (f) determining
a behavior of the biosystem under different conditions, and (g)
repeating steps (b) through (e) of the method for refining a
biosystem reaction network under the different conditions. The
method for refining a biosystem reaction network can additionally
include repeating steps (f) and (g) until the minimized behavioral
differences are exhausted, wherein the improved biosystem reaction
network representing an optimal biosystem reaction network.
[0123] The methods of the invention can also be applied in a
general process by which mathematical representations of biosystems
can be improved in an iterative fashion using algorithmic
approaches and targeted experimentation. Many biological systems
are incompletely characterized and additional experimentation can
be required to reconstruct a reaction network of these systems. For
such a process to converge quickly on an optimal model, an
iterative experimentation can be systematized. FIG. 2b exemplifies
such a procedure, which is further described in Example V.
[0124] The model building process can begin with a statement of
model scope and accuracy. Alternatively, the model building process
can proceed in the absence of such a predetermined assessment of
scope or accuracy but terminated once a desired scope or accuracy
is ultimately obtained.
[0125] The purpose for building the model leads to specification of
expected accuracy and the scope of capabilities that the model is
to have. The scope of a model can range from, for example,
describing a single pathway to a genome-scale description of a wild
type strain of an organism. An even broader scope would be to
include sequence variations and thus insist that a model describes
all the variants of the wild type strain.
[0126] The accuracy can be based on, for example, qualitative or
quantitative criteria. A useful model can be qualitative and be
able to make statements that predict, for example, that the growth
rate of an organism is reduced when a particular gene product is
inhibited under a particular growth condition. A quantitative model
can insist, within measurement error, on predicting the percent
reduction in growth rate of inhibition of all the gene products
under one or more growth conditions. The extent of the iterative
model-building process is therefore dictated and predetermined by
the user who can specify a required scope and accuracy of the model
to be generated.
[0127] A reconstructed biochemical reaction network can be
envisioned as a model of an experimental system. In this regard, it
is a duplicate of an actual organism that is capable of flexible
manipulation and study under any conditions that is desirable to
subject the actual organism to. One advantage of a reconstructed
biosystem reaction network, or an in silico version thereof, is
that it is capable of generating an immense amount of information
that characterizes the function and phenotype of the organism. The
accuracy of the in silico model can also be determined by, for
example, using the methods described above for reconciliation and
determining the consistency of the reconstructed network with that
of empirical data obtained from the actual organism. The
availability of both an actual organism and a reconstructed model
of the organism that is readily manipulable can be used
synergistically to harness the power of in silico models for
reliable and accurate predictions of organism behavior and
function.
[0128] An approach to reconstructing an in silico model of a
biosystem is through iterative refinement of a biochemical reaction
network. The refinement of a model can be accomplished by assessing
a particular function of the actual organism and incorporating into
the model new information gained from that particular study.
Because the model is an duplicate of the organism, deviations in
performance from the model compared to the actual organism when
performed under similar conditions will yield data indicating that
additions, omissions or revisions to the in silico that can account
for the deviations. By successive iterations of studies duplicating
conditions that the actual and in silico organisms are subjected
to, altering the model structure to correct and be consistent with
the empirical data obtained from the actual organism and repeating
the condition or subjecting the pair to different conditions, the
accuracy of the model to predict function and phenotype of the
actual organism will successively increase.
[0129] Briefly, studies can be performed with the actual organism
under defined conditions prescribed by an experiment design
algorithm. Similarly, the in silico model that describes the actual
organism can be used to simulate the behavior of the actual
organism under the same conditions. Based on the available data at
any given time, if the model fails to meet the desired scope or
accuracy requirements, further studies can be performed to improve
the model. These studies can be designed using, for example, a
systematic procedure to step-wise or incrementally probe network
function. One approach to probe network function can be, for
example, to incrementally move from a robust or validated subsystem
of the network to less validated parts. Another approach can be,
for example, to target different types functions or different types
of methods for probing function. Particular examples of such
targeted methods of study include, for example, genomic knock-outs,
expression profiling, protein-protein interactions, and the like.
Therefore, the content and capabilities of the in silico model are
subject to iterative updates.
[0130] The decision on what experiments to perform can be
determined, for example, based on the nature of the deviation and
the requirements in an accuracy specification. Deviations can
include a gene expression array that is not predicted correctly by
the model, a set of calculated flux values which does not match the
experimentally-determined fluxome under given conditions, or a set
of phenotypes, for example, growth, secretion and/or uptake rates,
which shows discrepancy from model predictions. Experiments which
could be performed to resolve such discrepancies include
perturbation analysis wherein one or more genes thought to be
responsible for the discrepancy are knocked out, upon which the
resulting organism is characterized using transcriptomics,
fluxornics and the like, or environmental analysis wherein one or
more component of the extracellular environment thought to
contribute to model deviations is removed and the system is
re-characterized.
[0131] Algorithms can be devised that design such experiments
automatically. An algorithm which can be used in the case of gene
expression can be, for example (1) determine the gene(s) which
exhibit a discrepancy from the predictions of the model, (2) use
the regulatory network model to identify the regulatory protein(s)
which control the gene(s) in step (1), (3) knockout one or more
genes in the organism which encode one or more regulatory proteins
(4) perform the same transcriptome experiment under the same
environmental conditions but with the new knockout strain. A second
such algorithm which could be used in the case of a high-throughput
phenotype study with a reconstructed metabolic network could be (1)
determine the phenotype(s) which exhibit discrepancy (e.g., growth
rates do not correlate), (2) systematically add all biochemical
reactions, one or more at a time, until the model prediction
matches the observed phenotype(s), (3) identify gene locus/loci
with significant sequence similarity to identified enzymes which
catalyze the reaction(s) in step (2), (4) clone and characterize
the gene in step (3) to verify whether it can catalyze the
predicted reaction(s). The inputs into an algorithm are several,
including the present model, the data that it has been tested
against, the magnitude and nature of deviations, and so forth. The
output from the algorithm can be component experiments of whole
organism experiments.
[0132] An algorithm can identify, for example, missing components
in the model and request that specific biochemical, protein-DNA
binding, protein-protein interaction, or enzyme kinetic activity
experiments be performed. As described above, the missing
components in the two above examples would be regulatory
interactions and identified enzymes. If these studies reveal
missing components of the model appropriate model updates are
performed.
[0133] An algorithm can be facilitated by, for example, the
inclusion of additional data from whole cell behavior. It may
request that growth, transcription profiling, metabolic profiling,
DNA-transcription factor binding state, or proteomic experiments be
performed under one or more environmental conditions in order to
obtain sufficient information to allow model updating.
[0134] Given a set of inputs such as gene deletions or
environmental inputs, the response of the biochemical reaction
network can be examined both actually and computationally. The
actual system will yield an observed response characterized through
phenomenological pathways of the system, while the model of the
actual system will predict a response characterized by the systemic
pathways of the system. The observed and computed responses can be
compared to identify operational pathways as described previously.
The difference in the measured and computed cellular functions
under the defined conditions where the experiment is performed can
be characterized, for example, as an "error". This difference
corresponds to those systemic pathways that are not operational.
The error can then be used to update the model.
[0135] Model update also can be accomplished by, for example, using
an algorithm for updating parameters in the model so that the model
error is minimized. As identified in Example VI, an algorithm for
characterization of a regulatory network can be, for example, (1)
obtain the activity of each protein as predicted by the model, (2)
for each protein, generate a rule based on the activity of the
given protein which results in the correct expression value for
T5a, (3) recalculate the overall expression array for the regulated
genes, (4) evaluate the difference between the criterion for model
accuracy by determining the new model error, and (5) choose the
model(s) with the lowest error as the new model for future
iterations. Following optimal model updates are implemented, the
remaining "error" between corrected model predictions and actual
responses can be used to design new studies to further probe the
system. The process can be repeated, for example, one or more times
to further update the model based on these new studies and until a
desired scope or accuracy is obtained.
[0136] Model updates that can minimize error on a round of the
iterative reconstruction process can be non-unique or very similar
to each other in generating optimal model updates. To preserve the
availability of such data and increase the efficiency of subsequent
rounds, alternative model updates can be stored, for example, so
that they capable of being retrieved and available for subsequent
use on further rounds of iterative model building. Additionally, a
collection of experimental outcomes can be stored as a historic
record of the behavioral data or phenotypic data that has been
obtained on a particular organism. Model updates and design
algorithms can be optionally capable of querying this database
during execution. Various other records and system data can be
alternatively stored for later efficient utilization in one or more
steps of the iterative process. Such computational approaches are
well known in the art and can be routinely implemented given the
teachings and guidance provided herein.
[0137] Further, combinations and permutations of the various
methods of the invention can be combined in any desired fashion to
facilitate the model building process or to augment a purpose or
implementation of the method. Additionally, single or other
"off-line" studies can be performed and the information generated
used in any of the methods of the invention to facilitate, augment
or optimize results or implementation. For example, in addition to
studies designed for the iterative process, in some cases specific
pair-wise interactions among molecules can be probed in separate
off-line studies to further characterize individual molecular
components.
[0138] Advantageous properties of the iterative model-building
procedure include convergence of system components into an
operative and optimal representation of the actual organism and
efficiency of constructing such a model. Efficiency in convergence
is important since it will minimize the number of studies that need
to be performed.
[0139] The following examples are intended to illustrate but not
limit the present invention.
Example I
Decomposing a Set of Phenomenological Flux Distributions for the E.
coli Core Metabolic Network in Order to Identify Operational
Extreme Pathways
[0140] This example shows how a set of phenomenological pathways
(flux distributions) can be decomposed into dominant modes these
modes can be compared with a set of systemic pathways (extreme
pathways) to identify operational reaction pathways of a metabolic
reaction network (E. coli core metabolism).
[0141] An in silico-generated metabolic flux profile of core
metabolism in E. coli was prepared. The reactions were taken from
table 6.3 of Schilling, "On Systems Biology and the Pathway
Analysis of Metabolic Networks," Department of Bioengineering,
University of California, San Diego: La Jolla. p. 198-241 (2000),
with the exception that reaction pntAB was not included, and
instead of T3P2 in reaction tktA2, T3P1 was used. The reaction list
is tabulated in Table 1.
[0142] The flux profile, which is the input matrix for Singular
Value Decomposition (SVD) analysis, consists of 57 fluxes (rows)
and 7 conditions in each phase (columns). The phase plane for
succinate for this system is presented in FIG. 3; generation of
Phase Planes is described in (Edwards J S, Ramakrishna R, Palsson B
O. Characterizing the metabolic phenotype: a phenotype phase plane
analysis. Biotechnol Bioeng. 2002 Jan. 5;77(1):27-36). The points
on FIG. 3 were chosen to define the upper limit of oxygen and
succinate available to the system. Each point, therefore,
represents a different condition (or column of the flux matrix) in
constructing the flux profile.
[0143] SVD analysis was performed on each phase (each of the 7
conditions) separately. The decomposition of the flux matrix, A,
results in three distinct matrices U (the left singular matrix), E
(singular value matrix), and V (right singular matrix):
A=U.di-elect cons.ENT
[0144] For phase I of the phase plane, the flux distribution matrix
was generated with the E. coli core metabolism using the oxygen and
succinate input values that are tabulated next to FIG. 4. The
points lie on phase I as shown.
[0145] SVD analysis on the flux matrix revealed that there is only
one dominant mode in phase I as demonstrated by the singular value
fractions shown in FIG. 5. Therefore, there is a common expression
that dominates nearly all of the system's behavior in this
phenotypic phase, which can be called a phase invariant singular
value.
[0146] The contribution level of each condition (i.e., each point
shown in Phase I of the phase plane) is shown in FIGS. 6 and 7 for
various modes obtained from SVD. The weight that each mode has on
the overall contribution of a pathway is seen by how far the curve
of that mode is from the zero contribution level (horizontal zero
level). Also, for each mode, the expression level increases with
the condition number which shows how fluxes increase in the pathway
represented by that mode. These representations provide information
regarding where on the phase plane the point lies relative to other
points (i.e., at a higher or lower growth rate). Thus, not only is
information provided about the dominant modes, but also additional
information is provided on biomass production rate. The slope of
the first dominant mode ("first mode") should correspond to the
slope of growth rate. The first mode captures nearly 100% of the
overall contribution.
[0147] To compare the results from SVD and with the results from
pathway analysis, extreme pathways of the core E. coli system were
calculated, using succinate as the sole carbon source. The reduced
set of extreme pathways for succinate is presented in Table 2
(adopted from Schilling, supra (2000), Table 6.6) and shown in FIG.
8.
[0148] For the Phase I analysis described above, to compare the
extreme pathways with the 1st mode, the genes were arranged in the
same order and fluxes were normalized by succinate uptake rate. The
angles between the 1st mode and each of the 12 extreme pathways
were calculated and sorted in descending order. Also, the number of
different fluxes (i.e., fluxes that are zero in one case and
non-zero in the other case or have opposite signs) and the net flux
difference between the first mode and each pathway were calculated
and sorted in the same fashion. Table 3 provides the results of
this analysis.
[0149] This analysis shows that the first mode in phase I is
exactly equivalent to the line of optimality (i.e., P.sub.--33). It
also shows that following this pathway, the first mode is the
closest to pathways 32, 30, and so on. Therefore, column angle not
only shows what pathways best describe flux distribution in phase I
in the order of similarity, but it also shows how similar they are
amongst themselves.
[0150] The analysis was repeated for Phases II and III, and for all
phases together. When all phases were analyzed by SVD together,
again a single dominant mode was identified (FIG. 14), with
relatively low entropy (4.80E-3). The angle between this mode and
each of the 12 extreme pathways was calculated. Table 4 provides
the results of this analysis. By this analysis, the dominant mode
was closest to extreme pathways 33 and 32 shown in Table 2.
Example II
Identifying Human Red Blood Cell Extreme Pathways Corresponding to
Physiologically Relevant Flux Distributions
[0151] This example shows how a set of phenomenological pathways
(flux distributions) generated by a kinetic model can be compared
with the modal decomposition of a set of systemic pathways (extreme
pathways) to identify dominant regulatory modes of a metabolic
reaction network (human red blood cell metabolism).
[0152] The extreme pathways of the red blood cell (RBC) metabolic
network have been computed (Wiback, S. J. & Palsson, B. O.
Biophysical Journal 83, 808-818 (2002)). Here, SVD analysis was
applied to the extreme pathway matrix, P, formed by these pathways.
A full kinetic model of the entire metabolic network of the RBC has
been developed (Jamshidi, N., Edwards, J. S., Fahland, T., Church,
G. M., Palsson, B. O. Bioinformatics 17, 286-7 (2001); Joshi, A.
& Palsson, B. O. Journal of Theoretical Biology 141, 515-28
(1991)), and was used to generate flux vectors (v) for
physiologically relevant states. These flux vectors were decomposed
using the modes obtained from SVD of P.
[0153] The rank of the V.sub.max-scaled RBC extreme pathway matrix,
P, was 23. The first mode represents 47% of the variance (FIG.
10F). Combined, the first five modes capture 86% of the variance of
the solution space, while the first nine modes capture 95% of its
variance.
[0154] The first five modes of P are shown on the metabolic maps in
FIG. 10(A-E). The first mode shows low flux values though the
adenosine reactions, higher fluxes through the glycolytic
reactions, with an exit through the R/L shunt, and the highest flux
levels through the pentose phosphate pathway. This map describes
the principal variance of the steady-state solution space. The
subsequent modes describe the next directions of greatest variance
in the steady-state solution space (FIG. 10). Movement along a mode
in the positive direction corresponds to increasing the fluxes
shown in red and decreasing those shown in green. Since the modes
are required to be orthogonal, they specifically describe the
directions of variance in the cone that are independent from each
other. The subsequent modes can be interpreted biochemically as
follows:
[0155] The second mode describes the flux split between glycolysis
and the pentose phosphate pathway. If the contribution of this mode
is added to the first mode it would lead to decreased flux through
the pentose phosphate pathway and reduced production of NADPH. The
increased glycolytic flux exits through the Rapoport-Leubering
(R/L) shunt leading to decreased ATP production since ATP is used
in upper glycolysis and not recovered in lower glycolysis. The
production of NADH increases.
[0156] The third mode describes the glycolytic pathway down to
pyruvate with production of ATP and NADH. It also describes lowered
dissipation of ATP as a consequence of AMP dissipation by AMPase.
This mode has a significant ATP production.
[0157] The fourth mode describes the flux split between lower
glycolysis and the R/L shunt. It thus naturally interacts
biochemically with the second mode. The fourth mode further
describes an increase in ATP dissipation via the AMPase-AK cycle
leading to little net production of ATP, and interacts with mode
three.
[0158] The fifth mode is actually one of the extreme pathways. It
describes importing pyruvate and converting it to lactate, thus
dissipating one NADH. It thus will be important in balancing NADH
redox metabolism.
[0159] As shown below the first five modes account for most of the
RBC's physiological states.
[0160] The nominal state (no additional metabolic load) of the red
blood cell metabolic network was calculated using a full kinetic
model and is shown on the RBC metabolic map (FIG. 10G). This
nominal physiologic steady state of the RBC was decomposed into 23
modes (FIG. 10H). The relative error remaining in the reconstructed
solution after the addition of each mode to the reconstruction of
the nominal steady state fell sharply (FIG. 10H). After the
contribution of the first five modes, the reconstructed nominal
state had a relative error of 0.013 (RE(5)=0.013).
[0161] An inspection of the first five modes (FIG. 10A-E)
demonstrates how they reconstruct the physiologic steady state
solution. Relative to the first mode (FIG. 10A), adding the second
mode (FIG. 10B) increases the flux through the first half of
glycolysis, decreases the flux through the pentose phosphate
reaction, and decreases NADPH production, all of which moves the
reconstructed solution significantly towards the physiologic steady
state (FIG. 10G). Adding the third mode (FIG. 10C) increases the
flux through all of glycolysis, particularly through lower
glycolysis. The addition of the fourth mode (FIG. 10D)
appropriately decreases the amount of 23DPG that is produced and
instead sends that flux through lower glycolysis. Finally, the
addition of the fifth mode increases the flux from pyruvate to
lactate, which leads essentially to the steady state solution where
lactate is the primary output of glycolysis. Thus, the significant
features of the physiologic steady state are captured within the
first five modes. A regulatory structure that can move the solution
along these five independent directions in the solution space will
be able to generate the desired physiological state.
[0162] Steady-state flux distributions for two load levels of
NADPH, ATP, and NADH were calculated using the RBC kinetic model.
These pairs of load levels each represented the maximum load the in
silico RBC could withstand, as well as one value chosen within the
tolerated load range. NADPH loads simulate physiologic states
corresponding to the red blood cell's response to oxidative free
radicals. The maximum NADPH load is 2.5 mM/hr. The ATP loads
simulate conditions of increased energy loads, such as in
hyperosmotic media. The maximum ATP load is 0.37 mM/hr. Two NADH
loads, important for methemoglobin reduction in the RBC, were also
applied. These six computed flux vectors thus represent extreme
physiological states of the RBC, and help designate the region of
physiologically meaningful states within the steady-state solution
space.
[0163] The modal composition of each of the six "stressed" steady
state flux solutions gives significant weighting to the first five
modes (FIG. 10H). In addition, some "fine tuning" appears in modes
7 to 11. All of the other modes are essentially insignificant in
reconstructing these solutions to the RBC kinetic model.
[0164] The application of metabolic loads changed the weighting of
the first five modes to reconstruct the appropriate metabolic flux
distribution (FIGS. 10H,I). Increases in the NADPH load resulted in
a substantial increase of the weighting on the first mode,
increasing the flux through the pentose phosphate reactions and
thus elevating the production of NADPH. The weightings on the
second, third, fourth, and fifth modes decrease with the
application of higher NADPH loads largely because as NADPH
production is maximized the flux distribution approaches that of
the first mode. The reduction in the weighting of the second mode,
however, is the most dramatic. The application of increasing ATP
loads resulted in little change in the values of the weightings on
all of the first five modes. The application of ATP load is handled
in the RBC by a decrease in an ATP-consuming futile cycle, with the
ATP generated instead being used instead to satisfy the load
imposed upon the cell. Thus, the usage of an ATP-dissipating futile
cycle in the unstressed state of the RBC acts to dampen the effects
of changing ATP loads, allowing the RBC to respond to changing ATP
loads with little change in the overall flux distribution in the
cell. Related experimental findings have demonstrated that the
concentration of ATP in the RBC does not change much as
environmental conditions change within specified limits, as a
result of this buffer, but then changes dramatically when the ATP
load is pushed beyond those limits. The application of the NADH
loads resulted in a significant decrease of all the mode weightings
because the length of the flux vector decreases. The weighting on
the fifth mode decreased most dramatically since it consumes NADH
when utilized in the positive direction and thus had needed to be
scaled down.
[0165] After the inclusion of the first five modes, the relative
error (RE(5)) of all the reconstructed solutions ranged from 0.005
to 0.018. In all six cases, the first five modes reconstructed at
least 98% of the steady state solutions. Thus, the physiologically
relevant portion of the steady-state solution space appears to be
only 5 dimensional, and therefore there are effectively only five
degrees of freedom to the problem of regulating red cell
metabolism.
[0166] Decomposition of the extreme pathway vectors into the modes
shows that the most important mode, in the reconstruction, is often
not one of the first five modes (FIG. 10J). Thus, many portions of
the allowable solution space, as defined by the extreme pathways,
are poorly characterized by the first five modes, which effectively
reconstruct each solution to the full RBC kinetic model. Thus, many
of the extreme pathways are physiologically irrelevant and they can
be identified using SVD of P, if the approximate location of
physiologically meaningful solutions is known.
[0167] Study of regulation of metabolism has historically focused
on the identification and characterization of individual regulatory
events. Now that we can reconstruct full metabolic reaction
networks we can address the need for regulation from a
network-based perspective. This study has focused on interpreting
regulation from a network-based perspective using singular value
decomposition of the extreme pathway matrix for human red blood
cell metabolism. Two main results were obtained. First, the
dominant modes obtained by SVD interpret RBC metabolic physiology
well. Second, the first five modes effectively characterize all the
relevant physiological states of the red cell.
[0168] RBC metabolic physiology is well interpreted by the dominant
modes obtained from SVD. Using the calculated modes, seven
physiologically relevant solutions to the full RBC kinetic model
were reconstructed. The RE(5) for these solutions was within 0.017.
Thus, the first five modes can be used to essentially completely
recapture each of the physiologically relevant kinetic solutions.
However, most of the extreme pathways could not be reconstructed to
such a high degree by the first five modes. Thus, the first five
modes represented the space relevant to solutions to the full
kinetic model better than they did to the space as a whole, even
though they were calculated to optimize their description of the
entire space. This fact suggests that developing constraints-based
methods that take into account kinetics and metabolomics will
result in defining a solution space that is much smaller than the
space circumscribed by the extreme pathways.
[0169] The results obtained herein were based on the topology of
the metabolic network and knowledge of some V.sub.max values. The
next step to bridge the gap between the network-based results and
the study of individual regulatory events is to find the best ways
to pair candidate regulatory molecules and the systemic regulatory
needs. In control theory this is known as the `loop-pairing`
problem (Seborg, D. E., Edgar, T. F. and Mellichamp, D. A. Process
Dynamics and Control (Wiley, N.Y., 1989)). As a part of its
solution we may have to relax the need for strict orthonormality of
the modes and look for oblique modal bases that are more in line
with the underlying biochemistry of the network.
[0170] Taken together, this study presents a network-based approach
to studying regulatory networks and defines the degrees of freedom
of the regulatory problem. This method calculates the modalities
needed to enable the metabolic network to navigate its solution
space and thus could be used to infer candidate regulatory loops of
metabolic systems for which the regulation is largely unknown.
Further, based upon their contribution to the steady-state solution
space, these regulatory loops can potentially be ordered in terms
of their importance to the reconstruction of the space.
Network-based approaches to studying regulation, such as the one
offered herein, complement component-based studies and provide a
potential framework to better understand the interaction of
regulatory components needed to achieve the regulatory demands of
the cell.
Example III
In Silico Assessment of the Phenotypic Consequences of Red Blood
Cell Single Nucleotide Polymorphisms
[0171] The following example illustrates the application of the
described methods to analysis of phenomenological pathways defined
through pathological data.
[0172] The Human Genome Project (HGP) is now essentially complete.
One result of the HGP is the definition of single nucleotide
polymorphisms (SNPs) and their effects on the development of human
disease. Although the number of SNPs in the human genome is
expected to be a few million, it is estimated that only 100,000 to
200,000 will effectively define a unique human genotype. A subset
of these SNPs are believed to be "informative" with respect to
human disease (Syvanen, A., 2001. Accessing genetic variation:
Genotyping single nucleotide polymorphisms. Nat Rev Genet 2:
930-942). Many of these SNPs will fall into coding regions while
others will be found in regulatory regions. The human
genotype-phenotype relationship is very complex and it will be hard
to determine the causal relationship between sequence variation and
physiological function. One way to deal with this intricate
relationship is to build large-scale in silico models of complex
biological processes (FIG. 12). Defects or alterations in the
properties of a single component in complex biological processes
can be put into context of the rest by using an in silico model. In
this work, recent data on SNPs in key red blood cell enzymes (FIG.
12a) and corresponding alterations in their kinetic properties
(FIG. 12b) were used in an in silico red blood cell model (FIG.
12c) to calculate the overall effect of SNPs on whole cell function
(FIG. 12d).
[0173] The study of variations in the kinetic properties of red
blood cell enzymes is not merely an academic study of the quality
of a mathematical model, but has real utility in the clinical
diagnosis and treatment of enzymopathies and can provide a link to
the underlying sequence variation (FIG. 12). Here, an in silico
model is used to study SNPs in two of the most frequent red blood
cell enzymopathies: glucose-6-phosphate dehydrogenase (G6PD) and
pyruvate kinase (PK).
[0174] For both enzyme deficiencies, clinical data was obtained
from the published literature to determine measured values for the
various kinetic parameters (V.sub.max's, Km's, Ki's) associated
with each clinically diagnosed variant. These numerical values were
then used in the in silico model (Jamshidi, N., Edwards, J. S.,
Fahland, T., Church, G. M., Palsson, B. O. Bioinformatics 17, 286-7
(2001)) and sensitivities to various oxidative and energy loads
(above normal, baseline values) were simulated. The results are
interpreted with respect to the genetic basis of the enzymopathy in
an attempt to establish a direct link between the genotype and
phenotype (FIG. 12).
[0175] Glucose-6-phosphate dehydrogenase (G6PD) catalyzes the first
step in the oxidative branch of the pentose pathway (FIG. 12c) and
is thus of critical importance in maintaining the red blood cell's
resistance to oxidative stresses. G6PD is the most common
erythrocyte enzymopathy, affecting approximately 400 million people
worldwide.
[0176] G6PD from normal patients and patients with hemolytic anemia
have been characterized on the molecular level. A total of 61 G6PD
class I variants have been described at the molecular level. Of the
61 class I chronic variants, 55 are the result of SNPs involving
amino acid changes, 5 result from frame deletions and one results
from a splicing defect (Fiorelli, G., F. M. d. Montemuros and M. D.
Cappellini, Bailliere's Clinical Haematology 13: 35-55 (2000)).
[0177] Clinically diagnosed SNPs cluster around important, active
regions of G6PD enzyme including the dimer interface and substrate
binding sites (FIG. 13a). Numerical values of G6PD kinetic
parameters were varied in silico to determine the sensitivity of
red blood cell metabolic functions to these changes in enzyme
function. The most sensitive parameters were found to be V.sub.max
and Ki-NADPH. The NADPH/NADP ratio proved to be the most
informative indicator of metabolic status as it was the most
sensitive to changes in these two parameters and it gives an
indication as to the oxidative state of the cell (Kirkman, H. N.,
G. D. Gaetani, E. H. Clemons and C. Mareni, Journal of Clinical
Investigation 55: 875-8 (1975)). For each documented variant there
appears to be no direct correlation between V.sub.max and Ki-NADPH
(FIG. 13b). Clinically, G6PD deficiencies are broken down into two
main categories: chronic and non-chronic hemolytic anemia. Chronic
cases show clinical symptoms and are very sensitive to the
environment. Non-chronic cases appear normal under homeostatic
conditions but can experience problems when subjected to large
oxidative stresses (Jacobasch, G., and S. M. Rapoport, in Molecular
Aspects of Medicine (1995)). For this study, kinetic data for 12
chronic and 8 non-chronic cases from Yoshida and 19 chronic cases
from Fiorelli were used (Fiorelli, G., F. M. d. Montemuros and M.
D. Cappellini, Bailliere's Clinical Haematology 13: 35-55 (2000);
Yoshida, A., pp. 493-502 in Glucose-6-Phosphate Dehydrogenase,
Academic Press 1995).
[0178] Under normal conditions (i.e., oxidative load, V.sub.ox=0)
there are differences between the chronic and non-chronic groups
with the chronic group having a somewhat lower homeostatic steady
state NADPH/NADP ratio than the non-chronic group. When subjected
to an oxidative load (V.sub.ox>0), noticeable differences
between the two groups (chronic and non-chronic) appear (FIG. 14).
The NADPH/NADP ratio at the maximum tolerated oxidative load
(V.sub.ox=max value) correlates with this ratio in the un-stressed
situation (V.sub.ox=0). The group of chronic hemolytic anemia
patients are clearly separated from the normal and non-chronic
group. A number of the chronic cases can only withstand a very
modest oxidative load. Of the variant cases studied, a handful have
been characterized at the molecular (amino acid) level (Table 5).
Of the cases considered, most of the single base changes in the
chronic (class I) variants occur at or near the dimer interface
(exons 10,11 and 6,7) or near the NADP binding site, leading to an
impaired ability to respond to systemic oxidative challenges.
[0179] Pyruvate kinase (PK) is a key glycolytic regulatory enzyme.
There have only been about 400 documented variants since PK's first
description in 1961 (Jacobasch, G., and S. M. Rapoport, in
Molecular Aspects of Medicine (1996); Tanaka, K. R., and C. R.
Zerez, Seminars in Hematology 27: 165-185 (1990); Zanella, A., and
P. Bianchi, Balliere's Clinical Hematology 13: 57-81 (2000)). PK
accounts for 90% of the enzyme deficiencies found in red blood cell
glycolysis. It is autosomal recessive where clinical manifestations
appear only in compound heterozygotes (2 mutant alleles). There are
four isozymes: L, R, M1, and M2, with the R type being exclusive to
the red blood cells. PK is encoded by the PK-LR gene on chromosome
1q21. The kinetics of the enzyme have been extensively studied
(Otto, M., R. Heinrich, B. Kuhn and G. Jacobasch, European Journal
of Biochemistry 49: 169-178 (1974)). PK activity is regulated by
F6P, ATP, Mg, and MgATP. Anemic heterzygotes have 5-40% of normal
PK activity.
[0180] A summary of the PK variants is presented in Table 6. The
Sassari variant only has a SNP (cDNA nt 514) transversion of a G to
a C resulting in a change of Glu to Gln at aa 172 which is in
between the .beta.1 and .beta.2 in the B domain. Here a basic
(negatively charged amino acid) is replaced by a polar uncharge
amino acid. Parma has 2 SNPs, one at aa 331 or 332 and another at
aa 486 or 487, neither of whose amino acid changes have been
elucidated yet. Soresina and Milano share the amino acid change Arg
to Trp at aa 486 (positively charged to non-polar). Brescia has a
deletion of Lys at aa 348 and another change at aa 486 or 487 that
has not been defined yet. Mantova has an exchange at amino acid 390
Asp to Asn (negatively charged to polar uncharged). (Bianchi, P.,
and A. Zanella, 2000 Hematologically important mutations: red cell
pyruvate kinase. Blood Cells, Molecules, and Diseases 15: 47-53;
Zanella, A., and P. Bianchi, Balliere's Clinical Hematology 13:
57-81 (2000)).
[0181] Unlike for G6PD, the characterized PK SNPs are scattered
throughout the protein coding region and do not appear to cluster
near the corresponding active site of the enzyme. The documented
kinetic values for the main kinetic parameters V.sub.max and KPEP
are shown (FIG. 15a). Similar to the G6PD variants, there is not a
clear correlation between changes in the numerical V.sub.max and
KPEP amongst the PK variants (FIG. 15b). Although changes in KADP
are also documented for each variant and accounted for in the
simulations, increases or decreases in its value did not
significantly affect the red blood cell's steady state metabolite
concentrations or its ability to withstand energy loads (data not
shown). Changes in KPEP and V.sub.max influence the concentration
of ATP and 2,3DPG most significantly. When increased energy loads
(Y.sub.e>0) are applied in silico, differences between the
variants are observed. The ratio between the ATP concentration at
maximum tolerated load (ve=max value) and the ATP concentration in
the unchallenged state (V.sub.e=0) varies approximately linearly
with the maximum tolerated load when all the variants are evaluated
(FIG. 15c). Thus the variants that tolerated the lowest maximum
load have a [ATP]max/[ATP]no load ratio close to unity indicating
their sharply diminished ability to deviate from the nominal
homeostatic state. Interestingly, the computed energy charge
(EC=(ATP+1/2ADP)/(ATP+ADP+AMP)) (Atkinson, D. E., 1977).
[0182] Cellular energy metabolism and its regulation. (Academic
Press, New York) stays relatively constant (FIG. 15d). This result
indicates that red blood cell metabolism strives to maintain its EC
within the tolerated load range, thus allowing for an energetically
consistent metabolic function.
[0183] Sequence variations in coding regions for metabolic enzymes
can lead to altered kinetic properties. The kinetic properties of
enzymes are described by many parameters and a single SNP can alter
one or many of these parameters. For the variants of G6PD and PK
considered here, there appears to be no clear relationship between
their kinetic parameters as a function of sequence variation. Thus
consequences of sequence variations on the function of a gene
product must be fully evaluated to get a comprehensive assessment
of the altered biochemical function.
[0184] The consequences of many simultaneously altered enzyme
properties must in turn be evaluated in terms of the function of
the enzyme in the context of the reaction network in which it
participates. The assessment of sequence variation on biochemical
and kinetic properties of enzymes may seem difficult and this
challenge is currently being addressed (Yamada, K., Z. Chen, R.
Rozen and R.G. Matthews, Proc Natl Acad Sci USA 98: 14853-14858
(2001)), but the assessment of sequence variation on entire network
function is even more complicated. This highly complex and
intricate relationship between sequence variation and network
function can be studied through the use of a computer model. Here
we have shown that a large number of variants in red blood cell
G6PD and PK can be systematically analyzed using an in silico model
of the red blood cell. Correlation between sequence variation and
predicted overall cell behavior is established, and in the case of
G6PD, it in turn correlates with the severity of the clinical
conditions.
Example IV
Consistency Between Known Regulatory Network Structures and
Transcriptomics Data
[0185] The following example illustrates the use of the described
methods to validate and expand known regulatory network structures
by reconciling these structures with large-scale gene expression
data sets.
[0186] The availability of large genome-scale expression data sets
has initiated the development of methods that use these data sets
to infer large-scale regulatory networks (D'Haeseleer, P., Liang,
S. and Somogyi, R Bioinformatics 16:707-26 (2000); de Jong, H. J.
Comput. Biol. 9:67-103 (2002); Yeung, M. K., Tegner, J. &
Collins, J. J. Proc. Natl. Acad. Sci. USA 99:6163-8 (2002)).
Alternatively, such regulatory network structures can be
reconstructed based on annotated genome information, well-curated
databases, and primary research literature (Guelzim, N., Bottani,
S., Bourgine, P. and Kepes, F. Nat. Genet. 31, 60-3. (2002);
Shen-Orr, S. S., Milo, R., Mangan, S. & Alon, U. Nat. Genet.
31, 64-8 (2002)). Here we examine how consistent existing
large-scale gene expression data sets are with known genome-wide
regulatory network structures in Echerichia coli and Saccharomyces
cerevisiae. We find that approximately 10% of the known pair-wise
regulatory interactions between transcription factors and their
target genes are consistent with gene expression data in both
organisms. We show that accounting for combinatorial effects due to
multiple transcription factors acting on the same gene can improve
the agreement between gene expression data and regulatory network
structures. We also find that regulatory network elements involving
repressors are typically less consistent with the data than ones
involving activators. Taken together these results allow us to
define regulatory network modules with high degree of consistency
between the network structure and gene expression data. The results
suggest that targeted gene expression profiling data can be used to
refine and expand particular subcomponents of known regulatory
networks that are sufficiently decoupled from the rest of the
network.
[0187] The known genome-scale transcriptional regulatory network
structures for yeast (Guelzim, N., Bottani, S., Bourgine, P. &
Kepes, F. Nat. Genet. 31, 60-3. (2002)) and E. coli (Shen-Orr, S.
S., Milo, R., Mangan, S. and Alon, U. Nat. Genet. 31, 64-8 (2002))
were obtained and pre-processed to remove autoregulation. These
structures were represented as graphs with directed regulatory
interaction edges between a regulator node (typically a
transcription factor) and a target gene node, with the mode of
regulation (activation, repression, or both) indicated for each
interaction. The yeast network has 108 regulatory genes regulating
414 target genes through 931 regulatory interactions, whereas the
E. coli network has 123 regulatory genes regulating 721 target
genes through 1367 regulatory interactions. We used data from a
total of 641 diverse gene expression profiling experiments
organized into five separate data sets for yeast and 108
experiments organized into three separate data sets for E.
coli.
[0188] There were three basic types of regulatory network elements
analyzed in this study: 1) pair-wise regulatory interactions, 2)
target-regulator units, and 3) regulons. A target-regulator unit
(TRU) is defined as a single target gene together with all of its
transcriptional regulators. A regulon is defined as the set of all
target genes for a single transcriptional regulator. For each
instance of the individual network elements present in the network,
we computed a consistency measure between a particular gene
expression data set and the network element structure. The
particular measures we used were Pearson correlation coefficients
for pairwise interactions, multiple coefficients of determination
for TRUs, and average within regulon correlation for regulons. The
statistical significance of a particular value of a consistency
measure was determined by a randomization procedure.
[0189] The simplest elements in the regulatory network are
pair-wise regulator-target interactions. Overall only a relatively
small fraction (less than 10% at P<0.01) of pairwise
interactions are in agreement with the gene expression data given
the criteria stated above. In particular, virtually none of the
repressor-target interactions are supported by any of the gene
expression data sets examined. Most repressors actually have
positive correlation with the expression of their target genes--not
negative as would be expected for a repressor. These results for
repressing pair-wise interactions highlights the problems
associated with detecting transcripts expressed at a low level as a
result of a transcriptional repressor bound to the promoter of the
target gene.
[0190] Analysis of pair-wise correlations could overestimate
correlations between transcription factor and target gene
expression levels in the presence of transcriptional feed-forward
loops. In such cases two or more transcription factors act on the
same gene, but some of them (primary regulators) also regulate
another (secondary) regulator directly. Feed-forward loops can lead
to an indirect effect by which the secondary regulator-target
correlation is solely due the influence of the primary regulators.
In the framework used here, this effect can be accounted for by
replacing standard correlation coefficients with partial
correlation coefficients for secondary regulator-target
interactions. Although there is a significant number of
feed-forward loops in both networks (240 in yeast, 206 in E. coli),
the overall effect of accounting for feed-forward loops is small
(0-3 percentage points).
[0191] Target-regulator units represent more complex combinatorial
effects than feed-forward loops. The percentage of TRUs consistent
with gene expression data is higher than the percentage of
consistent pair-wise interactions for E. coli at all confidence
levels. This result indicates that combinatorial effects between
transcription factors play a significant role in many cases.
Conversely for TRUs in yeast, we do not observe a significant
change in the percentage of units in agreement with expression data
compared to the calculations that considered only pairwise
interactions.
[0192] TRUs can be categorized by the number of regulators that act
on the target gene. In yeast, the TRUs with four regulators are in
general best supported by the gene expression data. These
four-regulator TRUs include genes participating in diverse cellular
functions including nitrogen utilization, oxygen regulation, and
stress response. Hence the high degree of consistency observed for
four-regulator TRUs does not appear to be solely due to a
particular subcomponent of the network, but is a more general
feature of the network structure. In E. coli, no clear dependence
between the number of regulators and the fraction of consistent
TRUs can be detected.
[0193] In order to investigate the agreement between regulatory
network structures and gene expression data from a different
perspective that does not assume correlation between the expression
levels of transcription factors and their target genes, we studied
the coherence of gene expression within known regulons. A large
fraction of regulons (over 40%) have coherent gene expression in
both yeast and E. coli even for the most stringent confidence level
(P<0.001) in at least one data set. This result indicates that a
clustering-like approach to analyzing gene expression data can
indeed be expected to be successful in detecting truly co-regulated
genes. The most interesting feature of this calculation is the
relatively low level of regulon coherence for regulons regulated by
transcriptional repressors in yeast. In contrast, E. coli regulons
controlled by repressors tend to be more coherent than those
controlled by activators.
[0194] All the results described above for both yeast and E. coli
can be displayed on a map of the regulatory network (FIG. 16). This
data display allows identifying subcomponents of the networks that
have high degree of agreement with the gene expression data sets
analyzed. For example in yeast the nitrogen utilization (I in FIG.
16a) and oxygen response (O) systems have many highly consistent
elements, but the elements in the carbon utilization (C) network
are generally not consistent with the gene expression data.
Similarly in E. coli components such as the flagellar biosynthesis
network (F in FIG. 16b) are highly consistent, but the carbon
utilization (C) network again does not have many consistent network
elements.
[0195] Some of the variability in consistency between regulatory
network structures and gene expression data appears to be due to
the types of data sets utilized in this work. For example, the DNA
repair system in E. coli was specifically activated in one of the
gene expression data sets and the response to nitrogen depletion
was studied in one of the yeast data set. However, there are also
general network structural features that appear to influence
consistency. The most prominent feature is the tendency of
relatively isolated subcomponents of the network such as flagellar
biosynthesis in E. coli or nitrogen utilization in yeast to be
consistent with gene expression data whereas highly interconnected
components such as carbon utilization regulation are typically
inconsistent. However, not every isolated sub-network is consistent
indicating that the network reconstruction may be incomplete and
these subnetworks may in fact be more strongly connected to other
parts of the network than is currently known.
[0196] Taken together, the results shown here indicate that
combining information on known regulatory network structures with
gene expression data is a productive way to validate and expand
regulatory networks structures. It is important to note that,
because the overall level of consistency was generally found to be
low, genome-scale reconstruction of regulatory networks based on
gene expression data alone does not appear to be feasible, even if
large quantities of data is available as is the case for yeast. The
results show that different features of the network structure
influence consistency. In particular, we observe that network
elements involving repressors (pair-wise interactions, regulons)
are typically less consistent than those involving activators
indicating that reconstruction of these types of network components
would pose a challenge. Further, in yeast TRUs with four regulators
are generally more consistent than other types of TRUs indicating
that in such cases the known network structure appears to be
sufficiently complete whereas for the TRUs with fewer regulators
there may be regulators missing. The discovery of highly consistent
network subcomponents indicates that a gene expression data based
reconstruction of regulatory networks can be a powerful strategy
for particular subcomponents that are sufficiently isolated and for
which sufficient quantities of relevant data is available. Future
availability of other high-throughput data types such as
genome-wide DNA-binding site occupancy data (Ren, B. et al. Science
290:2306-9. (2000)) will further improve the prospects of such
reconstruction as additional data types can be used to resolve
inconsistencies. The full utilization of all high-throughput data
types, however, will require the combination prior biological
knowledge extracted from databases and literature with the
statistical analysis of the large-scale data sets. Thus, full
reconstruction of regulatory networks will rely on a combination of
`bottom-up` and `top-down` approaches with targeted prospective
experimentation to successively resolve inconsistencies between the
two. Ultimately, all such data types are expected to be reconciled
in the context of genome-scale in silico models of regulatory
networks that can be used to analyze, interpreted and ultimately
predict their function.
Example V
Iterative Refinement of a Regulatory Network Model
[0197] The purpose of this example is to illustrate how the methods
described can be used for regulatory network identification,
improvement and the identification of regulatory states in
regulatory or combined regulatory/metabolic models.
[0198] The "bottom-up" approach to genome-scale transcriptional
regulatory network model reconstruction is initiated by
incorporation of knowledge into a computational model to analyze,
interpret and predict phenotype. The process begins with first pass
reconstruction of metabolic and transcriptional regulatory networks
for the organism of interest. Reconstruction of such genome-scale
models has been described elsewhere in detail (Covert M. W.,
Schilling C. H., Famili I., Edwards J. S., Goryanin II, Selkov E.,
Palsson B. O. Trends Biochem Sci 26:179-86 (2001); Covert M. W.,
Schilling C. H., Palsson B., J Theor Biol 213:73-88 (2001)) and
leads to the representation of metabolic behavior as a linear
programming problem, with a matrix describing all known metabolic
reactions, and certain measured parameters (e.g., maximum uptake
rates, biomass composition) defined as constraints on the metabolic
system. Transcriptional regulatory behavior is represented as a set
of regulatory rules written as Boolean logic statements. These
rules are dependent on environmental and internal conditions and
determine the expression and/or repression of various metabolic
genes in the metabolic network.
[0199] The regulatory and metabolic models are integrated as the
outcomes of the logic statements impose time-dependent constraints
on the metabolic linear programming problem. The outcome of the
linear programming problem is then used to recalculate
environmental conditions (Varma A., Palsson B. O., Appl Environ
Microbiol. 60:3724-31 (1995); Covert M. W., Schilling C. H.,
Palsson B. J Theor Biol. 213:73-88 (2001)), and the Boolean logic
equations are reevaluated.
[0200] The Boolean logic rules are derived from the primary
literature to represent the conditions required for expression of a
particular gene or set of genes. Experimental studies are examined
to obtain a set of potential transcription factors for all known
promoters of expression of a particular target gene. The presence
of multiple promoters from which transcription may occur indicates
an OR relationship, and the presence of two interacting
transcription factors which effect one promoter indicates an AND
relationship. For example, if gene A has two promoters, one of
which is activated by transcription factor X and the other which is
repressed by the integrated product of transcription factors Y and
Z, then a rule may be derived which states that A is transcribed IF
(X) OR NOT (Y AND Z).
[0201] Such a model is in process of being built for E. coli. For
this organism, a genome-scale metabolic network model had already
been reconstructed (Edwards J. S., Palsson B. O., Proc Natl Acad
Sci USA. 97:5528-33 (2000)). The regulatory network model was first
implemented for core metabolic processes. The first combined
metabolic/regulatory model accounts for 149 genes, the products of
which include 16 regulatory proteins and 73 enzymes. These enzymes
catalyze 113 reactions, 45 of which are controlled by
transcriptional regulation. The combined metabolic/regulatory model
can predict the ability of mutant E. coli strains to grow on
defined media, as well as time courses of cell growth, substrate
uptake, metabolic by-product secretion and qualitative gene
expression under various conditions, as indicated by comparison to
experimental data under a variety of environmental conditions. The
in silico model may also be used to interpret dynamic behaviors
observed in cell cultures (Covert M. W., Palsson B. O., J Biol Chem
277:28058-64 (2002)).
[0202] When integrated as mentioned above, the regulatory/metabolic
models represent a first-pass reconstruction and may be used for
the generation of testable hypotheses (see FIG. 16). First, a
phenotypic or behavioral shift of interest must be specified for a
particular organism (e.g., glucose-lactose diauxie in E. coli), as
well as important regulatory genes. The regulatory/metabolic model
may then be used to simulate behavior of the wild type strain over
the course of the shift, as well as behavior of knockout and/or
mutant strains of the relevant regulatory genes. These simulations
represent hypotheses about the growth behavior, substrate uptake,
by-product secretion, and gene expression over the course of the
shift for each strain.
[0203] Strains of the organism are then obtained and/or constructed
to build a full complement of the wild type as well as all
corresponding knockout strains. Each strain is then cultured to
monitor experimentally the shift in question. Rates of growth,
uptake and secretion as well as gene expression are monitored over
the course of the shift using practices that are well known in the
art (Ideker T., Thorsson V., Ranish J. A., Christmas R., Buhler J.,
Eng J. K., Bumgarner R., Goodlett D. R., Aebersold R., Hood L.
Science 294:929-34 (2001)).
[0204] Once the necessary experimental data has been obtained, the
experimental outcomes are compared rigorously to the
computationally-generated data. This comparison will lead to (1)
validation of certain regulatory relationships described by the
model; (2) the identification of regulatory relationships included
in the model but for which the experimental results were
contradictory; and (3) the identification of regulatory
relationships which were not previously known which must be
incorporated into the model. Both (2) and (3) represent areas where
the model may be improved.
[0205] Many genes are regulated by more than one transcription
factor in certain organisms. Such genes correspond to complex
Boolean logic rules, which must obtained by further
experimentation. Specifically, for genes which are shown by the
process above to be regulated by more than one transcription
factor, the multiple knockout strains may be constructed, in which
to determine complex interactions. If two transcription factors are
required to affect the regulation of a gene, they have an AND
relationship; if only one factor is required they have an OR
relationship.
[0206] The method is applied to the study of anaerobiosis in E.
coli (FIG. 16). A large-scale model of metabolism and
transcriptional regulation was generated for E. coli previously
(Covert M W, Palsson B O, J Bial Chem 277:28058-64 (2002)). This
model will be built up to the genome-scale (currently in progress)
and used to generate predictions about growth, uptake and secretion
rates as well as gene expression of E. coli under conditions of
aerobic and anaerobic growth in glucose minimal media. Six
strains--the appy, soxS, oxyR, fnr and arcA knockout strains as
well as the wild type--will be grown in batch culture as described
above, with growth, uptake and secretion monitored continually. A
sample will be taken at mid-log phase from which the mRNA will be
extracted and analyzed using Affymetrix Gene Chip technology. From
this data, the model will be evaluated both in terms of regulation
(e.g., its ability to predict gene induction/repression) and
metabolism (e.g., its ability to predict growth behavior of the
wild type and mutant strains). This information will then be used
to iteratively improve the model in terms of anaerobiosis
prediction.
[0207] All journal article, reference and patent citations provided
above, in parentheses or otherwise, whether previously stated or
not, are incorporated herein by reference in their entirety.
[0208] Although the invention has been described with reference to
the examples provided above, it should be understood that various
modifications can be made without departing from the spirit of the
invention.
Example VI
Iterative Refinement of a Regulatory Network Model Via a Systematic
Model Improvement Algorithm
[0209] The purpose of this example is to illustrate the importance
of the systematic approach described above and depicted in FIG. 2b
to converge quickly on the best model of a biological process.
Although a hypothetical regulatory network is used here as an
example, this process is equally applicable to metabolic networks,
signaling pathways, protein interaction networks and any other
biological processes.
[0210] A skeleton network of core metabolism was formulated earlier
(Covert M W, Schilling C H, Palsson B. J Theor Biol. 213:73-88
(2001)). It includes 20 reactions, 7 of which are governed by
regulatory logic. This network is a highly simplified
representation of core metabolic processes (e.g., glycolysis, the
pentose phosphate pathway, TCA cycle, fermentation pathways, amino
acid biosynthesis and cell growth), along with corresponding
regulation (e.g., catabolite repression, aerobic/anaerobic
regulation, amino acid biosynthesis regulation and carbon storage
regulation). A schematic of this skeleton network is shown in FIG.
18, together with a table containing all of the relevant chemical
reactions and regulatory rules which govern the transcriptional
regulation. In terms of FIG. 2b, this network will be considered
the actual experimental system which is to be characterized.
[0211] To the right of the experimental system in FIG. 18 is the
model of the experimental system. The model is fairly complete,
with one exception: the regulation of R5a in the model has not been
correctly characterized, with no regulatory rule given (i.e., the
reaction is expressed under all conditions).
[0212] A statement of scope and accuracy is determined for the
model; namely, that the model will model the entire transcriptional
regulatory component of the system qualitatively, using Boolean
logic, where a "1" indicates that the gene corresponding to a given
reaction has been expressed and a "0" indicates that the gene has
been down-regulated. The experiments of interest are growth of the
system on metabolite Carbon2 under aerobic and anaerobic
conditions. For this example, the criterion for the desired
accuracy of the model is that the model error, calculated as the
sum of the squared difference between the observed and predicted
expression of all regulated genes in the system, is equal to
zero.
[0213] In Phase I of the process, an experiment is run with Carbon2
and Oxygen available to the system. The expression of the regulated
genes in the experimental and model system are calculated and shown
in FIG. 19. The model error is equal to zero in this case,
indicating that the experimental data and the model predictions
agree completely in this case.
[0214] Next, an experiment is run with Carbon2, but not Oxygen,
available to the system. In this case, there is a discrepancy
between the observed and calculated expression of T5a, resulting in
an error of one. Because the model error is greater than allowed by
the stated criterion, a procedure is implemented to alter the
composition of the mathematical model in such a way that the model
error is minimized under the given experimental conditions. The
procedure used in this case is developed with the following
assumption: the regulation of T5a depends on only one of the known
regulatory proteins (RPc1, RPb, RPh, and RPO2) in the system. The
procedure is therefore as follows: (1) Obtain the activity of each
protein as predicted by the model, (2) for each protein, generate a
rule based on the activity of the given protein which results in
the correct expression value for T5a, (3) recalculate the overall
expression array for the regulated genes, (4) evaluate the
difference between the criterion for model accuracy by determining
the new model error, and (5) choose the model(s) with the lowest
error as the new model for future iterations.
[0215] The activity of the regulatory proteins under the given
conditions are: RPc1=0, RPb=0, RPh=1, RPO2=1. For T5a to have a
value of zero, the rules which could be implemented are therefore:
T5a=IF (RPc1), T5a=IF (RPb), T5a=IF NOT (RPh), and T5a=IF NOT
(RPO2). The error of the model is calculated with each new rule;
and the new models all have an error of zero, as shown in FIG. 19
(Phase III). As a result, one of the models (with new rule T5a=IF
(RPc1), for example) is picked arbitrarily and the other equivalent
solutions are stored.
[0216] The new model may then be reevaluated with data in the
Phenotypic database. For this example, data from the experiment
where Carbon2 and Oxygen were available to the system is compared
to the predictions of the new model. The new model has an error
with respect to these conditions (shown in Phase IV of FIG. 19); as
the other alternative solutions are considered, only the model with
new rule T5a=IF NOT (RPO2) fits the data with zero error. This
model is kept for future iterations.
[0217] The process suggests a new experiment to further
characterize the regulatory network: specifically, creating a RPO2
knockout strain of the system and testing the ability of the
knockout strain to grow where Carbon2 is available but Oxygen is
not. As shown in FIG. 19, the model predictions and experimental
data are also in agreement for this experiment.
[0218] The model has therefore been used to drive an experimental
process where new data has been generated to improve model
predictions and better characterize the experimental system itself,
as well to suggest a new round of experiments which can be
performed to gain further knowledge and insight.
Example VII
Decomposing Steady State Flux Distributions into Extreme Pathways
Using the Alpha-Cone Method
[0219] This example shows how an arbitrary steady state
phenomenological flux distribution can be decomposed in a
principled fashion into systemic pathways (here extreme pathways)
to identify operational pathways in a biosystem. The alpha-cone
decomposition method allows identifying the range of systemic
pathway weightings for a given flux distribution as well as
defining the minimal set of systemic pathways required to describe
a phenomenological pathway. This minimal set of systemic pathways
together with the range of possible weightings of these pathways
defines the operational pathways of the biosystem.
[0220] The sample metabolic network used for this analysis has been
published previously (Covert M. W., Schilling C. H., Palsson B. J
Theor Biol 213:73-88 (2001)). The network consists of 20 reactions
and 16 internal metabolites. The example network was designed to
mirror some of the core metabolic processes such as glycolysis, the
citric acid cycle, and respiration. The extreme pathways of this
network were calculated previously (Covert M W & Palsson B O. J
Theor Biol 216 (2003)). The network has 80 Type I extreme pathways
that are included in this analysis. Each extreme pathway, pi, was
scaled to its maximum possible flux based on the maximum value of
the uptake reactions (V.sub.max). A matrix P is then formed using
p.sub.i (i=1 . . . n, where n is the number of extreme pathways for
the system) as its columns.
[0221] To mimic phenomenological flux distributions produced by
experimental measurements the steady state flux distributions for
this network were calculated using the well-established technique
of flux balance analysis (FBA). For the purposes of this study,
unique steady state flux distributions were calculated for various
environmental conditions.
[0222] For a given phenomenological flux distribution the
decomposition weightings on the extreme pathways (denoted by
.alpha.) are not usually unique. The rank of the P matrix
determines the number of consistent equations and is usually
smaller than the number of extreme pathways, resulting in extra
degrees of freedom. This results in an "alpha space" of allowable
extreme pathway weightings. In order to elucidate the range of
possible alpha values that could contribute to the steady state
solution, the alpha-spectrum was developed based on the equation
P..alpha.=v where P is a matrix of extreme pathway vectors (extreme
pathways are the columns, reactions are the rows), .alpha. is a
vector of alpha weightings on the pathways and v an arbitrary
steady state flux distribution that is to be decomposed. For each
individual extreme pathway defined for the network, the alpha
weighting for that pathway was both maximized and minimized using
linar programming while leaving all other extreme pathway alpha
weightings free. This resulted in an allowable alpha range for each
extreme pathway. The results were then plotted on a 2-dimensional
graph with the extreme pathways on the x-axis and the range of
alpha weightings on the y-axis. Since the pathways are normalized
to V.sub.max, the alpha weightings correspond to a percentage usage
of each extreme pathway. Some extreme pathways are not used while
others can have a range of alpha weightings.
[0223] In addition to defining the alpha-spectrum, mixed integer
linear programming (MILP) (Williams, H P Model building in
mathematical programming. Chichester; New York, Wiley (1990)) was
used to find the minimum number of extreme pathways that were
needed to describe a given phenomenological flux distribution in
cases where multiple pathway combinations exist. The usage of a
specific extreme pathway was represented by a Boolean variable
(.beta..sub.j which was assumed to have a value of 1 when the
corresponding pathway is used and zero when the pathway is not
used. The sum of all Boolean variables representing pathway usage
was minimized to obtain the alpha weightings corresponding to the
case where the least number of pathways was used. The corresponding
optimization problem can be formally described as:
Min i = 1 n p .beta. i ##EQU00001## P .alpha. = v ##EQU00001.2## 0
.ltoreq. .alpha. i .ltoreq. .beta. i ##EQU00001.3##
where .beta. is the vector of the Boolean variables corresponding
to the pathway usage and a is the vector of the pathway weightings.
The solution is a set of alpha weightings such that the minimum
number of extreme pathways are used to obtain the decomposition of
the desired phenomenological flux distribution.
[0224] The methods described above were applied to the case of
aerobic growth with no regulation included. This case was
essentially unrestricted as all possible substrates (Carbon 1,
Carbon 2, F, H, and Oxygen) were provided to the network. The
resulting flux distribution computed using FBA can be seen in FIG.
20A. The calculated alpha-spectrum shows that of the 80 Type I
pathways, only 13 could be used in reconstructing the aerobic flux
distribution (FIG. 20B). Pathway 52 can range from 0 to 1 (0 to
100% of its maximum possible usage). Pathway 36 must be used as
indicated by the non-zero minimum alpha value. The remaining 11
pathways vary from 0 to various sub-maximum values. An MILP
analysis was done to determine the minimum number of pathways
needed to produce the aerobic steady state flux distribution. When
the MILP was solved without additional constraints, P36 was used to
its maximum capacity (100%) with sub-maximal contributions from
pathways 48, 38, 66, and 8. Interestingly, when the network was
forced to maximally use the pathway with the greatest alpha range
(P52), pathway 36 was also used, albeit sub-maximally, along with
pathways 12, 32, and 60. Note that with the exception of P36, which
has a non-zero minimum possible weighting and thus has to be used
in all possible solutions, there are no pathways in common between
the two sets of MILP solutions (FIG. 20C).
[0225] While the alpha-cone method was demonstrated above for a
flux distribution obtained through an FBA calculation, it is be
possible to use experimentally determined metabolic flux data in
the analysis as well. Even given partial or fragmented flux data,
it will be possible to determine the candidate alpha-spectrum and
hence obtain the operational pathways active in a cell in a given
external condition.
TABLE-US-00001 TABLE 1 Reaction/Gene Enzyme Name Reaction Membrane
Transport Phosphotransferase system pts GLCxt + PEP.fwdarw. G6P +
PYR Succinate transport SUCC trx SUCCxt SUCC Acetate transport AC
trx ACxt AC Ethanol transport ETH trx ETHxt ETH Oxygen transport O2
trx O2xt O2 Carbon dioxide transport CO2 trx CO2xt CO2 Phosphate
transport Pi trx PIxt PI Glycolysis Phosphoglucose isomerase pgi
G6P F6P Phosphofructokinase pfkA F6P + ATP.fwdarw. FDP + ADP
Fructose-1,6-bisphosphatase fbp FDP .fwdarw. F6P + PI
Fructose-1,6-bisphosphatate aldolase fba FDP T3P1 + T3P2
Triosphosphate Isomerase tpiA T3P2 T3P1 Glyceraldehyde-3-phosphate
dehydrogenase gapA T3P1 + PI + NAD NADH + 13PDG Phosphoglycerate
kinase pgk 13PDG + ADP 3PG + ATP Phosphoglycerate mutase 1 gpmA 3PG
2PG Enolase eno 2PG PEP Pyruvate Kinase II pykA PEP + ADP.fwdarw.
PYR + ATP Phosphoenolpyruvate synthase ppsA PYR + ATP.fwdarw. PEP +
AMP + PI Pyruvate dehydrogenase aceE PYR + COA + NAD.fwdarw. NADH +
CO2 + ACCOA Pentose Phosphate Shunt Glucose
6-phosphate-1-dehydrogenase zwf G6P + NADP D6PGL + NADPH
6-Phosphogluconolactonase pgl D6PGL.fwdarw. D6PGC
6-Phosphogluconate dehydrogenase gnd D6PGC + NADP NADPH + CO2 +
RL5P Ribose-5-phosphate isomerase A rpiA RL5P R5P Ribulose
phosphate 3-epimerase rpe RL5P X5P Transketolase I tldA1 R5P + X5P
T3P1 + S7P Transaldolase B talA T3P1 + S7P E4P + F6P Transketolase
II tktA2 X5P + E4P F6P + T3P1 TCA cycle Citrate synthase gltA ACCOA
+ OA.fwdarw. COA + CIT Aconitase A acnA CIT ICIT Isocitrate
dehydrogenase icdA ICIT + NADP CO2 + NADPH + AKG 2-Ketoglutarate
dehyrogenase sucA AKG + NAD + COA.fwdarw. CO2 + NADH + SUCCOA
Succinyl-CoA synthetase sucC SUCCOA + ADP + PI ATP + COA + SUCC
Succinate dehydrogenase sdhA1 SUCC + FAD.fwdarw. FADH + FUM
Fumurate reductase frdA FUM + FADH.fwdarw. SUCC + FAD Fumarase A
fumA FUM MAL Malate dehydrogenase mdh MAL + NAD NAPH + OA
Dissimttation of Pyruvate Acetaldehyde dehydrogenase adhE ACCOA + 2
NADH.fwdarw. ETH + 2 NAD + COA Phosphotransacetylase pta ACCOA + PI
ACTP + COA Acetate kinase A ackA ACTP + ADP ATP + AC Anapleurotic
Reactions Phosphoenolpyruvate carboxykinase pckA OA + ATP.fwdarw.
PEP + CO2 + ADP Phosphoenolpyruvate carboxylase ppc PEP +
CO2.fwdarw. OA + PI Energy/Redox Metabolism NADH dehydrogenase I
nuoA NADH + Q.fwdarw. NAD + QH2 + 2 HEXT Cytochrome oxidase bo3
cyoA QH2 + 1/2 O2.fwdarw. Q + 2 HEXT Pyridine nucleotide
transhydrogenase pntA NADPH + NAD.fwdarw. NADP + NADH Succinate
dehydrogenase complex sdhA2 FADH + Q.fwdarw. FAD + QH2 F0F1-ATPase
atpABCDEFGHI ADP + PI + 3 HEXT.fwdarw. ATP Adenylate kinase adk ATP
+ AMP 2 ADP ATP drain ATP dr ATP .fwdarw. ADP + PI Growth Flux
Growth flux GRO 41.3 ATP + 3.5 NAD + 18.2 NADPH + 0.2 G6P + 0.1 F6P
+ 0.9 R5P + 0.4 E4P + 0.1 T3P1 + 1.5 3PG + 0.5 PEP + 2.8 PYR + 3.7
ACCOA + 1.8 OA + 1.1 AKG.fwdarw. 41.3 ADP + 41.3 PI + 3.5 NADH +
18.2 NADP + 3.7 COA + 1.0 BIOMASS Exchange Fluxes Glucose external
GLCxt GLCxt.fwdarw. Succinate external SUCCxt SUCCxt.fwdarw.
Ethanol external ETHxt ETHxt.fwdarw. Acetate external ACxt
ACxt.fwdarw. Biomass drain BIOMASS BIOMASS.fwdarw. Phosphate
external PIxt PIxt.fwdarw. Carbon dioxide external CO2xt
CO2xt.fwdarw. Oxygen external O2xt O2xt.fwdarw.
TABLE-US-00002 TABLE 2 Exchange Fluxes Pathway SUCCxt/ ETHxt/ ACxt/
GRO/ PIxt/ CO2xt/ O2xt/ Number SUCCxt SUCCxt SUCCxt SUCCxt SUCCxt
SUCCxt SUCCxt Net Pathway Reaction Balance 33 -1.000 0 0 0.051
-0.188 1.825 -1.267 SUCCxt + 0.188 PIxt + 1.267 O2xt ---> 0.051
GRO + 1.825 CO2xt 30 -1.000 0 0 0.034 -0.125 2.553 -2.014 SUCCxt +
0.125 PIxt + 2.014 O2xt ---> 0.034 GRO + 2.553 CO2xt 32 -1.000 0
0 0.033 -0.121 2.600 -2.062 SUCCxt + 0.121 PIxt + 2.062 O2xt
---> 0.033 GRO + 2.6 CO2xt 34 -1.000 0 0 0.049 -0.182 1.895
-1.338 SUCCxt + 0.182 PIxt + 1.338 O2xt ---> 0.049 GRO + 1.895
CO2xt 22 -1.000 0 0 0.032 -0.117 2.644 -2.108 SUCCxt + 0.117 PIxt +
2.108 O2xt ---> 0.032 GRO + 2.644 CO2xt 14 -1.000 0 0 0.031
-0.114 2.679 -2.144 SUCCxt + 0.114 PIxt + 2.144 O2xt ---> 0.031
GRO + 2.679 CO2xt 18 -1.000 0.549 0 0.025 -0.092 1.837 -0.759
SUCCxt + 0.092 PIxt + 0.759 O2xt ---> 0.025 GRO + 1.837 CO2xt +
0.549 ETHxt 31 -1.000 0 0.158 0.047 -0.172 1.696 -1.142 SUCCxt +
0.172 PIxt + 1.142 O2xt ---> 0.047 GRO + 1.696 CO2xt + 0.158
ACxt 10 -1.000 0 0 0 0 4.000 -3.500 SUCCxt + 3.5 O2xt ---> 4.0
CO2xt 5 -1.000 0 1.000 0 0 2.000 -1.500 SUCCxt + 1.5 O2xt --->
2.0 CO2xt + 1.0 ACxt 2 -1.000 1.000 0 0 0 2.000 -0.500 SUCCxt + 0.5
O2xt ---> 2.0 CO2xt + 1.0 ETHxt 26 -1.000 0 0 0 0 4.000 -3.500
SUCCxt + 3.5 O2xt ---> 4.0 CO2xt
TABLE-US-00003 TABLE 3 Angles Path- Diff Path- Net Path- (degree)
way # Fluxes (%) way # Diff (%) way # 4.62E-05 P_33 0 P_33 5.84E-05
P_33 4.9 P_32 3.5 P_34 11.3 P_32 11.1 P_30 5.3 P_22 22.4 P_30 22.6
P_31 5.3 P_30 36.8 P_31 25.2 P_34 8.8 P_14 67.1 P_2 26.1 P_22 8.8
P_31 67.8 P_34 27.0 P_14 10.5 P_32 70.7 P_5 27.7 P_18 14.0 P_18
72.2 P_22 38.8 P_10 22.8 P_26 76.3 P_14 40.1 P_5 38.6 P_10 79.6
P_18 40.4 P_26 52.6 P_2 177.0 P_10 41.5 P_2 52.6 P_5 233.1 P_26
TABLE-US-00004 TABLE 4 Angles Path- Diff Path- Net Path- (degree)
way # Fluxes (%) way # Diff (%) way # 2.3 P_32 3.5 P_32 5.2 P_32
2.6 P_33 7.0 P_33 6.4 P_33 10.9 P_30 10.5 P_34 25.1 P_30 21.9 P_31
12.3 P_22 35.8 P_31 25.9 P_34 12.3 P_30 66.9 P_2 26.9 P_22 15.8
P_14 71.4 P_5 27.8 P_14 15.8 P_31 75.3 P_34 28.4 P_18 21.1 P_18
79.9 P_22 39.5 P_10 29.8 P_26 84.1 P_14 39.5 P_5 45.6 P_10 87.5
P_18 40.5 P_26 45.6 P_5 187.6 P_10 41.0 P_2 59.6 P_2 241.7 P_26
TABLE-US-00005 TABLE 5 Nucleotide Amino Acid Variant Type Change
Change Result B+ normal none none A+ non-chronic 376 A.fwdarw.G
Asn.fwdarw.Asp polar to acidic A- non-chronic 376 A.fwdarw.G
Asn.fwdarw.Asp polar to acidic 202* G.fwdarw.A Val.fwdarw.Met
nonpolar to nonpolar Mediterranean non-chronic 563 C.fwdarw.T
Ser.fwdarw.Phe polar to nonpolar Tsukui chronic 561-563 del 188/189
del Minnesota chronic 637 G.fwdarw.T 213 Val .fwdarw. Leu nonpolar
to nonpolar Asahikawa chronic 695 G.fwdarw.A 232 Cys.fwdarw.Tyr
slightly polar to nonpolar Durham chronic 713 A.fwdarw.G 238
Lys.fwdarw.Arg basic to basic Wayne chronic 769 C.fwdarw.G 257
Arg.fwdarw.Gly basic to nonpolar Loma Linda chronic 1089 C.fwdarw.A
363 Asn.fwdarw.Lys polar to basic Tomah chronic 1153 T.fwdarw.C 385
Cys.fwdarw.Arg slightly polar to basic Iowa chronic 1156 A.fwdarw.G
386 Lys.fwdarw.Glu basic to acidic Walter Reed chronic 1156
A.fwdarw.G 386 Lys.fwdarw.Glu basic to acidic Iowa City chronic
1156 A.fwdarw.G 386 Lys.fwdarw.Glu basic to acidic Springfield
chronic 1156 A.fwdarw.G 386 Lys.fwdarw.Glu basic to acidic
Guadalajara chronic 1159 C.fwdarw.T 387 Arg.fwdarw.Cys basic to
slightly polar Iwate chronic 1160 G.fwdarw.A 387 Arg.fwdarw.His
basic to acidic/basic Niigata chronic 1160 G.fwdarw.A 387
Arg.fwdarw.His basic to acidic/basic Yamaguchi chronic 1160
G.fwdarw.A 387 Arg.fwdarw.His basic to acidic/basic Portici chronic
1178 G.fwdarw.A 393 Arg.fwdarw.His basic to acidic/basic Alhambra
chronic 1180 G.fwdarw.C 394 Val.fwdarw.Leu nonpolar to nonpolar
Tokyo chronic 1246 G.fwdarw.A 416 Glu.fwdarw.Lys acidic to basic
Fukushima chronic 1246 G.fwdarw.A 416 Glu.fwdarw.Lys acidic to
basic Atlanta chronic 1284 C.fwdarw.A 428 Tyr.fwdarw.End Pawnee
chronic 1316 G.fwdarw.C 439 Arg.fwdarw.Pro basic to nonpolar
Morioka chronic 1339 G.fwdarw.A 447 Gly.fwdarw.Arg nonpolar to
basic
TABLE-US-00006 TABLE 6 Nucleotide Amino Acid Variant Change Change
Result Sassari 514 G.fwdarw.C Glu.fwdarw.Gln acidic to slightly
polar Parma not characterized -- -- 1456 C.fwdarw.T Arg.fwdarw.Trp
basic to nonpolar Soresina 1456 C.fwdarw.T Arg.fwdarw.Trp basic to
nonpolar 1552 C.fwdarw.A Arg.fwdarw.Ser basic to slightly polar
Milano 1456 C.fwdarw.T Arg.fwdarw.Trp basic to nonpolar Brescia
1042-1044 del Lys deleted basic deleted 1456 C.fwdarw.T
Arg.fwdarw.Trp basic to nonpolar Manatova 1168 G.fwdarw.A
Asp.fwdarw.Asn acidic to slightly polar
* * * * *