U.S. patent application number 14/124826 was filed with the patent office on 2014-06-19 for systems and methods for network-based biological assessment.
This patent application is currently assigned to PHILLIP MORRIS PRODUCTS S.A.. The applicant listed for this patent is Julia Hoeng, Florian Martin, Manuel Claude Peitsch, Alain Sewer. Invention is credited to Julia Hoeng, Florian Martin, Manuel Claude Peitsch, Alain Sewer.
Application Number | 20140172398 14/124826 |
Document ID | / |
Family ID | 47295520 |
Filed Date | 2014-06-19 |
United States Patent
Application |
20140172398 |
Kind Code |
A1 |
Hoeng; Julia ; et
al. |
June 19, 2014 |
SYSTEMS AND METHODS FOR NETWORK-BASED BIOLOGICAL ASSESSMENT
Abstract
Systems and methods are directed to computerized methods and one
or more computer processors for quantifying the perturbation of a
biological system in response to an agent. A set of treatment data
corresponding to a response of a biological system to an agent and
a set of control data are received. A computational causal network
model represents the biological system and includes nodes
representing biological entities, edges representing relationships
between the biological entities, and direction values representing
the expected direction of change between the control data and the
treatment data. Activity measures are calculated and represent a
difference between the treatment data and the control data, and
weight values are calculated for the nodes. A score for the
computational model is generated representative of the perturbation
of the biological system to the agent and is based on the direction
values, the weight values and the activity measures.
Inventors: |
Hoeng; Julia; (Corcelles,
CH) ; Martin; Florian; (Peseux, CH) ; Peitsch;
Manuel Claude; (Peseux, CH) ; Sewer; Alain;
(Orbe, CH) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Hoeng; Julia
Martin; Florian
Peitsch; Manuel Claude
Sewer; Alain |
Corcelles
Peseux
Peseux
Orbe |
|
CH
CH
CH
CH |
|
|
Assignee: |
PHILLIP MORRIS PRODUCTS
S.A.
Neuchatel
CH
|
Family ID: |
47295520 |
Appl. No.: |
14/124826 |
Filed: |
June 11, 2012 |
PCT Filed: |
June 11, 2012 |
PCT NO: |
PCT/EP2012/061035 |
371 Date: |
March 3, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61495824 |
Jun 10, 2011 |
|
|
|
61525700 |
Aug 19, 2011 |
|
|
|
Current U.S.
Class: |
703/11 |
Current CPC
Class: |
G06N 5/02 20130101; G16B
5/00 20190201; G16H 50/50 20180101; G06N 7/06 20130101; G16B 40/00
20190201; G06N 7/005 20130101 |
Class at
Publication: |
703/11 |
International
Class: |
G06F 19/12 20060101
G06F019/12 |
Foreign Application Data
Date |
Code |
Application Number |
Dec 22, 2011 |
EP |
11195417.8 |
Claims
1. A computerized method for quantifying the perturbation of a
biological system in response to an agent, comprising receiving, at
a first processor, a set of treatment data corresponding to a
response of a biological system to an agent, wherein the biological
system includes or comprises a plurality of biological entities,
each biological entity interacting with at least one other of the
biological entities; receiving, at a second processor, a set of
control data corresponding to the biological system not exposed to
the agent; providing, at a third processor, a computational causal
network model that represents the biological system and includes or
comprises: nodes representing the biological entities, edges
representing relationships between the biological entities, and
direction values, for the nodes, representing the expected
direction of change between the control data and the treatment
data; calculating, with a fourth processor, activity measures, for
the nodes, representing a difference between the treatment data and
the control data; calculating, with a fifth processor, weight
values for the nodes, wherein at least one weight value is
different from at least one other weight value; and generating,
with a sixth processor, a score for the computational model
representative of the perturbation of the biological system to the
agent, wherein the score is based on the direction values, the
weight values and the activity measures.
2. The computerized method of claim 1, wherein the biological
system is represented by at least one mechanism hypothesis.
3. The computerized method of claim 1, wherein the biological
system is represented by a plurality of computational causal
network models or at least one computational causal network model
comprising a plurality of mechanism hypotheses.
4. The computerized method of claim 1, further comprising
normalizing the score based on the number of measurable nodes in
the respective computational model.
5. The computerized method of claim 1, wherein the weight values
represent a confidence in at least one of the set of treatment data
and control data.
6. The computerized method of claim 1, wherein the weight values
include or comprise local false non-discovery rates.
7. The computerized method of claim 1, further comprising
calculating, with a seventh processor, an approximate distribution
of the activity measures of nodes over a model or a mechanism
hypotheses in a model; calculating, with an eighth processor, an
expected value of activity measures with respect to the approximate
distribution; and generating, with a ninth processor, a score for
each computational model representative of the perturbation of the
subset of the biological system to the agent, wherein the score is
based on expected value.
8. The computerized method of claim 7, wherein the approximate
distribution is based on the activity measures.
9. The computerized method of claim 7, wherein calculating an
expected value comprises performing a rectangular
approximation.
10. The computerized method of claim 1, further comprising
calculating, with a tenth processor, a positive activation metric
and a negative activation metric based on the activity measures,
the positive and negative activation metrics representative of
consistency and inconsistency, respectively, between the activity
measures and the direction values with respect to the model; and
generating, with an eleventh processor, a score for each
computational model representative of the perturbation of the
subset of the biological system to the agent, wherein the score is
based on the positive and negative activation scores.
11. The computerized method of claim 1, wherein the positive
activation metric, negative activation metric or both are based on
local false non-discovery rates.
12. The computerized method of claim 1, wherein the activity
measure is a fold-change value, and the fold-change value for each
node includes or comprises a logarithm of the difference between
the treatment data and the control data for the biological entity
represented by the respective node.
13. The computerized method of claim 1, wherein the subset of the
biological system includes or comprises at least one of cell
proliferation mechanism, cellular stress mechanism, cell
inflammation mechanism, and DNA repair mechanism.
14. The computerized method of claim 1, wherein the agent includes
or comprises at least one of aerosol generated by heating tobacco,
aerosol generated by combusting tobacco, tobacco smoke or cigarette
smoke.
15. The computerized method of claim 1, wherein the agent includes
or comprises a heterogeneous substance, including a molecule or an
entity that is not present in or derived from the biological
system.
16. The computerized method of claim 1, wherein the agent includes
or comprises toxins, therapeutic compounds, stimulants, relaxants,
natural products, manufactured products, and food substances.
17. The computerized method of claim 1, wherein the set of
treatment data includes or comprises a plurality of sets of
treatment data such that each measurable node includes or comprises
a plurality of fold-change values defined by a first probability
distribution and a plurality of weight values defined by a second
probability distribution.
18. The computerized method of claim 1, wherein the set of
treatment data includes or comprises a plurality of sets of
treatment data such that each measurable node includes or comprises
a plurality of fold-change values and the corresponding weight
values.
19. The computerized method of claim 1, wherein the step of
generating the score comprises a linear or a non-linear combination
of the activity measures, the weight values, and the direction
values; and a normalization of the combination by a scale
factor.
20. The computerized method of claim 19, wherein the combination is
an arithmetic combination, and the scale factor is the square root
of the number of biological entities for which measured data are
received.
21. The computerized method of claim 1, wherein the score is
generated by a geometric perturbation index scoring technique, a
probabilistic perturbation index scoring technique, or an expected
perturbation index scoring technique.
22. The computerized method of claim 1, further comprising
determining a confidence interval for the score based on a
parametric or non-parametric computational bootstrapping
technique.
23. A computer system for quantifying the perturbation of a
biological system in response to an agent, the system comprising at
least one processor configured or adapted to: receive a set of
treatment data corresponding to a response of a biological system
to an agent, wherein the biological system includes or comprises a
plurality of biological entities, each biological entity
interacting with at least one other of the biological entities;
receive a set of control data corresponding to the biological
system not exposed to the agent; provide a computational causal
network model that represents the biological system and includes or
comprises: nodes representing the biological entities, edges
representing relationships between the biological entities, and
direction values, for the nodes, representing the expected
direction of change between the control data and the treatment
data; calculate activity measures, for the nodes, representing a
difference between the treatment data and the control data;
calculate weight values for the nodes, wherein at least one weight
value is different from at least one other weight value; and
generate a score for the computational model representative of the
perturbation of the biological system to the agent, wherein the
score is based on the direction values, the weight values and the
activity measures.
24. (canceled)
25. (canceled)
Description
BACKGROUND
[0001] The human body is constantly perturbed by exposure to
potentially harmful agents that can pose severe health risks in the
long-term. Exposure to these agents can compromise the normal
functioning of biological mechanisms internal to the human body. To
understand and quantify the effect that these perturbations have on
the human body, researchers study the mechanism by which biological
systems respond to exposure to agents. Some groups have extensively
utilized in vivo animal testing methods. However, animal testing
methods are not always sufficient because there is doubt as to
their reliability and relevance. Numerous differences exist in the
physiology of different animals. Therefore, different species may
respond differently to exposure to an agent. Accordingly, there is
doubt as to whether responses obtained from animal testing may be
extrapolated to human biology. Other methods include assessing risk
through clinical studies of human volunteers. But these risk
assessments are performed a posteriori and, because diseases may
take decades to manifest, these assessments may not be sufficient
to elucidate mechanisms that link harmful substances to disease.
Yet other methods include in vitro experiments. Although, in vitro
cell and tissue-based methods have received general acceptance as
full or partial replacement methods for their animal-based
counterparts, these methods have limited value. Because in vitro
methods are focused on specific aspects of cells and tissues
mechanisms; they do not always take into account the complex
interactions that occur in the overall biological system.
[0002] In the last decade, high-throughput measurements of nucleic
acid, protein and metabolite levels in conjunction with traditional
dose-dependent efficacy and toxicity assays, have emerged as a
means for elucidating mechanisms of action of many biological
processes. Researchers have attempted to combine information from
these disparate measurements with knowledge about biological
pathways from the literature to assemble meaningful biological
models. To this end, researchers have begun using mathematical and
computational techniques that can mine large quantities of data,
such as clustering and statistical methods, to identify possible
biological mechanisms of action.
[0003] Previous work has also explored the importance of uncovering
a characteristic signature of gene expression changes that results
from one or more perturbations to a biological process, and the
subsequent scoring of that signature's presence in additional data
sets as a measure of that process's specific activity amplitude.
Most work in this regard has involved identifying and scoring
signatures that are correlated with a disease phenotype. These
phenotype-derived signatures provide significant classification
power, but lack a mechanistic or causal relationship between a
single specific perturbation and the signature. Consequently, these
signatures may represent multiple distinct unknown perturbations
that, by often unknown mechanism(s), lead to, or result from, the
same disease phenotype.
[0004] One challenge lies in understanding how the activities of
various individual biological entities in a biological system
enable the activation or suppression of different biological
mechanisms. Because an individual entity, such as a gene, may be
involved in multiple biological processes (e.g., inflammation and
cell proliferation), measurement of the gene's activity is not
sufficient to identify the underlying biological process that
triggers the activity. None of the current techniques have been
applied to identify the underlying mechanisms responsible for the
activity of biological entities on a micro-scale, nor provide a
quantitative assessment of the activation of different biological
mechanisms in which these entities play a role, in response to
potentially harmful agents and experimental conditions.
Accordingly, there is a need for improved systems and methods for
analyzing system-wide biological data in view of biological
mechanisms, and quantifying changes in the biological system as the
system responds to an agent or a change in the environment.
SUMMARY
[0005] In one aspect, the systems and methods described herein are
directed to computerized methods and one or more computer
processors for quantifying the perturbation of a biological system
in response to an agent.
The computerized method comprises, in one aspect, receiving, at a
first processor, a set of treatment data corresponding to a
response of a biological system to an agent, wherein the biological
system includes or comprises a plurality of biological entities,
each biological entity interacting with at least one other of the
biological entities; receiving, at a second processor, a set of
control data corresponding to the biological system not exposed to
the agent; providing, at a third processor, a computational causal
network model that represents the biological system and include or
comprise: nodes representing the biological entities, edges
representing relationships between the biological entities, and
direction values, for the nodes, representing the expected
direction of change between the control data and the treatment
data; calculating, with a fourth processor, activity measures, for
the nodes, representing a difference between the treatment data and
the control data; calculating, with a fifth processor, weight
values for the nodes, wherein at least one weight value is
different from at least one other weight value; and generating,
with a sixth processor, a score for the computational model
representative of the perturbation of the biological system to the
agent, wherein the score is based on the direction values, the
weight values and the activity measures. The biological system may
be represented by at least one mechanism hypothesis. The biological
system may be represented by a plurality of computational causal
network models or at least one computational causal network model
comprising a plurality of mechanism hypotheses. The method may
further comprise normalizing the score based on the number of
measurable nodes in the respective computational model. The weight
values may represent a confidence in at least one of the set of
treatment data and control data. The weight values may include or
comprise local false non-discovery rates. The method may further
comprise calculating, with a seventh processor, an approximate
distribution of the activity measures of nodes over a model or a
mechanism hypotheses in a model; calculating, with an eighth
processor, an expected value of activity measures with respect to
the approximate distribution; and generating, with a ninth
processor, a score for each computational model representative of
the perturbation of the subset of the biological system to the
agent, wherein the score is based on expected value. The
approximate distribution may be based on the activity measures. In
certain implementations, calculating an expected value may comprise
performing a rectangular approximation. The method may further
comprise calculating, with a tenth processor, a positive activation
metric and a negative activation metric based on the activity
measures, the positive and negative activation metrics
representative of consistency and inconsistency, respectively,
between the activity measures and the direction values with respect
to the model; and generating, with an eleventh processor, a score
for each computational model representative of the perturbation of
the subset of the biological system to the agent, wherein the score
is based on the positive and negative activation scores. The
positive activation metric, negative activation metric or both may
be based on local false non-discovery rates. The activity measure
may be a fold-change value, and the fold-change value for each node
includes or comprises a logarithm of the difference between the
treatment data and the control data for the biological entity
represented by the respective node. The subset of the biological
system may include or comprise at least one of cell proliferation
mechanism, cellular stress mechanism, cell inflammation mechanism,
and DNA repair mechanism. The agent may include or comprise at
least one of aerosol generated by heating tobacco, aerosol
generated by combusting tobacco, tobacco smoke or cigarette smoke.
The agent may include or comprise a heterogeneous substance,
including a molecule or an entity that is not present in or derived
from the biological system. The agent may include or comprise
toxins, therapeutic compounds, stimulants, relaxants, natural
products, manufactured products, and food substances. The set of
treatment data may include or comprise a plurality of sets of
treatment data such that each measurable node includes or comprises
a plurality of fold-change values defined by a first probability
distribution and a plurality of weight values defined by a second
probability distribution. The set of treatment data may include or
comprise a plurality of sets of treatment data such that each
measurable node include or comprise a plurality of fold-change
values and the corresponding weight values. The step of generating
the score may comprise a linear or a non-linear combination of the
activity measures, the weight values, and the direction values; and
a normalization of the combination by a scale factor. The
combination may be an arithmetic combination, and the scale factor
is the square root of the number of biological entities for which
measured data are received. The score may be generated by a
geometric perturbation index scoring technique, a probabilistic
perturbation index scoring technique, or an expected perturbation
index scoring technique. The method may further comprise
determining a confidence interval for the score based on a
parametric or non-parametric computational bootstrapping technique.
There is also described in another aspect, a computer system for
quantifying the perturbation of a biological system in response to
an agent is also described. The system comprises at least one
processor configured or adapted to: receive a set of treatment data
corresponding to a response of a biological system to an agent,
wherein the biological system includes or comprises a plurality of
biological entities, each biological entity interacting with at
least one other of the biological entities; receive a set of
control data corresponding to the biological system not exposed to
the agent; provide a computational causal network model that
represents the biological system and includes or comprises: nodes
representing the biological entities, edges representing
relationships between the biological entities, and direction
values, for the nodes, representing the expected direction of
change between the control data and the treatment data; calculate
activity measures, for the nodes, representing a difference between
the treatment data and the control data; calculate weight values
for the nodes, wherein at least one weight value is different from
at least one other weight value; and generate a score for the
computational model representative of the perturbation of the
biological system to the agent, wherein the score is based on the
direction values, the weight values and the activity measures. The
biological system may be represented by at least one mechanism
hypothesis. The biological system may be represented by a plurality
of computational causal network models or at least one
computational causal network model comprising a plurality of
mechanism hypotheses. The computer system may further comprises
normalizing the score based on the number of scorable nodes in the
respective computational model. The weight values may represent a
confidence in at least one of the set of treatment data and control
data. The weight values may include or comprise local false
non-discovery rates. In certain implementations, the computer
system further comprises calculating an approximate distribution of
the activity measures of nodes over a model or a mechanism
hypotheses in a model; calculating, with an eighth processor, an
expected value of activity measures with respect to the approximate
distribution; and generating a score for each computational model
representative of the perturbation of the subset of the biological
system to the agent, wherein the score is based on expected value.
The approximate distribution may be based on the activity measures.
In certain implementations of the computer system, it may further
comprise calculating an expected value comprises performing a
rectangular approximation. The system may further comprise
calculating a positive activation metric and a negative activation
metric based on the activity measures, the positive and negative
activation metrics representative of consistency and inconsistency,
respectively, between the activity measures and the direction
values with respect to the model; and generating a score for each
computational model representative of the perturbation of the
subset of the biological system to the agent, wherein the score is
based on the positive and negative activation scores. The positive
activation metric, negative activation metric or both may be based
on local false non-discovery rates. The activity measure may be a
fold-change value, and the fold-change value for each node may
include or comprise a logarithm of the difference between the
treatment data and the control data for the biological entity
represented by the respective node. The subset of the biological
system may include or comprise at least one of cell proliferation
mechanism, cellular stress mechanism, cell inflammation mechanism,
and DNA repair mechanism. The agent may include or comprise at
least one of aerosol generated by heating tobacco, aerosol
generated by combusting tobacco, tobacco smoke or cigarette smoke.
The agent may include or comprise a heterogeneous substance,
including a molecule or an entity that is not present in or derived
from the biological system. The agent may include or comprise
toxins, therapeutic compounds, stimulants, relaxants, natural
products, manufactured products, and food substances. The set of
treatment data may include or comprise a plurality of sets of
treatment data such that each measurable node includes or comprises
a plurality of fold-change values defined by a first probability
distribution and a plurality of weight values defined by a second
probability distribution. The set of treatment data may include or
comprise a plurality of sets of treatment data such that each
measurable node includes or comprises a plurality of fold-change
values and the corresponding weight values. The step of generating
the score may comprise a linear or a non-linear combination of the
activity measures, the weight values, and the direction values; and
a normalization of the combination by a scale factor. The
combination may be an arithmetic combination, and the scale factor
is the square root of the number of biological entities for which
measured data are received. The score may be generated by a
geometric perturbation index scoring technique, a probabilistic
perturbation index scoring technique, or an expected perturbation
index scoring technique. The system may further comprise
determining a confidence interval for the score based on a
parametric or non-parametric computational bootstrapping technique.
In certain aspects, the computerized method may comprise receiving,
at a first processor, a set of treatment data corresponding to a
response of a biological system to an agent, wherein the biological
system includes a plurality of biological entities, each biological
entity interacting with at least one other of the biological
entities, and receiving, at a second processor, a set of control
data corresponding to the biological system not exposed to the
agent. The computerized method may comprise providing, at a third
processor, a computational causal network model that represents the
biological system. The computational model may include or comprise
nodes representing the biological entities, edges representing
relationships between the biological entities, and direction
values, for the nodes, representing the expected direction of
change between the control data and the treatment data. The
computerized method may further comprise calculating, with a fourth
processor, activity measures, for the nodes, representing a
difference between the treatment data and the control data, and
calculating, with a fifth processor, weight values for the nodes,
wherein at least one weight value is different from at least one
other weight value. The computerized method may also comprise
generating, with a sixth processor, a score for the computational
model representative of the perturbation of the biological system
to the agent, wherein the score is based on the direction values,
the weight values and the activity measures. In certain
implementations, the computerized method further comprises
normalizing the score based on the number of nodes in the
respective computational model. In certain implementations, each of
the first through sixth processors is included or comprised within
a single processor or single computing device. In other
implementations, one or more of the first through sixth processors
are distributed across a plurality of processors or computing
devices.
[0006] In certain implementations, the computational causal network
model includes or comprises a set of causal relationships that
exist between a node representing a potential cause and nodes
representing the measured quantities. In such implementations, the
activity measures may include a fold-change. The fold-change may be
a number describing how much a node measurement changes going from
an initial value to a final value between control data and
treatment data. The fold-change number may represent the logarithm
of the fold-change of the activity of the biological entity between
control condition and treatment condition. The activity measure for
each node may include or comprise a logarithm of the difference
between the treatment data and the control data for the biological
entity represented by the respective node. In such implementations,
the weight value may represent a weight to be given to the
fold-change value of the nodes. The weight value may represent the
known biological significance of the measured node with regard to a
feature or outcome of interest (e.g., a known carcinogen in cancer
studies). The weight value may represent a confidence in at least
one of the set of perturbation data and control data. More
particularly, the weight values may include or comprise local false
non-discovery rates. In such an implementation, the computerized
method may generate the score for the computational model by
multiplying the activity measure with the weight value and the
direction value and summing over the nodes. In certain
implementations, the computerized method includes or comprises
generating, with a processor, a confidence interval for each of the
generated scores. The confidence interval may comprise
approximating a distribution of a generated score.
[0007] In another aspect, the systems and methods described herein
are directed to computerized methods for quantifying the
perturbation of a biological system in response to an agent. The
computerized method may comprise receiving, at a first processor, a
set of treatment data corresponding to a response of a biological
system to an agent, wherein the biological system includes or
comprises a plurality of biological entities, each biological
entity interacting with at least one other of the biological
entities, and receiving, at a second processor, a set of control
data corresponding to the biological system not exposed to the
agent. The computerized method may comprise providing, at a third
processor, a computational causal network model that represents the
biological system. The computational model may include or comprise
nodes representing the biological entities, edges representing
relationships between the biological entities, and direction
values, for the nodes, representing the expected direction of
change between the control data and the treatment data. The
computerized method may further comprise calculating, with a fourth
processor, activity measures, for the nodes, representing a
difference between the treatment data and the control data, and
calculating, with a fifth processor, an approximate distribution of
the activity measures over the node. The computerized method may
also include or comprise calculating, with a sixth processor, an
expected value of the approximate distribution. The computerized
method may also comprise generating, with a seventh processor, a
score for each computational model representative of the
perturbation of the subset of the biological system to the agent,
wherein the score is based on the expected value. In certain
implementations, each of the first through seventh processors is
included or comprised within a single processor or single computing
device. In other implementations, one or more of the first through
seventh processors are distributed across a plurality of processors
or computing devices.
[0008] In certain implementations, the computational causal network
model includes or comprises a set of causal relationships that
exist between a node representing a potential cause and nodes
representing the measured quantities. In such implementations, the
activity measures may include or comprise a fold-change. The
fold-change may be a number describing how much a node measurement
changes going from an initial value to a final value between
control data and treatment data. The fold-change number may
represent the logarithm of the fold-change of the activity of the
biological entity between control condition and treatment
condition. The computerized method may include or comprise
generating, with a processor, a range for the fold-change density,
which may represent an approximation of the set of values that the
fold-change values can take in the biological system under the
treatment conditions. The processor may generate an approximate
fold-change density, which may include or comprise an approximate
probability distribution of fold-change values. In such
implementations, the computerized method further includes or
comprises calculating the approximate expected value of the
approximate fold-change density. The computerized method may
generate the score for the computational model based on the
calculated expected value.
[0009] In certain implementations, the approximate distributions
may be based, generally, on the activity measures. Additionally and
optionally, the expected value may comprise a rectangular
approximation. In certain implementations, the computerized method
includes or comprises generating, with a processor, a confidence
interval for each of the generated scores. Generating the
confidence interval may comprise performing a parametric
bootstrapping technique.
[0010] In yet another aspect, the systems and methods described
herein are directed to computerized methods for quantifying the
perturbation of a biological system in response to an agent. The
computerized method may comprise receiving, at a first processor, a
set of treatment data corresponding to a response of a biological
system to an agent, wherein the biological system includes or
comprises a plurality of biological entities, each biological
entity interacting with at least one other of the biological
entities, and receiving, at a second processor, a set of control
data corresponding to the biological system not exposed to the
agent. The computerized method may comprise providing, at a third
processor, a computational causal network model that represents the
biological system. The computational model may include or comprise
nodes representing the biological entities, edges representing
relationships between the biological entities, and direction
values, for the nodes, representing the expected direction of
change between the control data and the treatment data. The
computerized method may further comprise calculating, with a fourth
processor, activity measures, for the nodes, representing a
difference between the treatment data and the control data, and
calculating, with a fifth processor, a positive activation score
and a negative activation score based on the activity measures, the
positive and negative activation scores representative of
consistency and inconsistency, respectively, between the activity
measures and the direction values. The computerized method may also
comprise generating, with a sixth processor, a score for each
computational model representative of the perturbation of the
subset of the biological system to the agent, wherein the score is
based on the positive and negative activation scores. In certain
implementations, each of the first through sixth processors is
included or comprised within a single processor or single computing
device. In other implementations, one or more of the first through
sixth processors are distributed across a plurality of processors
or computing devices.
[0011] In certain implementations, the computational causal network
model includes or comprises a set of causal relationships that
exist between a node representing a potential cause and nodes
representing the measured quantities. In such implementations, the
activity measures may include or comprise a fold-change. The
fold-change may be a number describing how much a node measurement
changes going from an initial value to a final value between
control data and treatment data. The fold-change number may
represent the logarithm of the fold-change of the activity of the
biological entity between control condition and treatment
condition. The computerized method may include or comprise
generating, with a processor, a range for the fold-change density,
which may represent an approximation of the set of values that the
fold-change values can take in the biological system under the
treatment conditions. The computerized method may comprise
calculating, with a processor, a positive activation score based on
the fold-change values and the direction values. The positive and
negative activation scores may indicate whether the observed
activation/inhibition of biological entities is consistent or
inconsistent with the expected directions of change. In one
example, the positive activation score is a probability that the
direction values are consistent with the activity measures. The
negative activation score may be a probability that the direction
values are inconsistent with the activity measures. The
computerized method may further include or comprise generating a
score for the computational model by combining the positive and
negative activation scores. In certain implementations, the score
is based on local false non-discovery rates.
[0012] In certain implementations, the subset of the biological
system includes or comprises at least one of cell proliferation
mechanism, cellular stress mechanism, cell inflammation mechanism,
and DNA repair mechanism. The agent may include or comprise at
least one of aerosol generated by heating tobacco, aerosol
generated by combusting tobacco, tobacco smoke or cigarette smoke.
The agent may include cadmium, mercury, chromium, nicotine,
tobacco-specific nitrosamines and their metabolites
(4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK),
N'-nitrosonornicotine (NNN), N-nitrosoanatabine (NAT),
N-nitrosoanabasine (NAB), and
4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol (NNAL)). In certain
implementations, the agent includes or comprises a product used for
nicotine replacement therapy. The agent may include or comprise a
heterogeneous substance, including a molecule or an entity that is
not present in or derived from the biological system. The agent may
also include or comprise toxins, therapeutic compounds, stimulants,
relaxants, natural products, manufactured products, and food
substances. In certain implementations, the set of treatment data
includes or comprises a plurality of sets of treatment data
corresponding to certain nodes of a biological network model,
wherein each such node corresponds to a plurality of fold-change
values defined by a first probability distribution and a plurality
of weight values defined by a second probability distribution.
[0013] In yet another aspect, the systems and methods described
herein are directed to computerized methods and one or more
computer processors for quantifying the perturbation of a
biological system in response to an agent. The computerized method
may comprise receiving, at a first processor, a set of treatment
data corresponding to a response of a biological system to an
agent, wherein the biological system includes or comprises a
plurality of biological entities, each biological entity
interacting with at least one other of the biological entities, and
receiving, at a second processor, a set of control data
corresponding to the biological system not exposed to the agent.
The computerized method may comprise providing, at a third
processor, a computational causal network model that represents the
biological system. The computational model may include or comprise
nodes representing the biological entities, edges representing
relationships between the biological entities, and direction
values, for the nodes, representing the expected direction of
change between the control data and the treatment data. The
computerized method may further comprise calculating, with a fourth
processor, activity measures, for the nodes, representing a
difference between the treatment data and the control data. The
computerized method may also comprise generating, with a fifth
processor, a score for the computational model representative of
the perturbation of the biological system to the agent, wherein the
score is based on the direction values and the activity measures.
In certain implementations, the computerized method further
comprises normalizing the score based on the number of nodes in the
respective computational model. The computerized method may also
comprise generating, with a sixth processor, a confidence interval
for each of the generated scores. The confidence interval may
comprise approximating a distribution of the generated scores and a
t-statistic may be derived from the variance of the approximated
distribution of generated scores. In certain implementations, each
of the first through sixth processors is included or comprised
within a single processor or single computing device. In other
implementations, one or more of the first through sixth processors
are distributed across a plurality of processors or computing
devices.
[0014] The computerized methods described herein may be implemented
in a computerized system having one or more computing devices, each
including one or more processors. Generally, the computerized
systems described herein may comprise one or more engines, which
include or comprise a processing device or devices, such as a
computer, microprocessor, logic device or other device or processor
that is configured with hardware, firmware, and software to carry
out one or more of the computerized methods described herein. In
certain implementations, the computerized system includes or
comprises a systems response profile engine, a network modeling
engine, and a network scoring engine. The engines may be
interconnected from time to time, and further connected from time
to time to one or more databases, including a perturbations
database, a measurables database, an experimental data database and
a literature database. The computerized system described herein may
include or comprise a distributed computerized system having one or
more processors and engines that communicate through a network
interface. Such an implementation maybe appropriate for distributed
computing over multiple communication systems. In a further aspect,
there is described a computer program product comprising a program
code adapted to performed the method described herein. In a further
aspect, there is described a computer or computer recordable medium
or device comprising the computer program product.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] Further features of the disclosure, its nature and various
advantages, will be apparent upon consideration of the following
detailed description, taken in conjunction with the accompanying
drawings, in which like reference characters refer to like parts
throughout, and in which:
[0016] FIG. 1 is a block diagram of an exemplary computerized
system for quantifying the response of a biological network to a
perturbation.
[0017] FIG. 2 is a flow diagram of an exemplary process for
quantifying the response of a biological network to a perturbation
by calculating a network perturbation amplitude (NPA) score.
[0018] FIG. 3 is a graphical representation of data underlying a
systems response profile comprising data for two agents, two
parameters, N biological entities.
[0019] FIG. 4 is an illustration of a computational model of a
biological network having several biological entities and their
relationships.
[0020] FIG. 5 is a flow diagram of an exemplary process for
generating a geometric perturbation index (GPI) score.
[0021] FIG. 6 is a flow diagram of an exemplary process for
generating a probabilistic perturbation index (PPI) score.
[0022] FIG. 7 is a flow diagram of an exemplary process for
generating an expected perturbation index (EPI) score.
[0023] FIG. 8 is a flow diagram of an exemplary process for
generating a confidence interval for a geometric perturbation index
(GPI) score.
[0024] FIG. 9 illustrates a biological network model analyzed with
the systems and methods disclosed herein.
[0025] FIGS. 10-14 illustrate network perturbation amplitude (NPA)
scoring results for the network-based biological mechanisms.
[0026] FIG. 15 is a block diagram of an exemplary distributed
computerized system for quantifying the impact of biological
perturbations; and
[0027] FIG. 16 is a block diagram of an exemplary computing device
which may be used to implement any of the components in any of the
computerized systems described herein.
DETAILED DESCRIPTION
[0028] The words "including" or "comprising" do not exclude other
elements or steps, and the indefinite article "a" or "an" does not
exclude a plurality. Described herein are computational systems and
methods that assess quantitatively the magnitude of changes within
a biological system when it is perturbed by an agent. Certain
implementations include or comprise methods for computing a
numerical value that expresses the magnitude of changes within a
portion of a biological system. The computation uses as input, a
set of data obtained from a set of controlled experiments in which
the biological system is perturbed by an agent. The data is then
applied to a network model of a feature of the biological system.
The network model is used as a substrate for simulation and
analysis, and is representative of the biological mechanisms and
pathways that enable a feature of interest in the biological
system. The feature or some of its mechanisms and pathways may
contribute to the pathology of diseases and adverse health effects
of the biological system. Prior knowledge of the biological system
represented in a database is used to construct the network model
which is populated by data on the status of numerous biological
entities under various conditions including under normal conditions
and under perturbation by an agent. The network model used is
dynamic in that it represents changes in status of various
biological entities in response to a perturbation and can yield
quantitative and objective assessments of the impact of an agent on
the biological system. Computer systems for operating these
computational methods are also provided.
[0029] The numerical values generated by computerized methods of
the invention can be used to determine the magnitude of desirable
or adverse biological effects caused by manufactured products (for
safety assessment or comparisons), therapeutic compounds including
nutrition supplements (for determination of efficacy or health
benefits), and environmentally active substances (for prediction of
risks of long term exposure and the relationship to adverse effect
and onset of disease), among others.
[0030] In one aspect, the systems and methods described herein
provide a computed numerical value representative of the magnitude
of change in a perturbed biological system based on a network model
of a perturbed biological mechanism. The numerical value referred
to herein as a network perturbation amplitude (NPA) score can be
used to summarily represent the status changes of various entities
in a defined biological mechanism. The numerical values obtained
for different agents or different types of perturbations can be
used to compare relatively the impact of the different agents or
perturbations on a biological mechanism which enables or manifests
itself as a feature of a biological system. Thus, NPA scores may be
used to measure the responses of a biological mechanism to
different perturbations. The term "score" is used herein generally
to refer to a value or set of values which provide a quantitative
measure of the magnitude of changes in a biological system. Such a
score is computed by using any of various mathematical and
computational algorithms known in the art and according to the
methods disclosed herein, employing one or more datasets obtained
from a sample or a subject.
The NPA scores may assist researchers and clinicians in improving
diagnosis, experimental design, therapeutic decision, and risk
assessment. For example, the NPA scores may be used to screen a set
of candidate biological mechanisms in a toxicology analysis to
identify those most likely to be affected by exposure to a
potentially harmful agent. By providing a measure of network
response to a perturbation, these NPA scores may allow correlation
of molecular events (as measured by experimental data) with
phenotypes or biological outcomes that occur at the cell, tissue,
organ or organism level. A clinician may use NPA values to compare
the biological mechanisms affected by an agent to a patient's
physiological condition to determine what health risks or benefits
the patient is most likely to experience when exposed to the agent
(e.g., a patient who is immuno-compromised may be especially
vulnerable to agents that cause a strong immuno-suppressive
response).
[0031] FIG. 1 is a block diagram of a computerized system 100 for
quantifying the response of a network model to a perturbation. In
particular, system 100 includes or comprises a systems response
profile engine 110, a network modeling engine 112, and a network
scoring engine 114. The engines 110, 112, and 114 are
interconnected from time to time, and further connected from time
to time to one or more databases, including a perturbations
database 102, a measurables database 104, an experimental data
database 106 and a literature database 108. As used herein, an
engine includes or comprises a processing device or devices, such
as a computer, microprocessor, logic device or other device or
devices as described with reference to FIG. 14, that is configured
with hardware, firmware, and software to carry out one or more
computational operations.
[0032] FIG. 2 is a flow diagram of a process 200 for quantifying
the response of a biological network to a perturbation by
calculating a network perturbation amplitude (NPA) score, according
to one implementation. The steps of the process 200 will be
described as being carried out by various components of the system
100 of FIG. 1, but any of these steps may be performed by any
suitable hardware or software components, local or remote, and may
be arranged in any appropriate order or performed in parallel. At
step 210, the systems response profile (SRP) engine 110 receives
biological data from a variety of different sources, and the data
itself may be of a variety of different types. The data comprises
data from experiments in which a biological system is perturbed, as
well as control data. At step 212, the SRP engine 110 generates
systems response profiles (SRPs) which are representations of the
degree to which one or more entities within a biological system
change in response to the presentation of an agent to the
biological system. At step 214, the network modeling engine 112
provides one or more databases that contain(s) a plurality of
network models, one of which is selected as being relevant to the
agent or a feature of interest. The selection can be made on the
basis of prior knowledge of the mechanisms underlying the
biological functions of the system. In certain implementations, the
network modeling engine 112 may extract causal relationships
between entities within the system using the systems response
profiles, networks in the database, and networks previously
described in the literature, thereby generating, refining or
extending a network model. At step 216, the network scoring engine
114 generates NPA scores for each perturbation using the network
identified at step 214 by the network modeling engine 112 and the
SRPs generated at step 212 by the SRP engine 110. An NPA score
quantifies a biological response to a perturbation or treatment
(represented by the SRPs) in the context of the underlying
relationships between the biological entities (represented by the
network). The following description is divided into subsections for
clarity of disclosure, and not by way of limitation.
A. Biological System
[0033] A biological system in the context of the present invention
is an organism or a part of an organism, including functional
parts, the organism being referred to herein as a subject. The
subject is generally a mammal, including a human. The subject can
be an individual human being in a human population. The term
"mammal" as used herein includes or comprises but is not limited to
a human, non-human primate, mouse, rat, dog, cat, cow, sheep,
horse, and pig. Mammals other than humans can be advantageously
used as subjects that can be used to provide a model of a human
disease. The non-human subject can be unmodified, a transgenic
animal, a genetically modified animal, or an animal carrying one or
more genetic mutation(s), or silenced gene(s). A subject can be
male or female. Depending on the objective of the operation, a
subject can be one that has been exposed to an agent of interest. A
subject can be one that has been exposed to an agent over an
extended period of time, optionally including time prior to the
study. A subject can be one that had been exposed to an agent for a
period of time but is no longer in contact with the agent. A
subject can be one that has been diagnosed or identified as having
a disease. A subject can be one who has already undergone, or is
undergoing treatment of a disease or adverse health condition. A
subject can also be one who exhibits one or more symptoms or risk
factors for a specific health condition or disease. A subject can
be one that is predisposed to but is asymptomatic for a disease. In
certain implementations, the disease or health condition in
question is associated with exposure to an agent or use of an agent
over an extended period of time. According to some implementations,
the system 100 (FIG. 1) contains or generates computerized models
of one or more biological systems and mechanisms of its functions
(collectively, "biological networks" or "network models") that are
relevant to a type of perturbation or an outcome of interest.
[0034] Depending on the context of the operation, the biological
system can be defined at different levels as it relates to the
function of an individual organism in a population, an organism
generally, an organ, a tissue, a cell type, an organelle, a
cellular component, or a specific individual's cell(s). Each
biological system comprises one or more biological mechanisms or
pathways, the operation of which manifest as functional features of
the system. Animal systems that reproduce defined features of a
human health condition and that are suitable for exposure to an
agent of interest are preferred biological systems. Cellular and
organotypical systems that reflect the cell types and tissue
involved in a disease etiology or pathology are also preferred
biological systems. Priority could be given to primary cells or
organ cultures that recapitulate as much as possible the human
biology in vivo. It is also important to match the human cell
culture in vitro with the most equivalent culture derived from the
animal models in vivo. This enables creation of a translational
continuum from animal model to human biology in vivo using the
matched systems in vitro as reference systems. Accordingly, the
biological system contemplated for use with the systems and methods
described herein can be defined by, without limitation, functional
features (biological functions, physiological functions, or
cellular functions), organelle, cell type, tissue type, organ,
development stage, or a combination of the foregoing. Examples of
biological systems include or comprise, but are not limited to, the
pulmonary, integument, skeletal, muscular, nervous (central and
peripheral), endocrine, cardiovascular, immune, circulatory,
respiratory, urinary, renal, gastrointestinal, colorectal, hepatic
and reproductive systems. Other examples of biological systems
include or comprise, but are not limited to, the various cellular
functions in epithelial cells, nerve cells, blood cells, connective
tissue cells, smooth muscle cells, skeletal muscle cells, fat
cells, ovum cells, sperm cells, stem cells, lung cells, brain
cells, cardiac cells, laryngeal cells, pharyngeal cells, esophageal
cells, stomach cells, kidney cells, liver cells, breast cells,
prostate cells, pancreatic cells, islet cells, testes cells,
bladder cells, cervical cells, uterus cells, colon cells, and
rectum cells. Some of the cells may be cells of cell lines,
cultured in vitro or maintained in vitro indefinitely under
appropriate culture conditions. Examples of cellular functions
include or comprise, but are not limited to, cell proliferation
(e.g., cell division), degeneration, regeneration, senescence,
control of cellular activity by the nucleus, cell-to-cell
signaling, cell differentiation, cell de-differentiation,
secretion, migration, phagocytosis, repair, apoptosis, and
developmental programming. Examples of cellular components that can
be considered as biological systems include or comprise, but are
not limited to, the cytoplasm, cytoskeleton, membrane, ribosomes,
mitochondria, nucleus, endoplasmic reticulum (ER), Golgi apparatus,
lysosomes, DNA, RNA, proteins, peptides, and antibodies.
B. Perturbation
[0035] A perturbation in a biological system can be caused by one
or more agents over a period of time through exposure or contact
with one or more parts of the biological system. An agent can be a
single substance or a mixture of substances, including a mixture in
which not all constituents are identified or characterized. The
chemical and physical properties of an agent or its constituents
may not be fully characterized. An agent can be defined by its
structure, its constituents, or a source that under certain
conditions produces the agent. An example of an agent is a
heterogeneous substance, that is a molecule or an entity that is
not present in or derived from the biological system, and any
intermediates or metabolites produced therefrom after contacting
the biological system. An agent can be a carbohydrate, protein,
lipid, nucleic acid, alkaloid, vitamin, metal, heavy metal,
mineral, oxygen, ion, enzyme, hormone, neurotransmitter, inorganic
chemical compound, organic chemical compound, environmental agent,
microorganism, particle, environmental condition, environmental
force, or physical force. Non-limiting examples of agents include
or comprise but are not limited to nutrients, metabolic wastes,
poisons, narcotics, toxins, therapeutic compounds, stimulants,
relaxants, natural products, manufactured products, food
substances, pathogens (prion, virus, bacteria, fungi, protozoa),
particles or entities whose dimensions are in or below the
micrometer range, by-products of the foregoing and mixtures of the
foregoing. Non-limiting examples of a physical agent include or
comprise radiation, electromagnetic waves (including sunlight),
increase or decrease in temperature, shear force, fluid pressure,
electrical discharge(s) or a sequence thereof, or trauma.
[0036] Some agents may not perturb a biological system unless it is
present at a threshold concentration or it is in contact with the
biological system for a period of time, or a combination of both.
Exposure or contact of an agent resulting in a perturbation may be
quantified in terms of dosage. Thus, a perturbation can result from
a long-term exposure to an agent. The period of exposure can be
expressed by units of time, by frequency of exposure, or by the
percentage of time within the actual or estimated life span of the
subject. A perturbation can also be caused by withholding an agent
(as described above) from or limiting supply of an agent to one or
more parts of a biological system. For example, a perturbation can
be caused by a decreased supply of or a lack of nutrients, water,
carbohydrates, proteins, lipids, alkaloids, vitamins, minerals,
oxygen, ions, an enzyme, a hormone, a neurotransmitter, an
antibody, a cytokine, light, or by restricting movement of certain
parts of an organism, or by constraining or requiring exercise.
[0037] An agent may cause different perturbations depending on
which part(s) of the biological system is exposed and the exposure
conditions. Non-limiting examples of an agent may include or
comprise aerosol generated by heating tobacco, aerosol generated by
combusting tobacco, tobacco smoke or cigarette smoke, and any of
the gaseous constituents or particulate constituents thereof.
Further non-limiting examples of an agent include or comprise
cadmium, mercury, chromium, nicotine, tobacco-specific nitrosamines
and their metabolites
(4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK),
N'-nitrosonornicotine (NNN), N-nitrosoanatabine (NAT),
N-nitrosoanabasine (NAB),
4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol (NNAL)), and any
product used for nicotine replacement therapy. An exposure regimen
for an agent or complex stimulus should reflect the range and
circumstances of exposure in everyday settings. A set of standard
exposure regimens can be designed to be applied systematically to
equally well-defined experimental systems. Each assay could be
designed to collect time and dose-dependent data to capture both
early and late events and ensure a representative dose range is
covered. However, it will be understood by one of ordinary skill in
the art that the systems and methods described herein may be
adapted and modified as is appropriate for the application being
addressed and that the systems and methods designed herein may be
employed in other suitable applications, and that such other
additions and modifications will not depart from the scope
thereof.
[0038] In various implementations, high-throughput system-wide
measurements for gene expression, protein expression or turnover,
microRNA expression or turnover, post-translational modifications,
protein modifications, translocations, antibody production
metabolite profiles, or a combination of two or more of the
foregoing are generated under various conditions including the
respective controls. Functional outcome measurements are desirable
in the methods described herein as they can generally serve as
anchors for the assessment and represent clear steps in a disease
etiology.
[0039] A "sample" as used herein refers to any biological sample
that is isolated from a subject or an experimental system (e.g.,
cell, tissue, organ, or whole animal). A sample can include or
comprise, without limitation, a single cell or multiple cells,
cellular fraction, tissue biopsy, resected tissue, tissue extract,
tissue, tissue culture extract, tissue culture medium, exhaled
gases, whole blood, platelets, serum, plasma, erythrocytes,
leucocytes, lymphocytes, neutrophils, macrophages, B cells or a
subset thereof, T cells or a subset thereof, a subset of
hematopoietic cells, endothelial cells, synovial fluid, lymphatic
fluid, ascites fluid, interstitial fluid, bone marrow,
cerebrospinal fluid, pleural effusions, tumor infiltrates, saliva,
mucous, sputum, semen, sweat, urine, or any other bodily fluids.
Samples can be obtained from a subject by means including but not
limited to venipuncture, excretion, biopsy, needle aspirate,
lavage, scraping, surgical resection, or other means known in the
art.
[0040] During operation, for a given biological mechanism, an
outcome, a perturbation, or a combination of the foregoing, the
system 100 can generate a network amplitude (NPA) value, which is a
quantitative measure of changes in the status of biological
entities in a network in response to a treatment condition.
[0041] The system 100 (FIG. 1) comprises one or more computerized
network model(s) that are relevant to the health condition,
disease, or biological outcome, of interest. One or more of these
network models are based on prior biological knowledge and can be
uploaded from an external source and curated within the system 100.
The models can also be generated de novo within the system 100
based on measurements. Measurable elements are causally integrated
into biological network models through the use of prior knowledge.
Described below are the types of data that represent changes in a
biological system of interest that can be used to generate or
refine a network model, or that represent a response to a
perturbation.
[0042] Referring to FIG. 2, at step 210, the systems response
profile (SRP) engine 110 receives biological data. The SRP engine
110 may receive this data from a variety of different sources, and
the data itself may be of a variety of different types. The
biological data used by the SRP engine 110 may be drawn from the
literature, databases (including data from preclinical, clinical
and post-clinical trials of pharmaceutical products or medical
devices), genome databases (genomic sequences and expression data,
e.g., Gene Expression Omnibus by National Center for Biotechnology
Information or ArrayExpress by European Bioinformatics Institute
(Parkinson et al. 2010, Nucl. Acids Res., doi: 10.1093/nar/gkq1040.
Pubmed ID 21071405)), commercially available databases (e.g., Gene
Logic, Gaithersburg, Md., USA) or experimental work. The data may
include or comprise raw data from one or more different sources,
such as in vitro, ex vivo or in vivo experiments using one or more
species that are specifically designed for studying the effect of
particular treatment conditions or exposure to particular agents.
In vitro experimental systems may include or comprise tissue
cultures or organotypical cultures (three-dimensional cultures)
that represent key aspects of human disease. In such
implementations, the agent dosage and exposure regimens for these
experiments may substantially reflect the range and circumstances
of exposures that may be anticipated for humans during normal use
or activity conditions, or during special use or activity
conditions. Experimental parameters and test conditions may be
selected as desired to reflect the nature of the agent and the
exposure conditions, molecules and pathways of the biological
system in question, cell types and tissues involved, the outcome of
interest, and aspects of disease etiology. Particular
animal-model-derived molecules, cells or tissues may be matched
with particular human molecule, cell or tissue cultures to improve
translatability of animal-based findings.
[0043] The data received by SRP engine 110 many of which are
generated by high-throughput experimental techniques, include or
comprise but are not limited to that relating to nucleic acid
(e.g., absolute or relative quantities of specific DNA or RNA
species, changes in DNA sequence, RNA sequence, changes in tertiary
structure, or methylation pattern as determined by sequencing,
hybridization--particularly to nucleic acids on microarray,
quantitative polymerase chain reaction, or other techniques known
in the art), protein/peptide (e.g., absolute or relative quantities
of protein, specific fragments of a protein, peptides, changes in
secondary or tertiary structure, or posttranslational modifications
as determined by methods known in the art) and functional
activities (e.g., enzymatic activities, proteolytic activities,
transcriptional regulatory activities, transport activities,
binding affinities to certain binding partners) under certain
conditions, among others. Modifications including posttranslational
modifications of protein or peptide can include or comprise, but
are not limited to, methylation, acetylation, farnesylation,
biotinylation, stearoylation, formylation, myristoylation,
palmitoylation, geranylgeranylation, pegylation, phosphorylation,
sulphation, glycosylation, sugar modification, lipidation, lipid
modification, ubiquitination, sumolation, disulphide bonding,
cysteinylation, oxidation, glutathionylation, carboxylation,
glucuronidation, and deamidation. In addition, a protein can be
modified posttranslationally by a series of reactions such as
Amadori reactions, Schiff base reactions, and Maillard reactions
resulting in glycated protein products.
[0044] The data may also include or comprise measured functional
outcomes, such as but not limited to those at a cellular level
including cell proliferation, developmental fate, and cell death,
at a physiological level, lung capacity, blood pressure, exercise
proficiency. The data may also include or comprise a measure of
disease activity or severity, such as but not limited to tumor
metastasis, tumor remission, loss of a function, and life
expectancy at a certain stage of disease. Disease activity can be
measured by a clinical assessment the result of which is a value,
or a set of values that can be obtained from evaluation of a sample
(or population of samples) from a subject or subjects under defined
conditions. A clinical assessment can also be based on the
responses provided by a subject to an interview or a
questionnaire.
[0045] This data may have been generated expressly for use in
determining a systems response profile, or may have been produced
in previous experiments or published in the literature. Generally,
the data includes or comprises information relating to a molecule,
biological structure, physiological condition, genetic trait, or
phenotype. In some implementations, the data includes or comprises
a description of the condition, location, amount, activity, or
substructure of a molecule, biological structure, physiological
condition, genetic trait, or phenotype. As will be described later,
in a clinical setting, the data may include or comprise raw or
processed data obtained from assays performed on samples obtained
from human subjects or observations on the human subjects, exposed
to an agent.
[0046] At step 212, the systems response profile (SRP) engine 110
generates systems response profiles (SRPs) based on the biological
data received at step 212. This step may include or comprise one or
more of background correction, normalization, fold-change
calculation, significance determination and identification of a
differential response (e.g., differentially expressed genes). SRPs
are representations that express the degree to which one or more
measured entities within a biological system (e.g., a molecule, a
nucleic acid, a peptide, a protein, a cell, etc.) are individually
changed in response to a perturbation applied to the biological
system (e.g., an exposure to an agent). In one example, to generate
an SRP, the SRP engine 110 collects a set of measurements for a
given set of parameters (e.g., treatment or perturbation
conditions) applied to a given experimental system (a
"system-treatment" pair). FIG. 3 illustrates two SRPs: SRP 302 that
includes or comprises biological activity data for N different
biological entities undergoing a first treatment 306 with varying
parameters (e.g., dose and time of exposure to a first treatment
agent), and an analogous SRP 304 that includes or comprises
biological activity data for the N different biological entities
undergoing a second treatment 308. The data included or comprised
in an SRP may be raw experimental data, processed experimental data
(e.g., filtered to remove outliers, marked with confidence
estimates, averaged over a number of trials), data generated by a
computational biological model, or data taken from the scientific
literature. An SRP may represent data in any number of ways, such
as an absolute value, an absolute change, a fold-change, a
logarithmic change, a function, and a table. The SRP engine 110
passes the SRPs to the network modeling engine 112.
[0047] While the SRPs derived in the previous step represent the
experimental data from which the magnitude of network perturbation
will be determined, it is the biological network models that are
the substrate for computation and analysis. This analysis requires
development of a detailed network model of the mechanisms and
pathways relevant to a feature of the biological system. Such a
framework provides a layer of mechanistic understanding beyond
examination of gene lists that have been used in more classical
gene expression analysis. A network model of a biological system is
a mathematical construct that is representative of a dynamic
biological system and that is built by assembling quantitative
information about various basic properties of the biological
system.
[0048] Construction of such a network is an iterative process.
Delineation of boundaries of the network is guided by literature
investigation of mechanisms and pathways relevant to the process of
interest (e.g., cell proliferation in the lung). Causal
relationships describing these pathways are extracted from prior
knowledge to nucleate a network. The literature-based network can
be verified using high-throughput data sets that contain the
relevant phenotypic endpoints. SRP engine 110 can be used to
analyze the data sets, the results of which can be used to confirm,
refine, or generate network models.
C. Networks
[0049] Returning to FIG. 2, at step 214, the network modeling
engine 112 uses the systems response profiles from the SRP engine
110 with a network model based on the mechanism(s) or pathway(s)
underlying a feature of a biological system of interest. In certain
aspects, the network modeling engine 112 is used to identify
networks already generated based on SRPs. The network modeling
engine 112 may include or comprise components for receiving updates
and changes to models. The network modeling engine 112 may also
iterate the process of network generation, incorporating new data
and generating additional or refined network models. The network
modeling engine 112 may also facilitate the merging of one or more
datasets or the merging of one or more networks. The set of
networks drawn from a database may be manually supplemented by
additional nodes, edges, or entirely new networks (e.g., by mining
the text of literature for description of additional genes directly
regulated by a particular biological entity). These networks
contain features that may enable process scoring. Network topology
is maintained; networks of causal relationships can be traced from
any point in the network to a measurable entity. Further, the
models are dynamic and the assumptions used to build them can be
modified or restated and enable adaptability to different tissue
contexts and species. This allows for iterative testing and
improvement as new knowledge becomes available. The network
modeling engine 112 may remove nodes or edges that have low
confidence or which are the subject of conflicting experimental
results in the scientific literature. The network modeling engine
112 may also include or comprise additional nodes or edges that may
be inferred using supervised or unsupervised learning methods
(e.g., metric learning, matrix completion, pattern
recognition).
[0050] In certain aspects, a biological system is modeled as a
mathematical graph consisting of vertices (or nodes) and edges that
connect the nodes. For example, FIG. 4 illustrates a simple network
400 with 9 nodes (including nodes 402 and 404) and edges (406 and
408). The nodes can represent biological entities or processes
within a biological system, such as, but not limited to, compounds,
DNA, RNA, genes, proteins, peptides, antibodies, cells, tissues,
organs and cellular or molecular processes. The biological entities
are not necessarily limited to those biological entities for which
treatment or control data are received or available. Thus, the
nodes representing the biological entities can include or comprise
the plurality of biological entities and may include or comprise
one or more further biological entities. At least some of the nodes
are scorable and the score may represent the activity level of the
node(s). Many of the nodes represent biological entities of which
the activity levels can be measured. However, in some
implantations, it is not necessary for the computerized method to
receive data for all such measurable nodes. Thus, the nodes are
scorable and/or measurable. In certain implementations, most of the
nodes are measurable. A measurable node may contain or comprise
measured data. The edges can represent relationships between the
nodes. The edges in the graph can represent various relations
between the nodes. For example, edges may represent a "binds to"
relation, an "is expressed in" relation, an "are co-regulated based
on expression profiling" relation, an "inhibits" relation, a
"co-occur in a manuscript" relation, or "share structural element"
relation. Generally, these types of relationships describe a
relationship between a pair of nodes. The nodes in the graph can
also represent relationships between nodes. Thus, it is possible to
represent relationships between relationships, or relationships
between a relationship and another type of biological entity
represented in the graph. For example a relationship between two
nodes that represent chemicals may represent a reaction. This
reaction may be a node in a relationship between the reaction and a
chemical that inhibits the reaction.
[0051] A graph may be undirected, meaning that there is no
distinction between the two vertices associated with each edge.
Alternatively, the edges of a graph may be directed from one vertex
to another. For example, in a biological context, transcriptional
regulatory networks and metabolic networks may be modeled as a
directed graph. In a graph model of a transcriptional regulatory
network, nodes would represent genes with edges denoting the
transcriptional relationships between them. As another example,
protein-protein interaction networks describe direct physical
interactions between the proteins in an organism's proteome and
there is often no direction associated with the interactions in
such networks. Thus, these networks may be modeled as undirected
graphs. Certain networks may have both directed and undirected
edges. The entities and relationships (i.e., the nodes and edges)
that make up a graph, may be stored as a web of interrelated nodes
in a database in system 100.
[0052] The knowledge represented within the database may be of
various different types, drawn from various different sources. For
example, certain data may represent a genomic database, including
information on genes, and relations between them. In such an
example, a node may represent an oncogene, while another node
connected to the oncogene node may represent a gene that inhibits
the oncogene. The data may represent proteins, and relations
between them, diseases and their interrelations, and various
disease states. There are many different types of data that can be
combined in a graphical representation. The computational models
may represent a web of relations between nodes representing
knowledge in, e.g., a DNA dataset, an RNA dataset, a protein
dataset, an antibody dataset, a cell dataset, a tissue dataset, an
organ dataset, a medical dataset, an epidemiology dataset, a
chemistry dataset, a toxicology dataset, a patient dataset, and a
population dataset. As used herein, a dataset is a collection of
numerical values resulting from evaluation of a sample (or a group
of samples) under defined conditions. Datasets can be obtained, for
example, by experimentally measuring quantifiable entities of the
sample; or alternatively, or from a service provider such as a
laboratory, a clinical research organization, or from a public or
proprietary database. Datasets may contain data and biological
entities represented by nodes, and the nodes in each of the
datasets may be related to other nodes in the same dataset, or in
other datasets. Moreover, the network modeling engine 112 may
generate computational models that represent genetic information,
in, e.g., DNA, RNA, protein or antibody dataset, to medical
information, in medical dataset, to information on individual
patients in patient dataset, and on entire populations, in
epidemiology dataset. In addition to the various datasets described
above, there may be many other datasets, or types of biological
information that may be included or comprised when generating a
computation model. For example, a database could further include or
comprise medical record data, structure/activity relationship data,
information on infectious pathology, information on clinical
trials, exposure pattern data, data relating to the history of use
of a product, and any other type of life science-related
information.
[0053] The network modeling engine 112 may generate one or more
network models representing, for example, the regulatory
interaction between genes, interaction between proteins or complex
bio-chemical interactions within a cell or tissue. The networks
generated by the network modeling engine 112 may include or
comprise static and dynamic models. The network modeling engine 112
may employ any applicable mathematical schemes to represent the
system, such as hyper-graphs and weighted bipartite graphs, in
which two types of nodes are used to represent reactions and
compounds. The network modeling engine 112 may also use other
inference techniques to generate network models, such as an
analysis based on over-representation of functionally-related genes
within the differentially expressed genes, Bayesian network
analysis, a graphical Gaussian model technique or a gene relevance
network technique, to identify a relevant biological network based
on a set of experimental data (e.g., gene expression, metabolite
concentrations, cell response, etc.). The biological system may be
represented by a plurality of network models, including
computational causal network models.
[0054] As described above, the network model is based on mechanisms
and pathways that underlie the functional features of a biological
system. The network modeling engine 112 may generate or contain a
model representative of an outcome regarding a feature of the
biological system that is relevant to the study of the long-term
health risks or health benefits of agents. Accordingly, the network
modeling engine 112 may generate or contain a network model for
various mechanisms of cellular function, particularly those that
relate or contribute to a feature of interest in the biological
system, including but not limited to cellular proliferation,
cellular stress, cellular regeneration, apoptosis, DNA
damage/repair or inflammatory response. In other embodiments, the
network modeling engine 112 may contain or generate computational
models that are relevant to acute systemic toxicity,
carcinogenicity, dermal penetration, cardiovascular disease,
pulmonary disease, ecotoxicity, eye irrigation/corrosion,
genotoxicity, immunotoxicity, neurotoxicity, pharmacokinetics, drug
metabolism, organ toxicity, reproductive and developmental
toxicity, skin irritation/corrosion or skin sensitization.
Generally, the network modeling engine 112 may contain or generate
computational models for status of nucleic acids (DNA, RNA, SNP,
siRNA, miRNA, RNAi), proteins, peptides, antibodies, cells,
tissues, organs, and any other biological entity, and their
respective interactions. In one example, computational network
models can be used to represent the status of the immune system and
the functioning of various types of white blood cells during an
immune response or an inflammatory reaction. In other examples,
computational network models could be used to represent the
performance of the cardiovascular system and the functioning and
metabolism of endothelial cells.
[0055] In some implementations of the present invention, the
network is drawn from a database of causal biological knowledge.
This database may be generated by performing experimental studies
of different biological mechanisms to extract relationships between
mechanisms (e.g., activation or inhibition relationships), some of
which may be causal relationships, and may be combined with a
commercially-available database such as the Genstruct Technology
Platform or the Selventa Knowledgebase, curated by Selventa Inc. of
Cambridge, Mass., USA. Using a database of causal biological
knowledge, the network modeling engine 112 may identify a network
that links the perturbations 102 and the measurables 104. In
certain implementations, the network modeling engine 112 extracts
causal relationships between biological entities using the systems
response profiles from the SRP engine 110 and networks previously
generated in the literature. The database may be further processed
to remove logical inconsistencies and generate new biological
knowledge by applying homologous reasoning between different sets
of biological entities, among other processing steps.
[0056] In certain implementations, the network model extracted from
the database is based on reverse causal reasoning (RCR), an
automated reasoning technique that processes networks of causal
relationships to formulate mechanism hypotheses, and then evaluates
those mechanism hypotheses against datasets of differential
measurements. Each mechanism hypothesis links a biological entity
to measurable quantities that it can influence. At least one
mechanism hypothesis may be formulated--such as a plurality of
mechanism hypotheses. For example, measurable quantities can
include or comprise an increase or decrease in concentration,
number or relative abundance of a biological entity, activation or
inhibition of a biological entity, or changes in the structure,
function or logical of a biological entity, among others. RCR uses
a directed network of experimentally-observed causal interactions
between biological entities as a substrate for computation. The
directed network may be expressed in Biological Expression
Language.TM. (BEL.TM.), a syntax for recording the
inter-relationships between biological entities. The RCR
computation specifies certain constraints for network model
generation, such as but not limited to path length (the maximum
number of edges connecting an upstream node and downstream nodes),
and possible causal paths that connect the upstream node to
downstream nodes. The output of RCR is a set of mechanism
hypotheses that represent upstream controllers of the differences
in experimental measurements, ranked by statistics that evaluate
relevance and accuracy. The mechanism hypotheses output can be
assembled into causal chains and larger networks to interpret the
dataset at a higher level of interconnected mechanisms and
pathways.
[0057] One type of mechanism hypothesis comprises a set of causal
relationships that exist between a node representing a potential
cause (the upstream node or controller) and nodes representing the
measured quantities (the downstream nodes). The mechanism
hypothesis can be used to make predictions, such as if the
abundance of an entity represented by an upstream node increases,
the downstream nodes linked by causal increase relationships would
be inferred to be increase, and the downstream nodes linked by
causal decrease relationships would be inferred to decrease.
[0058] A mechanism hypothesis represents the relationships between
a set of measured data, for example, gene expression data, and a
biological entity that is a known controller of those genes.
Additionally, these relationships include or comprise the sign
(positive or negative) of influence between the upstream entity and
the differential expression of the downstream genes. The downstream
genes of a hypothesis are drawn from a database of
literature-curated causal biological knowledge. The causal
relationships of a mechanism hypothesis that link the upstream
entity to downstream genes, in the form of a computable causal
network model, are the substrate for the calculation of network
changes by the NPA scoring methods. The biological system may be
represented by at least one mechanism hypothesis--such as a
plurality of mechanism hypotheses. The at least one computational
causal network model may comprise a plurality of mechanism
hypotheses.
[0059] A scorable complex causal network model of biological
entities can be transformed into a single causal network model by
collecting the individual mechanism hypothesis representing
entities in the model and regrouping the connections of all the
downstream genes to a single upstream process representing the
whole complex causal network model; this in essence is a flattening
of the underlying graph structure. In this fashion, the activity
changes of the biological entities described by the network model
can be assessed via combination of its individual mechanism
hypotheses, such that the underlying gene expression measurements
contribute to the network as a whole.
[0060] To generate a scorable network for use in the methods of the
invention, a reference node is first selected from a starting,
typically complex, causal network model. The reference node can be
any entity in the network whose level or activity is positively
related to the activity of the network as a whole (as opposed to,
for example, and inhibitor whose activity may be negatively related
to the network activity). Next, the causal relationship between
each node in the model and the reference node is determined. This
can be done by first requiring that the model be "causally
consistent". The signs of regulation of downstream measurable
entities (in this example, gene expressions) for each node in the
model are adjusted based on the relationship between that model
node and the reference node. For example, the signs of the
downstream gene expressions for a model node that has a positive
causal relationship with the reference node (i.e., that node is
expected to be positively regulated when the reference node
increases) are maintained. On the other hand, the signs of the
downstream gene expressions for a model node with a negative causal
relationship with the reference node (i.e., that node is expected
to be negatively regulated when the reference node increases) are
inverted. All the downstream gene expressions and their signs are
then assembled into a single mechanism hypothesis, and downstream
gene expressions with contradictory signs (from multiple model
nodes) are omitted from the mechanism hypothesis.
[0061] For a network model to be causally consistent, for an
increase in any node in the model, it should be possible to
unambiguously map a sign of "positive regulation" or "negative
regulation" on every other node in the model by following the
causal relationships that connect the nodes. Biological
interpretation can be used to resolve ambiguities to construct
causally consistent models by considering what process is being
scored by the mechanism hypothesis, and in what sign each node is
effectively related to the reference node. For example, the node
where a negative feedback connects back to the model has a
particular relationship with the process being scored, and although
the negative feedback may regulate this node, it should not change
this relationship. Thus, the connection between the negative
feedback loop and this node can be removed from the model to obtain
causal consistency in a manner that is congruent with known facts.
Variations on the approach described above are discussed in U.S.
Patent Application Publication No. 2007/0225956 and 2009/0099784,
which are incorporated by reference herein in their entirety. An
exemplary causal network model is described in Westra J W, Schlage
W K, Frushour B P, Gebel S, Catlett N L, Han W, Eddy S F,
Hengstermann A, Matthews A L, Mathis C, et al: Construction of a
Computable Cell Proliferation Network Focused on Non-Diseased Lung
Cells. BMC Syst Biol 2011, 5:105, which is incorporated by
reference herein in its entirety.
[0062] In certain implementations, the system 100 may contain or
generate a computerized model for the mechanism of cell
proliferation when the cells have been exposed to cigarette smoke.
In such an example, the system 100 may also contain or generate one
or more network models representative of the various health
conditions relevant to cigarette smoke exposure, including but not
limited to cancer, pulmonary diseases and cardiovascular diseases.
In certain aspects, these network models are based on at least one
of the perturbations applied (e.g., exposure to an agent), the
responses under various conditions, the measurable quantities of
interest, the outcome being studied (e.g., cell proliferation,
cellular stress, inflammation, DNA repair), experimental data,
clinical data, epidemiological data, and literature.
[0063] As an illustrative example, the network modeling engine 112
may be configured for generating a network model of cellular
stress. The network modeling engine 112 may receive networks
describing relevant mechanisms involved in the stress response
known from literature databases. The network modeling engine 112
may select one or more networks based on the biological mechanisms
known to operate in response to stresses in pulmonary and
cardiovascular contexts. In certain implementations, the network
modeling engine 112 identifies one or more functional units within
a biological system and builds a larger network model by combining
smaller networks based on their functionality. In particular, for a
cellular stress model, the network modeling engine 112 may consider
functional units relating to responses to oxidative, genotoxic,
hypoxic, osmotic, xenobiotic, and shear stresses. Therefore, the
network components for a cellular stress model may include or
comprise xenobiotic metabolism response, genotoxic stress,
endothelial shear stress, hypoxic response, osmotic stress and
oxidative stress. The network modeling engine 112 may also receive
content from computational analysis of publicly available
transcriptomic data from stress relevant experiments performed in a
particular group of cells.
[0064] When generating a network model of a biological mechanism,
the network modeling engine 112 may include or comprise one or more
rules. Such rules may include or comprise rules for selecting
network content, types of nodes, and the like. The network modeling
engine 112 may select one or more data sets from experimental data
database 106, including a combination of in vitro and in vivo
experimental results. The network modeling engine 112 may utilize
the experimental data to verify nodes and edges identified in the
literature. In the example of modeling cellular stress, the network
modeling engine 112 may select data sets for experiments based on
how well the experiment represented physiologically-relevant stress
in non-diseased lung or cardiovascular tissue. The selection of
data sets may be based on the availability of phenotypic stress
endpoint data, the statistical rigor of the gene expression
profiling experiments, and the relevance of the experimental
context to normal non-diseased lung or cardiovascular biology, for
example.
[0065] After identifying a collection of relevant networks, the
network modeling engine 112 may further process and refine those
networks. For example, in some implementations, multiple biological
entities and their connections may be grouped and represented by a
new node or nodes (e.g., using clustering or other techniques).
[0066] The network modeling engine 112 may further include or
comprise descriptive information regarding the nodes and edges in
the identified networks. A node may be described by its associated
biological entity, an indication of whether or not the associated
biological entity is a measurable quantity, or any other descriptor
of the biological entity. Some of the nodes are scorable and the
score may represent the activity level of the node(s). Many of the
nodes represent biological entities of which the activity levels
can be measured. However, in some implantations, it is not
necessary for the computerized method to receive data for all such
measurable nodes. Thus, the nodes are scorable and/or measurable.
In certain implementations, most of the nodes are measurable. A
measurable node may contain or comprise measured data. An edge may
be described by the type of relationship it represents (e.g., a
causal relationship such as an up-regulation or a down-regulation,
a correlation, a conditional dependence or independence), the
strength of that relationship, or a statistical confidence in that
relationship, for example. In some implementations, for each
treatment, each node that represents a measurable entity is
associated with an expected direction of activity change (i.e., an
increase or decrease) in response to the treatment. For example,
when a bronchial epithelial cell is exposed to an agent such as
tumor necrosis factor (TNF), the activity of a particular gene may
increase. This increase may arise because of a direct regulatory
relationship known from the literature (and represented in one of
the networks identified by network modeling engine 112) or by
tracing a number of regulation relationships (e.g., autocrine
signaling) through edges of one or more of the networks identified
by network modeling engine 112. In some cases, the network modeling
engine 112 may identify an expected direction of change, in
response to a particular perturbation, for each of the measurable
entities. When different pathways in the network indicate
contradictory expected directions of change for a particular
entity, the two pathways may be examined in more detail to
determine the net direction of change, or measurements of that
particular entity may be discarded. In certain embodiments,
direction values, for the nodes, may represent the expected
direction of change between the control data and the treatment
data. In certain embodiments, direction values, for the nodes, may
represent the expected change in value between the control data and
the treatment data. In certain embodiments, direction values, for
the nodes, may represent the expected increase or decrease in value
of the control data and the treatment data. Suitably, the change is
representative of the change after treatment.
D. Network Perturbation Amplitude
[0067] The computational methods and systems provided herein
translate SRPs into NPA scores. Experimental measurements that are
identified as downstream effects of a perturbation within a network
model are aggregated into a network-specific response score.
Accordingly, at step 216, the network scoring engine 114 generates
NPA scores for each perturbation using the networks identified at
step 214 by the network modeling engine 112 and the SRPs generated
at step 212 by the SRP engine 110. NPA scoring applies one or more
defined algorithm(s) to an experimental dataset consisting of a
series of treatment versus control comparisons, where the
experimental data is filtered to represent a particular scope of
biology (for example, a particular set of gene expression
relationships) in the context of a defined biological network
model. A NPA score quantifies a biological response to a treatment
(represented by the SRPs) in the context of the underlying
relationships between the biological entities (represented by the
identified networks). The network scoring engine 114 includes or
comprises hardware and software components for generating NPA
scores for each of the networks contained in or identified by the
network modeling engine 112.
[0068] The network scoring engine 114 may be configured to
implement any of a number of scoring techniques. Such techniques
include those that generate scalar-valued scores. Such techniques
also include those that generate vector-valued scores.
Vector-valued scores are indicative of the magnitude and
topological distribution of the response of the network to the
perturbation.
[0069] One described scoring technique is a strength scoring
technique. A strength score is a scalar valued score that is a mean
of the activity. A strength score is a mean of the activity
observations for different entities represented in the SRP. The
strength of a network response is calculated in accordance
with:
strength = i d i .beta. i N ( 1 ) ##EQU00001##
where d.sub.i represents the expected direction of activity change
for the entity associated with node i, .beta..sub.i represents the
log of the fold-change (i.e. the number describing how much a
quantity changes going from initial to final value) of activity
between the treatment and control conditions, and N is the number
of nodes with associated measured biological entities. A positive
strength score indicates that the SRP is matched to the expected
activity change derived from the identified networks, while a
negative strength score indicates that the SRP is unmatched to the
expected activity change.
[0070] The score may be generated by a geometric perturbation index
scoring technique, a probabilistic perturbation index scoring
technique, or an expected perturbation index scoring technique. One
scoring technique is the Geometric Perturbation Index (GPI) scoring
technique. FIG. 5 is a flow diagram 500 of a GPI scoring technique
that may be implemented by the network scoring engine 114. At step
502, the network scoring engine assembles a fold-change vector
.beta.. A fold-change is a number describing how much a measurable
changes going from an initial value to a final value under
different conditions, such as between the perturbation and control
conditions. This fold-change vector has N components, corresponding
to the number of nodes in the network with associated measured
biological entities. In some implementations, the ith component of
the fold-change vector, .beta..sub.i, represents the logarithm
(e.g., base 2) of the fold-change of the activity of the ith
measured biological entity between the perturbation and control
conditions (i.e. the log of the factor by which the activity of the
entity changes between the two conditions). As a result, a value of
zero for .beta..sub.i indicates that no change in activity was
observed between the perturbation and control conditions. The
logarithm operation need not be included, or may be replaced by any
other linear or non-linear function. For example, in some
implementations, .beta..sub.i represents the fold-change in
activity between perturbation conditions without a logarithm
operation; in such implementations, a value of one for .beta..sub.i
indicates that no change in activity was observed between the
perturbation and control conditions. It will be understood that
fold-changes are simply one possible approach of quantifying an
activity for use with the network scoring techniques described
herein, and any other convention for expressing changes in
measurables may be used. In certain embodiments, the step of
generating the score may comprise a linear or a non-linear
combination of the activity measures, the weight values, and the
direction values; and a normalization of the combination by a scale
factor. The combination may be an arithmetic combination, and the
scale factor may be the square root of the number of biological
entities for which measured data are received. In certain
embodiments, the scores are not scalar-value scores.
[0071] At step 504, the network scoring engine 114 generates a
weight vector r. The weight vector r also has N components, one for
each of the components of the fold-change vector .beta.. Each of
the components r.sub.i of the weight vector r represents a weight
to be given to the ith observed fold-change .beta..sub.i. In some
implementations, the weight represents the known biological
significance of the ith measured entity with regard to a feature or
an outcome of interest (e.g., a known carcinogen in cancer
studies). In some implementations, the weight represents the
confidence of the activity measurement of the biological entity
associated with the node. By weighting the log-fold-changes by
confidence estimates, fold-changes .beta..sub.i for which
confidence is low contribute less to the GPI score. Improved
laboratory conditions, increased number of biological replicates,
better repeatability, smaller variance, and stronger signals may
all contribute to a higher confidence in a particular
.beta..sub.i.
[0072] One value that may be advantageously used for weighting is
the local false non-discovery rate fndr.sub.i (i.e., the
probability that a fold-change value .beta..sub.i represents a
departure from the underlying null hypothesis of a zero
fold-change, in some cases, conditionally on the observed p-value)
as described by Strimmer et al. in "A general modular framework for
gene set enrichment analysis," BMC Bioinformatics 10:47, 2009 and
by Strimmer in "A unified approach to false discovery rate
estimation," BMC Bioinformatics 9:303, 2008, each of which is
incorporated by reference herein in its entirety. In some
implementations, fndr.sub.i is calculated in accordance with
fndr i ( .beta. 1 , , .beta. N ) = 1 - 2 v i ( .beta. 1 , , .beta.
N ) .intg. .beta. i / S i ( .beta. 1 , , .beta. N ) .infin. t df (
x ) x , ( 2 ) ##EQU00002##
where fdr.sub.i is the local false discovery rate (i.e., the
probability that a fold-change value .beta..sub.i does not
represent a departure from the underlying null hypothesis of a zero
fold-change), v.sub.i is the Benjamini-Hochberg adjustment factor
described by Benjamini et al. in "Controlling the false discovery
rate: a practical and powerful approach to multiple testing,"
Journal of the Royal Statistical Society, Series B 57:289, 1995,
which is incorporated by reference herein in its entirety, p is the
probability of obtaining a fold-change at least as extreme as the
fold-change .beta..sub.i, that was actually observed (assuming that
the null hypothesis of a zero fold-change is true), and t.sub.df is
a t-distribution with df degrees of freedom. Note that p is a
function of .beta..sub.i and the standard deviation S.sub.i. which
is in turn based on all of the .beta..sub.i. In an alternative
implementation, no adjustment for multiple testing is made;
accordingly, v.sub.i(.beta..sub.1, . . . , .beta..sub.N) is equal
to 1 and the weight vector r.sub.i=1-p(.beta..sub.i,
S.sub.i(.beta..sub.1, . . . , .beta..sub.N)).
[0073] At step 506, the network scoring engine 114 uses the weight
vector r to scale the fold-change vector .beta.. The result is a
scaled fold-change vector in which each component .beta..sub.i is
multiplied by its associated weight component r.sub.i. One way to
achieve such a scaling computationally is to create an N.times.N
diagonal matrix with the weight components r.sub.i on the diagonal,
and multiply that matrix by the N.times.1 vector .beta., as shown
in Eq. 3:
[ r 1 .beta. 1 r 2 .beta. 2 r N .beta. N ] = [ r 1 0 0 0 r 2 0 0 0
0 0 r N ] diag ( r ) [ .beta. 1 .beta. 2 .beta. N ] .beta. ( 3 )
##EQU00003##
[0074] At step 508, the network scoring engine 114 identifies the
expected directions of change for each component in the fold-change
vector .beta.. The network scoring engine 114 may do so by querying
the network modeling engine 112 to retrieve the expected directions
of change from the causal biological network models. The network
scoring engine 114 can then assemble these expected directions of
change into an N-component vector d, where the ith component of the
vector d, d.sub.i, represents the expected direction of change
(e.g., +1 for increased activity and -1 for decreased activity) for
the ith measured biological entity.
[0075] At step 510, the network scoring engine 114 combines the
components of the scaled fold-change vector (generated at step 506)
with the expected directions of change for each component
(identified at step 508). In some implementations, this combination
is an arithmetic combination, wherein each of the scaled
fold-changes r.sub.i.beta..sub.i are multiplied by its
corresponding expected direction of change d.sub.i and the result
summed over all N biological entities. Mathematically, this
implementation of step 510 can be represented by
i d i r i .beta. i . ( 4 ) ##EQU00004##
In other implementations, the vectors d, r and .beta. may be
combined in any linear or non-linear manner.
[0076] At step 512, the network scoring engine 114 normalizes the
combination of step 510. In some implementations, the normalization
consists of multiplying by a pre-determined scale factor. One such
scale factor is the square root of N, the number of biological
entities. In this implementation, the GPI score can be represented
by
G P I = i d i r i .beta. i N . ( 5 ) ##EQU00005##
Other scale factors, which may or may not be pre-determined, may
also be used. In certain embodiments, a causal network model (e.g.,
a mechanism hypothesis) can be seen as a unit sign vector s=(1, 1,
-1, 1, . . . , -1)/ N in the N-dimensional downstream measurable
space (where each dimension represents a downstream measurable,
here gene expression, of the causal network model). The observed
effect of perturbation on the downstream gene expressions is also a
vector in this space. So geometrically, the amplitude of the
perturbation in the causal network model can be quantified by
projecting the differential log.sub.2 expression vector onto the
hypothesis unit vector. However, the downstream measurements of a
causal network model come from a generic model. To deal explicitly
with the specificity of data supporting an NPA score, each
downstream is assigned a belief of activation, which is set to be
the local false non-discovery rate (fndr.sub.i=(1-fdr.sub.i)). It
is equivalent to weight the dimensions of the downstream gene
expression space according to the belief of each differential
expression and therefore consider a weighted scalar product to
define the geometry of the gene expression space:
<s|.beta.>.sub.W=s.sup.Tdiag(fndr).beta.. Hence,
GPI=(.SIGMA.s.sub.ifndr.sub.i.beta..sub.i)/ N. By weighting the
differential log 2 expression with false non-discovery rate,
individual differential expression values for which there is little
confidence are moved closer to zero (no change), while values for
which there is stronger confidence are minimally decreased. A
positive GPI score indicates an upregulation of the process
described by the mechanism hypotheses, a zero GPI score indicates
that the process is unchanged along the direction s of the
mechanism hypotheses, and a negative GPI score indicates that the
process is down-regulated.
[0077] FIG. 6 is a flow diagram 600 of a Probabilistic Perturbation
Index (PPI) scoring technique that may be implemented by the
network scoring engine 114. As discussed above with respect to SRP
engine 110 (FIG. 1) and step 212 of process 200 (FIG. 2), each SRP
represents the activity (or change in activity) of a measured
biological entity under a treatment condition. Each SRP, then, is
associated with a number of measured activities, one for each
measured biological entity. The PPI is a quantification of the
probability that the biological mechanisms represented by the
networks of interest are activated given the observed SRPs.
[0078] At step 602, the network scoring engine 114 assembles a
fold-change vector .beta.. This fold-change vector, representing
the observed fold-changes in the activity of the N measured
biological entities, may be assembled as described above with
reference to step 502 of the Geometric Perturbation Index (GPI)
scoring technique illustrated in FIG. 5. At step 604, the network
scoring engine 114 generates a range for the fold-change density.
The range for the fold-change density represents an approximation
of the set of values that the fold-change values can take in the
biological system under the treatment conditions, and may be
approximated by the range [-W,W], where W is the theoretical
expected largest absolute value of a log 2 fold-change. By choosing
W this way, all observed fold-changes will fall in the range
[-W,W]. For example, the maximum expected signal of a gene chip
(e.g., 16 in log 2 scale) may be used as the value W.
[0079] At step 606, the network scoring engine 114 identifies the
expected directions of change for each component in the fold-change
vector .beta.. This step may be performed as described above with
reference to step 508 of the GPI scoring technique illustrated in
FIG. 5, resulting in a set of expected directions of change d.sub.i
that correspond to the observed fold-changes .beta..sub.i.
[0080] At step 608, the network scoring engine 114 generates a
positive activation metric. In some implementations, a positive
activation metric represents the degree to which the SRPs indicate
that the observed activation/inhibition of biological entities is
consistent with the expected directions of change represented by
the d.sub.i. Consistent behavior is referred to as "positive
activation" herein. One positive activation metric that may be used
is the probability that a network or networks is positively
activated. Such a probability, referred to as PPI+, may be
calculated in accordance with the following expression:
P P I + = Pr ( PositivelyActivated ) = 1 W .intg. 0 W Pr (
PositivelyActivated .PHI. ) .PHI. , ( 6 ) ##EQU00006##
in which
Pr ( PositivelyActivated .PHI. ) = 1 N 0 < d i .beta. i <
.PHI. fndr i ( 7 ) ##EQU00007##
where fndr.sub.i is the false non-discovery rate discussed above
with reference to Eq. 1. In some implementations, the network
scoring engine 114 is configured to numerically integrate the
expression of Eq. 6 using a set of bins representing the values of
.phi. between 0 and W. One set of bins that may be used are the
bins [d.sub.(i-1).beta..sub.(i-1),d.sub.(i) .beta..sub.(i)], where
the (.cndot.) subscripts represent the values taken in order from
smallest fold-change to largest fold-change and with the convention
that d.sub.(0).beta..sub.(0)=0. In such implementations, the
network scoring engine 114 calculates an approximation to the
positive activation metric PPI.sup.+ according to:
P P I + .apprxeq. 1 WN 0 < d i .beta. i fndr i d i .beta. i . (
8 ) ##EQU00008##
[0081] At step 610, the network scoring engine 114 generates a
negative activation metric. In some implementations, a negative
activation metric represents the degree to which the SRPs indicate
that the observed activation/inhibition of biological entities is
inconsistent with the expected directions of change represented by
the d.sub.i. Inconsistent behavior is referred to as "negative
activation" herein. One negative activation metric that may be used
is the probability that a network or networks is negative
activated. Such a probability, referred to as PPT.sup.-, may be
calculated in accordance with the following expression:
P P I - = Pr ( NegativelyActivated ) = 1 W .intg. - W 0 Pr (
NegativelyActivated .PHI. ) .PHI. , ( 9 ) ##EQU00009##
in which
Pr ( NegativelyActivated .PHI. ) = 1 N .PHI. < d i .beta. i <
0 fndr i ( 10 ) ##EQU00010##
where fndr.sub.i is the false non-discovery rate discussed above
with reference to Eqs. 1 and 7. As discussed above with reference
to positive activation metrics, in some implementations, the
network scoring engine 114 is configured to numerically integrate
the expression of Eq. 9 using a set of bins representing the values
of .phi. between -W and 0. One set of bins that may be used are the
bins [d.sub.(i-1).beta..sub.(i-1),d.sub.(i) .beta..sub.(i)], where
the (.cndot.) subscripts represent the values taken in order from
smallest fold-change to largest fold-change and with the convention
that d.sub.(0).beta..sub.(0)=0. In such implementations, the
network scoring engine 114 calculates an approximation to the
negative activation metric PPI.sup.- according to:
PPI - .apprxeq. 1 WN d i .beta. i < 0 f n d r i d i .beta. i . (
11 ) ##EQU00011##
[0082] At step 612, the network scoring engine combines the
positive activation metric (generated at step 608) and the negative
activation metric (generated at step 610) to generate a composite
metric, referred to as the Probabilistic Perturbation Index or PPI.
The combination of step 612 can be any linear or non-linear
combination. In some implementations, the PPI is a weighted linear
combination of the positive activation metric and the negative
activation metric. For example, the network scoring engine 114 may
be configured to generate a PPI in accordance with:
PPI = 1 2 ( PPI + + PPI - ) , ( 12 ) ##EQU00012##
where PPI.sup.+ and PPI.sup.- are the positive and negative
activation metrics described above. The PPI generated according to
Eq. 12 is related to the GPI calculated according to Eq. 5 in the
following manner:
GPI = W N ( PPI + - PPI - ) . ( 13 ) ##EQU00013##
Additionally, the network scoring engine 114 may be configured to
compute the PPI of Eq. 12 by calculating the L1 norm of the vector
whose ith component is defined according to:
[ 1 2 WN f n d r i d i .beta. i ] . ( 14 ) ##EQU00014##
[0083] FIG. 7 is a flow diagram 700 of an Expected Perturbation
Index (EPI) scoring technique that may be implemented by the
network scoring engine 114. As discussed above with respect to SRP
engine 110 (FIG. 1) and step 212 of process 200 (FIG. 2), each SRP
represents the activity (or change in activity) of a measured
biological entity under a treatment condition. Each SRP, then, is
associated with a number of measured activities, one for each
measured biological entity. The EPI is a quantification of the
average activity change over all biological entities represented by
the SRP. Generally the measured activities represented in an SRP
may be random draws from a distribution of measured activities,
with the EPI representing the expected value of that distribution.
If each of the fold-changes .beta..sub.i is drawn from a
distribution p(.cndot.), then the expected value of that
distribution is
EPI=.intg..phi.p(.phi.)d.phi.. (15)
Since the true theoretical distribution p(.cndot.) is not readily
known, the network scoring engine 114 may be configured to execute
the steps described below to approximate the EPI value based on the
observed activities and other information drawn from the system
100.
[0084] At step 702, the network scoring engine 114 assembles a
fold-change vector .beta.. This fold-change vector, representing
the observed fold-changes in the activity of the N measured
biological entities, may be assembled as described above with
reference to step 502 of the Geometric Perturbation Index (GPI)
scoring technique illustrated in FIG. 5 or step 602 of the
Probabilistic Perturbation Index (PPI) scoring technique
illustrated in FIG. 6. At step 704, the network scoring engine 114
generates a range for the fold-change density. The network scoring
engine 114 may generate the range for the fold-change density as
described above with reference to step 604 of the PPI scoring
technique illustrated in FIG. 6.
[0085] At step 706, the network scoring engine 114 identifies the
expected directions of change for each component in the fold-change
vector .beta.. This step may be performed as described above with
reference to step 508 of the GPI scoring technique illustrated in
FIG. 5, resulting in a set of expected directions of change d.sub.i
that correspond to the observed fold-changes .beta..sub.i.
[0086] At step 708, the network scoring engine 114 generates an
approximate fold-change density. If each of the fold-changes
.beta..sub.i drawn from a distribution p(.cndot.), then the
distribution p(.cndot.) can be approximately represented by:
p ^ ( .PHI. ) .varies. { 1 N i | d i .beta. i > .PHI. .beta. i W
, .PHI. > 1 N i | d i .beta. i < .PHI. .beta. i W , .PHI.
< . ( 16 ) ##EQU00015##
[0087] At step 710, the network scoring engine 114 generates the
approximate expected value of the approximate fold-change density,
resulting in an EPI score. In some implementations, the network
scoring engine 114 applies a computational interpolation technique
(e.g., linear or non-linear interpolation techniques) to generate
an approximate continuous distribution from the distribution of Eq.
16, then calculates the expected value of that distribution using
the formula of Eq. 15. In other implementations, the network
scoring engine 114 is configured to use the discrete distribution
of Eq. 16 as a rectangular approximation to the continuous
distribution, and calculate the EPI in accordance with:
EPI .apprxeq. 1 WN [ i | d i .beta. i > 0 ( d .beta. ) ( i ) ( j
= 1 n + ( d .beta. ) ( j ) ) ( ( d .beta. ) ( i ) - ( d .beta. ) (
i - 1 ) ) - i | d i .beta. i < 0 - ( d .beta. ) ( i ) ( j = 1 n
- - ( d .beta. ) ( j ) ) ( - ( d .beta. ) ( i ) - ( - ( d .beta. )
( i - 1 ) ) ) ] ( 17 ) ##EQU00016##
In Eq. 17, the (.cndot.) subscripts represent the values taken in
order from smallest fold-change to largest fold-change), n.sup.+ is
the number of entities whose activity was expected to increase in
response to the treatment (d.sub.i.beta..sub.i>=0) (per step
706) and n- is the number of entities whose activity was expected
to decrease in response to the treatment
(d.sub.i.beta..sub.i<=0) (per step 706). In the EPI score, high
value fold-changes are taken into account more often than lower
ones, providing a measure of activity with high specificity.
[0088] The network scoring engine 114 may also be configured to
determine confidence intervals around the network scores. These
confidence intervals may be used by clinicians and researchers to
evaluate the experimental results reflected in the network scores
and may be used by other components of the system 100 in further
data processing steps (e.g., by the aggregation engine 110). One
useful method for determining confidence intervals is to evaluate
the null hypothesis of the network score being zero (or other
appropriate null value representing no different in activity
between treatment and control conditions) for a given Type-I (false
positive) error risk .alpha. (e.g., .alpha.=0.05). In some
implementations, the network scoring engine 114 uses a
computational bootstrapping technique, such as a parametric or
non-parametric bootstrapping technique, to approximate the
distributions of the computed metrics. Many such bootstrapping
techniques are known in the art. When few assumptions about the
underlying distribution can be made, a non-parametric technique may
be advantageously employed. When an underlying distribution is
assumed, parametric techniques may be advantageously employed. In
the examples discussed below, the .beta..sub.i are assumed to arise
from a normal distribution under the null hypothesis, with mean
zero and sample variance S.sub.i.sup.2 based on t.sub.df degrees of
freedom. The network scoring engine may generate these quantities,
as well as t-statistics and moderated t-statistics representative
of the .beta..sub.i, by using a statistical estimation and test
procedure, such as the t-statistics and moderated t-statistics
generated by the linear model approach of the "limma" R package,
commonly used in the analysis of differential gene expression and
described by Smyth in "Linear models and empirical Bayes methods
for assessing differential expression in microarray experiments,"
Statistical Applications in Genetics and Molecular Biology, 3:3,
2004, incorporated in its entirety by reference herein. For
example, to determine confidence intervals for EPI scores (as
discussed above with reference to FIG. 7), the network scoring
engine 114 may be configured to implement a parametric
bootstrapping technique to approximate the distribution of the
.beta..sub.i, assuming that the .beta..sub.i arise from an
underlying normal distribution. In implementations in which the
assumptions for the application of percentile bootstrapping appear
to be violated, which may include or comprise EPI, the network
scoring engine 114 may additionally apply the bias-corrected
percentile method described by Efron in "The jackknife, the
bootstrap, and other resampling plans," SIAM, 1982 and Diciccio et
al. in "A review of bootstrap confidence intervals," Journal of the
Royal Statistical Society, 50:338, 1988, each of which is
incorporated by reference in its entirety herein.
[0089] In some implementations, the network scoring engine 114 may
employ an analytical approach to determine the confidence
intervals, instead of or in combination with a bootstrapping
technique. The particular techniques implemented by the network
scoring engine 114 to analytically determine confidence intervals
will depend on the particular network scoring technique used and
the assumptions on the underlying statistical distributions for the
.beta..sub.i.
[0090] For example, when the network scoring engine 114 is
configured to calculate strength scores (in accordance with Eq. 1),
the network scoring engine 114 treats the strength score as a
random variable consisting of a weighted sum of independent,
approximately normal random variables. As a result, the
distribution of the strength score is an approximately normal
random variable, with zero mean and a variance that is calculated
in accordance with
S strength 2 = 1 N 2 i S i 2 . ( 18 ) ##EQU00017##
The network scoring engine 114 can use the variance
S.sub.strength.sup.2 to derive a t-statistic in accordance with
t = strength S strength , ( 19 ) ##EQU00018##
whose degrees of freedom df is estimated with the
Welch-Satterthwaite equation, described by Satterthwaite in "An
approximate distribution of estimates of variance components,"
Biometrics, 2:110, 1946 and by Welch in "The generalization of
student's problems when several different population variances are
involved," Biometrika, 34:28, 1947, each of which is incorporated
in its entirety by reference herein. Using these quantities, the
network scoring engine 114 may generate a (1-.alpha.)-confidence
interval for the strength score in accordance with
strength.+-.t.sub.df.sup..alpha./2S.sub.strength. (20)
[0091] As another example, when the network scoring engine 114 is
configured to calculate GPI scores (as discussed above with
reference to FIG. 5), the network scoring engine 114 may also be
configured to calculate a confidence interval for the GPI score in
accordance with the steps of the flow diagram 800 of FIG. 8. At
step 802, the network scoring engine 114 performs a first-order
Taylor expansion of the GPI score as represented by Eq. 5, as a
function of the .beta..sub.i, in accordance with
GPI ( .beta. 1 , , .beta. N ) = GPI ( .beta. ^ 1 , , .beta. ^ N ) +
i .differential. GPI .differential. .beta. i | .beta. ^ i ( .beta.
i - .beta. ^ i ) + O ( N 2 ) ( 21 ) ##EQU00019##
wherein .beta..sub.i hat is the measured fold-change value. The
first-order Taylor approximation of the GPI score retains the first
two terms and drops the O(N.sup.2) terms.
[0092] At step 804, the network scoring engine 114 assesses whether
the coefficients of the .beta..sub.i terms in the GPI calculation
are functions of the .beta..sub.i. These coefficients include or
comprise the expected direction terms d.sub.i and the weights
r.sub.i. When these coefficients do not depend on the values of
.beta..sub.i, the first-order term in Eq. 21 becomes a constant
value with respect to .beta..sub.i and the network scoring engine
114 proceeds to step 808. However, when the coefficients do depend
on the values of .beta..sub.i, the network scoring engine 114
proceeds to step 806 to approximate the first-order term in Eq. 21.
In particular, when the weight vector r is a function of the
.beta..sub.i and the expected direction terms d.sub.i are not a
function of the .beta..sub.i, the first order term may be
represented as:
.differential. GPI .differential. .beta. i = 1 N ( d i r i + d i
.beta. i .differential. r i .differential. .beta. i ) . ( 22 )
##EQU00020##
In particular, when the weight vector r is a vector of false
non-discovery rate values, fndr.sub.i, as discussed above with
reference to Eq. 2 and step 504 of FIG. 5, the network scoring
engine 114 may use the following expression for the derivative term
of Eq. 22:
.differential. .differential. .beta. i f n d r i ( .beta. 1 , ,
.beta. N ) = - 2 .differential. v i ( .beta. 1 , , .beta. N )
.differential. .beta. i term 1 .intg. .beta. i S i .infin. t df ( x
) x term 2 - 2 v i ( .beta. 1 , , .beta. N ) .differential.
.differential. .beta. i ( .intg. .beta. i S i .infin. t df ( x ) x
) . ( 23 ) ##EQU00021##
[0093] The derivative labeled "term1" in Eq. 23 represents the
derivative of the Benjamini-Hochberg adjustment factor and the
integral labeled "term2" represents the p-value for the fold-change
of the ith biological entity. Because the Benjamini-Hochberg terms
are most relevant when p-values are low, the network scoring engine
114 may be configured to approximate the product of term1 and term2
as zero at step 806. As a result, the network scoring engine 114
may apply the fundamental theorem of calculus and use the following
approximation of the derivative term of Eq. 23:
.differential. .differential. .beta. i f n d r i ( .beta. 1 , ,
.beta. N ) .apprxeq. 2 sgn ( .beta. i ) v i ( .beta. 1 , , .beta. N
) S i t df ( .beta. i S i ) . ( 24 ) ##EQU00022##
[0094] Including the approximation of Eq. 24 in the expression of
Eq. 21 yields the following approximation of the GPI score:
GPI ( .beta. 1 , , .beta. N ) = GPI ( .beta. ^ 1 , , .beta. ^ N ) +
( .beta. i - .beta. ^ i ) i ( d i f n dr i + d i .beta. ^ i [ 2 v i
( .beta. ^ 1 , , .beta. ^ N ) S ^ i t df ( .beta. ^ i S ^ i ) ] ) 1
N ( 25 ) ##EQU00023##
[0095] At step 808, the network scoring engine 114 determines the
approximate variance of the GPI score using the approximation of
the GPI score generated in the preceding steps. If the GPI score
has been approximated as an affine function of the random variables
.beta..sub.i (as in Eq. 21), the variance of the approximation will
be the weighted sum of the variances of the .beta..sub.i as given
by:
S GPI 2 = i ( .differential. GPI .differential. .beta. i ) 2 S i 2
, ( 26 ) ##EQU00024##
where S.sub.i.sup.2 is the variance of the ith fold-change
.beta..sub.i. Thus, the variance of the approximation of Eq. 25 may
be written as:
S GPI 2 .apprxeq. i ( f n dr i + .beta. i [ 2 v i ( .beta. 1 , ,
.beta. N ) S ^ i t df ( .beta. i S ^ i ) ] ) 2 S i 2 1 N , ( 27 )
##EQU00025##
where the d.sub.i terms drop away when d.sub.i=+/-1 because
d.sub.i.sup.2=1.
[0096] At step 810, the network scoring engine 114 evaluates the
variance of the GPI score (e.g., as represented by Eq. 27) at the
observed fold-change values. At step 812, the network scoring
engine 114 generates a confidence interval for the GPI score in
accordance with
GPI.+-.t.sub.df.sup..alpha./2S.sub.GPI, (28)
where S.sub.GPI is calculated as described above with reference to
Eqs. 26 and 27. Eq. 28 may be adapted as desired to determine
variance of a PPI score at the observed fold-change values.
[0097] The network scoring engine 114 may generate vector-valued
scores in addition to or instead of the scalar-valued scores
described above. One vector-valued score is the vector of
fold-changes or absolute changes in activity for each of the
measured nodes.
[0098] In certain implementations, for each perturbation (e.g.,
exposure to a known or unknown agent), the network scoring engine
114 may generate multiple NPA scores. For example, the network
scoring engine 114 may generate an NPA score for a particular
network, a particular dose of the agent, and a particular exposure
time.
E. Experimental Results
[0099] The process 200 for quantifying the response of a biological
network to a perturbation by calculating a network perturbation
amplitude (NPA) score has been used to analyze tumor necrosis
factor (TNF)-treated normal human bronchial epithelial (NHBE) cells
using several causal network models. Activation of the stress- and
immune-response transcription factor NF-kB (nuclear factor
kappa-light-chain enhancer of activated B cells) has been
well-defined as a major mediator of tumor necrosis factor-alpha
(TNF.alpha.)-induced signaling in a variety of systems. Normal
human bronchial epithelial (NHBE) cells were treated with four
different doses of TNF.alpha. (0.1, 1, 10 and 100 ng/mL) and total
RNA was collected for microarray measurement at four different
times after treatment (30 minutes, 2 hours, 4 hours and 24 hours).
All treatments were compared to time-matched mock-treated controls
to obtain 16 contrasts (4 doses.times.4 time points). Normal human
bronchial epithelial cells (Lonza Walkersville, Inc.) were cultured
in standard growth medium (Clonetics medium, Lonza Walkersville,
Inc.). Cells were either treated with TNF.alpha. (Sigma) or a
vehicle control (HBSS), and then harvested after the desired
perturbation time periods. Cells were immediately put on ice and
split into three technical replicates from which total RNA was
extracted using RNeasy Microkit (Qiagen). The processed RNA samples
are then hybridized to Affymetrix U133 Plus 2.0 microarrays. Cell
viability and cell counts were controlled for all conditions after
24 hours with CellTiter-Glo.RTM. assay (Promega). NF-kB nuclear
translocation was measured using Cellomics NF-kB Activation HCS
Reagent Kit (Thermo Scientific). Data processing and NPA methods
were implemented in the R statistical environment. Raw RNA
expression data was analyzed using the affy and limma packages of
the Bioconductor suite of microarray analysis tools available in
the R statistical environment (Gentleman R: Bioinformatics and
computational biology solutions using R and Bioconductor. New York:
Springer Science+Business Media; 2005; Gentleman R C, Carey V J,
Bates D M, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge
Y, Gentry J, et al: Bioconductor: open software development for
computational biology and bioinformatics. Genome Biol 2004, 5:R80).
Robust Microarray Analysis (RMA) background correction and quantile
normalization were used to generate probe set expression values
(Irizarry et al., Exploration, normalization, and summaries of high
density oligonucleotide array probe level data. Biostatistics 2003,
4:249-264). An overall linear model was fit to the data for all
groups of replicates, and specific contrasts of interest
(comparisons of "treated" and "control" conditions) were evaluated
to generate raw p-values for each probe set on the expression
array. Raw p-values were subsequently corrected for multiple
testing effects using Benjamini-Hochberg false discovery rate
(FDR).
[0100] Probe sets were matched to RNA Abundance nodes in the
Selventa Knowledgebase using the HG-U133_Plus.sub.--2.na30 probe
set mappings and the following criteria. First, only "at" or "s_at"
probe sets were considered. Second, probe sets that mapped to
multiple genes were discarded. Third, when multiple probe sets
mapped to the same gene, preference was given to "at" probe sets
over "s_at" probe sets. Finally, when there still remained multiple
probe sets mapped to the same gene, the probe set with the lowest
geometric mean FDR-corrected p-value across all contrasts of
interest was selected. A linear model was then re-fit for all
groups of replicates to only those probe sets that mapped to RNA
Abundance nodes in the knowledgebase, and FDR-corrected p-values
were recomputed. The Selventa Knowledgebase is a repository
containing over 1.5 million nodes (biological concepts and
entities) and over 7.5 million edges (assertions about causal and
non-causal relationships between nodes). The assertions in the
Selventa Knowledgebase are derived from peer-reviewed scientific
literature as well as other public and proprietary databases.
Specifically, each assertion describes an individual experimental
observation from an experiment performed in a human, mouse, and rat
species context, either in vitro or in vivo. Assertions also
capture information about the referring source (e.g. the PubMed ID
(PMID) for journal articles listed in MEDLINE), as well as key
contextual information including the species (human, mouse, or rat)
and the tissue or cell line from which the experimental observation
was derived. An example causal assertion is the increased
transcriptional activity of NFkB (nuclear factor
kappa-light-chain-enhancer of activated B cells) causes an increase
in the mRNA expression of CXCL1 (Chemokine (C-X-C motif) ligand 1)
[HeLa cell line; Human; PMID 16414985]. The knowledgebase contains
causal relationships derived from healthy tissues and disease areas
such as inflammation, metabolic diseases, cardiovascular injury,
liver injury, and cancer.
[0101] The GPI, EPI and PPI scoring methods were first investigated
using a causal network model created to be a specific measure of
NF-kB activation, the NF-kB-direct model. This model is composed of
155 genes (curated from 247 distinct references, some genes being
supported by more than one reference) known to be directly
regulated by NF-kB (genes whose expression is controlled in an
NF-kB-dependent manner and whose promoter sequences are directly
bound by NF-kB). Both scoring methods showed the same pattern of
response to TNF.alpha., having demonstrated a dose-dependent
response at all times, and a time-dependent response that generally
saturated at later times (See FIG. 10a). The EPI method was
qualitatively different from GPI in that EPI scores continued to
increase from 2 hours to 4 hours to 24 hours, while the GPI score
plateaued from 4 hours to 24 hours. Also, the EPI method produced
near-zero scores for 0.1 ng/mL TNF.alpha.. In general, EPI scores
appeared to reduce to 0 (or near to 0) scores that trended
relatively lower by other methods. The lowest dose for all but the
2 hour time point for the EPI method were found to not be specific
to the NF-.kappa.B-direct network model.
[0102] Next, NF-.kappa.B-direct model scores were compared to
NF-.kappa.B nuclear translocation. Upon activation, NF-.kappa.B is
transported into the nucleus where it acts to regulate the
expression of many genes. A series of feedback loops then lead to
the subsequent translocation of NF-.kappa.B back to the cytoplasm,
and this oscillatory cycle continues several times. Because
NF-.kappa.B oscillations occur with slightly different periods in
different cells in the population, the first oscillation may be the
most reliable population-measure of NF-.kappa.B activation.
Although the time of the first oscillation depends on dose, 30
minutes after TNF.alpha. treatment may be a realistic time to
measure NF-.kappa.B nuclear translocation for the doses used. All
three scoring methods produced a monotonic, and in some cases
nearly linear, relationship between score and nuclear
translocation, with Pearson correlation coefficients between 0.85
and 0.98 for the GPI and EPI scoring methods (FIG. 11). FIG. 11
illustrates NF-.kappa.B-direct NPA scores at 30 minutes, plotted
against NF-.kappa.B nuclear translocation at 30 minutes. Error bars
in NF-.kappa.B nuclear translocation represent the standard
deviation of the mean nuclear translocation for three different
fields of view of the same population of cells. Interestingly, this
dose-dependent relationship was preserved at different times after
TNF.alpha. treatment (FIG. 13). These findings demonstrate that the
causal network model-based NPA scores can quantify NF-.kappa.B
transcriptional activity.
[0103] The effects of the extent and composition of a causal
network model on the NPA scoring methods of the invention were also
investigated. First, the effect of hand-selecting a set of
measurements that are known to be modulated by NF-.kappa.B
specifically in a TNF.alpha.-dependent manner was assessed. A
submodel was constructed from a set of 20 genes that were
previously measured via reverse transcriptase-polymerase chain
reaction (RT-PCR) to assess NF-.kappa.B activity in response to
TNF.alpha. treatment in 3T3 mouse fibroblast cells (omitting 2
genes that have no direct human ortholog). These genes were
measured as perturbed by TNF.alpha. in 3T3 cells upon dosing with
TNF.alpha. (10 different concentrations spanning 100 ng/mL to 0.005
ng/mL) over a 12 hour time course. This submodel produced a very
similar pattern of activation to the NF-.kappa.B-direct model (FIG.
14), suggesting that inclusion of genes whose TNF.alpha.-dependent
expression has not been directly verified does not have a
detrimental effect on the quality of the score. FIG. 14 shows the
results of transcriptomic data from TNF.alpha.-treated NHBE cells
which was scored using GPI and EPI for (a) the NF-.kappa.B-direct
model, (b) a submodel composed of 20 NF-.kappa.B-regulated genes
reported to be TNF.alpha.-responsive in mouse 3T3 fibroblast cells
(NFKBIA, CASP4, CCL5, TNFAIP3, CCL2, ZFP36, RIPK2, TNFSF10, NFKBIE,
IL6, CCL20, ICAM1, TNFRSF1A, TNFRSF1B, SQSTM1, NRG1, SOD1, IL1RL1,
HIF1A, ERBB2)(Tay et al., Single-cell NF-kappaB dynamics reveal
digital activation and analogue information processing. Nature
2010, 466:267-271).
[0104] Next, the effects of using causal network models derived
from upstream biological entities that are less proximal to the
measurement were investigated. To do so, two additional models were
constructed: the IKK/NF-.kappa.B signaling model, which is composed
of 992 genes (curated from 414 different references) that are known
to be modulated by perturbation of proteins in a causal network
model of signaling from the I.kappa.B kinase (IKK) proteins to
NF-.kappa.B activation (FIG. 9); and the TNF model, which is
composed of 1741 genes (curated from 589 different references) that
are known to be modulated by treatment of cells with TNF.alpha..
Whereas the NF-.kappa.B-direct model is composed entirely of genes
whose expressions were directly controlled by a single
transcription factor (NF-.kappa.B), each of these two models
contains genes whose direct transcriptional controller is not
necessarily known. The expression of these genes may be controlled
by transcription factors not involved in constructing the model.
For example, genes in the IKK/NF-.kappa.B signaling model are known
to be modulated by perturbation of proteins in the IKK/NF-.kappa.B
signaling causal network model, but some of these genes could be
regulated as secondary effects caused by altered expression of a
smaller subset of genes that are directly modulated by NF-.kappa.B.
Also, TNF.alpha. is a ligand and therefore does not directly
mediate transcription of any genes. Treatment of cells with
TNF.alpha. results in activation of a myriad of transcription
factors, any of which may directly or indirectly (for example,
through autocrine signaling) alter the expression of each gene in
the TNF model.
FIG. 9 illustrates the full causal network model (top), along with
a schematic of the basic model architecture (middle). CHUK, IKBKB,
and IKBKG act as inhibitors of NFKBIA, NFKBIB, and NFKBIE, which
are in turn inhibitors of NFKB1, NFKB2, and RELA. The nodes used in
the model are listed under each section. The nodes in bold
represent nodes that have downstream gene expression measurables in
the knowledgebase, and the number of measurables is given in the
square brackets (because the same downstream may be found under
multiple nodes, these 1227 downstream measurables correspond to 992
unique measurables). The notations used are as follows: "CHUK P@S"
represents CHUK phosphorylated at serine (where the residue is
given if known), "CHUK P@ST" represents CHUK phosphorylated at
serine or threonine (the exact residue is unknown), "kaof(CHUK)"
represents the kinase activity of CHUK, "CHUK:IKBKB" represents the
complex of CHUK and IKBKB proteins, "IkappaB kinase complex Hs"
represents an aggregate of the various kinases (CHUK, IKBKB, and
IKBKG) in Homo sapiens (Hs), "degradationof(NFKBIA)" represents the
process of NFKBIA degradation, and "taof(NFKB1)" represents the
transcriptional activity of NFKB1. The IKK/NF-.kappa.B signaling
model and TNF model give insight into the behaviors of mechanism
hypotheses at different levels of proximity to the measurements.
The IKK/NF-.kappa.B signaling model is primarily composed of genes
that are regulated (either directly or indirectly) by NF-.kappa.B
(FIG. 9), and it produced a pattern of response that is very
similar to the NF-.kappa.B-direct model (FIG. 10(b)). This similar
pattern of response suggests that there is not a large difference
between the population-level behavior of genes that are known to be
directly regulated by a transcription factor and the behavior of
genes where knowledge of direct regulation is unknown. The time-
and dose-dependent response that was seen for the
NF-.kappa.B-direct model appears somewhat less robustly in the TNF
model (FIG. 10(c)), for example at the 30 minute time point, but
again the methods produced very similar responses. Thus, although
the general pattern of response was well-preserved among the
models, minor but noticeable differences in response can be
observed in models that are less proximal to the entities of which
measurements were made.
[0105] To assess the ability of the causal network models to
respond specifically to relevant TNF.alpha. signaling
perturbations, another model was constructed for a key cell-cycle
component, the transcription factor E2F1, with the assumption that
E2F1 is a less direct effector of TNF.alpha. signaling compared to
NF-.kappa.B. The E2F1-direct model is composed of 80 genes (curated
from 54 different references) known to be directly regulated by
E2F1 (expression controlled by E2F1 and promoter sequence bound by
E2F1). In order to provide a comparison of NPA results for biology
not directly related to NF-.kappa.B signaling, the NPA response of
the four models introduced above (NF-.kappa.B-direct,
IKK/NF-.kappa.B signaling, TNF, and E2F1-direct) were assessed in
response to inhibition of cell cycle progression via a CDK
inhibitor. Specifically, a publicly available microarray data set
was used for treatment of HCT116 colon cancer cells with three
different concentrations of the CDK inhibitor R547
(GSE15395)(Berkofsky-Fessler, et al: Preclinical biomarkers for a
cyclin-dependent kinase inhibitor translate to candidate
pharmacodynamic biomarkers in phase I patients. Mol Cancer Ther
2009, 8:2517-2525)(FIG. 12). All three NPA methods demonstrated
dose- and time-dependent decreases in the E2F1-direct model score
at the 4 hour, 6 hour, and 24 hour time points. The TNF model
showed a similar pattern of response as the E2F1-direct model. In
contrast, the NF-K.kappa.-direct and IKK/NF-.kappa.B signaling
model scores did not display this same dose- and time-dependent
pattern, indicating that these focused models potentially contain
few cell cycle regulated genes.
F. Hardware
[0106] FIG. 15 is a block diagram of a distributed computerized
system 1500 for quantifying the impact of biological perturbations.
The components of the system 1500 are the same as those in the
system 100 of FIG. 1, but the arrangement of the system 100 is such
that each component communicates through a network interface 1510.
Such an implementation maybe appropriate for distributed computing
over multiple communication systems including wireless
communication system that may share access to a common network
resource, such as "cloud computing" paradigms.
[0107] FIG. 16 is a block diagram of a computing device, such as
any of the components of system 100 of FIG. 1 or system 1300 of
FIG. 13 for performing processes described with reference to FIGS.
1-10. Each of the components of system 100, including the SRP
engine 110, the network modeling engine 112, the network scoring
engine 114, the aggregation engine 116 and one or more of the
databases including the outcomes database, the perturbations
database, and the literature database may be implemented on one or
more computing devices 1600. In certain aspects, a plurality of the
above-components and databases may be included or comprised within
one computing device 1600. In certain implementations, a component
and a database may be implemented across several computing devices
1600.
[0108] The computing device 1600 comprises at least one
communications interface unit, an input/output controller 1610,
system memory, and one or more data storage devices. The system
memory includes or comprises at least one random access memory (RAM
1602) and at least one read-only memory (ROM 1604). All of these
elements are in communication with a central processing unit (CPU
1606) to facilitate the operation of the computing device 1600. The
computing device 1600 may be configured in many different ways. For
example, the computing device 1600 may be a conventional standalone
computer or alternatively, the functions of computing device 1600
may be distributed across multiple computer systems and
architectures. The computing device 1600 may be configured to
perform some or all of modeling, scoring and aggregating
operations. In FIG. 10, the computing device 1600 is linked, via
network or local network, to other servers or systems.
[0109] The computing device 1600 may be configured in a distributed
architecture, wherein databases and processors are housed in
separate units or locations. Some such units perform primary
processing functions and contain at a minimum a general controller
or a processor and a system memory. In such an aspect, each of
these units is attached via the communications interface unit 1608
to a communications hub or port (not shown) that serves as a
primary communication link with other servers, client or user
computers and other related devices. The communications hub or port
may have minimal processing capability itself, serving primarily as
a communications router. A variety of communications protocols may
be part of the system, including, but not limited to: Ethernet,
SAP, SAS.TM., ATP, BLUETOOTH.TM., GSM and TCP/IP.
[0110] The CPU 1606 comprises a processor, such as one or more
conventional microprocessors and one or more supplementary
co-processors such as math co-processors for offloading workload
from the CPU 1606. The CPU 1606 is in communication with the
communications interface unit 1608 and the input/output controller
1610, through which the CPU 1606 communicates with other devices
such as other servers, user terminals, or devices. The
communications interface unit 1608 and the input/output controller
1610 may include or comprise multiple communication channels for
simultaneous communication with, for example, other processors,
servers or client terminals. Devices in communication with each
other need not be continually transmitting to each other. On the
contrary, such devices need only transmit to each other as
necessary, may actually refrain from exchanging data most of the
time, and may require several steps to be performed to establish a
communication link between the devices.
[0111] The CPU 1606 is also in communication with the data storage
device. The data storage device may comprise an appropriate
combination of magnetic, optical or semiconductor memory, and may
include or comprise, for example, RAM 1602, ROM 1604, flash drive,
an optical disc such as a compact disc or a hard disk or drive. The
CPU 1606 and the data storage device each may be, for example,
located entirely within a single computer or other computing
device; or connected to each other by a communication medium, such
as a USB port, serial port cable, a coaxial cable, an Ethernet type
cable, a telephone line, a radio frequency transceiver or other
similar wireless or wired medium or combination of the foregoing.
For example, the CPU 1606 may be connected to the data storage
device via the communications interface unit 1608. The CPU 1606 may
be configured to perform one or more particular processing
functions.
[0112] The data storage device may store, for example, (i) an
operating system 1612 for the computing device 1600; (ii) one or
more applications 1614 (e.g., computer program code or a computer
program product) adapted to direct the CPU 1606 in accordance with
the systems and methods described here, and particularly in
accordance with the processes described in detail with regard to
the CPU 1606; or (iii) database(s) 1616 adapted to store
information that may be utilized to store information required by
the program. In some aspects, the database(s) includes or comprises
a database storing experimental data, and published literature
models.
[0113] The operating system 1612 and applications 1614 may be
stored, for example, in a compressed, an uncompiled and an
encrypted format, and may include or comprise computer program
code. The instructions of the program may be read into a main
memory of the processor from a computer-readable medium other than
the data storage device, such as from the ROM 1604 or from the RAM
1602. While execution of sequences of instructions in the program
causes the CPU 1606 to perform the process steps described herein,
hard-wired circuitry may be used in place of, or in combination
with, software instructions for implementation of the processes of
the present invention. Thus, the systems and methods described are
not limited to any specific combination of hardware and
software.
[0114] Suitable computer program code may be provided for
performing one or more functions in relation to modeling, scoring
and aggregating as described herein. The program also may include
or comprise program elements such as an operating system 1612, a
database management system and "device drivers" that allow the
processor to interface with computer peripheral devices (e.g., a
video display, a keyboard, a computer mouse, etc.) via the
input/output controller 1610.
[0115] The term "computer-readable medium" as used herein refers to
any non-transitory medium that provides or participates in
providing instructions to the processor of the computing device
1600 (or any other processor of a device described herein) for
execution. Such a medium may take many forms, including but not
limited to, non-volatile media and volatile media. Non-volatile
media include or comprise, for example, optical, magnetic, or
opto-magnetic disks, or integrated circuit memory, such as flash
memory. Volatile media include or comprise dynamic random access
memory (DRAM), which typically constitutes the main memory. Common
forms of computer-readable media include or comprise, for example,
a floppy disk, a flexible disk, hard disk, magnetic tape, any other
magnetic medium, a CD-ROM, DVD, any other optical medium, punch
cards, paper tape, any other physical medium with patterns of
holes, a RAM, a PROM, an EPROM or EEPROM (electronically erasable
programmable read-only memory), a FLASH-EEPROM, any other memory
chip or cartridge, or any other non-transitory medium from which a
computer can read.
[0116] Various forms of computer readable media may be involved in
carrying one or more sequences of one or more instructions to the
CPU 1606 (or any other processor of a device described herein) for
execution. For example, the instructions may initially be borne on
a magnetic disk of a remote computer (not shown). The remote
computer can load the instructions into its dynamic memory and send
the instructions over an Ethernet connection, cable line, or even
telephone line using a modem. A communications device local to a
computing device 1600 (e.g., a server) can receive the data on the
respective communications line and place the data on a system bus
for the processor. The system bus carries the data to main memory,
from which the processor retrieves and executes the instructions.
The instructions received by main memory may optionally be stored
in memory either before or after execution by the processor. In
addition, instructions may be received via a communication port as
electrical, electromagnetic or optical signals, which are exemplary
forms of wireless communications or data streams that carry various
types of information. Further aspects and embodiments are set forth
in the following passages:
1. A computerized method for quantifying the perturbation of a
biological system in response to an agent, comprising receiving, at
a first processor, a set of treatment data corresponding to a
response of a biological system to an agent, wherein the biological
system includes or comprises a plurality of biological entities,
each biological entity interacting with at least one other of the
biological entities; receiving, at a second processor, a set of
control data corresponding to the biological system not exposed to
the agent; providing, at a third processor, a computational casual
network model that represents the biological system and includes or
comprises: nodes representing the biological entities, edges
representing relationships between the biological entities, and
direction values, for the nodes, representing the expected
direction of change between the control data and the treatment
data; calculating, with a fourth processor, activity measures, for
the nodes, representing a difference between the treatment data and
the control data; calculating, with a fifth processor, weight
values for the nodes, wherein at least one weight value is
different from at least one other weight value; and generating,
with a sixth processor, a score for the computational model
representative of the perturbation of the biological system to the
agent, wherein the score is based on the direction values, the
weight values and the activity measures. 2. The computerized method
of passage 1, further comprising normalizing the score based on the
number of nodes in the respective computational model. 3. The
computerized method of any of the above passages, wherein the
weight values represent a confidence in at least one of the set of
treatment data and control data. 4. The computerized method of any
of the above passages, wherein the weight values include local
false non-discovery rates. 5. The computerized method of passage 1,
further comprising calculating, with a seventh processor, an
approximate distribution of the activity measures over the node;
calculating, with an eighth processor, an expected value of the
approximate distribution; and generating, with a ninth processor, a
score for each computational model representative of the
perturbation of the subset of the biological system to the agent,
wherein the score is based on expected value. 6. The computerized
method of passage 5, wherein the approximate distribution is based
on the activity measures. 7. The computerized method of any of
passages 5-6, wherein calculating an expected value comprises
performing a rectangular approximation. 8. The computerized method
of passage 1, further comprising calculating, with a tenth
processor, a positive activation score and a negative activation
score based on the activity measures, the positive and negative
activation scores representative of consistency and inconsistency,
respectively, between the activity measures and the direction
values; and generating, with an eleventh processor, a score for
each computational model representative of the perturbation of the
subset of the biological system to the agent, wherein the score is
based on the positive and negative activation scores. 9. The
computerized method of passage 8, wherein the score is based on
local false non-discovery rates. 10. The computerized method of any
of passages 8-9, wherein the activity measure is a fold-change
value, and the fold-change value for each node includes a logarithm
of the difference between the treatment data and the control data
for the biological entity represented by the respective node. 11.
The computerized method of any of the above passages, wherein the
subset of the biological system includes at least one of cell
proliferation mechanism, cellular stress mechanism, cell
inflammation mechanism, and DNA repair mechanism. 12. The
computerized method of any of the above passages, wherein the agent
includes at least one of aerosol generated by heating tobacco,
aerosol generated by combusting tobacco, tobacco smoke or cigarette
smoke. 13. The computerized method of any of the above passages,
wherein the agent includes a heterogeneous substance, including a
molecule or an entity that is not present in or derived from the
biological system. 14. The computerized method of any of the above
passages, wherein the agent includes toxins, therapeutic compounds,
stimulants, relaxants, natural products, manufactured products, and
food substances. 15. The computerized method of any of the above
passages, wherein the set of treatment data includes a plurality of
sets of treatment data such each node includes a plurality of
fold-change values defined by a first probability distribution and
a plurality of weight values defined by a second probability
distribution.
[0117] While implementations of the invention have been
particularly shown and described with reference to specific
examples, it should be understood by those skilled in the art that
various changes in form and detail may be made therein without
departing from the spirit and scope of the invention as defined by
the appended claims. The scope of the invention is thus indicated
by the appended claims and all changes which come within the
meaning and range of equivalency of the claims are therefore
intended to be embraced.
* * * * *