U.S. patent application number 10/563223 was filed with the patent office on 2006-08-10 for method computer program with program code elements and computer program product for analysing s regulatory genetic network of a cell.
Invention is credited to Mathaus Dejori, Martin Stetter.
Application Number | 20060177827 10/563223 |
Document ID | / |
Family ID | 33559880 |
Filed Date | 2006-08-10 |
United States Patent
Application |
20060177827 |
Kind Code |
A1 |
Dejori; Mathaus ; et
al. |
August 10, 2006 |
Method computer program with program code elements and computer
program product for analysing s regulatory genetic network of a
cell
Abstract
A regulator genetic network of a cell is analyzed using a causal
network after predefining a gene expression rate for a selected
gene of the regulatory genetic network. The causal network is used
to generate a resultant gene expression pattern relating to the
genetic network for the predefined gene expression rate. The
generated resultant gene expression pattern is subsequently
compared with a predefined gene expression pattern of the
regulatory genetic network.
Inventors: |
Dejori; Mathaus; (Munich,
DE) ; Stetter; Martin; (Munich, DE) |
Correspondence
Address: |
STAAS & HALSEY LLP
SUITE 700
1201 NEW YORK AVENUE, N.W.
WASHINGTON
DC
20005
US
|
Family ID: |
33559880 |
Appl. No.: |
10/563223 |
Filed: |
June 28, 2004 |
PCT Filed: |
June 28, 2004 |
PCT NO: |
PCT/EP04/51266 |
371 Date: |
January 4, 2006 |
Current U.S.
Class: |
435/6.12 ;
435/6.13; 702/20; 706/13 |
Current CPC
Class: |
G16B 5/00 20190201; G16B
25/00 20190201 |
Class at
Publication: |
435/006 ;
702/020; 706/013 |
International
Class: |
C12Q 1/68 20060101
C12Q001/68; G06N 3/12 20060101 G06N003/12; G06F 19/00 20060101
G06F019/00 |
Foreign Application Data
Date |
Code |
Application Number |
Jul 4, 2003 |
DE |
103 30 280.8 |
Claims
1-24. (canceled)
25. A method for analysis of a regulatory genetic network of a cell
using a causal network describing a regulatory genetic network of
cells such that nodes of the causal network represent genes of the
regulatory genetic network and connectors of the causal networks
represent regulatory interactions between the genes of the
regulatory genetic network, said method comprising: providing a
predetermined gene expression rate for a selected gene of the
regulatory genetic network; generating a resulting gene expression
pattern for the regulatory genetic network using the causal network
for the predetermined gene expression rate; and comparing the
resulting gene expression pattern with a predetermined gene
expression pattern of the regulatory genetic network.
26. A method in accordance with claim 25, further comprising
selecting the selected gene by dependency analysis using the causal
network.
27. A method in accordance with claim 26, wherein the predetermined
gene expression rate of the selected gene reflects an assumption of
a gene defect.
28. A method in accordance with claim 27, wherein the causal
network is a Bayesian network.
29. A method in accordance with claim 28, wherein the causal
network is a directed acylic graph type.
30. A method in accordance with claim 29, wherein at least one of
the resulting gene expression pattern and the predetermined gene
expression pattern represents discrete gene states.
31. A method in accordance with claim 30, wherein the discrete gene
states include an overexpressed gene state, a normally expressed
gene state and an underexpressed gene state.
32. A method in accordance with claim 31, wherein said comparing of
the resulting gene expression pattern to the predetermined gene
expression pattern uses at least one of a static method and a
statistical code as a measure of distance.
33. A method in accordance with claim 32, further comprising
training the causal network using training gene expression patterns
to adapt the nodes and the connectors of the causal network.
34. A method in accordance with claim 33, further comprising
determining at least one of the predetermined gene expression
pattern and the training gene expression patterns using a DNA
microarray technique.
35. A method in accordance with claim 34, wherein at least one of
the predetermined gene expression pattern and the training gene
expression patterns are for a diseased cell.
36. A method in accordance with claim 35, wherein the diseased cell
is an oncocell.
37. A method in accordance with claim 36, wherein the diseased cell
features an Acute Lymphoblastic Leukemia oncogene.
38. A method in accordance with claim 25, further comprising
repeating said determining, said generating and said comparing to
determine a plurality of predetermined gene expression rates for
selected genes of the regulatory genetic network and to generate
and compare the resulting gene expression pattern for each of the
predetermined gene expression rates with a corresponding
predetermined gene expression pattern.
39. A method in accordance with claim 38, wherein said repeating of
the generation the resulting gene expression patterns is performed
iteratively.
40. A method in accordance with claim 39, further comprising
identifying a dominant gene based on said comparing repeatedly
performed.
41. A method in accordance with claim 39, further comprising
identifying at least one of a degenerated gene, a mutated gene, a
diseased gene, an oncogene, and a tumor-suppressor gene based on
said comparing repeatedly performed.
42. A method in accordance with claim 39, further comprising
identifying a tumor cell based on said comparing repeatedly
performed.
43. A method in accordance with claim 39, further comprising
detecting cancer based on said comparing repeatedly performed.
44. A method in accordance with claim 39, further comprising
analyzing a cause of an abnormal gene expression pattern/gene
expression rate based on said comparing repeatedly performed.
45. A method in accordance with claim 39, further comprising
simulating an effect of a medicament based on said comparing
repeatedly performed.
46. A method in accordance with claim 39, further comprising
analyzing an effect of a medicament based on said comparing
repeatedly performed.
47. At least one computer-readable medium storing a program which
when executed on a computer causes the computer to perform a method
comprising: providing a predetermined gene expression rate for a
selected gene of the regulatory genetic network; generating a
resulting gene expression pattern for the regulatory genetic
network using the causal network for the predetermined gene
expression rate; and comparing the generated resulting gene
expression pattern with a predetermined gene expression pattern of
the regulatory genetic network.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is based on and hereby claims priority to
German Patent Application No. 10330280.8 filed on Jul. 4, 2003, the
contents of which are hereby incorporated by reference.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] The invention relates to an analysis of a regulatory genetic
network of a cell using a statistical method.
[0004] 2. Description of the Related Art
[0005] Fundamentals of a regulatory genetic network of a cell are
known from Stetter et al., Large-Scale Computational Modeling of
Generic Regulatory Networks, Kluwer Academic Publisher,
Netherlands, 2003. Such a regulatory genetic network should be
taken in this document to mean in particular regulatory
interactions between genes of a cell.
[0006] A genome, i.e. the human genetic substance, is estimated to
comprise 20,000 to 40,000 genes, of which a biologically specified
number in each case--depending on a specialization of a cell--are
present in the cell in the form of a DNA or a part of a DNA.
[0007] A not necessarily contiguous section of this DNA containing
the genetic code for a protein or also for a group of proteins or
for creating a protein or a group of proteins is designated as a
gene here. Overall the genes contain a genetic code for around a
million proteins.
[0008] An interplay or the interactions between the genes as well
as with the proteins represents the most important part of a
machinery (regulatory genetic network) which underlies the
development of a human body from a fertilized egg cell as well as
all bodily functions.
[0009] It is also known from Stetter that so-called gene expression
rates which form a gene expression pattern supply a description or
representation of a regulatory genetic network or of a current
status of the regulatory genetic network.
[0010] In simple terms or expressed more clearly the gene
expression pattern of a cell thus represents a state of the
regulatory genetic network of this cell.
[0011] It is further known that by using high-throughput gene
expression measurements (microarray data) these gene expression
rates can be measured. The microarray data in its turn describes
snapshots of the gene expression pattern.
[0012] Many illnesses and malfunctions of the body are attributable
to disturbances in the regulatory genetic network which is
reflected by greatly changed gene expression behavior (gene
expression rates) or a changed gene expression pattern of a
cell.
[0013] An understanding of the regulatory genetic network thus
represents an important step on the path to a characterization of
the understanding of genetic mechanisms as well as consequently of
identification of what are known as dominant or
malfunction-initiating genes underlying the illnesses or
malfunctions.
[0014] In cancer research for example suppressing genes can play a
key role in the identification of growths and tumors, the knowledge
of new potential oncogenes and their interactions with other genes
can be a contribution to discovering the basic principles (of
cancers) which determine how normal cells change into malignant
cancer cells.
[0015] Furthermore a quantitative understanding of the regulatory
genetic network of a cell is necessary for developing improved
medicaments and therapies for fighting genetic diseases.
[0016] Thus a number of medicaments act as agonists or antagonists
of specific target proteins, i.e. they strengthen or weaken the
function of a protein with corresponding effect on the regulatory
genetic network with the aim of bringing this back into a normal
function mode.
[0017] A description of a regulatory genetic network of a cell
using a statistical method, a causal network is known from DE
10159262.0.
[0018] A causal network, a Bayesian network, is known from Jensen,
An Introduction to Bayesian Networks, UCL Press, London, 1996.
Bayesian Networks
[0019] A Bayesian network B is a specific type of presentation of a
common multivariate probability density function (WDF) of a set of
variables X by a graphical model which consists of two parts.
[0020] It is defined by a directed acyclic graph, DAG) G--of the
first component, in which each node i=1, . . . , n corresponds to a
random variable X.sub.i.
[0021] The connectors between the nodes represent statistical
dependencies and can be interpreted as causal relationships between
them. The second component of the Bayesian network is the set of
conditional WDFs P(X.sub.i|Pa.sub.i,.theta.,G), which are
parameterized by a vector .theta..
[0022] These conditional WDFs specify the type of dependencies of
the individual variables i of the set of its parents .sub.Pai. Thus
the common WDF can be broken down into the product form P .times. (
.times. X 1 , X 2 , .times. .times. X n = i = 1 n .times. .times. P
( X i .times. Pa i , .theta. , G ) ##EQU1##
[0023] The DAG of a Bayesian network uniquely describes the
conditional dependency and independency relationships between a set
of variables, but by contrast a given statistical structure of the
WDF does not result in any unique DAG.
[0024] Instead it can be shown that two DAGs describe one and the
same WDF, if and only if they feature the same set of connectors
and the same set of "colliders", with a collider being a
constellation in which at least two directed connectors lead to the
same node.
SUMMARY OF THE INVENTION
[0025] An object of the invention is to specify a method which
allows an analysis of a regulatory genetic network of a cell, for
example represented by at least one gene expression pattern of the
cell.
[0026] A further object of the invention is to specify a method
which enables a defective gene to be identified, for example a
cancer or tumor gene, in the regulatory genetic network of a
cell.
[0027] Further the invention is designed to allow a simulation
and/or an analysis of an effect of a medicament on the regulatory
genetic network of a cell.
[0028] In the basic method for analysis of a regulatory genetic
network of a cell a causal network is used, [0029] the causal
network describing the regulatory genetic network of the cell such
that nodes of the causal network represent genes of the regulatory
genetic network and connectors of the causal network represent
regulatory interactions between the genes of the regulatory genetic
network
[0030] In the analysis method a gene expression rate is now
specified for a selected gene of the regulatory genetic network.
Using the causal network a resulting gene expression pattern is
generated for the predetermined gene expression rate for the
regulatory genetic network. The resulting gene expression pattern
generated is subsequently compared with a predetermined gene
expression pattern of the regulatory genetic network.
[0031] A probabilistic semantic of a causal network, such as of a
Bayesian network, is very well suited to analysis of gene
expression rates, given for example in the form of microarray data,
since it is adapted to the stochastic nature both of biological
processes and also to experiments susceptible to noise.
[0032] Furthermore, viewed in illustrative terms, an effect of an
expression state of specific genes on a global gene expression
pattern (inverse modeling) is estimated, in that a resulting gene
expression pattern is analyzed.
[0033] The developments described below relate to both the method
and to the configuration.
[0034] The invention and the developments described below can be
implemented both in software and also in hardware, for example by
using a specific electrical circuit.
[0035] With a further development the selected gene is selected
using the causal network by a dependency analysis.
[0036] The gene expression rate of the selected gene can also be
predetermined such that the predetermined gene expression rate of
the selected gene reflects an assumption of a gene defect.
[0037] A Bayesian network can be used as the causal network.
[0038] The causal network can also be of a type DAG (Directed
Acylic Graph).
[0039] Furthermore the generated resulting and/or the predetermined
gene expression pattern can represent discrete gene states, with
the represented discrete gene states being able to be a an
overexpressed, a normal or an underexpressed gene state.
[0040] In a further development the generated resulting gene
expression pattern can be compared with the predetermined gene
expression pattern using a static method and/or of a statistical
code, especially a measure of distance.
[0041] There can also be provision for the causal network to be
trained using gene expression patterns, with the nodes and the
connectors of the causal network being adapted.
[0042] Furthermore it is expedient for the gene expression
patterns, especially the predetermined gene expression pattern
and/or the gene expression patterns for training, to be determined
using a DNA microarray technique.
[0043] In one embodiment the predetermined gene expression pattern
and/or the gene expression pattern for training is a gene
expression pattern of a genetic regulatory network of a diseased
cell.
[0044] Here for example the diseased cell can be a cancer cell,
especially a oncocell with ALL (Acute Lymphoblastic Leukemia).
[0045] Furthermore the diseased cell can feature an oncogene,
especially an ALL oncogene.
[0046] Also for a plurality of selected genes of the regulatory
genetic network one gene expression can be predetermined in each
case, a plurality of resulting gene expression patterns generated
and/or a plurality of comparisons undertaken.
[0047] In a further development the generation of the plurality of
resulting gene expression patterns is performed iteratively.
[0048] Furthermore the inventive procedure or development is
particularly suitable for identifying a dominant gene and/or a
degenerated/mutated/diseased gene/oncogene/tumor-suppressor
gene.
[0049] It is also suitable for identifying a tumor cell, for
example in connection with cancer detection.
[0050] Further the inventive method is especially suited to
analyzing the causes of an abnormal gene expression pattern/ gene
expression rate.
[0051] It can also be used for a simulation and/or analysis of the
effects of a medicament.
BRIEF DESCRIPTION OF THE DRAWINGS
[0052] These and other objects and advantages of the present
invention will become more apparent and more readily appreciated
from the following description of an exemplary embodiment of the
invention, taken in conjunction with the accompanying drawings of
which:
[0053] FIG. 1 is a flowchart of a procedure for investigating
genetically-related causes of illness through Bayesian inverse
modelling using a cancer as an example;
[0054] FIG. 2 is a procedural listing for an algorithm for creating
a data set of N samples in accordance with an exemplary
embodiment;
[0055] FIG. 3 is a procedural listing for a procedure for creating
data sets, which reflect an effect of different observations in
accordance with an exemplary embodiment;
[0056] FIGS. 4a and 4b are graphs which show that data obtained by
sampling show subtype characteristic expression patterns as also in
an original data set;
[0057] FIG. 5 is a graph which shows graphically a probability of
each subtype under a condition which is overexpressed on a gene,
for all 271 genes;
[0058] FIG. 6 is a graph structure of a causal network, which
represents a regulatory genetic network.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0059] Reference will now be made in detail to the preferred
embodiments of the present invention, examples of which are
illustrated in the accompanying drawings, wherein like reference
numerals refer to like elements throughout.
Exemplary Embodiment Investigation of Genetically-Related Causes of
Diseases Using Bayesian Inverse Modelling Using a Cancer as an
Example (Espec. FIG. 1)
Overview of the Bayesian Inverse Modelling (BIM) Procedure
[0060] In many areas of empirical research the desire is to reach
conclusions from the observation of trial results about the
underlying principle and its causes--the relationship between
"cause" and "effect".
[0061] For example in cancer research the underlying principle is
studded which causes a normal cell to transform it into a
malignant, rapidly growing cancer cell.
[0062] The effect of the various types of cancer is known, e.g. the
general appearance of a cancer cell compared to a normal cell,
measured with the aid of microarray chips.
[0063] By contrast the cause of its origination is largely
unknown.
[0064] On the basis of the understanding that cancer is a genetic
illness and that it is attributable to a deviation in the behavior
of cells, the research is concentrating on discovering the genetic
principles which are responsible for the development of the
cancer.
[0065] An important task in this environment is to identify genes
which can play a role in tumor genesis, such as for example growth
and tumor-suppressing genes.
[0066] A procedure is described below with which it is possible to
identify genes which are a potential cause of tumor genesis.
[0067] One element of the procedure is a statistical method, in
this case a Bayesian network (see Jensen, above and subsequent
associated embodiments for more details), which is learnt (see DE
10159262.0) from a microarray data set as described in Stetter (see
"Structural learning" below) (cf. FIG. 1).
[0068] In this case it is assumed that the set of the measured gene
expression vectors X belong to a basic totality with a
highly-dimensional multivariate probability density function which
is modelled with the aid of Bayesian network with adaptive network
structure.
[0069] The relationships between the variables, namely the
conditional dependences and independences, are represented by a
Directed Acyclic Graph (DAG) G.
[0070] The probabilistic semantic of the Bayesian network is very
well suited to the analysis of microarray data since it is adapted
to the stochastic nature both of the biological processes and also
of the experiments susceptible to noise.
[0071] In the procedure described below the learnt Bayesian network
will be used as a generative model for taking samples of artificial
microarray data sets which supplies the learned conditional
probability density distributions (cf. FIG. 1, step 110-130).
[0072] Furthermore the effect of the expression state of specific
genes on the global gene expression pattern (inverse modelling) is
estimated, in that a resulting data set is analyzed (cf. FIG. 1,
step 110-130).
[0073] In the procedure described below each gene is also assigned
its probability, with which it is the cause of these cell
states.
[0074] To this end these data sets are compared with data obtained
from microarray investigations of various known cell states (cf.
FIG. 1, step 130).
[0075] Seen in general terms, the procedure does not concentrate
explicitly on the structures of the network, but rather on the
probability distribution which is derived from the learnt Bayesian
network.
[0076] Finally the procedure is applied to microarray data of
different subtypes of pediatric acute lymphoblastic leukemia (ALL)
of Yeoh et al., "Classification, Subtype Discovery, and Prediction
of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene
Expression Profile", Cancer Cell, 2002, pp. 133-143.
[0077] The comparison of the artificial data with expression
patterns of specific cancer subtypes enables a measure of
probability of the illness-causing behavior of each gene (cf. FIG.
1, step 130) to be obtained.
[0078] Results of the applied procedure show that, in connection
with Bayesian Inverse Modelling (BIM) this allows the effect of
pathogenetically modified expression levels on the global gene
expression pattern to be predicted, in which case already known
oncogenes as well as potential new ones are found.
Bayesian Networks
[0079] The basic principles of Bayesian networks as described in
Jensen have already been described above.
[0080] In the case of the modelling of a regulatory genetic network
by a Bayesian network genes or their corresponding proteins are
symbolized by nodes.
[0081] Regulation mechanisms are described by connectors between
two nodes, which can be interpreted in a causal manner.
[0082] The quality of the regulation is encoded in the conditional
probability distribution of the gene involved for given regulators
of the same.
Structural Learning
[0083] The process of structural learning can be described as
follows:
[0084] Let D={d.sup.1, d.sup.2, . . . , d.sup.N be a data set of N
independent observation, with each data point being an
n-dimensional vector with components d.sup.1={d.sup.I, d.sub.2, . .
. , d.sup.1.sub.N). For a given D the structure G of the Bayesian
network is to be found which best corresponds to D, i.e. which
maximizes the Bayes-Score, S .times. ( .times. Q .times. D ) = P (
D .times. G ) .times. P .function. ( G ) P .function. ( D )
##EQU2##
[0085] with P(D|G) the being the peripheral probability, P(G) the
apriori probability of the structures and P(D) the evidence.
[0086] Since both the apriori probability and also the evidence are
unknown, the problem is reduced to determining the structures with
the best peripheral probability corresponding to the data
(Heckerman et al., "Learning Bayesian networks: The combination of
knowledge and statistical data", Machine Learning, vol. 20, 1995,
pp. 197-243).
[0087] If the data set D consists of N microarray experiments, e.g.
of cell samples of different patients, each data vector
{d.sup.1.sub.1, d.sup.1.sub.2, . . . , d.sup.1.sub.n} represents
the expression profile of n genes in a microarray experiment.
[0088] A Bayesian network learnt from such data encodes the
probability distribution of n genes, which were obtained from these
N microarray experiments.
Bayesian Inverse Modelling (BIM)
Generative Model
[0089] A learnt (see notes above about "structural learning")
Bayesian network B represents a density estimation function which
reflects the probability distribution of the data set D, on the
basis of which it was learnt, with the aid of the set of
conditional WDFs.
[0090] This means that it can be used as a generative model for
creating a data set D.sub.B which reflects the density distribution
obtained from D.
[0091] FIG. 2 shows an algorithm 200 for creating a data set of N
samples from B.
[0092] The first step 210 of the algorithm 200 consists of
arranging all variables such that the parents (parent nodes)
Pa.sub.i are instantiated before X.sub.i.
[0093] Subsequently the variables corresponding to the arrangement
are selected and instantiated with a value 220.
[0094] The value of each variable is selected with the probability
P(state|Pa.sub.i). This step is repeated 230, until N samples are
created.
Probabilistic Interference
[0095] A significant problem in Bayesian networks is the evidence
propagation, meaning the determination of the aposteriori
distribution P(X.sub.q|E) of a request variable X.sub.q, if a
certain evidence E has been observed in the Bayesian network.
[0096] As a result of the definition of a conditional probability,
the aposteriori probability is P .times. ( .times. X q .times. E )
= P .function. ( X q , E ) P .function. ( E ) = x .times. \ .times.
.times. ( x q , x E ) .times. P .function. ( X ) X .times. \
.times. X E .times. P .function. ( X ) ##EQU3## with X.sub.E
designating the quantity of the observed variables.
[0097] To overcome the time complexity, the different methods of
exact interference calculation use the general principle of dynamic
programming.
[0098] As part of this exemplary embodiment a simple interference
algorithm, of "bucket elimination", as described in Dechter, R.,
"Bucket Elimination: A unifying framework for probabilistic
inference", Uncertainty in Artificial Intelligence, UAI 196, pp.
211-219, is used.
[0099] The basic idea with this interference algorithm consists of
eliminating variables one after the other in accordance with an
order of elimination p by summation.
[0100] In this way P(X.sub.q|E) can be efficiently calculated
within a perceivable time.
Interventional Modelling by Setting the Evidence
[0101] With the interventional modelling approach the effect of
specific observation on the behavior of the Bayesian network using
a combination of probabilistic interference and data sampling is
estimated.
[0102] In accordance with FIG. 3 the Bayesian network can be viewed
as a kind of black box 300, with the input being given by a set of
observations E 310 and the corresponding list of observed variables
X.sub.E 320.
[0103] The output, which is given by the data set D.sub.B|E 330 is
created using the method previously explained in association with
FIG. 2.
[0104] In addition the empirical evidence is to be taken into
account.
[0105] Consequently each state of X.sub.i is selected with
probability P(state|Pa.sub.i,E), which is calculated by
probabilistic interference.
[0106] With the procedure described in accordance with FIG. 3
different data sets can now be created which reflect the effect of
the different observations.
[0107] If, as described below, biological effects are analyzed,
this means that through this method of operation in accordance with
FIG. 3 artificial microarray data can be created which reflects the
probability distribution of a certain data set if specific
observations are given.
[0108] If the artificially created data from a known origin is
compared for example with a cancer-specific set of measurement
data, those genes can be determined which, when they are fixed at a
certain expression level, will influence the model so that these
two microarray data sets, the artificial and the known, exhibit the
same characteristics.
Statistical Comparison of Data Sets
[0109] In order to estimate the quality of the influence of the
evidence I on the behavior of the Bayesian network I, the created
data set D.sub.B|E is compared with a set of data sets I of known
states S.
[0110] It is assumed that D describes the effect of different types
of cancer. In accordance with the embodiment the behavior of
evidence E relating to a specific type of cancer S can now be
described.
[0111] By using a measure of distance the change a of the
correlation between D.sub.B|E and Ds as a result of E can be
estimated: a .function. ( E ) = d ( D B .times. E , D S ) d ( D B ,
D S ) ##EQU4## with the distance between the two data sets having
been standardized with the aid of the distance between D.sub.B,
which was taken from B without evidence, and Ds.
[0112] As a result, in accordance with the embodiment, the
influence of an observed evidence is measurable, e.g. the
expression state of a specific gene on a behavior of the model
characteristic for cancer.
[0113] Secondly the probability can be calculated of B creating a
data set D.sub.B|E which is equal to Ds for a given E.
[0114] For this purpose an estimate is made of how many samples
d.sup.I of D.sub.B|E lie closest to Ds in that the distance between
each sample and each data set is calculated by D.
[0115] The aposteriori probability P(S|E) of the occurence of the
cancer type S for given evidence E is thus obtained: P .times. (
.times. S .times. E ) = N ES N ( 5 ) ##EQU5## with N.sub.es being a
number of samples of DB|E, which is statistically closest to the
data set DS, and with N being the total number of samples of
D.sub.B|E.
[0116] As already pointed out above, empirical research deals with
the relationship between cause and effect, in that it draws
conclusions about the underlying cause from experimental
observation.
[0117] With the Bayesian Inverse Modelling approach in accordance
with the exemplary embodiment an underlying cause is estimated by
first creating an effect which stems from a known observation.
[0118] After this inverse step this effect is compared with effects
which are well-defined but for which the cause is unknown.
[0119] The potential cause of the best-match effect is then given
by the observation which gives rise to the created effect.
The ALL Microarray Data Set of Yeoh et al.
[0120] The data which is used for the analysis in accordance with
the exemplary embodiment consists of 327 samples of various
subtypes of pediatric acute lymphoblastic leukemia (ALL).
[0121] The data set was assembled by Yeoh and his colleagues at the
St. Jude Children's Research Hospital.
[0122] ALL is a heterogeneous illness which includes different
subtypes, including both T-cell type leukemia and B-cell type
leukemia, which differ as regards their reaction to a medical
treatment.
[0123] Apart from T-ALL, of which the cause is not clearly known,
each B-cell subtype can be traced back to a specific genetic
modification, e.g. to genetic translocations t(9;22) [BCR-ABL],
t(1;19) [E2A-PBX1], t(12;21) [TEL-AML1], t(4;11) [MLL] or to a
hyperdiploid karyotype [>50 chromosomes].
[0124] No wonder then that the gene expression patterns of the
different subtypes differ very markedly from one another.
[0125] Furthermore microarray data exhibits one more clear
expression profile which points to the existence of a further ALL
subtype in addition to the 6 known.
[0126] It should be pointed out that Yeoh et al. are working on a
robust classification for classifying the subtypes using a support
vector machine with a set of 271 discriminating genes.
Results
Learnt Structure
[0127] For analysis in accordance with the exemplary embodiment the
reduced data set of 271 genes and 327 samples of different ALL
subtypes, as described above with respect to the work by Yeoh et
al., is used.
[0128] To perform the learning process of a multivariate model the
data set in the values has been divided up into the discrete value
"under-expressed", "expressed normally" and "over-expressed".
[0129] The learnt structure shows scale-free characteristic values,
a feature which is typical of biological networks, such as for
metabolic networks or signaling networks.
[0130] Such networks are characterized by a power distribution of
the ranges of a node which is defined as the number of connections
to other nodes.
[0131] These nodes have a strong influence on the dynamics and
robustness of scale-free networks, and of many of these strongly
connected genes in our model it is actually known that they play a
role in the ocogenesis or in the critical processes associated with
the development of cancer, e.g. DNA repair.
[0132] First a data set of 300 samples is now created from the
model in order to estimate the statistics which are defined by the
set of the conditional probabilities.
[0133] FIGS. 4a and 4b show that data obtained by taking samples
(FIG. 4b) shows subtype characteristic expression patterns, as is
also the case in the original data set (FIG. 4a).
[0134] The patterns of a number of subtypes such as E2A-PBX1 or
T-ALL, are reproduced very well whereas others are generated less
well, e.g. the pattern of the subtype MLL, or are missed completely
such as for example BCR-ABL.
Modelling of Leukaemia Subtypes by Intervention
[0135] The learnt Bayesian network is the basic starting point for
the exemplary embodiment for the approach adopted of using inverse
modelling to find those genes which, when fixed at a specific
expression level, influence the model such that the generated
artificial microarray data set exhibits specific
characteristics.
[0136] As described above, the probability P(C|E) of creation of
specific cancer subtype C is estimated if a certain observation E
is given, in this case the expression state of a specific gene
P(C|Gen.sub.i=state).
[0137] By contrast with Yeoh, not only the presence of a specific
cancer subtype is predicted, but genetic mechanisms which lead to
its creation.
[0138] A high probability indicates that the fixed gene is a
potential cause for the subtype-specific expression behavior of the
gene in question, which in its turn can be the underlying cause of
a specific cancerous appearance.
[0139] 7 reference data sets are used for the comparison, with each
of these having been obtained in conjunction with a specific ALL
subtype.
[0140] FIG. 4a shows that the original microarray data set is
clearly subdivided into 7 clusters (accumulations of points) with
different sample extents.
[0141] Each of these clusters represents the expression pattern of
271 genes if a specific subtype of leukaemia is given, and has been
used to to measure the influence of an evidence for the occurrence
of these different ALL subtypes.
[0142] In a first step each gene is fixed for any one of its
expression values, with all these conditions being used to to
generate a data set of 300 samples (FIG. 4b).
[0143] Subsequently all this data is compared with the 7 reference
data sets, as explained previously.
[0144] In FIG. 5 the probability of each subtype, under the
condition that a gene is overexpressed, is shown on a graph for 271
genes.
[0145] FIG. 5 shows that a small number of genes exist which are
very likely to trigger a specific ALL subtype if they are strongly
active.
[0146] To verify these results the molecular function of specific
genes and their role in biological processes, especially as regards
pathogenesis, is examined in more detail below.
Biological Insights
[0147] These are obtained by examining in greater detail the genes
which are very probably the cause of a specific subtype as well as
significant structure patterns in the learnt network, i.e. dominant
genes and their environment.
[0148] The learnt Bayesian network (model) results from the
microarray data set of different leukaemia subtypes and reflects
transcriptional relationships between genes which occur in these
malignant cancer cells.
[0149] Thus genes which trigger a specific subtype are either
potential oncogenes or are regulated by such genes.
[0150] The first gene to be analyzed in more detail is the gene
PBX1.
[0151] If it is overexpressed the learnt Bayesan network creates a
data set with 0.96 probability which is characteristic of the
subtype E2A-PBX1 of the ALL off B-cell type (see FIG. 5).
[0152] This makes the obvious assumption that a causal relationship
between the "overexpression" of this gene and the occurrence of the
ALL subtypes E2A-PBX1 is present.
[0153] And in actual fact PBX1 s known as a proto ocogene which
causes normal blood cells to mutate into malignant ALL cancer
cells.
[0154] As a result of the chromosome translocation t(1;19) PBX1
merges with the gene E2A and transform into a potent ocogene which
causes the leukemia subtype E2A-PBX1.
[0155] Since the graph structure of the model (FIG. 6) can further
be interpreted in a causal manner it provides information about the
interaction between potential oncogenes and other genes which in
its turn can be interpreted as an oncogene regulation.
[0156] |the structure of the network (FIG. 6) is considered, PBX1
represents a dominant gene in that it influences many other genes
but is only regulated by one or a few other genes.
[0157] In addition, as a result of the conditional probability
distribution, the model identifies PBX1 as a transcription
activator.
[0158] This can also be explained by known biological facts, since
PBX1 activates genes which are normally not expressed or are
expressed at a low level.
[0159] Patients with a hyperdiploidy of >50 chromosomes have
clones of 51-68 chromosomes. Although high hyperdiploid clones are
seldom identical, they tend to exhibit a pattern of the chromosome
increase with additional copies of the chromosomes 4, 6, 10, 14, 18
and 21.
[0160] Trisomy and Polysomy 21 are non-random anomalies which are
frequently to be observed with ALL Their occurrence, even if it is
not specific, as well as the increased occurrence of acute
leukaemia or in subjects with constitutional Trisomy 21 make it
reasonable to assume that the chromosome 21 has a particular role
to play in leukemogenesis.
[0161] Another disease, Down's Syndrome, is caused by Trisomy 21
and shows an increased occurence of leukemia such as ALL.
[0162] As a result the method described makes it possible in this
case, in accordance with the exemplary embodiment, to identify
genes which to a large extent indicate the hyperdiploid ALL
subtype, of which however it is also known that they play a
significant role in the occurrence of Down's Syndrome.
[0163] The gene SOD1 is located at chromosome 21 and produces an
enzyme which converts superoxide-free radicals into hydrogen
peroxide. The increased expression at Trisomy 21, which is also to
be observed for the microarray samples of patients with
hyperdiploid karyotype, can give rise to the brain damage which is
to be seen with Down's Syndrome.
[0164] The frequency of the occurence of the hyperdiploid ALL also
increases in the case in which the gene PSMD10 is
overexpressed.
[0165] PSMD10 is a regulatory cluster unit of the proteasome 26S
for which it has been shown that is operates as a natural mechanism
for the breakdown of protein by regulating the protein metabolism
in eukaryotic cells
[0166] This is of significance for cancers in humans since the cell
cycle, the growth of the tumor and the survival are determined by a
great vraiety of intracellular proteins which are regulated by the
ubiquitin-dependent proteasome breakdown path which is influenced
by PSMD10.
[0167] In more recent scientific work it has been verified that
this breakdown path is often the object of a deregulation
associated with cancer and can be subject to such processes as
oncogene transformation, tumor progression, bypassing of the immune
system and resistance to medicaments.
Abstract of the Exemplary Embodiment
[0168] The exemplary embodiment described presents a new method by
which it is possible to identify genes which are a potential cause
of tumorgenesis, by analyzing the relationships between microarray
data of leukemia subtypes and a data set, which is the result of
taking samples from a learnt Bayesian network.
[0169] This method of operation is based on the modelling of a
regulator genetic network through a Bayesian network, with genes or
their corresponding proteins being symbolized by the nodes of the
Bayesian network.
[0170] Regulation mechanisms are described by connectors between
two nodes, which can be interpreted in a causal manner.
[0171] The quality of the regulation is encoded in the conditional
probability dsitribution of the gene involved for given regulators
of the same.
[0172] The understanding of the regulatory genetic network
represents an important step along the road to characterizing the
genetic mechanisms underlying complex diseases.
[0173] In cancer research, were the identification of genes which
suppress growths and tumors plays a key role, the knowledge of new
potential oncogenes and their interactions with other molecules is
an important contribution to discovering the basic principles which
determine why normal cells mutate into malignant cancer cells.
[0174] With the procedure described in accordance with the
exemplary embodiment, especially with Bayesian Inverse Modelling,
it is possible to discover genes with such an oncogene
characteristic simply through a statistical analysis of gene
expression patterns, which have been measured with the aid of DNA
microarrays.
[0175] The underlying theoretical probability model which has been
used, is a Bayesian network, which encodes the multivariate
probability distribution of a set of variables by a set of
conditional probability distributions.
[0176] The statistical dependencies are encoded in a graph
structure. In the learning method Bayesian statistics are used to
determine the network structure and the corresponding model
parameters which best describe the probability distribution
contained in the data.
[0177] The invention has been described in detail with particular
reference to preferred embodiments thereof and examples, but it
will be understood that variations and modifications can be
effected within the spirit and scope of the invention covered by
the claims which may include the phrase "at least one of A, B and
C" as an alternative expression that means one or more of A, B and
C may be used, contrary to the holding in Superguide v. DIRECTV, 69
USPQ2d 1865 (Fed. Cir. 2004).
* * * * *