System, method and apparatus for causal implication analysis in biological networks Chandra; Dundee Navin ; et al. [Selventa, Inc.]

System, method and apparatus for causal implication analysis in biological networks

Chandra; Dundee Navin ; et al.

Patent Application Summary

U.S. patent application number 14/088472 was filed with the patent office on 2014-09-25 for system, method and apparatus for causal implication analysis in biological networks. This patent application is currently assigned to Selventa, Inc.. The applicant listed for this patent is Selventa, Inc.. Invention is credited to Dundee Navin Chandra, David Kightley, Dexter Pratt, Suresh Toby Segaran, Justin Sun.

Application Number	20140288910 14/088472
Document ID	/
Family ID	34652352
Filed Date	2014-09-25

United States Patent Application	20140288910
Kind Code	A1
Chandra; Dundee Navin ; et al.	September 25, 2014

System, method and apparatus for causal implication analysis in biological networks

Abstract

Described are methods, systems and apparatus for hypothesizing a biological relationship in a biological system. A database of biological assertions is provided consisting of biological elements, relationships among the biological elements, and relationship descriptors characterizing the properties of the elements and relationships. A biological element may be selected from the database and a logical simulation may be performed within the biological database, from the selected biological element, through relationship descriptors, along a path defined by potentially causative biological elements to discern a biological element hypothetically responsible for the change in the selected biological element. The logical simulation may be either a backward logical simulation, performed upstream through the relationship descriptors to discern a hypothetical responsible biological element, or a forward logical simulation, performed downstream through the relationship descriptors to discern the extent to which the perturbation generates the observed change in the selected biological element.

Inventors:

Chandra; Dundee Navin; (Framingham, MA) ; Segaran; Suresh Toby; (Somerville, MA) ; Kightley; David; (York, ME) ; Sun; Justin; (Norwood, MA) ; Pratt; Dexter; (Reading, MA)

Applicant:

Name	City	State	Country	Type
Selventa, Inc.	Cambridge	MA	US

Assignee:

Selventa, Inc.
Cambridge
MA

Family ID:

34652352

Appl. No.:

14/088472

Filed:

November 25, 2013

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
10992973	Nov 19, 2004	8594941
14088472

Current U.S. Class:	703/11
Current CPC Class:	G16B 40/00 20190201; G16B 5/00 20190201
Class at Publication:	703/11
International Class:	G06F 19/12 20060101 G06F019/12

Claims

1-35. (canceled)

36. A software implemented method for hypothesizing a biological relationship in a biological system, the method comprising: (a) providing a database of biological assertions comprising a multiplicity of nodes representative of biological elements, relationship descriptors describing relationships between nodes, and characterizing properties of said nodes and relationships; (b) selecting at least a pair of target nodes in the database; and (c) performing a logical simulation within said database between said target nodes through said relationship descriptors along a path defined by at least one potentially causative node or at least one potential effector node to discern one or a group of pathways hypothetically linking said target nodes.

37. The method of claim 36 comprising the additional steps of: (a) simulating perturbation of one of said pair of nodes; and (b) performing a forward logical simulation within said database through said relationship descriptors from said virtually perturbed node downstream along a path defined by potentially affected nodes to discern the extent to which said simulated perturbation generates a predicted effect on the other of said pair of nodes.

38. The method of claim 36 wherein the database comprises noisy data, erroneous data, or omits nodes representing structures, processes, or networks present in the biological system.

39. The method of claim 38 further comprising applying a probability algorithm to plural hypothetical pathways to assess which pathway has the highest probability of representing real biology.

40. The method of claim 36 comprising the additional step of conducting an experiment on a specimen of said biological system to determine the existence or operability of said hypothetical pathway.

41. The method of claim 36, wherein said relationship descriptors comprise descriptors of the condition, location, source, amount, or substructure of a molecule, biological structure, physiological condition, trait, phenotype, biological process, clinical data, medical data, or disease data and chemistry.

42. The method of claim 36, wherein one or more relationship descriptors correspond to an epistemological relationship between a pair of nodes.

43. The method of claim 36, wherein one or more of the relationship descriptors comprise a case frame.

44-48. (canceled)

Description

RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. provisional application No. 60/525,543, entitled "System, Method and Apparatus for Causal Implication Analysis in Biological Networks," filed Nov. 26, 2003, the disclosure of which is incorporated by reference herein.

TECHNICAL FIELD

[0002] The invention relates to methods, systems and apparatus for analyzing causal implications in biological networks, and more particularly, to methods, systems and apparatus for hypothesizing a biological relationship in a biological system, for simulating a perturbation within a biological system, and for hypothesizing a relationship between two biological elements by performing a logical simulation within a database of biological knowledge.

BACKGROUND

[0003] The amount of biological information generated in the today's world is increasing dramatically. It is estimated that the amount of information now doubles every four to five years. Because of the large amount of information that must be processed and analyzed, traditional methods of discerning and understanding the meaning of information, especially in the life science-related areas, are breaking down. Statistical techniques, while useful, do not provide a biologically motivated explanation of how things work. The present invention takes a causative approach (rather than correlative) at understanding biological effects.

[0004] To form an effective understanding of a biological system, a life science researcher must synthesize information from many sources. Understanding biological systems is made more difficult by the interdisciplinary nature of the life sciences. Forming an understanding of a biological system may require in-depth knowledge of genetics, cell biology, biochemistry, medicine, and many other fields. Understanding a system may require that information of many different types be combined. Life science information may include material on basic chemistry, proteins, cells, tissues, and effects on organisms or population--all of which may be interrelated. These interrelations may be complex, poorly understood, or hidden.

[0005] There are ongoing attempts to produce electronic models of biological systems. These involve compilation and organization of enormous amounts of data, and construction of a system that can operate on the data to simulate the behavior of a biological system. Because of the complexity of biology, and the sheer numbers of data, the construction of such a system can take hundreds of man years and multiple tens of millions of dollars. Furthermore, those seeking new insights and new knowledge in the life sciences are presented with the ever more difficult task of connecting the right data from mountains of information gleaned from vastly different sources. Companies willing to invest such resources so far have been unsuccessful in compiling models of real utility which aid researchers significantly in advancing biological knowledge. Thus, to the extent current systems of generating and recording life science data have been developed to permit knowledge processing and analysis, they are clearly far from optimal, and significant new efficiencies are needed.

[0006] More specifically, what is needed in the art is a way to assemble vast amounts of diverse life science-related knowledge, and to discern from it insightful and meaningful new biological relationships, pathways, causes and effects, and other insights with efficiency and ease.

SUMMARY OF THE INVENTION

[0007] In accordance with the invention, it has been realized that a key to providing useful and manageable biological knowledge bases that are capable of effectively modeling biological systems is to provide means for rapidly and efficiently analyzing relationships between biological elements. A biological knowledge base containing assertions regarding the biological elements and the many possible relationships between the elements can be analyzed to facilitate understanding and revelation of hidden interactions and relationships in biological systems, i.e., to produce new biological knowledge. This in turn permits the generation of new hypotheses concerning biological pathways based on the new biological knowledge, and permits the user to design and conduct biological experiments using biomolecules, cells, animal models, or a clinical trial to validate or refute a hypothesis.

[0008] The invention thus provides a novel method, apparatus, and tool set which can be applied to a global knowledge base. The tools and methods enable efficient execution of discovery projects in the life sciences-related fields. The invention permits one to address any biological topic, no matter how obscure or esoteric, provided there are at least some assertions in a global knowledge base relevant to the topic. Assertions represent facts relating existing objects in a system, or a fact about one object in the system and some literal value, or any combination thereof.

[0009] The invention provides methods of hypothesizing a biological relationship in a biological system using a database of biological assertions, or means, such as a user interface, for accessing such a knowledge base. The knowledge base includes a multiplicity of nodes representative of biological elements and relationship descriptors describing relationships among the nodes and characterizing properties of the nodes and relationships. A preferred knowledge base is disclosed in co-pending, co-owned U.S. patent application Ser. No. 10/644,582, the disclosure of which is incorporated by reference herein.

[0010] The invention provides methods for discovering new biological knowledge. The methods include providing a database of biological assertions comprising a multiplicity of nodes representative of biological elements and relationship descriptors describing relationships between nodes and characterizing properties of the nodes and relationships.

[0011] The methods further include selecting a node in the database for analysis. In some embodiments, the selected node is known from experimental observation to correspond to a biological element which increases in number or concentration, decreases in number or concentration, appears within, or disappears from a real biological system when it is perturbed.

[0012] The effect of perturbing a selected node, in another embodiment, is not known to correspond to a biological element, but may be investigated using the system by representing the specific perturbation and then reasoning about the effects based on the biological knowledge represented in the system. The selected node is perturbed by specifying an increase in concentration or number, stimulation of activity, an effective decrease in concentration or number, inhibition of activity, or the appearance or disappearance of the selected node. In another embodiment, a pair or multiplicity of nodes may be selected from the database and perturbed.

[0013] The invention provides methods for performing logical simulation within a biological knowledge base. Logical simulation includes backward logical simulations, which proceeds from a selected node upstream through a path of relationship descriptors to discern a node which is hypothetically responsible for the experimentally observed changes in the biological system. In short, this computation answers the question "What could have caused the observed change?" Logical simulation also includes forward logical simulations, which travel from the target node downstream through a path of relationship descriptors to discern the extent to which a perturbation to the target node causes experimentally observed changes in the biological system.

[0014] The invention provides methods for performing a logical simulation on a hypothetical perturbation. One or more nodes may be selected and specified as perturbed, regardless of whether they are observed in an actual experiment. Backward logical simulation on the hypothetical perturbation includes backward logical simulations, which travel directly or indirectly from the target node upstream through a path of relationship descriptors to discern a hypothesis or node potentially explanatory of a cause of the specified change in the biological system. Forward logical simulation on the hypothetical perturbation identifies the nodes and relationship descriptors which would potentially be perturbed due to the hypothetical perturbation. The logical simulation also includes forward logical simulations, which travel directly or indirectly from the target node downstream through a path of relationship descriptors to discern a hypothesis or node potentially explanatory of an effect of the specified change on the biological system.

[0015] The invention provides methods for performing a logical simulation between at least two groups of selected nodes. The logical simulation travels through a path of relationship descriptors containing at least one potentially causative node or at least one potential effector node to discern a pathway hypothetically linking the target nodes. The set of these paths, derived in either manner, comprise the set of all possible explanations for perturbations of the target nodes which could hypothetically be caused due to perturbations of the source nodes.

[0016] In various embodiments, the invention includes method steps, applications, and devices wherein the database contains noisy data, erroneous data, or omits nodes representing structures, processes, or networks present in the real biological system.

[0017] In various embodiments, the invention includes method steps, applications, and devices wherein a probability algorithm is applied to plural hypotheses to assess which hypothetical relationship has the highest probability of representing real biology.

[0018] In various embodiments, the invention includes method steps, applications, and devices wherein the biological system is a mammalian biological system. The method further comprises determining the identity of a second hypothetically responsible node upstream of the hypothetically responsible node. Perturbation of the biological entity represented by the second node is predicted to induce a predetermined change in the biological system upon inhibition or stimulation.

[0019] In various embodiments, the invention includes method steps, applications, and devices for conducting an experiment on a biological specimen to determine if the hypothetical changes predicted by logical simulation correspond to the biologically observed change. The invention includes performing an experiment on a biological specimen to attempt to induce the observed change.

[0020] In various embodiments, the invention includes method steps, applications, and devices wherein the biological system being analyzed is a mammalian biological system that is perturbed from stasis. The perturbation is induced by a disease, toxicity, environmental exposure, abnormality, morbidity, aging, or another stimulus.

[0021] In various embodiments, the invention includes method steps, applications, and devices for determining the state of a group of biological entities represented by nodes in the system which are reproducibly associated with the biological state of the mammal which can act as a marker set characteristic of the biological state.

[0022] In various embodiments, nodes represent enzymes, cofactors, enzyme substrates, enzyme inhibitors, DNAs, RNAs, transcription regulators, DNA activators, DNA repressors, signaling molecules, trans membrane molecules, transport molecules, sequestering molecules, regulatory molecules, hormones, cytokines, chemokines, antibodies, structural molecules, metabolites, vitamins, toxins, nutrients, minerals, agonists, antagonists, ligands, receptors, or combinations thereof. In other embodiments, nodes represent protons, gas molecules, organic molecules, amino acids, peptides, protein domains, proteins, glycoproteins, nucleotides, oligonucleotides, polysaccharides, lipids, glycolipids, or combinations thereof. In further embodiments, nodes comprise cells, tissues, or organs, or drug candidate molecules.

[0023] In various embodiments, biological information represented by nodes and relationship descriptors may include experimental data, knowledge from the literature, patient data, clinical trial data, compliance data, chemical data, medical data, or hypothesized data. In other embodiments, biological information represented may include facts about of a molecule, biological structure, physiological condition, trait, phenotype, or biological process.

[0024] In various embodiments, relationship descriptors represent the condition, location, source, amount, or substructure of a molecule. Relationship descriptors also may be used to represent biological structure, physiological condition, trait, phenotype, biological process, clinical data, medical data, or disease data and chemistry. A particular relationship descriptor also corresponds to an epistemological relationship between a pair of nodes. Relationship descriptors are represented as case frames.

[0025] In various embodiments, the database contains at least 1,000 nodes, 5,000 nodes, 10,000 nodes, 50,000 nodes, or 100,000 nodes.

[0026] In various embodiments, the new biological knowledge produced by the method includes predictions of physiological behavior in humans, for example, from analysis of experiments conducted on animals, such as drug efficacy and/or toxicity, or the discovery of biomarkers indicative of the prognosis, diagnosis, drug susceptibility, drug toxicity, severity, or stage of disease.

[0027] The invention provides computing devices for analyzing a biological knowledge base and for discovering new biological knowledge. The computing devices include means for accessing an electronic database of biological assertions comprising a multiplicity of nodes representative of biological elements, relationship descriptors representing relationships between nodes and characterizing the nodes and relationships, and a user interface for specifying biological elements or perturbations which will be analyzed by the device. The devices also include a computer application to perform a logical simulation of a perturbation to a selected biological element or relationship, to analyze the source or effects of the perturbation, and to assess the probability that the simulation generated pathway represents real biology. The invention also provides articles of manufacture having a computer-readable program carrier with computer-readable instructions embodied thereon for performing the methods and systems described above.

[0028] The foregoing and other features and advantages of the present invention, as well as the invention itself, will be more fully understood from the description, drawings, and claims which follow.

BRIEF DESCRIPTION OF THE DRAWINGS

[0029] In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the invention are described with reference to the following drawings, in which:

[0030] FIG. 1 is an exemplary causal tree showing inference paths for upstream causes starting with a change in mRNA levels for a particular gene in accordance with an illustrative embodiment of the invention.

[0031] FIG. 2 shows a knowledge assembly graph in accordance with an illustrative embodiment of the invention.

[0032] FIG. 3 shows the merger of two pathways in accordance with an illustrative embodiment of the invention.

[0033] FIG. 4 shows a knowledge graph in accordance with an illustrative embodiment of the invention.

[0034] FIG. 5 shows a knowledge graph in accordance with an illustrative embodiment of the invention.

[0035] FIGS. 6-11 show the iterative steps of generation of a causal tree in accordance with an illustrative embodiment of the invention.

[0036] FIG. 12A shows an explanation diagram in accordance with an illustrative embodiment of the invention.

[0037] FIG. 12B shows a detail of the explanation diagram in FIG. 12A in accordance with an embodiment of the invention.

[0038] FIG. 13 is a diagram showing propagation of predicted changes in a forward simulation being compared with observed expression changes in accordance with an illustrative embodiment of the invention.

[0039] FIG. 14 is a diagram generated by a backward simulation from nine expression data points, followed by pruning of the graph to show only the chains of reasoning which support the primary hypotheses, in accordance with an illustrative embodiment of the invention.

[0040] FIG. 15 shows an illustrative example of a visualization technique in accordance with the present invention that is based on a forward simulation that compares predicted outcomes with actual laboratory data.

[0041] FIG. 16 shows an example of an algorithm for use in validating a biological model by comparing predicted to actual results in accordance with the invention.

DESCRIPTION

[0042] To implement the present invention, a global knowledge base, or central database, is structured to comprise a multiplicity of nodes and relationship descriptors. Nodes represent elements of biological systems, both physical and functional, and include such things, for example, as specific organs, tissues, cells, organelles, cell compartments, membranes, proteins, DNAs, RNAs, small molecules, drugs, and metabolites. The relationship descriptors are data entries representing interrelations between nodes or associating additional information with nodes. Relationship descriptors connecting nodes may be thought of as "verbs" specifying the nature of a relationship between the represented biological entities. These may also be referred to as "case frames". Relationship descriptors may be used to represent additional information about the biological entity represented by a node, including but not limited to, recording the species or organ where a specific protein is found, identifying the journal where some datum was reported, notation of tertiary structural information about a specific protein, notation that some protein is elevated in patients with hypertension, etc. There are significantly more relationship descriptors in the knowledge base than there are nodes. Each node may have a plurality of relationship descriptors defining multiple attributes of that node.

[0043] Nodes may represent, by way of non-limiting examples, biological molecules including proteins, small molecules, ions, genes, ESTs, RNA, DNA, transcription factors, metabolites, ligands, trans-membrane proteins, transport molecules, sequestering molecules, regulatory molecules, hormones, cytokines, chemokines, histones, antibodies, structural molecules, metabolites, vitamins, toxins, nutrients, minerals, agonists, antagonists, ligands, or receptors. The nodes may represent drug substances, drug candidate compounds, antisense molecules, RNA, RNAi, shRNA, dsRNA, or chemogenomic or chemoproteomic probes. Viewed from a chemistry perspective, the nodes may represent protons, gas molecules, small organic molecules, amino acids, peptides, protein domains, proteins, glycoproteins, nucleotides, oligonucleotides, polysaccharides, lipids or glycolipids. Proceeding to higher order models, the nodes may represent protein complexes, protein-nucleotide complexes such as ribosomes, cell compartments, organelles, or membranes. From a structural perspective, they may represent various nanostructures such as filaments, intracellular lipid bilayers, cell membranes, lipid rafts, cell adhesion molecules, tissue barriers and semipermeable membranes, collagen structures, mineralized structures, or connective tissues. At still higher orders, the nodes represent cells, tissues, organs or other anatomical structures. For example, a model of the immune system might include nodes representing immunoglobulins, cytokines, various leucocytes, bone marrow, thymus, lymph nodes, and spleen. In simulating clinical trials the nodes may represent, for example, individuals, their clinical prognosis or presenting symptoms, drugs, drug dosage levels, and clinical end points. In simulating epidemiology, the nodes may represent, for example, individuals, their symptoms, physiological or health characteristics, their exposure to environmental factors, substances they ingest, and disease diagnoses. Nodes may also represent ions, physiological processes, diseases, disease processes, translocations, reactions, molecular complexes, cellular components, cells, anatomical parts, tissues, cell lines, and protein domains.

Relationship Descriptors

[0044] Relationship descriptors represent biological relationships between biological entities represented by nodes and contain each fact within a knowledge base. Relationship descriptors represent facts relating existing objects in a system, or a fact about one object in the system and some literal value, or any combination thereof. In various embodiments, relationship descriptors may represent knowledge such as RNA, proteomic, metabolite, or clinical knowledge from sources such as scientific publications, patient data, clinical trial data, compliance data, chemical data, medical data, hypothesized data, or data from biological databases.

[0045] Relationship descriptors may represent biological relationships between biological entities represented by nodes and include, but are not limited to, non-covalent binding, adherence, covalent modification, multi-molecular interactions (complexes), cleavage of a covalent bond, conversion, transport, change in state, catalysis, activation, stimulation, agonism, antagonism, up regulation, repression, inhibition, down regulation, expression, post-transcriptional modification, post-translational modification, internalization, degradation, control, regulation, chemo-attraction, phosphorylation, acetylation, dephosphorylation, deacetylation, transportation, and transformation.

[0046] One aspect of a relation descriptor is its attribution. Each relationship descriptor may have a multiplicity of attributions, characterizing multiple properties of the node or relationship. An attribution represents the source of the relationship, such as a scientific article, an abstract (e.g., Medline or PubMed), a book chapter, conference proceedings, a personal communication, or an internal memorandum. Another attribution of a relationship descriptor is its biological context. Relationship descriptors associated with a specific biological context may be selected. Biological context refers to, for example, species, tissue, body part, cell line, tumor, disease, sample, virus, organism, developmental stage, or any combination of the above. A further attribute of a relationship descriptor is its trust score, a measure of the level of confidence that the relationship descriptor reflects truly representative, real biology and is reproducible. Relationship descriptors can also be selected on the basis of a trust score. A minimum threshold is set and any relationships meeting or exceeding the threshold are selected. Additionally, seemingly identical relationship descriptors, containing the same nodes and relationship, may have different attributes, such as source, biological context, or certainty value, distinguishing the seemingly identical relationship descriptors.

[0047] Subsets of a knowledge base can also be made using specifications that define a Complex pattern of relationships between nodes. All the sets of nodes and relationship descriptors which meet the criteria of the pattern embody the subset. In one embodiment, a search algorithm can filter the knowledge base to generate a list of biological entities that satisfy the stated pattern. For example, a structure search can be used to generate the subset of all reactions that have a product which is phosphorylated and whose catalyst is a molecular complex. This search will find all phosphorylation reactions that are catalyzed by a molecular complex, while avoiding phosphorylation reactions that are catalyzed by a single protein.

[0048] A preferred form of relationship descriptors for use in the invention are case frames extracted from the representation structure which permit instantiation and generalization of the models to a variety of different life science systems or other systems. Case frames are described in detail in co-pending, co-owned U.S. patent application Ser. No. 10/644,582, the disclosure of which is incorporated by reference herein. Relationship descriptors may comprise quantitative functions such as differential equations representing possible quantitative relationships between pairs of nodes which may be used to refine the network further. Relationship descriptors may also comprise qualitative features that either cannot be measured or described easily in an analytical or quantitative manner, or because of insufficient knowledge of a system in general or the feature itself, it is impossible to be described otherwise.

[0049] A knowledge base represents a hypothesis explaining the operation of systems, i.e., capable of producing, upon simulation, predicted data that matches the actual data that serves as the fitness criteria. The hypothesis can be tested with further experiments conducted, combined with other models or networks, refined, verified, reproduced, modified, perfected, corrected, or expanded with new nodes and new relationships based on manual or computer aided analysis of new data, and used productively as a biological knowledge base. Models of portions of a physiological pathway, or sub-networks in a cell compartment, cell, organism, population, or ecology may be combined into a consolidated model by connecting one or more nodes in one model to one or more nodes in another.

Pathfinding

[0050] Pathfinding algorithms including radial, shortest path, and all paths pathfinding. Radial pathfinding is useful to discover how one biological entity is functionally or structurally connected to other biological entities. For example, if a given cell contains a mutant form of P53, one may want to discover its effect on molecules upstream or downstream from the mutant gene product. An algorithm for discovering this information can start from a particular node and find all nodes that are connected to the node for a predetermined number of steps removed from the node. If directionality is important (e.g., as in reactions), the algorithm can be instructed to follow links only in the direction indicated by the pathfinding criteria. Radial pathfinding can be applied in several steps. For example, a two-step radial pathfinding search will involve starting from a node, finding its immediate connected nodes, and then finding the immediate connected nodes of those nodes. This process can be applied to as many steps as needed. This analysis may be used to determine and predict the expected changes of perturbing a given node. This analysis may be displayed to the user to elucidate how a change might propagate through the knowledge base, and thereby to discover its real effect on a biological system. An example of a causal tree as applied to RNA changes is shown in FIG. 1.

[0051] FIG. 2 shows an example of the progression of a two-step radial pathfinding search starting from a specified start node 300. In the first step of the search, connected nodes 310 are found. In the second step of the search, connected nodes 320 are found. The result of this radial pathfinding search is the combination of all nodes and assertions as shown in the FIG. 2. A pathfinding search optionally can be configured to follow only specific descriptors, to ignore certain nodes that may be ubiquitous or uninformative, or to stop finding new nodes when certain nodes are encountered.

[0052] In large biological networks, there usually are multiple paths between any two entities. In a given analysis, it may be useful to determine the shortest path between two nodes, or to find all paths between two nodes. An algorithm for determining the shortest path in a network starts by performing a breadth-first radial pathfinding from each of the two nodes between which the shortest path is sought. Once a common node is found, the path is published as the shortest path between the nodes. To find all pathways, the algorithm can continue to pathfind radially from each node, identifying additional common nodes. In order to determine the pathways among several nodes, the algorithm discussed above can be run until all pathways between each pair of nodes are found. In this technique, one starts a radial pathfinding search from each one of the start nodes. Then, the paths being followed are recorded in every radial search. The union of all paths from the start nodes to the target nodes is the result of this algorithm. As this approach tends to increase exponentially in the number of pathways and nodes, the algorithm may be limited to follow a pre-designated number of steps. For example, a three-step search will only generate all pathways that exist between the given origin nodes by doing a three-step radial search out from each node. The results of this pathway algorithm can be displayed, for example as a sorted list of pathways starting from the shortest or largest, or as a merged graph.

[0053] A merged graph is generated by merging together all of the pathways traversed up to a specific length in the case of a radial search or by merging the set of pathways that link any of the source nodes to any of the target nodes. This is accomplished by merging two pathways at a time, until only a single graph containing all nodes and assertions emerges. An example of merging two pathways involves taking all common nodes and assertions and merging them into combined pathway as shown in FIG. 3. In this diagram, since nodes A, B, and D are shared between pathway 410 and pathway 420, these nodes are represented only once in the combined pathway 430. Node B occurs in pathway 410 and node E occurs in pathway 420, and they are also represented in the combined pathway 430. FIG. 4 shows the result of merging all pathways into a single graph based on a radial pathway search between a start node "FXR" (in the upper left-hand corner of the diagram) and a target node "LDL" (in the lower right-hand corner of the diagram). This type of analysis permits study of the implications of observed changes in gene expression studies or changes in concentrations of proteins and metabolites. The analysis is used to show how the changed entities relate to one another so one can discern the dependent changes and find changes that are central to the experiment at hand.

[0054] The matrix method is another way of studying the changes in a knowledge graph. Given a list of nodes of interest (e.g., statistically significant, highly modulated RNA in an experiment) the nodes are placed in a matrix with each node placed as an entry in a column and a row. The shortest path is then generated for every pair of nodes (redundant pairings are ignored). All the generated pathways are then merged as explained above. The matrix method can also be applied by not only finding one path for each cell in the matrix, but by generating multiple pathways. This can be done in several ways: (1) generating all pathways for each pair; (2) generating the top "n" pathways starting with the shortest or longest; and (3) generating all the top "n" pathways that are no more than some pre-determined number of steps long. The matrix method also is useful in determining how a set of biological entities are related to one another. FIG. 5 shows the result of a matrix method analysis among three nodes, "Acox1", "LDL" and "FXR" after merging all of the shortest paths between each pair of nodes.

Logical Simulation

[0055] Logical simulation may also be utilized in accordance with the invention. Logical simulation refers to a class of operations conducted on a knowledge base wherein observed or hypothetical changes are applied to one or more nodes in the knowledge base and the implications of those changes are propagated through the network based on the causal relationships expressed as assertions in the knowledge base.

[0056] A logical simulation can either be forward, where the effects of changes are inferred and are propagated downstream from the initial points of change, or it can be backward where the possible causes are inferred and are propagated upstream from the initial points of change. In either case, one result of a logical simulation is a new, derived network, comprised of the nodes and assertions that were involved in the propagation of cause or effect. This derived network embodies a hypothesis about the system being studied.

[0057] Logical simulation includes backward logical simulations, which proceed from a selected node by traversing relationship descriptors which express causal relationships between biological elements. The simulation is "backwards" when relationship descriptors are traversed such that the simulation moves from a selected node to nodes which, if perturbed, could cause perturbation in the selected node. As the backward simulation progresses it identifies the set of nodes and relationships which, if perturbed, may hypothetically be responsible for the experimentally observed changes in the biological system. In short, this computation answers the question "What could have caused the observed change?"

[0058] Logical simulation also includes forward logical simulations, which also proceed from a selected node by traversing relationship descriptors which express causal relationships between biological elements. The simulation is "forwards" when relationship descriptors are traversed such that the simulation moves from a selected node to nodes which could be perturbed if the selected node is perturbed. As the forward simulation progresses it identifies the set of nodes and relationship descriptors which may hypothetically be perturbed by the perturbation of the selected node. In short, this computation answers the question, "What are the possible effects of this change?" The sets of nodes and relationship descriptors derived by these two methods comprise connected graphs which may be also described as sets of "causal paths," chains of causal relationships connecting nodes. Each unique causal path identified by either forward logical simulation or backward logical simulation may be considered a hypothesis, a hypothesis that perturbations in the first node in the path may cause perturbations in the last node in the path via perturbations in the intervening nodes.

[0059] The invention provides methods for performing a logical simulation on a hypothetical perturbation. One or more nodes may be selected and specified as perturbed, regardless of whether they are observed in an actual experiment. Backward logical simulation on the hypothetical perturbation identifies the nodes and relationship descriptors which, if perturbed, would potentially explain the hypothetical perturbation. Forward logical simulation on the hypothetical perturbation identifies the nodes and relationship descriptors which would potentially be perturbed due to the hypothetical perturbation.

[0060] The invention provides methods for performing a logical simulation between at least two groups of selected nodes. One group is designated "source nodes" and the other may be designated "target nodes." Backward simulation may be performed starting from the target nodes, finding all causal paths which connect from the source nodes to the target nodes. Alternatively, forward simulation may be performed starting from the source nodes, finding all causal paths which connect from the source nodes to the target nodes. The set of these paths, derived in either manner, comprise the set of all possible explanations for perturbations of the target nodes which could hypothetically be caused due to perturbations of the source nodes.

[0061] Referring again to FIG. 1, for example, in the case of a backward simulation based on observed changes in RNA expression levels, FIG. 1 shows paths of inference to find upstream causes starting with an observed change in mRNA levels for a particular gene. One specific chain of causation could be as follows: a phosphorylation of a transcription factor by a kinase such that the kinase changes the activity of the transcription factor can in turn induce changes in the expression of genes controlled by that transcription factor. This diagram provides a "pseudo code" description of the inferences that are then performed to find possible causes of each of the observed RNA changes. The types of assertions to be explored are not limited to those in this diagram. Any assertion in the knowledge base that represents a causal biological linkage may be included in this type of analysis. In turn, each of the possible causes may then be explored to find their respective possible causes. The process may be repeated for as many steps as desired, annotating nodes in the knowledge base or assembly according to their possible role in the causation of the observed changes.

[0062] The resulting derived network embodies a hypothesis about the possible causes of the observed data. Moreover, depending on the methods of propagation of causality, it may further be considered a hypothesis about the most implicated and most consistent possible causes of the observed data, i.e. a set of possible causes ranked by objective criteria. This technique is not limited to RNA expression data, but rather may work with any set of changes that can be expressed in the representation system, including but not limited to proteometric data, metabolomic data, post-translational modification data, or even reaction rate data.

[0063] Logical simulation using the invention may also be applied to analyze the possible pathways between a single source node and a single target node; furthermore, the invention may be applied to multiple source nodes and/or multiple target nodes in a single logical simulation. A logical simulation may be applied to specified class(es) of nodes or of relationship descriptors. The logical simulation may also specify the exclusion of specified node(s), relationship descriptor(s), or classes of nodes or descriptors. For example, the search may specify that the logical simulation only traverse relationship descriptors relating to genes. Alternatively, the search may be limited to relationship descriptors relating to proteins, the expression of proteins, or the transcription of an mRNA to a protein.

[0064] The invention, in another aspect, includes a search that further limits the available nodes or relationship descriptors for a particular logical simulation. For example, according to one embodiment, the search may exclude specific nodes, requiring the logical simulation to connect the source and target node in a path that does not include the specified nodes. Alternatively, according to another embodiment, the search may require specific nodes, requiring the logical simulation to connect the source and target node in a path that contains the specified nodes. The search may also take the form of a negative test, requesting the logical simulation to find a pathway (or the absence of a pathway) beginning at a source node that does not connect to the target node in a specified number of steps. The search may also request the shortest path between the source and target nodes. Further, the logical simulation may be limited to a single direction, only allowing traversal of relationship descriptors in a single specified direction, either upstream or downstream, from the selected starting node.

[0065] The goal of the present invention is to find a cause or source of changes induced by a perturbation to a biological system. Perturbations are induced, for example, by a disease, toxicity, environmental exposure, abnormality, morbidity, aging, or other stimulus. The invention is based on the premise that perturbations to a biological system can be analyzed for the effects they may cause downstream in a biological system or for the cause of the perturbation upstream within the biological system. Based on existing or future knowledge about biological elements and the relationships between elements, perturbations to nodes or relationships are traversed within the knowledge base to discern their causes and effects. Changes in the amount, character, or quality of a biological element such as a molecule of RNA, a protein, a metabolite or another known element, can be evaluated to determine their possible cause. For example, if the level of a particular molecule of RNA is observed to increase in a particular diseased tissue versus the same healthy tissue, the RNA molecule can be evaluated to discern factors that have the potential to control the level of the RNA molecule.

[0066] The process of traversing backwards through relationships in the knowledge base can be applied iteratively from each subsequent change to yield a tree-like structure of interactions and possible causes. Metrics are applied to the results to find areas of commonality among the observations. These common areas are highlighted as areas of control for the network and can be further evaluated by biological experimentation to determine if the hypothesized cause or effect is observed in an actual biological system.

[0067] The relationships traversed during an evaluation of a biological perturbation represent facts relating existing objects in a system, or a fact about one object in the system and some literal value, or any combination thereof. In various embodiments, relationship descriptors may represent knowledge such as the effect of an increase or decrease in the level of RNA, protein, or a metabolite. The level of an RNA molecule, for example, may increase or decrease because the transcription factor that controls the RNA is either up or down, either activated or deactivated, or is not being degraded by some other molecule. Alternatively, the RNA level may change because it is either being degraded more or degraded less, or because it is being transported in or out of the system at a different rate.

[0068] The level of a protein, for example, may increase or decrease because the RNA that codes for the protein is up or down, its promotor is up or down, the protein is being degraded either faster or slower, it is being transported differently, it is complexing with something else that is either up or down, it is unable to complex with what it usually complexes, or it is being or not being phosphorylated or acylated as usual.

[0069] The level of a metabolite, for example, may increase or decrease because the biochemical reaction that makes the metabolite could be altered, the enzyme could be upregulated or downregulated or activated or deactivated, the substrates could be up or down, the environmental conditions for the reaction may be up or down, or the transport and or secretion of the metabolite may be up or down. The forgoing examples are meant to be illustrative and are not a complete recitation of the possible relationships described in the global knowledge base.

[0070] To analyze changes in a biological network induced by a perturbation, relationship descriptors (describing relationships such as those discussed above) are traversed to generate a list of possible causes. The relationship descriptors for each identified cause are then traversed and the process repeated a specified number of steps until a web of relationships is developed. For example, if the level of an RNA molecule is increased in a biological system, one possible cause is that its transcription factor is increased. If the transcription factor is a protein, the relationship descriptors regarding proteins can be traversed to analyze possible biological elements or relationships causing the protein level to increase. The relationships can be additionally be traversed for a desired number of subsequent steps to generate a causal tree.

[0071] In one embodiment of the invention, a biological relationship in a biological system is hypothesized using a software implemented method. A knowledge base, such as a database of biological knowledge, is provided comprising a multiplicity of nodes representative of biological elements and relationship descriptors, which represent relationships between specific nodes or properties of the nodes. A relationship descriptor may describe, for example, the condition, source, amount, or substructure of a molecule. It may also describe, for example, an aspect of a biological structure, physiological condition, trait, phenotype, biological process, clinical data, medical data, or disease data and chemistry. A relationship descriptor may correspond to an epistemological relationship between a pair of nodes. A relationship descriptor may also comprise a case frame, as described above.

[0072] According to the invention, a target node is selected from the knowledge base for investigation. In one embodiment, the target node is known from experimental observation to correspond to a biological element and the biological element is known to increase in number or concentration, decrease in number or concentration, appear within, or disappear from a real biological system when that system is subjected to a perturbation.

[0073] Starting at the selected target node, a logical simulation is performed backward within the knowledge base from the target node, through a path of relationship descriptors describing potentially causative nodes to discern a source node hypothetically responsible for the experimentally observed change in the selected target node.

[0074] In one embodiment of the invention, the hypothetically responsible source node is then selected for simulated perturbation. Starting at the selected hypothetically responsible source node, a logical simulation is performed forward within the knowledge base from the hypothetically responsible source node, through a path of relationship descriptors describing potentially affected nodes to discern the extent to which the simulated perturbation of the hypothetically responsible source node generates the experimentally observed change in the target node.

[0075] In another embodiment of the invention, a perturbation to a selected target node is specified. The perturbation may be known or may not be known from experimental observation to correspond to a biological element or to correspond to a perturbation of that biological element. The perturbation comprises an effective increase in concentration or number, stimulation of activity, an effective decrease in concentration or number, inhibition of activity, or the appearance or disappearance of the target node.

[0076] According to this embodiment, a logical simulation is performed within the knowledge base from the selected target node, through the relationship descriptors, upstream along a path defined by nodes which affect the state of the target node directly or indirectly to discern a hypothesis potentially explanatory of a cause of the specified change in the target node within the system. In another embodiment of this invention, the logical simulation is performed within the knowledge base from the selected target node, through the relationship descriptors, upstream along a path defined by nodes which affect the state of the target node directly or indirectly to discern a node hypothetically responsible for the specified change in the target node.

[0077] Alternatively, according to this embodiment, a logical simulation is performed within the knowledge base from the selected target node, through the relationship descriptors, downstream along a path defined by nodes affected by the target node directly or indirectly to discern a hypothesis potentially explanatory of an effect of the specified change in the target node within the system. In another embodiment of this invention, the logical simulation is performed within the knowledge base from the selected target node, through the relationship descriptors, downstream along a path defined by nodes affected by the target node directly or indirectly to discern a node hypothetically affected by the specified change in the target node.

[0078] In one embodiment of the invention, the hypothetically responsible source node is then selected for simulated perturbation. Starting at the selected hypothetically responsible source node, a logical simulation is performed backward within the knowledge base from the hypothetical source node, through a path of relationship descriptors describing potentially affecting nodes to discern the extent to which the simulated perturbation causes the specified change in the target node.

[0079] In another embodiment of the invention, the target node is then selected for simulated perturbation. Starting at the target node, a logical simulation is performed forward within the knowledge base from the target node, through a path of relationship descriptors describing potentially affected nodes to discern the extent to which the simulated perturbation generates the hypothesized effects of the specified change in the target node.

[0080] According to one embodiment of the invention, at least a pair of target nodes are selected from the knowledge base. A logical simulation is performed within the knowledge base between the target nodes, through relationship descriptors along a path defined by at least one potentially causative node or at least one potential effector node to discern one or a group of pathways hypothetically linking the selected target nodes. A logical simulation may be performed within the database upstream, along a path defined by nodes affecting the state of the target node directly or indirectly to discern hypotheses starting at the target node, each of which is potentially explanatory of a cause of the specified change in the target node within the system or hypothetically responsible for the specified change in the target node. A logical simulation may also be performed within the database downstream, along a path defined by nodes affected by the target node directly or indirectly to discern hypotheses potentially explanatory of an effect of a specified change in the target node within the system or hypothetically affected by the specified change in the target node. The result of the logical simulation may be the determination of a group of pathways hypothetically linking the two or more target nodes by chains of causal mechanism.

[0081] In another embodiment, one node of the pair of nodes is then selected for perturbation. Starting at the virtually perturbed node, a logical simulation is then performed forward within the knowledge base, through a path of relationship descriptors describing potentially affected nodes to discern the extent to which the simulated perturbation generates a predicted effect on the other node of the pair of nodes.

[0082] In various embodiments, the invention includes method steps, applications, and devices wherein the database contains noisy data, erroneous data, or omits nodes representing structures, processes, or networks present in the real biological system. A knowledge base may be augmented by insertion of new nodes and relationship descriptors derived from the knowledge base or may be filtered by excluding subsets of data based on other biological criteria. The granularity of the system may be increased or decreased as suits the analysis at hand (which is critical to the ability to make valid extrapolations between species or generalizations within a species as data sets differ in their granularity). A knowledge base may be made more compact and relevant by summarizing detailed knowledge into more conclusory assertions better suited for examination by data analysis algorithms, or better suited for use with generic analysis tools, such as cluster analysis tools.

[0083] A knowledge base may be updated periodically as knowledge advances, and the respective evolving knowledge base can be saved to show the progression of knowledge in the area. A knowledge base may be augmented in various ways, including having a curator add new data from a structured or unstructured database or add data derived from literature. A knowledge base also may be incorporated back into a global repository so that new assertions may be used as raw material for creation of a different assembly.

[0084] In various embodiments, the invention includes method steps, applications, and devices wherein a probability algorithm is applied to plural hypotheses to assess which hypothetical relationship has the highest probability of representing real biology. In one embodiment, the probability algorithm evaluates a hypothesis and assigns that hypothesis a score based on the number of predicted, number of observed, and number of contrary outcomes. The algorithm calculates a concurrence of events (the number of correct/incorrect outcomes compared to chance) and a measure of richness (statistical significance compared to random) for each measurement. The probability is the product of the concurrence and richness scores. The probability scores for two hypotheses are compared to determine which hypothetical relationship has the highest probability of representing real biology. The pathway outcomes for two related factors may also be compared to determine which nodes and relationship pathways are unique to each factor and where their paths overlap. Additionally, multiple pathway outcomes may be ranked in order of the probability that they represent real biology based on a parameter of consistency or explanatory power. Examples of parameters include the probability, concurrence and richness scores.

[0085] In various embodiments, the invention includes method steps, applications, and devices wherein the biological system is a mammalian biological system. The method further comprises determining the identity of a second hypothetically responsible node upstream of the hypothetically responsible node. The second node is predicted to induce a predetermined change in the biological system upon inhibition or stimulation.

[0086] In various embodiments, the invention includes method steps, applications, and devices for conducting an experiment on a biological specimen to determine if the hypothetical relationship predicted by the logical simulation corresponds to the biologically observed change. The invention includes performing an experiment on a biological specimen to attempt to induce the observed change.

[0087] In various embodiments, the invention includes method steps, applications, and devices wherein the biological system being analyzed is a mammalian biological system that is perturbed from stasis. The perturbation is induced by a disease, toxicity, environmental exposure, abnormality, morbidity, aging, or another stimulus.

[0088] In various embodiments, the invention includes method steps, applications, and devices for determining the state of a group of nodes reproducibly associated with the biological state of the mammal which can act as a marker set characteristic of the biological state.

[0089] In various embodiments, the new biological knowledge produced by the method includes predictions of physiological behavior in humans, for example, from analysis of experiments conducted on animals, such as drug efficacy and/or toxicity, or the discovery of biomarkers indicative of the prognosis, diagnosis, drug susceptibility, drug toxicity, severity, or stage of disease.

[0090] The invention provides computing devices for analyzing a biological knowledge base and for discovering new biological knowledge. The computing devices include means for accessing an electronic database of biological assertions comprising a multiplicity of nodes representative of biological elements, and relationship descriptors describing relationships between nodes and properties of nodes, and a user interface for specifying biological elements or perturbations which will be analyzed by the device. The devices also include a computer application to perform a logical simulation of a perturbation to a selected biological element or relationship, to analyze the source or effects of the perturbation, and to assess the probability that the simulation generated pathway represents real biology. The invention also provides articles of manufacture having a computer-readable program carrier with computer-readable instructions embodied thereon for performing the methods and systems described above.

Example 1

[0091] An example of an embodiment of the invention, examining liver changes in mice fed with polyunsaturated fatty acid (PUFA) rich foods, is described below and shown in FIGS. 6 through 11. This example application was conducted using publicly available results from Berger et al., Dietary effects of arachidonate-rich fungal oil and fish oil on murine hepatic and hippocampal gene expression, Metabolic and Genomic Regulation, Nestle Research Center, Switzerland. The research looked at mouse liver changes on a diet rich in polyunsaturated fatty acid in fungal and fish oils. We used differential gene expression data in our model of dyslipidemia and had the system work backwards from the data to find the most likely causes of observed changes. FIGS. 6 through 11 are illustrations of how the system walked backwards from a few selected genes (6 in this case). At each stage, the system walks an extra step backwards to find possible causes. As shown in FIGS. 12A and 12B, the area with the best score was PPAR.

[0092] FIG. 12A shows an explanation diagram of how PPAR is connected to a phenotype of interest. This diagram was generated after doing the back simulation in which PPAR was identified as the cause or area of control. Links were then extracted from the knowledge base that corresponded to the path that the back simulator took to get to PPAR. FIG. 12B, which is a zoomed in version of FIG. 12A, shows the user the exact links (an explanation) of how the system concluded PPAR was the cause or area of control. This was done by keeping track of the nodes and links that were traversed by the backwards simulator as it worked its way through the network and then displaying the nodes and links that led to a specific consensus area. While the result of this example was obtained algorithmically by the present invention, Berger et al. also reached a similar conclusion in their research.

[0093] FIG. 13 is a manually composed diagram which shows propagation of predicted changes 1210 in a forward simulation being compared with observed expression changes 1220. This diagram illustrates the propagation of predicted protein changes 1210 based on an increase in the amount of a compound 1230 through a known pathway. In this diagram, spheres 1240 represent proteins. Pairs of adjacent spheres 1250 indicate complexes of proteins. Thin arrows with T-shaped heads 1260 indicate inhibitions or causal decreases. Thin arrows with pointed heads 1270 indicate an activation or causal increase. Gene expression relationships are indicated by the arrows 1280. The diagram is intended to clarify the way in which changes predicted by a hypothesis may be compared with observed data.

[0094] FIG. 14 is a diagram generated by backward simulation from nine observed expression data points 1320, followed by pruning of the graph to show only the connections 1330 which support the primary hypotheses. Each node 1310 in this figure represents either a gene, protein, or compound. Nine of these nodes 1320 represent changes in expression of genes in response to dietary polyunsaturated fatty acids. The rest of the diagram is generated by exploring the knowledge base or assembly to find possible nodes 1310, which if changed, could explain one or more of the observed nine changes 1320 and then removing nodes 1310 and connections 1330 such that only the best explanations are shown.

[0095] One example of a method comprised of techniques herein above would be as follows: (1) load a set of expression fold-change data to the assembly; (2) run a backward logical simulation based on the fold-change data; (3) examine the resulting derived network and choose the most implicated nodes--the ones which are the highest ranking possible causes of the observed data; (4) for that set of nodes, return to the original assembly and run a pathfinding algorithm to find the derived network which is the minimal graph connecting the nodes; and (5) output the resulting derived network as a graph. Methods such as this example can be embodied as functions in the programming framework and can be named and re-used.

[0096] FIG. 15 illustrates a visualization technique comprising an aspect of the present invention that is based on a forward simulation that compares predicted outcomes with actual laboratory data. This diagram shows the direct downstream effects of a perturbation. The right-most column shows the expected outcome of a perturbation in the system. Each predicted value is compared to the actual values to determine how closely the predictions explain the lab data. A correlation can be calculated between the predicted outcome and the actual effect of each treatment. In FIG. 15, the cells marked with horizontal lines show a significant increase, the cells marked with vertical lines show a significant decrease, the darkened cells show no change, and the undarkened cells are insignificant. Perturbations may include, but are not limited to, the increase or decrease in concentration of a transcription factor, a small molecule, or a biochemical catalyst.

[0097] Applications of the invention include, but are not limited to, mechanisms of action (observations of tissue with and without drug can help elucidate the area in which the drug is working), mechanism of resistance (observations on the differences between responders and non-responders to a drug can lead to consensus areas that are the root cause(s) of resistance to treatment), mechanism of disease (observations of diseased versus healthy tissue(s) or patient(s) can lead to mechanisms of disease), and pathway identification (the method can be used to show which pathways are changing in an experiment and can help explain the observations).

[0098] Logical simulations, in an alternative embodiment, are performed within an assembly. Assemblies refer to sub-knowledge bases and derived knowledge bases. These specialty knowledge bases can be constructed from a global knowledge base by extracting a potentially relevant subset of life science-related data satisfying criteria specified by a user as a starting point, and reassembling a specially focused knowledge base having the structure disclosed herein. Assemblies are described in detail in co-pending, co-owned U.S. patent application Ser. No. 10/794,407, the disclosure of which is incorporated by reference herein.

[0099] Assemblies may be used to implement logical simulations, to evaluate data sets not present in a global repository at the time of the original assembly construction (e.g., to retest a hypothesis based on new experimental data), to hypothesize pathways and discern complex and subtle cause and effect relationships within a biological system, and to discern disease etiology, understand toxic biochemical mechanisms, and predict toxic response.

[0100] Logical simulations, in another example, are performed on data generated by an epistemic engine. Epistemic engines are described in detail in co-pending, co-owned U.S. patent application Ser. No. 10/717,224, the disclosure of which is incorporated by reference herein. Epistemic engines are programmed computers that accept biological data from real or thought experiments probing a biological system, and use them to produce a network model of protein interactions, gene interactions and gene-protein interactions consistent with the data and prior knowledge about the system, and thereby deconstruct biological reality and propose testable explanations (models) of the operation of natural systems. The engines identify new interrelationships among biological structures, for example, among biomolecules constituting the substance of life. These new relationships alone or collectively explain system behavior. For example, they can explain the observed effect of system perturbation, identify factors maintaining homeostasis, explain the operation and side effects of drugs, rationalize epidemiological and clinical data, expose reasons for species success, reveal embryological processes, and discern the mechanisms of disease. The programs reveal patterns in complex data sets too subtle for detection with the unaided human mind. The output of the epistemic engine permits one to better understand the system under study, to propose hypotheses, to integrate the system under study with other systems, to build more complex and lucid models, and to propose new experiments to test the validity of hypotheses.

[0101] In some embodiments, a knowledge base may take the form of one or more database tables, each having columns and rows. It should be understood that a knowledge base or assembly in the form of a database is only one way in which information may be represented in a computer. Information could instead be represented as a vector, a multi-dimensional array, a linked data structure, or many other suitable data structures or representations.

Graphical Output Techniques

[0102] A knowledge base, pathway, or group of pathways can be displayed visually as a graph of nodes connected by connections representing biological relationships between and among nodes. These graphs can be inspected by a scientist to understand the biological system and to facilitate the discovery of new biological knowledge about life sciences-related systems. Using these tools to discern biologically relevant insights into how a system behaves can be extremely valuable in drug research and development, and for developing a variety of therapies. Visualization techniques can also be used to display knowledge and associated data to enhance user understanding and recognition of relationships among entities that may emerge as patterns and clusters

Apparatus

[0103] The functionality of the systems and methods disclosed herein may be implemented as software on a general purpose computer. In some embodiments, a computer program may be written in any one of a number of high-level languages, such as FORTRAN, PASCAL, C, C++, LISP, JAVA, or BASIC. Further, a computer program may be written in a script, macro, or functionality embedded in commercially available software, such as EXCEL or VISUAL BASIC. Additionally, software could be implemented in an assembly language directed to a microprocessor resident on a computer. For example, software could be implemented in Intel 80.times.86 assembly language if it were configured to run on an IBM PC or PC clone. Software may be embedded on an article of manufacture including, but not limited to, a storage medium or computer-readable medium such as a floppy disk, a hard disk, an optical disk, a magnetic tape, a PROM, an EPROM, or CD-ROM.

Example

Validation Algorithm for Biological Models

[0104] An example of an algorithm for use in validating a biological model by comparing predicted to actual results is described below and in the pseudo code in FIG. 16. This algorithm assumes that there exists a knowledge base representing a biological system with data from gene expression experiments mapped onto the knowledge base.

[0105] The predicted results can be determined in two stages. First, a backward simulation as described herein is run on a knowledge base to determine potential causes of the gene expression changes. The backward simulation produces a list of genes and a score for each. The score for each node is based on the "votes" it received during the backward simulation. At the beginning of the backward simulation, nodes representing genes which are significantly upregulated are assigned positive votes, while those which are significantly downregulated are assigned negative votes. During the simulation, votes are copied from node to node according to a set of rules which follow the causal relationships expressed in the knowledge base. At the end of the simulation, the score for each node is computed as a set of three numbers: the sum of positive votes, the sum of negative votes, and an overall score, which is the sum of the positive and negative votes. At this point, the set of nodes representing potential causes ("the causes") may be used for the next step and may be selected based on each node's score, or the set of potential causes may be determined manually. In the second stage, the votes for all nodes are set to zero and a forward simulation as described herein is run on the selected set of causes. The votes are handled in the same way, except that they are propagated from causes to potential effects. At the end of the forward simulation, nodes which represent the expression of genes are reviewed. Those with a positive overall score are the ones which the forward simulation predicts to be up-regulated and those with a negative overall score are the ones which are predicted to be down-regulated. The results of the forward simulation represent the overall predicted results.

[0106] The actual results are classified into two categories based on the gene expression data. One list contains up-regulated genes and another list contains down-regulated genes. The genes included in these lists can be generated by various statistical methods, taking into account the absolute magnitude of the change (e.g., signal level), the relative magnitude of the change (e.g., fold values), statistical significance, etc. Alternatively, the genes may be selected manually.

[0107] After the predicted and actual results have been generated, overall results for each gene in the following three cases are tabulated. In the first case, a gene is predicted to be up-regulated. If the gene is in the actual list of up-regulated genes, the "correct prediction counter" is incremented. Otherwise, if the gene is in the actual list of down-regulated genes, the "opposite prediction counter" is incremented. If the gene is not in either list of actual gene expression changes, then the "predicted but not observed counter" is incremented. In the second case, a gene is predicted to be down-regulated. If the gene is in the actual list of up-regulated genes, the "opposite prediction counter" is incremented. Otherwise, if the gene is in the actual list of down-regulated genes, the "correct prediction counter" is incremented. If the gene is not in either list of actual gene expression changes, then the "predicted but not observed counter" is incremented. In the third case, there is no prediction for the gene and the "no net change counter" is incremented.

[0108] For every gene that is either in the actual up-regulated or down-regulated gene lists, but does not have any predictions, the "observed not predicted counter" is incremented. The five "counters" are then outputted: (1) "correct prediction counter", (2) "opposite prediction counter", (3) "predicted but not observed counter", (4) "observed not predicted counter", and (5) "no net change counter". These counters may be visualized, for example, in a histogram format, or pie chart format. Such visualizations provide an intuitive means for a scientist to initially assess the degree to which the generated hypothesis matches the observed data.

[0109] While the invention has been particularly shown and described with reference to specific embodiments and illustrative examples, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

* * * * *