U.S. patent application number 13/417190 was filed with the patent office on 2012-09-13 for biological data networks and methods therefor.
This patent application is currently assigned to ANNAI SYSTEMS, INC.. Invention is credited to Patrick Nikita Allen, Lawrence Ganeshalingam.
Application Number | 20120233202 13/417190 |
Document ID | / |
Family ID | 46795538 |
Filed Date | 2012-09-13 |
United States Patent
Application |
20120233202 |
Kind Code |
A1 |
Ganeshalingam; Lawrence ; et
al. |
September 13, 2012 |
BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR
Abstract
A method for facilitating processing of a request in a system
including a plurality of biological data units stored at a
plurality of network-accessible locations is disclosed herein. The
method includes receiving, at a first node of the biological data
network, the request from a client device. The method further
includes performing a first processing operation with respect to at
least one of the biological data units based upon the request. The
method also includes determining, based upon results of the first
processing operation, that the processing of the request is
incomplete and selecting, based upon the results of the first
processing operation, a second node of the biological data network
to perform a second processing operation. The method additionally
includes sending, from the first node, the results of the first
processing operation to the second node over a network.
Inventors: |
Ganeshalingam; Lawrence;
(Los Gatos, CA) ; Allen; Patrick Nikita; (Scotts
Valley, CA) |
Assignee: |
ANNAI SYSTEMS, INC.
Los Gatos
CA
|
Family ID: |
46795538 |
Appl. No.: |
13/417190 |
Filed: |
March 9, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61451086 |
Mar 9, 2011 |
|
|
|
61539942 |
Sep 27, 2011 |
|
|
|
61539931 |
Sep 27, 2011 |
|
|
|
Current U.S.
Class: |
707/769 ;
707/E17.014; 709/217 |
Current CPC
Class: |
G06F 9/52 20130101; G16B
30/00 20190201; H04L 45/00 20130101; G06F 2209/484 20130101 |
Class at
Publication: |
707/769 ;
709/217; 707/E17.014 |
International
Class: |
G06F 15/16 20060101
G06F015/16; G06F 17/30 20060101 G06F017/30 |
Claims
1. A method for facilitating processing of a request in a
biological data network comprised of a plurality of biological data
units stored at a plurality of network-accessible locations, the
method comprising: receiving, through a network interface of a node
of the biological data network, the request from a client device;
performing a first processing operation with respect to at least
one of the biological data units based upon the request;
determining, based upon results of the first processing operation,
that the processing of the request is complete; and sending,
through the network interface, a response to the client device.
2. The method of claim 1 wherein each of the biological data units
includes a representation of biological sequence data and at least
one biologically-relevant header associated with the biological
sequence data.
3. A method for facilitating processing of a request in a system
including a plurality of biological data units stored at a
plurality of network-accessible locations, the method comprising:
receiving, at a first node of the biological data network, the
request from a client device; performing a first processing
operation with respect to at least one of the biological data units
based upon the request; determining, based upon results of the
first processing operation, that the processing of the request is
incomplete; selecting, based upon the results of the first
processing operation, a second node of the biological data network
to perform a second processing operation; and sending, from the
first node, the results of the first processing operation to the
second node over a network.
4. A method, comprising; receiving, through a network interface of
a network node, a segment of a genome sequence of an organism;
comparing the segment of the genome sequence to a reference
sequence; identifying sequence variants between the genome sequence
and the reference sequence; and receiving, from another network
node, information relating to the sequence variants.
5. The method of claim 4 further including requesting, from the
other network node, the information relating to the sequence
variants.
6. A method for facilitating processing a disease-related query,
the method comprising: receiving, through a network interface of a
first network node, a query relating to a specified disease and a
genomic sequence associated with the query; identifying, relative
to a control sequence, any variant alleles within the genomic
sequence; sending, through the network interface, information
identifying the variant alleles to a second network node; and
receiving, through the network interface, information relating to
the set of variant alleles.
7. The method of claim 6 further including sending a response to
the disease-related query based upon the information relating to
the set of variant alleles.
8. A method for facilitating processing a disease-related query
within a biological data network, the method comprising: receiving,
at a first network node, a query relating to a specified disease
and a genomic sequence associated with the query; identifying,
relative to a control sequence, any variant alleles within the
genomic sequence; sending information identifying the variant
alleles over a network to a second network node; receiving, at the
first network node, pharmacological response data associated with
those of the variant alleles included within genes associated with
the specified disease; and sending a response to the query based
upon the pharmacological response data.
9. A method for facilitating processing a disease-related query,
the method comprising: receiving, through a network interface of a
network node, information identifying variant alleles within a
genomic sequence associated with a query relating to a specified
disease; providing the information to a processing module;
performing, using the processing module, a statistical correlation
analysis in order to identify those of the variant alleles included
within genes associated with the specified disease; providing
results of the statistical correlation to the network interface;
and sending the results of the statistical correlation to another
network node for further processing.
10. A method for facilitating the processing of biological data
within a network including a plurality of nodes, the method
comprising: receiving, at a first node of the plurality of nodes, a
request to process the biological data wherein the first node is
configured for DNA-specific layer processing; performing a first
processing operation with respect to at least a DNA-specific layer
of the biological data based upon the request; and sending, to a
second node of the plurality of nodes, results of the first
processing operation wherein the second node is configured for
processing of an RNA-specific layer of the results.
11. The method of claim 10 further including selecting, based upon
the results of the first processing operation, the second node to
perform the processing of the RNA-specific layer of the result.
12. A network node, comprising: a network interface configured to
receive a request from a client device; a processing module in
communication with the network interface, the processing module
performing a first processing operation with respect to at least
one of the biological data units based upon the request and
determining, based upon results of the first processing operation,
that the processing of the request is complete; and a transmit
controller configured to control sending, through the network
interface, a response to the client device.
13. The network node of claim 12 wherein each of the biological
data units includes a representation of biological sequence data
and at least one biologically-relevant header associated with the
biological sequence data.
14. A network node, comprising: a network interface configured to
receive a request from a client device; a processing module in
communication with the network interface, the processing module
being configured with instructions to: perform a first processing
operation with respect to at least one of the biological data units
based upon the request, determine, based upon results of the first
processing operation, that the processing of the request is
incomplete, select, based upon the results of the first processing
operation, a second node of the biological data network to perform
a second processing operation; and a transmit controller configured
to control sending results of the first processing operation to
another network node.
15. A network node, comprising; a network interface configured to
receive a segment of a genome sequence of an organism; and a
processing module communicatively coupled to the network interface,
the processing module being configured to compare the segment of
the genome sequence to a reference sequence and identify sequence
variants between the genome sequence and the reference sequence;
wherein the network interface is further configured to receive,
from another network node, information relating to the sequence
variants.
16. The network node of claim 15 further including a transmit
controller configured to control sending, to the another network
node, a request for the information relating to the sequence
variants.
17. A network node, comprising: a network interface configured to
receive a query relating to a specified disease and a genomic
sequence associated with the query; a processing module
communicatively coupled to the network interface, the processing
module being configured identify, relative to a control sequence,
any variant alleles within the genomic sequence; and a transmit
controller configured to control sending, through the network
interface, information identifying the variant alleles to a second
network node; wherein the network interface is further configured
to receive information relating to the set of variant alleles.
18. The network node of claim 17 wherein the transmit controller is
further configured to control sending a response to the query based
upon the information relating to the set of variant alleles.
19. A network node, comprising: a network interface configured to
receive a query relating to a specified disease and a genomic
sequence associated with the query; a processing module
communicatively coupled to the network interface, the processing
module being configured to identify, relative to a control
sequence, any variant alleles within the genomic sequence; and a
transmit controller configured to control sending information
identifying the variant alleles to a second network node; wherein
the network interface is further configured to receive
pharmacological response data associated with those of the variant
alleles included within genes associated with the specified disease
and wherein the transmit controller is further configured to send a
response to the query based upon the pharmacological response
data.
20. A network node, comprising: a network interface configured to
receive information identifying variant alleles within a genomic
sequence associated with a query relating to a specified disease; a
processing module configured to perform a statistical correlation
analysis in order to identify those of the variant alleles included
within genes associated with the specified disease; and a transmit
controller configured to send results of the statistical
correlation to another network node for further processing.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims the benefit of priority under
35 U.S.C. .sctn.119(e) of U.S. Provisional Patent Application Ser.
No. 61/451,086, entitled BIOLOGICAL DATA NETWORK, filed on Mar. 9,
2011, of U.S. Provisional Patent Application Ser. No. 61/539,942,
entitled SYSTEM AND METHOD FOR SECURE, HIGHSPEED TRANSFER OF VERY
LARGE FILES, filed Sep. 27, 2011, and of U.S. Provisional Patent
Application Ser. No. 61/539,931, entitled SYSTEM AND METHOD FOR
FACILITATING NETWORK-BASED TRANSACTIONS INVOLVING SEQUENCE DATA,
filed Sep. 27, 2011, the content of each of which is hereby
incorporated by reference herein in its entirety for all purposes.
This application is related to United States Utility patent
application Ser. No. 12/837,452, entitled METHODS AND SYSTEMS FOR
PROCESSING GENOMIC DATA, filed on Jul. 15, 2010, which claims
priority to U.S. Provisional Patent Application Ser. No.
61/358,854, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS
DATA, filed on Jun. 25, 2010, and to United States Utility patent
application Ser. No. 12/828,234, entitled METHODS AND SYSTEMS FOR
PROCESSING GENOMIC DATA, filed on Jun. 30, 2010, which claims
priority to U.S. Provisional Patent Application Ser. No.
61/358,854, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS
DATA, filed on Jun. 25, 2010, the content of each of which is
hereby incorporated by reference herein in its entirety for all
purposes. This application is also related to U.S. Utility patent
application Ser. No. 13/223,077, entitled METHODS AND SYSTEMS FOR
PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed
on Aug. 31, 2011, and to U.S. Utility patent application Ser. No.
13/223,084, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERIC
SEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, and
to U.S. Utility patent application Ser. No. 13/223,088, entitled
METHODS AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND
RELATED INFORMATION, filed on Aug. 31, 2011, and to U.S. Utility
patent application Ser. No. 13/223,092, entitled METHODS AND
SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED
INFORMATION, filed on Aug. 31, 2011, and to U.S. Utility patent
application Ser. No. 13/223,097, entitled METHODS AND SYSTEMS FOR
PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed
on Aug. 31, 2011, the content of each of which is hereby
incorporated by reference herein in its entirety for all purposes.
This application is also related to U.S. Utility patent application
Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND METHODS
THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent
application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND
METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent
application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND
METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent
application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND
METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent
application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND
METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent
application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND
METHODS THEREFOR, filed on Mar. 9, 2012, the disclosures of which
are hereby incorporated by reference for all purposes.
FIELD
[0002] This application is generally directed to processing and
networking polymeric sequence information, including biopolymeric
sequence information such as DNA sequence information.
BACKGROUND
[0003] Deoxyribonucleic acid ("DNA") sequencing is the process of
determining the ordering of nucleotide bases (adenine (A), guanine
(G), cytosine (C) and thymine (T)) in molecular DNA. Knowledge of
DNA sequences is invaluable in basic biological research as well as
in numerous applied fields such as, but not limited to, medicine,
health, agriculture, livestock, population genetics, social
networking, biotechnology, forensic science, security, and other
areas of biology and life sciences.
[0004] Sequencing has been done since the 1956s, when academic
researchers began using laborious methods based on two-dimensional
chromatography. Due to the initial difficulties in sequencing in
the early 1956s, the cost and speed could be measured in scientist
years per nucleotide base as researchers set out to sequence the
first restriction endonuclease site containing just a handful of
bases. Thirty years later, the entire 3.2 billion bases of the
human genome have been sequenced, with a first complete draft of
the human genome done at a cost of about three billion dollars.
Since then sequencing costs have rapidly decreased.
[0005] Today, the cost of sequencing the human genome is on the
order of $5000 and is expected to hit the $1000 mark later this
year with the results available in hours, much like a routine blood
test. As the cost of sequencing the human genome continues to
plummet, the number of individuals having their DNA sequenced for
medical, as well as other purposes, will likely increase
significantly. Currently, the nucleotide base sequence data
collected from DNA sequencing operations are stored in multiple
different formats in a number of different databases.
[0006] Such databases also contain annotations and other attribute
information related to the DNA sequence data including, for
example, information concerning single nucleotide polymorphisms
(SNPs), gene expression, copy number variations methylation
sequence. Moreover, transcriptomic and proteomic data are also
present in multiple formats in multiple databases. This renders it
impractical to exchange and process the sources of genome sequence
data and related information collected in various locations,
thereby hampering the potential for scientific discoveries and
advancements.
SUMMARY
[0007] In one aspect the disclosure relates to a method for
facilitating processing of a request in a biological data network
comprised of a plurality of biological data units stored at a
plurality of network-accessible locations. The method includes
receiving, through a network interface of a node of the biological
data network, the request from a client device. The method further
includes performing a first processing operation with respect to at
least one of the biological data units based upon the request. The
method additionally includes determining, based upon results of the
first processing operation, that the processing of the request is
complete and sending, through the network interface, a response to
the client device.
[0008] In another aspect the disclosure pertains to a method for
facilitating processing of a request in a system including a
plurality of biological data units stored at a plurality of
network-accessible locations. The method includes receiving, at a
first node of the biological data network, the request from a
client device. The method further includes performing a first
processing operation with respect to at least one of the biological
data units based upon the request. The method also includes
determining, based upon results of the first processing operation,
that the processing of the request is incomplete and selecting,
based upon the results of the first processing operation, a second
node of the biological data network to perform a second processing
operation. The method additionally includes sending, from the first
node, the results of the first processing operation to the second
node over a network.
[0009] In yet a further aspect the disclosure is directed to a
method for facilitating processing of a disease-related query. The
method includes receiving, through a network interface of a first
network node, a query relating to a specified disease and a genomic
sequence associated with the query. The method further includes
identifying, relative to a control sequence, any variant alleles
within the genomic sequence. The method additionally includes
sending, through the network interface, information identifying the
variant alleles to a second network node and receiving, through the
network interface, information relating to the set of variant
alleles.
[0010] The disclosure is further directed to a method for
facilitating processing a disease-related query within a biological
data network. The method includes receiving, at a first network
node, a query relating to a specified disease and a genomic
sequence associated with the query and identifying, relative to a
control sequence, any variant alleles within the genomic sequence.
The method further includes sending information identifying the
variant alleles over a network to a second network node. The method
additionally includes receiving, at the first network node,
pharmacological response data associated with those of the variant
alleles included within genes associated with the specified disease
and sending a response to the query based upon the pharmacological
response data.
[0011] In yet another aspect the disclosure pertains to a method
for facilitating processing of a disease-related query. The method
includes receiving, through a network interface of a network node,
information identifying variant alleles within a genomic sequence
associated with a query relating to a specified disease. The method
further includes providing the information to a processing module
and performing, using the processing module, a statistical
correlation analysis in order to identify those of the variant
alleles included within genes associated with the specified
disease. The method additionally includes providing results of the
statistical correlation to the network interface and sending the
results of the statistical correlation to another network node for
further processing.
[0012] In yet another aspect the disclosure relates to a method for
facilitating the processing of biological data within a network
including a plurality of nodes. The method includes receiving, at a
first node of the plurality of nodes, a request to process the
biological data wherein the first node is configured for
DNA-specific layer processing. The method further includes
performing a first processing operation with respect to at least a
DNA-specific layer of the biological data based upon the request.
In addition, the method includes sending, to a second node of the
plurality of nodes, results of the first processing operation
wherein the second node is configured for processing of an
RNA-specific layer of the results.
[0013] The disclosure also describes network nodes specially
configured to carry out the above-described methods. These network
nodes may include network interfaces, processing modules and
transmit/receive controllers particularly configured and arranged
to implement the operations corresponding to such methods.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] Various objects and advantages and a more complete
understanding of the disclosure are apparent and more readily
appreciated by reference to the following Detailed Description and
to the appended claims when taken in conjunction with the
accompanying Drawings wherein:
[0015] FIG. 1 is a representation is provided of a biological data
unit comprised of a payload containing DNA sequence data and a
header containing information having biological relevance to the
DNA sequence data within the payload.
[0016] FIG. 2 illustratively represents a biological data model
which includes a plurality of interrelated layers.
[0017] FIG. 3 depicts a biological data unit having a header and a
payload containing an instruction-based representation of segmented
DNA sequence data.
[0018] FIG. 4 is a logical flow diagram of a process for
segmentation of biological sequence data and combining the segments
with metadata attributes to form biological data units encapsulated
with headers.
[0019] FIG. 5 depicts a biological data network comprised of
representations of biological data linked and interrelated by an
overlay network containing a plurality of network nodes.
[0020] FIG. 6 illustrates an exemplary protocol stack implemented
at a network node together with corresponding layers of the OSI
network model.
[0021] FIG. 7 shows a high-level view of various data types that
may be processed by a group of network nodes in response to a
query/request received from a client terminal.
[0022] FIG. 8 provides a block diagrammatic representation of the
architecture of an exemplary network node.
[0023] FIG. 9A illustratively represents a process effected by a
network node to implement a sequence variants processing
procedure.
[0024] FIG. 9B is a flowchart of an exemplary variants processing
procedure.
[0025] FIG. 10 illustratively represents the processing occurring
at a network node configured to perform a specialized processing
function.
[0026] FIG. 11 provides a representation of an exemplary processing
platform capable of being configured to implement a network
node.
[0027] FIG. 12 illustrates one manner in which data may be
processed, managed and stored at an individual network node in an
exemplary clinical environment.
[0028] FIGS. 13-18 illustratively represent the manner in which
information within the layered data structure is utilized at an
individual network processing node.
[0029] FIG. 19 illustrates the cooperative performance of an
exemplary result-based network processing using multiple network
nodes.
[0030] FIG. 20 illustrates an exemplary process flow corresponding
to the result-based network processing illustrated by FIG. 19.
[0031] FIG. 21 depicts a biological data network comprised of a
plurality of network nodes.
[0032] FIG. 22 is a flow chart representative of a set of exemplary
processing operations performed by a biological data network in
response to a user query or request.
[0033] FIG. 23 illustratively represents a separation of localized
and network-based processing functions within a portion of a
biological data network.
[0034] FIG. 24 provides an illustration of various functional
interactions between network-based and localized applications.
[0035] FIG. 25 depicts a biological data network which includes a
collaborative simulation network.
[0036] FIG. 26 is a flowchart representative of a manner in which
information relating to various different layers of
biologically-relevant data organized consistently with a biological
data model may be processed at different network nodes.
[0037] FIG. 27 is a flowchart representative of an exemplary manner
in which network nodes of a biological data network may cooperate
to process a client request.
[0038] FIG. 28 is a flowchart representative of an exemplary
sequence of operations involved in the identification and
processing of sequence variants at a network node.
[0039] FIG. 29 is a flowchart representative of an exemplary
sequence of operations carried out by network nodes of a biological
data network in connection with processing of a disease-related
query.
[0040] FIG. 30 is a flowchart representative of an exemplary
sequence of operations involved in providing pharmacological
response data in response to a user query concerning a specified
disease.
[0041] FIG. 31 illustratively represents communication of DNA
sequence data or other biological sequence information between a
pair of devices supporting a biological data network.
[0042] FIG. 32 illustratively represents one manner in which
multiple devices may support various operations within a biological
data network.
[0043] FIG. 33 illustrates a biological data network configured to
utilize techniques such as, for example, multiprotocol label
switching ("MPLS") to facilitate the distribution of DNA sequence
data and related information between client devices.
[0044] FIG. 34 illustrates a process for assigning
biologically-relevant and network-related headers to segments of
DNA sequence data stored within network-attached storage or
received from a sequencing machine.
[0045] FIG. 35 illustratively represents a system and approach for
using networking protocols otherwise employed for streaming media
to facilitate the dissemination of DNA sequence data.
[0046] FIG. 36 is a block diagram of a high-speed sequence data
analysis system.
DETAILED DESCRIPTION
Introduction
[0047] This disclosure relates generally to an innovative new
biological data network and related methods capable of efficiently
handling the massive quantities of DNA sequence data and related
information expected to be produced as sequencing costs continue to
decrease. The disclosed network and approaches permit such sequence
data and related medical or other information to be efficiently
stored in data containers provided at either a central location or
distributed throughout a network, and facilitate the efficient
network-based searching, transfer, processing, management and
analysis of the stored information in a manner designed to meet the
demands of specific applications.
[0048] The disclosed approaches permit such sequence data and any
related medical, biological, referential or other information, be
it computed, human-entered/directed or a combination thereof, to be
efficiently transmitted and/or shared or otherwise conveyed from a
centralized location or either partly or wholly distributed
throughout the biological data network. These approaches also
facilitate data formats and encodings used in the efficient
processing, management and analysis of various "omics" (i.e.,
proto/onco/pharma) information. The innovative new biological data
network or, equivalently, network, is configured to operate with
respect to biological data units stored at various network
locations.
[0049] Each biological data unit will generally be comprised of one
or more headers associated with or relating to a payload containing
a representation of segmented DNA sequence data or other
non-sequential data of interest. The term header in this context
refers to one or more pieces of information that have relevance to
the payload, without regard to how or where such information is
physically stored or represented within the network. As is
discussed below, it will be appreciated that certain operations
performed by the nodes or elements of the biological data network
may be effected with respect to the entirety of the biological data
units undergoing processing; that is, with respect to
representations of both the segmented sequence data and headers of
such biological data units.
[0050] However, the elements of the biological data network may
perform other operations by, for example, comparing or correlating
only the headers of the biological data units being processed. In
this way network bandwidth may be conserved by obviating the need
for network transport of segmented biological sequence data, or
some representation thereof, in connection with various processing
operations involving biological units nominally stored at different
network locations.
[0051] The biological data network may be comprised of a plurality
of network nodes configured with processing and analytical
capabilities, which are individually or collectively capable of
responding to machine or user queries or requests for information.
As is discussed below, the functionality of the new biological data
network may be integrated into the current architectural framework
of the Open Systems Interconnection (OSI) seven-layer model and the
Transmission Control Protocol and Internet Protocol (TCP/IP) model
for network and computing communications. This will allow service
providers to configure existing network infrastructure to
accommodate biological sequence data to deliver optimized quality
of service for medical and health professionals practicing
genomics-based personalized medicine. Alternatively or in addition,
the new biological data network may be realized as an
Internet-based overlay network capable of providing biological,
medical and health-related intelligence to applications supported
by the network.
[0052] The new biological data network facilitates overcoming the
daunting challenges associated with analysis of various pertinent
omics data types together with, and in the context of, all
relevant, available prior knowledge. In this regard the new
biological data network may facilitate development of an integrated
ecosystem in which distributed databases are accessible on a
network and in which the data stored therein is configured to be
linked by. This new biological data network may enable, for
example, forming, securing, linking, searching, filtering, sorting,
aggregating and connecting an individual's genome data with a
layered data model of existing knowledge in order to facilitate
extraction of new and meaningful information.
Overview of Biological Data Units and Headers
[0053] As disclosed herein, the innovative new biological data
network is configured to operate with respect to biological data
units stored at various network locations. Biological data units
can be considered as a set of information that is known or can be
predicted to be associated with certain segments of genome
sequences. Biological data units will generally be comprised of one
or more headers associated with or relating to a payload containing
a representation of segmented DNA sequence data or other
non-sequential data of interest.
[0054] The biological data units may be generated by dividing
source DNA sequences into segments and associating one or more
headers (also referred to herein as "BI headers" or annotations or
attributes) with one or more segments of genome sequence data. The
various component parts XML metadata files that are of the header
information contained in biological data units can be stored in
distributed storage containers that are accessible on a network.
Furthermore, the different segments of a whole genome sequence data
contained in the payload of biological data units may be stored in
multiple BAM files at various different locations on a network.
[0055] Each BI header can be considered a specific piece of
information or set of information that may be associated with or
have biological relevance to one or more specific segments of DNA
sequence data within the payload of the biological data unit. It
should be appreciated that any information that is relevant to the
segmented sequence data payload of a biological data unit can be
placed in the one or more headers of the data unit or, as is
discussed below, within headers of other biological data units. It
should also be clearly understood that the information contained in
any biological data unit can be highly distributed and network
linked in such a manner that allows filtration and dynamic
recombination of any permutation of associated attributes and
sequence segments.
[0056] The headers may be arranged in any order, whether dependent
upon or independent of the payload data. However, in one embodiment
the headers are each respectively associated with at least one
layer of a biological data model of existing knowledge that is
representative of the biological sequence data which, for example,
may be stored as BAM files within the payloads of the distributed
biological data units with which such headers or XML metadata
attributes are associated.
[0057] Although the present disclosure provides specific examples
of the use of BI headers in the context of a layered data model, it
should be understood that BI headers may be realized in essentially
any form capable of embedding information within, or associating
such information with, all or part of any biological or other
polymeric sequence or plurality thereof. For example, one or more
BI headers could be associated with any permutation of segments of
DNA sequence or other such polymeric sequence or within any
combination thereof, in any analog or digital format.
[0058] The BI headers could also be placed within a representation
of associated polymeric sequence data, or could be otherwise
associated with any electronic file or other electronic structure
representative of molecular information. In other words, the one or
more metadata attributes that are stored in multiple storage
containers on a network may compose headers that are specifically
associated with at least one segment of sequence contained in a
file transfer session.
[0059] In the case in which data is embedded within DNA or other
biological sequence information, the BI headers or tags including
the data may be placed in front of, behind or in any arbitrary
position within any particular segmented sequence data or multiple
segmented data sequences. In other words, in one particular
embodiment of the invention, information that is associated
directly or indirectly may be stored within the base calls of reads
that are contained in BAM files or any other sequence file format
or internal memory structures, for example. This approach would
involve a method for integrating, at least one specific attribute
of information that is associated with a genome sequence between
and or among the base calls contained within reads of sequence data
files.
[0060] In addition, the data may be embedded in a contiguous or
disbursed manner among and within the base calls of the segmented
sequence data. When this highly structured and layered approach is
applied to the storage configuration of this sequence data and
associated information it will advantageously facilitate the
computationally efficient, effective and rapid analysis of, for
example, the massive quantities of genome sequence data being
generated by next-generation, high-throughput DNA sequencing
machines.
[0061] In particular, distributed biological data units containing
segmented DNA sequence data and associated attributes may be
stored, sorted, filtered and operated on for various scope and
depth of analysis based upon the said associated information which
is contained within the headers. This obviates the need to
manipulate, transfer and otherwise breach the security of the
segmented DNA sequence data in order to process and analyze such
data.
[0062] One embodiment of the layered data model of the existing
body of relevant knowledge includes not only of or pertaining to
biologically-relevant data but also other metadata which are
associated with the nucleic acid sequence files. Such
MetaIntelligence.TM. metadata may include, for example, facts,
information, knowledge and prediction derived from biological,
clinical, pharmacological, environmental, medical or other
health-related data, including but not limited to other biological
sequence data such as methylation sequence data as well as
information on differential expression, alternative splicing, copy
number variation and other related information.
[0063] The DNA sequence information included within the biological
data units described herein may be obtained from a variety of
sources. For example, DNA sequence information may be obtained
"directly" from DNA sequencing apparatus, as well as from sequence
data files that are stored in private and publicly accessible
genome data repositories. Additionally, it may be computationally
derived and/or manually gathered or inferred. In the case of the
database of Genotypes and Phenotypes at the National Center for
Biotechnology Information at the National Library of Medicine, the
DNA sequence entries may be stored as BAM, SRF, fastq as well as in
the FASTA format, which includes annotated information concerning
the sequence data files. In one embodiment certain of the
information contained within the one or more headers of each
biological data unit would be obtained from publicly accessible
databases containing genome data sequences.
[0064] Turning now to FIG. 1, a representation is provided of a
biological data unit comprised of a payload containing DNA sequence
data and a header containing information having biological
relevance to the DNA sequence data within the payload. Furthermore,
it should be appreciated that information contained in a particular
header may also point or associate with sequence data that is
stored in at least one data container as the payload portion of
biological data units.
[0065] In addition, it should be understood that the header
information and sequence payload that is contained within
biological data units relate directly to attributes in XML metadata
files and BAM sequence files, respectively. Any key value can
associate with one or more sequence files or segments of sequence
within such files. In one particular aspect of the disclosed
approach, the key value may be information of or pertaining to a
drug or its effect and the sequence may be a segment of sequence
contained in a GeneTorrent.TM. Object file transfer session.
[0066] The header information may associate with or relate to for
example a microRNA sequence or the regulatory region of a gene or
interaction with another gene product from at least one molecular
pathway. Since the example that is presented as FIG. 1 shows that
the payload contains DNA sequence data, the biological data unit of
FIG. 1 may also be referred to herein as a DNA protocol data unit
(DPDU). The DPDU can be considered as distributed biological data
units that are encapsulated with information for transfer, control
and other data that is relevant to the protocol.
[0067] In one embodiment, the exemplary biological data unit that
is depicted in FIG. 1 would be associated with the DPDUs that are
encapsulated and involved in a computer-implemented method for
processing data units. For example, in the case where the sequence
payload is RNA sequence data which may be derived from RNA-seq or
deduced from the DNA sequence data could be included within RNA
protocol data units (RPDU) comprised of a plurality of RNA specific
headers and a payload comprised of the RNA sequence data. The
header information contained in distributed components of RPDUs may
include but not be limited to information on differential
expression, splicing, processing and other posttranscriptional
modifications of RNA.
[0068] Similarly, a protein protocol data unit (PPDU) comprised of
peptide-specific headers and a payload containing a representation
of amino acid sequence data. The biological sequence data that is
contained in the payload of PPDUs may be from mass
spectrophotometry protein sequencing data or deduced from the DNA
sequence data of the DPDU of FIG. 1. Furthermore, the header
information may be information such as the protein's concentration
in body fluids or the extent of protein activity which could also
be associated with the DPDU(s) of the representative gene.
A Network-Based Layered Biological Data Model
[0069] Referring now to FIG. 2, representation of genome sequence
data using distributed biological data units having header
information corresponding to the different layers of the biological
data model 200 is expected to facilitate efficient processing of
such sequence data. For example, in cases in which it is desired to
query one or more data containers containing large numbers of
biological data units, the multi-layered representation of FIG. 2
enables queries to be configured in such a manner to be analyzed
using only the information within the xml metadata files that
contain portions of the distributed data units and without the need
to directly examine the segmented sequence data contained within
the payload of such data units.
[0070] As a consequence, data from different smart repositories can
be processed in real time, and access to various types of data
allows for more sophisticated analysis of biological, medical,
clinical and other related datasets. This is believed to represent
a significant advance relative to conventional database-centric
processing techniques, which typically rely upon evaluation of the
entirety of the sequence information stored within a database.
[0071] It should be appreciated that the multi-layered,
multi-dimensional data architecture represented by FIG. 2 provides
but one example of the many different architectures capable of
being implemented using biological data units containing headers.
It should also be understood that the data layers are exemplary and
not intended to limit the scope or extent of the invention. As
shown in FIG. 2, the biological data model 200 includes a DNA layer
210, an RNA layer 220, a protein layer 230, a systems biology layer
240, an application layer 250, a top level field-specific layer
260, a medical data layer 270, a molecular pathways layer 280 and a
management layer 290. In various embodiments the information
associated with each of these layers may be included within the
header and/or payload of biological data units that are configured
in a way that is consistent with the data model 200.
[0072] The DNA layer 210 will generally contain information, data
and knowledge associated with DNA found in public and private
databases, as well as information published or generally accepted
by the scientific community as being credible. For example and
without limitation, the information included within the DNA layer
210 may comprise: 1) the nucleotide sequence of DNA segment, 2)
chromosome number, positions and location, 3) nucleotide start and
end positions of a particular segment of sequence, 4) name of the
gene if and when the segment encodes known gene, 5) annotations for
the enhancer and promoter region, 6) identification of open reading
frames that are present within the segment of genome sequence, 7)
transcription start site and start codon used for translation, 8)
annotations for the identification of introns and exons, 9) known,
unknown and predicted mutations, 10) the various types of
mutations, 11) phenotypic effects, 12) any metadata or annotation
or knowledge or possible predictions on any sequence of DNA found
in any other database.
[0073] The RNA layer 220 is positioned adjacent to and is
intimately associated with the DNA layer 210. The information
included within this pair of layers is highly interrelated. The RNA
layer 220 contains information that is related to or pertaining to
RNA sequence, modification, function and structure. In certain
embodiments this layer may contain information relating to various
types of RNA including, for example, mRNA, tRNA, rRNA, miRNA,
siRNA, and other non-coding RNAs. The layer 220 may also include
information concerning snRNA involved with splicing and guiding RNA
in telomerase.
[0074] Examples of specific information which may be included
within the RNA layer 220 include, without limitation: 1) the
primary base sequence of the pre-mRNA and mature mRNA sequences, 2)
information on the sequences and locations of known and predicted
ribosome binding site, 3) initiation site for protein synthesis or
translation start codon, 4) processing and molecular modification
of mRNA, 5) positions and sequence of splice junctions, 6) know and
predictable alternative splicing data, 7) polyA tail data, 8)
microRNA binding data, 9) RNA expression data from microarray and
polysome analysis, 10) and essentially any other data concerning
RNA contained within any other database.
[0075] In the exemplary representation of FIG. 2, the protein layer
230 resides directly on top of the RNA layer 220. In this
configuration, information flows from the RNA layer 220 to the
protein layer 230 and can associate with information from the DNA
layer 210 through the RNA layer 220. This means, for example, that
data from the prior knowledge information contained in the protein
layer 230 can be processed and analyzed along with existing
knowledge from the DNA layer. The following types of information
may, for example and without limitation, be included within this
layer: 1) amino acid sequence of a protein, 2) any available
existing information on the post-translational modifications of a
protein encoded by the segmented genome sequence, 3) any
information on the activity of a protein or related polypeptides,
4) information on the crystal structure, 5) NMR data, 6)
well-established mass spectrometry data that is relevant to the
segmented sequence, 7) any information on protein-protein
interactions, 8) any protein-nucleic acid interactions, 9) any
pathway involvement information, 10) other data, related
information, annotation and attribute information concerning any
protein, polypeptide or nascent peptide published or stored within
any other accessible genome data repository.
[0076] The biological systems layer 240 may include information
relating to, for example and without limitation, transcriptomics,
genomics, epigenomics, proteomics, metabolomics and other
biological-system-related data. As the field of bioinformatics
advances further, this layer may be scaled to accommodate other
systems-level information, e.g., interactomics, immunomics,
chromosomomics, and the like. This layer biological systems layer
240 is preferably situated between the protein layer 230 and the
application layer 250. The application layer 250 serves to
facilitate user-definable interaction with the prior knowledge that
is included within lower layers of the data model 200. in the
application layer 250 may use application-specific filtering of
attributes to deliver query, analysis and processing results in
real time.
[0077] The top-level expert application layer 260 uses data from
microarray gene expression analysis, mass spectrometry proteomics
data, copy-number variation data, single nucleotide polymorphisms
and/or other data related to disease conditions, phenotypic
expression, behavior, pharmacogenetics, epigenetic markers to run
applications relating to processing, transport, analysis,
compression, retrieval, storage and any other such operation
capable of being applied to biological sequence data. In the
embodiment of the data model of existing knowledge that is
represented in FIG. 2, the layer 260 resides on top of the cubical
data model 200 along with the suite of application layer software
programs and related information in section 250, and is adjacent
the medical data layer 270.
[0078] The medical information layer that is presented in section
270 may contain, without limitation, clinical data, personal health
history and record data, medication data, lab test result data,
image data (mammograms, x-ray, MRI, CAT scan, ultrasound, etc.),
any other relevant, related, correlated or associated data. In this
case, accepted discoveries, knowledge, calculations or predictions
that are strongly linked with the clinical measurements and
information may be configured in a way that is consistent with the
ability to interrogate this prior knowledge base with metadata
attributes.
[0079] The molecular pathways layer 280 will generally include
information concerning pathways and molecular systems as well as
the proteins, nucleic acids and metabolites that participate in the
biological cycle. This layer of the layered molecular model may
include specific information on the differential expression of
certain genes at the level of organs, tissues, cell types, systems
and pathways as they are related to the pertinent data found in
headers of the biological data units that are involved in the
response to a query. In another aspect of the invention the
information represented in the pathway layer 280 may involve the
measure of specific molecular activities of the proteins that are
participants in a particular pathway.
[0080] The metadata attribute information that resides within the
layer 280 of the layered data model of existing knowledge may be
focused on, for example and without limitation, protein-protein
interactions, protein-nucleic acid interactions, as we as the
various types of interactions that may exist between and among
different molecules of nucleic acids and protein-metabolite
interactions. This type of information could prove to be very
powerful for elucidating key biological pathways, and thus may be
incredibly useful for identifying new and important drug targets.
Furthermore, the information that is comprised in this layer may
also include, for example, sequence data and annotations in pathway
specific databases such as Reactome, IntAct and Rhea at EBI. The
management layer 290 sits atop the z-dimension of layers within the
prior knowledge data model 1600 and serves as the engine that
controls and manages the flow of data across the cubical
structure.
[0081] As may be appreciated with reference to FIG. 2, the
illustrated biological data model is representative of the
associations between and among layers of existing knowledge as well
as the intra and interrelationships that exist among and between
the highly distributed biological data units described above. In
particular, the headers consisting of information pertaining to the
DNA-specific, RNA-specific and peptide specific biological data
units are each associated with at least one of the "layers" of the
biological data model of FIG. 2, i.e., the DNA, RNA and peptide
layers, respectively.
[0082] Alternatively, a given biological data unit which may be
stored in multiple storage containers may comprise a payload
containing a representation of biological sequence data and a
plurality of headers, each of which is associated with one or more
of the layers of the biological data model of FIG. 2. As is
discussed below, although each header may be characterized as being
associated with a certain layer of a data model, each may also
point to or otherwise reference information in the header or
payload of a separate biological data unit that may be stored in
multiple storage containers may further be associated with a
different layer of the biological data model.
[0083] headers may be associated with any form of intelligence or
information capable of being represented as headers, tags or other
parametric information which relates to the biological sequence
data within the payload of a biological data unit. Alternatively or
additionally, headers may point to relevant or unique (or
arbitrarily assigned for the processing purpose) information that
is associated with the biological sequence data within the
payload.
[0084] A header may be associated with any information which is
either known or predicted based upon scientific evidence, and may
also serve as a placeholder for information which is currently
unknown but which later may be discovered or otherwise becomes
known. For example, such information may include any type of
information related to the source biological sequence data
including, for example, analytical or statistical information,
testing-based data such as gene expression data from microarray
analysis, theories or facts based on research and studies (either
clinical or laboratory), or information at the community or
population level based study or any such related observation from
the wild or nature.
[0085] In one embodiment relevant information concerning a certain
segment of DNA sequence or biological sequence data may be
considered metadata and could, for example, include clinical,
pharmacological, phenotypic or environmental data capable of being
embedded and stored in more than one storage container but with
very close association with the sequence data as part of the
payload or included within a look-up table.
[0086] One distinct advantage to storing metadata and sequence
files in a manner that allows for effective and robust tracking and
linking of the data is that it enables DNA and other biological
sequences that make up large data files to be more efficiently
processed and managed. The type of information that may be embedded
or associated with segments of DNA sequences or any other
biological, chemical or synthetic polymeric sequence can be
represented in the form of packet headers, but any other format or
method capable of representing this information in association with
one or more segments of biological sequence data within a data unit
is within the scope of the teachings presented herein.
[0087] The systems described herein are believed to be capable of
facilitating real-time processing of biological sequence data and
other related data such as, for example and without limitation,
gene expression data, deletion analysis from comparative genomic
hybridization, quantitative polymerase chain reaction, quantitative
trait loci data, CpG island methylation analysis, alternative
splice variants, microRNA analysis, SNP and copy number variation
data as well as mass spectrometry data on related protein sequence
and structure. Such real-time processing capability may enable a
variety of applications including, for example, medical
applications.
[0088] The types of medical applications that could be facilitated
by this approach may include an automated computer-implemented
algorithm that allows the storing, filtering, sorting and tracking
of an individual's whole genome sequence in segments as they relate
to all the attributes and annotations in association with a
biological data model of existing knowledge to extract meaningful
and relevant results to specific queries. The processing and
analysis of this data will unveil a new class of rich information
that can be utilized in accordance with the layered data model of
prior knowledge.
[0089] BI headers may be used for the embedding of biologically
relevant information, in full or in part, in combination with any
polymeric sequence or part or combination thereof, and may be
placed at either end of such polymeric sequence or in association
within any combination of such polymeric sequences. In addition,
embedded information can be considered to be information that is
clustered and linked in such a way that relevant information that
is related to sequence data files are linked to allow for
precipitation of meaningful new insight. Furthermore, the various
components of the metadata information and sequence segments can be
accessible from multiple storage containers on a network.
[0090] BI headers may be configured to be in any format and may be
associated with one or more segments of polymeric sequence data.
Furthermore, in certain cases the components of biological data
units may be stored in a centralized container and in such case the
BI Headers may be positioned in front of or behind (tail) the
polymeric sequence data, or at any set of arbitrary locations
within the representation of the segmented sequence data. Moreover,
the BI headers may comprise contiguous strings of information or
may be themselves segmented and the constituent segments placed
(randomly or in accordance with a known pattern) among and between
the segments of sequence data which is comprised within one or more
biological data units.
[0091] The use of BI headers in representing genome sequence data
in a structured format advantageously provides an enhanced
capability for classifying and filtering the sequence data based
upon any of several stored existing knowledge fields that are
related to the said sequence segment. This approach allows for the
sequence data to be sorted based on the abstracted descriptive
information which is contained within the BI headers relating to
the segmented sequence data of a specific biological data unit.
[0092] For example, the segmented genome sequence data represented
by a plurality of biological data units could be processed such
that, a particular gene that is normally known to be located at a
certain position on chromosome 1 could be sorted along with other
genes or gene products from the same or a different chromosome if
the corresponding genes or gene products are associated with a
particular molecular pathway, drug treatment, health condition,
diagnosis, disease or phenotype. Alternatively, it should be known
that certain chromosomal rearrangements could generate a similar
result when a portion of one chromosome is transferred through
translocation and becomes part of another.
[0093] In the general case not all of the segments of DNA sequence
data within the set of biological data units resulting from
segmentation of an individual genome will directly associate with
every field of the applicable BI header attributes. For example, a
certain biological data unit may contain a segment of DNA sequence
lacking an open reading frame, in which case the exon count field
of the DNA-specific BI header would not be applicable. In any case,
the particular header information type along with other header
information types are maintained as place holders for future
scaling of the depth and scope of intelligence that is contained
within the XML metadata files. This permits biological information
relating to the segmented DNA sequence data of a certain biological
data unit which is not yet known to be easily added to the
appropriate layer of the biological data model once the information
becomes known and, in certain cases, scientifically validated.
[0094] In certain exemplary embodiments disclosed herein, the
biological or other polymeric sequence data contained within the
payload of a biological data unit is represented in a two-bit
binary format. However, it should be appreciated that other
representations are within the scope of the teachings herein. For
example, the instruction set architecture described in co-pending
application Ser. No. 12/828,234 (the "'234 application") may be
employed in certain embodiments described herein to more
efficiently represent and process the segmented genome sequence
data within the payload of biological data units. Accordingly, in
order to facilitate comprehension of these certain embodiments, a
description is provided below of certain aspects of the instruction
set architecture described in the '234 application.
Representation of Polymeric Sequence Data Using Biological Data
Units
[0095] One aspect the present disclosure describes an innovative
methodology for biological sequence manipulation well-suited to
address the difficulties that are related to the processing
comparative sequence analysis of large quantities of DNA sequence
data. The disclosed methodologies enable segmented representations
of such sequence data to be efficiently stored (either locally or
in a distributed fashion), searched, moved, processed, managed and
analyzed in an optimal manner in light of the demands of specific
applications.
[0096] The disclosed method involves breaking whole genome DNA
sequence entries into deliberate segments and packetizing the
fragments in association with header information to form biological
data units. In one embodiment much of the header information may be
obtained from private or public databases containing information
pertaining to involved molecular pathways, drug databases,
published research data that can be found in well-established
databases such as, for example, dbGaP and EMBL. The DNA sequence
entries within many public databases may be stored in a BAM file
format, which accommodates the inclusions of annotated information
concerning the sequence. For example, an entry for a DNA sequence
recorded in the BAM file format could include annotated information
identifying the name of the organism from which the DNA was
isolated and the gene or genes contained in the specific sequence
entry.
[0097] Alternatively, the sequence file may contain the base
sequence information while the ancillary metadata information could
be contained in XML files as specific attributes that are
associated with a particular segment of the sequence. The
associated information that is contained in these files may relate
with prior knowledge that is configured in a biological model that
is consistent with a layered data model.
[0098] In addition, the information that is pertinent to which
chromosome the particular DNA sequence segment was obtained and the
starting and ending base positions of the sequence would also
typically be available. Furthermore, other public and private
databases include information relating to, for example, the
location of human CpG islands and their methylation sequence, as
well as the genes with which such islands are associated (see,
e.g., http://data.microarrays.ca/cpg/index.htm).
[0099] For each identifiable gene there will be an essential need
for a normal control state of the particular gene. Database entries
that contain genes that are identified as being associated with a
RefSeqGene, which pertains to a project within NCBI's Reference
Sequence (RefSeq) project, provide another potential source of
header information. The RefSeqGene project defines the DNA
sequences of genes that are well-characterized by leaders in the
scientific community to be used as reference standards which is a
part of the Locus Reference Genomic (LRG) project. In particular,
sequences labeled with the keyword RefSeqGene serve as a stable
foundation for reporting mutations, for establishing conventions
for numbering exons and introns, and for defining the coordinates
of other biologically significant variation. DNA sequence entries
that associate directly with the RefSeqGene will be well-supported,
exist in nature, and, to the extent for which it is possible,
represent a prevalent, `normal` allele.
[0100] It should be appreciated that there may be different schemas
for segmentation and packetizing sequence entries in order to
associate the highly relevant attribute information with specific
sequence segments. For example, in the case in which it is suitable
to segment sequence entries into packets containing genes or,
alternatively, into introns and exons, relevant data is available
for placement into the header information relating to the metadata
attributes of the biological data units containing such sequence
segments.
Biological Data Units Including Headers
[0101] Referring again to FIG. 1, the header 110 is seen to include
a number of fields containing information of biological relevance
to the DNA sequence data within the payload 120 of the biological
data unit 100. The information that is contained within the header
may be stored in multiple containers on a biological data network.
See, e.g., FIG. 5.
[0102] In one approach, biological data units are created at least
in part by specifically linking information from XML metadata files
with particular segments of BAM file sequence data. In this case,
the biological data units can be considered a unit of information
that a certain relationship that can be stored or streaming from
and to multiple nodes on a network. In this case the information
that is contained within the BI header distributed and is able to
link with sequence segments specifically. The protocols used for
the transmission of these precisely related cluster of information
in biological data units is integrated with a computer implemented
program that defines and classifies the link between and among the
header information and the segment of sequence payload.
[0103] It should be appreciated that FIG. 1 provides only one
specific exemplary representation of the type of biologically
relevant information which may be included within a header of
distributed biological data units. Accordingly, including other
types of relevant attributes and information within a header or the
equivalent, regardless of how the data is represented or
configured, is believed to be within the scope of the present
disclosure.
[0104] In addition, although the following generally describes
information as being contained or included within various sections
of the header 110, it should be understood that in various
embodiments such headers may distributed and may contain pointers,
tags or links to other structures or memory locations storing the
associated header information.
[0105] Similarly, the payload 120 may contain a representation of
the segmented DNA sequence data of interest, or may include one or
more pointers or links to other structures or locations containing
a representation of such sequence data. In this case, the various
segments of a particular whole genome sequence may be stored in a
distributive manner in multiple containers that are accessible on a
network.
[0106] A first section 101 of the header 110 provides information
concerning CpG methylation sequence data that pertains to the
various positions of the DNA sequence segment within the payload
120 of the biological data unit 100. In other words, the
information that is contained in the ancillary files that are
associated with the sequence points to section 101. Identification
of these CpG islands and the methylation sequence will likely play
an important role in understanding regulation of the associated
genes and any involvement with disease.
[0107] The header information that is contained in section 110 also
includes a property of chromosome banding pattern in section 102
containing information concerning any chromosomal rearrangement
observed, known, yet unknown and or may be predicted to be involved
with at least one segment of genome sequence data linked to this
attribute. These types of cytogenetic abnormalities are often
associated with severe phenotypic effects. This information may be
configured to be in any other format to represent the genomic
effects of chromosomal rearrangements which are known to be common
in cancer tumor genomics.
[0108] Header sections 103 and 104 provide information identifying
the beginning and ending positions for the exons that are contained
in the DNA sequence segment included within the payload 120. In the
case of whole exome sequencing this information represents exons
throughout the whole genome that are expressed in genes. Since exon
selection has tissue and cell type specificity, these positions may
be different in the various cell types resulting from a splice
variant or alternative splicing. Along with this DNA coding
information for individual exons, header section 105 may represent
information in a metadata file of a count of the number of exons
contained in the DNA sequence segment included within the payload
120. This type of information is known to be relevant in disorder
involving exon skipping and exon duplication.
[0109] Certain particular attribute-informational link specifically
with one or more DNA sequence segments within payload 120 having
some association with a disease will be represented by the
attribute information contained within section 106. Information
that is pertaining to certain known molecular pathways or systems
that may have molecular interactions with other genes or gene
products that would also be described within this section of the BI
header. Alternatively, since variations of said certain gene could
be involved in one or more diseases, such information would also
generally be contained within header section 106.
[0110] To the extent the DNA sequence segment in the payload 120
contains a part of a gene, a gene or plurality of genes, then the
header section 107 provides all of the pertinent information that
relate specifically to the applicable known gene name or gene ID.
Header section 108 may represent the type of information that
specifies the tissue or cell type which may be relevant to the
extent and level of expression of the various exons that may be
encoded in the said gene or segment of genome that is described in
section 105.
[0111] The metadata attribute located in the header section 109
will provide information concerning all possible open reading
frames present within the segment of genome sequence data that is
contained within the payload 102. This type of attribute will be
crucial for characterizing disease associated variants which are
contained within what appears to be open reading frames that
express no proteins or peptides that are detectable with today's
methods.
[0112] Header section 110 and 111 represent the metadata
annotations that specify the start and end positions of the DNA
sequence segment that is linked to a specific segment of a BAM
file, represented by the payload 102. These positions may be
considered arbitrary since the positions in the sequence could be
more than one reference sequence.
[0113] Section 112 indicates if the segmented DNA sequence data
within the payload 102 is chromosomal, microbial or mitochondrial.
Furthermore, section 113 provides information concerning the genus
and species of the origin of the DNA sequence segment represented
with the payload 102. It should be appreciated that sections 112
and 113 will provide the information that describes all the DNA
sequence data that is associated with an individual including and
not limited to microbes attached on the outside and found on the
inside of said individual as well as genome sequence data from
plants and other higher animals found in the digestive track.
[0114] All of the metadata annotations and attributes that are
within the header 110 will generally contain prior knowledge
information relating to the that is relevant to the DNA sequence
which is functionally utilized while the data is being sorted,
filtered and processed. This packetized structure of the DNA
sequence data that is represented in bits and encapsulated with
headers and other relevant information advantageously facilitates
processing by existing network elements operative in accordance
with layered or stacked protocol architectures.
[0115] For example, The Cancer Genome Atlas consortium has elected
to implement biological data units comprised of headers consisting
of information contained in XML metadata files and payloads
comprised of genome sequence data contained in the BAM files. In
this exemplary implementation a first specific type of information
may reference the tissue type or cell type of the sequence files
(section 108 of FIG. 1). Similarly, second specific type of
information type may reference a disease type (section 104 of FIG.
1).
[0116] Attention is now directed to FIG. 3, which depicts a
biological data unit 300 having a header 310 and a payload 320
containing an instruction-based representation of segmented DNA
sequence data. The type of information that is illustrated in 310
is exemplary. Moreover, this information may be stored in one or
more storage containers that are accessible on a network. The
instruction-based representation is discussed above and in the
copending '234 application. Although the content and
representations of the payloads 110 and 310 differ, the same type
of information is included within the headers 110 and 310 of the
biological data units 100 and 300, respectively.
[0117] The distributed packetizing of segmented DNA sequence data
files and the embedding of biologically and clinically relevant
information in biological data units will enable development of a
networked processing architecture within which such data may be
organized and configured in a layered format. Based on preliminary
results, the architecture is expected to be particularly suited for
effecting rapid analysis of large amounts of data of this type.
[0118] In one approach, the header which is contained within such
biological data units, is used to qualify or characterize the
fragmented or otherwise segmented genome sequence data included
within the payloads of such data units. In so doing, biological
data units containing segmented DNA sequence data or other sequence
data may now be sorted, filtered and operated upon based on the
associated attribute information contained within the ancillary
metadata files of the highly distributed data units.
[0119] For example, a data repository containing biological data
units incorporating segmented DNA sequence data and related
attribute information similar to that associated with the header
110 of FIG. 1 may be quickly and efficiently sorted in accordance
with parameters defined by an application. This has been recently
demonstrated with a system that has reduced to practice the
concepts and ideas of the current disclosure as the repository that
is now known as the Cancer Genome Hub (CGHub) operated by the
University of California. In other words, the same segments of
genome sequence may be sorted and analyzed in several different
ways by using the header information associated with, or otherwise
directly or indirectly linked to, the payload representation of the
sequence segments.
[0120] It is highly expected that it would be beneficial to arrange
and represent all of the genomic sequence information from an
individual, e.g., from bacteria, animals, plants to humans, in
accordance with the layered data architecture illustrated in FIG.
2. For example, consider the case in which a segment of a genome
sequence data file of interest is included as the payload of a
biological data unit stored in a data container which includes
biological data units associated with DNA sequence data of other
organisms.
[0121] Consider further that if, for example, the DNA sequence data
of interest is a particular variant of a human gene associated with
breast cancer, such as BRCA1, then such data could be extracted
from the container by filtering the contents of the data container
for metadata attributes associated specifically with the segment of
DNA sequence data from the organism homo sapiens. The data units
containing the specific BRCA1 variant along with all other DNA data
packets containing human DNA sequence data may be easily extracted.
However, sorting human DNA sequence data from the DNA sequence data
from other organisms may not be sufficient enough of a challenge in
view of the technical requirements of certain applications.
Accordingly, additional processing and comparative analysis may be
performed in which specific data units comprising certain segments
of sequence data from human chromosome 17 would be filtered out
from the data container.
[0122] Biological data units having payloads containing DNA
sequence segments from chromosome 17 may provide a reasonable level
of filtering. However, in order to efficiently analyze the gene
most notably associated with breast cancer, further processing,
sorting and filtering will be necessary. This may be achieved using
several methods including but not limited to filtering on the
specific start and end positions within the chromosome (S pos and E
pos) or the gene ID (GID) or by disease, breast cancer. If the
biological data units that are being sorted contain sequence
segments data associated with an alternately-spliced variant of
BRCA1, then this information may be contained in the header
information representing the total exon count (see, e.g., header
section 105 of FIG. 1), in addition to within the header sections
including start exon and end exon information sections (see, e.g.,
header sections 103 and 104). Furthermore, additional information
concerning tissue or cell type may need to be provided in order to
perform the most intricate level of sorting and filtering of the
biological data units associated with a specific BRCA1 variant.
[0123] The packetized structural configuration of the disclosed
distributed biological data units further enable functional
integration of a layered data models such as that depicted in FIG.
2. In particular, each metadata attribute of headers forming at
least a part of or is linked to a particular biological data unit
which may be associated with one or more specific layers of the
model. One advantage of using a layered data model is that data
from the various layers may interrelate during processing of the
header information included within the set of biological data units
being operated on or otherwise analyzed. For example, in the
exemplary case described above, information from the RNA layer of
the model relating to the splicing of introns from pre-mRNA was
used to identify BRCA splice variants, thereby correctly
facilitating determination of exon start and end positions.
[0124] The use of header information which are consistent with a
layered data architecture also advantageously enables substantial
changes to be made to the information associated with one layer of
the model without necessitating that corresponding modifications be
made to other layers of the model. For example, sequence variants
may be observed at splice donor and splice acceptor sites which may
change the splicing pattern and mRNA size, protein structure and
function, and these changes may yet be accommodated and mapped to
the DNA layer without requiring that corresponding changes be made
the DNA layer of the existing knowledge data model.
[0125] Attention is now directed to FIG. 4, which provides a
logical flow diagram of a process 400 for segmentation of
biological sequence data and combining the segments with metadata
attributes to form biological data units encapsulated with headers.
The process 400 provides one example of a way in which source DNA
sequence data may be fragmented to generate biological data units
containing DNA sequence segments and associated header information
in accordance with a layered data model such as the biological data
model 200.
[0126] In one embodiment the process 400 utilizes sequence feature
information of the type annotated in well-established nucleotide
databases 410 such as, for example, NCBI, EMBL and DDBJ for
sorting, configuring and operating on the sequence data. By mapping
the biological information within these databases into various
layers of header information, a layered data model of existing
knowledge can be constructed.
[0127] Referring to FIG. 4, human genomic DNA data is shown to be
accessible from different storage elements 410. In this regard, the
DNA sequence data can be stored in segments as sequences of
individual chromosomes or partial chromosomes or as individual
genes, and may comprise all or part of a genome. In addition, the
DNA sequence data could be generated from a sequencing machine and
the results made accessible to a network of computers. Further,
genomic sequence data might be represented in any file format and
produced using any approach including, for example, as a partial
dipolar charge and phosphorescence sequence profile indicative of
the sequence data.
[0128] In a stage 420, the sequence data obtained from storage
elements 410 is mapped and aligned with the reference genomic
sequence data. The DNA sequence is associated with a set of
relevant molecular features using, for example, biological data 414
deemed valid by the scientific community. This data 414 is mapped
to specific regions of a sequence entry. In addition, clinical and
pharmacological data 416 demonstrated to be associated with any
coding or non-coding regions of a sequence entry is also
mapped.
[0129] In one embodiment layer-1 biological data units 444.sub.1
include a payload comprised of segmented DNA sequence data and a
DNA layer header. Similarly, layer-2 biological data units
444.sub.2 may include a payload comprised of segmented DNA sequence
data, a DNA layer header and an RNA layer header. A layer-N
biological data unit 444.sub.N may include a payload comprised of
segmented DNA sequence data, a DNA layer header, an RNA layer
header, and other headers associated with higher layers of the
relevant data model.
[0130] Alternatively, in one embodiment layer-1 biological data
units 444.sub.1 may include a payload comprised of segmented DNA
sequence data and a DNA layer header, layer-2 biological data units
444.sub.2 may be comprised of a segmented RNA sequence data and an
RNA layer header, and so on. In one embodiment a base unit may be
prepended to or otherwise associated with each biological data unit
in order to identify the specific headers included within the data
unit and/or the number thereof.
[0131] In one embodiment headers 424 may include physical,
chemical, or biological knowledge or findings, or any related
molecular data that has been peer reviewed, published and accepted
as valid. headers 424 may also include clinical, pharmacological
and environmental data, as well as data from gene expression and
methylation.
[0132] In certain embodiments headers 424 may further include
information relating to gene and gene product interaction with
other components of a pathway or related pathways. The information
within headers 424 may also be obtained form, for example,
microarray studies, copy number variation data, SNP data, complete
genome hybridization, PCR and other related techniques, data types
and studies.
[0133] The prior scientific knowledge and information associated
with a specific sequence and included within a header 424 may be of
several different types including, for example, molecular
biological, clinical, medical and pharmacological information. In
this regard such molecular and biological information could be
separated and layered based on data from, for example, genomics,
exomics, epigenomics, transcriptomics, proteomics, and metabolomics
in order to yield data.
[0134] The data may also include DNA mutation data, splicing and
alternative splicing data, as well as data relating to
posttranscriptional control (including microRNA and other
non-coding silencing RNA and other nuclease degradation pathways).
Mass spectrometric data on protein structure and function, mutant
protein products with reduced or null function, as well as toxic
products could also be utilized as information.
[0135] In addition, pharmacological and clinical data relating to
specific genes or gene regions disposed to exert effects through
interaction with gene products or other components of a pathway
could be considered as a class of header information. Finally,
header information could also include environmental conditions or
effects correlated with certain genes or gene products known or
predicted to be related to a certain phenotypic effect or disease
onset.
[0136] As mentioned above, during stage 440 headers 424 are
associated with segmented DNA sequence data form biological data
units comprised of a header 424 encapsulating a payload containing
the segmented DNA sequence data. In this process the association of
a header 424 to payload containing segmented genome sequence data
may be carried out in any of a number of ways. For example, such
association may be effected using a pointer table, tag, graph,
dictionary structure, key value stores or by embedding header
information directly into the segmented sequence data.
[0137] In a stage 460, the biological data units 444 may be
organized into encapsulated data units in accordance with the
requirements of particular applications. For example, in certain
cases it may be desired to create encapsulated biological data
units including only a subset of the headers which would otherwise
be included in the biological data units associated with at least
one particular layer of the biological data model of prior
knowledge. For example, a certain application may require
encapsulated biological data units having headers associated with
only layers 1, 2 and 5 of a data model.
[0138] Another application may require, for example, encapsulated
biological data units having headers associated with only layer 2,
3 and 4 of the data model. Similarly, other applications may
require that the headers of the encapsulated biological data units
be arranged in a particular order, e.g., the header for layer 4,
followed by the header for layer 1, followed by the header for
layer 2.
[0139] In a stage 480, the encapsulated biological data units
created in stage 480 are stored in a manner consistent with being
interoperable with one or more multi-layered, multi-dimensional
data containers 464. The content of the headers of the encapsulated
biological data units is chosen to promote optimal interoperability
among and between layers. For example, in one simplified case each
biological data unit included within the data container 464.sub.1
may include at least a DNA layer header, an RNA layer header, and a
protein layer header. It is a feature of the present system that
information within higher-layer headers (e.g., RNA layer headers or
protein layer headers) may be "mapped" to lower-layer headers
and/or sequence information in such way as to establish a
relationship provenance between information within various
layers.
[0140] Consider an example wherein data concerning a particular
protein product that is expressed in a certain tissue type (i.e.,
protein layer information) may also provide information relating to
splicing (i.e., RNA layer information) or to a SNP at the genomic
level (i.e., DNA layer information) resulting in a premature
termination codon. In other words, protein structure related data
can provide RNA level knowledge on alternative splicing as well as
data on primary sequence data of amino acids substitutions
revealing SNPs and indels at in the DNA sequence.
[0141] In another case, the diagnosis of a certain disease in a
certain patient or, for example, results from a mammogram screen or
prostate-specific antigen results, may provide information that is
directly related to hyper-methylation of certain regions of the DNA
sequence segment included within a DNA layer biological data unit.
These epigenetic markers, along with the methylation profile at CpG
islands associated with certain genes, could provide crucial header
information to relate and correlate with appropriate gene and
disease conditions.
[0142] One advantage of the layered architecture of the data
containers 464 is that modification or updating of the data content
associated with a given layer has minimal or no effect on the
processing of data in the remaining layers. In one embodiment
layers are advantageously designed to be operated on independently
while retaining the capability to integrate, and interoperate with,
data and existing knowledge of other layers. In addition, data can
be organized within each data container 464 in accordance with the
requirements of specific applications.
[0143] All or part of this data may be mapped, via linked
relationships between information within headers or metadata
attributes that are associated with different layers of a data
model, to a disease condition capable of being associated with a
region of segmented DNA sequence data contained within a biological
data unit. This enables biological data units to be grouped and
analyzed based upon the classification schema required by a
particular application.
[0144] In a stage 490, biological data units encapsulated with
headers and stored with the data containers 464 may subsequently be
filtered, sorted or operated upon based on information included
within such headers. The layered structure of biological data units
comprised of biological data units including encapsulated headers
enables querying of the information included within one or more
such headers to be performed and results returned based upon a set
of rules specified by, for example, the application issuing the
query.
Architectural Components of Biological Data Networks
[0145] Attention is now directed to FIG. 5, which depicts a
biological data network 500 comprised of representations of
biological data linked and interrelated by an overlay network 504
containing a plurality of network nodes 510. In one embodiment the
network nodes 510 are in communication via network elements 520
(e.g, routers and switches) of the Internet 530 and thus overlay
such Internet elements. Certain of the network nodes 510' may have
localized access, via a local area network or the like, to
databases 550 containing the representations of biological sequence
data, clinical data, drug response or other information types which
are networked in the manner described herein.
[0146] In one embodiment the network nodes 510' may be configured
to locally process information within a database 550 and make
available all or part of the results of such processing, and
potentially information within the database 550 itself, to other of
the network nodes 510. In addition, the network nodes 510' may also
be designed to perform network processing functions along with the
network nodes 510 in the manner described hereinafter.
[0147] The biological data network 500 may in one aspect be viewed
as comprising a network of data stored within the databases 550 as
well as within storage (not shown) at the network nodes 550. In one
embodiment each biological data sequence or other sequence
information stored within the network 500 may be accorded a unique
identifier such as, for example, IP addresses, unique universal
identifiers (UUIDs), or tags in order to facilitate the
establishment of such a data network. Moreover, tables may be
maintained at each network node 510 for data tracking purposes
(references herein to network node 510 are generally also intended
to refer to network nodes 510', unless the context of the reference
clearly suggests otherwise). In particular, such tables may be used
to track the sequence information available directly or indirectly
(via other network nodes 510) from other network nodes 510, as well
as the results of processing such sequence information at various
nodes 510. These tables may be updated as biological data units
containing sequence information and/or and or MetaIntelligence.TM.
headers are transported between nodes for processing. Alternatively
or in addition, overhead messages may be exchanged between network
nodes 510 for the purpose of propagating the information stored
within ones of these table to the tables maintained by other nodes
510. Such messaging and updating of tables between network nodes
510 generates a type of BioIntelligent.TM. data awareness that
provides a distinct advantage for processing and sharing data on
network 500. Furthermore, the network processing that is carried
out allows seamless access to network-associated processing
functions, shared data as well as support databases that also
contain properties of and information about the data.
Structure and Operation of Biological Data Network Nodes
[0148] During operation of the network 500, requests from a client
terminal 560 are received by a network node 510. Such requests are
interpreted at the network node 510 and appropriate processing is
carried out at such network node 510, and potentially other network
nodes 510, in order to produce the requested results. In this
regard metadata attribute information contained in headers are
linked to all of the data throughout the network 500 that is
designated as or otherwise made network accessible may be accessed
and processed in response to requests from a client terminal 560.
In this way intelligent information concerning data stored remote
from a client terminal 560 and its associated network node 510,
and/or such data itself, may be processed in a manner transparent
to such terminal 560 and node 510.
[0149] Although certain of the embodiments disclosed herein
contemplate that various ones of the network nodes 510 may perform
specialized processing functions and operate cooperatively to
produce an overall processing result, in other embodiments certain
nodes may be capable of performing all of the processing functions
necessary to deliver results in response to queries.
[0150] In certain aspects of the invention whereby cooperative
operations and processing functions are coordinated at various
distributed network nodes 510 queries can be made that would
facilitate the simulation, study and comprehension of systems in
biology. In this case, header information fields at the DNA, RNA
and protein layers along with query dependent processing function
requirements serve as the activated substrates for generating a
result.
[0151] In general, when a query/request is made, a suite of
protocols are invoked which are based upon the properties of the
request. For example, a request can be made from any client on the
network 500 and the stack of application protocols use processing
functions at multiple nodes to access the associated data and a
process management function to sort, aggregate, tabulate,
coordinate and combine the partial information from multiple nodes
to return the query result. In this regard, processing at a network
node 510 can be achieved using either of at least two approaches.
In a first approach of cooperative processing functions, data and
or partial processing results can be moved to the desired
functional node 510 to be processed. Alternatively, the required
processing function can be moved form a network node 510 to the
location of the network accessible data at 550 and the data is
processed at the site at which it resides on the network 504.
Furthermore, a combination of the two approaches can be used to
return the query result to end nodes or terminals 560. In addition,
any result from processing that is new network information can be
used to update tables at nodes 510 to enhance network
awareness.
[0152] The network nodes 510 are aware of the types, the content
and location of all network accessible data and its intelligence.
Moreover, the network nodes 510 are aware of the types, locations
and capabilities of processing functions on the network 504. In
this regard each node 510 is regularly updated with the activities
being performed by, and processing results generated by, each other
node 510 of the network 500. In one embodiment, network-based
applications and protocols are aware of the information contained
in the different fields of the BI headers associated with the
biological data units stored within the highly distributed
databases 550 and access such information to the extent necessary
to process queries from terminals 560.
[0153] Turning now to FIG. 6, there is illustrated an exemplary
protocol stack 610 implemented at a network node 510 together with
corresponding layers of the OSI network model 600. As shown, the
protocol stack 610 includes a DNA Network Protocol Stack (DPS.TM.)
over TCP/IP layers. The DPS.TM. is consistent with utilization of
biological data units and supports a-Aware Network Application
capable of processing requests from a client terminal 560 and
delivering results. As is discussed below, a network node 510
configured with the protocol stack 610 is capable of performing
processing, switching and routing functions based upon not only the
information within messages associated with the TCP/IP layers of
the protocol stack 610 but also in accordance with the higher-layer
information within headers and other information associated with
the DPS.TM.. As a consequence, a network node 510 may use this
higher-layer information to prioritize the processing of packets
received by the network node 510. For example, the network node 510
may control quality of service ("QoS") and effect load balancing
based upon this higher-layer information.
[0154] The DPS.TM. is intended to enable existing Internet
infrastructure to efficiently process and transport DNA
sequence-based data. The DPS.TM. protocol stack comprises a DNA
Transport Protocol.TM. (DTP.TM.), DNA Signaling Protocol.TM.
(DSP.TM.), and DNA Control Protocol.TM. (DCP.TM.). In one
embodiment the DTP.TM. protocols enable network elements such as
routers and switchers to process, transport, and communicate
biological data such as DNA sequence data and related information
between single or multiple sources of streaming DNA servers
(discussed below). The servers will include or have access to data
containers (e.g., storage devices) including biological data units
and/or unprocessed or partially processed DNA sequence data.
[0155] The functions of the DPS.TM. protocol suite comprise
processing, transporting, controlling, switching and routing
biological data such as DNA sequence information as streaming data
so as to enable such data to be utilized for a variety of
"streaming" applications. In this regard the DPS.TM. protocol stack
will be used for pulling streaming biological data from servers
having access to containers of biological sequence data. Such
streaming applications are capable of continuously "pushing" and
"pulling" biological sequence data and the high level abstracted
information from this data as necessary to support the
functionality of each particular application.
[0156] Various options exist for introducing the DPS.TM. protocol
suite into existing network infrastructure. In one implementation,
for example, the DPS.TM. protocol suite may be distributed
throughout the routers/switches of a given service provider. In
another implementation, the DPS.TM. protocol suite may reside only
in one or more network elements near an edge of the service
provider's network in an overlay network.
[0157] FIG. 7 shows a high-level view of the various data types
that may be processed by a group of network nodes 510 in response
to a query/request received from a client terminal 560. As shown,
transcriptomics data, proteomics data and/or gene expression data
along with a patient's medical record information is a small sample
of the type of data that may be stored as biological data units
within databases or data containers accessible to the nodes 510 may
be processed.
[0158] FIG. 7 illustratively represents a query request message
being sent to a network controlled by an "operating system" of
protocols and programs. Such a network operating system is capable
of processing the request by using biological data units consisting
of the metadata attributes that are associated with distributed
sequence data accessible on the network. The system is able to
locate, aggregate, sort and filter the highly distributed but
linked data units and sent a response to the query request.
[0159] In addition, the "data cube" represents one or more
databases of all the prior knowledge that may be associated with
the biological data units that are aggregated based on a query. The
information that is contained in the existing knowledge base (data
cube) will be stored in a manner consistent with the concepts of a
data model disclosed herein.
[0160] Attention is now directed to FIG. 8, which provides a block
diagrammatic representation of the architecture of an exemplary
network node 510. As shown, the network node receives incoming IP
packets containing BioIntelligent.TM. biologically-relevant
headers. Encapsulated within such incoming IP packets will
typically be, for example, information identifying the particular
segments of genome sequence data with which such
biologically-relevant headers are known, calculated or predicted to
be associated with. Such information could include, for example,
the particular chromosome and position within the chromosome with
which the gene is associated, protein information associated with
the gene, whether any part of the sequence of the gene corresponds
to a normal or minor allele, or other information pertinent to the
gene including association with any disease or phenotype or drug
metabolism information. In addition, each incoming packet could
also include information uniquely identifying the specific DNA
sequence or other biological sequence information and the network
location at which such sequence is stored.
[0161] For example, such identifying information (which could be in
the form of, for example, an IP address separate from the IP
address of the incoming IP packet) could identify a particular
network-accessible database and a location or position with such
database. In other embodiments both information identifying the
gene associated with the biologically-relevant headers within the
incoming IP packet and information specifying a particular location
at which the sequence information associated with such headers is
stored could be inherent within a unique identifier included within
the incoming IP packet.
[0162] Each incoming IP packet containing biologically-relevant
headers is received via a network interface 810 and provided to an
input packet processor 820. In one embodiment the network interface
is comprised of a physical port in communication with an external
network and further includes, for example, buffers, controllers and
timers configured to facilitate transmission and reception of
packetized sequence data and other information over such network.
The input packet processor 820 removes the IP header information
and parses the higher-layer content included within the packet. A
classification module 830 may then assign the packet to a
particular class based upon this higher-layer content. The
biologically-relevant header information included within the packet
may then be passed to a configurable processing module 850 for
processing in the manner described hereinafter based upon the
determined class and any policies applicable to such class defined
by policy module 840. As is also described hereinafter, the
biologically-relevant header information may then be processed by
configurable processing module with reference to various sequence
location tables 870 and layered data tables 860 maintained at the
network node 510. The layered data tables 860 are structured
consistently with the biological data model (FIG. 2) used to define
the biologically-relevant headers within each incoming IP
packet.
[0163] Based upon the results of the processing performed by the
configurable processing module 850, outgoing biologically-relevant
header information associated with the biological sequence
identified within the input IP packet or other processing results
is provided to a transmit controller module 880 for packetization
within an outgoing IP packet. To the extent the outgoing
biologically-relevant header information requires further
processing by another network node 510 in order to render an
appropriate response to the user request received by the network
500, a load balancing module 882 within the transmit controller
module 880 selects such a network node 510 from among the group of
such nodes capable of performing the required processing. Such
selection may be based upon, for example, the processing loads
associated with each node within the group. Additionally, selection
may be based upon processing results that are passed to the
transmit controller module 880. A QoS module 884 places each
outgoing IP packet in one or more queues in accordance with, for
example, the applicable class accorded the corresponding incoming
IP packet by the classification module 830 and the policy
associated with such class. Each outgoing IP packet will generally
include identifying information similar to that included within
each incoming IP packet. The outgoing IP packets are provided by
the transmit controller module from the applicable queue to the
network interface for transmission to a destination network node
510.
[0164] In one embodiment the headers within each IP packet received
by a network node 510 will be functionally associated with or
contain information having biological relevance to a segment of DNA
sequence data, MetaIntelligence.TM. metadata, or both. It should be
appreciated that the headers may be arranged in any order, whether
dependent upon or independent of any associated payload data.
However, in one embodiment the headers are each respectively
associated with a particular layer of a biological data cube model
representative of the biological sequence data contained within the
payloads of the biological data units with which such headers are
associated. Moreover, it should be understood that any
patient-related data which is not predicated upon genomic sequence
information but is nonetheless pertinent to the processing by the
network 500 of a request may be included within the headers of a
received IP packet.
[0165] It should be further understood that BI headers may be
realized in essentially any form capable of embedding information
within, or associating such information with, all or part of any
biological or other polymeric sequence or plurality thereof. BI
headers may also be placed within a representation of associated
DNA sequence data, or could be otherwise associated with any
electronic file or other electronic structure representative of
molecular information. In particular, biological data units
containing segmented DNA sequence data may be sorted, filtered and
operated upon based on the associated information contained within
the header fields.
[0166] Attention is now directed to FIG. 9A, which illustratively
represents a process effected by a network node 510 to implement a
sequence variants processing procedure. In many instances the first
process performed within the network 500 in response to receipt of
a user query is the execution of a variants calling function at a
network processing node 510. The variants calling function may be
executed at the network node 510 receiving the user query.
Alternatively, the procedure may be executed at a network node 510
specially configured for performing a comparative analysis of the
subject patient whole or partial genome sequence against the
selected reference/control sequence.
[0167] In an initial step of the variants processing procedure, a
determination is made as to whether any differences exist between
the biological data sequence associated with the query and the
reference sequence. To the extent differences are detected, the
nature of the differences and their locations with respect to the
reference sequence are recorded. In this regard the sequence data
associated with the query could comprise a portion of a gene or
plurality of genes, an entire genomic sequence from normal cells,
and/or an entire genomic sequence from diseased cells. The sequence
data for a particular patient could comprise any, or a combination,
of these types of sequence data.
[0168] In other embodiments a clinically transformed version of a
patient's genomic sequence data, rather than the sequence data
itself, is associated with user requests received by the network
500. Such a clinical transformation may involve, for example,
associating a patient's medical records or health related
information with any or a combination of the patient's genomic
sequence or the patient's transcriptomic, proteomic, metabolomic or
lipidomic information, or any other such related data. For example,
such transformation could involve using certain minor allele
variations in or near certain genes that are associated with
certain phenotypes, symptoms, syndromes, diseases, disorders, etc.
Furthermore, certain knowledge of the linkage disequilibrium that
is associated with the haplotype map genome sequence of the patient
might provide a detailed transformation of this genotyping data
into information on protein concentrations in blood, urine and
other body fluids. Information on functional activity of these
proteins and their metabolic state which might include
posttranslational modifications could be a useful part of improving
the granularity of the patient's genomic-based transformed data.
Accordingly, the present disclosure advantageously provides a
mechanism for networking and sharing genomic-based data without
requiring a corresponding sharing of a patient's genomic sequence
data.
[0169] Again considering the process of FIG. 9A, in a comparison
operation 910 packets of genomic sequence segments 914 are mapped
to corresponding portions of a reference sequence 918. In an
operation 922, statistical corrections are then carried out at the
network node 510 on the basis of the comparison in order to make a
variant call. Variants calls can be checked against a database of
variant alleles since each node has awareness of such data location
on the network. For example, a rare variant in a certain gene
associated with breast cancer might be contained in TCGA database
with pertinent information on drug response. This information will
have information on clinical responses to certain drugs that relate
directly to the minor allele. The network can access the TCGA
database and extract the required information for processing on the
network or locally at the client server.
[0170] For simplicity, in the case where SNPs are the only variants
dbSNP can be used to validate common SNPs. In addition, data on
minor alleles with disease association might be present in other
cancer genome databases that are maintained by public and private
entities such as but not limited to CGP (Cancer Genome Project at
Sanger Institute), TCGA (at NIH's National Cancer Institute), RCGDB
(Roche Cancer Genome Database), and the like.
[0171] Attention is now directed to FIG. 9B, which is a flowchart
of an exemplary variants processing procedure 930 representative of
one manner in which a network node 510 configured for variants
processing may be utilized in connection with processing a
particular user request. In particular, consider the case in which
a structured representation of the DNA sequence data of a breast
cancer patient is received at a network node 510 configured for
variants processing along with a reference sequence (stage 934).
The structured sequence data is then mapped against the reference
in order to produce the specific variant alleles forming the basis
of variants calls made by the node 510 (stage 940). In this example
it is assumed that the request accompanying the sequence data
comprised a request to determine the pharmaceutical drug with the
highest efficacy and with lowest toxic effects in view of the DNA
sequence data of the patient. Once the specific variant alleles of
the patient have been determined, the network node 510 configured
for variants processing may issue a query/request that is processed
by those network nodes 510 having access to public and private
databases containing information relating to pharmacogenomics-based
responses to various drugs (stage 944). The results of such queries
may then be returned to the requesting client terminal 560 (stage
950), and the drug response data for specific variant alleles
included within such results may then be used for analysis of the
patient data (stage 954).
[0172] In the general case, once the processing to be performed at
a given network node 510 has been completed, a decision will be
made to route or switch the processing to another network node 510
based upon the results of such processing (stage 960). The extent
of the processing to be performed by the network 500 with respect
to a particular request will of course be dependent upon the nature
of the request.
[0173] Turning now to FIG. 10, an illustrative representation is
provided of the processing which occurs at a network node 510
configured to perform a specialized processing function. As may be
appreciated with reference to FIG. 10, a specialized processing
function which is required to be performed is first carried out and
the result of such a processing function is supported by access to
public and private databases with relevant associated data.
[0174] In one embodiment each network node 510 implements a method
which generally involves performing a processing operation
involving ones of a first set of biological data units and a second
set of biological data units. The processing might further involve
a comparison of the called variant with access to established
variants databases.
[0175] In the general case, the biological data unit encapsulated
within the IP packet received by a network node 510 will contain a
first header associated with first information relating to
segmented biological sequence data and a second header associated
with second information relating to the segmented biological
sequence data. The method includes processing of the first
information and the second information in relation to the content
of the payload of the biological data unit. In one embodiment
processing is carried out at each network node 510 with respect to
biological data units including a first header associated with
information relating to a first-layer representation of biological
sequence data and a second header associated with information
relating to a second-layer representation of biological sequence
data wherein a biological, clinical, pharmacological, medical or
other such relationship exists between the first-layer and
second-layer representations. For example, the DNA sequence for a
gene may be related to the cDNA or RNA sequence of that gene or the
protein sequence, structure or function of the gene product. In one
embodiment all of the data contained within a layered
representation of the DNA sequence information (see FIG. 2) would
be available for a subset of patients at each client server.
[0176] As may be appreciated with reference to FIG. 2, a biological
data unit predicated upon the layered data model of FIG. 2 includes
a transformed representation of a biological sequence and a first
header associated with first information relating to such sequence.
Since the headers included within such a biological data unit may
generally correspond to the layers of the layered data structure of
FIG. 2, it should be understood that a processing node 510 that
operates on a given layer of data will typically be able to access
only a certain type of data. For example, in one embodiment "layer
1" headers are associated with the DNA layer and a network node 510
configured for "layer 1" processing would access DNA-related
data.
[0177] Attention is now directed to FIG. 11, which provides a
representation of an exemplary processing platform 1100 capable of
being configured to implement a network node 510. The processing
platform 1100 includes one or more processors 1110, along with a
memory space 1170, which may include one or more physical memory
devices, and may include peripherals such as a display 1120, user
input output, such as mice, keyboards, etc (not shown), one or more
media drives 1130, as well as other devices used in conjunction
with computer systems (not shown for purposes of clarity).
[0178] The platform 1100 may further include a CAM memory device
1150, which is configured for very high speed data location by
accessing content in the memory rather than addresses as is done in
traditional memories. In addition, one or more database 1160 may be
included to store data such as compressed or uncompressed
biological sequences, dictionary information, metadata or other
data or information, such as computer files. Database 1160 may be
implemented in whole or in part in CAM memory 1150 or may be in one
or more separate physical memory devices.
[0179] The platform 1100 may also include one or more network
connections 1140 configured to send or receive biological data,
sequences, instruction sets, or other data or information from
other databases or computer systems. The network connection 1140
may allow users to receive uncompressed or compressed biological
sequences from others as well as send uncompressed or compressed
sequences. Network connection 1140 may include wired or wireless
networks, such as Etherlan networks, T1 networks, 802.11 or 802.15
networks, cellular, LTE or other wireless networks, or other
networking technologies are known or developed in the art.
[0180] Memory space 1170 may be configured to store data as well as
instructions for execution on processor(s) 1110 to implement the
methods described herein. In particular, memory space 1170 may
include a network processing module 1172 for performing
networked-based processing functions as described herein. Memory
space 1170 may further include an operating system (OS) module
1174, a data module 1176 configured to temporarily store sequence
data and/or associated attributes or metadata, a module 1178 for
storing results of the processing effected by the network
processing module 1172.
[0181] The various modules included within memory space 1170 may be
combined or integrated, in whole or in part, in various
implementations. In some implementations, the functionality shown
in FIG. 11 may be incorporated, in whole or in part, in one or more
special purpose processor chips or other integrated circuit
devices.
[0182] Attention is now directed to FIG. 12, which illustrates one
manner in which data may be processed, managed and stored at an
individual network node 510 in an exemplary clinical environment.
In particular, FIG. 12 depicts one way in which the information
technology systems of a medical provider (e.g., an oncologist)
could interface with network processing at a node 1210 included
within a local area network in communication with the data network
500. In one embodiment the network processing node 1210 may have
similar or identical processing functionality as the nodes 510 of
the network 500 and would be in communication with at least one
such node 510, but could also be locally networked with other
information technology infrastructure in a campus environment not
part of the network 500.
[0183] In one embodiment none of the data which is stored in the
local storage container 1220 is generally accessible to clients 560
of the network 500. Movement of data between storage containers
associated with or accessible to different network nodes 510 may be
governed by the policies established by the one or more clients 560
controlling such containers. For example, depending on the policy
in place at a first network node 510, certain aspects of actual
patient data or a transformed version of such data might be
"pulled" in whole or in part from data containers accessible to a
second network node 510.
Access to Existing Knowledge
[0184] Attention is now directed to FIGS. 13-18, which
illustratively represent the manner in which information within the
layered data structure 200 is utilized at an individual network
processing node 510. In particular, each of FIGS. 13-18 depict an
exemplary representation of the relationship between information in
the headers 1304 of a biological data unit associated with a query
message and prior knowledge 1308 within storage accessible to the
node 510 that is used in generating a response to the message. It
should be understood that FIGS. 13-18 provide only one example of a
set of three layers of a BI header information or metadata
attributes which are directly associated with the various layers of
the knowledge structure.
[0185] As may be appreciated by reference to FIGS. 13-18, the first
field of information present within each BI layer header
specifically relates to a first source of data and/or knowledge
associated with such BI header. For example, the fields within the
"layer 1" header 1310 will relate directly with a first layer of
the structured knowledge data model. In this case the fields within
the layer 1, or "L1" header 1310 can relate with L1 data (i.e.,
DNA-related data in the case of the data model 200). Consequently,
information that is contained in the fields of the layer 2, or
"L2", header relate directly but not strictly with the data
presented in the second layer or the RNA layer data and knowledge
presented in that layer.
[0186] Referring now specifically to FIG. 13, "H1" represents a
first of the information within the L1 set of attributes that
represent header 1310 of a given data packet. In the example of
FIG. 13 the particular attributes within section L1 header 1310
directly correspond to characteristics of the first layer (i.e.,
the DNA layer 210) of the layered model of existing related
knowledge 200.
[0187] It should be noted that FIG. 13 depicts only the different
layers of headers and the various header information fields, and
not any associated payload of segmented sequence data, of a
particular biological data unit. As discussed above, IP packets
based upon a particular biological data unit which is exchanged
between network nodes 510 may or may not include such payload data
(i.e., such IP packets may only include higher level abstracted
attribute information corresponding to the biological data
unit).
[0188] In the embodiment of FIG. 13, the header field H1 within the
L1 header 1310 relates to a particular type of information
pertinent to the DNA layer 210. For example, as indicated by
DNA-layer table 1320 maintained by the individual network
processing node 510, the field H1 within the L1 header 1310 may
point to the base positions for a sequence of genomic data within
the payload of the biological data unit containing headers 1304.
The layered prior knowledge that is being accessed or related or
pointed to by attributes such as H1 is specifically associated with
DNA layer information of data 1308.
[0189] The segmented sequence data within the payload of the
biological data unit identified by the field H1 within the L1
header 1310 may represent a certain region of a genome that may be
positioned in similar but not necessarily identical base positions.
For example, the comparison of this region or section of the genome
that is represented in the payload for a particular gene would be
expected to code for the same genes or at least different isoforms
of the same gene.
[0190] As a result, the effect of L1H1 header field (layer 1,
header field 1) from the stored DNA data would give comparable
results for the various DNA layer annotations that are present in
that data container. Such DNA layer information could include, for
example, gene ID, chromosome, base positions, regulatory regions,
5' and 3' UTR, variant alleles and other DNA-based information
related to the gene. Based on the query message, the individual
network processing node 510 accesses information within data
cubical of prior knowledge 1308 relating to, for example,
chromosome number (for simplicity, not shown) and base positions
identified by the L1H1 header field.
[0191] Referring now to FIG. 14, "H2" represents a second attribute
of header information within the L1 header 1310 of the certain data
packet (i.e., the "L1H2" header field). In this case, the L1H2
header field refers to a second field in the DNA layer that points
specifically to the associated gene or gene product related to the
packetized segment of DNA sequence data within the biological data
unit associated with headers 1304. Such sequence data could, for
example, code for one gene, a plurality of genes or a part of a
gene (represented in either the + or - orientation based on the 5'
to 3' direction of the sense strand). As indicated by FIG. 14, the
L1H2 attribute field relates or points to the gene ID section of
the distributed network-accessible data 1308.
[0192] In one embodiment this field should contain at least one
representation for the name of the gene and or gene product that is
encoded by the DNA sequence in the payload of the biological data
unit associated with headers 1304. In cases where more than one
name is used to identify a gene, gene product or the activity
associated with that gene the most current and widely accepted
names are listed. Any gene ID name that is used to relate
specifically to the sequence represented by the chromosome number
and base positions that are indicated in the first header field of
the layer 1 should be encoded by this particular sequence in this
region of the genome. However, because of gene duplication, copy
number variations, existence of gene families, repeat sequences,
mobile transposable elements and other such related molecular
phenomena certain classes of redundancy will exist. Furthermore,
one gene or the polypeptide product of a gene or the enzymatic
activity of a gene could be associated with more than one disease,
syndrome, disorder, phenotype, etc.
[0193] Turning now to FIG. 15, "H3" represents a third field of
header information within the L1 header 1310 of the certain data
packet (i.e., the "L1H3" header field). In this case, the L1H3
header field relates to any phenotypic expression of encoded gene
that is associated with a disease or disorder. That is, in the
example of FIG. 15 the L1H3'' header field points to disease(s)
known or predicted to be associated with the gene, a mutated or
variant form of the gene, or an expressed gene product.
[0194] For simplicity and clarity, the supportive data in this case
show three different cancer types that are associated with packaged
genome sequence data attached to the exemplary header fields. The
diseases that are known to have association with the segmented
sequence in the payload of this biological data unit in this case
are colon, cervical and breast cancers. The gene or sequence
segment might represent an up-regulated oncogene or proto-oncogene,
a down-regulated tumor suppressor gene or a structural or
functional gene involved in a pathway with other genes associated
with the disease.
[0195] Referring now to FIG. 16, a first field of information
within the L2 header 1610 of the certain data packet is denoted by
"H1". In the example of FIG. 16 the header fields within the L2
header 1610 directly correspond to characteristics of the second
layer (i.e., the RNA layer 210) of the layered data model 200. It
should be appreciated that network access to the data that relates
to the diseases associated with any packetized segment of DNA
sequence data will be through a layer 1 (DNA layer) access. Access
to data associated with other layers, e.g., layer 2 and layer 3,
will require access to information associated with the header
fields of layer 2 or layer 3. That is, the header fields associated
with the L1 header 1310 will generally relate only to data in the
DNA layer 210 of the layered data structure 200, the header fields
within the L2 header 1610 will relate only to data within the RNA
layer 220, and so on. Such RNA-layer data related to a gene of
interest could include, for example, the lengths of the pre-mRNA
and mature mRNA, exon selection, alternate splicing, data on
differential expression of RNA, transcription control and any
RNA-related information.
[0196] As shown in FIG. 16, fields within the L2 header 1610 relate
to the RNA layer 220 of the layered data structure 200. For
example, in the embodiment of FIG. 16 the H1 field may relate to
the transcription start site of the mRNA for the gene identified by
fields of the L1 header 1310. In other words, the transcription
start site information included within the RNA layer 220 would
relate to the chromosomal position of the gene. It should be
understood that all of the information and field data in FIG. 16 is
exemplary, and none of such information actually relates to any
information concerning any particular gene. For instance, where
BRCA1 might be used to indicate a gene and chromosome 17 the
chromosome, all of the information in the related table 1620 is
exemplary. Thus, information within the RNA layer 220 and the DNA
layer 210 are associated and interrelated by layered data structure
200 in a manner that allows independent access to the different
information and or data types or layers.
[0197] Attention is now directed to FIG. 17, in which "H2"
represents a second field of header information within the L2
header 1610 of the certain data packet (i.e., the "L2H2" header
field). In this case, the L2H2 header field relates to RNA-layer
information pertaining to the length of a transcript. The RNA data
on this particular gene shows a variety of lengths for the
transcript. Entries that harbor an insertion show relatively longer
transcript length; conversely, the shorter length transcripts show
deleted bases in comparison with the normal case.
[0198] Referring now to FIG. 18, the third field ("H3") of header
information within the L2 header 1610 may relate to other
information associated with the RNA layer 220. For example, this
"H3L2" header field may relate to the exon selection of a gene
associated with breast cancer.
[0199] In this example, the variations in the number of exons that
are contained in this gene indicate the existence of different
splice variants that are associated with the transcripts from cell
taken from the breast tumor tissue. The defect in splicing could be
from variants of the gene or some component of the splicing
mechanism.
[0200] In the embodiment of FIG. 18, layer 3 ("L3") headers 1810
may include information associated with a protein layer of the data
model 200. Such protein-layer information may include, for example,
the molecular weight of the protein product of the gene identified
by the L1 header 1310, amino acid count and content, expression
level, activity, posttranslational modifications, structure,
function and other related information.
[0201] Although FIG. 18 does not explicitly depict the relationship
between the fields of the L3 header 1810 and corresponding portions
of the data cubical 1308, such fields are related to the
protein-layer data within cubical 1308 in a manner consistent with
that described above with respect to DNA-layer and RNA-layer
information.
[0202] Attention is now directed to FIG. 19, which illustrates the
performance of an exemplary result-based network processing
operation involving the cooperation of multiple network nodes 510.
As discussed above, messages will generally be regularly exchanged
between network nodes 510 in order to update tables identifying the
biologically-relevant data and other information accessible to each
such node 510 as well as the processing capabilities of each such
node 510. In addition, when certain processing operations are
completed at a network node 510, the results of such processing may
be used to update various tables maintained by the node 510. In one
embodiment such processing results are evaluated to determine the
type, if any, of further processing is required in view of the
applicable client request. To the extent it is determined at a
current node 510 that further processing is required, tables at
such current node 510 may be consulted in order to identify a
subsequent node 510 capable of performing the required additional
processing. The current node 510 may then forward a set of
partially processed data to the subsequent node 510 for further
processing.
[0203] As a simple example of such result-based processing,
consider a request message that requires processing at multiple
nodes 510 on the network 500. Depending on the query and the
headers that are assigned or associated with the patient-based or
other data related to the query, partially-processed results are
passed to successive nodes 510 as processing is completed at each
such node. In the case in which the initial processing at a current
node 510 requires performing operations with respect to a header
corresponding to the DNA layer 210 of the data model 200,
information pertinent to the layer 210 may be retrieved from memory
or storage accessible to such current node 510. Such
network-accessible memory or storage may include a layered data
model of related prior knowledge containing biologically-relevant
information organized in a manner consistent with the data model
200. To the extent it is determined based upon the results of this
initial processing that access is required to information relevant
to the RNA layer 220 of the data model 200, then such information
may also be retrieved from the network-accessible storage. The
result from this data access function could return a simple
categorical binary response (zero or one).
[0204] Consider the case where access to the second layer is to
determine if there are any alternative splicing associated with the
phenotype or disease. The disease could be one of the many
molecular classifications of breast cancer and the drug target for
treatment could be specific functions of splicing or kinase
function for example. The first of two molecular functions might be
targeted by a first class of drug and the second molecular function
might be the target of a second class of drug. Both of these drugs
would normally be treatment of choice for this patient whose genome
and medical data was used to make the query. The patient might fall
in a certain category based on age, weight, tumor cell morphology,
tumor size and position as well as other social, environmental and
physical aspects to place the patient and disease in a category.
However, the type of genomic variants that gives rise to the
molecular cascade of events that characterize the onset of the
disease may involve certain molecular targeted activities.
[0205] For example, in the case where a mutation affects a
transcription factor binding site to up the over expression of a
gene associated with many cancers versus a minor allele variant
that is known to cause alternative splicing resulting in a protein
product associate with the disease onset and progression. Different
classes of drugs that target certain molecular pathways or
functions or activities will be more suitable for treating certain
diseases and the ability to be able to discriminate between them
would improve treatment selection.
[0206] Again referring to FIG. 19, consider the processing
occurring at a current network node 510 where access to the RNA
layer is a necessary path for the request message to return a
result. In this case associated data is retrieved from
network-accessible storage containing a data cubical organized
consistent with the data model 200 in order to facilitate
comparison or other processing of RNA-layer information. If the
result of such RNA-layer data access and processing at the current
node 510 indicates, for example, a splice variant, then a next node
510 selected to performed the processing steps subsequently
required would be different than the node 510 selected had the
initial processing indicated that there was no alternate splicing
involved. Moreover, in the case where the response indicated
alternate splicing the subset of drug selections for final results
would be different from those listed when the splice variant query
is returned as null.
[0207] In one embodiment a path of a query request may involve
execution of a limited number of preferred processing steps
selected based upon specific characteristics of the query. For
example, the network application may monitor the results of
processing at a particular node 510 and then determine which of a
number of possible successive processing steps is most consistent
with returning the best available results based on the
characteristics of the data accessible to the network 500.
[0208] Turning now to FIG. 20, there is illustrated an exemplary
process flow 2000 corresponding to the result-based network
processing discussed above. As shown, a request message 2010 is
sent by a user application executing on a network client 560. The
network application 2020 the message 2010 activates a set of
protocols associated with processing the message sent from the user
application. Protocols are compiled and sorted by a protocol
sorter/compiler manager 2030 and a representative stack 2040 which
is consistent with the processing of the user application request
is selected. The suite of protocols that is required to process the
message includes a set of processing functions that are performed
at each network node. Nodal functions are organized and updated
constantly by a node function organizer 2050. In particular, the
organizer 2050 selects and configures a set of network nodes 510 to
effect distributed processing of the set of required network
functions.
[0209] In one embodiment the processing functions executed by the
network nodes 510 are highly distributive; that is, each network
node 510 performs one specialized function and thus functions are
distributed throughout the network. The network application message
management and processing function engine coordinate the widely
distributed network nodes to perform a system function using
MetaIntelligence. As is explained below, a function organizer is
adapted to select a sequence of nodes to effect a set of
distributed functions to be performed in processing a message or
request from a client 560.
[0210] In one embodiment the network 500 may be regarded as operate
as a system in which multiple nodes 510 will be configured to be of
capable of performing particular processing functions. That is, the
network nodes 510 would generally be configured such that the
frequency and distribution of the available processing functions
would be selected based upon prior usage. As a consequence, a
relatively large percentage of network nodes 510 could be
configured to implement those functions most often required in
connection with generation of a result in response to a request
message; conversely, a relatively small percentage of network nodes
510 could be configured to implement those functions least often
required in connection with generation of a result in response to a
request. For example, depending on the usage load of processing
functions at certain high volume network node the updating messages
sent between nodes can be used for load balancing and congestion
control. As a result, network recommendations can be made based on
nodal usage to provide updated node functions to optimize the
network.
[0211] Attention is now directed to FIG. 21, which depicts a
biological data network 2100 comprised of a plurality of network
nodes 2110. In one embodiment each of the network nodes 2110 is
substantially similar or identical to each network node 510.
Similarly, the network nodes 2110 form an overlay network and
communicate by way of IP packets delivered through the Internet
(not shown in FIG. 21). A plurality of network-associated devices
2120 are configured to send messages to the network 2100 to receive
updated data and result information in response to such messages.
Each device 2120 may also structure any data provided to the
network 2100 consistent with the layered data structure 200
utilized by the network 2100.
[0212] During operation of the network 2100, a user application
executing on a device 2120 will determine a set of processing
functions which are required for responding to a request message.
This determination will generally require interaction between the
user application and the protocols that are running on the network.
In one embodiment frequent "push and pull" between user application
2340 and the user net software 2350, coupled with frequent updating
of information at the network nodes 510, enables an approximation
of required functionalities to be made based on a combination of
factors. For example, such an approximation could be predicated
upon knowledge of previous query messages, available data, and
available network functions.
[0213] As shown in FIG. 21, a local area network 2140 contains a
plurality of processing devices 2150 and a network-associated
device 2120' in communication with the network 2100. The processing
devices 2150 may be connected in a manner by which access to the
network can be achieved through at least the network-associated
device 2120'. The single network-associated device 2120' will
generally regularly communicate with the plurality of network nodes
2110 and can broadcast messages over the network 2100 sent from any
user in the local area network 2140.
[0214] Attention is now directed to FIG. 22, which is a flow chart
2200 representative of a set of exemplary processing operations
performed by the biological data network 500 in response to a user
query or request. In a stage 2210, at least one subject sequence is
received at a first network node 510 and compared to a reference
sequence. This comparative sequence analysis can be done locally
(i.e., at the first network node 510), or at another network node
510.
[0215] In one embodiment a result of this sequence comparison is a
large file of minor alleles with relation to the reference
sequence. The variants can range from single nucleotide
polymorphisms to larger insertions, deletions, reversions,
translocation, chromosomal rearrangements, mobile elements, and the
like. Initially, all the variant alleles are arranged sequentially
based on position in the reference sequence. In a stage 2220, these
variants are matched or otherwise validated against a database of
known and implicated variant alleles for at least one disease,
phenotype, symptom, biomarker, etc.
[0216] The list of variants alleles that have been validated are
used to isolate genes that are associated with the onset,
progression or prognosis of a disease (stage 2230). In this case,
the locus for a trait can fall within the coding region or
regulatory region or in introns associated with a gene. The gene
profile has disease specificity and along with the information on
the particular variant alleles that are characterized validated in
the genome of this patient, the gene profile becomes very
personalized. Statistical analytical functions may be performed to
generate a correlation profile between the validated variant
alleles and various phenotypes, symptoms, biomarkers, scans, scores
etc. that are associated with the disease condition (stage
2240).
[0217] Differential gene expression data and clinical results from
various pharmacological drug studies may then be used to generate a
drug efficacy and toxicity profile (stage 2250). Based on the
results of the gene profile, correlation profile and gene
expression profile, a particular molecular classification could be
accorded to a patient so as to enable a health care provider to
develop various clinical profiles. For example, a drug profiling
scheme could be developed for the patient in order to facilitate
selection of more effective treatments (stage 2260). For example,
rather than treating a disease based exclusively on symptoms, drug
selection may be made based on molecular-level clinical profiles
such that drugs targeting a specific molecular activity, mechanism
or pathway could be selected based upon such profiles.
[0218] Turning now to FIG. 23, an illustration is provided of the
separation of localized and network-based processing functions
within a portion of a biological data network 2300. During
operation, a user may make a request through a graphical user
interface ("GUI") 2310 generated by a user application 2316
utilized to access local data 2320 and other application software.
As the local data 2320 is operated upon by the user, a user network
software engine 2326 monitors the activities of the user and
determines if the outcome of the operation may be useful to other
users on the network. The network accessible data at the local
source is converted to a normalized to a format consistent with,
for example, the biological data model 200. For example, the
network accessible data may comprise a plurality of biological data
units containing a payload including a segment of biological
sequence data and a set of headers associated with the sequence
segment.
[0219] In this case the sequence data could comprise actual, "raw"
sequence data, or sequence data represented in an instruction
format as described in the above-referenced copending patent
applications. Alternatively, the network accessible data could
include only the header information associated with a collection of
biological data units. In this case the sequence data comprising
the payloads of such sequence data could, for example, remain
stored only within local data 2320. This arrangement advantageously
permits the selective sharing of various characteristics of a
collection of sequence data without permitting access to the
sequence data itself.
[0220] The network software engine 2326 evaluates the request
message and is able to intelligently distribute the required
processing functions between the local server 2340 and one or more
network nodes 510. For example, to generate a list of variant
alleles relative to a reference sequence, the comparative sequence
analysis yielding a list of variants could be performed on the
local server 2340. The list of variants could then be validated
using one or more network nodes 510 to access relevant databases
and broadcast results for updating network nodes.
[0221] In one embodiment nodes 510 at the edge of the network 500
use applications to communicate and update core network elements.
The information about the data that is accessed at the various
network nodes 510 may be transmitted between nodes 510 as a result
of the functions and the inherent awareness of the network 1810 to
biologically-relevant information.
[0222] It should be understood that a source or user node in one
instance can access multiple network node and associated databases
at various destinations. However, in another instance the previous
source can serve as a destination for network processing functions
and biologically relevant information concerning the requested data
( ).
[0223] Certain information that is learned, updated, stored or
otherwise made accessible based on a query might be published or
broadcast on the network 500 based on a previous request for the
specific or related data. For example, when a query relating to a
new drug is processed by the network 500, the result of the query
could be used to update a multi-function super node 2350.
[0224] Turning now to FIG. 24, an illustration is provided of
various functional interactions between network-based and localized
applications. The network-based applications executing on the
network nodes 510 interact in a manner that allows the use of
biologically-relevant information to distribute functional
processing between the local processor and network processors. In
response to a request message received through the graphical user
interface, user application software begins performing some portion
of the network-based and local processing that is required to
return the desired response. The user network interface relates to
the network software in such a manner that allows the network
software to operate based on updated information at network
nodes.
[0225] Attention is now directed to FIG. 25, which depicts a
biological data network 2500 including a collaborative simulation
network 2510. In addition to the collaborative simulation network
2510, the biological data network includes a plurality of network
nodes 2504. The collaborative simulation network 2510 is comprised
of a plurality of processing nodes 2514.
[0226] In one embodiment the network nodes 2504 and processing
nodes 2514 are structured and function in a manner substantially
similar or identical to that described above with respect to the
network nodes 510. In this embodiment the biological data network
2500 is implemented as an overlay network to the Internet (not
shown in FIG. 25), which facilitates packetized communication
between ones of the network nodes 2504 and between ones of the
processing nodes 2514. As discussed below, packetized communication
also occurs between certain processing nodes 2514 and network nodes
2504.
[0227] Each processing node 2514 of the collaborative simulation
network 2510 is capable of performing at least one function
required to process a user request or message. In one embodiment
the applications executed by the collaborative simulation network
2510 are interactive and capable of distributing and coordinating
processing function requirements with available updated information
to return results to a user. In general, results generated at a
given processing node 2514 on the collaborative simulation network
2510 are propagated to, and stored at, the other nodes 2514 of the
network 2510. In addition, this data can also be made available,
through one or more of the processing nodes 2514, to the network
nodes 2504.
[0228] The collaborative simulation network 2510 could be used by,
for example, groups such as consortia, a network of providers, at
least one processing event involved in a genome sequence data
analysis workflow or in connection with performance of a clinical
trial. Users associated with particular processing nodes 2514 may
access the processing functions and data associated with other such
nodes 2514.
[0229] In one embodiment the ability of users of processing nodes
2514 to access the processing capabilities of other nodes 2514
would be controlled in accordance with an access policy. Local data
that is made available on the processing nodes 2514 of the
simulation network 2510 could be published or broadcast to the
network nodes 2504 of the data network 2500 based upon, for
example, the interests of users associated with such nodes
2504.
[0230] Although FIG. 25 depicts only one simulation network 2510
operative within the network 2500, in other embodiments multiple
different simulation networks could be simultaneously functioning
on the data network 2500. In this case the data types and
processing functions utilized in the collaborative effort effected
by each simulation network would generally be specific to each such
network. For example, a particular collaboration facilitated by a
given simulation network could include or involve use of, for
example, image data, biomarkers including proteomic, metabolomic
and transcriptomic markers, and other related data.
BioIntelligence Processing on Biological Data Networks
[0231] Attention is now directed to FIG. 26, there is shown a
flowchart 2600 representative of the manner in which information
relating to various different layers of biologically-relevant data
organized consistently with the biological data model 200 may be
processed at different network nodes 510. In a stage 2610, a
request to process data comprised of at least a DNA layer 210 and
an RNA layer 220 is received at a first network node. Data in the
DNA layer is then processed in accordance with the request (stage
2612). At least partial results of the processing of the data in
the DNA layer is then forwarded to a second network node (stage
2616). Data within the partial results is then processed at the
second network node with respect to at least the RNA layer (stage
2620). A third network node is then identified based upon the
results of the processing at the second network node (stage 2622).
The results of the processing at the second network node are then
forwarded to the third network node, which then processes such
results (stage 2626). The results of the processing performed at
the third network node are then sent and subsequently received at
the first network node (stage 2630). A response to the request is
then sent from the first network node to, for example, a client
terminal based upon the results of the processing performed at the
third network node 510 (stage 2632).
[0232] Turning now to FIG. 27, a flowchart 2700 provides an
overview of an exemplary manner in which network nodes 510 of the
biological data network 500 may cooperate to process a client
request. In stage 2710, a request is received from a client device
at a first network node 510. Based upon the request, processing is
performed at the first network node based upon the request (stage
2712). In stage 2714, it is determined whether processing at the
first network node is complete. If such processing is complete,
then an appropriate response is returned to the client (stage
2718). If not, the results of the processing at the first network
node 510 may be routed or switched to a next network node 510
selected or otherwise scheduled in accordance with the nature of
such processing results (stage 2720). In a stage 2722, processing
is performed at the next network node based upon the request (stage
2722). It is then determined whether processing at the next network
node has been completed (stage 2724). If such processing has been
completed, a response is returned to the client (stage 2718);
otherwise, some or all the accumulated processing results may again
be routed or switched to a next network node 510 stage 2720.
[0233] FIG. 28 is a flowchart representative of an exemplary
sequence of operations involved in the identification and
processing of sequence variants at a network node 510. In stage
2810, a genome sequence (e.g., a segment of the entire genome of an
organism) associated with a request issued by a user terminal or
other client device is received at a network node 510. The genome
sequence is then compared with a reference sequence at the network
node (stage 2812). Through this comparison sequence variants
between the genome sequence and the reference sequence are
identified (stage 2816). In a stage 2820, a network location of a
database containing information concerning at least a first of the
sequence variants it is determined. Next, at least the first of the
sequence variants is sent from the network node to the database
(stage 2822). In a stage 2826, information from the database
relating to the first of the sequence variants is received at the
network node (stage 2826). A response is then sent from the network
node to the user terminal based upon the information from the
database (stage 2830).
[0234] Turning now to FIG. 29, a flowchart 2900 is provided of an
exemplary sequence of operations carried out by network nodes 510
of the biological data network in connection with processing of a
disease-related query. In a stage 2910, a query relating to a
specified disease and a genomic sequence associated with the query
is received at a first network node 510 (stage 2910). Any variant
alleles within the genomic sequence are then identified relative to
a control sequence (stage 2912). Next, information relating to the
variant alleles is sent from the first network node to a second
network node (stage 2916). In a stage 2920, a statistical
correlation analysis is performed at the second network node 510 in
order to identify a set of the variant alleles included within
genes associated with a specified disease (stage 2920). Information
relating to the set of variant alleles is then received at the
first network node (stage 2926). In a stage 2930, a response to the
query is sent from the first network node 510 based upon the
information relating to the set of variant alleles (stage
2930).
[0235] Attention is now directed to FIG. 30, which is a flowchart
3000 representative of an exemplary sequence of operations involved
in providing pharmacological response data in response to a user
query concerning a specified disease. In a stage 3010, a query
relating to a specified disease and a genomic sequence associated
with the query are received at a first network node 510. Next, any
variant alleles within the genomic sequence are identified relative
to a control sequence. In a stage 3016, information relating to the
variant alleles is sent from the first network node 510 to a second
network node. A statistical correlation analysis is then performed
at the second network node in order to identify those of the
variant alleles included within genes associated with a specified
disease (stage 3020). At a third network node 510, processing is
performed to associate pharmacological response data with those of
the variant alleles included within genes associated with the
specified disease (stage 3022). Such pharmacological response is
sent from the third network node 510 and received at the first
network node (stage 3026). A response to the query is then sent
from the first network node to, for example, a client terminal
based upon the pharmacological response data (stage 3030).
Transmission and Reconstitution of Genome Sequence Data
[0236] Attention is now directed to FIG. 31, to which reference
will be made in describing the communication of DNA sequence data
or other biological sequence information between a pair of devices
supporting a biological data network 3100. In one embodiment the
biological data network 3100 comprises representations of
biological data linked and interrelated by an overlay network 3104
containing a plurality of network nodes 3110. In one embodiment the
biological overlay network 3104 incorporates networking
applications and protocols similar to those described with
reference to the biological data network 1800.
[0237] As shown, the biological overlay network 3104 includes a
plurality of network nodes 3110, a source client device 3120 and a
destination client device 3130. In one embodiment both the source
client device 3120 and the destination client device 3130 are
configured to generate IP packets encapsulating biological data
units comprised of one or more biologically-relevant headers and a
payload including a representation of a segment biological sequence
data and to provide such IP packets to a network node 3110 for
distribution within the network 3100. Likewise, both the source
client device 3120 and the destination client device 3130 are
capable of receiving such IP packets from a network node 3110 and
extracting the biologically relevant headers and payload sequence
data.
[0238] In one embodiment the source client device 3120 stores or
has access to DNA sequence data. Such sequence data may, for
example, be accessed from storage or from a sequencing machine (not
shown) configured to produce "reads" of DNA sequence data. Within
the source client device 3120, the DNA sequence data may be
compared to a reference sequence and represented in an instruction
format in the manner described above. A plurality of biological
data units may then be generated based upon segments of this
sequence data and stored with in the source client device 3120.
Each biological data unit will include a suitably-sized segment of
DNA sequence data and a plurality of biologically-relevant headers.
These biological data units may then be encapsulated with TCP/IP
and/or other network protocol headers to facilitate transmission
through the biological data network 3100.
[0239] The packetized biological data units sent by the source
client device 3120 are routed and switched through the Internet or
other network connecting the network nodes 3110 of the biological
data network 3100 and delivered to the destination client device
3130. In the case in which DNA sequence data comprising an entire
genome is sent by the client device 3120, the destination client
device 3130 may reconstruct such genome from the packetized
biological data units sent by the source client device 3120.
Load Balancing
[0240] Attention is now directed to FIG. 32, to which reference
will be made in describing various ways in which multiple devices
supporting a biological data network 3200 may share responsibility
for mapping, assembling, fragmenting, packetizing, transmitting,
re-assembling and otherwise processing DNA sequence data or other
biological sequence information.
[0241] In one embodiment the biological data network 3200 comprises
packetized representations of biological data linked and
interrelated by a biologically-relevant-data-aware overlay network
3204 containing a plurality of network nodes 3210. As is discussed
below, such packetized DNA sequence data may be stored within a
storage element, or may be created by directly accessing data
produced by a high-throughput sequencing machine.
[0242] In the embodiment of FIG. 32, a device 3220 (i.e., device
3220 or "Device A") is associated with a network area storage
element 3240. The information stored can be accessed and mapped by
transmitting the data to any device having access to the
BioIntelligent.TM. data network 3200. A device 3224 (i.e., device
3224 or "Device B") is attached to a high-throughput next
generation sequencing machine 3244 and data can stream directly to
the device. In this case fragments of sequences flow into the
Device B, which may further divide such segments in order to
generate sequence fragments of optimal length in view of the
desired size of the payloads of packets used for data transport
within the network 3200.
[0243] In one embodiment both Device A and Device B are configured
to generate IP packets encapsulating biological data units
comprised of one or more biologically-relevant headers and a
payload including a representation of a segment biological sequence
data and to provide such IP packets to a network node 3210 for
distribution within the network 3200. Likewise, both Device A and
Device B are capable of receiving such IP packets from a network
node 3210 and extracting the biologically relevant headers and
payload sequence data.
[0244] Packetized sequence data may be transmitted by direct
networking between Device A and Device B, in which case both Device
A and B have access to the machine-read data and both contain a
stored copy of the reference sequence. As a result, both Device A
and Device B may share the load of assembling the genome for
example. Using a specific set of dynamically interactive network
application and protocols, the direct connection between Devices A
and B means that all of the DNA sequencing machine read data that
are accessible to one device can be distributed through a local
network to the second device for load sharing. One or more
reference sequences used for mapping and assembly may also be
shared between Device A and Device B. In one embodiment Device A
and Device B are networked and able to transmit and track specific
reads that have been mapped, along with the site or sites on the
reference sequence that correspond to the packetized machine-read
sequence.
[0245] Referring again to FIG. 32, a network-attached storage
container (NAS) 3240 contains DNA sequence data in the form of raw
machine-read sequences. When read size is short and sequencing has
a high level of redundancy, the consensus of the redundant reads is
stored. The DNA sequence reads in this storage element 3240 could
have been generated from, for example, an image data-sequencing
platform or direct-to-digital sequence device. In any case, the DNA
sequence is packetized with BI header information that can be used
to characterize such sequence in a way that allows it to be mapped
to a specific region of the genome using a separately stored
reference sequence. The sequence information stored within NAS 3240
need not necessarily comprise whole genome sequence data, but
rather could have been generated using a method of sequence
enrichment such as, for example, ChIP-Seq, RNA-Seq, ribosome
profiling, and the like.
[0246] During operation, Device A is capable of accessing data from
the NAS 3240. As the DNA sequence data streams into Device A from
the NAS 3240, the sequence data is processed and BI header
information is attached to the packetized data, thereby yielding
data units that are fully recognizable by the network elements and
devices, including but not limited to hardware, software, firmware,
middleware, etc. In this regard the Device A may be configured to
generate a biologically-relevant header for each segment of
sequence data accessed from the NAS 3240 based upon the position to
which such segment maps in a stored reference sequence being used
for assembly. Once this mapping has been effected for each sequence
segment, an entire assembled sequence (e.g., of an entire genome)
may again be stored in NAS 3240.
[0247] In one embodiment the sequencing machine 3244 comprising any
sequencing platform capable of generating reads of DNA sequence
data. As such that reads are being generated, the sequencing
machine 3244 may stream the data directly to Device B. Reads of DNA
sequence data accessed from the sequencing machine 3244, or
sequence segments thereof, are assigned biologically-relevant
headers having one or more fields pertaining at least to the
position or positions on a reference sequence corresponding to the
particular read or sequence segment.
[0248] Alternatively, in order to facilitate sharing the load of
mapping and assembling the reads generated by the sequencing
machine 3244, Device B may forward such sequence data from the
machine 3244 directly to another device, such as Device A, or to
any other device operatively coupled to the network 3200.
Based upon the configuration of Device A and Device B, the reads of
sequence data streamed into Device B can also be read directly by
Device A. Since in this case both Device A and Device B are mapping
the DNA sequence reads from a single sequencing machine 3244, the
reference sequence being used by both Device A and Device B will
generally be the same. In this way Device A and Device B may be
configured to cooperatively share the load of mapping and
assembling the reads of sequence data generated by the machine
3244.
[0249] In one embodiment Device A and Device B would implement a
protocol stack developed specifically to handle such shared-mapping
assembly and to effect load balancing. For example, a user could
configure the devices such that Device A would be responsible for
mapping sequence reads (or segments thereof) to chromosomes 1 to
10, while the sequence reads (or segments thereof) mapping to all
other chromosomes could be assembled by Device B.
[0250] Considering now the processing by Device B (and/or by Device
A) of reads of DNA sequence data produced by the machine 3244 by
Device B, in a first stage a size of such reads is determined. If
the sizes of the sequence fragments comprising such reads are
determined to be too large for convenient inclusion in biological
data packets, then such sequence fragments are further segmented
into appropriately-sized segments. Subsequent stages in the process
including aligning the incoming sequence fragments against a stored
reference sequence.
[0251] Once the incoming sequence fragments or segments thereof
have been properly aligned to the stored reference sequence, then
biological data packets including biologically-relevant headers may
be generated. Information pertaining to the alignment site (or
sites) at which such sequence fragments or segments map to the
reference will generally be included in the "Layer 1" header of
each biological data packet. Each such Layer 1 header will also
generally include other information required for the mapping and
assembly of such sequence fragments or segments thereof into whole
genome sequences.
[0252] Referring again to FIG. 32, another network-connected device
3228 (i.e., "Device C") may receive biological data units
encapsulated within IP packets sent through the network 3200 by,
for example, Device A and Device B. In one embodiment the Device C
is substantially similar or identical to Device A and Device B, and
may also share the load of mapping sequence fragments or segments
thereof produced by sequencing machine 3244, or stored within NAS
3240, to a reference sequence. For example, sequence fragments
generated by machine 3244 could be streamed over the network 3200
to Device C, which would map such sequence fragments (or segments
thereof) to a stored reference sequence identical to the reference
sequence utilized by Device A and/or Device B. Because the two
devices are networked with a protocol suite capable of establishing
a robust level of communication, communication can also be
established with a third device (e.g., Device C in FIG. 32) through
the existing transport and control protocols of the existing
Internet.
[0253] Turning now to FIG. 33, a high-level illustration is
provided of a biological data network 3300 configured to utilize
techniques such as, for example, multiprotocol label switching
("MPLS") to facilitate the distribution of DNA sequence data and
related information between client devices 3320. In the embodiment
of FIG. 33, each client device 3320 is configured to generate IP
packets encapsulating biological data units comprised of one or
more biologically relevant headers and a payload including a
representation of a segment biological sequence data and to provide
such IP packets to a network node 3210 for distribution within the
network 3300. Likewise, each client device 3320 is capable of
receiving such IP packets from a network node 3310 and extracting
the biologically relevant headers and payload sequence data.
[0254] In the embodiment of FIG. 33, MPLS may be utilized in edge
and backbone routers to analyze IP packets and encapsulate DNA
sequence data with appropriate labeling for switching. This enables
service providers the ability to select particular traffic paths
and supports virtual private networks with superior performance.
MPLS is capable of seamlessly addressing the issue of scalability
and the switch routing of DNA sequence data using a modification of
existing protocol suites or newly developed protocol suites. Such
DNA-based multiprotocol label switching provides a convenient
"short cut" to packet routing that may be made compatible with
existing protocols such as, for example, open shortest path first
(OSPF) and resource reservation protocol (RSVP). Packets that will
share the same transmission path will be grouped together in a
label switching protocol.
[0255] As shown, device 3320A ("Device A") is associated with a
network area storage element ("NAS") 3340. The information stored
can be accessed and mapped by transmitting the data to any device
supporting the data network 3300. Device A is also attached to a
high-throughput next generation sequencing machine 3344, from which
fragments of sequences are received. Device A which may further
divide such segments in order to generate sequence fragments of
optimal length in view of the desired size of the payloads of
packets used for data transport within the network 3300 to, for
example, device 3320B (i.e., "Device B").
[0256] FIGS. 33 and 34 also illustrate the process of assigning
biologically-relevant and network-related headers to segments of
DNA sequence data stored within NAS 3340 or received from the
sequencing machine 3344. As sequence fragments are received by
Device A from either or both of the NAS 3344 and the sequencing
machine 3344, biologically-relevant headers 3348 are generated and
assigned and to such fragments or to segments thereof. This results
in creation of biological data units 3350, each of which includes
the fragment or segment of DNA sequence data 3352 with which one or
more biologically-relevant headers 3348 are associated.
[0257] In one embodiment Device A is configured to determine the
map site on a reference sequence as the biologically-relevant
headers are assigned. Next, a specialized suite of networking
protocol headers 3354 may be used to encapsulate the biological
data units, thus creating network-enabled packets 3360. In one
embodiment MPLS labels may also be assigned to the network-enabled
packets 3360, thereby creating MPLS-labeled packets 3410 and
facilitating more efficient switching through label swapping
techniques.
[0258] As may be appreciated with reference to FIGS. 33 and 34, in
one embodiment multiple protocol label switching is performed
within a biologically-relevant-data-aware overlay network 3304. In
one embodiment, label edge routers (LER) are used on the ingress
side of the network 3304 to label as yet unlabeled IP packets,
while the label switch routers (LSR) are used for swapping in the
backbone of the network 3304. These labels may be used to assign
DNA sequence data packets to a particular class for forwarding.
[0259] As a result, transmission along a predetermined
path--namely, a label switch path ("LSP")--may be determined based
on class, traffic, and quality of service, each of which can be
controlled and maintained by the service provider. That is, based
on the analysis performed at the ingress side of the network 3304,
incoming IP packets encapsulating biological data units are
classified, assigned the appropriate label, encapsulated in an MPLS
header, and forwarded to the next stop in the LSP.
[0260] On the egress side of the network 3304, the labels are
removed by LERs and packets are sent on through the network 3304 to
their destination. Device B may receive the network enabled packets
3360 received transmitted over the network 3304 and extract the DNA
sequence data therefrom. In the embodiment of FIG. 33 the Device B
also communicates with Device A in order to determine which
reference sequence (or version thereof) is being used by the Device
A in order to create the representation of DNA sequence data
contained within the network enabled packets 3360. With this
arrangement, sequence mapping can be distributive and the load can
be shared with multiple devices.
[0261] As an example of the use of MPLS labeling techniques,
consider the case in which a biological data unit includes a
payload comprised of a representation of DNA sequence data and an
associated biologically-relevant header annotated with information
on a particular gene or gene feature correlated with a particular
phenotype and/or disease. In one embodiment an appropriate MPLS
label could be associated with packets including such header
information, which would enable such packets to be accorded a
particular quality of service.
Streaming of Biological Sequence Data
[0262] Referring now to FIG. 35, in one embodiment of the
biological data network described herein various networking
protocols otherwise employed for streaming media may be utilized to
facilitate the dissemination of DNA sequence data. In a particular
implementation, such networking protocols (e.g., RTP, RTSP, RTCP)
are modified in order to make selecting networking devices "DNA
aware". The resulting novel, specialized protocol stacks may be
used to pull, in response to a request from a client application,
streaming DNA sequence data from servers having access to storage
containing sequence data.
[0263] In accordance with one approach, the entire human diploid
genome sequence data for healthy and diseased heart, lung, and
colon tissue from one individual could be transmitted with
streaming packets. The DNA data in this case might stream directly
from high-throughput sequencing machines to a network-enabled
encoder element. The existing appliances would be able to respond
to the data with specific DNA sequence data content awareness.
[0264] As the DNA data are received, the various samples and
specific portions of samples can be decompressed or decoded,
compared, and analyzed without the need for saving any of the data.
During operation, a server streams DNA sequence data that has been
encoded into a predetermined compressed file format, such as the
compressed delta database format disclosed in the above-referenced
copending patent applications. This format stores the DNA data as
individualized encoded segments of the genome. Each biological data
unit containing a segment of DNA is assigned a BioIntelligence (BI)
header field that indicates the bit size of the read or segment or
gene. The server parses the streaming bits of the compressed file
to extract the biological data on the fly. The server sends the DNA
sequence data packets to the client at periodic intervals, while
the client then plays or interprets the individual encapsulated
packets as they arrive from the server.
[0265] Referring to FIG. 35, sequence fragments, i.e., machine
reads 3510, of any length are generated by a sequencing machine.
Such sequence fragments, or segments thereof, are mapped to a
reference sequence 3514 (e.g., the human genome reference sequence
or an idealized reference sequence generated to optimize the
process) by a data encoder 3520. The data are then converted into a
compressed instruction format 3524 that is based on the reference
3514.
[0266] Compression may be carried out with no loss of information,
since the reference sequence 3514 may be stored and accessible to
the data encoder 3520. The DNA sequence data represented in the
compressed instruction format 3524 may then be assigned
biologically-relevant headers, as well as network associated
headers, and the resulting encapsulated sequence information served
3530 over, for example, a DNA-aware overlay network 3534.
[0267] On the other side of the network 3534, the packets of
compressed data in instruction format arrive at a receiver. In one
embodiment the receiver can then decode 3540 and play the bit
stream as it is being sent. One advantage of this streaming
multimedia is that the DNA data can be processed and analyzed as
the packets are transmitted, before the entire file is
received.
[0268] After the compressed DNA sequence data 3544 in the
instruction format is decoded 3540, the un-compressed read sequence
can be aligned and mapped to the reference sequence. There is no
loss of information due to the compression and transmission of the
data. In this case, mapping of the machine-read sequences can be
delocalized and assembly of the whole genome sequence 3550 can be
shared among devices.
Distributed Sequence Processing, Analysis and Classification
[0269] Attention is now directed to FIG. 36, which provides a block
diagram of a high-speed sequence data analysis system 3600. The
analysis system 3600 may, for example, be utilized in personalized
medicine applications in which genomic-based diagnosis, treatment
or other services are offered. As is discussed below, the system
3600 operates to organize and represent genomic sequence data in a
structured format in association with information in the manner
described above. The structured data may then be further processed
and delivered to end users 3606 to facilitate analysis, research
and personalized medical applications. For example, the system 3600
may be configured to establish a networked arrangement among
participating medical clinics in a manner enabling the provision of
genomic-based diagnosis, treatment and other services.
[0270] Turning to FIG. 36, genomic data repository 3601 is
representative of genomic sequence data that has been configured in
accordance with standard protocols as well as newly built protocols
for operating on this type of data specifically. Substantially all
publicly available genomic sequence data which is currently
available is provided by commonly-used genomics databases such as
dbGaP, CGHub containing data for TCGA (The Cancer Genome Atlas),
EMBL-Bank, DDBJ or other databases containing biological sequence
information. Other sources of information represented by genomic
data repository 3601 may include, for example, various sources of
microarray data, gene expression data, next-generation deep
sequencing data, copy number variation data, and SNP analysis
data.
[0271] In a stage 3602, the accepted format for the DNA sequences
from repository 3601 are segmented into multiple fragments of data
sequences based upon user or application requirements. As a result,
fragments or data units of DNA sequence information may be
generated arbitrarily. Such fragments may include genes, introns
and/or exons, regions of the genome currently referred to as
"non-coding regions", or any other sequence segment relevant to a
particular application.
[0272] In a stage 3604, a header comprised of data provided by
storage device 3603 is assigned, associated, related or embedded
with each segment of DNA sequence data, thereby forming specialized
aggregates of sequence segments and attributes as biological data
units. This enables the selective processing and analysis of
genomic information in accordance with application requirements.
For example, in the case in which a system user 3606 is an
oncologist, only biological data units containing information from
those genes associated or otherwise correlated with a particular
cancer of interest (whether human, canine or other) are selected
for processing, thereby obviating the need for inefficient
processing of all of the information within data repository 3601.
This selective processing is facilitated by the layered
architecture of the biological data model 1400 and its
implementation using headers, as discussed previously.
[0273] Similarly, if the user 3609 is a virologist, only biological
data units having headers indicative of an association with viral
genomic information, or with human genes or gene fragments relating
to a specific viral infection, would be selected and processed.
[0274] The data within storage device 3603 may comprise any or all
of the information and knowledge known to be of relevance to a
particular gene. In addition, such data may also include
information related to processing genes which have been fragmented
into segments, and may be incorporated within headers designed to
scale to accommodate future information not yet discovered or known
about the particular gene or gene product or expression of that
gene.
[0275] In stage 3604, the segmented genomic data is encapsulated,
embedded or associated with appropriate headers to form biological
data units. Further, certain fields of such headers may be further
dynamically modified based upon application requirements. This may
occur, for example, when genomic data is further segmented pursuant
to stage 3602, which may essentially result in the generation of
new headers for the associated gene. The segmented genomics data
unit may then be further normalized (stage 3605) consistent with
the layered data structure described herein in view of user
application processing requirements. Storage devices 3606 are
generally configured for storage of normalized segmented sequence
data as biological data units in such a layered structure, thereby
facilitating easy access based upon application requirements.
[0276] In response to requests from user applications, the data
associated with biological data units stored within the devices
3606 may be processed, moved, analyzed or accelerated by one or
more application processing nodes 3607 to provide services such as,
for example, genomic-based diagnoses, visual exploitation of
genomic studies, or research and drug discovery and
development.
[0277] The user or client application desktop unit 3609 provides a
mechanism to run user applications, which generate user request
messages received by application processing nodes 3607 and display
the data or results returned by such nodes 3607. The unit 3609 may
be connected to localized ones of the processing nodes 3607 and
storage elements 3606 through a local area network or the
equivalent, and to remote processing and storage elements through a
wide area network and/or the Internet.
[0278] The word "exemplary" is used herein to mean "serving as an
example, instance, or illustration." Any embodiment described
herein as "exemplary" is not necessarily to be construed as
preferred or advantageous over other embodiments.
[0279] In one or more exemplary embodiments, the functions, methods
and processes described may be implemented in hardware, software,
firmware, or any combination thereof. If implemented in software,
the functions may be stored on or encoded as one or more
instructions or code on a computer-readable medium.
Computer-readable media includes computer storage media. Storage
media may be any available media that can be accessed by a
computer.
[0280] By way of example, and not limitation, such
computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or
other optical disk storage, magnetic disk storage or other magnetic
storage devices, or any other medium that can be used to carry or
store desired program code in the form of instructions or data
structures and that can be accessed by a computer. Disk and disc,
as used herein, includes compact disc (CD), laser disc, optical
disc, digital versatile disc (DVD), floppy disk and blu-ray disc
where disks usually reproduce data magnetically, while discs
reproduce data optically with lasers. Combinations of the above
should also be included within the scope of computer-readable
media.
[0281] It is understood that the specific order or hierarchy of
steps or stages in the processes and methods disclosed are examples
of exemplary approaches. Based upon design preferences, it is
understood that the specific order or hierarchy of steps in the
processes may be rearranged while remaining within the scope of the
present disclosure. The accompanying method claims present elements
of the various steps in a sample order, and are not meant to be
limited to the specific order or hierarchy presented.
[0282] Those of skill in the art would understand that information
and signals may be represented using any of a variety of different
technologies and techniques. For example, data, instructions,
commands, information, signals, bits, symbols, and chips that may
be referenced throughout the above description may be represented
by voltages, currents, electromagnetic waves, magnetic fields or
particles, optical fields or particles, or any combination
thereof.
[0283] Those of skill would further appreciate that the various
illustrative logical blocks, modules, circuits, and algorithm steps
described in connection with the embodiments disclosed herein may
be implemented as electronic hardware, computer software, or
combinations of both.
[0284] To clearly illustrate this interchangeability of hardware
and software, various illustrative components, blocks, modules,
circuits, and steps have been described above generally in terms of
their functionality. Whether such functionality is implemented as
hardware or software depends upon the particular application and
design constraints imposed on the overall system.
[0285] Skilled artisans may implement the described functionality
in varying ways for each particular application, but such
implementation decisions should not be interpreted as causing a
departure from the scope of the present disclosure.
[0286] The various illustrative logical blocks, modules, and
circuits described in connection with the embodiments disclosed
herein may be implemented or performed with a general purpose
processor, a digital signal processor (DSP), an application
specific integrated circuit (ASIC), a field programmable gate array
(FPGA) or other programmable logic device, discrete gate or
transistor logic, discrete hardware components, or any combination
thereof designed to perform the functions described herein. A
general purpose processor may be a microprocessor, but in the
alternative, the processor may be any conventional processor,
controller, microcontroller, or state machine. A processor may also
be implemented as a combination of computing devices, e.g., a
combination of a DSP and a microprocessor, a plurality of
microprocessors, one or more microprocessors in conjunction with a
DSP core, or any other such configuration. Additionally, the scope
of the invention includes hardware not traditionally used or
thought-of having use within general purpose computing, such as
graphic processing units (GPUs).
[0287] The steps or stages of a method, process or algorithm
described in connection with the embodiments disclosed herein may
be embodied directly in hardware, in a software module executed by
a processor, or in a combination of the two. A software module may
reside in RAM memory, flash memory, ROM memory, EPROM memory,
EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or
any other form of storage medium known in the art.
[0288] Certain of the disclosed methods may also be implemented
using a computer-readable medium containing program instructions
which, when executed by one or more processors, cause such
processors to carry out operations corresponding to the disclosed
methods.
[0289] An exemplary storage medium is coupled to the processor such
the processor can read information from, and write information to,
the storage medium. In the alternative, the storage medium may be
integral to the processor. The processor and the storage medium may
reside in an ASIC. The ASIC may reside in a user terminal. In the
alternative, the processor and the storage medium may reside as
discrete components in a user terminal.
[0290] The previous description of the disclosed embodiments is
provided to enable any person skilled in the art to make or use the
present disclosure. Various modifications to these embodiments will
be readily apparent to those skilled in the art, and the generic
principles defined herein may be applied to other embodiments
without departing from the spirit or scope of the disclosure. Thus,
the present disclosure is not intended to be limited to the
embodiments shown herein but is to be accorded the widest scope
consistent with the principles and novel features disclosed herein.
It is intended that the following claims and their equivalents
define the scope of the disclosure.
Sequence CWU 1
1
3117DNAArtificial SequenceExample sequence fragment 1ggaggctagt
tagtata 17266DNAArtificial SequenceExample sequence fragment
2agttgacacc tgtccacacg ttaaacaggt tccataagat tgtgccgtta aatactcagg
60caatct 66316DNAArtificial SequenceExample sequence fragment
3ttaaacaggt tccata 16
* * * * *
References