U.S. patent application number 14/826595 was filed with the patent office on 2016-02-18 for systems and methods for genetic analysis.
The applicant listed for this patent is Good Start Genetics, Inc.. Invention is credited to Alexander Frieden, Xavier S. Haurie, Caleb J. Kennedy.
Application Number | 20160048608 14/826595 |
Document ID | / |
Family ID | 55302346 |
Filed Date | 2016-02-18 |
United States Patent
Application |
20160048608 |
Kind Code |
A1 |
Frieden; Alexander ; et
al. |
February 18, 2016 |
SYSTEMS AND METHODS FOR GENETIC ANALYSIS
Abstract
The invention relates to using a graph database in genetic
analyses to link mutation data to extrinsic data. Entities such as
mutations, patients, samples, alleles, and clinical information are
individually represented and stored as nodes and relationships
between entities are also individually represented and stored. Each
node and relationship can be stored using a fixed-size record and
nodes can be flexibly invoked to represent any entity without
disrupting the existing data. Systems and methods of the invention
may be used for obtaining data representing a mutation in an
individual and using a node in a graph database to store a
description of the mutation. The node has stored within it a
pointer to an adjacent node that provides information about a
clinical significance of the variant. The graph database can be
queried to provide a report of the clinical significance of the
mutation.
Inventors: |
Frieden; Alexander;
(Somerville, MA) ; Kennedy; Caleb J.; (Arlington,
MA) ; Haurie; Xavier S.; (Belmont, MA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Good Start Genetics, Inc. |
Cambridge |
MA |
US |
|
|
Family ID: |
55302346 |
Appl. No.: |
14/826595 |
Filed: |
August 14, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62037861 |
Aug 15, 2014 |
|
|
|
Current U.S.
Class: |
707/722 |
Current CPC
Class: |
G06F 16/9024 20190101;
G16B 50/00 20190201; G16B 20/00 20190201; G06F 16/9038
20190101 |
International
Class: |
G06F 17/30 20060101
G06F017/30; G06F 19/28 20060101 G06F019/28 |
Claims
1. A system for describing genetic information, the system
comprising: at least one computer comprising memory coupled to a
processor, the system having at least a portion of a graph database
stored therein, wherein the system is operable to: obtain data
representing a mutation in a genome of an individual; use a node in
the graph database to store a description of the mutation; store,
in the node, a pointer to an adjacent node that provides
information about a clinical significance of the mutation; and
query the graph database to provide a report of the clinical
significance of the mutation in the genome of the individual.
2. The system of claim 1, wherein the system is operable to obtain
the data representing the mutation by receiving at least one
sequence read file that includes the data.
3. The system of claim 2, further operable to represent, in the
graph database, a biological sample from the individual using a
sample node and connect the sample node via a pointer to a read
file node representing the sequence read file.
4. The system of claim 1, wherein the data representing the
mutation is obtained as part of a file.
5. The system of claim 4, wherein the file has a format selected
from the group consisting of variant call format; sequence
alignment map; binary alignment map; FASTA; and FASTQ.
6. The system of claim 4, operable to represent the file as a file
node in the graph database and store, in the variant node, a
pointer to the file node.
7. The system of claim 6, further operable to represent, in the
graph database, a biological sample from the individual using a
sample node and connect the sample node via a pointer to a read
file node representing the sequence read file.
8. The system of claim 1, wherein the data representing the
mutation comprises a description of the mutation as a variant of a
reference human genome.
9. The system of claim 8, wherein the description of the mutation
is obtained from a VCF record in a VCF file.
10. The system of claim 9, further operable to represent, in the
graph database, a biological sample from the individual using a
sample node and connect the sample node via a pointer to a read
file node representing the sequence read file.
11. The system of claim 1, further operable to: obtain sequencing
data representing a plurality of mutations in the genome of the
individual, the plurality of mutations being represented as variant
calls relative to a human genome reference; use, for each of the
plurality of mutations, a corresponding variant node in the graph
database to store a description of that mutation; and link the
individual to an allele node based on the plurality of
mutations.
12. The system of claim 11, wherein the graph database comprises:
nodes representing people, nodes representing genomic variants
relative to a reference, and nodes representing literature reports
on medical relevance of the genomic variants; and edges defining
relationships between pairs of the nodes.
13. The system of claim 12, further operable to represent, in the
graph database, a biological sample from the individual using a
sample node and connect the sample node via a pointer to a read
file node representing the sequence read file.
14. The system of claim 1, wherein the graph database comprises:
nodes representing people, nodes representing genomic variants
relative to a reference, and nodes representing literature reports
on medical relevance of the genomic variants; and edges defining
relationships between pairs of the nodes.
15. The system of claim 14, further operable to represent, in the
graph database, a biological sample from the individual using a
sample node and connect the sample node via a pointer to a read
file node representing the sequence read file.
16. A method for analyzing mutations, the method comprising:
obtaining data representing a mutation in a genome of an
individual; using a node in a graph database to store a description
of the mutation; storing, in the node, a pointer to an adjacent
node that provides information about a clinical significance of the
mutation; and querying the graph database to provide a report of
the clinical significance of the mutation in the genome of the
individual.
17. The method of claim 16, wherein obtaining the data representing
the mutation comprises obtaining a sample that includes a nucleic
acid from the individual; and sequencing the nucleic acid to obtain
a sequence read file that includes the data.
18. The method of claim 17, further comprising representing the
sample in the graph database using a sample node and connecting the
sample node via a pointer to a read file node representing the
sequence read file and metadata associated with the data.
19. The method of claim 16, wherein the data representing a
mutation is obtained as part of a file.
20. The method of claim 19, wherein the file has a format selected
from the group consisting of variant call format; sequence
alignment map; binary alignment map; FASTA; and FASTQ.
21. The method of claim 19, further comprising representing the
file as a file node in the graph database and storing in the
mutation node a pointer to the file node.
22. The method of claim 16, wherein the data representing a
mutation comprises a description of the mutation as a variant of a
reference human genome.
23. The method of claim 22, wherein the description of the mutation
is provided as a VCF record in a VCF file.
24. The method of claim 16, further comprising: obtaining
sequencing data representing a plurality of mutations in the genome
of the individual, each of the plurality of mutations being
represented as variant calls relative to a human genome reference;
and using, for each of the plurality of mutations, a corresponding
variant node in the graph database to store a description of that
mutation.
25. The method of claim 16, wherein the graph database comprises:
nodes representing people, nodes representing genomic variants
relative to a reference, and nodes representing literature reports
on medical relevance of the genomic variants; and edges defining
relationships between pairs of the nodes.
Description
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims priority to, and the benefit of,
U.S. Provisional Patent Application No. 62/037,861, filed Aug. 15,
2014, the contents of which are incorporated by reference.
TECHNICAL FIELD
[0002] The invention relates to medical genetics.
BACKGROUND
[0003] Before having children, a person may turn to genetic
screening to find out if he or she is a carrier of a genetic
condition. Genetic carrier screening can be done using
next-generation sequencing (NGS), which produces millions of
"base-calls" read from the person's genome. Typically, those base
calls are then compared to a reference genome to determine their
clinical significance. While all 3.2 billion base-pairs of the
human genome are available for use as a reference (e.g., as hg18),
knowing the clinical significance of features in the person's
genome requires turning to medical literature or specialized
databases of mutations. For example, the Online Mendelian
Inheritance in Man (OMIM) database contains information on genetic
disorders in over 12,000 human genes.
[0004] The volumes of data that must be stored, compared, and
understood are a significant obstacle to realizing the full
potential of NGS as a carrier screening tool. Generally, the time
required for analysis and reporting is proportional to the amount
of data in the databases. The structure of those databases requires
exhaustive index table lookups for each comparison. Also, since
databases designs must be locked in prior to use, a clinician's use
of the data system is limited to what the database designer foresaw
as the likely qualities of the data. A clinician who discovers a
new phenomenon--such as and a novel combination of mutations
associated with an unexpected disease--may be faced with a data
system that does not even provide a means for entering or
describing this information.
SUMMARY
[0005] The invention provides systems and methods for genetic
analysis in which entities such as mutations, patients, samples,
alleles, and clinical information are individually represented and
stored as nodes and in which relationships between entities are
also individually represented and stored. Each node and
relationship can be stored using a fixed-size record and nodes can
be flexibly invoked to represent any novel entity without
disrupting the information already represented in the system. By
forsaking the traditional database schema of indexed tables, the
run time for queries need not be proportional to the amount of data
in the tables. Instead, queries that start with a certain node can
find the relevant related nodes in time proportional only to the
number of nodes in the results that match the query. Moreover,
novel entities and relationships can be inserted into the data
system upon discovery with no disruption to the data or operation
of the system. Thus, novel mutations can be added or related to
disease phenotypes or appropriate literature references as that new
information is discovered and observed. The time required for a
query of--for example--relationships between a patient and
disease-associated alleles in that patient's genome will be
proportional to the number of results that are found for inclusion
in a report for that patient. Where sequencing uncovers novel
mutations or genotype/phenotype associations, those entities and
relationships can be brought into the system and included in the
reporting without requiring any changes or re-design to the
underlying system architecture. In methods and systems of the
invention, NGS results, patient information, and medical
information can be stored in a graph database and analyzed using
graph processing approaches and languages. This provides for very
rapid querying and report generation, independent of the size of
the underlying data store.
[0006] Since report generation is rapid and not linked to the
underlying volume of data, and since systems of the invention may
easily accommodate the volumes of data associated with NGS
sequencing and human genome based analyses, systems and methods of
the invention may be employed for NGS-based carrier screening and
provide meaningful results to patients.
[0007] Additionally, the invention includes the insight that the
clinical significance of mutations--or "variants", e.g., as
documented in NGS results such as Variant Call Format (VCF)
files--can be shown by relating the mutation to a particular allele
of a gene and showing where in the literature the variant is
reported as pathogenic or benign while connecting this information
back to a patient and lab sample for reporting purposes. Sequencing
by existing NGS technologies may provide abundant high-quality raw
data in the form of sequence files such as FASTA, FASTQ, Sequence
Alignment Map (SAM), Binary Alignment Map (BAM), or VCF files.
Systems and methods of the invention can be used to extract
relevant data from those files into the described nodes to support
the rapid querying and report generation useful for NGS carrier
screening. For example, systems of the invention may include an
Application Programming Interface (API) that takes as input VCF
files and creates a network of nodes representing patients,
samples, VCF files, VCF records, variants, alleles, and literature
reports with relationships connecting adjacent pairs of those nodes
according to their natural relationships. The system supports a
genomics analysis clinical pipeline even as it changes and can
accommodate the loading in of external data. The system can be
implemented using a graph database and related software. Systems of
the invention support a variety of analyses and use cases. For
example, with NGS-based carrier screening implemented using the
described graph database structure for analysis and reporting, it
becomes easy to query and report such phenomenon as allele
frequencies.
[0008] Importantly, systems and methods of the invention support
the curation of variants. Curating variants includes identifying an
individual variant in sequencing results, researching medical
literature for information about the variant, classifying the
variant (e.g., pathogenic, benign, somewhere in between), and
accessioning that information into the database for use in
subsequent reports on patient samples in which that variant is
implicated. Using the nodes and relationships provided by the
invention, variants can be connected to alleles, literature
references, medical information, or combinations thereof. If
changes are subsequently made (e.g., a missense mutation is
re-classified as a nonsense mutation), other features of the system
infrastructure are not disrupted. Thus the active curation of
variants is accommodate and improves the system.
[0009] In certain aspects, the invention provides a method for
analyzing mutations. The method includes obtaining data
representing a mutation in a genome of an individual and using a
node in a graph database to store a description of the mutation.
The node has stored within it a pointer to an adjacent node that
provides information about a clinical significance of the variant.
The method includes querying the graph database to provide a report
of the clinical significance of the mutation in the genome of the
individual.
[0010] The data representing the mutation may be obtained by
obtaining a sample that includes a nucleic acid from the individual
and sequencing the nucleic acid to obtain a sequence read file that
includes the data. The sample may be represented in the graph
database using a sample node and the sample node may be connected
via a pointer to a read file node representing the sequence read
file. The graph database may include nodes representing people,
nodes representing genomic variants relative to a reference, and
nodes representing literature reports on medical relevance of the
genomic variants as well as edges defining relationships between
pairs of the nodes.
[0011] In some embodiments, the data representing a mutation is
obtained as part of a file such as a variant call file (VCF), a
sequence alignment map (SAM) file, a binary alignment map (BAM)
file, a FASTA file, or a FASTQ file. The file may be represented in
the graph database (e.g., using a file node) and a pointer to the
file node may be stored in the mutation node.
[0012] In certain embodiments, the data representing a mutation
comprises a description of the mutation as a variant of a reference
human genome. The description of the mutation may be provided as a
VCF record in a VCF file. The method may include obtaining
sequencing data that represents a plurality of mutations in the
genome of the individual--each of the plurality of mutations being
represented as variant calls relative to a human genome reference.
For each of the plurality of mutations, a corresponding variant
node in the graph database is used to store a description of that
mutation.
[0013] Aspects of the invention provide a system for describing
genetic information. The system includes at least one computer
comprising memory coupled to a processor. The system has at least a
portion of a graph database stored therein. The system is operable
to obtain data representing a mutation in a genome of an
individual, use a variant node in the graph database to store a
description of the mutation, and store--within the variant node--a
pointer to an adjacent node that provides information about a
clinical significance of the mutation. The system may be used to
query the graph database to provide a report of the clinical
significance of the mutation in the genome of the individual. As
discussed above, the data representing a mutation may be obtained
as part of a file such as a VCF file. The system may represent the
file as a file node in the graph database and store, in the variant
node, a pointer to the file node.
[0014] The data representing the mutation may be provided as a
sequence read file that includes that data. In certain embodiments,
the system is operable use the graph database to represent a
biological sample from the individual with a sample node and
connect the sample node via a pointer to a read file node
representing the sequence read file.
[0015] The system may be operated to obtain sequencing data
representing a plurality of mutations in the genome of the
individual (e.g., as variant calls relative to a human genome
reference) and use, for each of the plurality of mutations, a
corresponding variant node in the graph database to store a
description of that mutation. The system links the individual to an
allele node based on the plurality of mutations.
[0016] In a preferred aspect, the invention provides: a system for
describing genetic information, the system comprising: at least one
computer comprising memory coupled to a processor, the system
having at least a portion of a graph database stored therein,
wherein the system is operable to: obtain data representing a
mutation in a genome of an individual; use a node in the graph
database to store a description of the mutation; store, in the
node, a pointer to an adjacent node that provides information about
a clinical significance of the mutation; and query the graph
database to provide a report of the clinical significance of the
mutation in the genome of the individual. Preferably a pointer
identifies a physical location in the memory at which the adjacent
node is stored. Thus each node may be stored at a specific physical
location the memory. Each such specific physical location is
referenced by a pointer (which itself optionally may be stored
within a node at a physical location that is referenced, in-turn,
by another pointer). Preferably, each pointer identifies a physical
location in the memory subsystem at which the adjacent object is
stored. In the preferred embodiments, the pointer or native pointer
is manipulatable as a memory address in that it points to a
physical location on the memory but also dereferencing the pointer
accesses intended data. That is, a pointer is a reference to a
datum stored somewhere in memory; to obtain that datum is to
dereference the pointer. The feature that separates pointers from
other kinds of reference is that a pointer's value is interpreted
as a memory address, at a low-level or hardware level. The speed
and efficiency of the described low-level, or hardware level,
memory referencing allows for incredibly rapid graph traversals,
which means that data content can scale up unbounded but reporting
actionable medical genetic information will not require amounts of
time that scale up with the data content. Use of hardware level
references, or index-free adjacency, uncouples the time
requirements for medical genetics reporting from data content
volume.
[0017] In a first embodiment of the preferred aspect, the system is
operable to obtain the data representing the mutation by receiving
at least one sequence read file that includes the data. Preferably
the system of the first embodiment is further operable to
represent, in the graph database, a biological sample from the
individual using a sample node and connect the sample node via a
pointer to a read file node representing the sequence read
file.
[0018] In a second embodiment of the preferred aspect, the data
representing the mutation is obtained as part of a file. In the
second embodiment, the file may have a format selected from the
group consisting of variant call format; sequence alignment map;
binary alignment map; FASTA; and FASTQ. Preferably in the second
embodiment the system is operable to represent the file as a file
node in the graph database and store, in the variant node, a
pointer to the file node. Optionally, the system is further
operable to represent, in the graph database, a biological sample
from the individual using a sample node and connect the sample node
via a pointer to a read file node representing the sequence read
file.
[0019] In a third embodiment of the preferred aspect, the data
representing the mutation comprises a description of the mutation
as a variant of a reference human genome. In the third embodiment,
the description of the mutation may optionally be obtained from a
VCF record in a VCF file. Additionally, the system of the third
embodiment may be further operable to represent, in the graph
database, a biological sample from the individual using a sample
node and connect the sample node via a pointer to a read file node
representing the sequence read file.
[0020] In a fourth embodiment of the preferred aspect, the system
is further operable to: obtain sequencing data representing a
plurality of mutations in the genome of the individual, the
plurality of mutations being represented as variant calls relative
to a human genome reference; use, for each of the plurality of
mutations, a corresponding variant node in the graph database to
store a description of that mutation; and link the individual to an
allele node based on the plurality of mutations. In the fourth
embodiment, the graph database may include: nodes representing
people, nodes representing genomic variants relative to a
reference, and nodes representing literature reports on medical
relevance of the genomic variants; and edges defining relationships
between pairs of the nodes. The system of the fourth embodiment may
be further operable to represent, in the graph database, a
biological sample from the individual using a sample node and
connect the sample node via a pointer to a read file node
representing the sequence read file.
[0021] In a fifth embodiment of the preferred aspect, the graph
database comprises: nodes representing people, nodes representing
genomic variants relative to a reference, and nodes representing
literature reports on medical relevance of the genomic variants;
and edges defining relationships between pairs of the nodes. In the
fifth embodiment, the system may be further operable to represent,
in the graph database, a biological sample from the individual
using a sample node and connect the sample node via a pointer to a
read file node representing the sequence read file.
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] FIG. 1 illustrates an exemplary NGS workflow for carrier
screening.
[0023] FIG. 2 gives a sample of an exemplary VCF file.
[0024] FIG. 3 diagrams a method for analyzing mutations.
[0025] FIG. 4 gives a flow chart for a VCF file parser.
[0026] FIG. 5 presents a model of data received from parsing a VCF
file.
[0027] FIG. 6 shows an entity relationship diagram (ERD) of the
data modeled by FIG. 5.
[0028] FIG. 7 diagrams a high-level architecture of a system of the
invention.
[0029] FIG. 8 illustrates a structure for nodes and relationships
on disk.
[0030] FIG. 9 illustrates the use of a variant node to store a
description of a mutation.
[0031] FIG. 10 shows an allele node showing that an allele includes
a certain mutation.
[0032] FIG. 11 shows variant node connected to two different
literature reference nodes.
[0033] FIG. 12 illustrates updating information about a
mutation.
[0034] FIG. 13 presents an example database that may be queried for
allele frequency.
[0035] FIG. 14 diagrams a system for performing methods of the
invention.
DETAILED DESCRIPTION
[0036] The invention relates to using a graph database in genetic
analyses to link mutation data to extrinsic data. Entities such as
mutations, patients, samples, alleles, and clinical information are
individually represented and stored as nodes and relationships
between entities are also individually represented and stored. Each
node and relationship can be stored using a fixed-size record and
nodes can be flexibly invoked to represent any entity without
disrupting the existing data. Systems and methods of the invention
may be used for obtaining data representing a mutation in an
individual and using a variant node in a graph database to store a
description of the mutation. The variant node has stored within it
a pointer to an adjacent node that provides information about a
clinical significance of the variant. The graph database can be
queried to provide a report of the clinical significance of the
mutation. In certain embodiments, systems and methods of the
invention operate within the context of a carrier screening
workflow and provide a querying and reporting tool for carrier
screening.
[0037] FIG. 1 illustrates an exemplary NGS workflow for carrier
screening. The workflow combines automated, optimized molecular
inversion probe target capture 109 with molecular barcoding to
maximize the sample throughput of an NGS machine and employs
assembly and alignment methods that allow accurate identification
of both substitution and insertion/deletion lesions. The workflow
is applicable to, for example, genes in which loss-of-function
mutations cause recessive Mendelian disorders often included as
part of routine carrier screening. A screening or analysis may
begin with obtaining nucleic acid from a sample.
[0038] Nucleic acid in a sample can be any nucleic acid, including
for example, genomic DNA in a tissue sample, cDNA amplified from a
particular target in a laboratory sample, or mixed DNA from
multiple organisms. In some embodiments, the sample includes
homozygous DNA from a haploid or diploid organism. For example, a
sample can include genomic DNA from a patient who is homozygous for
a rare recessive allele. In other embodiments, the sample includes
heterozygous genetic material from a diploid or polyploidy organism
with a somatic mutation such that two related nucleic acids are
present in allele frequencies other than 50 or 100%, i.e., 20%, 5%,
1%, 0.1%, or any other allele frequency.
[0039] In one embodiment, nucleic acid template molecules (e.g.,
DNA or RNA) are isolated from a biological sample containing a
variety of other components, such as proteins, lipids, and
non-template nucleic acids. Nucleic acid template molecules can be
obtained from any cellular material, obtained from animal, plant,
bacterium, fungus, or any other cellular organism. Biological
samples for use in the present invention also include viral
particles or preparations. Nucleic acid template molecules can be
obtained directly from an organism or from a biological sample
obtained from an organism, e.g., from blood, urine, cerebrospinal
fluid, seminal fluid, saliva, sputum, stool, and tissue. Any tissue
or body fluid specimen (e.g., a human tissue of bodily fluid
specimen) may be used as a source for nucleic acid to use in the
invention. Nucleic acid template molecules can also be isolated
from cultured cells, such as a primary cell culture or cell line.
The cells or tissues from which template nucleic acids are obtained
can be infected with a virus or other intracellular pathogen. A
sample can also be total RNA extracted from a biological specimen,
a cDNA library, viral, or genomic DNA. A sample may also be
isolated DNA from a non-cellular origin, e.g. amplified/isolated
DNA from the freezer.
[0040] Generally, nucleic acid can be extracted, isolated,
amplified, or analyzed by a variety of techniques such as those
described by Green and Sambrook, Molecular Cloning: A Laboratory
Manual (Fourth Edition), Cold Spring Harbor Laboratory Press,
Woodbury, N.Y. 2,028 pages (2012); or as described in U.S. Pat. No.
7,957,913; U.S. Pat. No. 7,776,616; U.S. Pat. No. 5,234,809; U.S.
Pub. 2010/0285578; and U.S. Pub. 2002/0190663.
[0041] Nucleic acid from a sample may optionally be fragmented or
sheared to a desired length, using a variety of mechanical,
chemical, and/or enzymatic methods. DNA may be randomly sheared via
sonication using, for example, an ultrasonicator sold by Covaris
(Woburn, Mass.), brief exposure to a DNase, or using a mixture of
one or more restriction enzymes, or a transposase or nicking
enzyme. RNA may be fragmented by brief exposure to an RNase, heat
plus magnesium, or by shearing. The RNA may be converted to cDNA.
If fragmentation is employed, the RNA may be converted to cDNA
before or after fragmentation. In one embodiment, nucleic acid is
fragmented by sonication. In another embodiment, nucleic acid is
fragmented by a hydroshear instrument. Generally, individual
nucleic acid template molecules can be from about 2 kb bases to
about 40 kb. In a particular embodiment, nucleic acids are about 6
kb-10 kb fragments. Nucleic acid molecules may be single-stranded,
double-stranded, or double stranded with single-stranded regions
(for example, stem- and loop-structures).
[0042] A biological sample may be lysed, homogenized, or
fractionated in the presence of a detergent or surfactant as
needed. Suitable detergents may include an ionic detergent (e.g.,
sodium dodecyl sulfate or N-lauroylsarcosine) or a nonionic
detergent (such as the polysorbate 80 sold under the trademark
TWEEN by Uniqema Americas (Paterson, N.J.) or C14H22O(C2H4)n, known
as TRITON X-100).
[0043] In certain embodiments, genomic DNA samples are input to a
molecular inversion probe capture 109 reaction. Molecular inversion
probes may be designed to capture the coding regions and as well as
well-characterized noncoding regions of genes. Such probes may
include 5' and 3' targeting arms (extension and ligation,
respectively) of, for example, about a total of 40 nucleotides and
being designed to flank 130-bp target regions. Each target is
captured 109 by multiple probes that anneal to non-overlapping
genomic intervals. PCR is performed 121 using primers containing
patient-specific barcodes, yielding barcode libraries. Genomic DNA
may be subjected to multiplex target capture using molecular
inversion probes. Captured product may be subjected to PCR to
attach molecular barcodes in a manner that allow sequencing from
either end of the captured region.
[0044] PCR may be used as described or any other amplification
reaction may be performed. Amplification refers to production of
additional copies of a nucleic acid sequence and is generally
carried out using polymerase chain reaction (PCR) or other
technologies known in the art. The amplification reaction may be
any amplification reaction known in the art that amplifies nucleic
acid molecules such as PCR (e.g., nested PCR, PCR-single strand
conformation polymorphism, ligase chain reaction, strand
displacement amplification and restriction fragments length
polymorphism, transcription based amplification system, rolling
circle amplification, and hyper-branched rolling circle
amplification, quantitative PCR, quantitative fluorescent PCR
(QF-PCR), multiplex fluorescent PCR (MF-PCR), real time PCR
(RTPCR), restriction fragment length polymorphism PCR). See U.S.
Pat. No. 5,242,794; U.S. Pat. No. 5,494,810; U.S. Pat. No.
4,988,617; U.S. Pat. No. 6,582,938; U.S. Pat. No. 4,683,195; and
U.S. Pat. No. 4,683,202, hereby incorporated by reference. Primers
for PCR, sequencing, and other methods can be prepared by cloning,
direct chemical synthesis, and other methods known in the art.
Primers can also be obtained from commercial sources such as
Eurofins MWG Operon (Huntsville, Ala.) or Life Technologies
(Carlsbad, Calif.).
[0045] Amplification adapters may be attached to the fragmented
nucleic acid. Adapters may be commercially obtained, such as from
Integrated DNA Technologies (Coralville, Iowa). In certain
embodiments, the adapter sequences are attached to the template
nucleic acid molecule with an enzyme. The enzyme may be a ligase or
a polymerase. The ligase may be any enzyme capable of ligating an
oligonucleotide (RNA or DNA) to the template nucleic acid molecule.
Suitable ligases include T4 DNA ligase and T4 RNA ligase, available
commercially from New England Biolabs (Ipswich, Mass.). Methods for
using ligases are well known in the art. The polymerase may be any
enzyme capable of adding nucleotides to the 3' and the 5' terminus
of template nucleic acid molecules.
[0046] Embodiments of the invention involve attaching the bar code
sequences to the template nucleic acids e.g., for barcode PCR 121.
In certain embodiments, a bar code is attached to each fragment. In
other embodiments, a plurality of bar codes, e.g., two bar codes,
are attached to each fragment. A bar code sequence generally
includes certain features that make the sequence useful in
sequencing reactions. For example the bar code sequences are
designed to have minimal or no homo-polymer regions, i.e., 2 or
more of the same base in a row such as AA or CCC, within the bar
code sequence. The bar code sequences are also designed so that
they are at least one edit distance away from the base addition
order when performing base-by-base sequencing, ensuring that the
first and last base do not match the expected bases of the
sequence.
[0047] The bar code sequences are designed such that each sequence
is correlated to a particular portion of nucleic acid, allowing
sequence reads to be correlated back to the portion from which they
came. Methods of designing sets of bar code sequences are shown for
example in U.S. Pat. No. 6,235,475, the contents of which are
incorporated by reference herein in their entirety. In certain
embodiments, the bar code sequences range from about 5 nucleotides
to about 15 nucleotides. In a particular embodiment, the bar code
sequences range from about 4 nucleotides to about 7 nucleotides.
Since the bar code sequence is sequenced along with the template
nucleic acid, the oligonucleotide length should be of minimal
length so as to permit the longest read from the template nucleic
acid attached. Generally, the bar code sequences are spaced from
the template nucleic acid molecule by at least one base (minimizes
homo-polymeric combinations). In certain embodiments, the bar code
sequences are attached to the template nucleic acid molecule, e.g.,
with an enzyme. The enzyme may be a ligase or a polymerase, as
discussed below. Attaching bar code sequences to nucleic acid
templates is shown in U.S. Pub. 2008/0081330 and U.S. Pub.
2011/0301042, the contents of which are incorporated by reference
herein in its entirety. Methods for designing sets of bar code
sequences and other methods for attaching bar code sequences are
shown in U.S. Pat. Nos. 7,544,473; 7,537,897; 7,393,665; 6,352,828;
6,172,218; 6,172,214; 6,150,516; 6,138,077; 5,863,722; 5,846,719;
5,695,934; and 5,604,097, each incorporated by reference.
[0048] After any processing steps (e.g., obtaining, isolating,
fragmenting, amplification, or barcoding), nucleic acid can be
sequenced 129.
[0049] Sequencing 129 may be by any method known in the art. DNA
sequencing techniques include classic dideoxy sequencing reactions
(Sanger method) using labeled terminators or primers and gel
separation in slab or capillary, sequencing by synthesis using
reversibly terminated labeled nucleotides, pyrosequencing, 454
sequencing, Illumina/Solexa sequencing, allele specific
hybridization to a library of labeled oligonucleotide probes,
sequencing by synthesis using allele specific hybridization to a
library of labeled clones that is followed by ligation, real time
monitoring of the incorporation of labeled nucleotides during a
polymerization step, polony sequencing, and SOLiD sequencing.
Sequencing of separated molecules has more recently been
demonstrated by sequential or single extension reactions using
polymerases or ligases as well as by single or sequential
differential hybridizations with libraries of probes.
[0050] A sequencing technique that can be used includes, for
example, use of sequencing-by-synthesis systems sold under the
trademarks GS JUNIOR, GS FLX+ and 454 SEQUENCING by 454 Life
Sciences, a Roche company (Branford, Conn.), and described by
Margulies, M. et al., Genome sequencing in micro-fabricated
high-density picotiter reactors, Nature, 437:376-380 (2005); U.S.
Pat. No. 5,583,024; U.S. Pat. No. 5,674,713; and U.S. Pat. No.
5,700,673, the contents of which are incorporated by reference
herein in their entirety. 454 sequencing involves two steps. In the
first step of those systems, DNA is sheared into fragments of
approximately 300-800 base pairs, and the fragments are blunt
ended. Oligonucleotide adaptors are then ligated to the ends of the
fragments. The adaptors serve as primers for amplification and
sequencing of the fragments. The fragments can be attached to DNA
capture beads, e.g., streptavidin-coated beads using, e.g., Adaptor
B, which contains 5'-biotin tag. The fragments attached to the
beads are PCR amplified within droplets of an oil-water emulsion.
The result is multiple copies of clonally amplified DNA fragments
on each bead. In the second step, the beads are captured in wells
(pico-liter sized). Pyrosequencing is performed on each DNA
fragment in parallel. Addition of one or more nucleotides generates
a light signal that is recorded by a CCD camera in a sequencing
instrument. The signal strength is proportional to the number of
nucleotides incorporated. Pyrosequencing makes use of pyrophosphate
(PPi) which is released upon nucleotide addition. PPi is converted
to ATP by ATP sulfurylase in the presence of adenosine 5'
phosphosulfate. Luciferase uses ATP to convert luciferin to
oxyluciferin, and this reaction generates light that is detected
and analyzed.
[0051] Another example of a DNA sequencing technique that can be
used is SOLiD technology by Applied Biosystems from Life
Technologies Corporation (Carlsbad, Calif.). In SOLiD sequencing,
genomic DNA is sheared into fragments, and adaptors are attached to
the 5' and 3' ends of the fragments to generate a fragment library.
Alternatively, internal adaptors can be introduced by ligating
adaptors to the 5' and 3' ends of the fragments, circularizing the
fragments, digesting the circularized fragment to generate an
internal adaptor, and attaching adaptors to the 5' and 3' ends of
the resulting fragments to generate a mate-paired library. Next,
clonal bead populations are prepared in microreactors containing
beads, primers, template, and PCR components. Following PCR, the
templates are denatured and beads are enriched to separate the
beads with extended templates. Templates on the selected beads are
subjected to a 3' modification that permits bonding to a glass
slide. The sequence can be determined by sequential hybridization
and ligation of partially random oligonucleotides with a central
determined base (or pair of bases) that is identified by a specific
fluorophore. After a color is recorded, the ligated oligonucleotide
is removed and the process is then repeated.
[0052] Another example of a DNA sequencing technique that can be
used is ion semiconductor sequencing using, for example, a system
sold under the trademark ION TORRENT by Ion Torrent by Life
Technologies (South San Francisco, Calif.). Ion semiconductor
sequencing is described, for example, in Rothberg, et al., An
integrated semiconductor device enabling non-optical genome
sequencing, Nature 475:348-352 (2011); U.S. Pub. 2010/0304982; U.S.
Pub. 2010/0301398; U.S. Pub. 2010/0300895; U.S. Pub. 2010/0300559;
and U.S. Pub. 2009/0026082, the contents of each of which are
incorporated by reference in their entirety.
[0053] Another example of a sequencing 129 technology that can be
used is Illumina sequencing. Illumina sequencing is based on the
amplification of DNA on a solid surface using fold-back PCR and
anchored primers. Genomic DNA is fragmented, and adapters are added
to the 5' and 3' ends of the fragments. DNA fragments that are
attached to the surface of flow cell channels are extended and
bridge amplified. The fragments become double stranded, and the
double stranded molecules are denatured. Multiple cycles of the
solid-phase amplification followed by denaturation can create
several million clusters of approximately 1,000 copies of
single-stranded DNA molecules of the same template in each channel
of the flow cell. Primers, DNA polymerase and four
fluorophore-labeled, reversibly terminating nucleotides are used to
perform sequential sequencing. After nucleotide incorporation, a
laser is used to excite the fluorophores, and an image is captured
and the identity of the first base is recorded. The 3' terminators
and fluorophores from each incorporated base are removed and the
incorporation, detection and identification steps are repeated.
Sequencing according to this technology is described in U.S. Pat.
No. 7,960,120; U.S. Pat. No. 7,835,871; U.S. Pat. No. 7,232,656;
U.S. Pat. No. 7,598,035; U.S. Pat. No. 6,911,345; U.S. Pat. No.
6,833,246; U.S. Pat. No. 6,828,100; U.S. Pat. No. 6,306,597; U.S.
Pat. No. 6,210,891; U.S. Pub. 2011/0009278; U.S. Pub. 2007/0114362;
U.S. Pub. 2006/0292611; and U.S. Pub. 2006/0024681, each of which
are incorporated by reference in their entirety.
[0054] Another example of a sequencing technology that can be used
includes the single molecule, real-time (SMRT) technology of
Pacific Biosciences (Menlo Park, Calif.). In SMRT, each of the four
DNA bases is attached to one of four different fluorescent dyes.
These dyes are phospholinked. A single DNA polymerase is
immobilized with a single molecule of template single stranded DNA
at the bottom of a zero-mode waveguide (ZMW). It takes several
milliseconds to incorporate a nucleotide into a growing strand.
During this time, the fluorescent label is excited and produces a
fluorescent signal, and the fluorescent tag is cleaved off.
Detection of the corresponding fluorescence of the dye indicates
which base was incorporated. The process is repeated.
[0055] Another example of a sequencing technique that can be used
is nanopore sequencing (Soni & Meller, 2007, Progress toward
ultrafast DNA sequence using solid-state nanopores, Clin Chem
53(11):1996-2001). A nanopore is a small hole, of the order of 1
nanometer in diameter. Immersion of a nanopore in a conducting
fluid and application of a potential across it results in a slight
electrical current due to conduction of ions through the nanopore.
The amount of current which flows is sensitive to the size of the
nanopore. As a DNA molecule passes through a nanopore, each
nucleotide on the DNA molecule obstructs the nanopore to a
different degree. Thus, the change in the current passing through
the nanopore as the DNA molecule passes through the nanopore
represents a reading of the DNA sequence.
[0056] Another example of a sequencing technique that can be used
involves using a chemical-sensitive field effect transistor
(chemFET) array to sequence DNA (for example, as described in U.S.
Pub. 2009/0026082). In one example of the technique, DNA molecules
can be placed into reaction chambers, and the template molecules
can be hybridized to a sequencing primer bound to a polymerase.
Incorporation of one or more triphosphates into a new nucleic acid
strand at the 3' end of the sequencing primer can be detected by a
change in current by a chemFET. An array can have multiple chemFET
sensors. In another example, single nucleic acids can be attached
to beads, and the nucleic acids can be amplified on the bead, and
the individual beads can be transferred to individual reaction
chambers on a chemFET array, with each chamber having a chemFET
sensor, and the nucleic acids can be sequenced.
[0057] Another example of a sequencing technique that can be used
involves using an electron microscope as described, for example, by
Moudrianakis, E. N. and Beer M., in Base sequence determination in
nucleic acids with the electron microscope, III. Chemistry and
microscopy of guanine-labeled DNA, PNAS 53:564-71 (1965). In one
example of the technique, individual DNA molecules are labeled
using metallic labels that are distinguishable using an electron
microscope. These molecules are then stretched on a flat surface
and imaged using an electron microscope to measure sequences.
[0058] Sequencing according to embodiments of the invention
generates a plurality of reads. Reads according to the invention
generally include sequences of nucleotide data less than about 5000
bases in length, or less than about 150 bases in length. In certain
embodiments, reads are between about 80 and about 90 bases, e.g.,
about 85 bases in length. In some embodiments, methods of the
invention are applied to very short reads, i.e., less than about 50
or about 30 bases in length. Sequence read data can include the
sequence data as well as meta information. Sequence read data can
be stored in any suitable file format including, for example, VCF
files, FASTA files or FASTQ files, as are known to those of skill
in the art. In some embodiments, PCR product is pooled and
sequenced (e.g., on an Illumina HiSeq 2000). Raw .bcl files are
converted to qseq files using bclConverter (Illumina). FASTQ files
are generated by "de-barcoding" genomic reads using the associated
barcode reads; reads for which barcodes yield no exact match to an
expected barcode, or contain one or more low-quality base calls,
may be discarded. Reads may be stored in any suitable format such
as, for example, FASTA or FASTQ format.
[0059] FASTA is originally a computer program for searching
sequence databases and the name FASTA has come to also refer to a
standard file format. See Pearson & Lipman, 1988, Improved
tools for biological sequence comparison, PNAS 85:2444-2448. A
sequence in FASTA format begins with a single-line description,
followed by lines of sequence data. The description line is
distinguished from the sequence data by a greater-than (">")
symbol in the first column. The word following the ">" symbol is
the identifier of the sequence, and the rest of the line is the
description (both are optional). There should be no space between
the ">" and the first letter of the identifier. It is
recommended that all lines of text be shorter than 80 characters.
The sequence ends if another line starting with a ">" appears;
this indicates the start of another sequence.
[0060] The FASTQ format is a text-based format for storing both a
biological sequence (usually nucleotide sequence) and its
corresponding quality scores. It is similar to the FASTA format but
with quality scores following the sequence data. Both the sequence
letter and quality score are encoded with a single ASCII character
for brevity. The FASTQ format is a de facto standard for storing
the output of high throughput sequencing instruments such as the
Illumina Genome Analyzer. Cock et al., 2009, The Sanger FASTQ file
format for sequences with quality scores, and the Solexa/Illumina
FASTQ variants, Nucleic Acids Res 38(6):1767-1771.
[0061] For FASTA and FASTQ files, meta information includes the
description line and not the lines of sequence data. In some
embodiments, for FASTQ files, the meta information includes the
quality scores. For FASTA and FASTQ files, the sequence data begins
after the description line and is present typically using some
subset of IUPAC ambiguity codes optionally with "-". In a preferred
embodiment, the sequence data will use the A, T, C, G, and N
characters, optionally including "-" or U as-needed (e.g., to
represent gaps or uracil).
[0062] Following sequencing, reads are preferably mapped 135 to a
reference using assembly and alignment techniques known in the art
or developed for use in the workflow. Various strategies for the
alignment and assembly of sequence reads, including the assembly of
sequence reads into contigs, are described in detail in U.S. Pat.
No. 8,209,130, incorporated herein by reference. Strategies may
include (i) assembling reads into contigs and aligning the contigs
to a reference; (ii) aligning individual reads to the reference;
(iii) assembling reads into contigs, aligning the contigs to a
reference, and aligning the individual reads to the contigs; or
(iv) other strategies known to be developed or known in the art.
Mapping 135, it can be seen, may employ assembly steps, alignment
steps, or both. Assembly can be implemented, for example, by the
program `The Short Sequence Assembly by k-mer search and 3' read
Extension` (SSAKE), from Canada's Michael Smith Genome Sciences
Centre (Vancouver, B.C., CA) (see, e.g., Warren et al., 2007,
Assembling millions of short DNA sequences using SSAKE,
Bioinformatics, 23:500-501). SSAKE cycles through a table of reads
and searches a prefix tree for the longest possible overlap between
any two sequences. SSAKE clusters reads into contigs.
[0063] Another read assembly program is Forge Genome Assembler,
written by Darren Platt and Dirk Evers and available through the
SourceForge web site maintained by Geeknet (Fairfax, Va.) (see,
e.g., DiGuistini et al., 2009, De novo sequence assembly of a
filamentous fungus using Sanger, 454 and Illumina sequence data,
Genome Biology, 10:R94). Forge distributes its computational and
memory consumption to multiple nodes, if available, and has
therefore the potential to assemble large sets of reads. Forge was
written in C++ using the parallel MPI library. Forge can handle
mixtures of reads, e.g., Sanger, 454, and Illumina reads.
[0064] Assembly through multiple sequence alignment can be
performed, for example, by the program Clustal Omega, (Sievers et
al., 2011, Fast, scalable generation of high-quality protein
multiple sequence alignments using Clustal Omega, Mol Syst Biol
7:539), ClustalW, or ClustalX (Larkin et al., 2007, Clustal W and
Clustal X version 2.0, Bioinformatics, 23(21):2947-2948) available
from University College Dublin (Dublin, Ireland).
[0065] Another exemplary read assembly program known in the art is
Velvet, available through the web site of the European
Bioinformatics Institute (Hinxton, UK) (Zerbino & Birney,
Velvet: Algorithms for de novo short read assembly using de Bruijn
graphs, Genome Research 18(5):821-829). Velvet implements an
approach based on de Bruijn graphs, uses information from read
pairs, and implements various error correction steps.
[0066] Read assembly can be performed with the programs from the
package SOAP, available through the website of Beijing Genomics
Institute (Beijing, Conn.) or BGI Americas Corporation (Cambridge,
Mass.). For example, the SOAPdenovo program implements a de Bruijn
graph approach. SOAP3/GPU aligns short reads to a reference
sequence.
[0067] Another read assembly program is ABySS, from Canada's
Michael Smith Genome Sciences Centre (Vancouver, B.C., CA) (Simpson
et al., 2009, ABySS: A parallel assembler for short read sequence
data, Genome Res., 19(6):1117-23). ABySS uses the de Bruijn graph
approach and runs in a parallel environment.
[0068] Read assembly can also be done by Roche's GS De Novo
Assembler, known as gsAssembler or Newbler (NEW assemBLER), which
is designed to assemble reads from the Roche 454 sequencer
(described, e.g., in Kumar & Blaxter, 2010, Comparing de novo
assemblers for 454 transcriptome data, Genomics 11:571 and
Margulies 2005). Newbler accepts 454 Flx Standard reads and 454
Titanium reads as well as single and paired-end reads and
optionally Sanger reads. Newbler is run on Linux, in either 32 bit
or 64 bit versions. Newbler can be accessed via a command-line or a
Java-based GUI interface. Additional discussion of read assembly
may be found in Li et al., 2009, The Sequence alignment/map (SAM)
format and SAMtools, Bioinformatics 25:2078; Lin et al., 2008,
ZOOM! Zillions Of Oligos Mapped, Bioinformatics 24:2431; Li &
Durbin, 2009, Fast and accurate short read alignment with
Burrows-Wheeler Transform, Bioinformatics 25:1754; and Li, 2011,
Improving SNP discovery by base alignment quality, Bioinformatics
27:1157. Assembled sequence reads may preferably be aligned to a
reference.
[0069] Methods for alignment and known in the art and may make use
of a computer program that performs alignment, such as
Burrows-Wheeler Aligner.
[0070] In certain embodiments, reads are aligned to hg18 on a
per-sample basis using Burrows-Wheeler Aligner version 0.5.7 for
short alignments, and genotype calls are made using Genome Analysis
Toolkit. See McKenna et al., 2010, The Genome Analysis Toolkit: a
MapReduce framework for analyzing next-generation DNA sequencing
data, Genome Res 20(9):1297-1303. High-confidence genotype calls
may be defined as having depth .gtoreq.50 and strand bias score
.ltoreq.0. Clinical significance of variant calls is an important
question in carrier screening and will be addressed below. Other
computer programs for assembling reads are known in the art. Such
assembly programs can run on a single general-purpose computer, on
a cluster or network of computers, or on specialized computing
devices dedicated to sequence analysis.
[0071] In some embodiments, de-barcoded fastq files are obtained as
described above and partitioned by capture region (exon) using the
target arm sequence as a unique key. Reads are assembled in
parallel by exon using SSAKE version 3.7 with parameters "-m 30 -o
15". The resulting contiguous sequences (contigs) can be aligned to
hg18 (e.g., using BWA version 0.5.7 for long alignments with
parameter "-r 1"). In some embodiments, short-read alignment is
performed as described above, except that sample contigs (rather
than hg18) are used as the input reference sequence. Software may
be developed in Java to accurately transfer coordinate and variant
data (gaps) from local sample space to global reference space for
every BAM-formatted alignment. Genotyping and base-quality
recalibration may be performed on the coordinate-translated BAM
files using the GATK program.
[0072] In some embodiments, any or all of the steps of the
invention are automated. For example, a Perl script or shell script
can be written to invoke any of the various programs discussed
above (see, e.g., Tisdall, Mastering Perl for Bioinformatics,
O'Reilly & Associates, Inc., Sebastopol, C A 2003; Michael, R.,
Mastering Unix Shell Scripting, Wiley Publishing, Inc.,
Indianapolis, Ind. 2003). Alternatively, methods of the invention
may be embodied wholly or partially in one or more dedicated
programs, for example, each optionally written in a compiled
language such as C++ then compiled and distributed as a binary.
Methods of the invention may be implemented wholly or in part as
modules within, or by invoking functionality within, existing
sequence analysis platforms. In certain embodiments, methods of the
invention include a number of steps that are all invoked
automatically responsive to a single starting queue (e.g., one or a
combination of triggering events sourced from human activity,
another computer program, or a machine). Thus, the invention
provides methods in which any or the steps or any combination of
the steps can occur automatically responsive to a queue.
Automatically generally means without intervening human input,
influence, or interaction (i.e., responsive only to original or
pre-queue human activity).
[0073] Mapping 135 sequence reads to a reference, by whatever
strategy, may produce output such as a text file or an XML file
containing sequence data such as a sequence of the nucleic acid
aligned to a sequence of the reference genome. In certain
embodiments (e.g., see FIG. 1) mapping 135 reads to a reference
produces results stored in SAM or BAM file 179 and such results may
contain coordinates or a string describing one or more mutations in
the subject nucleic acid relative to the reference genome.
Alignment strings known in the art include Simple UnGapped
Alignment Report (SUGAR), Verbose Useful Labeled Gapped Alignment
Report (VULGAR), and Compact Idiosyncratic Gapped Alignment Report
(CIGAR) (Ning, Z., et al., Genome Research 11(10):1725-9 (2001)).
These strings are implemented, for example, in the Exonerate
sequence alignment software from the European Bioinformatics
Institute (Hinxton, UK).
[0074] In some embodiments, a sequence alignment is produced--such
as, for example, a sequence alignment map (SAM) or binary alignment
map (BAM) file--comprising a CIGAR string (the SAM format is
described, e.g., in Li, et al., The Sequence Alignment/Map format
and SAMtools, Bioinformatics, 2009, 25(16):2078-9). In some
embodiments, CIGAR displays or includes gapped alignments
one-per-line. CIGAR is a compressed pairwise alignment format
reported as a CIGAR string. A CIGAR string is useful for
representing long (e.g. genomic) pairwise alignments. A CIGAR
string is used in SAM format to represent alignments of reads to a
reference genome sequence.
[0075] A CIGAR string follows an established motif. Each character
is preceded by a number, giving the base counts of the event.
Characters used can include M, I, D, N, and S (M=match;
I=insertion; D=deletion; N=gap; S=substitution). The CIGAR string
defines the sequence of matches/mismatches and deletions (or gaps).
For example, the CIGAR string 2MD3M2D2M will mean that the
alignment contains 2 matches, 1 deletion (number 1 is omitted in
order to save some space), 3 matches, 2 deletions and 2 matches. In
general, for carrier screening or other assays such as the NGS
workflow depicted in FIG. 1, sequencing results will be used in
genotyping 141.
[0076] Output from mapping 135 may be stored in a SAM or BAM file
179, in a variant call format (VCF) file 183, or other format. In
an illustrative embodiment, output is stored in a VCF file,
although methods described herein are applicable to other file
formats such as SAM or BAM files, as will be readily apparent to
one of skill in the art.
[0077] FIG. 2 gives a sample of an exemplary VCF file 183. A
typical VCF file 183 will include a header section and a data
section. The header contains an arbitrary number of
meta-information lines, each starting with characters `##`, and a
TAB delimited field definition line starting with a single `#`
character. The field definition line names eight mandatory columns
and the body section contains lines of data populating the columns
defined by the field definition line. The VCF format is described
in Danecek et al., 2011, The variant call format and VCFtools,
Bioinformatics 27(15):2156-2158.
[0078] The data contained in a VCF file 183 as shown for example in
FIG. 2 represents the variants, or mutations, that are found in the
nucleic acid that was obtained from the sample from the patient and
sequenced. In its original sense, mutation refers to a change in
genetic information and has come to refer to the present genotype
that results from a mutation. As is known in the art, mutations
include different types of mutations such as substitutions,
insertions or deletions (INDELs), translocations, inversions,
chromosomal abnormalities, and others. By convention in some
contexts where two or more versions of genetic information or
alleles are known, the one thought to have the predominant
frequency in the population is denoted the wild type and the
other(s) are referred to as mutation(s). In general in some
contexts an absolute allele frequency is not determined (i.e., not
every human on the planet is genotyped) but allele frequency refers
to a calculated probable allele frequency based on sampling and
known statistical methods and often an allele frequency is reported
in terms of a certain population such as humans of a certain
ethnicity. Variant can be taken to be roughly synonymous to
mutation but referring to a genotype being described in comparison
or with reference to a reference genotype or genome. For example as
used in bioinformatics variant describes a genotype feature in
comparison to a reference such as the human genome (e.g., hg18 or
hg19 which may be taken as a wild type). An NGS workflow and
genotype 141 generates data representing one or more mutations in a
genome of an individual that are generally reported as variants, or
"variant calls", in, for example, a VCF file 183.
[0079] With continuing reference to FIG. 2, a VCF file 183 includes
data representing one or more mutations. Those data may be analyzed
by methods of the invention to provide a report of the clinical
significance of the mutations in the genome of the individual.
[0080] FIG. 3 diagrams a method 301 for analyzing mutations
according to the invention. One benefit of a method 301 is an
ability to provide information about the clinical significance of
mutations in a patient's genome from data such as that provided by
sequencing, e.g., in FASTA/FASTQ files, SAM/BAM files, or VCF
files. Methods include obtaining 305 data representing a mutation
in a genome of an individual by, for example, the sampling,
sequencing, and mapping methods described above. A variant node in
a graph database is used 311 to store a description of the
mutation. A pointer is stored 317 in the variant node and the
pointer points to an adjacent node that provides information about
a clinical significance of the variant. Method 301 includes
querying 323 the graph database to obtain information reporting the
clinical significance of the mutation in the genome of the
individual.
[0081] To illustrate operation of the invention, the following
discusses obtaining mutation data in a VCF file, although one of
skill in the art will readily see that the discussion is extensible
to other formats. Using a workflow such as the NGS workflow
illustrated in FIG. 1, a VCF file containing mutation data is
obtained 305. The VCF file may be parsed to isolate its component
pieces of information and to consider each piece of information for
its own significance. There exist programs or application
programming interfaces (APIs) for parsing VCF files 183 or a
program may be written that parses data from the VCF file.
[0082] FIG. 4 gives a flow chart for a VCF parser. The flow chart
shown in FIG. 4 represents the conceptual steps that may go into
parsing a VCF file and extracting component information. Since the
various action blocks and loops are defined according to the format
of the VCF file as standardized (e.g., in Danecek, 2011,
Bioinformatics 27:2156), each character of information that is
extracted is treated for what it is. Thus, using VCF file 183 from
FIG. 2 for reference, the "A" that appears on line 16, character 7
(counting 1 tab as 1 character) is treated as a nucleotide in the
reference and the "A" that appears in line 17, character 17 is
simply part of the word "PASS" in the FILTER column. It is further
recognized that line 16 (and any subsequent line) is a single VCF
record within a VCF file. Each record from the VCF file represents
something found by sequencing the nucleic acid from the sample from
the patient. Each patient, having numerous genes in their genome,
has numerous alleles. Thus where carrier screening is performed for
a patient, the VCF run (e.g., all the VCF files produced by the NGS
sequencing) ultimately documents and shows the various alleles in
the patient's genome that were probed for by the probes used.
[0083] FIG. 5 presents a model of data received from parsing a VCF.
As just discussed, one run from the sequencing instruments can
produce a plurality of VCF files. Each VCF file typically contains
a plurality of VCF records. Those records ultimately relate back to
the samples from which they were derived, and the samples can each
contain a plurality of alleles. However, this relationship just
described can also be described using an entity relationship
diagram, or ERD.
[0084] FIG. 6 shows an entity relationship diagram (ERD) 601 of the
data modeled by FIG. 5. An insight of the invention is that the ERD
601 satisfies the definition of a graph as used in graph theory
within mathematics and computer science. Graph theory provides a
well-known mathematical tool for representing systems. Graph theory
is the mathematical study of properties of formal mathematical
structures called graphs. In that context, a graph is a finite set
of points, termed vertices or nodes, connected by links termed
edges or arcs. A graph thus generally defines a set of vertices and
a set of pairs of vertices, which are the edges of the graph. There
are several types of graphs in graph theory. The type of a
particular graph largely depends upon the features of its
components, namely the attributes of its vertices and edges. For
example, when the set of pairs includes only distinct elements, the
graph is called a simple graph; when one or more pairs are
connected by multiple edges the graph is called a multi-graph; when
one or more vertices are connected to themselves the graph is
called a pseudo-graph; when the edges are assigned with directions
the graph is called a directed graph or a digraph; and when the
pairs of vertices are unordered the graph is called undirected.
Additional illustrative background on graph theory may be found in
U.S. Pat. No. 8,463,895 to Arora; U.S. Pat. No. 8,462,161 to
Barber; U.S. Pat. No. 7,523,117 to Zhang; U.S. Pat. No. 6,360,235
to Tilt; U.S. Pub. 2013/0222388 to McDonald; and U.S. Pub.
2007/0244675 to Shai, the contents of each of which are
incorporated by reference.
[0085] It can be observed that ERD 601 presents a graph--a
collection of vertices and edges--or another description would be a
set of nodes and the relationships that connect them. Graphs
represent entities as nodes and the ways in which those entities
relate to the world as relationships. This general-purpose,
expressive structure allows graphs to model all kinds of phenomena
such as NGS sequence files and their relationships to the source
biological samples and genetic concepts like certain alleles. There
are various dominant graph data models such as the property graph,
Resource Description Framework (RDF) triples, and hypergraphs. In
certain embodiments, a graph database used in the invention uses
the property graph model.
[0086] A property graph has characteristics such as containing
nodes and relationships (which are illustrated by ERD 601 in FIG.
6). The nodes contain properties (key-value pairs). Relationships
are named and directed, and have a start and end node; and
relationships can also contain properties. A graph database
management system (henceforth, a graph database) is an online
database management system with Create, Read, Update, and Delete
(CRUD) methods that expose a graph data model. Graph databases
according to the invention may be described or characterized
according to the underlying storage, the processing engine, or
both.
[0087] Regarding the underlying storage, some graph databases use
native graph storage that is optimized and designed for storing and
managing graphs. Some databases serialize the graph data into a
relational database, an object-oriented database, or some other
general-purpose data store and present graph database functionality
on top of that.
[0088] Regarding the processing engine, some graph databases use
index-free adjacency, meaning that connected nodes physically
"point" to each other in the database. More broadly, graph
databases can include any database that from the user's perspective
behaves like a graph database (i.e., exposes a graph data model
through CRUD operations) qualifies as a graph database. In certain
embodiments, however, the invention provides the significant
performance advantages of index-free adjacency. Native graph
processing may describe graph databases that use index-free
adjacency.
[0089] A benefit of native graph storage is that it is engineered
for performance and scalability. A benefit of non-native graph
storage is that it typically depends on a mature non-graph backend
(such as MySQL) whose production characteristics are well
understood by operations teams. Native graph processing (index-free
adjacency) benefits traversal performance.
[0090] In the graph data model, relationships are included as
entities that themselves are stored as objects. (Whereas other
database management systems require connections between entities to
be inferred using contrived properties such as foreign keys, or
out-of-band processing like map-reduce.) By assembling the simple
abstractions of nodes and relationships into connected structures,
graph databases provide arbitrarily sophisticated models that map
closely to the problem domain (e.g., FIG. 5). The resulting models
are simpler and at the same time more expressive than those
produced using traditional relational databases and the other NOSQL
stores.
[0091] Any suitable graph database can be used to implement the
systems and methods described herein. Exemplary graph databases may
include Microsoft Infinite Graph, Titan, OrientDB, Neo4j, *dex,
Franz Inc., AllegroGraph, and Hypergraphdb. Preferably, systems and
methods of the invention employ a graph compute engine.
[0092] A graph compute engine is a technology that enables global
graph computational algorithms to be run against large datasets.
Graph compute engines are designed to do things like identify
clusters in the data, or answer questions about how entities are
connected, and particularly to trace across a series of linked
ideas (e.g., SNP to allele to genetic condition to a literature
reference providing a clinical significance of the allele
containing the SNP).
[0093] A variety of different types of graph compute engines exist.
Most notably there are in-memory/single machine graph compute
engines like Cassovary, and distributed graph compute engines like
Pegasus or Giraph. A distributed graph compute engine may be
structured as described in Malewicz, et al., 2010, Pregel: a system
for large-scale graph processing, Proceedings ACM SIGMOD Int Conf
Management Data 135-146. Also see Rodriguez and Neubauer, 2010,
Constructions from Dots and Lines, Bulletin Am Soc Inf Sci Tech
36(6):35-41.
[0094] In preferred embodiments, systems and methods of the
invention store mutation descriptions using a graph database and
analyze mutations in graph space.
[0095] To achieve the benefits potentially offered by using a graph
database, a genetic analysis pipeline and methodology according to
the invention uses nodes as well as named and directed
relationships, with both the nodes and relationships serving as
containers for properties. With continuing reference to FIG. 6,
nodes and relationships are illustrated and index-free adjacency is
discussed.
[0096] A database engine that utilizes index-free adjacency is one
in which each node maintains direct references to its adjacent
nodes. Each node thus acts as a micro-index of other nearby nodes,
which is much cheaper than using global indexes. It means that
query times are independent of the total size of the graph, and are
instead simply proportional to the amount of the graph
searched.
[0097] A non-native graph database engine, in contrast, uses
(global) indexes to link nodes together. These indexes add a layer
of indirection to each traversal, thereby incurring greater
computational cost. Proponents for native graph processing argue
that index-free adjacency is crucial for fast, efficient graph
traversals. To understand why native graph processing is so much
more efficient than graphs based on heavy indexing, consider the
following. Depending on the implementation, index lookups could be
O(log n) in algorithmic complexity versus O(l) for looking up
immediate relationships. To traverse a network of m steps, the cost
of the indexed approach, at O(m log n), dwarfs the cost of O(m) for
an implementation that uses index-free adjacency.
[0098] Index-free adjacency provides lower-cost "joins." With
index-free adjacency, bidirectional joins are effectively
pre-computed and stored in the database as relationships. In
contrast, when using indexes to fake connections between records,
there is no actual relationship stored in the database. This
becomes problematic for traversals in the "opposite" direction from
the one for which the index was constructed. Because such
traversals require a brute-force search through the index--which is
an O(n)operation--and joins like this are simply too costly to be
of any practical use. Index free adjacency provides surprising
benefits in the context of reporting clinical significance of the
results of NGS-based carrier screening in that the concepts
involved are of just such a nature as to naturally lend themselves
to representation using the pre-computed bidirectional joins
offered by index free adjacency.
[0099] For at least these reasons, systems and methods of certain
embodiments of the invention use index-free adjacency to ensure
high-performance traversals. FIG. 6 shows how relationships
eliminate the need for index lookups. A graph database can use
relationships, not indexes, for fast traversals
[0100] A general-purpose graph database relationships can be
traversed in either direction (tail to head, or head to tail)
extremely cheaply. Starting from a given VcfRun or a given allele,
a graph processing engine can find the related other one of those
two at a very low computation cost.
[0101] In certain embodiments, systems and methods of the invention
use native graph storage. If index-free adjacency is the key to
high-performance traversals, queries, and writes, then one key
aspect of the design of a graph database is the way in which graphs
are stored. An efficient, native graph storage format supports
extremely rapid traversals for arbitrary graph algorithms an
important reason for using graphs.
[0102] A graph database such as Neo4j stores graph data in a number
of different store files. Each store file may contain the data for
a specific part of the graph (e.g., nodes, relationships,
properties). The division of storage responsibilities--particularly
the separation of graph structure from property data--facilitates
performant graph traversals, even though it means the user's view
of their graph and the actual records on disk are structurally
dissimilar. FIGS. 7-10 illustrates a node and relationship storage
structure as implemented by a graph database of the invention.
[0103] FIG. 7 diagrams a high-level architecture 701 of systems of
certain embodiments of the invention. From the bottom-up, systems
may operate using files on disk 733. Record files 739 provide a
basic level of storage to support the file system cache 741. The
object cache 747 is kept at a high level for rapid access as
discussed herein. Additionally, the disks 733 can store a
transaction log 725, which is written to by a transaction
management module 721. A graph database such as Neo4j includes or
provides a traversal API 755, core API 705, and a query language
713 such as Cypher.
[0104] FIG. 8 illustrates the structure of nodes 801 and
relationships 809 on disk as may be deployed within a physical
structure of systems of the invention. The node store file stores
node records. Every node created in the user-level graph ends up in
the node store. Preferably, the node store is a fixed-size record
store. While the precise values or traits may be varied as
necessary or best-suited to the invention, in the illustrated
embodiment, each node record 801 is nine bytes in length.
Fixed-size records enable fast lookups for nodes in the store file.
To illustrate, if a node has id 100, then it can be known that its
record begins 900 bytes into the file. Based on this format, the
database can directly compute a record's location, at cost O(l),
rather than performing a search, which would be cost O(log n). It
is noted that fixed-size record stores provide an improvement to a
computer in the sense that information storage efficiently exploits
the physical storage device for very fast retrieval and very fast
look-ups. Thus, genetic queries according to methods and systems of
the invention actually proceed faster at a hardware level than
prior art approaches--the computer itself is sped up by the
implementations described.
[0105] The first byte of a node 801 record is the in-use flag. This
tells the database whether the record is currently being used to
store a node. The next four bytes represent the ID of the first
relationship connected to the node, and the last four bytes
represent the ID of the first property for the node. The node
record is lightweight and contains just pointers to lists of
relationships and properties.
[0106] Correspondingly, relationships are stored in a relationship
store file Like the node store, the relationship store consists of
fixed-sized records--in this case each relationship record 809 is
33 bytes long. Each relationship record 809 contains the IDs of the
nodes at the start and end of the relationship, a pointer to the
relationship type (which is stored in the relationship type store),
and pointers for the next and previous relationship records for
each of the start and end nodes. These last pointers are part of
what is often called the relationship chain.
[0107] The node and relationship stores are concerned only with the
structure of the graph, not its property data. Both stores use
fixed-sized records so that any individual record's location within
a store file can be rapidly computed given its ID. The significance
can hardly be overstated: the described structure improves the
operation of the hardware itself.
[0108] Using the described structures, given the way that the
various store files are stored on disk, graph processing operations
are low-cost. Each of the node records contains a pointer to that
node's first property and first relationship in a relationship
chain. To read a node's properties, one may follow the singly
linked list structure beginning with the pointer to the first
property. To find a relationship for a node, one may follow that
node's relationship pointer to its first relationship and then
follow the doubly linked list of relationships for that particular
node (that is, either the start node doubly linked list, or the end
node doubly linked list) until the relationship of interest is
found.
[0109] Having found the record for the relationship of interest,
that relationship's properties can be read (if there are any) using
the same singly linked list structure as is used for node
properties, or the node records can be examined for the two nodes
the relationship connects using its start node and end node IDs.
These IDs, multiplied by the node record size, give the immediate
offset of each node in the node store file.
[0110] In some embodiments, systems and methods of the invention
use doubly-linked lists in the relationship store. It is noted that
a relationship record 809 can be thought of as "belonging" to two
nodes--the start node and the end node of the relationship. To
avoid storing two relationship records and to make the relationship
record belong to both the start node and the end node, there are
pointers (aka record IDs) for two doubly linked lists: one is the
list of relationships visible from the start node; the other is the
list of relationships visible from the end node. This provide rapid
iteration through that list in either direction, and efficient
insertion or deletion of relationships.
[0111] Choosing to follow a different relationship involves
iterating through a linked list of relationships until a candidate
matching the correct type or having some matching property value is
found. The found relationship gives a new ID. The new ID is
multiplied by record size as a new pointer and the traversal
continues. With fixed-sized records and pointer-like record IDs,
traversals are implemented simply by chasing pointers around a data
structure, which can be performed at very high speed. To traverse a
particular relationship from one node to another, the database
performs several cheap ID computations (these computations are much
cheaper than searching global indexes, as would be required if
faking a graph in a non-graph native database). First, from a given
node record, the first record in the relationship chain is located
by computing its offset into the relationship store--that is, by
multiplying its ID by the fixed relationship record size (e.g., 33
bytes). This gets to the right record in the relationship store.
Then, from the relationship record, look in the second node field
to find the ID of the second node. Multiply that ID by the node
record size (e.g., nine bytes) to locate the correct node record in
the store.
[0112] In addition to the node and relationship stores, which
contain the graph structure, systems include the property store
files. These store the user's key-value pairs. Properties may be
attached to both nodes and relationships. The property stores,
therefore, are referenced from both node and relationship records.
Records in the property store are physically stored in a file. As
with the node and relationship stores, property records are of a
fixed size. Each property record consists of four property blocks
and the ID of the next property in the property chain. Properties
are held as a singly linked list on disk as compared to the doubly
linked list used in relationship chains. Each property occupies
between one and four property blocks--a property record can,
therefore, hold four properties. A property record holds the
property type and a pointer to the property index file, which is
where the property name is stored. For each property's value, the
record contains either a pointer into a dynamic store record or an
inlined value. The dynamic stores allow for storing large property
values. A graph database may optimize storage where it inlines some
properties into the property store file directly. This happens when
property data can be encoded to fit in one or more of a record's
four property blocks. In practice this means that data like variant
calls can be inlined in the property store file directly, rather
than being pushed out to the dynamic stores. This results in
reduced I/O operations and improved throughput, because only a
single file access is required.
[0113] In addition to in-lining certain compatible property values,
a graph database can also reference long values as property names
(e.g., complete journal article titles and citations). In such
cases, property names are indirectly referenced from the property
store through the property index file. The property index allows
all properties with the same name to share a single record, and
thus for repetitive graphs achieves considerable space and I/O
savings.
[0114] To improve the performance characteristics of
mechanical/electronic mass storage de-vices, many graph databases
use in-memory caching to provide probabilistic low latency access
to the graph. Neo4j uses a two-tiered caching architecture to
provide this functionality.
[0115] The lowest tier in the Neo4j caching stack is the file
system cache 741. The file system cache 741 is a page-affined
cache, meaning the cache divides each store into discrete regions,
and then holds a fixed number of regions per store file. The actual
amount of memory to be used to cache the pages for each store file
can be fine-tuned, though in the absence of input from the user,
Neo4j will use sensible default values based on the capacity of the
underlying hardware. Pages are evicted from the cache based on a
least-frequently-used (LFU) cache policy.
[0116] The file system cache 741 is particularly beneficial when
related parts of the graph are modified at the same time such that
they occupy the same page. This is a common pattern for writes,
where whole sub-graphs (such as a patient's NGS results and
associated carrier screening report) are written to disk in a
single operation, rather than discrete nodes and relationships.
[0117] A graph database may be manipulated through a query
language, which can be either imperative or declarative. One such
language is the Cypher query language. Cypher is a declarative
graph query language for Neo4j that allows for expressive and
efficient querying and updating of the graph store. Cypher contains
a variety of clauses, some of the most common of which include
MATCH and WHERE. These functions are slightly different than in
SQL. MATCH is used for describing the structure of the pattern
searched for, primarily based on relationships, and WHERE is used
to add additional constraints to patterns. Cypher additionally
contains clauses for writing, updating, and deleting data. CREATE
and DELETE are used to create and delete nodes and relationships.
SET and REMOVE are used to set values to properties and add labels
on nodes.
[0118] Systems and methods of the invention provide very rapid
transactions, idiomatic queries, and an excellent ability to "scale
up" with very large data sizes. The topic of scale has become more
important as data volumes have grown. Graph databases don't suffer
the same latency problems as traditional relational databases,
where the more data that exists in tables--and in indexes--the
longer the join operations. With a graph database, most queries
follow a pattern whereby an index is used simply to find a starting
node (or nodes). The remainder of the traversal then uses a
combination of pointer chasing and pattern matching to search the
data store. What this means is that, unlike relational databases,
performance does not depend on the total size of the dataset, but
only on the data being queried. This leads to performance times
that are nearly constant (i.e., are related to the size of the
result set), even as the size of the dataset grows. Throughput,
speed, and scalability of graph databases make them suited to
genetic analysis and reporting. Given the input/output-intensive
nature of such sequencing, variant-calling, genotyping, and
clinical reporting, a typical operation reads and writes a set of
related data. In other words, the application performs multiple
operations on a logical sub-graph within the overall dataset. With
a graph database such multiple operations can be rolled up into
larger, more cohesive operations. Further, with a graph-native
store, executing each operation takes less computational effort
than the equivalent relational operation. Graphs scale by doing
less work for the same outcome.
[0119] FIG. 9 illustrates the use of a variant node 901 in a graph
database to store a description of a mutation. The first byte of
the variant node 901 record is set to show that node 901 is in use.
The next four bytes of node 901 represent the ID of the first
relationship connected to the node. Through the ID of that first
relationship, node 901 thus includes a pointer to an adjacent node
(adjacent by definition, since the relationship is identified by
the four bytes in node 901). The last four bytes of node 901
represent the ID of the first property for the node.
[0120] To read the first property for node 901, one may follow the
singly linked list structure to the appropriate property record in
the property store. Property records in the property store are of a
fixed size and each property record consists of four property
blocks and the ID of the next property in the chain. The property
record holds the property type (here, "variant") and a pointer to
the property index file, which is where the property name is
stored. For each property's value, the record either points to a
dynamic store or an inline record. Here, the parser operating via
the logic mapped in FIG. 4 produces a record of a mutation (by
parsing that record from the VCF file) and can store that mutation
in the property index file. Thus the property index file for a
variant node preferably includes a description of a mutation.
[0121] A description of a mutation may be provided according to a
systematic nomenclature. For example, a variant can be described by
a systematic comparison to a specified reference which is assumed
to be unchanging and identified by a unique label such as a name or
accession number. For a given gene, coding region, or open reading
frame, the A of the ATG start codon is denoted nucleotide +1 and
the nucleotide 5' to +1 is -1 (there is no zero). A lowercase g, c,
or m prefix, set off by a period, indicates genomic DNA, cDNA, or
mitochondrial DNA, respectively.
[0122] A systematic name can be used to describe a number of
variant types including, for example, substitutions, deletions,
insertions, and variable copy numbers. A substitution name starts
with a number followed by a "from to" markup. Thus, 199A>G shows
that at position 199 of the reference sequence, A is replaced by a
G. A deletion is shown by "del" after the number. Thus 223delT
shows the deletion of T at nt 223 and 997-999del shows the deletion
of three nucleotides (alternatively, this mutation can be denoted
as 997-999delTTC). In short tandem repeats, the 3' nt is
arbitrarily assigned; e.g. a TG deletion is designated
1997-1998delTG or 1997-1998del (where 1997 is the first T before
C). Insertions are shown by ins after an interval. Thus 200-201insT
denotes that T was inserted between nts 200 and 201. Variable short
repeats appear as 997(GT)N-N'. Here, 997 is the first nucleotide of
the dinucleotide GT, which is repeated N to N' times in the
population.
[0123] Variants in introns can use the intron number with a
positive number indicating a distance from the G of the invariant
donor GU or a negative number indicating a distance from an
invariant G of the acceptor site AG. Thus, IVS3+1C>T shows a C
to T substitution at nt+1 of intron 3. In any case, cDNA nucleotide
numbering may be used to show the location of the mutation, for
example, in an intron. Thus, c.1999+1C>T denotes the C to T
substitution at nt+1 after nucleotide 1997 of the cDNA. Similarly,
c.1997-2A>C shows the A to C substitution at nt-2 upstream of
nucleotide 1997 of the cDNA. When the full length genomic sequence
is known, the mutation can also be designated by the nt number of
the reference sequence.
[0124] Relative to a reference, a patient's genome may vary by more
than one mutation, or by a complex mutation that is describable by
more than one character string or systematic name. The invention
further provides systems and methods for describing more than one
variant using a systematic name. For example, two mutations in the
same allele can be listed within brackets as follows: [1997G>T;
2001A>C]. Systematic nomenclature is discussed in den Dunnen
& Antonarakis, 2003, Mutation Nomenclature, Curr Prot Hum Genet
7.13.1-7.13.8 as well as in Antonarakis and the Nomenclature
Working Group, 1998, Recommendations for a nomenclature system for
human gene mutations, Human Mutation 11:1-3. By such means, a
mutation can be described in the property index file of a variant
node.
[0125] While described here with reference to FIG. 9 as a "variant
node", it will be appreciated that node 901 can be instantiated or
used as any type, with the type being stored in the property
store.
[0126] FIG. 10 illustrates a simple example in which an allele node
is used to show that an allele includes a certain mutation by
representing the mutation using a variant node and representing a
relationship between the allele node and the variant node with a
"HAS_VARIANT" type relationship. This illustrates the simplicity of
connecting alleles to variants using relationships. After the
variant is created, literature references can be added to the
variant.
[0127] FIG. 11 shows elements of a graph database in which a
variant has been connected to two nodes, each for a literature
reference. From this setup emerges one of the powerful applications
of a graph database in processing results from NGS sequencing data.
If variant changes are made, those variant changes can be tracked
within systems of the invention without requiring upsetting the
structure of the existing database.
[0128] To illustrate the invention by an example, a patient sample
could be sequenced via NGS technologies and the sequencing results
could include, in a VCF file, a description of a mutation in that
patient's mitochondrial genome. A variant node is used and a
property of that node (e.g., in a property index file) is used to
describe that mutation as m.593T>C. A relationship is created to
shown that the mutation is described in a literature reference. The
relationship is a pointer to a LitRef node and the LitRef node
points to a property index file that with information about the
literature reference. The property index file contains Zhang et
al., 2011, Is mitochondrial tRNAphe variant m.593T>C a
synergistically pathogenic mutation in Chinese LHON families with
m.11778G>A?, PLoS ONE 6(10):e26511. Based on the synergistic
pathogenesis alluded to by the literature reference, a geneticist
or curator may deem it important to flag instances in which a
patient has both m.593T>C and m.11778G>A in their genome.
This example illustrates the real power of a graph database and
index-free adjacency. A query can be initiated that starts at the
LitRef node just described and traverses to the variant node. That
query can traverse to the sample node for that patient and even to
a node for the patient. That query can then--by its own
terms--traverse from the patient or sample node examining for the
presence of a second variant node representing m.11778G>A. The
query can be programmed to, in the absence of said second variant
node, classify the mutation as benign. The query can be programmed
to, in the presence of said second variant node, classify the
mutation as pathogenic. Intermediate labels or other categories can
also be used. Since the query is traversing across a graph
database, a comprehensive index-based look-up is not required as
would be required in prior art RDMSs.
[0129] It is important to note that the "graph" of the described
graph databases follows the counter-intuitive path of connecting
things of un-related categories. Although it is not the primary
structure or purpose described herein, one may imagine embodiments
in which a graph has a horizontal structure connecting entities
that are essentially similar in nature so that the database maps a
natural phenomenon. For example, a graph database could represent
protein interactions using the edges (aka pointers or
relationships) to represent interactions between proteins and thus
influxes of data would expand the graph "horizontally". However,
the invention is unlike the protein interaction example in that the
graph expands "vertically" outside of a set of natural phenomena.
Since a sample can have a node, the graph can reach to laboratory
management systems and receive from or provide information to, for
example, sample chain of custody modules. With NGS results from
that sample, the graph can leap vertically to a genetic plane and
represent human mutations that are being discovered. For NGS
carrier screening application, the graph can reach vertically into
a different category to represent medical literature, and can go on
to be used patient reports. The power of this novel vertical
structure is shown by the illustration of use of the invention for
reporting carrier screening results.
[0130] FIG. 12 illustrates a graph database in which a variant has
been connected to two nodes, each for a literature reference and in
which updated information about the variant has been introduced in
two changes. For example, node 17451 may represent a specific
mutation such as a SNP (e.g., G at a certain position). Node 17454
could be created when A is observed at that position.
[0131] Systems and methods of the invention support a plurality of
different use cases and applications. For example, if a graph
database is used in support of NGS carrier screening, one
capability that will emerge is support for evaluating and reporting
allele frequency.
[0132] For example, where a practitioner wants to know, across all
included research consenting data, what is the frequency of a
certain allele, the graph database can easily be queried for
that.
[0133] FIG. 13 presents an example database that may be queried for
allele frequency.
[0134] Using--for example, in Cypher--the following (pseudo) code
produces the desired result.
MATCH (a:Allele).rarw.(sd:S
ampleData).fwdarw.(s:Sample).fwdarw.p:Patient) RETURN
a,count(distinct p)
[0135] Another illustrative use case for application of a graph
database is the curation of variants. As was illustrated by FIGS.
10-12. The curation of variants involves taking variants (i.e.
genetic mutations) that have been picked up through a sequencing
platform and then looking through the literature for references to
evaluate how common the variant is and whether it is identified as
pathogenic, benign, or somewhere in-between. This can be supported
and modeled by tracking three things: connecting allele to a
variant; variant and variant changes; and literature references per
variant. To illustrate, a geneticist may observe review a patient's
NGS sequencing results and observe the presence of a poly-T
variant. The geneticist may connect this variant to an allele of
the cystic fibrosis transmembrane conductance receptor (CFTR) gene
located on the long arm of chromosome 7 (e.g., as shown in FIG.
10). The geneticist may further observe that this variant is
described by a literature reference and connect the variant object
to two different LitRef objects such as one for each of Rowntree
and Harris, The phenotypic consequences of CFTR mutations, Ann Hum
Gen 67:471-485 (2003) and Kreindler, Cystic fibrosis: exploiting
its genetic basis in the hunt for new therapies, Pharmacol Ther
125(2):219-229 (2010) (e.g., according to the diagram of FIG. 11).
Moreover the geneticist may observe that the mutation (the poly-T)
is a novel poly-T variant in the acceptor splice site of intron 8
of CFTR in cis with R117H (i.e., c.350G>A based on GenBank cDNA
reference sequence NM.sub.--000492.3). In this instance, the
geneticist may want to update the graph for this patient by
connecting the poly-T mutation to a variant object for c.350G>A
(e.g., as seen in FIG. 12). To further illustrate, the chain of
updated variants may reveal that the patient has an allele with the
T5 poly-T variant, which evidence suggests plays a role in in
pathogenic alternate splicing or exon skipping. Moreover, the
geneticist may further consider the data and determine that,
in-fact, the patient's allele includes a T6 form of the poly-T
variant and may update the variant nodes to so reflect. Here, with
the addition of a T6 node, other content need not be modified. The
geneticist may add a LitRef node for Huang, et al., Comparative
analysis of common CFTR polymorphisms poly-T, TG-repeats and M470V
in a healthy Chinese population, World J Gastroenterol
14(12):1925-30 (2008). Thus if the NGS screening gave results
indicating a R117H with T6 variant, methods and systems of the
invention can be used to relate this clinical results data to the
existing infrastructure of medical information on one level and
back to the patient via the sample (through the VCF files and
instrument run) on another level. Since a graph database preferably
with index-free adjacency is used for each node, those connections
can be traversed to provide a report to the patient's attending
physician, where the report shows the patient to be R117H T6 and
gives the relevant literature with information about treatment and
outcomes. Since a graph database is used, the traversals are very
fast and traversal times do not increase with increasing volumes of
database contents as queries times must so increase in the context
of prior art relational databases.
[0136] As one skilled in the art would recognize as necessary or
best-suited for performance of the methods of the invention, a
computer system or machines of the invention include one or more
processors (e.g., a central processing unit (CPU) a graphics
processing unit (GPU) or both), a main memory and a static memory,
which communicate with each other via a bus.
[0137] FIG. 14 diagrams a system 1500 suitable for performing
methods of the invention. As shown in FIG. 14, system 1500 may
include one or more of a server computer 1513, a terminal 1567, a
sequencer 1501, a sequencer computer 1533, a computer 1549, or any
combination thereof. Each such computer device may communicate via
network 1509. Sequencer 1501 may optionally include or be operably
coupled to its own, e.g., dedicated, sequencer computer 1533
(including any input/output mechanisms (I/O), processor, and
memory). Additionally or alternatively, sequencer 1501 may be
operably coupled to a server 1513 or computer 1549 (e.g., laptop,
desktop, or tablet) via network 1509. Computer 1549 includes one or
more processor, memory, and I/O. Where methods of the invention
employ a client/server architecture, any steps of methods of the
invention may be performed using server 1513, which includes one or
more of processor, memory, and I/O, capable of obtaining data,
instructions, etc., or providing results via an interface module or
providing results as a file. Server 1513 may be engaged over
network 1509 through computer 1549 or terminal 1567, or server 1513
may be directly connected to terminal 1567. Terminal 1567 is
preferably a computer device. A computer according to the invention
preferably includes one or more processor coupled to an I/O
mechanism and memory.
[0138] A processor may be provided by one or more processors
including, for example, one or more of a single core or multi-core
processor (e.g., AMD Phenom II X2, Intel Core Duo, AMD Phenom II
X4, Intel Core i5, Intel Core i& Extreme Edition 980X, or Intel
Xeon E7-2820).
[0139] An I/O mechanism may include a video display unit (e.g., a
liquid crystal display (LCD) or a cathode ray tube (CRT)), an
alphanumeric input device (e.g., a keyboard), a cursor control
device (e.g., a mouse), a disk drive unit, a signal generation
device (e.g., a speaker), an accelerometer, a microphone, a
cellular radio frequency antenna, and a network interface device
(e.g., a network interface card (NIC), Wi-Fi card, cellular modem,
data jack, Ethernet port, modem jack, HDMI port, mini-HDMI port,
USB port), touchscreen (e.g., CRT, LCD, LED, AMOLED, Super AMOLED),
pointing device, trackpad, light (e.g., LED), light/image
projection device, or a combination thereof.
[0140] Memory according to the invention refers to a non-transitory
memory which is provided by one or more tangible devices which
preferably include one or more machine-readable medium on which is
stored one or more sets of instructions (e.g., software) embodying
any one or more of the methodologies or functions described herein.
The software may also reside, completely or at least partially,
within the main memory, processor, or both during execution thereof
by a computer within system 1500, the main memory and the processor
also constituting machine-readable media. The software may further
be transmitted or received over a network via the network interface
device.
[0141] While the machine-readable medium can in an exemplary
embodiment be a single medium, the term "machine-readable medium"
should be taken to include a single medium or multiple media (e.g.,
a centralized or distributed database, and/or associated caches and
servers) that store the one or more sets of instructions. The term
"machine-readable medium" shall also be taken to include any medium
that is capable of storing, encoding or carrying a set of
instructions for execution by the machine and that cause the
machine to perform any one or more of the methodologies of the
present invention. Memory may be, for example, one or more of a
hard disk drive, solid state drive (SSD), an optical disc, flash
memory, zip disk, tape drive, "cloud" storage location, or a
combination thereof. In certain embodiments, a device of the
invention includes a tangible, non-transitory computer readable
medium for memory. Exemplary devices for use as memory include
semiconductor memory devices, (e.g., EPROM, EEPROM, solid state
drive (SSD), and flash memory devices e.g., SD, micro SD, SDXC,
SDIO, SDHC cards); magnetic disks, (e.g., internal hard disks or
removable disks); and optical disks (e.g., CD and DVD disks).
[0142] Components of system 1500 may be under the control of a
carrier screening service provider and may be operated to obtain
data representing a mutation in a genome of an individual, use a
variant node in a graph database to store a description of the
mutation (while storing, in the variant node, a pointer to an
adjacent node that provides information about a clinical
significance of the variant), and query the graph database to
provide a report of the clinical significance of the mutation in
the genome of the individual. Functionality of server computer 1513
may be provided by an outside vendor such as Amazon Web Services or
Amazon's EC2. In fact, the carrier screening entity who is
analyzing the mutations from the sample may not and need not have
actual knowledge of the physical location and type of computers
that provide server computer(s) 1513. It is enough that the entity
have access to and the ability to control at least a portion of
each of one or more of server computer 1513. In some embodiments, a
sequencing instrument 1501 is employed (e.g., an IIlumina HiSeq
2000), which itself includes a sequencer computer 1533). The sample
from the patient may be received from an outside source (e.g., from
a phlebotomy facility down the hall or may be sent by courier
(e.g., in an Eppendorf tube). Generally, the service provider will
have access to and use a computer 1549 for coordinating methods of
the invention. It is important to note that any given computer is
optional but typically at least one of the depicted computer
(sequencer computer 1533, local computer 1549, or server computer
1513) will be used to perform steps of the methods of the
invention. In some embodiments, sequencer 1501 is operated by an
outside service provider in support of or on order of the carrier
screening entity. Thus generally the carrier screening professional
has access to or control over components of the system.
INCORPORATION BY REFERENCE
[0143] References and citations to other documents, such as
patents, patent applications, patent publications, journals, books,
papers, web contents, have been made throughout this disclosure.
All such documents are hereby incorporated herein by reference in
their entirety for all purposes.
EQUIVALENTS
[0144] Various modifications of the invention and many further
embodiments thereof, in addition to those shown and described
herein, will become apparent to those skilled in the art from the
full contents of this document, including references to the
scientific and patent literature cited herein. The subject matter
herein contains important information, exemplification and guidance
that can be adapted to the practice of this invention in its
various embodiments and equivalents thereof.
* * * * *