U.S. patent application number 14/990323 was filed with the patent office on 2017-07-13 for systems and methods for adaptive local alignment for graph genomes.
The applicant listed for this patent is Seven Bridges Genomics Inc.. Invention is credited to Kaushik Ghose, Wan-Ping Lee.
Application Number | 20170199960 14/990323 |
Document ID | / |
Family ID | 57890907 |
Filed Date | 2017-07-13 |
United States Patent
Application |
20170199960 |
Kind Code |
A1 |
Ghose; Kaushik ; et
al. |
July 13, 2017 |
SYSTEMS AND METHODS FOR ADAPTIVE LOCAL ALIGNMENT FOR GRAPH
GENOMES
Abstract
Systems and methods for analyzing genomic information can
include obtaining a sequence read including genetic information;
identifying, within a graph representing a reference genome, a
plurality of candidate mapping positions that relate to the genetic
information, the graph comprising nodes representing genetic
sequences and edges connecting pairs of nodes; determining, by
means of a computer system, whether an alignment with the graph
surrounding each of the plurality of candidate mapping positions is
advanced or basic; and performing for each candidate mapping
position, by means of the computer system, a local alignment based
on whether the local alignment is advanced or basic. The advanced
local alignment can include a first-local-alignment algorithm, and
the basic local alignment includes a second-local-alignment
algorithm. Based on the local alignments, the mapped position of
the sequence read can be identified within the genome.
Inventors: |
Ghose; Kaushik; (Malden,
MA) ; Lee; Wan-Ping; (Somerville, MA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Seven Bridges Genomics Inc. |
Cambridge |
MA |
US |
|
|
Family ID: |
57890907 |
Appl. No.: |
14/990323 |
Filed: |
January 7, 2016 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G16B 30/00 20190201 |
International
Class: |
G06F 19/22 20060101
G06F019/22 |
Claims
1. A method, comprising: obtaining a sequence read including
genetic information; identifying, within a graph representing a
reference genome, a plurality of candidate mapping positions that
relate to the genetic information, the graph comprising nodes
representing genetic sequences and edges connecting pairs of nodes;
determining, by means of a computer system, whether an alignment
with the graph surrounding each of the plurality of candidate
mapping positions is advanced or basic; performing for each
candidate mapping position, by means of the computer system, a
local alignment based on whether the local alignment is advanced or
basic, wherein: the advanced local alignment includes a
first-local-alignment algorithm, and the basic local alignment
includes a second-local-alignment algorithm; and based on the local
alignments, identifying an optimal mapped position of the sequence
read within the reference genome.
2. The method of claim 1, wherein the first-local-alignment
algorithm is different from the second-local-alignment
algorithm.
3. The method of claim 1, further comprising determining whether
the local alignment is advanced or basic based on at least one of:
a length of the graph, a variability of the graph, a total
processing time, a remaining processing time, and a number of
repeating elements.
4. The method of claim 1, further comprising determining whether
the local alignment is advanced or basic based, at least in part,
on a complexity of the graph.
5. The method of claim 4, further comprising determining whether
the local alignment is advanced or basic based, at least in part,
on a complexity of a subset of the graph surrounding each candidate
mapping position.
6. The method of claim 5, wherein the local alignment is advanced
if the complexity of a subset of the graph surrounding a candidate
mapping position is 10 or more nodes.
7. The method of claim 5, wherein the local alignment is basic if
the complexity of a subset of the graph surrounding a candidate
mapping position is 5 or fewer nodes.
8. The method of claim 1, wherein the second-local-alignment
algorithm comprises a pattern matching algorithm.
9. The method of claim 8, wherein the pattern matching algorithm is
selected from the group consisting of: a Boyer-Moore algorithm, a
Horspool algorithm, and a Tarhio-Ukkonen algorithm.
10. The method of claim 1, wherein performing a basic local
alignment comprises: linearizing a subset of the graph surrounding
each candidate mapping position into a plurality of linear
sequences, and performing a basic local alignment of the sequence
read against each of the plurality of linear sequences using the
second-local-alignment algorithm.
11. The method of claim 10, wherein linearizing a subset of the
graph surrounding each candidate mapping position into a plurality
of linear sequences comprises enumerating the number of unique
paths through the subset of the graph, and associating a linear
sequence with each enumerated path.
12. The method of claim 10, wherein linearizing a subset of the
graph surrounding each candidate mapping position into a plurality
of linear sequences comprises performing a depth first search of
the subset of the graph.
13. The method of claim 1, wherein the first-local-alignment
algorithm comprises a graph aware algorithm.
14. The method of claim 13, wherein the graph aware algorithm is a
modified Smith Waterman algorithm.
15. The method of claim 1, further comprising ranking each of the
candidate mapping positions based on a quality of the local
alignment.
16. The method of claim 15, further comprising re-aligning the
sequence read using a third-local-alignment algorithm if the
quality of the highest ranking local alignment is low.
17. A system for determining a subject's genetic information, the
system comprising: a computer system comprising a processor coupled
to memory and operable to: receive identities of a plurality of
nucleotides at known locations on a reference genome; receive
sequence reads from a sample from a subject; and map the sequence
reads to the reference genome, thereby identifying a corresponding
location on the reference genome, the mapping comprising:
identifying, within a graph representing a reference genome, a
plurality of candidate mapping positions that relate to the genetic
information, the graph comprising nodes representing genetic
sequences and edges connecting pairs of nodes; determining, by
means of a computer system, whether an alignment with the graph
surrounding each of the identified plurality of candidate mapping
positions is advanced or basic; performing for each candidate
mapping position, by means of the computer system, a local
alignment based on whether the local alignment is advanced or
basic, wherein: the advanced local alignment includes a
first-local-alignment algorithm, and the basic local alignment
includes a second-local-alignment algorithm; and based on the local
alignments, identify the mapped position of the sequence read
within the reference genome.
18. The system of claim 17, wherein the computer system is further
operable to determine whether the local alignment is advanced or
basic based, at least in part, on a complexity of a subset of the
graph surrounding each candidate mapping position.
19. The system of claim 17, wherein the first-local-alignment
algorithm comprises a graph aware alignment algorithm, and the
second-local-alignment algorithm comprises a linear alignment
algorithm.
20. The system of claim 17, wherein performing a basic local
alignment comprises: linearizing a subset of the graph surrounding
each candidate mapping position into a plurality of linear
sequences, and performing a basic local alignment of the sequence
read against each of the plurality of linear sequences using the
second-local-alignment algorithm.
Description
FIELD OF THE INVENTION
[0001] The invention relates to systems and methods for adaptive
local alignment for graph genomes.
BACKGROUND
[0002] A person's genetic information has the potential to reveal
much about their health and life. A risk of cancer or a genetic
disease may be revealed by the sequences of the person's genes, as
well as the possibility that his or her children could inherit a
genetic disorder. Genetic information can also be used to identify
an unknown organism, such as potentially infectious agents
discovered in samples from public food or water supplies.
Next-generation sequencing (NGS) technologies are available that
can sequence entire genomes quickly. Some approaches to analyzing
sequence reads involve mapping the sequence reads to a reference
genome. Mapping reads to a reference can be done by aligning each
read to a relevant portion of the reference.
SUMMARY
[0003] The described methods and systems provide a method for
automatically analyzing sequence data using a graph, such as a
reference directed acyclic graph (DAG), and accounting for the
various difficulties (e.g., complexities) of sequence alignment and
genomic variation without requiring intermediate input or guidance
from a user. Essentially, the described methods and systems
simplify and optimize the process for sequence alignment between a
sample sequence and a reference genome because the systems and
methods function autonomously from the user (after the sequence
information is received). A processor can, for example, identify
and align the sequence information in a manner that meets a user's
requirements for accuracy and expediency. In some cases, the
sequence information can be directly transmitted to the systems
described herein, which helps to provide a vastly simplified
sequence alignment experience for the user.
[0004] Representing genomes as graphs, such as directed acyclic
graphs (DAGs), is a powerful tool for coping with the complexities
of sequence alignment and genomic variation. In contrast to a
single linear reference sequence, a DAG can incorporate myriad
types of information about genomic variations, such as single
nucleotide polymorphisms (SNPs), small insertions and deletions
(indels), and larger structural variants (SVs). Each of these
variations can be represented by following different branches
through the DAG. Additionally, using DAGs can help to increase the
probability of identifying small variations near large known
structural variants because the presence of the known structural
variants in the graph improves the quality of the alignment. Thus,
known variations can be reliably accounted for and identified by
aligning reads containing the known variations to a sequence path
that includes those variations.
[0005] However, the complex nature of DAGs can result in a
combinatorial explosion when used with conventional linear
alignment tools thereby rendering alignment to a DAG
computationally expensive and cost prohibitive. The methods and
systems described throughout this document can automatically and
efficiently analyze sequence data using the reference DAG even when
doing so would conventionally be undesirable (e.g., computationally
expensive, time consuming, imprecise, and/or cost prohibitive). The
systems and methods can analyze the difficulty of aligning the
sequence data with the reference DAG based on a predicted alignment
difficulty (e.g., due to the DAG complexity, sequence length, total
processing time, remaining processing time, or any combination
thereof) between the reference DAG and the sequence data. By
choosing and employing a selected alignment algorithm based on this
predicted alignment difficulty, the methods and systems can
efficiently analyze sequence data.
[0006] In one aspect of the invention, a method includes
identifying, within a graph representing a reference genome, a
plurality of candidate mapping positions that relate to the genetic
information, the graph including nodes representing genetic
sequences and edges connecting pairs of nodes; determining, by
means of a computer system, whether an alignment with the graph
surrounding each of the plurality of candidate mapping positions is
advanced or basic; and performing for each candidate mapping
position, by means of the computer system, a local alignment based
on whether the local alignment is advanced or basic. The advanced
local alignment includes a first-local-alignment algorithm, and the
basic local alignment includes a second-local-alignment algorithm.
Based on the local alignments, the method further includes
identifying the mapped position of the sequence read within the
reference genome.
[0007] In another aspect of the invention, the system for
determining a subject's genetic information, the system includes a
computer system comprising a processor coupled to memory and
operable to: receive identities of a plurality of nucleotides at
known locations on a reference genome; receive sequence reads from
a sample from a subject; and map the sequence reads to the
reference genome, thereby identifying a corresponding location on
the reference genome. The mapping includes identifying, within a
graph representing a reference genome, a plurality of candidate
mapping positions that relate to the genetic information, the graph
comprising nodes representing genetic sequences and edges
connecting pairs of nodes; determining, by means of a computer
system, whether an alignment with the graph surrounding each of the
plurality of candidate mapping positions is advanced or basic; and
performing for each candidate mapping position, by means of the
computer system, a local alignment based on whether the local
alignment is advanced or basic. The advanced local alignment
includes a first-local-alignment algorithm, and the basic local
alignment includes a second-local-alignment algorithm. Based on the
local alignments, the system identifies the mapped position of the
sequence read within the reference genome.
[0008] Implementations can include one or more of the following
features.
[0009] In some implementations, the method further includes
performing, by means of the computer system, a global alignment for
the sequence read based on whether the global alignment is advanced
or basic. The advanced global alignment includes a
first-global-alignment algorithm, and the basic global alignment
includes alignment includes a second-global-alignment algorithm.
The global alignment provides information indicative of the
plurality of candidate mapping positions.
[0010] In certain implementations, the first-local-alignment
algorithm is different from the second-local-alignment
algorithm.
[0011] In some implementations, the first-global-alignment
algorithm is different from the second-global-alignment
algorithm.
[0012] In certain implementations, at least one of the
first-local-alignment algorithm and the second-local-alignment
algorithm is different from at least one of the
first-global-alignment algorithm and the second-global-alignment
algorithm.
[0013] In some implementations, the method further includes
determining whether the local alignment is advanced or basic based
at least in part on at least one of: the complexity of the graph, a
length of the graph, a total processing time, a remaining
processing time, and a number of repeating elements.
[0014] In certain implementations, a complex graph includes 10 or
more nodes.
[0015] In some implementations, a simple graph includes 5 or fewer
nodes.
[0016] In certain implementations, the first-local-alignment
algorithm includes a pattern matching algorithm.
[0017] In some implementations, the pattern matching algorithm
includes at least one of the following: a Boyer-Moore algorithm, a
Horspool algorithm, and a Tarhio-Ukkonen algorithm.
[0018] In certain implementations, the second-local-alignment
algorithm includes a linear alignment algorithm.
[0019] In some implementations, each node and edge stores a list of
one or more adjacent objects, and wherein identifying the plurality
of candidate mapping positions comprises finding related alignments
between the sequence read and the reference genome.
[0020] In certain implementations, candidate mapping positions
relate to the genetic information if a predetermined alignment
quality is met between the candidate mapping position and the
sequence read.
[0021] In some implementations, the method further includes ranking
each of the candidate mapping positions based on the quality of the
local alignment.
[0022] In certain implementations, each of the plurality of
candidate mapping positions defines a selected path through the
graph.
[0023] In some implementations, the processor is further operable
to: perform a global alignment, by means of the computer system,
for the sequence read based on whether the global alignment is
advanced or basic. The advanced global alignment includes a
first-global-alignment algorithm, and the basic global alignment
includes a second-global-alignment algorithm. The global alignment
provides information indicative of the plurality of candidate
mapping positions.
[0024] In certain implementations, each node and edge stores a list
of one or more adjacent objects, and wherein identifying the
plurality of candidate mapping positions comprises finding related
alignments between the sequence read and the candidate mapping
position.
[0025] In some implementations, the complexity of the graph varies
with a remaining processing time estimate to reduce a total
processing time.
[0026] In certain implementations, the first-local-alignment
algorithm is different from the second-local-alignment
algorithm.
[0027] In some implementations, the first-global-alignment
algorithm is different from the second-global-alignment
algorithm.
[0028] In certain implementations, at least one of the
first-local-alignment algorithm and the second-local-alignment
algorithm is different from at least one of the
first-global-alignment algorithm and the second-global-alignment
algorithm.
[0029] In some implementations, the mapping further includes
determining whether the local alignment is advanced or basic based
at least in part on at least one of: the complexity of the graph, a
length of the graph, a total processing time, a remaining
processing time, and a number of repeating elements.
[0030] In certain implementations, a complex graph includes 10 or
more nodes.
[0031] In some implementations, a simple graph includes 5 or fewer
nodes.
[0032] In certain implementations, the first-local-alignment
algorithm includes a pattern matching algorithm.
[0033] In some implementations, the pattern matching algorithm
includes at least one of the following: a Boyer-Moore algorithm and
a Tarhio-Ukkonen algorithm.
[0034] In certain implementations, the second-local-alignment
algorithm includes a linear alignment algorithm.
[0035] In some implementations, the candidate mapping positions
relate to the genetic information if a predetermined alignment
quality is met between the candidate mapping position and the
sequence read.
[0036] In certain implementations, the mapping further includes
ranking each of the candidate mapping positions based on the
quality of the local alignment.
[0037] In some implementations, each of the plurality of candidate
mapping positions defines a selected path through the graph.
[0038] Implementations can include one or more of the following
advantages.
[0039] Many of the methods described herein can account for the
complexity of a reference graph by adjusting the alignment methods
based on the complexity. Regions of a reference graph with a large
number of nodes will typically result in an exponential number of
calculations, which leads to excessive cost and processing time
during the alignment process. Other sequence alignment methods,
such as the Graph Smith-Waterman algorithm, have been adapted to
accommodate this complexity and to work with DAGs (e.g., as
disclosed in U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613,
both incorporated by reference); however, these methods have a high
initial computational cost and therefore are not amenable for use
with large data sets. By adjusting the alignment methods based on
the complexity of the reference DAG, an optimal alignment method
can be used. For example, if genomic information is being aligned
with a relatively simple reference graph, a linear alignment method
can be used to save resources (e.g., time and computational costs).
In other cases, if the reference graph is relatively complex, other
sequence alignment methods, such as the modified Graph
Smith-Waterman algorithm may be an alignment tool well-suited to
the task despite the high initial computational costs.
[0040] In some implementations, the predicted alignment difficulty
can vary during the alignment process. The alignment techniques
described herein can continually assess the alignment difficulty
and dynamically adapt the alignment tools in response to any
changes. For example, two or more alignment tools can be used
during the same alignment process.
BRIEF DESCRIPTION OF THE DRAWINGS
[0041] FIG. 1 diagrams a method for aligning sequence reads to a
reference genome.
[0042] FIG. 2 diagrams a method for performing a global alignment
of a sequence read to a reference genome.
[0043] FIG. 3 shows the matrices that represent the comparison.
[0044] FIG. 4 illustrates finding alignments between a sequence and
a reference DAG.
[0045] FIG. 5 shows the use of an adjacency list for each vertex
and edge in a DAG.
[0046] FIG. 6 shows an exemplary DAG and sequence read.
[0047] FIG. 7 shows an exemplary DAG representing a single
nucleotide polymorphism (C or D) and an insertion (EG or EFG).
[0048] FIG. 8 shows an exemplary DAG representing a highly variable
region of the genome that includes information regarding various
structural variations, SNPs, and small insertions and
deletions.
[0049] FIG. 9 illustrates the relationship between processing time
and DAG complexity for a pattern matching algorithm (such as
Tarhio-Ukkonen) and a graph aware alignment algorithm (such as a
Graph Smith-Waterman alignment method).
[0050] FIG. 10 illustrates a system for implementing the methods
diagramed in FIGS. 1 and 2.
DETAILED DESCRIPTION
[0051] In general, the methods and systems described herein relate
to determining a subject's genetic information. Information
representing at least a portion of the subject's DNA (e.g., a
sequence read) is globally aligned with a graph, such as a
reference genomic directed acyclic graph (DAG), to identify
candidate paths or mapping positions through the genomic DAG. The
genomic information can then be compared against the candidate
paths with a higher degree of accuracy as compared to the original
global alignment. This two-step alignment analysis allows for the
selection of an appropriate alignment algorithm based on the
alignment difficulty and an acceptable alignment tolerance, thereby
reaping considerable computational savings resources because
resources are not needlessly expended.
[0052] Additionally, the present disclosure recognizes, in certain
embodiments, that the problem of combinatorial explosion associated
with aligning sequence data to genomic DAGs is solved by an
alignment method that analyzes the complexity of a DAG and chooses
an alignment algorithm based on the complexity. For example,
sequence reads can be quickly aligned to portions of a DAG that
have low complexity using a pattern matching algorithm or a linear
alignment algorithm. However, for complex portions of the DAG, the
number of paths that must be analyzed by a pattern matching or
linear alignment algorithm grows exponentially, leading to a
corresponding exponential decrease in performance of the pattern
matching or linear alignment tool. In these cases, a graph-aware
algorithm may be substituted. The graph-aware algorithm may have a
high initial overhead cost, but its decrease in performance scales
linearly as the complexity of the graph increases, making it a
preferred choice for complex regions of a DAG.
[0053] The following describes methods for indexing collections of
data strings represented in a form of DAGs and performing efficient
string search including exact and approximate matching techniques.
Among other applications, these methods can be used to quickly and
efficiently find genomic data sequences (e.g., reads produced by
DNA sequencing machines) in genome graphs representing genome data
variations.
[0054] FIG. 1 diagrams a method 101 for analyzing genomic sequence
information. At step 102 sequence reads from a sample from a
subject are received. Sequence reads can be received from a nucleic
acid sequencing instrument, such as an Illumina.RTM. HiSeq2500, a
Roche 454 GS FLX+, an Ion Torrent PGM.TM., and the like. A
processor coupled to the tangible memory device is used to globally
align 104 the sequence read against the reference genome to
identify 106 one or more candidate mapping positions. These
candidate mapping positions represent acceptable alignments between
the received sequence reads and the paths through the reference DAG
(e.g., a location within the reference genome that could be related
to the sequence read). In some cases, a report can be provided that
identifies a one or more of the candidate mapping positions.
[0055] Next-generation sequencing and alignment can be
computationally expensive. For example, over 900 million paired-end
100 bp sequence reads are needed to obtain 30.times. coverage of
the human genome. As will be described in further detail below,
certain algorithms, such as various pattern matching and linear
alignment algorithms, scan each sequence read over a reference
sequence to identify an exact, optimal, or best-fit match. But this
approach is typically computationally infeasible given the scales
associated with next-generation sequencing. To solve this problem,
a heuristic approach may be used to reduce the total number of
computations required. For example, one approach is to separate an
alignment into "coarse" and "fine" stages. In the coarse stage, a
fast, yet often incomplete or inexact alignment is performed to
identify a number of candidate regions to which a particular
sequence read aligns. This is referred to as a global alignment or
a global search because the read has aligned considering the
entirety of the reference sequence. In the second, fine stage, each
candidate region is subsequently analyzed using a more
computationally expensive, yet exact algorithm. This is referred to
as a local alignment or local search because the alignment is
restricted to the local space identified during the coarse stage.
In certain examples, this may be referred to as a "seed and extend"
strategy; the global alignment yields a number of seed positions,
which are then extended during the local alignment stage. The
candidate positions may then be scored and ranked, yielding the
best aligned position for each sequence read.
[0056] For example, one approach for performing a global alignment
104 is to place many small sections, or k-mers, of the reference
DAG into a computer hash function. The hash function calculates an
index as a function of the k-mer. The index identifies an entry in
a hash table, and the positions of the k-mer within the reference
DAG are stored in that entry, i.e., in the entry indexed by the
hash of the k-mer. A sequence read is analyzed by hashing some or
all of its k-mers and the corresponding entries of the DAG hash
table are accessed to read positions within the DAG where those
sequence read k-mers can be found. Sections of the DAG where a
threshold number of k-mer positions are found are identified as
candidate regions within the represented genomes for alignment or
mapping of the sequence read. By the described implementation, a
sequence read can be mapped to a large number of "good fit"
positions within a reference genome very rapidly.
[0057] A sequence read can be quickly mapped to a reference using a
global alignment algorithm because hash values can be retrieved
rapidly (i.e., without iterating over entries in an array). Since a
DAG can store a great amount of genetic variation, a sequence read
can be mapped to an appropriate portion of one or more certain
genomes from among numerous candidate genomes very rapidly. Since
sequence reads can be mapped to biologically relevant reference
genomes very rapidly, a patient's genetic information can be
discerned, or genetic sequences can be identified quickly even
against a background of all genetic variety represented in the
system. For example, aligning a patient's sequence reads to a DAG
representing variations in cancer genomes (such as those from The
Cancer Genome Atlas) can be used to quickly identify particular
mutations associated with cancer present in the patient.
[0058] Once a plurality of candidate mapping positions has been
identified 106 by a global searching algorithm 104, a subsection of
the graph at each candidate mapping position is identified. In some
examples, the subsection of the graph includes at least the number
of base pairs in the sequence read for searching (e.g., 50, 100,
200 bp). However, longer subsections can be preferable for local
search. Using longer subsections is particularly advantageous when
using paired-end or mate-pair sequence reads, which are pairs of
sequence reads describing the 5' and 3' ends of a single DNA
molecule, often with an inferred insert size distance of 300-600
(paired-end) or 2,000-5,000 (mate pair) base pairs between the 5'
and 3' sequence reads. Accordingly, the portions or subsections of
the graph for each of the candidate mapping positions can be a
variety of sizes ranging from 50 to even 10,000 base pairs.
[0059] The processor can then perform 108 a local alignment for
each of the candidate mapping positions. With the candidate mapping
positions (e.g., "good fit" positions) identified, a processor can
perform 108 a local alignment at each of the candidate mapping
positions to narrow the candidate mapping positions to identify one
or more mapped positions (e.g., "better fit" or "best fit"
positions). A processor can start 110 the local alignment process
and then determine 112 whether an alignment with the reference DAG
surrounding the candidate mapping positions (e.g., the subset of
the reference DAG surrounding a candidate mapping position) is
advanced or basic. In some examples, an alignment is advanced if
the complexity of the subset of the reference DAG exceeds 5 paths.
In other examples, the alignment is considered basic if the
complexity of the subset of the reference DAG is 5 paths or less.
If the alignment is advanced 114, then the processor can use a
first alignment tool implementing a first method 116. If the
alignment is not advanced (i.e., basic), then the processor can use
a second alignment tool implementing a second method 118. The
processor can repeat 120 this process until all candidate mapping
positions are locally aligned. The processor can then rank 122 the
candidate mapping positions to identify the most likely alignment
positions corresponding to the received sequence read before ending
124 the local alignment process. Based on these rankings, the
processor can identify 126 the mapped position within an acceptable
margin of error.
[0060] While the method 101 comprises performing 104 a global
alignment against a reference DAG, in certain embodiments, a global
alignment may be simply performed against a linear reference
sequence. This can be advantageous in certain embodiments where
using a linear reference sequence for the global alignment can help
improve performance. However, the local alignment 108 may still
consider a reference DAG structure. Various embodiments are
considered to be within the scope of the disclosure.
[0061] In certain embodiments, a global alignment may also consider
whether an alignment with the DAG is advanced or basic and select
an appropriate algorithm accordingly. FIG. 2 illustrates an
exemplary global alignment process 201 for identifying candidate
mapping positions for a sequence read. As with the local alignment
process 108, the global alignment process 201 can evaluate the
reference DAG to determine if the global alignment is advanced or
basic. For example, this may be performed in situations wherein the
reference DAG is small enough such that pattern matching and linear
alignment algorithms may be used for the global alignment stage.
For example, a processor can start 202 the local alignment process
and then determine 204 whether an alignment with the reference DAG
and the sequence read is advanced or basic. In some examples, an
alignment is advanced if the complexity of the DAG exceeds 6 paths,
as discussed below. In other examples, the alignment is basic if
the complexity of the DAG is 6 paths or less. If the alignment is
advanced 206, then the processor can use a first alignment tool
implementing a first method 208. If the alignment is not advanced
(i.e., basic), then the processor can use a second alignment tool
implementing a second method 210. The first method 208 and the
second method 210 may be the same, different, or partially overlap.
In some examples, the first method 208 and/or the second method 210
can correspond to at least one of the first method 116 or the
second method 118. Using these alignment tools, the processor can
identify 212 the candidate mapping positions. The processor can
then repeat this process until the global alignment is complete
214. The processor can then end 216 the global alignment process
201.
[0062] If the alignment is basic 118, the second alignment tool may
comprise linear alignment methods, such as optimal alignment or
pattern matching algorithms. For example, the second alignment tool
may comprise an optimal alignment algorithm, such as a
Smith-Waterman algorithm, a Needleman-Wunsch algorithm, and the
like. These algorithms use a dynamic programming approach in order
to find an optimal local alignment between two sequences, but are
computationally expensive. To improve performance, pattern matching
or string matching algorithms, such as a Boyer-Moore algorithm, a
Horspool algorithm, and a Tarhio-Ukkonen algorithm may be used.
These algorithms are fast, but typically do not handle variations
(such as indels and SNPs) as well as an optimal alignment
algorithm. However, pattern matching and string matching algorithms
are particularly well suited for aligning sequence reads against
graph genomes, such as DAGs. As known variations are already
present in the graph, the pattern matching algorithm does not need
to consider the presence of indels or other variations within a
sequence read; it need only map the sequence read to the most
likely branch within the DAG having that variation.
[0063] While pattern matching and linear alignment algorithms can
function as a fast local alignment tool, each of these methods is
easily susceptible to efficiency loss when faced with a high number
of comparisons when, for example, a linearized graph genome
represents a significant number of combinations. Accordingly, if
the alignment is advanced 114, the first alignment tool may
comprise a graph-aware or multi-dimensional algorithm, such as
Graph Smith-Waterman, as explained in further detail below.
[0064] Optimal Alignment Algorithms
[0065] Pairwise alignment generally involves placing one sequence
along part of target, introducing gaps according to an algorithm,
scoring how well the two sequences match, and preferably repeating
for various positions along the reference. The best-scoring match
is deemed to be the alignment and represents an inference of
homology between alignment portions of the sequences. In some
embodiments, scoring an alignment of a pair of nucleic acid
sequences involves setting values for the probabilities of
substitutions and indels. When individual bases are aligned, a
match or mismatch contributes to the alignment score by a
substitution score, which could be, for example, 1 for a match and
-0.33 for a mismatch. An indel deducts from an alignment score by a
gap penalty, which could be, for example, -1. Gap penalties and
substitution probabilities can be based on empirical knowledge or a
priori assumptions about how sequences evolve. Their values affect
the resulting alignment. Particularly, the relationship between the
gap penalties and substitution probabilities influences whether
substitutions or indels will be favored in the resulting
alignment.
[0066] Stated formally, an alignment represents an inferred
relationship between two sequences, x and y. For example, in some
embodiments, an alignment A of sequences x and y maps x and y
respectively to another two strings x' and y' that may contain
spaces such that: (i) |x'|=|y'|; (ii) removing spaces from x' and
y' should get back x and y, respectively; and (iii) for any i,
x'[i] and y'[i] cannot be both spaces.
[0067] A gap is a maximal substring of contiguous spaces in either
x' or y'. An alignment A can include the following three kinds of
regions: (i) matched pair (e.g., x'[i]=y'[i]; (ii) mismatched pair,
(e.g., x'[i]--y'[i] and both are not spaces); or (iii) gap (e.g.,
either x'[i . . . j] or y'[i . . . j] is a gap). In certain
embodiments, only a matched pair has a high positive score a. In
some embodiments, a mismatched pair generally has a negative score
b and a gap of length r also has a negative score g+rs where g,
s<0. For DNA, one common scoring scheme (e.g. used by BLAST)
makes score a=1, score b=-3, g=-5 and s=-2. The score of the
alignment A is the sum of the scores for all matched pairs,
mismatched pairs and gaps. The alignment score of x and y can be
defined as the maximum score among all possible alignments of x and
y.
[0068] Any pair may have a score a defined by a 4.times.4 matrix B
of substitution probabilities. For example, B(i,i)=1 and
0<B(i,j)<1 [for i.noteq.j], is one possible scoring system.
For instance, where a transition is thought to be more biologically
probable than a transversion, matrix B could include B(C,T)=0.7 and
B(A,T)=0.3, or other values desired or determined by methods known
in the art.
[0069] A pairwise alignment, generally, involves--for sequence Q
(query) having m characters and a reference genome T (target) of n
characters--finding and evaluating possible local alignments
between Q and T. For any 1.ltoreq.i.ltoreq.n and
1.ltoreq.j.ltoreq.m, the largest possible alignment score of T[h .
. . i] and Q[k . . . j], where h.ltoreq.i and k.ltoreq.j, is
computed (i.e. the best alignment score of any substring of T
ending at position i and any substring of Q ending at position j).
This can include examining all substrings with cm characters, where
c is a constant depending on a similarity model, and aligning each
substring separately with Q. Each alignment is scored, and the
alignment with the preferred score is accepted as the alignment.
One of skill in the art will appreciate that there are exact and
approximate algorithms for sequence alignment. Exact algorithms
will find the highest scoring alignment, but can be computationally
expensive. Two well-known exact algorithms are Needleman-Wunsch (J
Mol Biol, 48(3):443-453, 1970) and Smith-Waterman (J Mol Biol,
147(1):195-197, 1981; Adv. in Math. 20(3), 367-387, 1976). A
further improvement to Smith-Waterman by Gotoh (J Mol Biol, 162(3),
705-708, 1982) reduces the calculation time from O(m.sup.2n) to
O(mn) where m and n are the sequence sizes being compared and is
more amendable to parallel processing. In the field of
bioinformatics, it is Gotoh's modified algorithm that is often
referred to as the Smith-Waterman algorithm. Smith-Waterman
approaches are being used to align larger sequence sets against
larger reference sequences as parallel computing resources become
more widely and cheaply available. See, e.g., Amazon's cloud
computing resources. All of the journal articles referenced herein
are incorporated by reference in their entireties.
[0070] The original Smith-Waterman (SW) algorithm is a linear
alignment algorithm that aligns linear sequences by rewarding
overlap between bases in the sequences, and penalizing gaps between
the sequences. Smith-Waterman also differs from Needleman-Wunsch,
in that SW does not require the shorter sequence to span the string
of letters describing the longer sequence. That is, SW does not
assume that one sequence is a read of the entirety of the other
sequence. Furthermore, because SW is not obligated to find an
alignment that stretches across the entire length of the strings, a
local alignment can begin and end anywhere within the two
sequences.
[0071] The original SW algorithm is expressed for an n.times.m
matrix H, representing the two strings of length n and m, in terms
of equation (1):
H_k0=H_01=0 (for 0.ltoreq.k.ltoreq.n and 0.ltoreq.l.ltoreq.m)
H_ij=max{H_(i-1,j-1)+s(a_i,b_j),H_(i-1,j)-W_in,H_(i,j-1)-W_del,0}
(1)
[0072] (for 1.ltoreq.i.ltoreq.n and 1.ltoreq.j.ltoreq.m)
[0073] In the equations above, s(ai,bj) represents either a match
bonus (when ai=bj) or a mismatch penalty (when ai.noteq.bj), and
insertions and deletions are given the penalties Win and Wdel,
respectively. In most instances, the resulting matrix has many
elements that are zero. This representation makes it easier to
backtrace from high-to-low, right-to-left in the matrix, thus
identifying the alignment.
[0074] Once the matrix has been fully populated with scores, the SW
algorithm performs a backtrack to determine the alignment. Starting
with the maximum value in the matrix, the algorithm will backtrack
based on which of the three values (Hi-1,j-1, Hi-1,j, or Hi,j-1)
was used to compute the final maximum value for each cell. The
backtracking stops when a zero is reached. The optimal-scoring
alignment may contain greater than the minimum possible number of
insertions and deletions, while containing far fewer than the
maximum possible number of substitutions.
[0075] SW or SW-Gotoh may be implemented using dynamic programming
to perform local sequence alignment of the two strings, S and A, of
sizes m and n, respectively. This dynamic programming employs
tables or matrices to preserve match scores and avoid
re-computation for successive cells. Each element of the string can
be indexed with respect to a letter of the sequence, that is, if S
is the string ATCGAA, S[1]=A.
[0076] Instead of representing the optimum alignment as Hi,j
(above), the optimum alignment can be represented as B[j,k] in
equation (2) below:
B[j,k]=max(p[j,k],i[j,k],d[j,k],0) (for
0<j.ltoreq.m,0<k.ltoreq.n) (2)
The arguments of the maximum function, B[j,k], are outlined in
equations (3)-(5) below, wherein MISMATCH_PEN, MATCH_BONUS,
INSERTION_PEN, DELETION_PEN, and OPENING_PEN are all constants, and
all negative except for MATCH_BONUS (PEN is short for PENALTY). The
match argument, p[j,k], is given by equation (3), below:
p[j,k]=max(p[j-1,k-1],i[j-1,k-1],d[j-1,k-1])+MISMATCH_PEN, if
S[j]A[k]=max(p[j-1,k-1],i[j-1,k-1],d[j-1,k-1])+MATCH_BONUS, if
S[j]=A[k] (3)
[0077] the insertion argument i[j,k], is given by equation (4),
below:
i[j,k]=max(p[j-1,k]+OPENING_PEN,i[j-1,k],d[j-1,k]+OPENING_PEN)+INSERTION-
_PEN (4)
[0078] and the deletion argument d[j,k], is given by equation (5),
below:
d[j,k]=max(p[j,k-1]+OPENING_PEN,i[j,k-1]+OPENING_PEN,d[j,k-1])+DELETION_-
PEN (5)
[0079] For all three arguments, the [0,0] element is set to zero to
assure that the backtrack goes to completion, i.e.,
p[0,0]=i[0,0]=d[0,0]=0.
[0080] The scoring parameters are somewhat arbitrary, and can be
adjusted to achieve the behavior of the computations. One example
of the scoring parameter settings (Huang, Chapter 3: Bio-Sequence
Comparison and Alignment, ser. Curr Top Comp Mol Biol. Cambridge,
Mass.: The MIT Press, 2002) for DNA would be:
[0081] MATCH_BONUS: 10
[0082] MISMATCH_PEN: -20
[0083] INSERTION_PEN: -40
[0084] OPENING_PEN: -10
[0085] DELETION_PEN: -5
[0086] The relationship between the gap penalties (INSERTION_PEN,
OPENING_PEN) above help limit the number of gap openings, i.e.,
favor grouping gaps together, by setting the gap insertion penalty
higher than the gap opening cost. Of course, alternative
relationships between MISMATCH_PEN, MATCH_BONUS, INSERTION_PEN,
OPENING_PEN and DELETION_PEN are possible.
[0087] Smith-Waterman and Needleman-Wunsch are exact alignment
algorithms and identify optimal alignments, including those with
short insertions and deletions. However, it can be as
computationally expensive as naive pattern matching algorithms, and
has a processing time typically on the order of O(mn). As a result,
the use of the Smith-Waterman algorithm is primarily reserved for
local alignment. However, when applied to a multi-dimensional data
structure such as a reference DAG, these algorithms can take an
excessive amount of time to complete. In these cases, more
efficient pattern matching algorithms may be substituted.
[0088] Pattern Matching Algorithms
[0089] Pattern matching algorithms use mathematical techniques to
reduce the total number of pairwise comparisons in order to improve
efficiency. In a short read alignment, a short read P having n
nucleotides is aligned with the left-most portion of a reference
sequence T, creating a window of length n. After a match or
mismatch determination, the window is shifted to the right for
additional comparisons. Pattern matching algorithms increase local
alignment efficiency by shifting the window farther to the right
that a single character, i.e., by skipping a certain number of
comparisons which would result in a mismatch. In particular, some
pattern matching algorithms process a sequence read, or pattern, in
order to create a "shift table" that indicates how many comparisons
can safely be skipped depending on the mapping of the short read at
the current position.
[0090] To create a shift table, many pattern matching algorithms
operate in two phases: a pre-processing phase and a search phase.
The pre-processing phase typically processes the pattern itself to
generate information that can aid the search phase by providing a
table of shift values after each comparison. Improvement in
efficiency of a search can be achieved by altering the order in
which the characters are compared at each attempt, and by choosing
a shift value that permits the skipping of a predefined number of
characters in the text after each attempt. The search phase then
uses this information to ignore certain character comparisons,
reducing the total number of comparisons and thus the overall
execution time.
[0091] The Boyer-Moore algorithm is one of the most efficient. See
Boyer, R. S., Moore, J. S., "A Fast String Searching Algorithm",
Comm. ACM 20(10), 762-772. Boyer-Moore preprocesses a pattern to
create two tables: a Boyer-Moore bad character (bmBc) table and a
Boyer-Moore good suffix (bmGs) table. For each character in the
alphabet (e.g., "A", "C", G'', and "T", representing the four
possible nucleotides), the bad character table stores a shift value
based on the occurrence of the character in the pattern (i.e.,
short read). The good-suffix table stores the matching shift value
for each character in the read. The maximum of the shift value in
either table is considered after each comparison during a search,
which is then used to advance the window. Boyer-Moore does not
allow any mismatches between the read and the reference sequence,
and thus is a perfect-match search algorithm.
[0092] Boyer-Moore serves as the basis for several pattern matching
algorithms that can be used for sequence alignment. One variation
of Boyer-Moore is the Horspool algorithm. See Horspool, R. N.,
"Practical Fast Searching in Strings", Software--Practice &
Experience, 10, 501-506. Horspool simplifies Boyer-Moore by
recognizing that the bad-character shift of the right-most
character of the window can be sufficient to compute the value of
the shift. These shift values are then determined in the
pre-processing stage for all of the characters in the alphabet,
i.e., the possible nucleotides. Horspool is particularly efficient
in practical situations when the length of the pattern is small,
such as for sequence reads generated by next-generation sequencing.
Like Boyer-Moore, Horspool is a perfect-match search algorithm.
[0093] Another variation of Boyer-Moore that allows for a certain
number of mismatches is the Tarhio-Ukkonen algorithm. See Tarhio,
J., Ukkonen, E., "Approximate Boyer-Moore String Matching", SIAM J.
Comput., 22, 243-260. Because it allows for mismatches,
Tarhio-Ukkonen can be used to provide approximate alignments for
next generation sequencing reads to a reference sequence. During a
pre-processing phase, Tarhio-Ukkonen calculates shift values for
every character that take into account a desired allowable number
of mismatches, k. Tarhio-Ukkonen is able to identify matches in
expected time
( kn ( 1 m - k + k c ) ) , ##EQU00001##
where c is the size of the alphabet. The use of approximation in
pattern matching algorithms such as Tarhio-Ukkonen is particularly
useful for next generation sequencing alignments due to inherent
noise present in sequence reads generated by most next generation
sequencing technologies. Furthermore, because the primary goal of
most genomic sequence analyses is to identify unknown variations
(primarily, SNPs) that may be present in a sample, approximate
pattern matching algorithms help to identify novel polymorphisms in
sequence data. In contrast, the Boyer-Moore and Horspool algorithms
would reject such reads if there were any differences between the
read and the reference.
[0094] When used for next generation sequencing alignment, pattern
matching algorithms can be used as a fast local alignment tool. In
particular, pattern matching algorithms typically have performance
that exceeds that of traditional local alignment algorithms for
sequence data, such as Needleman-Wunsch and Smith-Waterman.
However, when applied to a graph genome, combinatorial complexity
can quickly increase the number of comparisons to a point in which
improvements in efficiency are lost. In other words, for a local
alignment to a subset of a reference DAG, the expected time for the
Boyer-Moore family is multiplied by the number of linearized paths.
Additionally, despite variants that are able to address mismatches
and small insertions and deletions, the Boyer-Moore family of
algorithms do not perform particularly well when handling
alignments involving novel variations (that are not already present
in the graph). Exact alignment algorithms, such as Needleman-Wunsch
and Smith-Waterman are able to accurately identify variations;
however, these algorithms run in O(mn) time and similarly suffer
from combinatorial explosion when applied to a DAG.
[0095] One solution to this problem is to use a multi-dimensional
or graph-aware algorithm, such as if the alignment is advanced. One
example of a graph-aware alignment algorithm is a modified
Smith-Waterman algorithm that considers the highest scoring paths
through a DAG ("Graph Smith Waterman"). Like Smith-Waterman, Graph
Smith-Waterman is an exact alignment algorithm and thus is able to
identify optimal local alignments that include insertions and
deletions. However, when applied to a linear reference or a DAG
having low complexity, Graph Smith-Waterman can be slow relative to
the alignment speed of a pattern matching algorithm.
[0096] Graph-Aware Alignment Algorithms
[0097] Certain alignment algorithms have been developed that are
able to provide accurate and relatively fast local alignments when
used with DAGs or other graph data structures. For example, a Graph
Smith-Waterman algorithm as described below calculates a
Smith-Waterman score matrix between the sequence read and each node
in a DAG, such that the maximum scores are propagated across the
matrices for each node. In this way, the linear portions of the DAG
with the highest scores in relation to the sequence read are
serialized and chained together. One need only look back through
the chained Smith-Waterman matrices to identify both the optimal
alignment and path of the DAG.
[0098] In some embodiments, the methods and systems of the
invention use a modified Smith-Waterman operation that involves a
multi-dimensional look-back through the reference DAG (such as the
reference DAG 331 of FIG. 3). Multi-dimensional operations of the
invention provide for a "look-back" type analysis of sequence
information (as in Smith-Waterman), wherein the look back is
conducted through a multi-dimensional space that includes multiple
pathways and multiple nodes. The multi-dimensional algorithm can be
used to align sequence reads against the graph-type reference. That
alignment algorithm identifies the maximum value for Ci,j by
identifying the maximum score with respect to each sequence
contained at a position on the graph. In fact, by looking
"backwards" at the preceding positions, it is possible to identify
the optimum alignment across a plurality of possible paths.
[0099] The modified Smith-Waterman operation described here, aka
the multi-dimensional alignment, provides exceptional speed when
performed in a genomic graph system that employs physical memory
addressing (e.g., through the use of native pointers or index free
adjacency as discussed above). The combination of multi-dimensional
alignment to the graph of reference DAG 331 with the use of spatial
memory addresses (e.g., native pointers or index-free adjacency)
improves what the computer system is capable of, facilitating whole
genomic scale analysis to be performed using the methods described
herein.
[0100] The operation includes aligning a sequence, or string, to a
graph. For the purpose of defining the algorithm, let S be the
string being aligned, and let D be the directed graph to which S is
being aligned. The elements of the string, S, are bracketed with
indices beginning at 1. Thus, if S is the string ATCGAA, S[1]=A,
S[4]=G, etc.
[0101] In certain embodiments, for the graph, each letter of the
sequence of a node will be represented as a separate element, d. In
a preferred embodiment, node or edge objects contain the sequences
and the sequences are stored as the longest-possible string in each
object. A predecessor of d is defined as:
[0102] (i) If d is not the first letter of the sequence of its
object, the letter preceding d in its object is its (only)
predecessor;
[0103] (ii) If d is the first letter of the sequence of its object,
the last letter of the sequence of any object that is a parent of
d's object is a predecessor of d.
[0104] The set of all predecessors is, in turn, represented as
P[d].
[0105] In order to find the "best" alignment, the algorithm seeks
the value of M[j,d], the score of the optimal alignment of the
first j elements of S with the portion of the graph preceding (and
including) d. This step is similar to finding Hi,j in equation 1
above. Specifically, determining M[j,d] involves finding the
maximum of a, i, e, and 0, as defined below:
M[j,d]=max{a,i,e,0} (6)
where
[0106] e=max{M[j, p*]+DELETE_PEN} for p* in P[d]
[0107] i=M[j-1, d]+INSERT_PEN
[0108] a=max{M[j-1, p*]+MATCH_SCORE} for p* in P[d], if S[j]=d;
[0109] max{M[j-1, p*]+MISMATCH_PEN} for p* in P[d], if
S[j].noteq.d
[0110] As described above, e is the highest of the alignments of
the first j characters of S with the portions of the graph up to,
but not including, d, plus an additional DELETE_PEN. Accordingly,
if d is not the first letter of the sequence of the object, then
there is only one predecessor, p, and the alignment score of the
first j characters of S with the graph (up-to-and-including p) is
equivalent to M[j,p]+DELETE_PEN. In the instance where d is the
first letter of the sequence of its object, there can be multiple
possible predecessors, and because the DELETE_PEN is constant,
maximizing [M[j, p*]+DELETE_PEN] is the same as choosing the
predecessor with the highest alignment score with the first j
characters of S.
[0111] In equation (6), i is the alignment of the first j-1
characters of the string S with the graph up-to-and-including d,
plus an INSERT_PEN, which is similar to the definition of the
insertion argument in SW (see equation 1).
[0112] Additionally, a is the highest of the alignments of the
first j characters of S with the portions of the graph up to, but
not including d, plus either a MATCH_SCORE (if the jth character of
S is the same as the character d) or a MISMATCH_PEN (if the jth
character of S is not the same as the character d). As with e, this
means that if d is not the first letter of the sequence of its
object, then there is only one predecessor, i.e., p. That means a
is the alignment score of the first j-1 characters of S with the
graph (up-to-and-including p), i.e., M[j-1,p], with either a
MISMATCH_PEN or MATCH_SCORE added, depending upon whether d and the
jth character of S match. In the instance where d is the first
letter of the sequence of its object, there can be multiple
possible predecessors. In this case, maximizing
{M[j,p*]+MISMATCH_PEN or MATCH_SCORE} is the same as choosing the
predecessor with the highest alignment score with the first j-1
characters of S (i.e., the highest of the candidate M[j-1,p*]
arguments) and adding either a MISMATCH_PEN or a MATCH_SCORE
depending on whether d and the jth character of S match.
[0113] Again, as in the SW algorithm, the penalties, e.g.,
DELETE_PEN, INSERT_PEN, MATCH_SCORE and MISMATCH_PEN, can be
adjusted to encourage alignment with fewer gaps, etc.
[0114] As described in the equations above, the operation finds the
optimal (e.g., maximum) value for a sequence read 103 to the
reference DAG 331 by calculating not only the insertion, deletion,
and match scores for that element, but looking backward (against
the direction of the graph) to any prior nodes on the graph to find
a maximum score.
[0115] FIG. 3 shows matrices 301 that represent the comparison. The
modified Smith-Waterman operation of the invention identifies the
highest score and performs a backtrack to identify the proper
alignment of the sequence. See, e.g., U.S. Pub. 2015/0057946 and
U.S. Pub. 2015/0056613, both incorporated by reference. Systems and
methods of the invention can be used to provide a report that
identifies a modified base at the position within the genome of the
subject. Other information may be found in Kehr et al., 2014,
Genome alignment with graph data structures: a comparison, BMC
Bioinformatics 15:99 and Lee, 2003, Generating consensus sequences
from partial order multiple sequence alignment graphs,
Bioinformatics 19(8):999-1008, both incorporated by reference.
Thus, identifying an alignment for the genomic information from the
sample (e.g., the sequence read 103 of the embodiment of FIG. 4)
may include aligning 129 (e.g., using the global alignment 106 or
the local alignment 108) one or more of the sequence reads 103 to
the reference DAG 331, as shown in FIG. 4. The branches to which
the sequence reads 103 have been aligned are identified, along with
the corresponding regions within the reference genome.
[0116] Graph Smith-Waterman is able to achieve similar
computational speed as Smith-Waterman, yet does not suffer from the
combinatorial explosion associated with linearizing complex regions
of DAGs. In contrast, the performance of Graph Smith-Waterman
scales linearly as the complexity of the DAG increases.
Additionally, like Smith-Waterman, Graph Smith-Waterman is an exact
alignment algorithm and thus is able to identify optimal local
alignments that include insertions and deletions. However, when
applied to a linear reference sequence or a DAG having low
complexity, Graph Smith-Waterman can be slow compared to pattern
matching algorithms.
[0117] FIG. 4 illustrates finding 129 alignments 401 between the
sequence 103 (or a consensus sequence representing an assembly or
one or more sequence reads 103) and the reference DAG 331.
Alignments can be performed directly against the DAG using the
Graph Smith-Waterman algorithm, for example; however, as will be
discussed below, alignments may also be performed against the DAG
using linear alignment or pattern matching algorithms by first
linearizing the DAG.
[0118] FIG. 4 also illustrates an optional variant calling 807 step
to identify a subject genotype 831. Using alignment operations of
the invention, reads can be rapidly mapped to the reference DAG 331
despite their large numbers or short lengths. Numerous benefits
obtain by using a graph as a reference. For example, aligning
against a graph is more accurate than aligning against a linear
reference and then attempting to adjust one's results in light of
other extrinsic information. This is primarily because the latter
approach enforces an unnatural asymmetry between the sequence used
in the initial alignment and other information. Aligning against an
object that potentially represents all the relevant physical
possibilities is much more computationally efficient than
attempting to align against a linear sequence for each physical
possibility (the number of such possibilities will generally be
exponential in the number of junctions).
[0119] Many implementations provide methods that rely on a directed
acyclic data structure with the data in an annotated transcriptome.
In such a data structure, features (e.g., exons and introns) from
the annotated transcriptome are represented as nodes, which are
connected by edges. An isoform from the real-world transcriptome is
thus represented by a corresponding path through the DAG-based data
structure.
[0120] Aspects of the described methods relate to the creation or
use of a DAG that includes features such as introns and exons from
one or more known references. A DAG is understood in the art to
refer to data that can be presented as a graph as well as to a
graph that presents that data. The DAG can be stored as data that
can be read by a computer system for bioinformatic processing or
for presentation as a graph. A DAG can be saved in any suitable
format including, for example, a list of nodes and edges, a matrix
or a table representing a matrix, an array of arrays or similar
variable structure representing a matrix, in a language built with
syntax for graphs, in a general markup language purposed for a
graph, or others. Further, while the DAG is a directed acyclic data
structure, in certain embodiments, a reference graph may lack
direction or not be acyclic. Various embodiments are considered to
be within the scope of the disclosure.
[0121] DAG Representation and Storage
[0122] FIG. 5 shows the use of an adjacency list 502 for each
vertex 505 and edge 509. A computer system 901 (shown in FIG. 10)
creates the reference DAG 531 using an adjacency list 502 for each
vertex and edge, wherein the adjacency list 502 for a vertex 505 or
edge 509 lists the edges or vertices to which that vertex or edge
is adjacent. Each entry in adjacency list 502 is a pointer to the
adjacent vertex or edge.
[0123] Preferably, each pointer identifies a physical location in
the memory subsystem at which the adjacent object is stored. In the
preferred embodiments, the pointer or native pointer is
manipulatable as a memory address in that it points to a physical
location on the memory and permits access to the intended data by
means of pointer dereference. That is, a pointer is a reference to
a datum stored somewhere in memory; to obtain that datum is to
dereference the pointer. The feature that separates pointers from
other kinds of reference is that a pointer's value is interpreted
as a memory address, at a low-level or hardware level. The speed
and efficiency of the described graph genome engine allows a
sequence to be queried against a large-scale genomic reference
graph (e.g., the reference DAG 531) representing millions or
billions of bases, using the computer system 901. Such a graph
representation provides means for fast random access, modification,
and data retrieval.
[0124] In some embodiments, fast random access is supported and
graph object storage are implemented with index-free adjacency in
that every element contains a direct pointer to its adjacent
elements (e.g., as described in U.S. Pub. 2014/0280360 and U.S.
Pub. 2014/0278590, each incorporated by reference), which obviates
the need for index look-ups, allowing traversals (e.g., as done in
the modified SW alignment operation described herein) to be very
rapid. Index-free adjacency is another example of low-level, or
hardware-level, memory referencing for data retrieval (as required
in alignment and as particularly pays off in terms of speed gains
in the modified, multi-dimensional Smith-Waterman alignment
described below). Specifically, index-free adjacency can be
implemented such that the pointers contained within elements are
references to a physical location in memory.
[0125] Since a technological implementation that uses physical
memory addressing such as native pointers can access and use data
in such a lightweight fashion without the requirement of separate
index tables or other intervening lookup steps, the capabilities of
a given computer, e.g., any modern consumer-grade desktop computer,
are extended to allow for full operation of a genomic-scale graph
(i.e., the reference DAG 531 as a graph that represents all loci in
a substantial portion of the reference genome). Thus storing graph
elements (e.g., nodes and edges) using a library of objects with
native pointers or other implementation that provides index-free
adjacency actually improves the ability of the technology to
provide storage, retrieval, and alignment for genomic information
since it uses the physical memory of a computer in a particular
way.
[0126] While no specific format is required for storage of a graph,
FIG. 5 is presented to illustrate a useful format. With reference
back to FIG. 1, it is noted that methods of the invention use the
stored graph (e.g., the reference genome) with sequence reads that
are obtained from a subject. In some embodiments, sequence reads or
obtained as an electronic article, e.g., uploaded, emailed, or FTP
transferred from a lab to the computer system 901. In certain
embodiments, sequence reads are obtained by sequencing.
[0127] In some embodiments, a DAG is stored as a list of nodes and
edges. One such way is to create a text file that includes all
nodes, with an ID assigned to each node, and all edges, each with
the node ID of starting and ending node. Thus, for example, were a
DAG to be created for two sentences, "See Jane run," and "Run, Jane
run,", a case-insensitive text file could be created. Any suitable
format could be used. For example, the text file could include
comma-separated values. Naming this DAG "Jane" for future
reference, in this format, the DAG "Jane" may read as follows: 1
see, 2 run, 3 jane, 4 run, 1-3, 2-3, 3-4. One of skill in the art
will appreciate that this structure is applicable to the introns
and exons represented in FIGS. 7 and 8, and the accompanying
discussion below.
[0128] In certain embodiments, a DAG is stored as a table
representing a matrix (or an array of arrays or similar variable
structure representing a matrix) in which the (i,j) entry in the
N.times.N matrix denotes that node i and node j are connected
(where N is a vector containing the nodes in genomic order). For
the DAG to be acyclic simply requires that all non-zero entries be
above the diagonal (assuming nodes are represented in genomic
order). In a binary case, a 0 entry represents that no edge is
exists from node i to node j, and a 1 entry represents an edge from
i to j. One of skill in the art will appreciate that a matrix
structure allows values other than 0 to 1 to be associated with an
edge. For example, any entry may be a numerical value indicating a
weight, or a number of times used, reflecting some natural quality
of observed data in the world. A matrix can be written to a text
file as a table or a linear series of rows (e.g., row 1 first,
followed by a separator, etc.), thus providing a simple
serialization structure.
[0129] Embodiments of the invention include storing a DAG in a
language built with syntax for graphs. For example, The DOT
Language provided with the graph visualization software packages
known as Graphviz provides a data structure that can be used to
store a DAG with auxiliary information and that can be converted
into graphic file formats using a number of tools available from
the Graphviz web site. Graphviz is open source graph visualization
software. Graph visualization is a way of representing structural
information as diagrams of abstract graphs and networks. It has
applications in networking, bioinformatics, software engineering,
database and web design, machine learning, and in visual interfaces
for other technical domains. The Graphviz layout programs take
descriptions of graphs in a simple text language, and make diagrams
in useful formats, such as images and SVG for web pages; PDF or
Postscript for inclusion in other documents; or display in an
interactive graph browser.
[0130] In related embodiments, a DAG is stored in a general markup
language purposed for a graph, or others. Following the
descriptions of a linear text file, or a comma-separated matrix,
above, one of skill in the art will recognize that a language such
as XML can be used (extended) to create labels (markup) defining
nodes and their headers or IDs, edges, weights, etc. However a DAG
is structured and stored, embodiments of the invention involve
using nodes to represent features such as exons and introns. This
provides a useful tool for analyzing sequence reads and
discovering, identifying, and representing isoforms.
[0131] If nodes represent features or fragments of features, then
isoforms can be represented by paths through those fragments. An
exon's being "skipped" is represented by an edge connecting some
previous exon to some later one. Presented herein are techniques
for constructing a DAG to represent alternative splicing or
isoforms of genes. Alternative splicing is discussed in Lee and
Wang, 2005, Bioinformatics analysis of alternative splicing, Brief
Bioinf 6(1):23-33; Heber, et al., 2002, Splicing graphs and EST
assembly problems, Bioinformatics 18Suppl:s181-188; Leipzig, et
al., 2004, The alternative splicing gallery (ASG): Bridging the gap
between genome and transcriptome, Nucl Ac Res 23(13):3977-2983; and
LeGault and Dewey, 2013, Inference of alternative splicing from
RNA-Seq data with probabilistic splice graphs, Bioinformatics
29(18):2300-2310, the contents of each of which are incorporated by
reference. Additional discussion may be found in Florea, et al.,
2005, Gene and alternative splicing annotation with AIR, Genome
Research 15:54-66; Kim, et al., 2005, ECgene: Genome-based EST
clustering and gene modeling for alternative splicing, Genome
Research 15:566-576; and Xing, et al., 2006, An
expectation-maximization algorithm for probabilistic
reconstructions of full-length isoforms from splice graphs, Nucleic
Acids Research, 34, 3150-3160, the contents of each of which are
incorporated by reference.
[0132] Linearizing a DAG for a Linear Alignment Tool
[0133] As previously noted, a DAG is a multi-dimensional structure
and not immediately amenable to alignment with a linear alignment
tool. To use an optimal alignment algorithm or pattern matching
algorithm, a reference DAG can be linearized into a format
appropriate for the algorithm. In certain embodiments, this can be
performed by enumerating each of the possible paths in the
subsection of a reference DAG. FIG. 6 illustrates an exemplary DAG
601 representing a subsection of the reference DAG for a candidate
mapping position (e.g., a local alignment position). To use a
linear alignment algorithm with such a DAG, a depth-first search
can be performed by starting at the first node and taking the
left-most path. Once the end of the graph is reached, the search
returns to the previous node with an untaken path and attempts to
take the left-most path again. Each path is associated with a
sequence represented by the nodes and edges in the path. The
sequence read can then be compared against each of the paths. For
example, a depth-first search would identify the following paths
and associated sequences from the subsection of the DAG 661 shown
in the embodiment of FIG. 6:
TABLE-US-00001 (SEQ ID NO: 1) 1. ATCGACGGCGTTTGCAT (SEQ ID NO: 2)
2. ATCGACGGCGTTTAGCAT (insertion of A at pos. 13) (SEQ ID NO: 3) 3.
ATCGACGACGTTTGCAT (G->A polymorphism at pos. 8) (SEQ ID NO: 4)
4. ATCGACGACGTTTAGCAT (G->A polymorphism at pos. 8 + insertion
of A at pos. 13)
To perform a local alignment of the sequence read GTTTAG against
this subsection of the reference DAG, these four sequences could be
provided as individual reference sequences to a linear alignment or
pattern matching algorithm. The linear alignment or pattern
matching algorithm would subsequently identify a best-fit alignment
for the sequence read with respect to the four sequences. In this
way, systems and methods according to the disclosure may enumerate
the possible paths for a reference DAG or subsection of a reference
DAG, identify a sequence associated with each possible path, and
provide each sequence to an alignment tool (such as those disclosed
herein) for alignment with a sequence read.
[0134] In another embodiment, the subsection of the reference DAG
could be linearized by walking the graph using a sliding window of
n bases (wherein n is the length of the sequence read) and
performing a comparison of the sequence read to the reference
sequence in the window. (Of course, in certain embodiments, the
size of the window can be increased to accommodate the possibility
of small insertions and deletions in the sequence read.) The walk
can similarly be performed using a depth-first search. For example,
for the sequence read GTTTAG and subsection of the reference DAG
601 in the embodiment of FIG. 6, a local alignment could proceed as
shown in Table 1.
TABLE-US-00002 TABLE 1 Depth-First Search Results using a sliding
window of n bases and performing a comparison of the sequence read
to the reference sequence in the Window. # Path Sequences Read
Comparison Note 1 ATCGAC GTTTAG No Match Start of Graph 2 TCGACG
GTTTAG No Match Includes "G" Polymorphism 3 CGACGG GTTTAG No Match
Includes "G" Polymorphism 4 GACGGC GTTTAG No Match Includes "G"
Polymorphism 5 ACGGCG GTTTAG No Match Includes "G" Polymorphism 6
CGGCGT GTTTAG No Match Includes "G" Polymorphism 7 GGCGTT GTTTAG No
Match Includes "G" Polymorphism 8 GCGTTT GTTTAG No Match Includes
"G" Polymorphism 9 CGTTTG GTTTAG No Match 10 GTTTGC GTTTAG No Match
11 TTTGCA GTTTAG No Match 12 TTGCAT GTTTAG No Match End of Graph 13
CGTTTA GTTTAG No Match Previous Unfollowed Path Includes "A"
Insertion 14 GTTTAG GTTTAG Match Includes "A" Insertion 15 TTTAGC
GTTTAG No Match Includes "A" Insertion 16 TTAGCA GTTTAG No Match
Includes "A" Insertion 17 TAGCAT GTTTAG No Match Includes "A"
Insertion End of Graph 18 CGACGA GTTTAG No Match Previous
Unfollowed Path Includes "A" Polymorphism 19 GACGAC GTTTAG No Match
Includes "A" Polymorphism 20 ACGACG GTTTAG No Match Includes "A"
Polymorphism 21 CGACGT GTTTAG No Match Includes "A" Polymorphism 22
GACGTT GTTTAG No Match Includes "A" Polymorphism 23 ACGTTT GTTTAG
No Match Includes "A" Polymorphism No remaining paths to follow
[0135] Once the walk has completed, the highest scoring matches can
be identified to obtain the locally aligned position for the
sequence read. As shown in Table 1, a match occurs at the 14.sup.th
comparison. While a depth-first search is described for traversing
the graph, various other searching algorithms or techniques may be
used. For example, a breadth-first search could be substituted.
Similarly, one could also employ an algorithm following the
right-most path, or any previously unfollowed path. Various
embodiments are considered to be within the scope of the
disclosure.
[0136] As previously noted, pattern matching algorithms are
particularly well suited for aligning sequence reads to DAGs. As
known variations are already present in the graph, the pattern
matching algorithm does not need to consider the presence of indels
or other variations within a sequence read. Instead, the pattern
matching algorithm simply considers each path available through the
graph, such as in the manner described above. However, as noted
above, while pattern matching algorithms are fast, the performance
of such an algorithm (and other local alignment algorithms) can be
affected based on the complexity of the DAG surrounding the local
alignment position.
[0137] Considering DAG Complexity to Select a Local Alignment
Tool
[0138] The complexity of a DAG at a given region indicates the
amount of variation in the reference genome at that position. The
complexity of a DAG can be determined in a variety of ways. For
example, the complexity of a DAG can be the number of nodes, number
of vertices, number of branches, number of possible paths, and the
like. Complexity may be determined for a subsection or region of
the DAG, or for the entire DAG.
[0139] For example, a DAG 701 shown in FIG. 7 includes information
for a single nucleotide polymorphism (C or D) and an insertion (EG
or EFG). There are a total of seven nodes and four paths through
the DAG. In another example, a DAG 801 shown in FIG. 8 indicates a
highly variable region of the genome that includes information
regarding various structural variations, SNPs, and small insertions
and deletions. In this DAG 801, there are 10 nodes and 21 possible
paths.
[0140] If there are few paths (e.g., 10 or less, 9 or less, 8 or
less, 7 or less, 6 or less, 5 or less, 4 or less, 3 or less 2 or
less) through the DAG, then the DAG complexity is low. In these
cases, a local linear sequence alignment or pattern matching
algorithm may be faster and more efficient than a graph aware
algorithm. In these cases, the additional number of paths that can
be tested by the algorithm does not significantly impact the
advantages in speed gained by using a pattern matching algorithm
(e.g., by using a shift table to avoid unnecessary computations).
Further, the required number of comparisons can be performed in
less processing time than by generating a plurality of matrices for
each node, as required by Graph Smith-Waterman. However, if the
complexity is high (e.g., 10 or more paths, 10-15 paths, 15 or more
paths, 15-20 paths, or 20 or more paths) then the number of paths
that must be tested quickly increases. In these cases, Graph
Smith-Waterman may be more efficient than a linear sequence
alignment or pattern matching algorithm because it does not suffer
from the combinatorial explosion problem associated with
linearizing complex DAGs.
[0141] The relationship between the alignment processing time and
local DAG complexity is shown in FIG. 9 as a graph 900. Graph 900
illustrates the relationship between processing time and DAG
complexity for a pattern matching algorithm (such as
Tarhio-Ukkonen) and a graph aware algorithm (such as Graph
Smith-Waterman). As shown, at a level of complexity C, performance
of Tarhio-Ukkonen and Graph Smith-Waterman is equal and either
algorithm may be used without a deviation in performance. Above C,
Tarhio-Ukkonen becomes more inefficient due to the large number of
paths that must be individually tested, and Graph Smith-Waterman
may be preferred. In contrast, below C, Tarhio-Ukkonen is more
efficient and requires less processing time.
[0142] Even though performance of either algorithm may be
comparable at a level of complexity C, other considerations may
dictate usage of either algorithm depending on the circumstances.
For example, in certain instances, switching local alignment
algorithms at a level of complexity C' may be preferable instead of
C. As discussed elsewhere, Graph Smith-Waterman is an exact
alignment algorithm and thus able to identify gapped alignments
between a sequence read and a reference. Thus, even if an
approximate pattern-matching algorithm would execute in less time,
it may still be preferable to choose Graph Smith-Waterman because
the resulting alignment can be superior. In these cases, the ideal
level of complexity at which the method chooses either algorithm is
based on the acceptable timeliness of performance. As shown in FIG.
9, above C, Graph Smith-Waterman is may be preferable. However,
below C, one may also decide to use Graph Smith-Waterman. The
consideration is the tradeoff between processing time and improved
alignments. Given the scales typically associated with next
generation sequencing (e.g., to cover the human genome at 30.times.
coverage, approximately 600 million paired-end 100 bp sequence
reads are required), a processing method may account for a level of
complexity in which the processing time is reasonable, yet the
alignments are able to yield useful information regarding unknown
variants.
[0143] Price may also be a consideration. For example, cloud
computing services such as Amazon AWS and Microsoft Azure offer
cloud-based computational resources at a cost. When using these
services to align next generation sequence reads against DAGs, one
may decide to switch to Tarhio-Ukkonen only when its performance is
less than Graph Smith-Waterman in order to reduce processing time
and increase cost efficiency.
[0144] In certain embodiments, a sequence read could be realigned
depending on the resulting quality of the local alignment. For
example, if a sequence read has an unknown variation (such as an
insertion or deletion), and the complexity of a DAG at a candidate
mapping position is low and a pattern matching algorithm is used,
the resulting alignment may have low quality. This is because
pattern matching algorithms, while fast, may not gracefully handle
unknown variations. In these cases, an optimal alignment algorithm
(such as Smith Waterman) could be subsequently performed to
accurately identify the variation and improve the local alignment.
Similarly, in high entropy or high variability areas of the genome,
an optimal alignment algorithm could be substituted for a pattern
matching algorithm to improve the quality of the alignment. The
quality of an alignment may be expressed according to the Phred
scale, for example.
[0145] Additionally, in certain embodiments, more than two
alignment algorithms may be selected based on the complexity. For
example, one may consider using a pattern matching algorithm, an
optimal alignment algorithm, and a graph-aware algorithm as
complexity increases. In this way, systems and methods according to
the invention may select from two, three, or more local alignment
algorithms to significantly improve the performance of
next-generation sequencing alignment using graph data
structures.
[0146] Exemplary System
[0147] FIG. 10 illustrates a computer system 901 suitable for
performing methods of the invention. The computer system 901
includes at least one computer 633. Optionally, the computer system
901 may further include one or more of a server computer 909 and a
sequencer 955, which may be coupled to a sequencer computer 951.
Each computer in the computer system 901 includes a processor 935,
936, or 937 coupled to a memory device and at least one
input/output device. Thus the computer system 901 includes at least
one processor 935 coupled to a memory subsystem 975 (e.g., a memory
device or collection of memory devices). Using those mechanical
components, the computer system 901 is operable to obtain a
sequence generated by sequencing nucleic acid from a genome of a
patient. The system uses the processor to transform the sequence
read 103 information into the reference DAG 331.
[0148] Processor refers to any device or system of devices that
performs processing operations. A processor will generally include
a chip, such as a single core or multi-core chip, to provide a
central processing unit (CPU). A processor may be provided by a
chip from Intel or AMD. A processor may be any suitable processor
such as the microprocessor sold under the trademark XEON E7 by
Intel (Santa Clara, Calif.) or the microprocessor sold under the
trademark OPTERON 6200 by AMD (Sunnyvale, Calif.).
[0149] The memory subsystem 975 contains one or any combination of
memory devices. A memory device is a mechanical device that stores
data or instructions in a machine-readable format. Memory may
include one or more sets of instructions (e.g., software) which,
when executed by one or more of the processors of the disclosed
computers can accomplish some or all of the methods or functions
described herein. Preferably, each computer includes a
non-transitory memory device such as a solid state drive, flash
drive, disk drive, hard drive, subscriber identity module (SIM)
card, secure digital card (SD card), micro SD card, or solid-state
drive (SSD), optical and magnetic media, others, or a combination
thereof.
[0150] Using the described components, the computer system 901 is
operable to produce a report and provide the report to a user via
an input/output device. An input/output device is a mechanism or
system for transferring data into or out of a computer. Exemplary
input/output devices include a video display unit (e.g., a liquid
crystal display (LCD) or a cathode ray tube (CRT)), a printer, an
alphanumeric input device (e.g., a keyboard), a cursor control
device (e.g., a mouse), a disk drive unit, a speaker, a
touchscreen, an accelerometer, a microphone, a cellular radio
frequency antenna, and a network interface device, which can be,
for example, a network interface card (NIC), Wi-Fi card, or
cellular modem.
[0151] Preferably the reference DAG is stored in the memory
subsystem using adjacency techniques, which may include pointers to
identify a physical location in the memory subsystem 675 where each
vertex is stored. In a preferred embodiment, the graph is stored in
the memory subsystem 675 using adjacency lists. In some
embodiments, there is an adjacency list for each vertex. For
discussion of implementations see `Chapter 4, Graphs` at pages
515-693 of Sedgewick and Wayne, 2011, Algorithms, 4th Ed., Pearson
Education, Inc., Upper Saddle River N.J., 955 pages, the contents
of which are incorporated by reference and within which pages
524-527 illustrate adjacency lists.
INCORPORATION BY REFERENCE
[0152] References and citations to other documents, such as
patents, patent applications, patent publications, journals, books,
papers, web contents, have been made throughout this disclosure.
All such documents are hereby incorporated herein by reference in
their entirety for all purposes.
EQUIVALENTS
[0153] Various modifications of the invention and many further
embodiments thereof, in addition to those shown and described
herein, will become apparent to those skilled in the art from the
full contents of this document, including references to the
scientific and patent literature cited herein. The subject matter
herein contains important information, exemplification and guidance
that can be adapted to the practice of this invention in its
various embodiments and equivalents thereof.
[0154] Other implementations are within the scope of the following
claims.
Sequence CWU 1
1
4117DNAHomo sapiens 1atcgacggcg tttgcat 17218DNAHomo sapiens
2atcgacggcg tttagcat 18317DNAHomo sapiens 3atcgacgacg tttgcat
17418DNAHomo sapiens 4atcgacgacg tttagcat 18
* * * * *