U.S. patent application number 16/421653 was filed with the patent office on 2019-11-14 for systems and methods for sequence data alignment quality assessment.
The applicant listed for this patent is LIFE TECHNOLOGIES CORPORATION. Invention is credited to Fiona Hyland, Sowmi Utiramerur, Zheng Zhang.
Application Number | 20190348153 16/421653 |
Document ID | / |
Family ID | 45439304 |
Filed Date | 2019-11-14 |
United States Patent
Application |
20190348153 |
Kind Code |
A1 |
Zhang; Zheng ; et
al. |
November 14, 2019 |
SYSTEMS AND METHODS FOR SEQUENCE DATA ALIGNMENT QUALITY
ASSESSMENT
Abstract
A computer-implemented method for classifying alignments of
paired nucleic acid sequence reads is disclosed. A plurality of
paired nucleic acid sequence reads is received, wherein each read
is comprised of a first tag and a second tag separated by an insert
region. Potential alignments for the first and second tags of each
read to a reference sequence is determined, wherein the potential
alignments satisfies a minimum threshold mismatch constraint.
Potential paired alignments of the first and second tags of each
read are identified, wherein a distance between the first and
second tags of each potential paired alignment is within an
estimated insert size range. An alignment score is calculated for
each potential paired alignment based on a distance between the
first and second tags and a total number of mismatches for each
tag.
Inventors: |
Zhang; Zheng; (Arcadia,
CA) ; Utiramerur; Sowmi; (Pleasanton, CA) ;
Hyland; Fiona; (San Mateo, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
LIFE TECHNOLOGIES CORPORATION |
Carlsbad |
CA |
US |
|
|
Family ID: |
45439304 |
Appl. No.: |
16/421653 |
Filed: |
May 24, 2019 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
15001389 |
Jan 20, 2016 |
|
|
|
16421653 |
|
|
|
|
13177267 |
Jul 6, 2011 |
9268903 |
|
|
15001389 |
|
|
|
|
61361879 |
Jul 6, 2010 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G16B 40/00 20190201;
G16B 30/00 20190201; G06N 7/005 20130101; G06N 3/126 20130101 |
International
Class: |
G16B 40/00 20060101
G16B040/00; G16B 30/00 20060101 G16B030/00; G06N 3/12 20060101
G06N003/12; G06N 7/00 20060101 G06N007/00 |
Claims
1. A computer-implemented method for classifying alignments of
paired nucleic acid sequence reads, comprising: receiving a
plurality of paired nucleic acid sequence reads, wherein each read
is comprised of a first tag and a second tag separated by an insert
region; determining potential alignments for the first and second
tags of each read to a reference sequence, wherein each potential
alignment satisfies a minimum threshold mismatch constraint;
identifying potential paired alignments of the first and second
tags of each read, wherein a distance between the first and second
tags of each potential paired alignment is within an estimated
insert size range; and calculating an alignment score for each
potential paired alignment based on, a distance between the first
and second tags, and a total number of mismatches for each tag.
2. The computer-implemented method for classifying alignments of
paired nucleic acid sequence reads, as recited in claim 1, wherein
the paired nucleic acid sequence read is a mate-pair read.
3. The computer-implemented method for classifying alignments of
paired nucleic acid sequence reads, as recited in claim 1, wherein
the paired nucleic acid sequence read is a paired-end read.
4. The computer-implemented method for classifying alignments of
paired nucleic acid sequence reads, as recited in claim 1, wherein
the estimated insert size range is a standard deviation of a
distribution of estimated insert region sizes for the plurality of
paired nucleic acid sequence reads.
5. The computer-implemented method for classifying alignments of
paired nucleic acid sequence reads, as recited in claim 1, wherein
the calculated alignment score is a function of read alignment
length.
6. The computer-implemented method for classifying alignments of
paired nucleic acid sequence reads, as recited in claim 1, wherein
the calculated alignment score is a function of a total number of
possible alignments for each read.
7. A system for identifying potential alignments for sequencing
reads, comprising: a nucleic acid sequencer configured to
interrogate a sample and produce a plurality of read sequences from
the sample; and a processor in communication with the sequencer,
the processor configured to, obtain the read sequences from the
sequencer, perform alignments of the read sequences from the
sequencer to a reference sequence, calculate a quality value for
each alignment, and output each alignment with its associated
quality value.
8. The system for identifying potential alignments for sequencing
reads, as recited in claim 7, wherein the quality value is a
function of sequencing read type.
9. The system for identifying potential alignments for sequencing
reads, as recited in claim 8, wherein the sequencing read is a
single fragment read
10. The system for identifying potential alignments for sequencing
reads, as recited in claim 8, wherein the sequencing read is a
paired read.
11. The system for identifying potential alignments for sequencing
reads, as recited in claim 10, wherein aligned paired reads must
have insert region sizes that fall within an estimated insert size
range for the aligned paired reads.
12. The system for identifying potential alignments for sequencing
reads, as recited in claim 11, wherein the estimated insert size
range is based on a standard deviation value derived from a
distribution of estimated insert region sizes of the aligned paired
reads.
13. A computer-implemented method for determining possible
alignments for sequencing reads, comprising: interrogating a sample
and producing a plurality of read sequences from the sample;
performing alignments of the read sequences from the sequencer;
calculating a quality value for each alignment; and outputting each
alignment with its associated quality value.
14. The computer-implemented method for determining possible
alignments for sequencing reads, as recited in claim 13, wherein
the quality value is a function of sequencing read type.
15. The computer-implemented method for determining possible
alignments for sequencing reads, as recited in claim 14, wherein
the sequencing read is a single fragment read.
16. The computer-implemented method for determining possible
alignments for sequencing reads, as recited in claim 14, wherein
the sequencing read is a paired read.
17. The computer-implemented method for determining possible
alignments for sequencing reads, as recited in claim 13, wherein
the calculated quality values for each alignment is a function of
read alignment length.
18. The computer-implemented method for determining possible
alignments for sequencing reads, as recited in claim 13, wherein
the calculated quality values for each alignment is a function of
number of read mismatches.
19. The computer-implemented method for determining possible
alignments for sequencing reads, as recited in claim 16, wherein
aligned paired reads have insert region sizes that fall within an
estimated insert size range for the aligned paired reads.
20. The computer-implemented method for determining possible
alignments for sequencing reads, as recited in claim 19, wherein
the estimated insert size range is based on a standard deviation
value derived from a distribution of estimated insert region sizes
of the aligned paired reads.
Description
RELATED APPLICATIONS
[0001] This application is a continuation of U.S. application Ser.
No. 15/001,389 filed Jan. 20, 2016, which is a continuation of U.S.
application Ser. No. 13/177,267 filed Jul. 6, 2011, now U.S. Pat.
No. 9,268,903, which claims priority to U.S. application No.
61/361,879 filed Jul. 6, 2010, which disclosures are herein
incorporated by reference in their entirety.
FIELD
[0002] The present disclosure generally relates to the field of
nucleic acid sequencing including systems and methods for mapping
or aligning fragment sequence reads to a reference sequence.
INTRODUCTION
[0003] Upon completion of the Human Genome Project, one focus of
the sequencing industry has shifted to finding higher throughput
and/or lower cost nucleic acid sequencing technologies, sometimes
referred to as "next generation" sequencing (NGS) technologies. In
making sequencing higher throughput and/or less expensive, the goal
is to make the technology more accessible for sequencing. These
goals can be reached through the use of sequencing platforms and
methods that provide sample preparation for larger quantities of
samples of significant complexity, sequencing larger numbers of
complex samples, and/or a high volume of information generation and
analysis in a short period of time. Various methods, such as, for
example, sequencing by synthesis, sequencing by hybridization, and
sequencing by ligation are evolving to meet these challenges.
[0004] Research into fast and efficient nucleic acid (e.g., genome,
exome, etc.) sequence assembly methods is vital to the sequencing
industry as NGS technologies can provide ultra-high throughput
nucleic acid sequencing. As such sequencing systems incorporating
NGS technologies can produce a large number of short sequence reads
in a relatively short amount time. Sequence assembly methods must
be able to assemble and/or map a large number of reads quickly and
efficiently (i.e., minimize use of computational resources). For
example, the sequencing of a human size genome can result in tens
or hundreds of millions of reads that need to be assembled before
they can be further analyzed to determine their biological,
diagnostic and/or therapeutic relevance.
[0005] Sequence assembly can generally be divided into two broad
categories: de novo assembly and reference genome mapping assembly.
In de novo assembly, sequence reads are assembled together so that
they form a new and previously unknown sequence. Whereas in
reference genome mapping, sequence reads are assembled against an
existing backbone sequence (e.g., reference sequence, etc.) to
build a sequence that is similar but not necessarily identical to
the backbone sequence.
[0006] Conventional mapping tools (e.g., MAQ, BFAST, SHRiMP, BWA,
etc.) used to align sequence reads tend to incorrectly estimate
alignment quality compared to phred-scaled quality scores; as these
tools typically do not support quality value determination that
differentiates between read fragments types (e.g., single,
mate-pair, paired-end, etc.).
SUMMARY
[0007] Systems, methods, software and computer-usable media for
determining alignment quality of biomolecule-related sequence reads
aligned to a reference sequence are disclosed. Biomolecule-related
sequences can relate to proteins, peptides, nucleic acids, and the
like, and can include structural and functional information such as
secondary or tertiary structures, amino acid or nucleotide
sequences, sequence motifs, binding properties, genetic mutations
and variants, and the like.
[0008] In various embodiments, nucleic acid sequence read data can
be generated using various techniques, platforms or technologies,
including, but not limited to: capillary electrophoresis,
microarrays, ligation-based systems, polymerase-based systems,
hybridization-based systems, direct or indirect nucleotide
identification systems, pyrosequencing, ion- or pH-based detection
systems, electronic signature-based systems, etc.
[0009] In one aspect, a computer-implemented method for classifying
alignments of paired nucleic acid sequence reads is disclosed. A
plurality of paired nucleic acid sequence reads is received,
wherein each read is comprised of a first tag and a second tag
separated by an insert region. Potential alignments for the first
and second tags of each paired nucleic acid sequence read to a
reference sequence is determined, wherein the potential alignments
satisfies a minimum threshold mismatch constraint. Potential paired
alignments of the first and second tags of each read are
identified, wherein a distance between the first and second tags of
each potential paired alignment is within an estimated insert size
range. An alignment score is calculated for each potential paired
alignment based on a distance between the first and second tags and
a total number of mismatches for each tag.
[0010] In another aspect, a system for identifying potential
alignments for sequencing reads is disclosed. The system includes a
nucleic acid sequencer and a processor in communications with the
sequencer. The nucleic acid sequencer can be configured to
interrogate a sample and produce a plurality of read sequences from
the sample. The processor can be configured to obtain the read
sequences from the sequencer, perform alignments of the read
sequences from the sequencer to a reference sample, calculate a
quality value for each alignment and output each alignment with its
associated quality value.
[0011] In still another aspect, a computer-implemented method for
determining possible alignments for sequencing reads is disclosed.
A sample can be interrogated to produce a plurality of read
sequences from the sample. Alignments are performed for the read
sequences from the sequencer. A quality value for each alignment is
determined. Each alignment with its associated quality value is
outputted.
[0012] These and other features are provided herein.
DRAWINGS
[0013] For a more complete understanding of the principles
disclosed herein, and the advantages thereof, reference is now made
to the following descriptions taken in conjunction with the
accompanying drawings, in which:
[0014] FIG. 1 is a block diagram that illustrates a computer
system, in accordance with various embodiments.
[0015] FIG. 2 is a schematic diagram of a system for reconstructing
a nucleic acid sequence, in accordance with various
embodiments.
[0016] FIG. 3 is an exemplary flowchart showing a method for
classifying alignment quality of paired reads, in accordance with
various embodiments.
[0017] FIG. 4 is a depiction of how PQV can be calculated for
gapped alignments, in accordance with various embodiments.
[0018] It is to be understood that the figures are not necessarily
drawn to scale, nor are the objects in the figures necessarily
drawn to scale in relationship to one another. The figures are
depictions that are intended to bring clarity and understanding to
various embodiments of apparatuses, systems, and methods disclosed
herein. Wherever possible, the same reference numbers will be used
throughout the drawings to refer to the same or like parts.
Moreover, it should be appreciated that the drawings are not
intended to limit the scope of the present teachings in any
way.
DESCRIPTION OF VARIOUS EMBODIMENTS
[0019] Embodiments of systems and methods for determining sequence
alignment quality are described herein.
[0020] The section headings used herein are for organizational
purposes only and are not to be construed as limiting the described
subject matter in any way.
[0021] In this detailed description of the various embodiments, for
purposes of explanation, numerous specific details are set forth to
provide a thorough understanding of the embodiments disclosed. One
skilled in the art will appreciate, however, that these various
embodiments may be practiced with or without these specific
details. In other instances, structures and devices are shown in
block diagram form. Furthermore, one skilled in the art can readily
appreciate that the specific sequences in which methods are
presented and performed are illustrative and it is contemplated
that the sequences can be varied and still remain within the spirit
and scope of the various embodiments disclosed herein.
[0022] All literature and similar materials cited in this
application, including but not limited to, patents, patent
applications, articles, books, treatises, and internet web pages
are expressly incorporated by reference in their entirety for any
purpose. Unless defined otherwise, all technical and scientific
terms used herein have the same meaning as is commonly understood
by one of ordinary skill in the art to which the various
embodiments described herein belongs. When definitions of terms in
incorporated references appear to differ from the definitions
provided in the present teachings, the definition provided in the
present teachings shall control.
[0023] It will be appreciated that there is an implied "about"
prior to the temperatures, concentrations, times, etc. discussed in
the present teachings, such that slight and insubstantial
deviations are within the scope of the present teachings. In this
application, the use of the singular includes the plural unless
specifically stated otherwise. Also, the use of "comprise",
"comprises", "comprising", "contain", "contains", "containing",
"include", "includes", and "including" are not intended to be
limiting. It is to be understood that both the foregoing general
description and the following detailed description are exemplary
and explanatory only and are not restrictive of the present
teachings.
[0024] Further, unless otherwise required by context, singular
terms shall include pluralities and plural terms shall include the
singular. Generally, nomenclatures utilized in connection with, and
techniques of, cell and tissue culture, molecular biology, and
protein and oligo- or polynucleotide chemistry and hybridization
described herein are those well known and commonly used in the art.
Standard techniques are used, for example, for nucleic acid
purification and preparation, chemical analysis, recombinant
nucleic acid, and oligonucleotide synthesis. Enzymatic reactions
and purification techniques are performed according to
manufacturer's specifications or as commonly accomplished in the
art or as described herein. The techniques and procedures described
herein are generally performed according to conventional methods
well known in the art and as described in various general and more
specific references that are cited and discussed throughout the
instant specification. See, e.g., Sambrook et al., Molecular
Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor
Laboratory Press, Cold Spring Harbor, N.Y. 2000). The nomenclatures
utilized in connection with, and the laboratory procedures and
techniques described herein are those well known and commonly used
in the art.
[0025] As used herein, "a" or "an" means "at least one" or "one or
more."
[0026] A "system" denotes a set of components, real or abstract,
comprising a whole where each component interacts with or is
related to at least one other component within the whole.
[0027] A "biomolecule" is any molecule that is produced by a
biological organism, including large polymeric molecules such as
proteins, polysaccharides, lipids, and nucleic acids as well as
small molecules such as primary metabolites, secondary metabolites,
and other natural products.
[0028] The phrase "next generation sequencing" or NGS refers to
sequencing technologies having increased throughput as compared to
traditional Sanger- and capillary electrophoresis-based approaches,
for example with the ability to generate hundreds of thousands of
relatively small sequence reads at a time. Some examples of next
generation sequencing techniques include, but are not limited to,
sequencing by synthesis, sequencing by ligation, and sequencing by
hybridization. More specifically, the SOLiD Sequencing System of
Life Technologies Corp. provides massively parallel sequencing with
enhanced accuracy. The SOLiD System and associated workflows,
protocols, chemistries, etc. are described in more detail in PCT
Publication No. WO 2006/084132, entitled "Reagents, Methods, and
Libraries for Bead-Based Sequencing," international filing date
Feb. 1, 2006, U.S. patent application Ser. No. 12/873,190, entitled
"Low-Volume Sequencing System and Method of Use," filed on Aug. 31,
2010, and U.S. patent application Ser. No. 12/873,132, entitled
"Fast-Indexing Filter Wheel and Method of Use," filed on Aug. 31,
2010, the entirety of each of these applications being incorporated
herein by reference thereto.
[0029] The phrase "sequencing run" refers to any step or portion of
a sequencing experiment performed to determine some information
relating to at least one biomolecule (e.g., nucleic acid
molecule).
[0030] It is well known that DNA (deoxyribonucleic acid) is a chain
of nucleotides consisting of 4 types of nucleotides; A (adenine), T
(thymine), C (cytosine), and G (guanine), and that RNA (ribonucleic
acid) is comprised of 4 types of nucleotides; A, U (uracil), G, and
C. It is also known that certain pairs of nucleotides specifically
bind to one another in a complementary fashion (called
complementary base pairing). That is, adenine (A) pairs with
thymine (T) (in the case of RNA, however, adenine (A) pairs with
uracil (U)), and cytosine (C) pairs with guanine (G). When a first
nucleic acid strand binds to a second nucleic acid strand made up
of nucleotides that are complementary to those in the first strand,
the two strands bind to form a double strand. As used herein,
"nucleic acid sequencing data," "nucleic acid sequencing
information," "nucleic acid sequence," "genomic sequence," "genetic
sequence," or "fragment sequence," or "nucleic acid sequencing
read" denotes any information or data that is indicative of the
order of the nucleotide bases (e.g., adenine, guanine, cytosine,
and thymine/uracil) in a molecule (e.g., whole genome, whole
transcriptome, exome, oligonucleotide, polynucleotide, fragment,
etc.) of DNA or RNA. It should be understood that the present
teachings contemplate sequence information obtained using all
available varieties of techniques, platforms or technologies,
including, but not limited to: capillary electrophoresis,
microarrays, ligation-based systems, polymerase-based systems,
hybridization-based systems, direct or indirect nucleotide
identification systems, pyrosequencing, ion- or pH-based detection
systems, electronic signature-based systems, etc.
[0031] The phrase "ligation cycle" refers to a step in a
sequence-by-ligation process where a probe sequence is ligated to a
primer or another probe sequence.
[0032] The phrase "color call" refers to an observed dye color
resulting from the detection of a probe sequence after a ligation
cycle of a sequencing run.
[0033] The phrase "color space" refers to a nucleic acid sequence
data schema where nucleic acid sequence information is represented
by a set of colors (e.g., color calls, color signals, etc.) each
carrying details about the identity and/or positional sequence of
bases that comprise the nucleic acid sequence. For example, the
nucleic acid sequence "ATCGA" can be represented in color space by
various combinations of colors that are measured as the nucleic
acid sequence is interrogated using optical detection-based (e.g.,
dye-based, etc.) sequencing techniques such as those employed by
the SOLiD System. That is, in various embodiments, the SOLiD System
can employ a schema that represents a nucleic acid fragment
sequence as an initial base followed by a sequence of overlapping
dimers (adjacent pairs of bases). The system can encode each dimer
with one of four colors using a coding scheme that results in a
sequence of color calls that represent a nucleotide sequence.
[0034] The phrase "base space" refers to a nucleic acid sequence
data schema where nucleic acid sequence information is represented
by the actual nucleotide base composition of the nucleic acid
sequence. For example, the nucleic acid sequence "ATCGA" is
represented in base space by the actual nucleotide base identities
(e.g., A, T/or U, C, G) of the nucleic acid sequence.
[0035] A "polynucleotide", "nucleic acid", or "oligonucleotide"
refers to a linear polymer of nucleosides (including
deoxyribonucleosides, ribonucleosides, or analogs thereof) joined
by internucleosidic linkages. Typically, a polynucleotide comprises
at least three nucleosides. Usually oligonucleotides range in size
from a few monomeric units, e.g. 3-4, to several hundreds of
monomeric units. Whenever a polynucleotide such as an
oligonucleotide is represented by a sequence of letters, such as
"ATGCCTG," it will be understood that the nucleotides are in
5'->3' order from left to right and that "A" denotes
deoxyadenosine, "C" denotes deoxycytidine, "G" denotes
deoxyguanosine, and "T" denotes thymidine, unless otherwise noted.
The letters A, C, G, and T may be used to refer to the bases
themselves, to nucleosides, or to nucleotides comprising the bases,
as is standard in the art.
[0036] The techniques of "paired-end," "pairwise," "paired tag," or
"mate pair" sequencing are generally known in the art of molecular
biology (Siegel A. F. et al., Genomics. 2000, 68: 237-246; Roach J.
C. et al., Genomics. 1995, 26: 345-353). These sequencing
techniques can allow the determination of multiple "reads" of
sequence, each from a different place on a single polynucleotide.
Typically, the distance (i.e., insert region) between the two reads
or other information regarding a relationship between the reads is
known. In some situations, these sequencing techniques provide more
information than does sequencing two stretches of nucleic acid
sequences in a random fashion. With the use of appropriate software
tools for the assembly of sequence information (e.g., Mullikin J.
C. et al., Genome Res. 2003, 13: 81-90; Kent, W. J. et al., Genome
Res. 2001, 11: 1541-8) it is possible to make use of the knowledge
that the "paired-end," "pairwise," "paired tag" or "mate pair"
sequences are not completely random, but are known to occur a known
distance apart and/or to have some other relationship, and are
therefore linked or paired in the genome. This information can aid
in the assembly of whole nucleic acid sequences into a consensus
sequence.
Computer-Implemented System
[0037] FIG. 1 is a block diagram that illustrates a computer system
100, upon which embodiments of the present teachings may be
implemented. In various embodiments, computer system 100 can
include a bus 102 or other communication mechanism for
communicating information, and a processor 104 coupled with bus 102
for processing information. In various embodiments, computer system
100 can also include a memory 106, which can be a random access
memory (RAM) or other dynamic storage device, coupled to bus 102
for determining base calls, and instructions to be executed by
processor 104. Memory 106 also can be used for storing temporary
variables or other intermediate information during execution of
instructions to be executed by processor 104. In various
embodiments, computer system 100 can further include a read only
memory (ROM) 108 or other static storage device coupled to bus 102
for storing static information and instructions for processor 104.
A storage device 110, such as a magnetic disk or optical disk, can
be provided and coupled to bus 102 for storing information and
instructions.
[0038] In various embodiments, computer system 100 can be coupled
via bus 102 to a display 112, such as a cathode ray tube (CRT) or
liquid crystal display (LCD), for displaying information to a
computer user. An input device 114, including alphanumeric and
other keys, can be coupled to bus 102 for communicating information
and command selections to processor 104. Another type of user input
device is a cursor control 116, such as a mouse, a trackball or
cursor direction keys for communicating direction information and
command selections to processor 104 and for controlling cursor
movement on display 112. This input device typically has two
degrees of freedom in two axes, a first axis (i.e., x) and a second
axis (i.e., y), that allows the device to specify positions in a
plane.
[0039] A computer system 100 can perform the present teachings.
Consistent with certain implementations of the present teachings,
results can be provided by computer system 100 in response to
processor 104 executing one or more sequences of one or more
instructions contained in memory 106. Such instructions can be read
into memory 106 from another computer-readable medium, such as
storage device 110. Execution of the sequences of instructions
contained in memory 106 can cause processor 104 to perform the
processes described herein. Alternatively hard-wired circuitry can
be used in place of or in combination with software instructions to
implement the present teachings. Thus implementations of the
present teachings are not limited to any specific combination of
hardware circuitry and software.
[0040] The term "computer-readable medium" as used herein refers to
any media that participates in providing instructions to processor
104 for execution. Such a medium can take many forms, including but
not limited to, non-volatile media, volatile media, and
transmission media. Examples of non-volatile media can include, but
are not limited to, optical or magnetic disks, such as storage
device 110. Examples of volatile media can include, but are not
limited to, dynamic memory, such as memory 106. Examples of
transmission media can include, but are not limited to, coaxial
cables, copper wire, and fiber optics, including the wires that
comprise bus 102.
[0041] Common forms of computer-readable media include, for
example, a floppy disk, a flexible disk, hard disk, magnetic tape,
or any other magnetic medium, a CD-ROM, any other optical medium,
punch cards, paper tape, any other physical medium with patterns of
holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip
or cartridge, or any other tangible medium from which a computer
can read.
[0042] Various forms of computer readable media can be involved in
carrying one or more sequences of one or more instructions to
processor 104 for execution. For example, the instructions can
initially be carried on the magnetic disk of a remote computer. The
remote computer can load the instructions into its dynamic memory
and send the instructions over a telephone line using a modem. A
modem local to computer system 100 can receive the data on the
telephone line and use an infra-red transmitter to convert the data
to an infra-red signal. An infra-red detector coupled to bus 102
can receive the data carried in the infra-red signal and place the
data on bus 102. Bus 102 can carry the data to memory 106, from
which processor 104 retrieves and executes the instructions. The
instructions received by memory 106 may optionally be stored on
storage device 110 either before or after execution by processor
104.
[0043] In accordance with various embodiments, instructions
configured to be executed by a processor to perform a method are
stored on a computer-readable medium. The computer-readable medium
can be a device that stores digital information. For example, a
computer-readable medium includes a compact disc read-only memory
(CD-ROM) as is known in the art for storing software. The
computer-readable medium is accessed by a processor suitable for
executing instructions configured to be executed.
Nucleic Acid Sequencing Platforms
[0044] Nucleic acid sequence data can be generated using various
techniques, platforms or technologies, including, but not limited
to: capillary electrophoresis, microarrays, ligation-based systems,
polymerase-based systems, hybridization-based systems, direct or
indirect nucleotide identification systems, pyrosequencing, ion- or
pH-based detection systems, electronic signature-based systems,
etc.
[0045] Various embodiments of nucleic acid sequencing platforms
(i.e., nucleic acid sequencer) can include components as displayed
in the block diagram of FIG. 2. According to various embodiments,
sequencing instrument 200 can include a fluidic delivery and
control unit 202, a sample processing unit 204, a signal detection
unit 206, and a data acquisition, analysis and control unit 208.
Various embodiments of instrumentation, reagents, libraries and
methods used for next generation sequencing are described in U.S.
Patent Application Publication No. US20090062129 (application Ser.
No. 11/737308) and U.S. Patent Application Publication No.
US20080003571 (application Ser. No. 11/345,979) to McKernan, et
al., which applications are incorporated herein by reference.
Various embodiments of instrument 200 can provide for automated
sequencing that can be used to gather sequence information from a
plurality of sequences in parallel, i.e., substantially
simultaneously.
[0046] In various embodiments, the fluidics delivery and control
unit 202 can include reagent delivery system. The reagent delivery
system can include a reagent reservoir for the storage of various
reagents. The reagents can include RNA-based primers,
forward/reverse DNA primers, oligonucleotide mixtures for ligation
sequencing, nucleotide mixtures for sequencing-by-synthesis,
optional ECC oligonucleotide mixtures, buffers, wash reagents,
blocking reagent, stripping reagents, and the like. Additionally,
the reagent delivery system can include a pipetting system or a
continuous flow system which connects the sample processing unit
with the reagent reservoir.
[0047] In various embodiments, the sample processing unit 204 can
include a sample chamber, such as flow cell, a substrate, a
micro-array, a multi-well tray, or the like. The sample processing
unit 204 can include multiple lanes, multiple channels, multiple
wells, or other means of processing multiple sample sets
substantially simultaneously. Additionally, the sample processing
unit can include multiple sample chambers to enable processing of
multiple runs simultaneously. In particular embodiments, the system
can perform signal detection on one sample chamber while
substantially simultaneously processing another sample chamber.
Additionally, the sample processing unit can include an automation
system for moving or manipulating the sample chamber.
[0048] In various embodiments, the signal detection unit 206 can
include an imaging or detection sensor. For example, the imaging or
detection sensor can include a CCD, a CMOS, an ion sensor, such as
an ion sensitive layer overlying a CMOS, a current detector, or the
like. The signal detection unit 206 can include an excitation
system to cause a probe, such as a fluorescent dye, to emit a
signal. The excitation system can include an illumination source,
such as arc lamp, a laser, a light emitting diode (LED), or the
like. In particular embodiments, the signal detection unit 206 can
include optics for the transmission of light from an illumination
source to the sample or from the sample to the imaging or detection
sensor. Alternatively, the signal detection unit 206 may not
include an illumination source, such as for example, when a signal
is produced spontaneously as a result of a sequencing reaction. For
example, a signal can be produced by the interaction of a released
moiety, such as a released ion interacting with an ion sensitive
layer, or a pyrophosphate reacting with an enzyme or other catalyst
to produce a chemiluminescent signal. In another example, changes
in an electrical current can be detected as a nucleic acid passes
through a nanopore without the need for an illumination source.
[0049] In various embodiments, data acquisition analysis and
control unit 208 can monitor various system parameters. The system
parameters can include temperature of various portions of
instrument 200, such as sample processing unit or reagent
reservoirs, volumes of various reagents, the status of various
system subcomponents, such as a manipulator, a stepper motor, a
pump, or the like, or any combination thereof.
[0050] It will be appreciated by one skilled in the art that
various embodiments of instrument 200 can be used to practice
variety of sequencing methods including ligation-based methods,
sequencing by synthesis, single molecule methods, nanopore
sequencing, and other sequencing techniques. Ligation sequencing
can include single ligation techniques, or change ligation
techniques where multiple ligation are performed in sequence on a
single primary. Sequencing by synthesis can include the
incorporation of dye labeled nucleotides, chain termination,
ion/proton sequencing, pyrophosphate sequencing, or the like.
Single molecule techniques can include continuous sequencing, where
the identity of the nuclear type is determined during incorporation
without the need to pause or delay the sequencing reaction, or
staggered sequence, where the sequencing reactions is paused to
determine the identity of the incorporated nucleotide.
[0051] In various embodiments, the sequencing instrument 200 can
determine the sequence of a nucleic acid, such as a polynucleotide
or an oligonucleotide. The nucleic acid can include DNA or RNA, and
can be single stranded, such as ssDNA and RNA, or double stranded,
such as dsDNA or a RNA/cDNA pair. In various embodiments, the
nucleic acid can include or be derived from a fragment library, a
mate pair library, a ChIP fragment, or the like. In particular
embodiments, the sequencing instrument 200 can obtain the sequence
information from a single nucleic acid molecule or from a group of
substantially identical nucleic acid molecules.
[0052] In various embodiments, sequencing instrument 200 can output
nucleic acid sequencing read data in a variety of different output
data file types/formats, including, but not limited to: *.fasta,
*.csfasta, *seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms,
*srs and/or *.qv.
Classifying Alignments of Paired Reads
[0053] FIG. 3 is an exemplary flowchart showing a method for
classifying alignments of paired nucleic acid sequence reads, in
accordance with various embodiments. In various embodiments, the
sequence read alignment classification scores can be a factor in
the pairing quality value (PQV) determining.
[0054] As depicted herein, method 300 begins with step 302 where a
plurality of paired nucleic acid sequence reads is received. Each
paired nucleic acid sequence read is comprised of a first tag
(e.g., F3/R3 read) and a second tag (e.g., F3/R3 read) separated by
an insert region. In various embodiments, the paired nucleic acid
sequence reads are mate-pair reads. In various embodiments, the
paired nucleic acid sequence reads are paired-end reads. In various
embodiments, the paired nucleic acid sequence reads are a
combination of mate-pair and paired-end reads.
[0055] In step 304, the potential alignments for the first and
second tags of each of each paired nucleic acid sequence read to a
reference sequence are determined, wherein all the potential
alignments satisfy a minimum threshold mismatch constraint. That
is, each read tag that is aligned to the reference sequence cannot
exceed a certain number of mismatches (i.e., minimum threshold
mismatch constraint).
[0056] In step 306, potential paired alignments of the first and
second tags of each paired nucleic acid sequence read are
identified, wherein a distance between the first and second tags of
each potential paired alignment is within an estimated insert size
range. In various embodiments, the estimated insert size range can
be determined by: 1. mapping all the tags to a reference sequence,
2. determining a distribution of pairing distance for all uniquely
mapped pairs of tags, and 3. calculating a mean and standard
deviation value from the distribution pairing distance data to
estimate a range of insert size (e.g., range values that covers 95%
of the distributed distances of the observed pairs, range values
derived a certain number of standard deviation from the mean,
etc.).
[0057] In step 308, an alignment score is calculated for each
potential paired alignment based on the distance between the first
and second tags and a total number of mismatches for each tag. In
various embodiments, the alignment score calculation is also a
function of read alignment length (i.e., read length of the tags).
In various embodiments, the alignment score calculation is also a
function of the total number of possible alignment for each paired
nucleic acid sequence read.
[0058] In various embodiments, the method 300 can be performed
using color space nucleic acid sequence data. In various
embodiments, the method 300 can be performed using base space
nucleic acid sequence data. It should be understood, however, that
the method 300 disclosed herein can be performed using any schema
or format of nucleic acid sequence information as long as the
schema or format can convey the base identity and position.
Pairing/Mapping Quality Values
[0059] According to various embodiments, the system and methods of
the present teachings may introduce a Bayesian inference based
statistical approach to calculating mapping quality values for
different library types such as single fragment and paired reads
(e.g., mate-pair, paired-end reads, etc.). These approaches can
make use of mate-pair/paired-end read information including insert
size distribution between the read pairs (e.g., pairs of tags),
read orientation, strand ID annotations, gene ID annotations, etc.
Using this approach, non-uniform prior probabilities for different
alignment types and alignments that correspond to inversions (e.g.,
mate-pair reads mapping to opposite strands, etc.), gapped
alignments (e.g., insertion/deletion within a read) can be assigned
and can be useful to assess the probability of observing such
mutations in a particular genome.
[0060] In various embodiments, for the case of whole-transcriptome
sequencing, mate-pair/paired-end reads capable of being mapped to
exons from the same gene can be assigned a uniform prior
probability regardless of the genomic distance between the exons.
In various embodiments, mate-pair/paired-end reads that map to
exons from different genes (corresponding to gene fusions) can be
assigned a lower prior probability. In various embodiments such an
approach cam be implemented in sequence analytics tools and
applications such as for example SOLiD LIFESCOPE genetic analysis
software (Life Technologies Corporation; Carlsbad, Calif.) and can
be used for mapping and variant detection using sequencing reads
such as those obtained from a NGS sequencing instrument.
[0061] In various embodiments, the accuracy and predictive value of
the mapping/pairing quality score computed using these methods can
be demonstrated using either simulated datasets (for example from a
human reference chromosome 0) as well as actual genome datasets
(for example from a HuRef sample generated using a NGS instrument).
Evaluating the resulting mapping quality values and compared to
phred-scale values for probability of misalignment demonstrates
that the methods of the present teachings provide more accurate
mapping quality when compared against conventional approaches and
may be better suited to represent phred-scale alignment probability
for a multiplicity of different library types.
[0062] According to various embodiments, the mapping quality
methods described herein demonstrate highly accurate and
comprehensive functionality in terms of computing quality of
different alignment types including gapped alignments and
whole-transcriptomes. In one aspect, the predictive value of a
mapping quality value can improve the efficiency of generating
variant calls and gene fusion calls made using various tools and
sequencing analytics software (such as the SOLiD LIFESCOPE sequence
analysis toolset). Together with the base quality values of
individual bases in a read, mapping quality values can be used to
improve the efficiency of rare-allele detection in cancer genomics
research.
[0063] In various embodiments, methods for determining
Mapping/Pairing quality value (PQV) are provided. The PQV can be
generally associated with a phred-scaled quantitative measure of
the confidence of aligning a read to the correct location in the
reference genome. The PQV may further be represented as the
negative log odds of misaligning a read (-10 log.sub.10[prob of
error]).
[0064] In various embodiments, the posterior probability of
correctly aligning a read pair to a reference sequence can be
calculated using (for example) the total alignment length of the
mate pair reads, total number of mismatches to reference, complete
mate-pair information such as insert size and gene ID annotations
(in the case of whole transcriptome). The calculated
mapping/pairing quality values can further represent the
probability of aligning sequenced reads to the reference sequence
(e.g., reference genome, etc.).
[0065] According to various embodiments, a method is provided which
can be implemented in a software tool or application which computes
mapping/paring quality values that better represents phred-scale
quality scores (for example the probability of misaligning the
reads). This method can make use of read pair information to
compute quality values for mate-pair and paired-end library types.
Mapping/pairing quality values computed by the methods of the
present teachings can be accurate and predictive in terms of being
able to improve the accuracy of small variant detection.
Exemplary Methods for Calculating Pairing Quality Values
[0066] In various embodiments, the pairing algorithm of the present
teachings can be configured to report multiple sets of possible
alignments for any given pair of reads (for example F3/R3 tags for
a Mate-pair run and F3/F5-P2 tags for a Paired-end run obtained
using a NGS sequencer). The pairing quality method and algorithm
can implement a Bayesian approach to calculate the quality of a
given alignment for a pair of reads (i.e., pair of tags) and the
alignment with the highest PQV can be selected as the primary
alignment for the pair of reads. In various embodiments, the PQVs
may be used to represent a Phred-Scaled quality score. Such an
approach can be useful for downstream variant detection tools such
as DiBayes, Small-InDels, Large-InDels and CNV.
[0067] In various aspects, the quality of any given alignment for a
pair of reads r.sub.1, r.sub.2 mapped to positions x.sub.1 and
x.sub.2 in the reference sequence can be represented by Equation
1:
Q(r.sub.1, r.sub.2, x.sub.1, x.sub.2)=P(A(r.sub.1, r.sub.2,
x.sub.1, x.sub.2)|r.sub.1, r.sub.2),
[0068] where A(r.sub.1, r.sub.2, x.sub.1, x.sub.2) represents the
event when reads r.sub.1 & r.sub.2 are sequenced from locations
x.sub.1 & x.sub.2 respectively and P(A|r.sub.1, r.sub.2) is the
probability of the event A occurring given the pair of reads r1 and
r2.
[0069] Applying a Bayesian-type approach, the posterior probability
P(A|r.sub.1, r.sub.2) may be represented as Equation 2:
P ( A ( r 1 , r 2 , x 1 , x 2 ) r 1 , r 2 ) = P ( r 1 , r 2 A )
.times. P ( A ) P ( r 1 , r 2 ) ##EQU00001##
[0070] The probability P(r.sub.1, r.sub.2), of observing reads
r.sub.1 and r.sub.2 can then be a function of the complexity of the
genome sequenced. One exemplary probability determination can be
calculated as Equation 3:
P(r.sub.1, r.sub.2)=.SIGMA..sub.i,j.di-elect cons.M P(r.sub.1,
r.sub.2|A(r.sub.1, r.sub.2, i, j)).times.P(A(r.sub.1, r.sub.2, i,
j))
[0071] where M is the set of possible alignments to the reference
sequence for reads r.sub.1 and r.sub.2. Using this relationship to
represent P(r.sub.1, r.sub.2) in the previous equation one obtains
Equation 4:
P ( A ( r 1 , r 2 , x 1 , x 2 ) r 1 , r 2 ) = P ( r 1 , r 2 A ( r 1
, r 2 , x 1 , x 2 ) ) .times. P ( A ( r 1 , r 2 , x 1 , x 2 ) ) i ,
j P ( r 1 , r 2 A ( r 1 , r 2 , i , j ) ) .times. P ( A ( r 1 , r 2
, i , j ) ) ##EQU00002##
[0072] The prior probability P(A) of the event A can further be
given by Equation 5:
P(A(r.sub.1, r.sub.2, x.sub.1, x.sub.2))=P(A(r.sub.2,
x.sub.2)|B).times.P(B(r.sub.1, x.sub.1)),
[0073] where B(r.sub.1, x.sub.1) is the event that read r.sub.1 is
sequenced from location x.sub.1 in the genome and P(A|B) is the
conditional probability of finding the event A where read r.sub.2
is sequenced from location x.sub.2, given that read r.sub.1 was
sequenced from location x.sub.1.
[0074] The probability P(B) can be a constant for any given read
r.sub.1, and the conditional probability P(A|B) can follow the
insert-size distribution. As indicated below, the following prior
probabilities can be used in pairing quality calculations. In
various embodiments, P(A(r, r.sub.2, i, j)) can be the alignment
score calculated for each potential sequence pair alignment (as
discussed above with respect to FIG. 3).
[0075] P(A|B)=1, for all `AAA` pairs.
[0076] P(A|B)=1/10,000, for all `non-AAA` pairs (including Small
& Large Indels).
[0077] P(A|B)=1/10,000, when one of the reads in the pair cannot be
mapped to the reference sequence.
[0078] In various embodiments, where a pair of reads have a unique
set of alignments to a reference sequence, the posterior
probability P(A|r.sub.1, r.sub.2) can result in 1 thereby obscuring
the relative quality of the alignment compared to those of other
read pairs. This can be addressed by calculating a background
probability P(B), which can represent the probability of finding an
alignment to the reference sequence with M+1 mismatches, where M is
the maximum allowed mismatches set in the pairing.ini file, as
shown in Equation 6:
P.sub.B=P(r.sub.1|A(r.sub.1, x.sub.1)).times.P(r.sub.2|B, M+1
mismatches), r.sub.1>r.sub.2 (k.sub.1>k.sub.2, if
r.sub.1=r.sub.2)
[0079] For uniquely paired reads, the posterior probability can be
given by Equation 7:
P ( A ( r 1 , r 2 , x 1 , x 2 ) r 1 , r 2 ) = P ( r 1 , r 2 A ) P (
r 1 , r 2 A ) + P B ##EQU00003##
[0080] For mapping using a local alignment method, the likelihood
function P(r.sub.1, r.sub.2|A) can be given by Equation 8:
P ( r 1 , r 2 A ) = ( 1 - e ) ( k 1 + k 2 ) - ( m 1 + m 2 ) .times.
e ( m 1 + m 2 ) .times. ( 1 4 ) ( L 1 + L 2 ) - ( m 1 + m 2 )
##EQU00004##
[0081] where, [0082] L.sub.1 & L.sub.2 are the read lengths for
reads r.sub.1 and r.sub.2 respectively, (ex. F3=50 and R3=50),
[0083] k.sub.1 & k.sub.2 are the alignment lengths
(k.sub.1.ltoreq.L.sub.1 and k.sub.2.ltoreq.L.sub.2), [0084] m.sub.1
& m.sub.2 are the number of mismatches, and [0085] e is the
error rate.
[0086] Being consistent with a phred-type quality score
(-10*log.sub.10[prob(error)]), the PQV may be computed as the
negative log odds of misaligning the pair of reads, as shown in
Equation 9:
PQV=-10.times.log.sub.10[1-Q(r.sub.1, r.sub.2, x.sub.1,
x.sub.2)]
[0087] The resulting pairing quality values can be normalized by a
maximum value to help ensure that the pairing quality values are
within a desired range [0, 100], as shown in Equation 10:
PQV = PQV PQV max .times. 100 ##EQU00005##
[0088] PQV.sub.max can reflect an exemplary maximum possible
pairing quality value when the pair of reads map uniquely to the
reference with zero mismatches.
Exemplary Methods for Calculating PQVs for Gapped Alignments
[0089] In various embodiments, a pairing method can be devised to
search for gapped alignments (i.e., InDels) when one of the tag
(F3/R3/F5-P2) maps to a reference sequence and another tag does not
map to the reference sequence within a selected insert-size range.
For this exemplary approach, where both an un-gapped and a gapped
alignment are found for a given read then, due to the low prior
probability of 10{circumflex over ( )}-4 assigned to the gapped
alignments, the PQV for gapped alignments can be approximately
zero.
[0090] Thus, as shown in FIG. 4, in calculating the PQV for gapped
alignments, an alternative hypothesis can be tested as the
probability of finding the partial un-gapped alignments. The read
with the gapped alignment can be treated as two partial reads on
either side of an InDel start point where the half with the
greatest length is used as the partial alignment length for an
alternate hypothesis. Such an approach can be used to help ensure
that gapped alignments within InDel starting point at the middle of
the read with significant length of alignment on either side of the
InDel starting point will be assigned a higher PQV compared to
gapped alignments with InDel starting point close to either ends of
the read, as shown in Equation 11:
P ( A r ) InDel = P ( r A ) InDel P ( r A ) InDel + P
PartialAlignment ##EQU00006##
[0091] In various embodiments, for reads with multiple un-gapped
alignments, the read with the highest PQV can be selected as the
primary alignment for the read and is reported to the *.BAM file.
In cases where there are multiple alignments with the same PQV,
then the primary alignment can be selected at random from among the
alignments with the same PQV.
[0092] While the present teachings are described in conjunction
with various embodiments, it is not intended that the present
teachings be limited to such embodiments. On the contrary, the
present teachings encompass various alternatives, modifications,
and equivalents, as will be appreciated by those of skill in the
art.
[0093] Further, in describing various embodiments, the
specification may have presented a method and/or process as a
particular sequence of steps. However, to the extent that the
method or process does not rely on the particular order of steps
set forth herein, the method or process should not be limited to
the particular sequence of steps described. As one of ordinary
skill in the art would appreciate, other sequences of steps may be
possible. Therefore, the particular order of the steps set forth in
the specification should not be construed as limitations on the
claims. In addition, the claims directed to the method and/or
process should not be limited to the performance of their steps in
the order written, and one skilled in the art can readily
appreciate that the sequences may be varied and still remain within
the spirit and scope of the various embodiments.
[0094] The embodiments described herein, can be practiced with
other computer system configurations including hand-held devices,
microprocessor systems, microprocessor-based or programmable
consumer electronics, minicomputers, mainframe computers and the
like. The embodiments can also be practiced in distributing
computing environments where tasks are performed by remote
processing devices that are linked through a network.
[0095] It should also be understood that the embodiments described
herein can employ various computer-implemented operations involving
data stored in computer systems. These operations are those
requiring physical manipulation of physical quantities. Usually,
though not necessarily, these quantities take the form of
electrical or magnetic signals capable of being stored,
transferred, combined, compared, and otherwise manipulated.
Further, the manipulations performed are often referred to in
terms, such as producing, identifying, determining, or
comparing.
[0096] Any of the operations that form part of the embodiments
described herein are useful machine operations. The embodiments,
described herein, also relate to a device or an apparatus for
performing these operations. The systems and methods described
herein can be specially constructed for the required purposes or it
may be a general purpose computer selectively activated or
configured by a computer program stored in the computer. In
particular, various general purpose machines may be used with
computer programs written in accordance with the teachings herein,
or it may be more convenient to construct a more specialized
apparatus to perform the required operations.
[0097] Certain embodiments can also be embodied as computer
readable code on a computer readable medium. The computer readable
medium is any data storage device that can store data, which can
thereafter be read by a computer system. Examples of the computer
readable medium include hard drives, network attached storage
(NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs,
CD-RWs, magnetic tapes, and other optical and non-optical data
storage devices. The computer readable medium can also be
distributed over a network coupled computer systems so that the
computer readable code is stored and executed in a distributed
fashion.
* * * * *