U.S. patent application number 15/887914 was filed with the patent office on 2018-06-21 for systems and methods for detection of aneuploidy.
This patent application is currently assigned to Natera, Inc.. The applicant listed for this patent is Natera, Inc.. Invention is credited to Milena BANJEVIC, Zachary DEMKO, Allison RYAN, Styrmir SIGURJONSSON, Naresh VANKAYALAPATI.
Application Number | 20180173846 15/887914 |
Document ID | / |
Family ID | 62556361 |
Filed Date | 2018-06-21 |
United States Patent
Application |
20180173846 |
Kind Code |
A1 |
SIGURJONSSON; Styrmir ; et
al. |
June 21, 2018 |
Systems and Methods for Detection of Aneuploidy
Abstract
Provided herein are improved methods for detecting aneuploidy in
a sample. The methods in certain embodiments are used for the
analysis of circulating DNA in serum samples, such as circulating
fetal DNA or circulating tumor DNA. In certain embodiments,
chromosome or chromosome segments of interest are used to set a
bias model and/or a control value for a z-score determination, in
illustrative examples without the use of a control chromosome.
Inventors: |
SIGURJONSSON; Styrmir; (San
Jose, CA) ; VANKAYALAPATI; Naresh; (San Francisco,
CA) ; RYAN; Allison; (Belmont, CA) ; DEMKO;
Zachary; (San Francisco, CA) ; BANJEVIC; Milena;
(Los Altos Hills, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Natera, Inc. |
San Carlos |
CA |
US |
|
|
Assignee: |
Natera, Inc.
San Carlos
CA
|
Family ID: |
62556361 |
Appl. No.: |
15/887914 |
Filed: |
February 2, 2018 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14732632 |
Jun 5, 2015 |
|
|
|
15887914 |
|
|
|
|
62079257 |
Nov 13, 2014 |
|
|
|
62032785 |
Aug 4, 2014 |
|
|
|
62008235 |
Jun 5, 2014 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G16B 5/00 20190201; G16B
30/00 20190201; G16B 20/00 20190201; G16H 10/40 20180101 |
International
Class: |
G06F 19/22 20060101
G06F019/22; G06F 19/12 20060101 G06F019/12 |
Claims
1. A method for detecting a presence or absence of aneuploidy of a
chromosome or chromosome segment of interest in a test sample,
comprising: obtaining genetic data for the chromosome or chromosome
segment of interest from each sample in a set of samples comprising
the test sample, wherein the genetic data is obtained from a
parallel analysis of the samples, wherein the genetic data of the
test sample are obtained by isolating a mixture of fetal cell-free
genomic DNA and maternal cell-free genomic DNA from the test sample
which is a blood sample of a pregnant woman, and amplifying and
sequencing the mixture of fetal cell-free genomic DNA and maternal
cell-free genomic DNA together; determining whether aneuploidy is
present in the test sample by a first method comprising:
determining a depth of reads or a proportion of reads that map to
the chromosome or chromosome segment of interest; calculating a
z-score for the depth of reads or the proportion of reads that map
to the chromosome or chromosome segment of interest; and
determining whether the test sample is aneuploidy at the chromosome
or chromosome segment of interest based on the z-score, thereby
providing a first result; and determining whether aneuploidy is
present in the test sample by a second method comprising: creating
a plurality of ploidy hypotheses wherein each ploidy hypothesis is
associated with a specific copy number for the chromosome or
chromosome segment of interest, determining a ploidy probability
value for each ploidy hypothesis, wherein the ploidy probability
value indicates the likelihood that the test sample has the
specific copy number for the chromosome or chromosome segment of
interest that is associated with the ploidy hypothesis, and
determining which ploidy hypothesis is most likely to be correct by
selecting the ploidy hypothesis with the maximum likelihood,
thereby providing a second result, wherein aneuploidy is detected
by considering the first result and the second result.
2. The method according to claim 1, wherein the genetic data
comprises quantitative allelic data from a plurality of polymorphic
loci in the set of loci, wherein each of the ploidy hypotheses
specifies an expected distribution of quantitative allelic data at
the plurality of polymorphic loci, and wherein the ploidy
probability values are determined by calculating, for each of the
ploidy hypotheses, the fit between the expected genetic data and
the obtained genetic data.
3. The method according to claim 1, wherein the genetic data
comprises quantitative non-allelic data from a plurality of
polymorphic loci in the set of loci, and wherein each of the ploidy
hypotheses specifies an expected mean value of quantitative
non-allelic data at the plurality of polymorphic loci, and wherein
the ploidy probability values are determined by calculating, for
each of the ploidy hypotheses, the fit between the expected genetic
data and the obtained genetic data.
4. The method according to claim 1, wherein the first result is
determined by calculating a likelihood based on the z-score.
5. The method according to claim 4, wherein aneuploidy is detected
by combining the aneuploidy likelihoods from the first method and
these second method using the following formula: Combined
likelihood=R1R2/[R1R2+(1-R1)(1-R2)].
6. The method according to claim 1, wherein the first result is
determined by determining whether the z-score for the test sample
is above a threshold value.
7. The method according to claim 1, wherein the second method
comprises a quantitative allelic method.
8. The method according to claim 7, wherein the quantitative
allelic method is het rate method.
9. The method according to claim 8, wherein the het rate method is
based on analysis of observed allele ratios at each SNP using a
joint distribution model.
10. The method according to claim 1, wherein the second method
comprises a quantitative non-allelic method.
11. The method according to claim 10, wherein the quantitative
non-allelic method is QMM method.
12. The method according to claim 11, wherein the QMM method is
based on analysis of the number of sequencing reads at each
SNP.
13. The method according to claim 1, wherein the second method
comprises both a quantitative allelic method and a quantitative
non-allelic method.
14. A method for determining a presence or absence of a fetal
aneuploidy in a fetus for each of a plurality of maternal blood
samples obtained from a plurality of different pregnant women, said
maternal blood samples comprising fetal and maternal cell-free
genomic DNA, said method comprising: determining a number of
enumerated sequence reads corresponding to a chromosome or
chromosome segment of interest for each of the plurality of
samples; determining a reference value of enumerated sequence reads
from a diploid subset of between 1 and 50 samples of the plurality
of samples or between 1-50% of samples of the plurality of samples
having a number of enumerated sequence reads closest to the median
number of enumerated sequence reads for the chromosome or
chromosome segment of interest for the plurality of maternal blood
samples, without using determining sequencing reads for a separate
reference chromosome; and comparing the enumerated sequence reads
from each of the other samples of the plurality of samples that are
not diploid samples, to the reference value, thereby determining
the presence or absence of a fetal aneuploidy in the chromosome or
chromosome segment of interest.
15. The method according to claim 14, further comprising before the
determining the number of enumerated sequence reads: obtaining a
fetal and maternal cell-free genomic DNA sample from each of the
plurality of maternal blood samples; generating a library derived
from each fetal and maternal cell-free genomic DNA sample,
performing massively parallel sequencing of polynucleotide
sequences of the library from the chromosome or chromosome segment
of interest; and enumerating sequence reads corresponding to fetal
and maternal polynucleotide sequences selected from the chromosome
or chromosome segment of interest.
16. The method according to claim 14, wherein the reference value
of enumerated sequence reads is determined from a diploid subset of
between 10 and 40 samples closest to the median.
17. The method according to claim 14, wherein the reference value
of enumerated sequence reads is determined from a diploid subset of
between 15 and 40 samples closest to the median.
18. The method according to claim 14, wherein each library of
enriched and indexed fetal and maternal polynucleotide sequences
includes an indexing nucleotide sequence which identifies a
maternal blood sample of the plurality of maternal blood samples
and pooling the libraries generated to produce a pool of enriched
and indexed fetal and maternal non-random polynucleotide
sequences.
19. The method according to claim 14, wherein said plurality of
polynucleotide sequences comprises at least 100 different
non-random polynucleotide sequences, wherein each of said plurality
of non-random polynucleotide sequences is from 10 to 1000
nucleotide bases in length.
20. The method according to claim 14, wherein the method further
comprises selectively enriching a plurality of non-random
polynucleotide sequences of each fetal and maternal cell-free
genomic DNA samples.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. Utility
application Ser. No. 14/732,632, filed on Jun. 5, 2015. U.S.
Utility Ser. No. application 14/732,632 claims the benefit of and
priority to U.S. Provisional Application Ser. No. 62/008,235, filed
Jun. 5, 2014; U.S. Provisional Application Ser. No. 62/032,785,
filed Aug. 4, 2014; and U.S. Provisional Application Serial No.
62/079,257, filed Nov. 13, 2014. The entireties of all these
applications are each hereby incorporated by reference for the
teachings therein.
FIELD OF THE INVENTION
[0002] The present invention generally relates to molecular biology
methods and systems, and more specifically to methods and systems
for detecting ploidy of a chromosome segment.
BACKGROUND
[0003] Measurement of the number of copies of a chromosome or
chromosome segment in a cell of interest is an important technique
in molecular biology. The technique has wide applicability in
fields such as prenatal diagnosis and the analysis of cancer cells.
Older techniques such as karyotyping are being supplanted by
techniques employing high levels of DNA sequencing. For example,
such techniques can be used to detect copy number variation
(CNV).
[0004] Copy number variation (CNV) has been identified as a major
cause of structural variation in the genome, involving both
duplications and deletions of sequences that typically range in
length from 1,000 base pairs (1 kb) to 20 megabases (mb). Deletions
and duplications of chromosome segments or entire chromosomes are
associated with a variety of conditions, such as susceptibility or
resistance to disease.
[0005] CNVs are often assigned to one of two main categories, based
on the length of the affected sequence. The first category includes
copy number polymorphisms (CNPs), which are common in the general
population, occurring with an overall frequency of greater than 1%.
CNPs are typically small (most are less than 10 kilobases in
length), and they are often enriched for genes that encode proteins
important in drug detoxification and immunity. A subset of these
CNPs is highly variable with respect to copy number. As a result,
different human chromosomes can have a wide range of copy numbers
(e.g., 2, 3, 4, 5, etc.) for a particular set of genes. CNPs
associated with immune response genes have recently been associated
with susceptibility to complex genetic diseases, including
psoriasis, Crohn's disease, and glomerulonephritis.
[0006] The second class of CNVs includes relatively rare variants
that are much longer than CNPs, ranging in size from hundreds of
thousands of base pairs to over 1 million base pairs in length. In
some cases, these CNVs may have arisen during production of the
sperm or egg that gave rise to a particular individual, or they may
have been passed down for only a few generations within a family.
These large and rare structural variants have been observed
disproportionately in subjects with mental retardation,
developmental delay, schizophrenia, and autism. Their appearance in
such subjects has led to speculation that large and rare CNVs may
be more important in neurocognitive diseases than other forms of
inherited mutations, including single nucleotide substitutions.
[0007] Gene copy number can be altered in cancer cells. For
instance, duplication of Chrlp is common in breast cancer, and the
EGFR copy number can be higher than normal in non-small cell lung
cancer. Cancer is one of the leading causes of death; thus, early
diagnosis and treatment of cancer is important, since it can
improve the patient's outcome (such as by increasing the
probability of remission and the duration of remission). Early
diagnosis can also allow the patient to undergo fewer or less
drastic treatment alternatives. Many of the current treatments that
destroy cancerous cells also affect normal cells, resulting in a
variety of possible side-effects, such as nausea, vomiting, low
blood cell counts, increased risk of infection, hair loss, and
ulcers in mucous membranes. Thus, early detection of cancer is
desirable since it can reduce the amount and/or number of
treatments (such as chemotherapeutic agents or radiation) needed to
eliminate the cancer.
[0008] Copy number variation has also been associated with severe
mental and physical handicaps, and idiopathic learning disability.
Non-invasive prenatal testing (NIPT) using cell-free DNA (cfDNA)
can be used to detect abnormalities, such as fetal trisomies 13,
18, and 21, triploidy, and sex chromosome aneuploidies.
Subchromosomal microdeletions, which can also result in severe
mental and physical handicaps, are more challenging to detect due
to their smaller size. Eight of the microdeletion syndromes have an
aggregate incidence of more than 1 in 1000, making them nearly as
common as fetal autosomal trisomies.
[0009] In addition, a higher copy number of CCL3L1 has been
associated with lower susceptibility to HIV infection, and a low
copy number of FCGR3B (the CD16 cell surface immunoglobulin
receptor) can increase susceptibility to systemic lupus
erythematosus and similar inflammatory autoimmune disorders.
[0010] Thus, improved methods are needed to detect deletions and
duplications of chromosome segments or entire chromosomes.
Preferably, these methods can be used to more accurately diagnose
disease or an increased risk of disease, such as cancer or CNVs in
a gestating fetus.
[0011] In many clinical trials concerning a diagnostic that employs
molecular biology, for example for detecting CNVs, a protocol with
a number of parameters is set, and then the same protocol is
executed with the same parameters for each of the patients in the
trial. In the case of determining the ploidy status of a fetus
gestating in a mother using sequencing as a method to measure
genetic material one pertinent parameter is the number of reads.
The number of reads may refer to the number of actual reads, the
number of intended reads, fractional lanes, full lanes, or full
flow cells on a sequencer. In these studies, the number of reads is
typically set at a level that will ensure that all or nearly all of
the samples achieve the desired level of accuracy. Sequencing is
currently an expensive technology, a cost of roughly $200 per 5
mappable million reads, and while the price is dropping, any method
which allows a sequencing based diagnostic to operate at a similar
level of accuracy but with fewer reads will necessarily save a
considerable amount of money.
[0012] Accordingly, there is a need for new improved techniques for
the determination of aneuploidy in a chromosome or chromosome
segment of interest, especially by employing DNA sequencing in a
more accurate and cost-effective manner by reducing the required
number of reads. This will bring down the cost of such molecular
diagnostics, resulting in better diagnostics that are available to
more people. The improved techniques would for example, be
particularly valuable in the analysis of cell free DNA derived from
fetal cells or tumor cells to provide improved prenatal and cancer
diagnostics.
SUMMARY
[0013] Provided herein in one embodiment are methods and systems
for determining the copy number, or detecting aneuploidy of a
chromosome or chromosome segment of interest in a cell of interest
that are performed using the chromosome or chromosome segment of
interest to set a bias model, that is to set test parameters, using
samples analyzed in the same parallel analysis, that are identified
as diploid samples with high confidence, for the analysis of
aneuploidy for the same chromosome or chromosome segment of
interest of other sample(s) in the set of on-test samples.
Accordingly, in one example of this embodiment, provided herein is
a method for determining a presence or absence of aneuploidy of a
chromosome or chromosome segment of interest in a test sample, that
includes the following steps: [0014] a) obtaining genetic data for
the chromosome or chromosome segment of interest from each sample
of a set of samples that includes the test sample and at least one
diploid sample, wherein the genetic data is obtained from a
parallel analysis of the set of samples; [0015] b) setting a bias
model using the genetic data for the chromosome or chromosome
segment of interest in the diploid sample determined to be disomic
for the chromosome or chromosome segment of interest; [0016] c)
adjusting the genetic data for the chromosome or chromosome segment
of interest for the test sample using the bias model; and [0017] d)
establishing the presence or absence of aneuploidy for the
chromosome or chromosome segment of interest in the test sample
using the normalized data.
[0018] In certain illustrative examples of this embodiments, the at
least one diploid sample is determined to be disomic for the
chromosome or chromosome segment of interest by analyzing the
genetic data from the parallel analysis. In certain illustrative
examples, the diploid sample is determined to be disomic (i.e.
selected as being disomic) for the chromosome or chromosome segment
of interest without using a control chromosome or control
chromosome segment.
[0019] In certain examples of this embodiment of the invention, one
or two maximum likelihood analysis are used to carry out the
method. As disclosed above, the first maximum likelihood method can
be used to identify diploid samples in the set of samples and to
determine a first probability that the other samples in the set of
samples are aneuploidy. Accordingly, in certain embodiments, one or
more or all of the chromosome(s) or chromosome segment(s) of
interest are determined to be disomic using a first maximum
likelihood method. The method includes the following steps:
creating, for each sample in the set of samples, a plurality of
first hypotheses wherein each first hypothesis is associated with a
specific copy number for the chromosome or chromosome segment of
interest, determining a first probability value for each first
hypothesis, wherein the first probability value indicates the
likelihood that the sample has the number of copies of the
chromosome or chromosome segment that is associated with the first
hypothesis, wherein the first probability values are derived from
the genetic data associated with the sample, and selecting at least
one diploid sample by selecting those one or more samples that most
closely match a disomic copy number hypothesis for the chromosome
or chromosome segment of interest, with at least a minimum level of
confidence. That is, by selecting those samples that yield the
highest probability of being disomic for the chromosome or
chromosome segment of interest.
[0020] In certain embodiments, the method includes at least two
maximum likelihood analysis, the presence or absence of aneuploidy,
or the number of copies of the chromosome or chromosome segment of
interest, is determine by creating a plurality or set of 2.sup.nd
hypotheses, also called ploidy hypotheses herein, wherein each
2.sup.nd hypothesis is associated with a specific copy number of
the chromosome or chromosome segment of interest in the target
cell. The models are then used to test how well the genetic data
from each patient fits each 2.sup.nd hypothesis. The goodness of
fit for each 2.sup.nd hypothesis is determined. A second
probability value is calculated for each second hypothesis wherein
the second probability value indicates the likelihood that the
genome of the target cell has the number of chromosomes or
chromosome segments that is specified by the second hypothesis.
Thus by selecting the 2.sup.nd hypothesis with the maximum
likelihood, one may determine the copy number for the chromosome or
chromosome segment in the genome of the target cell. Such first and
second hypothesis can be considered in combination to increase the
confidence of the aneuploidy determination
[0021] In another embodiment, provide herein is a method for
determining a presence or absence of aneuploidy for a first
chromosome or chromosome segment of interest in a test sample from
a test subject, includes the following steps: obtaining genetic
sequencing data from a parallel analysis of the first chromosome or
chromosome segment of interest from cell free DNA from each sample
in a set of liquid samples comprising the test sample, wherein the
set of liquid samples comprises at least 3 samples and wherein the
genetic sequencing data determines an amount of DNA corresponding
to each locus in a first set of loci present on the first
chromosome or chromosome segment of interest respectively;
selecting a diploid subset of samples from the set of liquid
samples, wherein the diploid subset of samples are samples that are
initially determined to be disomic for the first chromosome or
chromosome segment of interest using an initial bias model, wherein
the subset of samples comprises at least 2 samples; setting a
confirmatory bias model from the genetic data from the first
chromosome or chromosome segment of interest from the diploid
subset of patients; adjusting the genetic data for the test subject
using the confirmatory bias model, to give normalized genetic data
for the test subject; and determining, using the normalized data,
whether genetic data from the test subject is indicative of an
aneuploidy in the first chromosome or chromosome segment of
interest.
[0022] In another embodiment, a method of the invention includes
both a non-allelic z-score based quantitative method and a maximum
likelihood method based on allelic or non-allelic data.
Accordingly, provided herein is a method for detecting a presence
or absence of aneuploidy of a chromosome or chromosome segment of
interest in a test sample, that includes the following steps:
obtaining genetic data for the chromosome or chromosome segment of
interest from each sample in a set of samples comprising the test
sample, wherein the genetic data is obtained from a parallel
analysis of the samples; determining whether aneuploidy is present
in the test sample by a first method comprising: [0023] a.
determining a depth of reads or a proportion of reads that map to
the chromosome or chromosome segment of interest; [0024] b.
calculating a z-score for the depth of reads or the proportion of
reads that map to the chromosome or chromosome segment of interest;
and [0025] c. determining whether the test sample is aneuploidy at
the chromosome or chromosome segment of interest based on the
z-score, thereby providing a first result; and determining whether
aneuploidy is present in the test sample by a second method
comprising: [0026] d. creating a plurality of ploidy hypotheses
wherein each ploidy hypothesis is associated with a specific copy
number for the chromosome or chromosome segment of interest, [0027]
e. determining a ploidy probability value for each ploidy
hypothesis, wherein the ploidy probability value indicates the
likelihood that the test sample has the specific copy number for
the chromosome or chromosome segment of interest that is associated
with the ploidy hypothesis, and [0028] f. determining which ploidy
hypothesis is most likely to be correct by selecting the ploidy
hypothesis with the maximum likelihood, thereby providing a second
result, detecting the aneuploidy by considering the first result
and the second result.
[0029] In certain illustrative examples of the above embodiments,
the sample is a liquid sample, such as a sera sample. The genetic
data, in these examples, can be derived from circulating DNA, such
as circulating fetal DNA or circulating tumor DNA.
[0030] In certain examples of any of the above embodiments, the
method further includes estimating a fetal fraction for each sample
in the set of samples, wherein the fetal fraction is used in the
selecting the diploid subset of samples and/or the determining
whether the genetic data from the test subject is indicative of an
aneuploidy.
BRIEF DESCRIPTION OF THE DRAWINGS
[0031] The presently disclosed embodiments will be further
explained with reference to the attached drawings, wherein like
structures are referred to by like numerals throughout the several
views. The drawings shown are not necessarily to scale, with
emphasis instead generally being placed upon illustrating the
principles of the presently disclosed embodiments.
[0032] FIG. 1 is a flow chart of a method according to one
embodiment of the invention.
[0033] FIG. 2 shows an example system architecture 200 useful for
performing embodiments of the present invention.
[0034] FIG. 3 illustrates an example computer system for performing
embodiments of the present invention.
[0035] While the above-identified drawings set forth presently
disclosed embodiments, other embodiments are also contemplated, as
noted in the discussion. This disclosure presents illustrative
embodiments by way of representation and not limitation. Numerous
other modifications and embodiments can be devised by those skilled
in the art which fall within the scope and spirit of the principles
of the presently disclosed embodiments.
DETAILED DESCRIPTION
[0036] Some embodiments of the present invention utilize the fact
that a typical analysis for aneuploidy of a set of blood samples
for example a set of blood samples from pregnant mothers in an NIPT
assay, or from cancer patients in an analysis of circulating free
tumor DNA (cfDNA), are run in parallel in the same set of assays,
wherein many and probably most DNA in the samples originates from
diploid cells. Thus, not only can on-test samples be identified
that are diploid with high confidence and used as controls for
analysis of other samples in the set of samples without the need
for running additional control samples, in many embodiments of the
present invention, the chromosome or chromosome segment of interest
in test samples that are initially identified as diploid with high
confidence, are used as controls for subsequent methods that
analyze the rest of the set of samples for aneuploidy at that same
chromosome(s) or chromosome segment(s) of interest. This
substantially reduces the cost of such analysis and substantially
improves the confidence in the analysis since the on-test and
control data comes from the same parallel analysis of the same
chromosome or same chromosome segment of interest.
[0037] Accordingly, provided herein are numerous methods and
systems for determining the copy number, or detecting aneuploidy of
a chromosome or chromosome segment of interest in a cell of
interest that are performed using the chromosome or chromosome
segment of interest to set a bias model, that is to set test
parameters, or to set a threshold cutoff value, using samples
analyzed in the same parallel analysis, that are identified as
diploid samples with high confidence, for the analysis of
aneuploidy for the same chromosome or chromosome segment of
interest of other sample(s) in the set of on-test samples. The
subject methods and systems can employ high throughput DNA
sequencers (capable of sequencing large number of DNA templates in
parallel) so as to produce quantitative information about the
amount of various DNA sequences of interest in a set of samples
obtained from a test subject. This quantitative sequence
information can be used to determine the copy number of a
chromosome or chromosome segment of interest in a cell of interest,
e.g., a cell of a developing fetus or a tumor cell.
[0038] The term "depth of read" as used herein refers to the number
of sequencing reads that map to a given locus. The depth of read
may be normalized over the total number of reads. When "depth of
read" refers to a sample, it may mean the average depth of read
over the targeted loci. When "depth of read" refers to a locus, it
may refer to the number of reads measured by the sequencer mapping
to that locus. In general, the greater the depth of read of a
locus, the closer the ratio of alleles at the locus will tend to be
to the ratio of alleles in the original sample of DNA.
[0039] The term "relative fraction" can also be used to express a
similar concept as depth of read. Depth of read can be expressed in
variety of different ways, including but not limited to the
percentage or proportion. Thus for example in a highly parallel DNA
sequencer such as an Illumina HISEQ, which for example would
produce sequence of 1 million clones, the sequencing of one locus
3,000 times, would result in a depth of read of 3,000 reads at that
locus. The proportion of reads at that locus would be 3,000 divided
by 1 million total reads, or 0.3% of the total reads.
[0040] The term "allelic data" as used herein means a quantitative
measurement indicative of the number of copies of a specific allele
of a polymorphic locus. Typically, quantitative measurements will
be obtained for all possible alleles of the polymorphic locus of
interest. In some embodiments, the polymorphic loci is a SNP, and
the SNP is dimorphic, and the allelic data will comprise the
quantity of each of the two alleles observed at that locus. In some
embodiments, the polymorphic loci is a SNP, and the SNP is
trimorphic or tetramorphic, and the allelic data will comprise the
quantity of each of the three or four alleles observed at that
locus. The allelic data may be obtained using a variety of
well-known molecular biology techniques such as DNA sequencing or
real-time PCR. High throughput DNA sequencing in which the number
of individual reads of a given locus obtained can be used to obtain
allelic data. When the allelic data is measured using
high-throughput sequencing, the allelic data will typically
comprise the number of reads of each allele mapping to the locus of
interest.
[0041] The term "non-allelic data" as use herein means a
quantitative measurement indicative of the number of copies of a
specific locus. The locus may be polymorphic or non-polymorphic. If
the locus is non-polymorphic, the non-allelic data will not contain
information about the relative or absolute quantity of the
individual alleles that may be present at that locus. Typically,
quantitative measurements will be obtained for all possible alleles
of the polymorphic locus of interest. The allelic data may be
obtained using a variety of well-known molecular biology techniques
such as DNA sequencing or real-time PCR. High throughput DNA
sequencing in which the number of individual reads of a given locus
obtained can be used to obtain allelic data. Non-allelic data for a
polymorphic locus may be obtained by summing the quantitative
allelic for each allele at that locus. When the allelic data is
measured using high-throughput sequencing, the non-allelic data
will typically comprise the number of reads of mapping to the locus
of interest. The sequencing measurements could indicate the
relative and/or absolute number of each of the alleles present at
the locus, and the non-allelic data would comprise the sum of the
reads, regardless of the allelic identity, mapping to the locus.
Note that it is possible to measure the DNA at a plurality of loci,
for example using high throughput sequencing, to yield allelic
data; it is then possible, by summing the number of reads that
correspond to each allele, at each locus, to produce non-allelic
data. In some embodiments the same set of measurements can be used
to yield both allelic data and non-allelic data. In some
embodiments, the produced allelic data can be used as part of a
method to determine copy number at a chromosome of interest, and
the produced non-allelic data can be used as part of a method to
determine copy number at a chromosome of interest, where the two
methods are statistically orthogonal.
[0042] The term "chromosomal abnormality" as used herein refers to
any deviation and the copy number of a specific chromosome or
chromosome segment from the most common number of copies of that
segment or chromosome, for example in a human somatic cell, any
deviation from 2 copies could be regarded as a chromosomal
abnormality.
[0043] The term "obtaining genetic data" as used herein refers to
both, unless specifically where implicitly indicated otherwise by
context, (1) acquiring DNA sequence information by laboratory
techniques, e.g. use of an automated high throughput DNA sequencer,
and (2) acquiring information that had been previously obtained by
laboratory techniques, wherein the information is electronically
transmitted, e.g. by computer over the Internet, by electronic
transfer from the sequencing device, etc.
[0044] The term "target cell" as used herein refers to the cell (or
cell type) that contains the chromosomes or chromosome segments
that are to be quantitatively measured as a result of the subject
methods. Examples of target cells include fetal cells and tumor
cells. As the cells of most individuals contain a nearly identical
set of nuclear DNA, the term "target cell" may be used
interchangeably with the term "individual."
[0045] The term "non-target cell" as used herein refers to cell (or
cell type) that supply DNA that is analyzed in the process of
performing the subject methods, but is not the cell contains the
chromosomes or chromosome segments that is required to be
quantitatively measured as a result of the subject methods. In some
embodiments, the "non-target cell" may be closely related to the
"target cell", for example if a prostate tumor cell is the target
cell a noncancerous prostate cell from the same individual may
(although not necessarily) be used as a "non-target cell".
Alternately, in the case where the measurements are made on a
mixture of cfDNA taken from a pregnant woman, the target cell could
be from the placenta of a fetus gestating in the mother, and the
non-target cells could be from the mother of the fetus. Typically,
non-target cells are euploid, though this is not required.
[0046] Methods for measuring chromosome copy number in fetal cells
based counting the number of DNA sequence-based reads that map to a
given chromosome or chromosome segment are conveniently referred to
as "counting methods", or "quantitative methods" for analyzing
chromosome copy number or chromosome segment copy number. Examples
of such methods can be found, among other places, in published
patent application US 2013/0172211 A1 U.S. Pat. No. 8,008,018; U.S.
Pat. No. 8,467,976 B2; US published patent application US
2012/0003637 A1. Such methods typically involve creation of a
reference value (cut-off value) for the number of DNA sequence
reads mapping to a specific chromosome, where in a number of reads
in excess of the value is indicative of a specific genetic
abnormality.
[0047] Confidence refers to the statistical likelihood that the
called SNP, allele, set of alleles, ploidy call, or determined
number of chromosome segment copies correctly represents the real
genetic state of the individual.
[0048] Ploidy Calling, also "Chromosome Copy Number Calling," or
"Copy Number Calling" (CNC), refers to the act of determining the
quantity and chromosomal identity of one or more chromosomes
present in a cell.
[0049] Aneuploidy refers to the state where the wrong number of
chromosomes are present in a cell. In the case of a somatic human
cell it refers to the case where a cell does not contain 22 pairs
of autosomal chromosomes and one pair of sex chromosomes. In the
case of a human gamete, it refers to the case where a cell does not
contain one of each of the 23 chromosomes. In the case of a single
chromosome, it refers to the case where more or less than two
homologous but non-identical chromosomes are present, and where
each of the two chromosomes originate from a different parent.
[0050] Ploidy State refers to the quantity and chromosomal identity
of one or more chromosomes in a cell.
[0051] Allelic Data refers to a set of genotypic data concerning a
set of one or more alleles. It may refer to the phased, haplotypic
data. It may refer to SNP identities, and it may refer to the
sequence data of the DNA, including insertions, deletions, repeats
and mutations. It may include the parental origin of each
allele.
[0052] Allelic Distribution refers to the distribution of the set
of alleles observed at a set of loci. An allelic distribution for
one locus is an allele ratio.
[0053] Allelic Distribution Pattern refers to a set of different
allele distributions for different parental contexts. Certain
allelic distribution patterns may be indicative of certain ploidy
states.
[0054] Allelic Bias refers to the degree to which the measured
ratio of alleles at a heterozygous locus is different to the ratio
that was present in the original sample of DNA. The degree of
allelic bias at a particular locus is equal to the observed
allelelic ratio at that locus, as measured, divided by the ratio of
alleles in the original DNA sample at that locus. Allelic bias may
be defined to be greater than one, such that if the calculation of
the degree of allelic bias returns a value, x, that is less than 1,
then the degree of allelic bias may be restated as 1/x.
[0055] Haplotype refers to a combination of alleles at multiple
loci that are transmitted together on the same chromosome.
Haplotype may refer to as few as two loci or to an entire
chromosome depending on the number of recombination events that
have occurred between a given set of loci. Haplotype can also refer
to a set of single nucleotide polymorphisms (SNPs) on a single
chromatid that are statistically associated.
[0056] Haplotypic Data, also "Phased Data" or "Ordered Genetic
Data," refers to data from a single chromosome in a diploid or
polyploid genome, i.e., either the segregated maternal or paternal
copy of a chromosome in a diploid genome.
[0057] Phasing refers to the act of determining the haplotypic
genetic data of an individual given unordered, diploid (or
polyploidy) genetic data. It may refer to the act of determining
which of two genes at an allele, for a set of alleles found on one
chromosome, are associated with each of the two homologous
chromosomes in an individual.
[0058] Phased Data refers to genetic data where the haplotype has
been determined.
[0059] Target Individual refers to the individual whose genetic
data is being determined. In one context, only a limited amount of
DNA is available from the target individual. In one context, the
target individual is a fetus. In some embodiments, there may be
more than one target individual. In some embodiments, each fetus
that originated from a pair of parents may be considered to be
target individuals.
[0060] Child is used interchangeably with the terms embryo,
blastomere, and fetus. Note that in the presently disclosed
embodiments, the concepts described apply equally well to
individuals who are a born child, a fetus, an embryo or a set of
cells therefrom. The use of the term child may simply be meant to
connote that the individual referred to as the child is the genetic
offspring of the parents.
[0061] Parental Context refers to the genetic state of a given SNP,
on each of the two relevant chromosomes for each of the two parents
of the target.
[0062] Primary Genetic Data refers to the analog intensity signals
that are output by a genotyping platform. In the context of SNP
arrays, primary genetic data refers to the intensity signals before
any genotype calling has been done. In the context of sequencing,
primary genetic data refers to the analog measurements, analogous
to the chromatogram, that comes off the sequencer before the
identity of any base pairs have been determined, and before the
sequence has been mapped to the genome.
[0063] Secondary Genetic Data refers to processed genetic data that
are output by a genotyping platform. In the context of a SNP array,
the secondary genetic data refers to the allele calls made by
software associated with the SNP array reader, wherein the software
has made a call whether a given allele is present or not present in
the sample. In the context of sequencing, the secondary genetic
data refers to the base pair identities of the sequences have been
determined, and possibly also the sequences have been mapped to the
genome.
[0064] Joint Distribution Model refers to a model that defines the
probability of events defined in terms of multiple random
variables, given a plurality of random variables defined on the
same probability space, where the probabilities of the variable are
linked.
Methods for Determining Aneuploidy by using Data for a Chromosome
of Interest from a Diploid Sample(s) to Set a Bias Model for Other
Samples in a Parallel Analysis
[0065] Provided herein in one embodiment are methods and systems
for determining the copy number, or detecting aneuploidy of a
chromosome or chromosome segment of interest in a cell of interest
that are performed using the chromosome or chromosome segment of
interest to set a bias model, that is to set test parameters, using
samples analyzed in the same parallel analysis, that are identified
as diploid samples with high confidence, for the analysis of
aneuploidy for the same chromosome or chromosome segment of
interest of other sample(s) in the set of on-test samples.
Accordingly, in one example of this embodiment, provided herein is
a method for determining a presence or absence of aneuploidy of a
chromosome or chromosome segment of interest in a test sample, that
includes the following steps: [0066] a) obtaining genetic data for
the chromosome or chromosome segment of interest from each sample
of a set of samples that includes the test sample and at least one
diploid sample, wherein the genetic data is obtained from a
parallel analysis of the set of samples; [0067] b) setting a bias
model using the genetic data for the chromosome or chromosome
segment of interest in the diploid sample determined to be disomic
for the chromosome or chromosome segment of interest; [0068] c)
adjusting the genetic data for the chromosome or chromosome segment
of interest for the test sample using the bias model; and [0069] d)
establishing the presence or absence of aneuploidy for the
chromosome or chromosome segment of interest in the test sample
using the normalized data.
[0070] In the present embodiment of the invention for determining
the presence or absence of aneuploidy, the set of samples comprises
the test sample and a subset of high probability diploid samples
that includes at least one diploid sample. The subset of samples
can include, for example, 1-1,056 samples. In illustrative methods
the set or subset can be made up of 2, 3, 4, 5, 10, 20, 25, 30, 40,
50, 95, 96, 100, 150, 200, 250, 500, 750, 959, 960, 1046, 1050,
1055, 1056, or 1500 samples on the low end of the range, 3, 4, 5,
10, 20, 25, 30, 40, 50, 95, 96, 100, 150, 200, 250, 500, 750, 959,
960, 1046, 1050, 1051, 1055, 1056, or 1150, 1500, 2000, or 2500
samples on the high end of the range. The set is at least 1 sample
more than the subset, and can be 2, 3, 4, 5, 10, 15, 20, 25, 30,
40, 47, 50, 95, 100, 150, 200, 250, 500, 750, or 1000 samples more
in certain embodiments.
[0071] In certain examples of the invention, at least one sample
known to be diploid is used as a control, and run alongside one or
more target samples. For example, 2, 3, 4, 5, 10, 15, 20, 25, 30,
35, 40 or 50 control samples identified in advance of the run to be
diploid, can be run alongside on-test samples. Any of the
analytical methods disclosed herein can then be used to determine
the presence or absence of aneuploidy in one or more test
samples.
[0072] In certain illustrative examples of this embodiment, the at
least one diploid sample is determined to be disomic for the
chromosome or chromosome segment of interest by analyzing the
genetic data from the parallel analysis. In certain examples, an
initial or first analytical technique, identifies samples that are
disomic for one or more chromosome regions with high confidence.
The identity of these diploid samples that are disomic at all of
the chromosome or chromosome segments of interest, are then used to
set a bias model for a different analytical technique, or a second
run of the same analytical technique. Thus, for examples where a
sample is a sera sample, the present invention provides an
advantage in that less sequencing reads, and accordingly less cost,
is associated with performing the method. This is the result of the
fact that for sera samples in methods analyzing circulating free
DNA, especially circulating fetal DNA or circulating tumor DNA,
many if not most of the samples in a parallel run, contain DNA
originating from only diploid cells. In illustrative methods of the
present invention, at least some of these samples are identified in
an initial analysis, and their identities are used in the analysis
of the other samples in the set of samples being analyzed.
[0073] Some embodiments of the invention employ the step of
selecting, determining or identifying a subset of patients from a
larger set of patients. The original set of patients is used as the
source of target samples containing DNA from target cells and
non-target samples containing DNA from non-target cells for
analysis. A skilled artisan will understand that numerous methods
are known in the art for obtaining genetic data for a chromosome or
chromosome segment of interest from a set of samples in a parallel
analysis.
[0074] In some embodiments of the invention, the DNA samples
obtained, are modified using standard molecular biology techniques
in order to be sequenced on a DNA sequencer. In some embodiments
the technique will involve forming a genetic library containing
priming sites for the DNA sequencing procedure. In some
embodiments, a plurality of loci may be targeted for site specific
amplification. In some embodiments the targeted loci are
polymorphic loci, e.g., a single nucleotide polymorphisms. In
embodiments employing the formation of genetic libraries, libraries
may be encoded using a DNA sequence that is specific for the
patient, e.g. barcoding, thereby permitting multiple patients to be
analyzed in a single flow cell (or flow cell equivalent) of a high
throughput DNA sequencer. Although the samples are mixed together
in the DNA sequencer flow cell, the determination of the sequence
of the barcode permits identification of the patient source that
contributed the DNA that had been sequenced
[0075] Methods are known in the art for obtaining genetic data from
a sample. Typically this involves amplification of DNA in the
sample, a process which transforms a small amount of genetic
material to a larger amount of genetic material that contains a
similar set of genetic data. This can be done by a wide variety of
methods, including, but not limited to, Polymerase Chain Reaction
(PCR), ligand mediated PCR, degenerative oligonucleotide primer
PCR, Multiple Displacement Amplification, allele-specific
amplification techniques, Molecular Inversion Probes (MW), padlock
probes, other circularizing probes, and combination thereof. Many
variants of the standard protocol can be used, for example
increasing or decreasing the times of certain steps in the
protocol, increasing or decreasing the temperature of certain
steps, increasing or decreasing the amounts of various reagents,
etc. The DNA amplification transforms the initial sample of DNA
into a sample of DNA that is similar in the set of sequences, but
of much greater quantity. In some cases, amplification may not be
required. Provided herein in the sample preparation section, are
detailed teachings about isolation and amplification of DNA from a
sample.
[0076] The genetic data of the target individual and/or of the
related individual can be transformed from a molecular state to an
electronic state by measuring the appropriate genetic material
using tools and or techniques taken from a group including, but not
limited to: genotyping microarrays, and high throughput sequencing.
Some high throughput sequencing methods and systems include Sanger
DNA sequencing, pyrosequencing, the ILLUMINA SOLEXA platform,
ILLUMINA's GENOME ANALYZER, ILLUMINA's HISEQ or MISEQ, APPLIED
BIOSYSTEM's SOLiD platform, ION TORRENT'S PGM or PROTON platforms,
HELICOS's TRUE SINGLE MOLECULE SEQUENCING platform, HALCYON
MOLECULAR's electron microscope sequencing method, or any other
sequencing method. All of these methods physically transform the
genetic data stored in a sample of DNA into a set of genetic data
that is typically stored in a memory device en route to being
processed.
[0077] Any relevant individual's genetic data can be obtained from
the following: the individual's bulk diploid tissue, one or more
diploid cells from the individual, one or more haploid cells from
the individual, one or more blastomeres from the target individual,
extra-cellular genetic material found on the individual,
extra-cellular genetic material from the individual found in
maternal blood or the blood of a cancer patient, cells from the
individual found in maternal blood, one or more embryos created
from (a) gamete(s) from the related individual, one or more
blastomeres taken from such an embryo, extra-cellular genetic
material found on the related individual, genetic material known to
have originated from the related individual, and combinations
thereof. In illustrative embodiments, methods provided herein are
used to analyze free DNA originating from the genome of a target
sample from a target cell, such as fetal cell or tumor cell.
[0078] It will be appreciated by those of ordinary skill in the art
that in those embodiments of the invention in which the target DNA
is not enriched for specific loci, the entire genome may be
sequenced, although assembly of the sequence in to a complete
genome is not required for use of the subject methods. Allelic data
about specific loci may be readily determined from all genome
sequencing. Cell free DNA may be conveniently analyzed in
commercially available high throughput DNA sequencers. Such high
throughput DNA sequencers may also be used in embodiments of the
invention employing the targeted amplification of loci of interest,
including polymorphic loci.
[0079] The term "cell free DNA" as used herein refers to DNA that
is available for analysis without requiring the step of lysing
cells. Cell free DNA can be found in blood or other bodily fluids.
Cell free DNA may be obtained from a variety of tissues. Such
tissues may be tissues that are in liquid form such as blood,
lymph, ascites fluid, cerebral spinal fluid, and the like. Cell
free DNA may be from a variety of cellular sources. In some cases
the cell free DNA will be comprised of DNA derived from fetal
cells. The cell free DNA may be a mixture of DNA derived from
target cells and non-target cells. In the case of the analysis of
DNA for fetal aneuploidies, in some embodiments the cell free DNA
may be obtained from the blood of the pregnant woman, wherein the
cell free DNA comprises a mixture of maternally derived cell free
DNA and fetally derived cell free DNA. In other embodiments, the
cell free DNA may be derived from a cancerous tumor cell. The cell
free DNA may comprise a mixture of cell free DNA derived from the
tumor cell and cell free DNA derived from non-tumor cells elsewhere
in the body.
[0080] Genetic data, e.g., DNA sequence data, can be obtained from
a mixture of DNA comprising DNA derived from one or more target
cells and DNA derived from one or more non-target cells. The target
cells and non-target cells differ with respect to one another at
the genomic level, as by virtue of other criteria. The term
"derived" is used to indicate that the cells are the ultimate
source of the DNA. Thus, for example, cell-free DNA obtained from
maternal blood of pregnant woman is derived from cells from the
placenta of the fetus, which are typically genetically identical to
the fetus itself, and the mother's cells. The method employs a set
of patients.
[0081] The genetic data is obtained from each member of the patient
set typically in a parallel biochemical analysis (i.e. a single
assay run). Each patient in the set of patients is analyzed using
essentially the same method of nucleic sequence analysis, e.g., the
same amplification and sequencing reagents analyzed at the same
time on the same run of the same instruments. In some embodiments,
all of the samples in the set of samples are mixed together and
analyzed so that the analysis conditions will be essentially
identical; the analysis of the mixed samples may be termed an
experiment, or a sample run or a parallel analysis. In some
embodiments, the samples may be mixed prior to amplification. In
some embodiments the samples may be mixed after some amplification
steps but before other amplification steps. In some embodiments,
the samples may be mixed after the amplification steps, but before
the sequencing step. Methods for barcoding, known in the art and
further discussed herein, help to facilitate simultaneous analysis
of multiple samples because the identity of a sample can be
determined by a barcode sequence associated with nucleic acids
derived from that sample.
[0082] In some embodiments, especially those involving methods that
provide likelihoods of a ploidy state using hypothesis testing,
genetic information is obtained at a plurality of loci. In some
embodiments, at least some, and possibly all of the loci are
polymorphic. In some embodiments, all of the loci are
non-polymorphic. The same loci are analyzed in both the target and
non-target cells. A number of sequence reads is obtained for each
locus. In some embodiments the number of each allele at a given
locus is quantitated. The quantitative data obtained can be from a
combination of the loci from the target cell and the non-target
cell genomes. Accordingly, in some embodiments, the genetic data
provides an amount of DNA corresponding to each locus in a set of
loci wherein the loci are present on the chromosome or chromosome
segment of interest. In illustrative examples, each chromosome or
chromosome segment of interest, can include 10, 15, 20, 25, 30, 40
50, 100, 250, 500, or 1000, 1500, 2000, 2500, 5000, or 10,000 loci
on the low end of the range, and 15, 20, 25, 30, 40 50, 100, 250,
500, or 1000, 1500, 2000, 2500, 5000, 10,000 or 25,000 loci on the
high end of the range.
[0083] The amount of each locus detected by sequencing preparations
of a DNA obtained from target and non-target cells can vary from
locus to locus for reasons other than the starting quantity of the
locus in the initial sample material prior to preparation for
sequencing, e.g. prior to an amplification step such as PCR.
Variables such as PCR primer binding efficiency, amplicon length,
GC content, and the like can cause variations in the representation
of individual loci in a preparation for sequencing or during
sequencing. Factors such as these can result in a locus specific
bias causing the overrepresentation or underrepresentation of one
locus to another. In addition, bias can result from sample-specific
inconsistencies. For example, due to a pipetting error or other
measurement error during physical processing of the samples, one
sample can have more DNA than another sample in the set of samples.
In illustrative embodiments, these sample-specific biases are taken
into account by sample-specific parameters. Certain sample-specific
parameters, such as alpha, in the QMM section herein, can be
identified based on observing certain properties of data. Anther
sample specific parameter illustrated in the QMM section herein is
the factors c.sub.s, and Ts which are constant per sample, and
represents for example the initial quantity of DNA and the total
number of sequence reads. It can be thought of as the
sample-specific amplification factor.
[0084] Independent of the specific method used to produce the
genetic information, the amount of genetic sequence information
from each locus is dependent upon the relative quantity of the copy
numbers of the loci in the original sample. Loci that are believed
to be on the same chromosomal segment, or in some embodiments the
same chromosome, are presumed to have the same starting amount.
Thus, for example, the multiple loci present on chromosome 21 in
the genome of the target cell (or the genome of the non-target
cell) are presumed to be present in approximately equal amounts in
the genomic DNA. Thus differences in the amount of observed genetic
information between loci on the same chromosome are the result of
locus specific bias. For example if SNP1 and SNP2 are located on
the same chromosome and assumed to have the same copy number on the
same chromosome, and SNP1 is found to have depth of read of 0.1%
and SNP2 is found to have a depth of read of 0.4%, this may be
explained by a quantifiable locus specific bias favoring the
production of DNA sequence from SNP2 over SNP1. This bias may be
additionally normalized by virtue of considering the distribution
of possible sampling outcomes for the two different SNPs. Thus,
methods of the present invention, analyze bias and provide a bias
model, as discussed more fully herein. In illustrative embodiments
of the invention, bias models are created from chromosome and
chromosome segments of interest, in certain embodiments without the
use of control chromosome or chromosome segment.
[0085] Accordingly, the selected high confidence diploid subset of
patients, in certain examples are used to set a bias model. Small
variations in reaction conditions mean that samples run at
different times experience slightly different conditions, resulting
in different relative rates of enrichment and measurement for
different molecules of DNA. Various parameters, including
reaction-specific, sample-specific parameters and target
locus-specific parameters can be set as part of the bias model,
allowing normalization of the differing relative rates of
enrichment and measurement.
[0086] Examples of such biases include amplification bias,
sequencing bias, processing bias, enrichment bias, measurement
bias, and combinations thereof. The nature of such biases may vary
in accordance with the specific amplification technology,
sequencing technology, processing, enrichment technology, and
particular conditions present for a specific reaction, etc.
selected for implementation of the specific embodiment. For
example, the diploid sample subset can be used to calculate a
per-sample constant of normalization that reflects the overall
number of reads in the sample, e.g. the percentage of reads in a
sample. In another embodiment, the diploid sample subset can be
used to calculate a per-locus constant of normalization that
reflects the overall number of reads in the sample, e.g. the
percentage of reads in a sample In some embodiments, the relative
amount of DNA from each sample that is present in the experiment
can be calculated, and used to normalize other sample data
parameters. In some embodiments, the proportion of DNA mapping to a
chromosome of interest can be calculated, and the proportion of DNA
mapping to the chromosome of interest from the selected subset of
patients can be used to calculate a per-experiment constant of
normalization that reflects the proportion or overall amount of DNA
from the chromosome of interest that is expected for a normal
sample, e.g. the percentage of reads in a sample that map to the
chromosome of interest. In some embodiments, the relative amount of
DNA mapping to each of a plurality of targeted loci can be
calculated in the selected subset of patients, and this can be used
to calculate a per-experiment, per-locus constant of normalization
that reflects the amplification and/or measurement bias for each
locus. In some embodiments, the bias model could be used to create
a noise parameter that aggregates amplification bias and various
possible errors such as transcription error rates, contamination
rates, and/or sequencing error rates. In certain examples, a bias
model, or a portion thereof, such as an allele-specific
amplification bias, can be used by a method that initially analyzed
the genetic data to identify diploid samples, that was calculated
from data from a prior run.
[0087] In some examples of this embodiment of the invention,
quantitative allelic and non-allelic data are both analyzed so as
to produce an identification of the number of chromosomes or
chromosome segments of interest with a higher level of confidence
than using the allelic data or non-allelic data alone. The data can
be from the same set of loci and in fact the same data, analyzed
separately for different alleles or as a combined sum for all
alleles of a locus or a haplotype.
[0088] In certain illustrative examples of this embodiment,
quantitative allelic information is used to determine the copy
number of the chromosome of interest or the chromosome segment of
interest without relying on a cut-off value. Polymorphic loci,
e.g., from SNPs that are heterozygous between the target cell and
the non-target cell, e.g. a fetus and its mother, can be used to
determine the copy number of chromosomal or chromosomal segment
based on quantitative allelic data from the polymorphic loci.
Provided herein is an exemplary allele-based maximum likelihood
method called the "heterozygote method" or "het rate method" of
determining chromosome or chromosome segment copy number.
Polymorphic loci that are heterozygous between the target cell and
the non-target cell, e.g., a fetus and its mother, can be used to
determine the relative amounts of target cell DNA and non-target
cell DNA in the sample for analysis. The quantity of genetic
information from the polymorphic loci is dependent upon the amount
of genetic starting material and the relative amounts of DNA from
the target cells and the non-target cells. The ratio of alleles at
a plurality of polymorphic loci can be determined and tested
against models corresponding to predicted allele ratios for various
chromosome copy number (or chromosome segment copy number)
hypotheses. The effects on predicted data for differing ratios of
target cell DNA and non-target cell DNA are included in such
models. For example, in the case of testing cell free DNA in the
blood of a pregnant woman, the potential different fetal fractions
(ratio of fetal DNA to total DNA; also referred to herein as child
fraction) can be modeled.
[0089] In certain embodiments, diploid samples are determined
and/or aneuploidy of the chromosome or chromosome segment of
interest are established using one or two algorithms that provide
maximum likelihoods. In these methods the collected data is
typically tested against a plurality of copy number hypotheses. The
copy number hypotheses can be created for the number of copies of a
chromosome or number of copies of a chromosome segment of the
target cell. Each hypothesis is tested against the genetic data
obtained from the loci. The testing of a hypothesis against the
genetic data results in the calculation of a probability value that
the copy number hypothesis is correct (or conversely incorrect). In
some embodiments wherein the genetic data is obtained from cell
free DNA obtained from the blood of a pregnant woman, the
hypothesis can include a condition that the mother is carrying
multiple fetuses, e.g., twins.
[0090] The probability value is used to select a subset of patients
consisting of those patients that are the source of genetic data
that is found to match a specific copy number hypothesis with a
specified level of confidence. In essence, a subset of patients is
selected, wherein the selected subset of patients matches the
selected hypothesis with a high level of confidence, the high level
of confidence being specified for the specific embodiment. In
illustrative examples of this embodiment of the invention, for
example, the hypothesis could be that chromosome 21 has 2 copies,
chromosome 13 has 2 copies, and chromosome 18 has 2 copies. Samples
meeting this hypothesis with high confidence in an NIPT analysis
are considered diploid samples in this embodiment. These diploid
samples are then used to set a bias model. In some embodiments, the
bias model is used by the same analysis technique to reassess the
samples that were not included in the diploid sample subset. In
other embodiments, a second analytical technique is used to analyze
one or more samples in the set of samples that were not identified
in the initial analysis as members of the diploid subset.
[0091] In some embodiments, a set of at least one ploidy state
hypothesis can be created for each of the chromosomes of interest
of the target individual. Each of the ploidy state hypotheses may
refer to one possible ploidy state of the chromosome or chromosome
segment of the target individual. The set of hypotheses may include
some or all of the possible ploidy states that the chromosome of
the target individual may be expected to have. Some of the possible
ploidy states may include nullsomy, monosomy, disomy, uniparental
disomy, euploidy, trisomy, matching trisomy, unmatching trisomy,
maternal trisomy, paternal trisomy, tetrasomy, balanced (2:2)
tetrasomy, unbalanced (3:1) tetrasomy, other aneuploidy, and they
may additionally involve unbalanced translocations, balanced
translocations, Robertsonian translocations, recombinations,
deletions, insertions, crossovers, and combinations thereof.
[0092] In some embodiments, the knowledge of the determined ploidy
state may be used to make a clinical decision. This knowledge,
typically stored as a physical arrangement of matter in a memory
device, may then be transformed into a report. The report may then
be acted upon. For example, the clinical decision may be to
terminate the pregnancy; alternately, the clinical decision may be
to continue the pregnancy. In some embodiments the clinical
decision may involve an intervention designed to decrease the
severity of the phenotypic presentation of a genetic disorder, or a
decision to take relevant steps to prepare for a special needs
child.
[0093] Some of the math in the presently disclosed embodiments
makes hypotheses concerning a limited number of states of
aneuploidy. In some cases, for example, only zero, one or two
chromosomes are expected to originate from each parent. In some
embodiments of the present disclosure, the mathematical derivations
can be expanded to take into account other forms of aneuploidy,
such as quadrosomy, where three chromosomes originate from one
parent, pentasomy, hexasomy etc., without changing the fundamental
concepts of the present disclosure. At the same time, it is
possible to focus on a smaller number of ploidy states, for
example, only trisomy and disomy. Note that ploidy determinations
that indicate a non-whole number of chromosomes may indicate
mosaicism in a sample of genetic material.
[0094] In some embodiments, the genetic abnormality is a type of
aneuploidy, such as Down syndrome (or trisomy 21), Edwards syndrome
(trisomy 18), Patau syndrome (trisomy 13), Turner Syndrome (45X0)
Klinefelter's syndrome (a male with 2 X chromosomes), Prader-Willi
syndrome, and DiGeorge syndrome. Congenital disorders, such as
those listed in the prior sentence, are commonly undesirable, and
the knowledge that a fetus is afflicted with one or more phenotypic
abnormalities may provide the basis for a decision to terminate the
pregnancy, to take necessary precautions to prepare for the birth
of a special needs child, or to take some therapeutic approach
meant to lessen the severity of a chromosomal abnormality.
[0095] In certain embodiments of the invention, one or two maximum
likelihood methods are used. As disclosed above, the first maximum
likelihood method can be used to identify diploid samples in the
set of samples and to determine a first probability that the other
samples in the set of samples are aneuploidy. Accordingly, in
certain embodiments, one or more or all of the chromosome(s) or
chromosome segment(s) of interest are determined to be disomic
using a first maximum likelihood method. The method includes the
following steps: [0096] creating, for each sample in the set of
samples, a plurality of first hypotheses wherein each first
hypothesis is associated with a specific copy number for the
chromosome or chromosome segment of interest, [0097] determining a
first probability value for each first hypothesis, wherein the
first probability value indicates the likelihood that the sample
has the number of copies of the chromosome or chromosome segment
that is associated with the first hypothesis, wherein the first
probability values are derived from the genetic data associated
with the sample, and [0098] selecting at least one diploid sample
by selecting those one or more samples that most closely match a
disomic copy number hypothesis for the chromosome or chromosome
segment of interest, with at least a minimum level of confidence.
That is, by selecting those samples that yield the highest
probability of being disomic for the chromosome or chromosome
segment of interest.
[0099] In certain examples using a maximum likelihood allelic
method, the method is performed by analyzing a second chromosome or
chromosome segment of interest and a third chromosome or chromosome
of interest in the parallel analysis, wherein the diploid samples
are identified by a method comprising comparing genetic data from
the first, second, and third chromosome or chromosome segments of
interest for each sample of the set of samples.
[0100] In these embodiments wherein the method includes at least
two maximum likelihood analysis, the presence or absence of
aneuploidy, or the number of copies of the chromosome or chromosome
segment of interest, is determine by creating a plurality or set of
2.sup.nd ploidy hypotheses, wherein each 2.sup.nd ploidy hypothesis
is associated with a specific copy number of the chromosome or
chromosome segment of interest in the target cell. The models are
then used to test how well the genetic data from each patient fits
each 2.sup.nd hypothesis. The goodness of fit for each 2.sup.nd
hypothesis is determined. A second probability value is calculated
for each second hypothesis wherein the second probability value
indicates the likelihood that the genome of the target cell has the
number of chromosomes or chromosome segments that is specified by
the second hypothesis. Thus by selecting the 2.sup.nd hypothesis
with the maximum likelihood, one may determine the copy number for
the chromosome or chromosome segment in the genome of the target
cell. Such first and second hypothesis can be considered in
combination to increase the confidence of the aneuploidy
determination, as discussed more fully herein.
[0101] In this embodiment of the invention, the subset of samples
that is selected because they are identified as diploid samples
with high confidence, is used to create a bias model. The bias
model is created using the genetic data for the chromosome or
chromosome segment of interest. In certain illustrative
embodiments, high confidence diploid samples are identified, and/or
the bias model is created without using a control chromosome or
control chromosome segment.
[0102] In some embodiments the 1.sup.st hypotheses are the same as
the 2.sup.nd h.sub.ypotheses.
[0103] In certain embodiments, determining a first probability
value for each first hypothesis includes the following: [0104] a.
determining an initial probability of each first hypothesis for
each grid point using a uniform hypothesis prior on a 2d grid of
fetal fraction and the second bias model; [0105] b. determining a
parameter distribution for each chromosome or chromosome segment of
interest based on the initial probability; [0106] c. determining a
composite parameter distribution from the parameter distribution
for each chromosome or chromosome segment of interest; [0107] d.
determining a posterior probability of each first hypothesis based
on the composite parameter distribution; and [0108] e. repeating
steps (a)-(e) using the posterior probability as a new initial
probability for each iteration until convergence is reached.
[0109] In certain embodiments, determining a ploidy probability
value for each ploidy hypothesis comprises: [0110] a. determining
an initial probability of each ploidy hypothesis for each grid
point using a uniform hypothesis prior on a 2d grid of fetal
fraction and the first bias model; [0111] b. determining a
parameter distribution for each chromosome or chromosome segment of
interest based on the initial probability; [0112] c. determining a
composite parameter distribution from the parameter distribution
for each chromosome or chromosome segment of interest; [0113] d.
determining a posterior probability of each ploidy hypothesis based
on the composite parameter distribution; and [0114] e. repeating
steps (a)-(e) using the posterior probability as a new initial
probability for each iteration until convergence is reached.
[0115] In these embodiments, the second bias model, can be the
noise parameter discussed herein. The above embodiments that
analyze grid points can be used in certain examples, with a
quantitative allelic method. Further disclosure related to the
above grid point hypothesis testing is found in the het rate
section herein.
[0116] In some embodiments the genetic data obtained from the
target and non-target cells identifies the alleles of polymorphic
loci and the number of reads of each allele is quantitatively
measured. Each 1.sup.st hypothesis is tested against a model
specifying a specific distribution of quantitative allelic data at
the plurality of polymorphic loci. Probability values are
determined by calculating for each hypothesis the fit between the
expected genetic data and the obtained, i.e. measured, genetic
data. The probabilities can be weighted for the biological
probability that a given genetic event is likely to occur.
[0117] In some embodiments the genetic data obtained from the
target and non-target cells identifies the alleles of polymorphic
loci and the number of reads of each is quantitatively measured
without regard for the identity of the specific alleles. Each first
hypothesis, or in illustrative embodiments second hypothesis is
tested against a model specifying a specific distribution of
quantitative allelic data at the plurality of loci analyzed.
Probability values are determined by calculating for each
hypothesis the fit between the expected genetic data and the
obtained, i.e. measured, genetic data.
[0118] In some embodiments, the genetic data comprises quantitative
genetic data from a plurality of non-polymorphic loci in which the
2.sup.nd hypothesis specifies an expected distribution of
quantitative data at the plurality of non-polymorphic loci and
where in the 2.sup.nd probability values are determined by
calculating, for each 2.sup.nd hypothesis the goodness of fit
between the expected genetic data and the normalized genetic data.
In these embodiments, a test statistic, as disclosed herein for the
QMM method, or a z-score could be determined.
[0119] In one embodiment of the present disclosure, where the
method used to determine the ploidy state of a fetus, the method
further includes taking into account the fraction of fetal DNA in
the sample. In one embodiment of the present disclosure, the method
involves calculating the percent of DNA in a sample that is fetal
or placental in origin. In one embodiment of the present
disclosure, the threshold for calling aneuploidy is adaptively
normalized based on the calculated percent fetal DNA. In some
embodiments, the method for estimating the percentage of DNA that
is of fetal origin in a mixture of DNA, comprises obtaining a mixed
sample that contains genetic material from the mother, and genetic
material from the fetus, obtaining a genetic sample from the father
of the fetus, measuring the DNA in the mixed sample, measuring the
DNA in the father sample, and calculating the percentage of DNA
that is of fetal origin in the mixed sample using the DNA
measurements of the mixed sample, and of the father sample.
[0120] In one embodiment of the present disclosure, the fraction of
fetal DNA, or the percentage of fetal DNA in the mixture can be
measured. In some embodiments the fraction can be calculated using
only the genotyping measurements made on the maternal plasma sample
itself, which is a mixture of fetal and maternal DNA. In some
embodiments the fraction may be calculated also using the measured
or otherwise known genotype of the mother and/or the measured or
otherwise known genotype of the father. In some embodiments the
percent fetal DNA may be calculated using the measurements made on
the mixture of maternal and fetal DNA along with the knowledge of
the parental contexts. In one embodiment the fraction of fetal DNA
may be calculated using population frequencies to adjust the model
on the probability on particular allele measurements.
[0121] The accuracy of a ploidy determination is typically
dependent on a number of factors, including the number of reads and
the fraction of fetal DNA in the mixture. The accuracy is typically
higher when the fraction of fetal DNA in the mixture is higher. At
the same time, the accuracy is typically higher if the number of
reads is greater. It is possible to have a situation with two cases
where the ploidy state is determined with comparable accuracies
wherein the first case has a lower fraction of fetal DNA in the
mixture than the second, and more reads were sequenced in the first
case than the second. It is possible to use the estimated fraction
of fetal DNA in the mixture as a guide in determining the number of
reads necessary to achieve a given level of accuracy.
[0122] In an embodiment of the present disclosure, a set of samples
can be run where different samples in the set are sequenced to
different reads depths, wherein the number of reads run on each of
the samples is chosen to achieve a given level of accuracy given
the calculated fraction of fetal DNA in each mixture. In one
embodiment of the present disclosure, this may entail making a
measurement of the mixed sample to determine the fraction of fetal
DNA in the mixture; this estimation of the fetal fraction may be
done with sequencing, it may be done with TaqMan, it may be done
with another qPCR method, it may be done with SNP arrays, it may be
done with any method that can distinguish different alleles at a
given loci. The need for a fetal fraction estimate may be
eliminated by including hypotheses that cover all or a selected set
of fetal fractions in the set of hypotheses that are considered
when comparing to the actual measured data. After the fraction
fetal DNA in the mixture has been determined, the number of
sequences to be read for each sample may be determined.
[0123] Accordingly, certain examples of the method for determining
aneuploidy further include estimating a fetal fraction for each
sample in the set of samples, wherein the fetal fraction is used to
select the diploid subset of samples and/or to determine whether
the genetic data from the test subject is indicative of an
aneuploidy. That is, fetal fraction can be used in one or both
methods used in a method for determining aneuploidy wherein a first
method is used to identify diploid samples and a second method uses
those diploid centers to determine wither another sample in a set
of samples is an aneuploidy sample.
[0124] FIG. 1 is a non-limiting example of a method 100 of the
invention that includes the use of a first method for identifying a
subset of diploid samples and a second method to increase the
accuracy and/or confidence of detection of aneuploidy in NIPT. The
method starts at block 102, where genetic data is obtained from a
mixture of target DNA and non-target DNA for each sample from a set
of samples, one for each patient in a set of patients, by running a
plurality of samples from pregnant mothers in parallel. That is,
the samples are analyzed together at the same time typically using
the same common reagents and instruments. In this example, the set
of samples are barcoded, mixed, and amplified in the same reaction.
Then, the set of samples are amplified in parallel using the same
or nearly the same conditions. Next, the set of samples are
sequenced on the same sequencing flow cell using the same
conditions.
[0125] In block 104, a method (block 105) is used to make an
initial determination of the copy number of the chromosome of
interest in each of the samples from the set of samples. In one
example, the initial determination is a made by a method that
relies on allelic data, for example, the het rate method 108
provided herein. In other examples, the initial determination is
made by a quantitative method 106 that relies on non-allelic
data.
[0126] In block 110, a subset of samples from the plurality of
samples is selected where the likelihood is very high that each of
the chromosomes in the subset of samples are normally represented
(i.e. diploid). In one example, one could choose only those samples
for inclusion into the subset where a non-allelic quantitative
method 106, such as a method that determines a depth of read is
used and the absolute value of the z-score is less than 0.5, 1,
1.5, or 2 , for example, or where the z-score is indicative of
disomy with at least a minimum level of confidence (e.g. 90%, 95%,
99%, 99.5%, 99.6%, 99.7%, 99.8% or 99.9%, for example).
Alternately, one could choose only those samples for inclusion into
the subset where a het rate method 108 is used that calculates a
confidence, and where the calculated confidence that the chromosome
is disomic is greater than 0.9, 0,95, 0.990, 0.9995, 0.9996, 0.997,
0998, or 0.9999, for example. Alternately, one could simply choose
a fixed size subset by choosing those samples with the highest
confidence or the z-score closest to zero. For example, one could
analyze 96 samples and choose the 24 samples with the highest
confidence of disomy for inclusion into the subset.
[0127] Once the subset of samples has been chosen, it is possible
to use the genetic data measured on the chromosome of interest in
the subset of samples as a reference set of samples for a secondary
analysis of the samples in the plurality of samples that are not in
the subset of samples. For example, the conditions in any given
amplification reaction and sequencing run are slightly different,
resulting in slightly different profiles of amplification rates and
biases for each locus and for each sample. If a plurality of
samples are run using the same conditions, and especially if those
samples are run in one homogenous mixture, then it is reasonable to
believe that the relative amplification and other processing biases
will be minimized. A second analytical method (block 117) can then
use the reference samples to create a model of the biases (block
112), either on a per-SNP basis (i.e. what is the relative
amplification and processing bias for each SNP), a per-region basis
(i.e. what is the relative amplification and processing bias for
each region of DNA), or on a per-chromosome basis (i.e. what is the
relative amplification and other processing bias for each
chromosome). In certain examples not illustrated in FIG. 1, the
first analytical method (block 105) can use a bias model as well,
which can be built using data from the run or which can be built
from previous data.
[0128] In block 114, the genetic data for all patients in the set
of patients from the parallel analysis is then normalized according
to the bias model from block 112, to correct bias errors where
appropriate.
[0129] In block 116, once the data is normalized, samples with
unknown ploidy states can then be analyzed a second time to
determine the copy number, by comparing a set of copy number
hypotheses to the normalized genetic data. This may be done using
the same or a different method as the initial determination in
block 104. For example, this second analytical method 117 may be
performed using a quantitative method 118, such as the QMM method
118 as discussed below, a het rate method 120, as also discussed
below, or samples with unknown ploidy could be analyzed by both an
allelic method such as a het rate method 120 and a quantitative
non-allelic method 188 such as QMM. For methods that generate a
maximum likelihood, such as QMM and het rate, a copy number
probability value is then determined in block 122 for each copy
number hypothesis.
[0130] In another example, blood is drawn from 96 pregnant women
who want to know if their fetuses have Down syndrome, or trisomy
21. These 96 samples are then all be processed and biochemically
analyzed together, or in parallel. In other examples, the number of
samples run in parallel could be, for example, at least 3, 8, 24,
36, 48, 72, 108, 144, 288, or 396. In certain examples, no more
than 396 samples are analyzed in parallel. The DNA from each of the
samples then has a barcode attached, and then all of the sample are
pooled and amplified. The amplified DNA is then sequenced using a
high throughput sequencer (e.g. block 102). Then a het rate method
(e.g. block 108) is used to analyze each of the samples. The 24
samples with the highest confidence for disomy at chromosome 21 is
selected to select (i.e. identify) a subset of samples that could
act as a reference subset (e.g. block 110). Alternately, a
quantitative method could be used to analyze each of the samples to
give a preliminary estimate of the proportion of DNA mapping to
chromosome 21 (e.g. block 106). The 24 samples with the z-score
closest to zero could be selected as a subset of samples that could
act as the reference subset (e.g. block 110).
[0131] Once the reference or control subset is chosen (e.g. block
112), a second analytical method (117) can make the assumption that
these cases are disomic, and then estimate the per-SNP bias, that
is, the experiment-specific amplification and other processing bias
for each locus using these diploid samples. Then, the second method
(117) can use this experiment-specific bias estimate to correct the
bias in the measurements of genetic data (e.g. sequencing reads) of
the chromosome 21 loci, and for other chromosome loci (e.g.
chromosomes 13, 18, X, and Y) as appropriate, for the 72 samples
that are not part of the subset where disomy was assumed for
chromosome 21 (e.g. block 114).
[0132] Once the reference (i.e. control) diploid samples have been
selected (i.e. identified) (110), the data from the 72 samples with
unknown ploidy state can then be analyzed a second time using the
same or a different method (117) to determine whether the fetuses
are afflicted with trisomy 21. The reference diploid subset of
samples are used to set a bias model (112) that is used by a second
method (117) to normalize the genetic data from the samples that
were not selected as members of the high confidence diploid subset.
For example, a quantitative method could be used on the remaining
72 samples, and a z-score could be calculated using the corrected
measured genetic data on chromosome 21 (e.g block 118).
[0133] In certain embodiments, the bias correction or normalization
of the genetic data is done as part of the second analysis. As part
of the preliminary estimate of the ploidy state of chromosome 21, a
fetal fraction, in certain examples, is calculated. The proportion
of corrected reads that would be expected in the case of a disomy
(the disomy hypothesis), and the proportion of corrected reads that
would be expected in the case of a trisomy (the trisomy hypothesis)
are calculated for a case with that fetal fraction. Alternately, if
the fetal fraction was not measured previously, a set of disomy and
trisomy hypotheses are generated for different fetal fractions. For
each case, an expected distribution of the proportion of corrected
reads are calculated given expected statistical variation in the
selection and measurement of the various DNA loci. The observed
corrected proportion of reads are compared to the distribution of
the expected proportion of corrected reads, and a likelihood ratio
is calculated for the disomy and trisomy hypotheses, for each of
the 72 samples. The ploidy state associated with the hypothesis
with the highest calculated likelihood, for each of the 72 samples,
in this example, is selected as the correct ploidy state. In
another embodiment, the corrected genetic data for the remaining 72
samples is analyzed using a plurality of orthogonal methods, and
the resulting likelihoods are then combined to give a combined
likelihood which is used to determine the actual ploidy state of
each of the fetuses. In one embodiment, an allelic maximum
likelihood method, such as the het rate method and a quantitative
method, such as the QM NI method, are each used to determine the
likelihood of disomy and trisomy in the fetus, and these
likelihoods are combined or considered together in a set of rules
that provide a output of whether a sample exhibits aneuoploidy in
any of the chromosome or chromosome segments of interest. It will
be apparent to an ordinary person skilled in the art how any of the
approaches disclosed herein could be used for other types of whole
chromosome abnormalities. Furthermore, it will be apparent to an
ordinary person skilled in the art how any of the approaches
disclosed herein could be used for other types of partial
chromosomal abnormalities, for example, a microdeletion, a micro
duplication, or an unbalanced translocation.
[0134] In some embodiments the target cells are fetal cells and
non-target cells are from the mother of the fetus. In some
embodiments the invention is directed to non-invasive prenatal
diagnosis, and the target cells may be fetal cells and the
non-target cells may be maternal cells. In some embodiments of the
invention an example of a hypothesis that may be used to select the
subset of samples is the hypothesis that a specific chromosome or
chromosome segment is diploid i.e. present in 2 copies. Examples of
chromosomes for analysis include chromosomes 13, 18, 21, X and Y,
including segments thereof. For example, the subset of samples may
be chosen on the basis of having the highest likelihood that all or
nearly all of the DNA in the sample originated from cells with
precisely two copies of the chromosome of interest. In certain
embodiments, the chromosomes that are analyzed are chromosomes 13,
18, and 21.
[0135] In some embodiments, the chromosome segment (s) that is
analyzed for copy number is at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
or all segments selected from the group consisting of chromosome
22q11.2, chromosome 1p36, chromosome 15q11-q13, chromosome 4p16.3,
chromosome 5p15.2, chromosome 17p13.3, chromosome 22q13.3,
chromosome 2q37, chromosome 3q29, chromosome 9q34, chromosome
17q21.31, and the terminus of a chromosome.
[0136] Note that it has been demonstrated that DNA that originated
from cancer that is living in a host can be found in the blood of
the host. In the same way that genetic diagnoses can be made from
the measurement of mixed DNA found in maternal blood, genetic
diagnoses can equally well be made from the measurement of mixed
DNA found in host blood. The genetic diagnoses may include
aneuploidy states, or gene mutations. Any claim in that patent that
reads on determining the ploidy state or genetic state of a fetus
from the measurements made on maternal blood can equally well read
on determining the ploidy state or genetic state of a cancer from
the measurements on host blood.
[0137] In some embodiments, the method may allow one to determine
the ploidy status of a cancer, the method comprising obtaining a
mixed sample that contains genetic material from the host, and
genetic material from the cancer, measuring the DNA in the mixed
sample, calculating the fraction of DNA that is of cancer origin in
the mixed sample, and determining the ploidy status of the cancer
using the measurements made on the mixed sample and the calculated
fraction. In some embodiments, the method may further comprise
administering a cancer therapeutic based on the determination of
the ploidy state of the cancer. In some embodiments, the method may
further comprise administering a cancer therapeutic based on the
determination of the ploidy state of the cancer, wherein the cancer
therapeutic is taken from the group comprising a pharmaceutical, a
biologic therapeutic, and antibody based therapy and combination
thereof.
[0138] Accordingly, in some embodiments the target cell is a tumor
cell and the non-target cell is a non-tumor cell. In some
embodiments the cell free DNA comprises DNA that that has been
released by apoptosis. In some embodiments the target cell is a
malignant tumor cell.
[0139] In certain embodiments, the chromosome or chromosome segment
of interest is known to exhibit CNV in cancer (see for example, Liu
et al. Oncotarget . 2013 November; 4(11): 1868-188, incorporated by
reference in its entirety and Beroukhim et al, Nature. 2010 Feb.
18; 463(7283): 899-905, incorporated by reference in its entirety).
For example, the chromosome or chromosome segment of interest in
certain embodiments, is a chromosome or chromosome segment
comprising at least 1, 2, 3, 4, 5, 10, 15, 20 or more of the
following genes: ERBB2, EGFR, MYC, PIK3CA, IGF1R, FGFR1/2, KRAS,
CDK4, CCND1, MDM2, MET, CDK6 (in certain embodiments, chromosome or
chromosome segments that include these genes are assayed for
amplification), and RB1, PTEN, CDKN2A/B, ARID1A, MAP2K4, NF1,
SMAD4, BRCA1/2, MSH2/6, DCC, CDH1 (in certain embodiments
chromosome or chromosome segments that include these genes are
assayed for deletion). In some embodiments wherein at least some of
the genetic data is derived from circulating tumor cells, the
chromosome segment, or chromosome from which the segment
originates, is 1p, 2p, 2q, 3p, 3q, 4p, 5p, 5q, 6p, 6q, 7p, 7q, 8p,
8q, 9p, 9q, 10p, 10q, 11p, 11 q, 12p, 12q, 13q, 14q, 15q, 16p, 16q,
17p, 17q, 18p, 18q, 19p. 19q, 20p, 20q, 21q, 22q (See Beroukhim et
al.. Nature. 2010 Feb. 18; 463 Supp FIG. 6).
[0140] The following provides a non-limiting example of a method
for the detection of aneuploidy (i.e. copy number variation "CNV")
in circulating tumor DNA in a blood sample from an individual at
high risk of having cancer, using a method of the invention that
includes the use of a first method for identifying a subset of
samples having a normal copy number for one or more target
chromosome regions, and a second method to increase the accuracy of
detection of CNV at the one or more target chromosome regions.
Accordingly, a blood sample is collected from each of 48 patients
at high-risk for breast cancer, of which for example, eleven
actually have breast cancer. Blood samples are centrifuged, plasma
separated, and the DNA isolated from the plasma. The isolated DNA
is then amplified using a non-specific amplification for six to ten
cycles in one example, or it is amplified for four to sixteen
cycles in another example. The DNA is then be amplified using a
targeted PCR protocol that targets a plurality of loci located
across one or more chromosomal regions where amplification or
deletion of the chromosomal regions are indicative of cancer; the
regions may be focal regions or they may be entire arms of
chromosomes, or entire chromosomes. The regions may be commonly
observed to be deleted or amplified in specific cancers, they may
be directly believed to affect oncogenesis, and/or they may be
driver mutations. The targeted amplification or deletion may
simultaneously amplify or delete single nucleotide variants
implicated in oncogenesis, or correlated with the presence of a
tumor. Each region may contain at least 20 loci, at least 50 loci,
at least 100 loci, at least 200 loci, at least 500 loci, or at
least 1,000 loci. The loci may be comprised of polymorphic loci,
for example SNPs. The loci may also be comprised of non-polymorphic
loci. The amplified DNA may be measured using high throughput
sequencing.
[0141] The data from each of the samples is analyzed to determine
which of the samples have a high likelihood of not having any CNVs.
This analysis can involve analysis of allelic data, it can involve
analysis of quantitative data, or it may involve analysis of both
allelic and quantitative data. The determination that certain
samples are most likely to not have any CNVs (i.e. a normal sample)
is based in this example, on selecting the samples with the lowest
fraction of tumor DNA, selecting the samples with the z-score
closest to zero, selecting the samples where the data fits the
hypothesis corresponding to no CNVs with the highest confidence or
likelihood, selecting the samples known to be normal, selecting the
samples from individuals with the lowest likelihood of having
cancer (e.g. having a low age, being a male when screening for
breast cancer, having no family history, etc.), selecting the
samples with the highest input amount of DNA, selecting the samples
with the highest signal to noise ratio, selecting samples based on
other criteria believed to be correlated to the likelihood of
having cancer, or selecting samples using some combination of
criteria.
[0142] A subset of the 48 samples with a sufficiently low
likelihood of having cancer in this example are selected to act as
a control set of samples. The subset can be a fixed number or
percent of samples, or it can be a variable number that is based on
choosing only those samples that fall below a threshold. For
example, the 25, 20, 15, 10, or 5% of samples with the lowest
likelihood of aneuploidy or lowest absolute value z score can be
selected as the control subset. Alternatively, the 25, 20, 15, 10
or 5 samples with the lowest likelihood of aneuploidy or lowest
absolute value z-score can be selected as the control subset. The
quantitative data from the subset of samples can be combined,
averaged, or combined using a weighted average where the weighting
is based on the likelihood of the sample being normal. The
quantitative data may be used to determine the per-locus bias for
the amplification the sequencing of samples as well as for sample
biases and other biases disclosed herein as part of a bias model,
in the instant batch of 48 samples. The per-locus bias can also
include data from other batches of samples. The per-locus bias can
indicate the relative over- or under-amplification that was
observed for that locus compared to other loci, making the
assumption that the subset of samples do not contain any CNVs, and
that any observed over or under-amplification is due to
amplification and/or sequencing or other bias. The per-locus bias
can take into account the GC content of the amplicon. The loci can
be grouped into groups of loci for the purpose of calculating a
per-locus bias. Once the per-locus bias has been calculated for
each locus in the plurality of loci, the sequencing data for one or
more of the samples that are not in the subset of the samples, and
optionally one or more of the samples that are in the subset of
samples, can be corrected by adjusting the quantitative
measurements for each locus to remove the effect of the bias at
that locus. For example, if SNP 1 was observed, in the subset of
patients, to have a depth of read that is twice as great as the
average, the adjustment can involve replacing the number of reads
corresponding from SNP 1 with a number that is half as great. If
the locus in question is a SNP, the adjustment can involve cutting
the number of reads corresponding to each of the alleles at that
locus in half
[0143] Once the sequencing data for each of the loci in one or more
samples has been normalized, it is analyzed using at least one
method, and in illustrative embodiments at least two methods for
the purpose of detecting the presence of a CNV at one or more
chromosomal regions. The method can be a quantitative maximum
likelihood method that uses only quantitative non-allelic data, it
can be an allelic maximum likelihood method that only uses allelic
data, including allele ratios or allele distributions, it may be a
method that uses both quantitative non-allelic and allelic data, or
it may be a method that uses other types of data. The likelihood of
a CNV is calculated using such a method. The likelihoods produced
for a plurality of hypotheses by more than one method is combined;
if the methods are not orthogonal, that is, if the likelihoods
generated have some correlation, a correction may be applied when
combining the likelihoods.
[0144] For example, sample A, a mixture of amplified DNA
originating from a mixture of normal and cancerous cells, is
analyzed using a quantitative method: a region of the q arm on
chromosome 22 is found to only have 90% as much DNA mapping to that
region as expected; a focal region corresponding to the HER2 gene
is found to have 150% as much DNA mapping to that region as
expected; and the p-arm of chromosome 5 is found to have 105% as
much DNA mapping to it as expected. A clinician can infer that the
sample has a deletion of a region on the q arm on chromosome 22,
and a duplication of the HER2 gene. The clinician can infer that
since the 22q deletions are common in breast cancer, and that since
cells with a deletion of the 22q region on both chromosomes usually
do not survive, that approximately 20% of the DNA in the sample
came from cells with a 22q deletion on one of the two chromosomes.
The clinician may also infer that if the DNA from the mixed sample
that originated from tumor cells originated from a set of
genetically tumor cells whose HER2 region and 22q regions were
homogenous, then the cells contained a five-fold duplication of the
HER2 region. Of course tumors tend to be heterogeneous, so this may
not be an appropriate assumption.
[0145] In this example, sample A is also analyzed using an allelic
method: the two haplotypes on same region on the q arm on
chromosome 22 are observed to be present in a ratio of 4:5; the two
haplotypes in a focal region corresponding to the HER2 gene are
found to be present in ratios of 1:2; and the two haplotypes in the
p-arm of chromosome 5 are observed in ratios of 20:21. All other
assayed regions of the genome are found to have no statistically
significant excess of either haplotype. A clinician can infer that
the sample contains DNA from a tumor with a CNV in the 22q region,
the HER2 region, and the 5p arm. Based on the knowledge that 22q
deletions are very common in breast cancer, and/or the quantitative
analysis showing an under-representation of the amount of DNA
mapping to the 22q region of the genome, the clinician can infer
the existence of a tumor with a 22q deletion. Based on the
knowledge that HER2 amplifications are very common in breast
cancer, and/or the quantitative analysis showing an
over-representation of the amount of DNA mapping to the HER2 region
of the genome, the clinician can infer the existence of a tumor
with a HER2 amplification. Based on the inferences, the clinician
may decide to pursue additional diagnostic testing such as a tumor
biopsy. Based on these inferences, the clinician can perform a
mammogram or an ultrasound. Based on these inferences, the
clinician can perform a lumpectomy, a mastectomy, or otherwise
excise the tumor. Based on these inferences, the clinician can
choose a course of radiation therapy, chemotherapy, immunotherapy
or other cancer therapy. It is also possible to run other genetic
assays in parallel or in the same assay, for example, testing for
the presence of one or more SNVs. The clinician can choose the form
of therapy, or combination of therapies, based on the genetic
footprint, that is, the particular combination of CNVs and other
mutations such as SNVs that are observed in the sample, combined
with any other data such as clinical data or phenotypic data. It
should be apparent to an ordinary person skilled in the art how any
of the approaches discussed herein could be used for other types of
cancer.
Allelic Joint Distribution Methods
[0146] In certain embodiments, methods of the invention include
determining whether the distribution of observed allele
measurements is indicative of a euploid or an aneuploid sample,
such as a fetus or circulating tumor cell, using a joint
distribution model. The use of a joint distribution model provides
certain advantages over methods that determine heterozygosity rates
by treating polymorphic loci independently in that the resultant
determinations are of significantly higher accuracy. Without being
bound by any particular theory, it is believed that one reason they
are of higher accuracy is that the joint distribution model takes
into account the linkage between SNPs, and likelihood of crossovers
occurring. Another reason it is believed that they are of higher
accuracy is that they can take into account alleles where the total
number of reads is low, and the allele ratio method would produce
disproportionately weighted stochastic noise. The het rate method
provided herein, is an example of an allelic joint distribution
method that can be used to carry out many of the embodiments
provided herein.
[0147] In certain embodiments provided herein, methods of the
invention include determining whether the distribution of observed
allele measurements is indicative of a euploid or an aneuploidy
sample using a maximum likelihood technique. The use of a maximum
likelihood technique has certain advantages over methods that use
single hypothesis rejection technique in that the resultant
determinations will be made with significantly higher accuracy. One
reason is that single hypothesis rejection techniques set cut off
thresholds based on only one measurement distribution rather than
two, meaning that the thresholds are usually not optimal. Another
reason is that the maximum likelihood technique allows the
optimization of the cut off threshold for each individual sample
instead of determining a cut off threshold to be used for all
samples regardless of the particular characteristics of each
individual sample. Another reason is that the use of a maximum
likelihood technique allows the calculation of a confidence for
each ploidy call.
[0148] In certain embodiments provide herein, the method includes
determining whether the distribution of observed allele
measurements is indicative of a euploid or an aneuploid sample
without comparing the distribution of observed allele measurements
on a suspect chromosome to a distribution of observed allele
measurements on a reference chromosome that is expected to be
disomic. This is a significant improvement over methods that
require the use of a reference chromosome to determine whether a
suspect chromosome is euploid or aneuploid. One example of where a
ploidy calling technique that requires a reference chromosome would
make an incorrect call is in the case of a 69XXX(trisomic fetus),
which would be called euploid since there is no reference diploid
chromosome, while the method described herein would be able to
determine that the fetus was trisomic.
[0149] In certain embodiments provided herein, the method involves
using algorithms that analyze the distribution of alleles that have
different parental contexts, and comparing the observed allele
distributions to the expected allele distributions for different
ploidy states for the different parental contexts (different
parental genotypic patterns). Such algorithms are different than
methods that do not utilize allele distribution patterns for
alleles from a plurality of different parental contexts because
they allow the use of significantly more genetic measurement data
from a set of sequence data in the ploidy determination, resulting
in a more accurate determination. In certain embodiments provided
herein, the method includes determining whether the distribution of
observed allele measurements is indicative of a euploid or an
aneuploid fetus using observed allelic distributions measured at
loci where the mother is heterozygous. This allows the use of about
twice as much genetic measurement data from a set of sequence data
in the ploidy determination than methods that do not use observed
allelic distributions, resulting in some instances, in a more
accurate determination.
[0150] In certain embodiments provided herein, genetic data is
obtained from DNA that is isolated using a selective enrichment
techniques that preserve the allele distributions that are present
in the original sample of DNA. In some embodiments the
amplification and/or selective enrichment technique may involve
targeted amplification, hybrid capture, or circularizing probes. In
some embodiments, methods for amplification or selective enrichment
may involve using probes where the hybridizing region on the probe
is separated from the variable region of the polymorphic allele by
a small number of nucleotides. This separation results in lower
amounts of allelic bias. This is an improvement over methods that
involve using probes where the hybridizing region on the probe is
designed to hybridize at the base pair directly adjacent to the
variable region of the polymorphic allele. This is an improvement
over other methods that involve amplification and/or selective
enrichment methods that do not preserve the allele distributions
that are present in the original sample of DNA well. Low allelic
bias is critical for ensuring that the measured genetic data is
representative of the original sample in methods that involve
either calculating allele ratios or allele measurement
distributions. Since prior methods did not focus on polymorphic
regions of the genome, or on the allele distributions, it was not
obvious that techniques that preserved the allele distributions
would result in more accurate ploidy state determinations. Since
prior methods did not focus on using allelic distributions to
determine ploidy state, it was not obvious that a composition where
a plurality of loci were preferentially enriched with low allelic
bias would be particularly valuable for determining a ploidy state
of a fetus.
[0151] The methods described herein are particularly advantageous
when used on samples where a small amount of DNA is available, or
where the percent of circulating DNA is low. This is due to the
correspondingly higher allele dropout rate that occurs when only a
small amount of DNA is available, or the correspondingly higher
allele dropout rate when the percent of fetal or tumor DNA is low.
A high allele dropout rate, meaning that a large percentage of the
alleles were not measured for the target individual, results in
poorly accurate fetal fractions calculations, and poorly accurate
ploidy determinations. Since the method disclosed herein uses a
joint distribution model that takes into account the linkage in
inheritance patterns between SNPs, significantly more accurate
ploidy determinations may be made.
[0152] In embodiments related to NPD, the parental context may
refer to the genetic state of a given SNP, on each of the two
relevant chromosomes for each of the two parents of the target.
Note that in one embodiment, the parental context does not refer to
the allelic state of the target, rather, it refers to the allelic
state of the parents. The parental context for a given SNP may
consist of four base pairs, two paternal and two maternal; they may
be the same or different from one another. It is typically written
as "m.sub.1m.sub.2f|f.sub.1f.sub.2," where m.sub.1 and m.sub.2 are
the genetic state of the given SNP on the two maternal chromosomes,
and f.sub.1 and f.sub.2 are the genetic state of the given SNP on
the two paternal chromosomes. In some embodiments, the parental
context may be written as "f.sub.1f.sub.2|m.sub.1m.sub.2." Note
that subscripts "1" and "2" refer to the genotype, at the given
allele, of the first and second chromosome; also note that the
choice of which chromosome is labeled "1" and which is labeled "2"
is arbitrary.
[0153] Note that in this disclosure, A and B are often used to
generically represent base pair identities; A or B could equally
well represent C (cytosine), G (guanine), A (adenine) or T
(thymine). For example, if, at a given allele, the mother's
genotype was T on one chromosome, and G on the homologous
chromosome, and the father's genotype at that allele is G on both
of the homologous chromosomes, one may say that the target
individual's allele has the parental context of AB|BB; it could
also be said that the allele has the parental context of AB|AA.
Note that, in theory, any of the four possible nucleotides could
occur at a given allele, and thus it is possible, for example, for
the mother to have a genotype of AT, and the father to have a
genotype of GC at a given allele. However, empirical data indicate
that in most cases only two of the four possible base pairs are
observed at a given allele. In this disclosure the discussion
assumes that only two possible base pairs will be observed at a
given allele, although the embodiments disclosed herein could be
modified to take into account the cases where this assumption does
not hold.
[0154] A "parental context" may refer to a set or subset of target
SNPs that have the same parental context. For example, if one were
to measure 1000 alleles on a given chromosome on a target
individual, then the context AA|BB could refer to the set of all
alleles in the group of 1,000 alleles where the genotype of the
mother of the target was homozygous, and the genotype of the father
of the target is homozygous, but where the maternal genotype and
the paternal genotype are dissimilar at that locus. If the parental
data is not phased, and thus AB=BA, then there are nine possible
parental contexts: AA|AA, AA|AB, AA|BB, AB|AA, AB|AB, AB|BB, BB|AA,
BB|AB, and BB|BB. If the parental data is phased, and thus
AB.noteq.BA, then there are sixteen different possible parental
contexts: AA|AA, AA|AB, AA|BA, AA|BB, AB|AA, AB|AB, AB|BA, AB|BB,
BA|AA, BA|AB, BA|BA, BA|BB, BB|AA, BB|AB, BB|BA, and BB|BB. Every
SNP allele on a chromosome, excluding some SNPs on the sex
chromosomes, has one of these parental contexts. The set of SNPs
wherein the parental context for one parent is heterozygous may be
referred to as the heterozygous context.
[0155] When considering which alleles to target, one may consider
the likelihood that some parental contexts are likely to be more
informative than others. For example, AA|BB and the symmetric
context BB|AA are the most informative contexts, because the fetus
is known to carry an allele that is different from the mother. For
reasons of symmetry, both AA|BB and BB|AA contexts may be referred
to as AA|BB. Another set of informative parental contexts are AA|AB
and BB|AB, because in these cases the fetus has a 50% chance of
carrying an allele that the mother does not have. For reasons of
symmetry, both AA|AB and BB|AB contexts may be referred to as
AA|AB. A third set of informative parental contexts are AB|AA and
AB|BB, because in these cases the fetus is carrying a known
paternal allele, and that allele is also present in the maternal
genome. For reasons of symmetry, both AB|AA and AB|BB contexts may
be referred to as AB|AA. A fourth parental context is AB|AB where
the fetus has an unknown allelic state, and whatever the allelic
state, it is one in which the mother has the same alleles. The
fifth parental context is AA|AA, where the mother and father are
heterozygous.
[0156] In some examples of an embodiment of the invention for
detecting a presence or absence of aneuploidy or for measuring the
number of copies of a chromosome or chromosome segment of interest,
quantitative non-allelic genetic information can be used to
determine the copy number of the chromosome or chromosomal segment
of interest in the target cells. For example, a quantitative
non-allelic z-score method can be used to identify at least one
diploid sample in the set of samples that is disomic for the
chromosome or chromosome segment of interest. In such embodiments,
each sample in the set of samples can be analyzed in the following
manner: [0157] determine a proportion of reads that map to the
chromosome or chromosome segment of interest; calculate a z-score
for the proportion of reads that map to the chromosome or
chromosome segment of interest; and [0158] select one or more
samples where the absolute value of the z-score is below a
threshold value as a diploid sample, or where the z-score indicates
disomy with at least a minimum level of confidence (e.g. 90, 95,
96, 97, 98, 99, 99.5, or 99.9%), or select the 20, 15, 10, or 5% of
samples or the 50, 25, 20, 15, 10, or 5, 4, 3, 2, or 1 sample(s)
with the lowest absolute value z-score for the set of samples.
[0159] As another non-limiting example, a quantitative non-allelic
threshold method can be used to identify the presence or absence of
aneuploidy in the test sample. Such a method can be performed in
the following manner for each sample in the set of samples: [0160]
determine a proportion of reads that map to the chromosome or
chromosome segment of interest; calculate a z-score for the
proportion of reads that map to the chromosome or chromosome
segment of interest; and [0161] output whether the data for the
sample yields an absolute value of the z-score above a threshold
value, wherein a z score with an absolute value above the threshold
is indicative aneuploidy in the sample, or whether the data for the
sample yields a z-score indicative of aneuoploidy with at least a
minimum level of confidence.
[0162] In related examples, non-allelic data can be used to
calculate or determine a sequencing depth of read for one or more
loci, or in some embodiments a depth of read for an entire
chromosome of segment of a chromosome. The depth of read refers to
the number of DNA fragments corresponding to the locus, chromosome
segment or chromosome of interest. The number of DNA fragments may
be measured using a sequencing methodology, and may refer to
amplified or unamplified DNA fragments. This non-allelic depth of
read information can then be compared to a threshold value (i.e., a
cut off value) relating to the depth of sequencing reads from a
specific chromosome or specific chromosome segment to a predicted
chromosome copy number or chromosome segment copy number. In
another embodiment, this non-allelic depth of read information can
be used to calculate a z-score with a likelihood that a particular
chromosome or chromosome segment has a particular copy number. For
example, a z-score can be associated with a 70, 75, 80, 90, 95, 96,
97, 98, 99, or 99.9% confidence of a disomic or an aneuploid state
for a chromosome of interest in the test sample.
[0163] In further examples non-allelic quantitative threshold
methods are used to determine the copy number count of a chromosome
or chromosome segment in an individual, for example as part of NPD,
where the target individual is the fetus (i.e. the target cells
come from the placenta), and where the related individual is the
mother (i.e. the non-target cells come from the mother). In this
situation, cfDNA from the maternal plasma may be amplified in a
targeted or untargeted (random) fashion, and sequenced. The copy
number of the chromosome of interest in the target individual may
be inferred by comparing the absolute or relative number of
sequence reads, or sequence tags, mapping to the chromosome of
interest to the number of sequence reads, or sequence tags, mapping
to one or a plurality of reference chromosomes. In certain
illustrative examples, the reference chromosome is the same as the
chromosome of interest for aneuploidy. In other examples, a
reference chromosome or set of reference chromosomes that is
different from the chromosome of interest may be used. In certain
illustrative examples, a subset of samples are determined to be
diploid from an initial analysis of data during a parallel
analysis. For example, all samples that have a z score below an
absolute value threshold such as 3, 2.5, 2, 1.5, 1, or 0.5. If the
number of sequence reads mapping to the chromosome of interest for
the remaining samples (those that were not determined to be diploid
in the initial analysis) is disproportionately higher than would be
expected given the number of sequence reads mapping to one or a
plurality of reference chromosomes, then a fetal trisomy may be
inferred. If the number of sequence reads mapping to the chromosome
of interest is disproportionately lower than would be expected
given the number of sequence reads mapping to one or plurality of
reference chromosomes, then a fetal monosomy may be inferred. If
the number of sequence reads mapping to the chromosome of interest
is proportionate to what would be expected given the number of
sequence reads mapping to the reference chromosome, then disomy may
be inferred. There are many way to determine what number of
sequence reads mapping to the chromosome of interest is
proportionate, or disproportionate, to what would be expected,
given the number of sequence reads mapping to the reference
chromosome including normalization based on representation in the
genome, and also including GC-bias correction, which is where the
expected number of reads may be normalized based on the fact that
GC-rich regions of the genome may not amplify at an equivalent rate
to non-GC-rich regions of the genome.
[0164] In a related embodiment, a method of the invention includes
both a non-allelic z-score based quantitative method and a maximum
likelihood method based on allelic or non-allelic data.
Accordingly, provided herein is a method for detecting a presence
or absence of aneuploidy of a chromosome or chromosome segment of
interest in a test sample, that includes the following steps:
obtaining genetic data for the chromosome or chromosome segment of
interest from each sample in a set of samples comprising the test
sample, wherein the genetic data is obtained from a parallel
analysis of the samples; [0165] determining whether aneuploidy is
present in the test sample by a first method comprising: [0166] a.
determining a depth of reads or a proportion of reads that map to
the chromosome or chromosome segment of interest; [0167] b.
calculating a z-score for the depth of reads or the proportion of
reads that map to the chromosome or chromosome segment of interest;
and [0168] c. determining whether the test sample is aneuploidy at
the chromosome or chromosome segment of interest based on the
z-score, thereby providing a first result; and determining whether
aneuploidy is present in the test sample by a second method
comprising: [0169] d. creating a plurality of ploidy hypotheses
wherein each ploidy hypothesis is associated with a specific copy
number for the chromosome or chromosome segment of interest, [0170]
e. determining a ploidy probability value for each ploidy
hypothesis, wherein the ploidy probability value indicates the
likelihood that the test sample has the specific copy number for
the chromosome or chromosome segment of interest that is associated
with the ploidy hypothesis, and [0171] f. determining which ploidy
hypothesis is most likely to be correct by selecting the ploidy
hypothesis with the maximum likelihood, thereby providing a second
result, detecting the aneuploidy by considering the first result
and the second result.
[0172] The z-score based on a non-allelic quantitative threshold or
cutoff value can be determined in variety of ways, for example an
average depth of read (normalized for the length of the specific
chromosome) can be obtained from a chromosome or chromosome
segment, i.e., a reference chromosome or chromosome segment, that
is assumed or proven to have a specific copy with a high degree of
certainty (e.g., chromosome 2 in a developing fetus can safely be
assume to be diploid). In examples of this embodiment of the
invention, the cutoff value is based on a reference chromosome or
chromosome segment that is the same as a chromosome or chromosome
segment having the copy number that is being measured, and in
certain illustrative examples, without the use of a sample known in
advance of an assay, as being diploid. In embodiments of the
invention where the cutoff value is based on a reference chromosome
or chromosome segment that is the same as the chromosome or
chromosome segment having the copy number that is being measured,
sets of patients (test subjects) can be co-analyzed in a run of a
high throughput DNA sequencer, so as to produce a reference value
(cutoff value). This reference value can be indicative of the
number of copies of a given chromosome or chromosome segment in a
patient. For example, if the amount of total DNA sequence
information obtained from a specific chromosome exceeds cutoff
value, it may be possible to determine that the target cell
contains a trisomy on a specific chromosome with a high degree
confidence of a correct determination. This probability of a
specific chromosome copy number or chromosome copy number segment
can be modified using a second probability value, wherein the
second probability value is determined from allelic data.
[0173] When sequencing is used for ploidy calling of a fetus in the
context of non-invasive prenatal diagnosis, there are a number of
ways to analyze the sequence data to determine the ploidy of the
fetus. In one method that is used in some embodiments provided
herein, a non-allelic threshold method is used. In one example of
such a method, the sequence data is used by counting the number of
reads that map to a given chromosome. For example, consider an
example where the goal is to determine the ploidy state of
chromosome 21 on the fetus where the DNA in the sample is comprised
of 10% DNA of fetal origin, and 90% DNA of maternal origin. In this
case, one could identify disomic samples as samples that initially
yield a z-score of below a threshold, and compare reads obtained
for chromosome 21 for a test sample to average reads of chromosome
21 for the diploid samples. If the on-test fetus were euploid, one
would expect the amount of DNA per unit of genome to be about equal
in chromosome 21 from a disomic sample to chromosome 21 in a sample
from the euploid on-test fetus. If the fetus were trisomic at
chromosome 21, one the other hand, then one would expect there to
be more slightly more DNA per genetic unit from chromosome 21 from
the on-test sample than for the disomic sample(s) Another method
that could be used to detect aneuploidy is similar to that above,
except that parental contexts could be taken into account.
Methods for Determining the Number of Copies of the Chromosome or
Chromosome Segment Employing a Reference Value Derived from a
Subset of Patients
[0174] One embodiment of the invention is a method for determining
the number of copies of a chromosome or chromosome segment of
interest in the genome of a target cell, such as fetal cell or
tumor cell. Genetic data, e.g., DNA sequence data, can be obtained
from a mixture of DNA comprising DNA derived from one or more
target cells and DNA derived from one or more non-target cells. The
target cells and non-target cells differ with respect to one
another at the genomic level, as by virtue of other criteria. The
term "derived" is used to indicate that the cells are the ultimate
source of the DNA. Thus, for example, cell-free DNA obtained from
maternal blood of pregnant woman is derived from cells and the
mother's cells. The method employs a set of patients. The genetic
data is obtained from each member of the patient set. Each patient
in the set of patients is analyzed using essentially the same
method of nucleic sequence analysis, e.g., the same amplification
and sequencing reagents. Genetic information is obtained at a
plurality of loci. In some embodiments, at least some, and possibly
all of the loci are polymorphic. In some embodiments, all of the
loci could be non-polymorphic. In some embodiments, the same loci
are analyzed in both the target and non-target cells. In other
embodiments the loci comprise non-polymorphic loci and also
polymorphic loci; in this case, methods that utilize allelic data
can be used with the allelic data measured on the polymorphic data
as input, and other methods that utilize non-polymorphic data can
be used with the non-polymorphic data measured on non-polymorphic
loci as input, optionally including additional non-polymorphic data
that can be produced by summing the allelic quantities from each of
the alleles at one or more of the polymorphic loci. A number of
sequence reads is obtained for each locus. In some embodiments the
number of each allele at a given locus is quantitated. The
quantitative data obtained can be from a combination of the loci
from the target cell and the non-target cell genomes. A depth of
sequencing reference value is derived from the genetic data
obtained from this set of patients or in some embodiments, the
depth of sequencing reference value is derived from a subset of the
original set of patients. The genetic data derived from the
specific chromosome or chromosome segment of interest from a
selected patient in the set of patients is compared to the
reference value, wherein the comparison indicates the copy number
of the specific chromosome or chromosome segment of interest from
the selected patient.
[0175] In some embodiments, the genetic data is obtained by
sequencing. The sequencing may be performed on a high throughput
parallel DNA sequencer.
[0176] In some embodiments, genetic data is obtained by
simultaneously sequencing a mixture comprising DNA derived from one
or more target cells and drive from one or more non-target cells to
give genetic data at the set of loci from each member of the set of
patients.
[0177] In some embodiments the target cells are fetal cells and
non-target cells are from the mother of the fetus.
[0178] In some embodiments directed to non-invasive prenatal
diagnosis, the target cells may be fetal cells and the non-target
cells may be maternal cells.
[0179] In some embodiments of the invention in example of a
hypothesis that may be used to select a subset of samples is the
hypothesis that a specific chromosome or chromosome segment is
diploid i.e. present in 2 copies. Examples of chromosomes for
analysis include chromosomes 13, 18, 21, X and Y, including
segments thereof. For example, the subset of samples may be chosen
on the basis of having the highest likelihood that all or nearly
all of the DNA in the sample originated from cells with precisely
two copies of the chromosome of interest.
[0180] In some embodiments, the chromosome segment that is analyzed
for copy number is selected from the group consisting of chromosome
22q11.2, chromosome 1p36, chromosome 15q11-q13, chromosome 4p16.3,
chromosome 5p15.2, chromosome 17p13.3, chromosome 22q13.3,
chromosome 2q37, chromosome 3q29, chromosome 9q34, chromosome
17q21.31, and the terminus of a chromosome.
[0181] In some embodiments, the set of loci are present on a
selected region of a chromosome. In some embodiments, the method is
performed independently for different chromosomes or chromosome
segments. The only upper limit imposed on the number of patients in
the set of patients is imposed by the DNA sequence generating
capacity of the specific DNA sequencing technology selected
(including the patient multiplexing technology, e.g. barcoding,
compatible with that sequencing technology) in general there will
be at least 10 patients in a patient set. In some embodiments there
will be at least 24 patients in the patient set, in other
embodiments there will be at least 48 patients, and in other
embodiments will be at least 96 patients.
[0182] In some embodiments the target cell is a tumor and the
non-target cell is a non-tumor cell.
[0183] Methods for analyzing genetic data for aneuploidy using a
threshold or cutoff method are known in the art. U.S. Pat. No.
7,888,017, incorporated herein by reference, provides a method for
determining fetal aneuploidy by counting the number of reads that
map to a suspect chromosome and comparing it to the number of reads
that map to a reference chromosome, and using the assumption that
an overabundance of reads on the suspect chromosome corresponds to
a triploidy in the fetus at that chromosome. Teachings provided
therein can be useful in carrying out embodiments of the present
invention that involve a depth of sequencing reads and a reference
value. It will be understood that in this embodiment of the present
invention a significant improvement over such methods is provided,
because in this embodiment of the present invention the depth of
sequencing reference value is derived from a subset of the original
set of samples processed in parallel, using the chromosome or
chromosome segment of interest in samples initially determined to
be diploid in the parallel analysis, for the analysis of other
samples in the parallel analysis of the set of samples. A skilled
artisan with this disclosure will understand how to modify methods
provided in these cited threshold method patents to perform methods
provided herein.
[0184] Methods for Determining the Number of Copies of a Chromosome
or Chromosome Segment in which a Set of Patients that have a
Relative Fraction of DNA from the Chromosome of Interest Close to
the Median of the Relative Fraction of DNA from the Chromosome of
Interest from a Larger Set of Patients
[0185] One embodiment of the invention is a method for determining
the number of copies of a chromosome or chromosome segment of
interest in the genome of a target cell, such as fetal cell or
tumor cell. Genetic data, e.g., DNA sequence data, can be obtained
from a mixture of DNA comprising DNA derived from one or more
target cells and DNA derived from one or more non-target cells. The
target cells and non-target cells differ with respect to one
another at the genomic level, as by virtue of other criteria. The
term "derived" is used to indicate that the cells are the ultimate
source of the DNA. Thus, for example, cell-free DNA obtained from
maternal blood of pregnant woman is derived from cells and the
mother's cells. The method employs a set of patients. The genetic
data is obtained from each member of the patient set. Each patient
in the set of patients is analyzed in parallel using essentially
the same method of nucleic sequence analysis, e.g., the same
amplification and sequencing reagents. Genetic information is
obtained. The quantitative data obtained can be from a combination
of the loci from the target cell and the non-target cell genomes.
The genetic data obtained from the combination of the target cell
DNA and the non-target cell DNA is used to obtain genetic data of
the relative fraction of DNA (depth of sequencing read) that
corresponds to the chromosome or chromosome segments of
interest.
[0186] A subset of patients is selected as a control subset, by
choosing those patients where the relative fraction of DNA that
corresponds to the chromosome or chromosome segments of interest in
the obtained genetic data for that patient is closest to the median
of the relative fractions for the set of patients. This median can
be obtained on a per locus basis, or in other embodiments by
grouping loci into subsets of loci, which are generally in close
physical proximity to one another (e.g., a genetic linkage with one
another) on the chromosome or chromosome segment of interest or by
looking at a chromosome or chromosome segment as a whole. A
reference value is determined for the relative fraction of DNA in
the obtained genetic data that corresponds to the chromosome or
chromosome segments of interest from the subset of patients. The
reference value for the relative fraction of DNA that corresponds
to the chromosome or chromosome segment of interest is compared to
the obtained genetic data from a selected patient in the set of
patients, wherein the comparison produces an experimental value
indicative of the presence or absence of a genetic abnormality in
chromosome copy number or chromosome segment copy number in the
target cell.
[0187] In some embodiments, the subset is selected as the 25, 20,
15, 10, 5, or 2% of patients or the 50, 40, 30, 25, 20, 15, 10, 5,
or 2 patients whose genetic data is closest to the mean, or
preferably the median for all samples.
[0188] In some embodiments, the experimental value may exceed a
specific diagnostic threshold value. In some embodiments the
genetic data is obtained by DNA sequencing. In some embodiments the
genetic data from the set of patients is obtained by simultaneously
sequencing a mixture comprising DNA derived from one or more target
cells and DNA derived from one or more non-target cells to give
genetic data at the set of loci from each member of the set of
patients.
[0189] In some embodiments, the genetic data is obtained by
sequencing. The sequencing may be performed on a high throughput
parallel DNA sequencer.
[0190] In some embodiments, genetic data is obtained by
simultaneously sequencing a mixture comprising DNA derived from one
or more target cells and drive from one or more non-target cells to
give genetic data at the set of loci from each member of the set of
patients.
[0191] In some embodiments the target cells are fetal cells and
non-target cells are from the mother of the fetus.
[0192] In some embodiments direct to non-invasive prenatal
diagnosis, the target cells may be fetal cells and the non-target
cells may be maternal cells.
[0193] In some embodiments of the invention in example of a
hypothesis that may be used to determine the subset of samples is
the hypothesis that a specific chromosome or chromosome segment is
diploid i.e. present in 2 copies. Examples of chromosomes for
analysis include chromosomes 13, 18, 21, X and Y, including
segments thereof.
[0194] In some embodiments, the chromosome segment that is analyzed
for copy number is selected from the group consisting of chromosome
22q11.2, chromosome 1p36, chromosome 15q11-q13, chromosome 4p16.3,
chromosome 5p15.2, chromosome 17p13.3, chromosome 22q13.3,
chromosome 2q37, chromosome 3q29, chromosome 9q34, chromosome
17q21.31, and the terminus of a chromosome.
[0195] In some embodiments, the set of loci are present on a
selected region of a chromosome. In some embodiments, the method is
performed independently for different chromosomes or chromosome
segments. The only upper limited imposed on the number of patients
in set of patients is imposed by the DNA sequence generating
capacity of the specific DNA sequencing technology selected
(including the patient multiplexing technology, e.g. barcoding,
compatible with that sequencing technology) in general there will
be at least 10 patients in a patient set. In some embodiments there
will be at least 24 patients, and the patient set in other
embodiments there will be at least 48 patients the patient set in
other embodiments will be at least 96 patients in the patient
set.
[0196] In some embodiments the target cell is a tumor and the
non-target cell is a non-tumor cell. In some embodiments the first
probability value is derived from the genetic data obtained from
polymorphic loci that comprise alleles present in the target cells
that are not present in the non-target cells. In some the cell free
DNA comprises DNA that that has been released by apoptosis. In some
embodiments the target cell is tumor cell, such tumor cells may be
a malignant tumor cell.
[0197] In some embodiments, provided herein is a method for
determining a presence or absence of a fetal aneuploidy in a fetus
for each of a plurality of maternal blood samples obtained from a
plurality of different pregnant women, said maternal blood samples
comprising fetal and maternal cell-free genomic DNA, that includes
the following steps:
[0198] determining a number of enumerated sequence reads
corresponding to an chromosome or chromosome segment of interest
for each of the plurality of samples;
[0199] determining a reference value of enumerated sequence reads
from a diploid subset of between 1 and 50 samples of the plurality
of samples or between 1-50% of samples of the plurality of samples
having a number of enumerated sequence reads closest to the median
number of enumerated sequence reads for the plurality of maternal
blood samples; and
[0200] comparing the number of enumerated sequence read from at
least of, or each of the other samples of the plurality of samples
that are not diploid samples, to the reference value, wherein a
value above a cutoff is indicate of aneuploidy in the sample,
thereby determining the presence or absence of a fetal aneuploidy
in the chromosome or chromosome segment of interest.
[0201] In certain embodiments the method further comprises before
the determining the number of enumerated sequence reads: [0202] a.
obtaining a fetal and maternal cell-free genomic DNA sample from
each of the plurality of maternal blood samples; [0203] b.
generating a library derived from each fetal and maternal cell-free
genomic DNA sample, [0204] c. performing massively parallel
sequencing of polynucleotide sequences of the library from the
chromosome or chromosome segment of interest; and [0205] d.
enumerating sequence reads corresponding to fetal and maternal
polynucleotide sequences selected from the chromosome or chromosome
segment of interest.
[0206] In certain embodiments, the reference value of enumerated
sequence reads is determined from a diploid subset of between 10
and 40 samples closest to the median.
[0207] In certain embodiments, the reference value of enumerated
sequence reads is determined from a diploid subset of between 15
and 40 samples closest to the median.
[0208] In other embodiments, the diploid subset can be determined
by selecting a diploid subset of between 1 and 50, 2 and 40, or 10
and 40 of the samples or between 1-50%, 2-40%, 5-25%, or 5-10% of
the samples having a number of enumerated sequence reads closest to
the median number of enumerated sequence reads for the plurality of
maternal blood samples,
[0209] In these embodiments, each library of polynucleotide
sequences can include an indexing nucleotide sequence which
identifies a maternal blood sample of the plurality of maternal
blood samples. Such examples typically include pooling the
libraries generated to produce a pool of enriched and indexed fetal
and maternal non-random polynucleotide sequences;
[0210] In certain embodiments, the plurality of non-random
polynucleotide sequences comprises at least 100 different
non-random polynucleotide sequences selected from a first
chromosome tested for being aneuploid (i.e. chromosome of interest)
wherein each of said plurality of non-random polynucleotide
sequences is from 10 to 1000 nucleotide bases in length,
[0211] In certain embodiments, the method further includes
selectively enriching a plurality of non-random polynucleotide
sequences of each fetal and maternal cell-free genomic DNA
samples.
[0212] In methods of the immediately above embodiment, further
background teaching can be found in U.S. Pat. No. 8,318,430, hereby
incorporated by reference in its entirety.
Embodiments that Determine Aneuploidy with Improved Confidence by
Utilizing a Non-Allelic Threshold Method and a Method that
Determines Likelihoods
[0213] In some embodiments of the invention, improved confidence
for an aneuploidy determination can be obtained by determining
aneuploidy of a sample using a quantitative non-allelic threshold
or cutoff method and for the same sample, determining aneuploidy
using a method that determines likelihoods. If the sample is
identified having aneuploidy in a chromosome or chromosome segment
of interest by a threshold method and the sample is identified as
having aneuploidy with high confidence using a likelihood
determination for a set of hypothesis, then the sample is
identified as a sample having aneuploidy at the chromosome or
chromosome segment of interest for one or more target cells in a
subject that is the source of the sample.
[0214] Accordingly, provided herein is a method for determining a
presence or absence of aneuploidy of a chromosome or chromosome
segment of interest in a test sample, comprising [0215] a.
obtaining genetic data for the chromosome or chromosome segment of
interest from a set of samples comprising the test sample, wherein
the genetic data is obtained from a parallel analysis of the
samples; [0216] b. determining whether aneuploidy is present in the
test sample by a first method comprising [0217] i. determining a
depth of read or a proportion of reads that map to the chromosome
or chromosome segment of interest; [0218] ii. calculating a z-score
for the depth of reads or the proportion of reads that map to the
chromosome or chromosome segment of interest; and [0219] iii.
determining whether the z-score for the test sample is above a
threshold value or whether the z-score is indicative of aneuploidy
with a minimum level of confidence; [0220] c. determining whether
aneuploidy is present in the test sample by a second method
comprising [0221] i. creating a plurality of ploidy hypotheses
wherein each ploidy hypotheses is associated with a specific copy
number for the chromosome or chromosome segment of interest, [0222]
ii. determining a ploidy probability value for each ploidy
hypotheses, wherein the ploidy probability value indicates the
likelihood that the target sample has the number of copies of the
chromosome or chromosome segment of interest that is associated
with the ploidy hypothesis, and [0223] iii. determining which
ploidy hypotheses is most likely to be correct by selecting the
ploidy hypotheses with the maximum likelihood, wherein aneuploidy
is determined for the chromosome or chromosome segment of interest
in the test sample when both a maximum likelihood ploidy hypothesis
is an aneuploidy and a z-score is above the threshold from step
Bii.
[0224] In the above method, step B is carried out by a non-allelic
threshold method and step C is carried out using a likelihood
determining method. Methods are known in the art for carrying out a
non-allelic threshold analysis, especially for NIPT. For example,
U.S. Pat. Nos. 7,888,017 and 8,318,430, incorporated in their
entirety herein by reference, provide methods for determining fetal
aneuploidy by counting the number of reads that map to a suspect
chromosome and comparing it to the number of reads that map to a
reference chromosome, and using the assumption that an
overabundance of reads on the suspect chromosome corresponds to a
triploidy in the fetus at that chromosome. Teachings provided
therein can be useful in carrying out embodiments of the present
invention that involve a depth of sequencing reads and a reference
value.
[0225] In certain examples of methods of this embodiment, using a
non-allelic threshold value, the non-allelic information can be
used to calculate a sequencing depth of read for one or more loci,
or in some embodiments a depth of read for an entire chromosome of
segment of a chromosome. This non-allelic depth of read information
can then be compared to a threshold value (i.e., a cut off value)
relating to the depth of sequencing reads from a specific chrome or
specific chromosome segment to a predicted chromosome copy number
or chromosome segment copy number. This cutoff value can be
determined in variety of ways, for example an average depth of read
(normalized for the length of the specific chromosome) can be
obtained from a chromosome or chromosome segment, i.e., a reference
chromosome or chromosome segment, that is assumed or proven to have
a specific copy with a high degree of certainty (e.g., chromosome 2
in a developing fetus can safely be assumed to be diploid). In some
embodiments of the invention, the cutoff value is based on a
reference chromosome or chromosome segment that is different than
the chromosome or chromosome segment having the copy number that is
being measured, wherein the different chromosome is assumed to have
a specific copy number. In some embodiments of the invention, the
cutoff value is based on a reference chromosome or chromosome
segment that is the same as a chromosome or chromosome segment
having the copy number that is being measured, and in certain
illustrative examples, without the use of a sample known in advance
of an assay, as being diploid.
[0226] In embodiments of the invention where the cutoff value is
based on a reference chromosome or chromosome segment that is the
same as chromosome or chromosome segment having the copy number
that is being measured, sets of patients (test subjects) can be
co-analyzed in a run of a high throughput DNA sequencer, so as to
produce a reference value (cutoff value). This reference value can
be indicative of the number of copies of a given chromosome or
chromosome segment in a patient. For example, if the amount of
total DNA sequence information obtained from a specific chromosome
exceeds cutoff value, it may be possible to determine that the
target cell contains a trisomy on a specific chromosome with a high
degree confidence of a correct determination. In these examples,
the same data, or a subset thereof, that is used for the
non-allelic threshold method, can be used for a non-allelic or
allelic likelihood method. Thus, efficiencies are gained by using
the same data or a subset thereof in a parallel experiment with the
same set of samples using both the non-allelic threshold analysis
and the likelihood determining method.
[0227] For the likelihood method in certain examples of methods of
this embodiment, the genetic data includes quantitative allelic
data from a plurality of polymorphic loci in the set of loci,
wherein each of the ploidy hypotheses specifies an expected
distribution of quantitative allelic data at a plurality of
polymorphic loci, and wherein the ploidy probability values are
determined by calculating, for each of the ploidy hypotheses, the
fit between the expected genetic data and the obtained genetic
data. In certain examples of methods of this embodiment, the
genetic data includes quantitative non-allelic data from a
plurality of polymorphic loci in the set of loci, and wherein each
of the ploidy hypotheses specifies an expected mean value of
quantitative non-allelic data at the plurality of polymorphic loci,
and wherein the ploidy probability values are determined by
calculating, for each of the ploidy hypotheses, the fit between the
expected genetic data and the obtained genetic data. Provided
throughout this application, are methods that provide likelihoods.
This includes both allelic and non-allelic methods. For example, a
het-rate method provided herein or a QMM method can be used.
Non-Invasive Prenatal Diagnosis (NPD)
[0228] Non-invasive prenatal diagnosis is an important technique
that can be used to determine the genetic state of a fetus from
genetic material that is obtained in a non-invasive manner, for
example from a blood draw on the pregnant mother. The blood could
be separated and the plasma isolated, and size selection can
optionally be used to isolate the DNA of the appropriate length.
This isolated DNA can then be measured by a number of means, such
as by hybridizing to a genotyping array and measuring the
fluorescence, or by sequencing on a high throughput sequencer.
[0229] In illustrative examples the methods and systems provided
herein are used for NIPD, also referred to herein as non-invasive
prenatal testing (NIPT). The process of non-invasive prenatal
diagnosis in certain embodiments involves a number of steps. Some
of the steps can include: (1) obtaining the genetic material from
the fetus; (2) optionally enriching the genetic material of the
fetus, ex vivo; (3) amplifying the genetic material, ex vivo; (4)
optionally preferentially enriching specific loci in the genetic
material, ex vivo; (5) genotyping the genetic material, ex vivo;
and (6) analyzing the genotypic data, on a computer, and ex vivo.
Methods to reduce to practice these and other relevant steps are
disclosed herein. At least some of the method steps are not
directly applied on the body. In an embodiment, the present
disclosure relates to methods of treatment and diagnosis applied to
tissue and other biological materials isolated and separated from
the body. At least some of the method steps are executed on a
computer.
[0230] Some embodiments of the present disclosure allow a clinician
to determine the genetic state of a fetus that is gestating in a
mother in a non-invasive manner such that the health of the baby is
not put at risk by the collection of the genetic material of the
fetus, and that the mother is not required to undergo an invasive
procedure. Moreover, in certain aspects, the present disclosure
allows the fetal genetic state to be determined with high accuracy,
significantly greater accuracy than, for example, the non-invasive
maternal serum analyte based screens, such as the triple test, that
are in wide use in prenatal care.
[0231] The accuracy of the methods disclosed herein is a result of
an informatics approach to analysis of the genotype data, as
described herein. Modern technological advances have resulted in
the ability to measure large amounts of genetic information from a
genetic sample using such methods as high throughput sequencing and
genotyping arrays. The methods disclosed herein allow a clinician
to take greater advantage of the large amounts of data available,
and make a more accurate diagnosis of the fetal genetic state. The
details of a number of embodiments are given below. Different
embodiments may involve different combinations of the
aforementioned steps. Various combinations of the different
embodiments of the different steps may be used interchangeably.
[0232] In one embodiment, a blood sample is taken from a pregnant
mother, and the free floating DNA in the plasma of the mother's
blood, which contains a mixture of both DNA of maternal origin, and
DNA of fetal origin, is used to determine the ploidy status of the
fetus. In one embodiment of the present disclosure, a key step of
the method involves preferential enrichment of those DNA sequences
in a mixture of DNA that correspond to polymorphic alleles in a way
that the allele ratios and/or allele distributions remain mostly
consistent upon enrichment. In one embodiment of the present
disclosure, the method involves sequencing a mixture of DNA that
contains both DNA of maternal origin, and DNA of fetal origin. In
one embodiment of the present disclosure, a key step of the method
involves using measured allele distributions to determine the
ploidy state of a fetus that is gestating in a mother.
Screening Maternal Blood Containing Free Floating Fetal DNA
[0233] The methods described herein may be used to help determine
the genotype of a child, fetus, or other target individual where
the genetic material of the target is found in the presence of a
quantity of other genetic material. In this disclosure, the
discussion focuses on determining the genetic state of a fetus
where the fetal DNA is found in maternal blood, but this example is
not meant to limit to possible contexts that this method may be
applied to. In addition, the method may be applicable in cases
where the amount of target DNA is in any proportion with the
non-target DNA; for example, the target DNA could make up anywhere
between 0.000001 and 99.999999% of the DNA present. In addition,
the non-target DNA does not necessarily need to be from one
individual, or even from a related individual, as long as genetic
data from non-target individual(s) is known. In one embodiment of
the present disclosure, the method can be used to determine
genotypic data of a fetus from maternal blood that contains fetal
DNA. It may also be used in a case where there are multiple fetuses
in the uterus of a pregnant woman, or where other contaminating DNA
may be present in the sample, for example from other already born
siblings.
[0234] This technique may make use of the phenomenon of fetal blood
cells gaining access to maternal circulation through the placental
villi. Ordinarily, only a very small number of fetal cells enter
the maternal circulation in this fashion (not enough to produce a
positive Kleihauer-Betke test for fetal-maternal hemorrhage). The
fetal cells can be sorted out and analyzed by a variety of
techniques to look for particular DNA sequences, but without the
risks that these latter two invasive procedures inherently have.
This technique may also make use of the phenomenon of free floating
fetal DNA gaining access to maternal circulation by DNA release
following apoptosis of placental tissue where the placental tissue
in question contains DNA of the same genotype as the fetus. The
free floating DNA found in maternal plasma has been shown to
contain fetal DNA in proportions as high as 30-40% fetal DNA.
[0235] In one embodiment of the present disclosure, blood may be
drawn from a pregnant woman. Research has shown that maternal blood
may contain a small amount of free floating DNA from the fetus, in
addition to free floating DNA of maternal origin. In addition,
there also may be enucleated fetal blood cells containing DNA of
fetal origin, in addition to many blood cells of maternal origin,
which typically do not contain nuclear DNA. There are many methods
known in the art to isolate fetal DNA, or create fractions enriched
in fetal DNA. For example, chromatography has been show to create
certain fractions that are enriched in fetal DNA.
[0236] Once the sample of maternal blood, plasma, or other fluid,
drawn in a relatively non-invasive manner, and that contains an
amount of fetal DNA, either cellular or free floating, either
enriched in its proportion to the maternal DNA, or in its original
ratio, is in hand, one may genotype the DNA found in said sample.
The method described herein can be used to determine genotypic data
of the fetus. For example, it can be used to determine the ploidy
state at one or more chromosomes, it can be used to determine the
identity of one or a set of SNPs, including insertions, deletions,
and translocations. It can be used to determine one or more
haplotypes, including the parent of origin of one or more genotypic
features.
[0237] Note that this method will work with any nucleic acids that
can be used for any genotyping and/or sequencing methods, such as
the ILLUMINA INFINIUM ARRAY platform, AFFYMETRIX GENECHIP, ILLUMINA
GENOME ANALYZER, HiSEQ or MiSEQ, LIFE TECHNOLGIES` SOLiD SYSTEM, or
Ion Torrent Person Genome Machine or Proton. This includes
extracted free-floating DNA from plasma or amplifications (e.g.
whole genome amplification, PCR) of the same; genomic DNA from
other cell types (e.g. human lymphocytes from whole blood) or
amplifications of the same. For preparation of the DNA, any
extraction or purification method that generates genomic DNA
suitable for the one of these platforms will work as well. In one
embodiment, storage of the samples may be done in a way that will
minimize degradation (e.g. at -20 C or lower).
Methods for Determining the Number of Copies of a Chromosome or
Chromosome Segment of Interest by Combining Allelic and Non-Allelic
Genetic Data
[0238] Other embodiments of the invention include methods for
determining the number of copies of a chromosome or chromosome
segment of interest in the genome of a target cell, such as fetal
cell or tumor cell. Genetic data, e.g., DNA sequence data, can be
obtained from a mixture of DNA comprising DNA derived from one or
more target cells and DNA derived from one or more non-target
cells. The method can employ a single patient or a set of patients.
The genetic data is obtained from a patient. Genetic information is
obtained at a plurality of loci. At least some, and possible all of
the loci are polymorphic. The same loci are analyzed in both the
target and non-target cells. A number of sequence reads is obtained
for each locus. The number of sequence reads at each allele at a
given locus is quantitated. The quantitative data obtained can be
from a combination of the loci from the target cell and the
non-target cell genomes. The collected data is then tested against
a plurality of copy number hypotheses, i.e., the copy number of the
chromosome or chromosome segment of interest. A first probability
value is calculated for each hypothesis i.e., the probability that
the hypothesis is either true or false given the measured genetic
data. Thus the likelihood that the genome of the target cell has
the number of copies of the chromosome or chromosome segment of
interest specified by the hypothesis is determined. This first
probability value is obtained using the allelic data. A second
probability value is calculated for each hypothesis i.e., the
probability that the hypothesis is either true or false given the
measured genetic data. Thus the likelihood that the genome of the
target cell has the number of copies of the chromosome or
chromosome segment of interest specified by the hypothesis is
determined. This second probability value is obtained using the
non-allelic data. For each hypothesis, the first probability value
and the second probability value can be combined, e.g., through
multiplication, to give a combined probability indicating the
likelihood that the genome of the target cell has the number of
copies of the chromosome or chromosome segment that is associated
with the hypothesis. The number of copies of the chromosome or
chromosome segment of interest in the genome of the target cell can
be determined by selecting the number of copies of the chromosome
or chromosome segment that is associated with the hypothesis with
the greatest combined probability is used to make the determination
of the chromosome or chromosome segment copy number in the sample
of interest. In some embodiments wherein the genetic data is
obtained from cell free DNA obtained from the blood of a pregnant
woman, the hypothesis can include a condition that the mother is
carrying multiple fetuses, e.g., twins.
[0239] Accordingly, in some embodiments, genetic data is obtained
by simultaneously sequencing a mixture comprising DNA derived from
one or more target cells and derived from one or more non-target
cells to give genetic data at the set of loci from each member of
the set of patients. In some embodiments the target cells are fetal
cells and non-target cells are from the mother of the fetus. That
is, in some embodiments directed to non-invasive prenatal
diagnosis, the target cells may be fetal cells and the non-target
cells may be maternal cells. In some embodiments of the invention
in example of a hypothesis that may be used to select the subset of
patients may be the hypothesis that a specific chromosome or
chromosome segment is diploid i.e. present in 2 copies. Examples of
chromosomes for analysis include chromosomes 13, 18, 21, X and Y,
including segments thereof. In some embodiments, the chromosome
segment that is analyzed for copy number is selected from the group
consisting of chromosome 22q11.2, chromosome 1p36, chromosome
15q11-q13, chromosome 4p16.3, chromosome 5p15.2, chromosome
17p13.3, chromosome 22q13.3, chromosome 2q37, chromosome 3q29,
chromosome 9q34, chromosome 17q21.31, and the terminus of a
chromosome.
[0240] In some embodiments, the set of loci are present on a
selected region of a chromosome. In some embodiments, the method is
performed independently for different chromosomes or chromosome
segments. The only upper limited imposed on the number of patients
in set of patients is imposed by the DNA sequence generating
capacity of the specific DNA sequencing technology selected
(including the patient multiplexing technology, e.g. barcoding,
compatible with that sequencing technology) in illustrative
embodiments there will be at least 10 patients in a patient set. In
some embodiments there will be at least 24 patients, and the
patient set in other embodiments there will be at least 48 patients
the patient set in other embodiments will be at least 96 patients
in the patient set.
Methods of Determining the Number of Copies of a Chromosome or
Chromosome Segment Employing Hypotheses that are Tested Using a
Combination of the Allelic and Non-Allelic Data
[0241] Embodiments include methods for determining the number of
copies of a chromosome or chromosome segment of interest in the
genome of a target cell in which genetic data is obtained from DNA
derived from target cells and DNA derived from non-target cells,
wherein the genetic data comprises (i) quantitative allelic data
from a plurality of polymorphic loci and (ii) quantitative
non-allelic data from a plurality of polymorphic and/or
non-polymorphic loci. The method includes the step of creating a
plurality of hypotheses wherein each hypothesis is associated with
a specific copy number for the chromosome or chromosome segment in
the genome of the target cell. A probability value is calculated
for each hypothesis, wherein the probability value indicates the
likelihood that the genome of the target cell has the number of
copies of the chromosome or chromosome segment that is associated
with the hypothesis, and wherein the first probability value is
derived from the allelic data and the non-allelic data obtained
from at least one first locus. For example, the hypothesis may be
tested using a model that incorporates both allelic data and
non-allelic data, thereby obtaining a probability value. Each
calculated probability value can be combined to give a combined
probability indicating the likelihood that the genome of the target
cell has the number of copies of the chromosome or chromosome
segment that is associated with the hypothesis. The number of
copies of the chromosome or chromosome segment of interest in the
genome of the target cell is determined by selecting the number of
copies of the chromosome or chromosome segment that is associated
with the hypothesis with the greatest probability. In some
embodiments wherein the genetic data is obtained from cell free DNA
obtained from the blood of a pregnant woman, the hypothesis can
include a condition that the mother is carrying multiple fetuses,
e.g., twins.
[0242] In some embodiments the probability value for each
hypothesis is obtained from allelic and non-allelic data obtained
from a single locus. In some embodiments the allelic data is tested
on a model based on a distribution of possible allelic ratios
associated with each hypothesis. In some embodiments the
probability values for each hypothesis are separately determined
for genetic data from at least 1000 polymorphic loci. In some
embodiments the step of calculating a probability value for each
hypothesis comprises the steps of (1) modeling, for each
hypothesis, the expected genetic data from the DNA derived from the
target cell based on the obtained genetic data comprising DNA
derived from non-target cells, (2) comparing, for each hypothesis,
the modeled genetic data from the DNA derived from the target cell
and the obtained genetic data from DNA derived from the target
cell, and (3) calculating a probability value, for each hypothesis,
based on the difference between the modeled genetic data from the
DNA derived from the target cell and the obtained genetic data from
DNA derived from the target cell. In some embodiments the
non-target cells originate from a parent of an individual from
which the target cell originated, and the modeling of the expected
genetic data further comprises determining the expected genetic
data of the target cell using the rules of Mendelian inheritance an
adjusting the expected genetic data of the target cell to correct
for biases in the system as disclosed herein. Examples of such a
system biases include amplification bias, sequencing bias,
processing bias, enrichment bias, and combinations thereof. The
nature of such biases may vary in accordance with the specific
amplification technology, sequencing technology, processing,
enrichment technology, etc. selected for implementation of the
specific embodiment. In some embodiments the target cell is from a
fetus, and wherein the expected genetic data comprises genetic data
from the parent of the fetus and genetic data from the fetus. In
some embodiments the modeling of the genetic data comprises the
steps of predicting, for each locus, an expected distribution of
allelic measurements at that locus, and predicting, for each locus,
an expected relative quantity of DNA (depth of read) at that locus.
In some embodiment the prediction of an expected distribution of
allelic measurements can takes into account the linkage and
cross-overs between different loci on the genome. In some
embodiments, the expected distribution is a binomial
distribution.
Different Implementations of the Presently Disclosed
Embodiments
[0243] FIG. 2 shows an example system architecture 200 useful for
performing embodiments of the present invention. System
architecture 200 includes an analysis platform 208 connected to one
or more laboratory information systems ("LISs") 204. Analysis
platform 208 may alternatively or additionally be connected
directly to LIS 206. As shown in FIG. 2, analysis platform 208 may
be connected to LIS 206 over a network 202. Network 202 may include
one or more networks of one or more network types, including any
combination of LAN, WAN, the Internet, etc. Network 202 may
encompass connections between any or all components in system
architecture 200. In an embodiment, analysis platform 208 analyzes
genetic data provided by LIS 206 in a software-as-a-service model,
where LIS 206 is a third-party LIS, while analysis platform 208
analyzes genetic data provided by LIS 204 in a full-service or
in-house model, where LIS 204 and analysis platform 208 are
controlled by the same party. In an embodiment where analysis
platform 208 is providing information over network 202, analysis
platform 208 may be a server.
[0244] In an example embodiment, laboratory information system 206
includes one or more public or private institutions that collect,
manage, and/or store genetic data. A person having skill in the
relevant art(s) would understand that methods and standards for
securing genetic data are known and can be implemented using
various information security techniques and policies, e.g.,
username/password, Transport Layer Security (TLS), Secure Sockets
Layer (SSL), and/or other cryptographic protocols providing
communication security.
[0245] In an example embodiment, system architecture 200 operates
as a service-oriented architecture and uses a client-server model
that would be understood by one of skill in the relevant art(s) to
enable various forms of interaction and communication between LIS
206 and analysis platform 208. System architecture 200 may be
distributed over various types of networks 202 and/or may operate
as cloud computing architecture. Cloud computing architecture may
include any type of distributed network architecture. By way of
example and not of limitation, cloud computing architecture is
useful for providing software as a service (SaaS), infrastructure
as a service (IaaS), platform as a service (PaaS), network as a
service (NaaS), data as a service (DaaS), database as a service
(DBaaS), backend as a service (BaaS), test environment as a service
(TEaaS), API as a service (APIaaS), integration platform as a
service (IPaaS) etc.
[0246] In an example embodiment, LISs 204 and 206 each include a
computer, device, interface, etc. or any sub-system thereof. In an
embodiment, LISs 204 and 206 are high-throughput DNA sequencers
that conduct genetic analysis and provide such genetic data to
analysis platform 208. In an embodiment, the high-throughput DNA
sequencers contain PCR amplifiers. LISs 204 and 206 may include an
operating system (OS), applications installed to perform various
functions such as, for example, access to and/or navigation of data
made accessible locally, in memory, and/or over network 202. In an
embodiment, LIS 204 accesses analysis platform 208 through an
application programming interface ("API"). LIS 204 may also include
one or more native applications that may operate independently of
an API.
[0247] In an example embodiment, analysis platform 208 includes one
or more of an input processor 212, a hypothesis manager 214, a
modeler 216, a bias correction unit 218, a machine learning unit
220, and an output processor 218. Input processor 212 receives and
processes inputs from LISs 204 and/or 206. Processing may include
but is not limited to operations such as parsing, transcoding,
translating, adapting, or otherwise handling any input received
from LISs 204 and/or 206. Inputs may be received via one or more
streams, feeds, databases, or other sources of data, such as may be
made accessible by LISs 204 and 206.
[0248] In an example embodiment, hypothesis manager 214 is
configured to receive the inputs passed from input processor 212 in
a form ready to be processed in accordance with hypotheses for
genetic analysis that are represented as models and/or algorithms.
Such models and/or algorithms may be stored in hypothesis database
224. In an embodiment, hypothesis database 224 stores such
information in table format. Data from hypothesis database 224 may
be used by modeler 216 to generate probabilities, for example,
using the methods disclosed herein such as, for example, the
non-allelic quantitative method or the allelic het rate method, and
the like. Data used to derive and populate such strategy models
and/or algorithms are available to hypothesis manager 214 via, for
example, genetic data source 210 via LIS 204 or 206. Genetic data
source 210 may include, for example, assays of samples to be
analyzed by LIS 204 or 206. Hypothesis manager 214 may be
configured to formulate hypotheses based on, for example, the
variables required to populate its models and/or algorithms.
Alternatively, hypotheses may be provided from a user and stored in
hypothesis database 224. Models and/or algorithms, once populated,
may be used by modeler 216 to compare one or more hypotheses to
observed genetic data as described above. Modeler 216 may also
develop bias models as described in various embodiments above. Bias
errors, such as amplification errors and the like, may be corrected
by bias correction unit 218 through performance of the bias
correction mechanisms described herein.
[0249] Hypothesis manager 214 may select a particular value, range
of values, or estimate based on a most-likely hypothesis as an
output as described above. Modeler 216 may operate in accordance
with models and/or algorithms trained by machine learning unit 220.
For example, machine learning unit 220 may develop such models
and/or algorithms by applying a classification algorithm, such as a
Bayes classification algorithm, as described above to genetic data
to identify diploid samples to be used as a reference set. Modeler
216 can then use the identified reference set to estimate, for
example, copy numbers for original or bias-corrected (adjusted or
normalized) genetic data. Modeler 216 can compare expected data
(based on each hypothesis) with observed data to generate a
probability value for each hypothesis as compared to the observed
data for a target sample.
[0250] Once hypothesis manager 214 receives probability values for
each hypothesis for a given target, hypothesis manager 214 can
select a most-likely hypothesis as an output result. Such output
may be returned to the particular LIS 204 or 206 requesting the
information by output processor 222. Such information can then be
transmitted for individual patient samples to their respective
representatives.
[0251] Various aspects of the disclosure can be implemented on a
computing device by software, firmware, hardware, or a combination
thereof. FIG. 3 illustrates an example computer system 300 in which
the contemplated embodiments, or portions thereof, can be
implemented as computer-readable code. Various embodiments are
described in terms of this example computer system 300. For
example, analysis platform 208 and databases 210 and 224 described
above may be implemented in system 300. In addition or
alternatively, the various methods described herein, such as method
100 and the additional algorithms used therein, may be executed by
a computer processing system such as system 300.
[0252] Processing tasks in the embodiment of FIG. 3 are carried out
by one or more processors 302. However, it should be noted that
various types of processing technology may be used here, including
programmable logic arrays (PLAs), application-specific integrated
circuits (ASICs), multi-core processors, multiple processors, or
distributed processors. Additional specialized processing resources
such as graphics, multimedia, or mathematical processing
capabilities may also be used to aid in certain processing tasks.
These processing resources may be hardware, software, or an
appropriate combination thereof. For example, one or more of
processors 302 may be a graphics-processing unit (GPU). In an
embodiment, a GPU is a processor that is a specialized electronic
circuit designed to rapidly process mathematically intensive
applications on electronic devices. The GPU may have a highly
parallel structure that is efficient for parallel processing of
large blocks of data, such as mathematically intensive data.
Alternatively or in addition, one or more of processors 302 may be
a special parallel processing without the graphics optimization,
such parallel processors performing the mathematically intensive
functions described herein. One or more of processors 302 may
include a processing accelerator (e.g., DSP or other
special-purpose processor).
[0253] Computer system 300 also includes a main memory 330, and may
also include a secondary memory 340. Main memory 330 may be a
volatile memory or non-volatile memory, and divided into channels.
Secondary memory 340 may include, for example, non-volatile memory
such as a hard disk drive 350, a removable storage drive 360,
and/or a memory stick. Removable storage drive 360 may comprise a
floppy disk drive, a magnetic tape drive, an optical disk drive, a
flash memory, or the like. The removable storage drive 360 reads
from and/or writes to a removable storage unit 370 in a well-known
manner. Removable storage unit 370 may comprise a floppy disk,
magnetic tape, optical disk, etc. which is read by and written to
by removable storage drive 360. As will be appreciated by persons
skilled in the relevant art(s), removable storage unit 370 includes
a computer usable storage medium having stored therein computer
software and/or data.
[0254] In alternative implementations, secondary memory 340 may
include other similar means for allowing computer programs or other
instructions to be loaded into computer system 300. Such means may
include, for example, a removable storage unit 370 and an interface
(not shown). Examples of such means may include a program cartridge
and cartridge interface (such as that found in video game devices),
a removable memory chip (such as an EPROM, or PROM) and associated
socket, and other removable storage units 370 and interfaces which
allow software and data to be transferred from the removable
storage unit 370 to computer system 300.
[0255] Computer system 300 may also include a memory controller
375. Memory controller 375 controls data access to main memory 330
and secondary memory 340. In some embodiments, memory controller
375 may be external to processor 310, as shown in FIG. 3. In other
embodiments, memory controller 375 may also be directly part of
processor 310. For example, many AMD.TM. and Intel.TM. processors
use integrated memory controllers that are part of the same chip as
processor 310 (not shown in FIG. 3).
[0256] Computer system 300 may also include a communications and
network interface 380. Communication and network interface 380
allows software and data to be transferred between computer system
300 and external devices. Communications and network interface 380
may include a modem, a communications port, a PCMCIA slot and card,
or the like. Software and data transferred via communications and
network interface 380 are in the form of signals which may be
electronic, electromagnetic, optical, or other signals capable of
being received by communication and network interface 380. These
signals are provided to communication and network interface 380 via
a communication path 385. Communication path 385 carries signals
and may be implemented using wire or cable, fiber optics, a phone
line, a cellular phone link, an RF link or other communications
channels.
[0257] The communication and network interface 380 allows the
computer system 300 to communicate over communication networks or
mediums such as LANs, WANs the Internet, etc. The communication and
network interface 380 may interface with remote sites or networks
via wired or wireless connections.
[0258] In this document, the terms "computer program medium,"
"computer-usable medium" and "non-transitory medium" are used to
generally refer to tangible (i.e. non-signal) media such as
removable storage unit 370, removable storage drive 360, and a hard
disk installed in hard disk drive 350. Signals carried over
communication path 385 can also embody the logic described herein.
Computer program medium and computer usable medium can also refer
to memories, such as main memory 330 and secondary memory 340,
which can be memory semiconductors (e.g. DRAMs, etc.). These
computer program products are means for providing software to
computer system 300.
[0259] Computer programs (also called computer control logic) are
stored in main memory 330 and/or secondary memory 340. Computer
programs may also be received via communication and network
interface 380. Such computer programs, when executed, enable
computer system 300 to implement embodiments as discussed herein.
In particular, the computer programs, when executed, enable
processor 310 to implement the disclosed processes. Accordingly,
such computer programs represent controllers of the computer system
300. Where the embodiments are implemented using software, the
software may be stored in a computer program product and loaded
into computer system 300 using removable storage drive 360,
interfaces, hard drive 350 or communication and network interface
380, for example.
[0260] The computer system 300 may also include
input/output/display devices 390, such as keyboards, monitors,
pointing devices, touchscreens, etc.
[0261] It should be noted that the simulation, synthesis and/or
manufacture of various embodiments may be accomplished, in part,
through the use of computer readable code, including general
programming languages (such as C or C++), hardware description
languages (HDL) such as, for example, Verilog HDL, VHDL, Altera HDL
(AHDL), or other available programming tools. This computer
readable code can be disposed in any known computer-usable medium
including a semiconductor, magnetic disk, optical disk (such as
CD-ROM, DVD-ROM). As such, the code can be transmitted over
communication networks including the Internet.
[0262] The presently disclosed embodiments can be implemented
advantageously in one or more computer programs that are executable
and/or interpretable on system 300. Each computer program can be
implemented in a high-level procedural or object-oriented
programming language, or in assembly or machine language if
desired; and in any case, the language can be a compiled or
interpreted language. A computer program may be deployed in any
form, including as a stand-alone program, or as a module,
component, subroutine, or other unit suitable for use in a
computing environment. A computer program may be deployed to be
executed or interpreted on one computer or on multiple computers at
one site, (that is, system 300 may be distributed locally) or
distributed across multiple sites and interconnected by a
communication network (that is, system 300 may be distributed
across a network). The embodiments are also directed to computer
program products comprising software stored on any computer-usable
medium. Such software, when executed in one or more data processing
devices, causes a data processing device(s) to operate as described
herein. Embodiments employ any computer-usable or -readable medium,
and any computer-usable or -readable storage medium known now or in
the future. Examples of computer-usable or computer-readable
mediums include, but are not limited to, primary storage devices
(e.g., any type of random access memory), secondary storage devices
(e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes,
magnetic storage devices, optical storage devices, MEMS,
nano-technological storage devices, etc.), and communication
mediums (e.g., wired and wireless communications networks, local
area networks, wide area networks, intranets, etc.).
Computer-usable or computer-readable mediums can include any form
of transitory (which include signals) or non-transitory media
(which exclude signals). Non-transitory media comprise, by way of
non-limiting example, the aforementioned physical storage devices
(e.g., primary and secondary storage devices).
[0263] Any of the methods described herein may include the output
of data in a physical format, such as on a computer screen, or on a
paper printout. In explanations of any embodiments elsewhere in
this document, it should be understood that the described methods
may be combined with the output of the actionable data in a format
that can be acted upon by a physician. In addition, the described
methods may be combined with the actual execution of a clinical
decision that results in a clinical treatment, or the execution of
a clinical decision to make no action. Some of the embodiments
described in the document for determining genetic data pertaining
to a target individual may be combined with the decision to select
one or more embryos for transfer in the context of IVF, optionally
combined with the process of transferring the embryo to the womb of
the prospective mother. Some of the embodiments described in the
document for determining genetic data pertaining to a target
individual may be combined with the notification of a potential
chromosomal abnormality, or lack thereof, with a medical
professional, optionally combined with the decision to abort, or to
not abort, a fetus in the context of prenatal diagnosis. Some of
the embodiments described herein may be combined with the output of
the actionable data, and the execution of a clinical decision that
results in a clinical treatment, or the execution of a clinical
decision to make no action.
Hypotheses
[0264] A hypothesis can refer to a possible genetic state. It can
refer to a possible ploidy state. It can refer to a possible
allelic state. A set of hypotheses refers to a set of possible
genetic states. In some embodiments, a set of hypotheses are
designed such that one hypothesis from the set will correspond to
the actual genetic state of any given individual. In some
embodiments, a set of hypotheses are designed such that every
possible genetic state can be described by at least one hypothesis
from the set. In some embodiments of the present disclosure, one
aspect of the method is to determine which hypothesis corresponds
to the actual genetic state of the individual in question.
[0265] A "copy number hypothesis," also called a "ploidy
hypothesis," or a "ploidy state hypothesis," may refer to a
hypothesis concerning a possible ploidy state for a given
chromosome, or chromosome segment, in the target individual. It may
also refer to the ploidy state at more than one of the chromosomes
in the individual. A set of copy number hypotheses may refer to a
set of hypotheses where each hypothesis corresponds to a different
possible ploidy state in an individual. A set of hypotheses in
certain examples is a set of possible ploidy states, a set of
possible parental haplotype contributions, a set of possible fetal
DNA percentages in the mixed sample, or combinations thereof.
[0266] A normal individual contains one of each chromosome from
each parent. However, due to errors in meiosis and mitosis, it is
possible for an individual to have 0, 1, 2, or more of a given
chromosome from each parent. In practice, it is rare to see more
that two of a given chromosomes from a parent. Certain embodiments
of the invention, especially those involving NIPT, consider the
possible hypotheses where 0, 1, or 2 copies of a given chromosome
come from a parent. In some embodiments, for a given chromosome,
there are nine possible hypotheses: the three possible hypothesis
concerning 0, 1, or 2 chromosomes of maternal origin, multiplied by
the three possible hypotheses concerning 0, 1, or 2 chromosomes of
paternal origin. Let (m,f) refer to the hypothesis where m is the
number of a given chromosome inherited from the mother, and f is
the number of a given chromosome inherited from the father.
Therefore, the nine hypotheses are (0,0), (0,1), (0,2), (1,0),
(1,1), (1,2), (2,0), (2,1), and (2,2). These may also be written as
H.sub.00, H.sub.01, H.sub.02, H.sub.10, H.sub.12, H.sub.20,
H.sub.21, and H.sub.22. The different hypotheses correspond to
different ploidy states. For example, (1,1) refers to a normal
disomic chromosome; (2,1) refers to a maternal trisomy, and (0,1)
refers to a paternal monosomy. Especially in the context of NIPT,
two of these hypothesis are not feasible 0,0 and 2,2. In some
embodiments, the case where two chromosomes are inherited from one
parent and one chromosome is inherited from the other parent may be
further differentiated into two cases: one where the two
chromosomes are identical (matched copy error), and one where the
two chromosomes are homologous but not identical (unmatched copy
error). In these embodiments, there are sixteen possible
hypotheses. It should be understood that it is possible to use
other sets of hypotheses, and a different number of hypotheses.
[0267] Ploidy hypothesis are created during exemplary methods of
the invention that use methods, algorithms, techniques, or
subroutines that provide likelihoods. For example, in certain
illustrative examples of embodiments for determining the presence
or absence of aneuploidy, a set of ploidy hypotheses is created for
each sample in the set of samples, wherein each hypothesis is
associated with a specific copy number for the chromosome or
chromosome segment of interest in a genome of a sample. For
example, in embodiments that use quantitative non-allelic data,
such as the QMM disclosed herein, the hypothesis can provide
estimates of sample parameters, such as the variability in the
starting quantity of DNA in a sample due to pipetting variability
or errors or other measurement errors, which can be used to
normalize the measurements (i.e. measured genetic data) at some or
all of the positions on some or all of the chromosomes or
chromosome segments of interest in that sample, and then a test
statistic can be computed as the variance-weighted mean of these
normalized measurements.. Thus, in certain embodiments, the
hypothesis provides a variance-weighted mean test statistic for a
given ploidy condition. The expectation and variance of the test
statistic is calculated under each of the chromosome copy number
hypothesis to form Gaussian models for the maximum likelihood
estimate. For example, a set of hypothesis in an NIPT analysis for
a non-allelic quantitative analysis, can provide a
variance-weighted mean test statistic for a disomy or a trisomy at
one or more of chromosomes 13, 18, and 21. In exemplary embodiments
of the present invention where the chromosome or chromosome segment
of interest can be used to set sample parameters, the hypothesis
can be a joint hypothesis on the copy numbers of some or all of the
chromosomes, for example chromosome 13, 18, and 21. This is further
discussed below with regards to a quantitative method that does not
use non-target reference chromosomes.
[0268] In some embodiments of the present disclosure, the ploidy
hypothesis may refer to a hypothesis concerning which chromosome
from other related individuals correspond to a chromosome found in
the target individual's genome. Some embodiments utilize the fact
that related individuals can be expected to share haplotype blocks,
and using measured genetic data from related individuals, along
with a knowledge of which haplotype blocks match between the target
individual and the related individual, it is possible to infer the
correct genetic data for a target individual with higher confidence
than using the target individual's genetic measurements alone. As
such, in some embodiments, the ploidy hypothesis may concern not
only the number of chromosomes, but also which chromosomes in
related individuals are identical, or nearly identical, with one or
more chromosomes in the target individual.
[0269] An allelic hypothesis, or an "allelic state hypothesis" may
refer to a hypothesis concerning a possible allelic state of a set
of alleles. In some embodiments, the technique, algorithm, or
method used utilizes the fact that, as described above, related
individuals may share haplotype blocks, which may help the
reconstruction of genetic data that was not perfectly measured. An
allelic hypothesis can also refer to a hypothesis concerning which
chromosomes, or chromosome segments, if any, from a related
individual correspond genetically to a given chromosome from an
individual. The theory of meiosis tells us that each chromosome in
an individual is inherited from one of the two parents, and this is
a nearly identical copy of a parental chromosome. Therefore, if the
haplotypes of the parents are known, that is, the phased genotype
of the parents, then the genotype of the child may be inferred as
well. (The term child, here, is meant to include any individual
formed from two gametes, one from the mother and one from the
father.) In one embodiment of the present disclosure, the allelic
hypothesis describes a possible allelic state, at a set of alleles,
including the haplotypes, at a chromosome or chromosome segment of
interest, as well as which chromosomes from related individuals may
match the chromosome(s) which contain the set of alleles.
[0270] Once the set of hypotheses have been defined the algorithms
operate on the input genetic data and output a determined
statistical probability for each of the hypotheses under
consideration. For example, in an embodiment of the invention the
method determines a probability value by comparing the genetic data
to an expected result for each hypothesis, wherein the probability
value indicates the likelihood that a sample has a certain number
of copies of the chromosome or chromosome segment that is
associated with the hypothesis.
[0271] The probabilities of the various hypotheses can be
determined by mathematically calculating, for each of the various
hypotheses, the value that the probability equals, as stated by one
or more of the expert techniques, algorithms, and/or methods
described elsewhere in this disclosure, using the relevant genetic
data as input.
[0272] Once the probabilities of the different hypotheses are
estimated, as determined by a plurality of techniques, they may be
combined. This may entail, for each hypothesis, multiplying the
probabilities as determined by each technique. The product of the
probabilities of the hypotheses may be normalized. Note that one
ploidy hypothesis refers to one possible ploidy state for a
chromosome.
[0273] The process of "combining probabilities," also called
"combining hypotheses," or combining the results of expert
techniques, is a concept that should be familiar to one skilled in
the art of linear algebra. In exemplary methods of the present
invention, two methods are utilized for determining the presence or
absence of aneuploidy or for determining the number of copies of a
chromosome that each provide a probability. In certain illustrative
embodiments, the confidence of the determination is increased by
combining the confidences that are selected for each method. For
example, a confidence for a first method that performs a
quantitative allelic analysis, can be combined with a confidence
from a second method that performs a quantitative non-allelic
analysis.
[0274] In cases where the likelihoods are determined by a first
method in a way that is orthogonal, or unrelated, to the way in
which a likelihood is determined for a second method, combining the
likelihoods is straightforward and can be done by multiplication
and normalization, or by using a formula such as:
R.sub.comb=R.sub.1R.sub.2/[R.sub.1R.sub.2+(1-R.sub.1)(1-R.sub.2)]
[0275] Where R.sub.comb is the combined likelihood, and R.sub.1 and
R.sub.2 are the individual likelihoods. In cases where the first
and the second methods are not orthogonal, that is, where there is
a correlation between the two methods, the likelihoods may still be
combined, though the mathematics may be more complex.
[0276] In some embodiments, the 1.sup.st probability and the
2.sup.nd probability are weighted differently prior to the step of
combining the probabilities. In some embodiments the 1.sup.st
probability and the 2.sup.nd probability are considered independent
events for the purposes of the step of combining the two
probability values. In some embodiments the 1.sup.st probability
and the 2.sup.nd probability are considered dependent events for
the purposes of the step of combining the two probability values.
In some embodiments, the method further comprises obtaining a third
probability value where in the third probability value indicates
the likelihood that the genome of the target has the number of
copies of the chromosome or chromosome segment associated with a
specific hypothesis wherein the third probability value is derived
from information that is a non-non-genetic clinical assay. Many
non-genetic clinical assays have a known probabilistic correlation
with a specific chromosome copy number or chromosome segment copy
number. For each hypothesis, the combined first and second
probability values may be combined with the third probability value
to give a combined probability value indicating the likelihood that
the genome of the target cell has the number of copies of the
chromosome or chromosome segment of interest, wherein that number
is associated with the specific hypothesis. An examples of such
non-genetic clinical assays include a nuchal translucency
measurement. In some embodiments the non-genetic clinical assay is
selected from the group consisting of measurements of: beta-human
chorionic gonadotropin, pregnancy associated plasma protein A,
estriol, inhibin-A, and alpha-fetoprotein.
[0277] Not to be limited by theory, the following disclosure
further teaches how to combine probabilities. One possible way to
combine probabilities is as follows: When an expert technique is
used to evaluate a set of hypotheses given a set of genetic data,
the output of the method is a set of probabilities that are
associated, in a one-to-one fashion, with each hypothesis in the
set of hypotheses. When a set of probabilities that were determined
by a first expert technique, each of which are associated with one
of the hypotheses in the set, are combined with a set of
probabilities that were determined by a second expert technique,
each of which are associated with the same set of hypotheses, then
the two sets of probabilities are multiplied. This means that, for
each hypothesis in the set, the two probabilities that are
associated with that hypothesis, as determined by the two expert
methods, are multiplied together, and the corresponding product is
the output probability. This process may be expanded to any number
of expert techniques. If only one expert technique is used, then
the output probabilities are the same as the input probabilities.
If more than two expert techniques are used, then the relevant
probabilities may be multiplied at the same time. The products may
be normalized so that the probabilities of the hypotheses in the
set of hypotheses sum to 100%.
[0278] In some embodiments, if the combined probabilities for a
given hypothesis are greater than the combined probabilities for
any of the other hypotheses, then it may be considered that that
hypothesis is determined to be the most likely. In some
embodiments, a hypothesis may be determined to be the most likely,
and the ploidy state, or other genetic state, may be called if the
normalized probability is greater than a threshold. In one
embodiment, this means that the number and identity of the
chromosomes that are associated with that hypothesis may be called
as the ploidy state. In one embodiment, this means that the
identity of the alleles that are associated with that hypothesis
are called as the allelic state. In some embodiments, the threshold
is between about 50% and about 80%. In some embodiments the
threshold is between about 80% and about 90%. In some embodiments
the threshold is between about 90% and about 95%. In some
embodiments the threshold is between about 95% and about 99%. In
some embodiments the threshold is between about 99% and about
99.9%. In some embodiments the threshold is above 99.9%. In other
embodiments, a set of rules are used for a final risk call for a
sample wherein a combined probability threshold is set, but
different scenarios can be considered and could override the
results of the probability threshold, or used to enhance the
calling ability of the combined probability. For example, if there
is a wide disparity in probabilities for a given ploidy hypothesis,
further analysis can be performed for example, to determine whether
there was an error in one of the methods.
[0279] Some embodiments of the invention employ the step of
producing a subset of patients from a larger set of patients. The
original set of patients is used as the source of target cells and
non-target cells for analysis. In some embodiments of the
invention, the DNA samples obtained from the patients are modified
using standard molecular biology techniques in order to be
sequenced on the DNA sequencer. In some embodiments the technique
will involve forming a genetic library containing priming sites for
the DNA sequencing procedure. In some embodiments, a plurality of
loci may be targeted for site specific amplification. In some
embodiments the targeted loci are polymorphic loci, e.g., a single
nucleotide polymorphisms. In embodiments implying the formation of
genetic libraries, libraries may be encoded using a DNA sequence
that is specific for the patient, e.g. barcoding, thereby
permitting multiple patients to be analyzed in a single flow cell
(or flow cell equivalent) of a high throughput DNA sequencer.
Although the samples are mixed together in the DNA sequencer flow
cell, the determination of the sequence of the barcode permits
identification of the patient source that contributed the DNA that
had been sequenced.
[0280] It will be appreciated by those of ordinary skill in the art
that in those embodiments of the invention in which the target DNA
is not enriched for specific loci, the entire genome may be
sequenced, although assembly of the sequence into a complete genome
is not required for use of the subject methods. Information about
specific loci may be readily determined from all genome
sequencing.
[0281] In one embodiment of the present disclosure, a confidence
may be calculated on the accuracy of the determination of the
ploidy state of the fetus. In one embodiment, the confidence of the
hypothesis of greatest likelihood (H.sub.major) may be calculated
as (1-H.sub.major/.SIGMA.(all H). It is possible to determine the
confidence of a hypothesis if the distributions of all of the
hypotheses are known. It is possible to determine the distribution
of all of the hypotheses if the parental genotype information is
known. It is possible to calculate a confidence of the ploidy
determination if the knowledge of the expected distribution of data
for the euploid fetus and the expected distribution of data for the
aneuploid fetus are known. It is possible to calculate these
expected distributions if the parental genotype data are known. In
one embodiment one may use the knowledge of the distribution of a
test statistic around a normal hypothesis and around an abnormal
hypothesis to determine both the reliability of the call as well as
refine the threshold to make a more reliable call. This is
particularly useful when the amount and/or percent of fetal DNA in
the mixture is low. It will help to avoid the situation where a
fetus that is actually aneuploid is found to be euploid because a
test statistic, such as the Z statistic, does not exceed a
threshold that is made based on a threshold that is optimized for
the case where there is a higher percent fetal DNA.
An Example of a Quantitative Non-Allelic Maximum Likelihood Method
("QMM")
[0282] An example of a quantitative method that may be used to
determine the number of copies of a chromosome of interest in a
target individual is provided here. Note that this example involves
normalization of the target chromosome data using a reference
chromosome that is the same as the target chromosome (i.e.
chromosome of interest), but found in other samples processed in a
similar or identical manner. The instant method is described in the
context of non-invasive prenatal aneuploidy testing, where the
target individual is a fetus, and the DNA that is sequenced
comprises fetal DNA, and in some cases, maternal DNA, for example
as found in the maternal plasma. Non-invasive prenatal aneuploidy
testing attempts to determine the chromosome copy number of a fetus
based on the free-floating fetal DNA in maternal plasma. In the
quantitative method, chromosome copy number classification is based
on the number of sequence reads which map to each chromosome.
Neither parental genotype nor allelic information is used, except
possibly to estimate the fetal fraction in the plasma. In this
targeted sequencing approach, the number of sequence reads at each
targeted SNP (single nucleotide polymorphism) is informative, in
contrast to untargeted sequencing approaches that tend to use a
sliding window average depth of read, or similar averaged approach.
Based on the estimated fetal fraction, a maximum likelihood
estimate is calculated based on the set of copy number hypotheses
including monosomy, disomy, and trisomy. In this example,
chromosome segmental errors are not considered, meaning that all
positions on the same chromosome are assumed to have the same copy
number. It should be clear to one of ordinary skill in the art how
to apply this method to chromosome segment copy number variants.
One may also incorporate non-uniform fragmentation of the fetal or
maternal genome; this is not done here.
[0283] Modeling an individual SNP: A fundamental assumption in this
method is that the number of sequence reads generated at a genome
position depends primarily on the number of genome copies of that
position going into the sequencing process. The targeted sequencing
approach is based on multiplexed PCR, which means that the number
of genome copies going into sequencing is determined both by the
chromosome copy number in the original sample, and the details of
the PCR amplification process. Thus, this method requires a
simplified models of both multiplex PCR and high throughput
sequencing.
[0284] One may assume that in the original sample, the amount of
genome copies is the same at all positions, except due to
variations in chromosome copy number. However, in the PCR process,
each targeted position is amplified with a different efficiency.
For each of k PCR cycles, a position i is amplified by a factor
a.sub.i. The number of observed reads at the position is x.sub.i.
This model can be written as in equation 1, where the sample factor
c.sub.s is constant per sample, and represents a sample parameter,
for example the initial quantity of DNA and the total number of
sequence reads. It can be thought of as the sample-specific
amplification factor. The chromosome copy number n.sub.i is the
ploidy state or copy number of the chromosome where position i is
located.
x.sub.i=c.sub.sn.sub.ia.sub.i.sup.k (1)
[0285] However, slight variations in experimental conditions mean
that the amplification efficiencies of the various PCR targets are
not perfectly constant. This is represented by a multiplicative
noise term .sub.i, for the amplification efficiency of each target.
The model is thus extended to equation 2.
x.sub.i=c.sub.sn.sub.i(a.sub.i .sub.i).sup.k (2)
[0286] Due to the multiplicative nature of the model, it is
advantageous to work in log space, and then consider the
expectation and the variance of log x.sub.i. One may assume that
the expectation of the log noise is zero. This is not quite the
same as assuming zero-mean noise, but it makes the math feasible,
shown in equation 3.
E log x.sub.i=log n.sub.i+k log a.sub.i
V log x.sub.i=k.sup.2V log .di-elect cons..sub.i (3)
[0287] Sample normalization can be achieved by considering reads
measured from positions located on chromosomes which are known,
assumed, or hypothesized to have copy number equal to two. There
are other methods of sample normalization such as using other
reference chromosomes, for example chromosomes 1 and 2, which are
known to be disomic. Let D be the set of positions i which are
located on chromosomes assumed to be disomic. The sample normalizer
T.sub.s is defined as the average log count over positions i in D,
detailed in equation 4. This can be measured directly from each
sample, and so will be considered a known quantity for further
calculations.
T s = E i .di-elect cons. D log x i = log c s + log 2 + k E i
.di-elect cons. D log a i ( 4 ) ##EQU00001##
[0288] Constructing a model from training data: A model for the
efficiency of individual SNPs can be constructed from a set of
training data with known chromosome copy number and fetal fraction.
In the ideal case, plasma is collected from (euploid) women who are
not pregnant, and so the fetal fraction is zero and there are no
aneuploidies. In this case, all samples contribute data for the
model of all targets. In the more difficult case, pregnancy plasma
with known chromsome copy number is used, and aneuploid samples are
excluded from the data set. Thus, the model is still constructed
from data where all chromosomes have the same copy number relative
to disomy.
[0289] Let y.sub.i, be the logspace normalized depth of read at
position i. One may define .beta..sub.i as the average over the set
of samples, of y.sub.i (5). The term .beta..sub.i is the logspace
amplification model for position i which measures how its
amplification efficiency compares to the average amplification
efficiency for positions on disomy chromosomes.
y i = log x i - T s = k log a i + k log .di-elect cons. i - k E i
.di-elect cons. D log a i ( 5 ) .beta. i = E s y i = k log a i - k
E i .di-elect cons. D log a i ##EQU00002##
[0290] Similarly, .sigma..sub.i is defined as the standard
deviation across samples of y.sub.i. Combined, the set of
.beta..sub.i and the set of .sigma..sub.i, form the amplification
model and the variance model for the set of SNPs i.
[0291] There are a number of subtleties associated with the model
calculation. Most importantly, it is important to note that the
model does not remain constant for a fixed set of targets subjected
to a fixed protocol.
[0292] Although the models will be quite similar, attempts to use a
fixed model across multiple sequencing runs have suffered from
biases which are large enough to effect results at low fetal
fraction, and may be eliminated by training separately for separate
experiments. As a result, in some embodiments, it is important to
ensure that each sequencing run contains a sufficient number of
samples for modeling.
[0293] Even within an experiment, there are typically a number of
samples which do not fit the model. These are often but not always
explained by locus dropout, which is discussed in more detail in a
later section. Outlier samples are not well predicted by quality
control metrics such as contamination level, spike ratio (a measure
of DNA starting quantity), fetal fraction, or overall depth of
read. A sample is tested for goodness of fit by calculating the
residual z, on each SNP with respect to the amplification and noise
models.
z.sub.i=(log x.sub.i-T.sub.s-.beta..sub.i)/.sigma..sub.i (6)
[0294] Under the further assumption that log .di-elect
cons..sub.ii, is not just zero-mean, but Gaussian, then z.sub.i
should be distributed according to the standard normal. The set of
disomy-chromosome residuals Z={z.sub.i:i .di-elect cons.D} is
analyzed as an approximate metric for model fit. Regardless of
fetal fraction or chromosome copy number, Z should be distributed
according to the standard normal. A Kolmogorov-Smirnov (KS) test is
used to measure goodness of fit of the residuals. The modeling
process is implemented in an iterative fashion, where each
iteration includes a recalculation of the model, followed by a KS
test for the model fit of each sample. Outlier samples are removed
from the training set at each iteration until the membership
converges to a constant set.
[0295] Forming a test statistic and modeling SNP correlation: A
test statistic for chromosome copy number classification can be
formed by averaging the normalized measurements at all positions on
a chromosome. A variance-weighted mean is selected in order to
minimize the variance of the test statistic. Consider the
normalized measurement)), defined above. For a position on a
chromosome with unknown copy number ni, y, has the properties
described in equation 7.
Ey i = log n i 2 + .beta. i Vy i = .sigma. i 2 ( 7 )
##EQU00003##
[0296] Let S be the set of positions on the current chromosome. The
chromosome test statistic t is defined as the variance-weighted
mean of y.sub.i, averaged across SNPs i in S.
t = i s y i .sigma. i 2 i s 1 .sigma. i 2 ( 8 ) ##EQU00004##
[0297] The expectation of t will be calculated under each of the
chromosome copy number hypotheses to form Gaussian models for the
maximum likelihood estimate. The variance of the model for each
hypothesis does not follow uniquely from the assumptions made
previously, which have not considered correlation between
measurements. The simplest assumption of uncorrelated measurements
was discarded because the observed variances on t were much higher
than that model would suggest. Without suggesting any physical
explanation for correlation, a single-parameter correlation model
is proposed in which the covariance of y.sub.i with y.sub.j is
.rho..sigma..sub.i.sigma..sub.j, corresponding to a constant
correlation factor between all positions i and j on the same
chromosome. This model uses a single parameter to represent the
additional variance beyond what would be implied by the
uncorrelated model. The variance of t using the constant
correlation model is shown in equation 9 which follows directly
from the formula for the variance of a sum of normal distributions
with known correlation. (The assumption of Gaussian noise is
continued throughout.)
Vt = ( i 1 .sigma. i 2 ) - 2 ( .rho. i j 1 .sigma. i .sigma. j + (
1 - .rho. ) i 1 .sigma. i 2 ) ( 9 ) ##EQU00005##
[0298] A maximum likelihood estimate of .rho. for each chromosome
is calculated from the same modeling data following the estimation
of {.beta..sub.i} and {.sigma..sub.i}.
[0299] Chromosome copy number classification consists of the
following steps which make use of the modeling developed in the
sections above.
[0300] 1. Confirm model fit. Using the disomy chromosomes (one and
two) a set of residuals is calculated with respect to the provided
model, and a KS test is used to compare them to the standard normal
distribution. If the resulting p-value is too low, the sample is
considered not to fit the model, and cannot be classified.
[0301] 2. Copy number hypothesis generation. Using the supplied
fetal fraction, the plasma copy number is calculated corresponding
to each fetal copy number hypothesis. For fetal copy number
hypotheses {h.sub.1, h.sub.2, h.sub.3}={1, 2, 3}, the plasma copy
number hypotheses are calculated using the fetal fraction according
to equation 10. The plasma copy number is a mixture of the fetal
copy number, which depends on the hypothesis, and the maternal copy
number, which is two.
n.sub.i=fh.sub.i+2(1-f) (10)
[0302] 3. Hypothesis modeling. An expected value for the test
statistic is calculated for the value of n.sub.i corresponding to
the ploidy hypotheses. This is done according to equation 7 and the
definition of the test statistic. The variance model for the test
statistic does not depend on the hypothesis.
[0303] 4. Calculate likelihoods. The value of the test statistic is
observed for the current chromosome. The data likelihood of each
hypothesis is the likelihood of the test statistic under each of
the corresponding normal distributions. The maximum likelihood
estimate can then be reported, or normalized using priors.
[0304] Copy number classification without non-target reference
chromosomes (also referred to as a "QMM" method)
[0305] As mentioned above, it is possible to identify copy number
without using reference chromosomes or chromosome segments that are
different than the target chromosome or chromosome segment, such
that none of the chromosomes or chromosome segments can be assumed
to have known copy numbers. This requires an alternate way of
estimating the sample normalizer T.sub.s and the linear shift
parameter .alpha..sub.s, which are conditioned on the chromosome
number hypotheses. Unlike the approach that uses copy number
hypotheses for each individual chromosome, this hypothesis space
contains joint hypotheses of all the training chromosomes.
[0306] In an embodiment, in order to connect the joint hypothesis
to the individual hypothesis, the following technique may be used.
For a training chromosome k.di-elect cons.{13, 18, 21}, let
p(D|h.sub.k), h.sub.k.di-elect cons.{1, 2, 3} be the pdf of the
data conditioned on the individual copy number hypothesis of that
chromosome. So, for example, for chromosome 13 it would be:
P ( D | h 13 ) = h 18 h 21 p ( D 13 | h 18 , h 21 , h 13 ) p ( D 18
| h 18 , h 21 , h 13 ) p ( D 21 | h 18 , h 13 ) P ( h 18 ) P ( h 21
) ##EQU00006##
[0307] Assuming equal priors for the hypothesis probabilities,
i.e., P(h.sub.k=1)=P(h.sub.k=2)=P(h.sub.k=3)=1/3, the above pdf is
computed. To compute p(D.sub.13|h.sub.18, h.sub.21, h.sub.13), the
T.sub.s and .alpha..sub.s estimates corresponding to the hypothesis
(h.sub.18, h.sub.21, h.sub.13) are used, and a variance weighted
mean test statistic is computed. Similarly, the respective pdfs of
the other training chromosomes, p(D|h.sub.18), p(D|h.sub.21) are
computed. Since equal priors are assumed, the posterior
probabilities are also computed:
P ( h_k | D ) = ( p ( D | h k ) h j .di-elect cons. { 1 , 2 , 3 } p
( D | h j ) ) , .A-inverted. k { 13 , 18 , 21 } . ##EQU00007##
This represents a normalizing step which provides confidences for
each of the training chromosomes.
[0308] Next, confidences of the rest of the chromosomes is
computed. For this, an estimate of the joint hypothesis of the
training chromosomes is obtained:
(h.sub.13, h.sub.18,h.sub.21)=arg
max.sub.h.sub.13.sub.,h.sub.18.sub.,h.sub.21p(D|h.sub.13, h.sub.18,
h.sub.21)
The T.sub.s and .alpha..sub.s estimates corresponding to this
hypothesis (h.sub.13, h.sub.18, h.sub.21) can then be used to
compute the variance weighted mean test statistic for each of the
test chromosomes.
[0309] In this method, a constant correlation coefficient model can
be used to model the inter-SNP correlations of a particular
chromosome. For example, for a particular chromosome k, the
covariance of y.sub.i and y.sub.j is
.rho..sub.i.sigma..sub.i.sigma..sub.j, as discussed above. If
chromosome K has N.sub.k loci, a covariance matrix is given by:
C(.rho..sub.k)=(1-.rho..sub.k).times.diag(.sigma..sub.k.sup.2)+.rho..sub-
.k.times..sigma..sub.k.sigma..sub.k.sup.T
This represents a matrix with the .sigma..sub.i.sup.2s on the main
diagonal and the off-diagonal elements are
.rho..sub.k.sigma..sub.i.sigma..sub.j. This can also be used to
determine the maximum likelihood estimates for each of T.sub.s and
.alpha..sub.s
[0310] An example of a quantitative allelic maximum likelihood
method ("het rate")
[0311] Provided herein are methods for determining the ploidy state
using an allelic maximum likelihood method. The method will be
illustrated in the context of NIPT, but a skilled artisan will
appreciate that it can be utilized in detection of circulating free
tumor cells. In addition to the discussion below, detailed examples
of how to implement a het rate method can be found, among other
places, in published US patent application US 2012/0270212 Al and
published US patent application US 2011/0288780 A1, all of which
are herein incorporated in their entirety by reference. However,
the het rate method disclosed in these sources, utilize data from
separate reference chromosomes
[0312] In the NIPT example, the ploidy state of a fetus given
sequence data that was measured on free floating DNA isolated from
maternal blood, wherein the free floating DNA contains some DNA of
maternal origin, and some DNA of fetal/placental origin. In this
example the ploidy state of the fetus is determined using the an
allelic maximum likelihood method and a calculated fraction of
fetal DNA in the mixture that has been analyzed. It will also
describe an embodiment in which the fraction of fetal DNA or the
percentage of fetal DNA in the mixture can be measured. In some
embodiments the fraction can be calculated using only the
genotyping measurements made on the maternal blood sample itself,
which is a mixture of fetal and maternal DNA. In some embodiments
the fraction may be calculated also using the measured or otherwise
known genotype of the mother and/or the measured or otherwise known
genotype of the father.
[0313] For a particular chromosome, suppose there are N SNPs, for
which: Parent genotypes from ILLUMINA data, assumed to be correct:
mother m=(m.sub.1, . . . ,m.sub.N), father=(f.sub.1, . . .
,f.sub.N), where m.sub.i, f.sub.i .di-elect cons. (AA,AB, BB).
[0314] Set of NR sequence measurements S=(s.sub.1, . . .
,s.sub.nr). [0315] Deriving most likely copy number from data
[0316] For each copy number hypothesis H considered, derive data
log likelihood LIK(H) on a whole chromosome and choose the best
hypothesis maximizing LIK, i.e.
H * = argmax H LIK ( H | D ) = argmax H LIK ( D | H ) P ( H ) ,
##EQU00008##
where P(H) is a prior probability of the hypothesis, from prior
knowledge or estimate.
[0317] Copy number hypotheses considered are:
[0318] Monosomy:
[0319] maternal H10(one copy from mother)
[0320] paternal H01(one copy from father)
[0321] Disomy: H11(one copy each mother and father)
[0322] Simple trisomy, no crossovers considered:
[0323] Maternal: H21_matched (two identical copies from mother, one
copy from father), H21_unmatched (BOTH copies from mother, one copy
from father)
[0324] Paternal: H12_matched (one copy from mother, two identical
copies from father), H12_unmatched (one copy from mother, both
copies from father)
[0325] Composite trisomy, allowing for crossovers (using a joint
distribution model):
[0326] maternal H21 (two copies from mother, one from father),
[0327] paternal H12 (one copy from mother, two copies from
father)
[0328] If there were no crossovers, each trisomy, whether the
origin was mitosis, meiosis I, or meiosis II, would be one of the
matched or unmatched trisomies. Due to crossovers, true trisomy is
a combination of the two. First, a method to derive hypothesis
likelihoods for simple hypotheses is described. Then a method to
derive hypothesis likelihoods for composite hypotheses is
described, combining individual SNP likelihood with crossovers.
Initially, it is assumed that the true child fraction and other
parameters such as beta noise parameter (N) and possible error
rates are known. A method for deriving child fraction cf from data
is also discussed below.
LIK(D|H) for Simple Hypotheses
[0329] For simple hypotheses H, LIK(D|H), the log likelihood of
data given hypothesis H on a whole chromosome, is calculated as the
sum of log likelihoods of individual SNPs, i.e.
LIK ( D | H ) = i LIK ( D | H , cf , i ) ##EQU00009##
[0330] This hypothesis does not assume any linkage between SNPs,
and therefore does not utilize a joint distribution model.
Log Likelihood Per SNP
[0331] On a particular SNP i, define m.sub.i=true mother genotype,
f.sub.i=true father genotype, and cf=known or derived child
fraction. Let x.sub.i=P(A|i,S) be the probability of having an A on
SNP i, given the sequence measurements S. Assuming child hypothesis
H, the log likelihood of observed data D on SNP i is defined as
[0332] P(D|m, f, c, H, cf, i)=P(SM|m, i)P(M|m, i)P(SF|f, i)P(F|f,
i)P(S|m, c, H, cf, i), which results in:
LIK(i, H)=log lik(x.sub.i|m.sub.i, f.sub.i, H,
cf)=.SIGMA..sub.cp(c|m.sub.i, f.sub.i, H)*log lik(x.sub.i|m.sub.i,
c, cf),
where p(c|m, f, H) is the probability of getting true child
genotype=c, given parents m, f, and assuming hypothesis H, which
can be easily calculated. For example, for H11, H21matched and H21
unmatched, p(c|m,f,H) is given below.
TABLE-US-00001 p(c|m, f, H) H11 H21 matched H21 unmatched m f AA AB
BB AAA AAB ABB BBB AAA AAB ABB BBB AA AA 1 0 0 1 0 0 0 1 0 0 0 AB
AA 0.5 0.5 0 0.5 0 0.5 0 0 1 0 0 BB AA 0 1 0 0 0 1 0 0 0 1 0 AA AB
0.5 0.5 0 0.5 0.5 0 0 0.5 0.5 0 0 AB AB 0.25 0.5 0.25 0.25 0.25
0.25 0.25 0 0.5 0.5 0 BB AB 0 0.5 0.5 0 0 0.5 0.5 0 0 0.5 0.5 AA BB
0 1 0 0 1 0 0 0 1 0 0 AB BB 0 0.5 0.5 0 0.5 0 0.5 0 0 1 0 BB BB 0 0
1 0 0 0 1 0 0 0 1
[0333] P(D|m,f,c,H,i,cf) is the probability of given data D on SNP
i, given true mother genotype m, true father genotype f, true child
genotype c, hypothesis H, and child fraction cf. It can be broken
down into probability of mother, father, and child data as
follows:
P(D|m, f, c, H, cf, i)=P(SM|m,i)P(M|m, i)P(SF|f, i)P(F|f, i)P(S|m,
c, H, cf, i).
lik(x.sub.i|m, c, cf) is the likelihood of getting derived
probability x.sub.i on SNP i, assuming true mother m, true child c,
defined as pdfx(x.sub.i) of the distribution that x.sub.i should be
following if hypothesis H were true. In particular
lik(x.sub.i|m,c,cf)=pdfx(x.sub.i)
[0334] In a simple case where Di of NR sequences in S line up to
SNP i, X.about.(1/D.sub.i)Bin(p,D.sub.i), where
p=p(A|m,c,cf)=probability of getting an A, for this mother/child
mixture, calculated as:
Hetrate A = p ( A | m , c , cf ) = # A ( m ) & ( 1 - cf correct
) + # A ( c ) * cf correct n m * ( 1 - cf correct ) + n c * cf
correct ##EQU00010##
where #A(g)=number of A's in genotype g, n.sub.m=2 is somy of
mother and n.sub.c is somy of the child, (1 for monosomy, 2 for
disomy, 3 for trisomy). The initial cf may be determined using, for
example, an allele ratio plot.
[0335] cf.sub.correct is corrected fraction of the child in the
mixture:
cf correct = cf * n c n m * ( 1 - cf ) + n c * f ##EQU00011##
[0336] If child is a disomy cf.sub.correct=cf, but for a trisomy
fraction of the child in the mix for this chromosome is actually a
bit higher:
cf correct = cf * 3 2 + cf . ##EQU00012##
[0337] In a more complex case where there is not exact alignment, X
is a combination of binomials integrated over possible D.sub.i
reads per SNP.
Using A Joint Distribution Model: LIK(H) for a Composite
Hypothesis
[0338] In real life, trisomy is usually not purely matched or
unmatched, due to crossovers, so in this section results for
composite hypotheses H21 (maternal trisomy) and H12(paternal
trisomy) are derived, which combine matched and unmatched trisomy,
accounting for possible crossovers.
[0339] In the case of trisomy, if there were no crossovers, trisomy
would be simply matched or unmatched trisomy. Matched trisomy is
where child inherits two copies of the identical chromosome segment
from one parent. Unmatched trisomy is where child inherits one copy
of each homologous chromosome segment from the parent. Due to
crossovers, some segments of a chromosome may have matched trisomy,
and other parts may have unmatched trisomy. Described in this
section is how to build a joint distribution model for the
heterozygosity rates for a set of alleles.
[0340] Suppose that on SNP i, LIK(i, Hm) is the fit for matched
hypothesis H, and LIK(i, Hu) is the fit for UNmatched hypothesis H,
and pc(i)=probability of crossover between SNPs i-1,i. One may then
calculate the full likelihood as:
LIK(H)=.SIGMA..sub.S,ELIK(S, E, 1:N)
[0341] where LIK(S, E, 1:N) is the likelihood starting with
hypothesis S, ending in hypothesis E, for SNPs 1:N. S=hypothesis of
the first SNP, E=hypothesis of the last SNP, S,E.di-elect cons.
(Hm, Hu). Recursivelly one may calculate:
LIK(S, E, 1:i)=LIK(i, E)+log(exp(LIK(S, E,
1:i-1))*(1-pc(i))+exp(LIK(S, .about.E, 1:i-1))*pc(i))
where .about.E is the other hypothesis (not E). In particular, one
may calculate the likelihood of 1:i SNPs, based on likelihood of
1:(i-1) SNPs with either the same hypothesis and no crossover or
the opposite hypothesis and a crossover times the likelihood of the
SNP i
[0342] For SNP i=1:
LIK ( S , E , 1 : 1 ) = { LIK ( 1 , S ) if S = E 0 if S .noteq. E
##EQU00013##
Then calculate:
LIK(S, E, 1:2)=LIK(2, E)+log(exp(LIK(S, E, 1))*(1-pc(2))+exp(LIK(S,
.about.E, 1))*pc(2))
and so on until i=N.
Deriving Child Fraction
[0343] The above formulas assume a known child fraction, which is
not always the case. In one embodiment, it is possible to find the
most likely child fraction by maximizing the likelihood for disomy
on selected chromosomes.
[0344] In particular, supposes that LIK(chr, H11, cf)=log
likelihood as described above, for the disomy hypothesis, and for
child fraction cf on chromosome chr. For selected chromosomes in
Cset (usually 1:16). Then the full likelihood is:
LIK ( cf ) = chr .di-elect cons. Cset Lik ( chr , H 11 , cf ) , and
cf * = argmax cf LIK ( cf ) . ##EQU00014##
[0345] It is possible to use any set of chromosomes. It is also
possible to derive child fraction without paternal data, as
follows.
Deriving Copy Number Without Paternal Data
[0346] Recall the formula of the simple hypothesis log likelihood
on SNP i:
LIK ( i , H ) = log lik ( x i | m i , f i , H , cf ) = c p ( c | m
i , f i , H ) * log lik ( x i | m i , c , H , cf ) ##EQU00015##
[0347] Determining the probability of the true child given parents
p(c|m.sub.i, f.sub.i, H) requires the knowledge of father genotype.
If the father genotype is unknown, but pAi, the population
frequency of A allele on this SNP, is known, it is possible to
approximate the above likelihood with
LIK ( i , H ) = log lik ( x i | m i , f i , H , cf ) = c p ( c | m
i , H ) * log lik ( x i | m i , c , H , cf ) ##EQU00016## where
##EQU00016.2## p ( c | m i , H ) f p ( c | m i , f , H ) * p ( f |
p A i ) ##EQU00016.3##
where p(f|pA.sub.i) is the probability of particular father
genotype, given the frequency of A on SNP i.
[0348] In particular:
ti (AA|pA.sub.i)=(pA.sub.i).sup.2 ,
p(AB|pA.sub.i)=2(pA.sub.1)*(1-pA.sub.i),
p(BB|pA.sub.i)=(1-pA.sub.i).sup.2 Training Method without using a
Control Chromosome or Chromosome Segment
[0349] Suppose, we have 3 data segments D.sub.1, D.sub.2 and
D.sub.3. Suppose that P(H) is the current prior on segment D.sub.1.
Suppose that p is a parameter with distribution P(p) (e.g., child
fraction cf or noise parameter np). Then probability for a certain
hypothesis H (with prior P(H)) to be true equals:
P ( H D 1 , D 2 , D 3 ) = 1 P ( D 1 , D 2 , D 3 ) p P ( D 1 , D 2 ,
D 3 , H , p ) ##EQU00017##
which results in
P ( H D 1 , D 2 , D 3 ) = P ( D 2 , D 3 ) P ( D 1 , D 2 , D 3 ) p P
( D 1 H , p ) P ( H ) P ( p D 2 , D 3 ) ##EQU00018##
or, to approximate,
P ( H D 1 , D 2 , D 3 ) ~ p P ( D 1 H , p ) P ( H ) P ( p D 2 , D 3
) ##EQU00019##
where the term P(D.sub.1|H, p) can be re-written as
P ( D 1 H , p ) = P ( D 1 ) P ( H D 1 , p ) P ( H ) P ( p D 1 ) P (
p ) Thus , P ( H D 1 , D 2 , D 3 ) ~ p P ( H D 1 , p ) P ( p D 1 )
P ( p ) P ( p D 2 , D 3 ) , ##EQU00020##
where the term P(p|D.sub.2, D.sub.3) is a parameter distribution
obtained from "training" on segments D.sub.2 and D.sub.3.
P(p|D.sub.1)/P(p) depends on what the actual hypothesis for segment
1 is, and may be dropped if unknown. The approximation loses some
information, but it can be more stable and intuitive, since each
piece is on a probability scale, and fits call per grid point,
scaled by grid point probability.
[0350] Significant processing advantages can be obtained if a
control chromosome or chromosome segment is not required, as the
tests can be run on only the chromosome(s) or chromosome segment(s)
of interest. In an embodiment, the chromosomes or chromosome
segments of interest themselves provide a baseline that can then be
used to evaluate the accuracy of the given hypotheses. For example,
by using the formula
P ( p D 1 , D 2 , D 3 = P ( p D 1 ) P ( p ) P ( p D 2 ) P ( p ) P (
p D 3 ) P ( p ) P ( p ) , ##EQU00021##
the above probability equation can also be written as:
P ( H D 1 , D 2 , D 3 ) ~ p P ( H D 1 , p ) P ( p D 1 ) P ( p ) P (
p D 2 , D 3 ) = p P ( H D 1 , p ) P ( p D 1 , D 2 , D 3 )
##EQU00022##
[0351] In this equation, the probability P(H|D.sub.1, p) is
obtained per grid point, and is then scaled by the best parameter
distribution estimate given P(p, D.sub.1, D.sub.2, D.sub.3). Once
the grid points are fixed, P(H|D.sub.1, p) does not change.
However, when no fixed hypothesis exists (i.e., no control
chromosome or chromosome segment is used) for P(p, D.sub.1,
D.sub.2, D.sub.3), the final answer for P(H|D.sub.1, D.sub.2,
D.sub.3) can vary greatly depending on the prior put on each
segment hypothesis.
[0352] In other words, since the parameter distribution given all
the data is a composite of parameter distributions for each
segment,
P ( p D i ) ~ G P ( D i p , G ) P ( G ) P ( p ) ##EQU00023##
where P(G) is the hypothesis prior used on this segment for
purposes of parameter estimation.
[0353] To account for the lack of a control, a uniform hypothesis
prior f.sub.prior(H) for hypothesis H is obtained. For example,
this may be obtained by estimating child fraction using an allele
ratio plot as discussed above. Then, for each grid point p,
calculate a probability of the hypothesis ("per-grid call"):
P(H|D.sub.1,p).about.P(D.sub.1|H, p)P(H)
where P(H) is the hypothesis prior used for segment calling. In an
embodiment, this is done only once to provide an idea of the calls
for the entire grid space.
[0354] For the first pass, f.sub.prior(H) is set to be P(H). The
parameter distribution for each segment is then obtained using:
P ( p D i ) ~ H P ( D i p , G ) f prior ( H ) P ( p )
##EQU00024##
The composite parameter distribution is then obtained:
P ( p D 1 , D 2 , D 3 ) = P ( p D 1 ) P ( p ) P ( p D 2 ) P ( p ) P
( p D 3 ) P ( p ) P ( p ) ##EQU00025##
The (posterior) probability of each hypothesis is then obtained by
combining parameter scaling to the per grid call:
P ( H D 1 , D 2 , D 3 ) = p P ( H D 1 , p ) P ( p D 1 , D 2 , D 3 )
. ##EQU00026##
[0355] This provides a new estimate of the distribution of the
hypothesis per each segment. F.sub.prior(H) can be updated with the
newly derived P(H|D.sub.1, D.sub.2, D.sub.3), and the process
(starting with calculating the probability of the hypothesis for
each grid point p) is repeated until convergence.
[0356] Convergence is reached the total likelihood does not change
anymore to any appreciable extent. In an embodiment, this can be
treated as an annealing problem, with the function to be optimized
being the likelihood of the data P(H1D.sub.i, D.sub.2, D.sub.3)
maximized by the best derived posterior P(H) and P(p)
distributions. That is, the function to maximize is:
L(D)=P(D.sub.1, D.sub.2,
D.sub.3).about..SIGMA..sub.H.SIGMA..sub.pP(D|H, p)P(H)P(p).
[0357] The hypotheses with final probabilities (i.e., calls), child
fraction, and noise parameters can then be output.
[0358] In certain embodiments of the present disclosure, a method
of the invention for determining aneuploidy can include a
quantitative allelic method, technique, or algorithm that can be
used to determine the relative ratios of two or more different
haplotypes that contain the same set of loci in a sample of DNA.
The different haplotypes could represent two different homologous
chromosomes from one individual, three different homologous
chromosomes from a trisomic individual, three different homologous
haplotypes from a mother and a fetus where one of the haplotypes is
shared between the mother and the fetus, three or four haplotypes
from a mother and fetus where one or two of the haplotypes are
shared between the mother and the fetus, or other combinations. If
one or more of the haplotypes are known, or the diploid genotypes
of one or more of the individuals are known, then a set of alleles
that are polymorphic between the haplotypes can be chosen, and
average allele ratios can be determined based on the set of alleles
that uniquely originate from each of the haplotypes.
[0359] Direct sequencing of such a sample, however, is extremely
inefficient as it results in many sequences for regions that are
not polymorphic between the different haplotypes in the sample and
therefore reveal no information about the proportion of the two
haplotypes. Described herein is a method that specifically targets
and enriches segments of DNA in the sample that are more likely to
be polymorphic in the genome to increase the yield of allelic
information obtained by sequencing. Note that for the allele ratios
measured in an enriched sample to be truly representative of the
actual haplotype ratios it is critical that there is little or no
preferential enrichment of one allele as compared to the other
allele at a given loci in the targeted segments. Current methods
known in the art to target polymorphic alleles are designed to
ensure that at least some of any alleles present are detected.
However, these methods were not designed for the purpose of
measuring the allele ratio of polymorphic alleles present in the
original mixture. It is non-obvious that any particular method of
target enrichment would be able to produce an enriched sample
wherein the proportion of various alleles in the enriched sample is
about the same as to the ratios of the alleles in the original
unamplified sample. While enrichment methods may be designed, in
theory, to accomplish such an aim, an ordinary person skilled in
the art is aware that there is a great deal of stochastic or
deterministic bias in current methods. On embodiment of the method
described herein allows a plurality of alleles found in a mixture
of DNA that correspond to a given locus in the genome to be
amplified, or preferentially enriched in a way that the degree of
enrichment of each of the alleles is nearly the same. Another way
to say this is that the method allows the relative quantity of the
alleles present in the mixture as a whole to be increased, while
the ratio between the alleles that correspond to each locus remains
essentially the same as they were in the original mixture of DNA.
For the purposes of this disclosure, for the ratio to remain
essentially the same, it is mean that the ratio of the alleles in
the orginal mixture divided by the ratio of the alleles in the
resulting mixture is between 0.5 and 1.5, between 0.8 and 1.2,
between 0.9 and 1.1, between 0.95 and 1.05, between 0.98 and 1.02,
between 0.99 and 1.01, between 0.995 and 1.005, between 0.998 and
1.002, between 0.999 and 1.001, or between 0.9999 and 1.0001.
Allele Distributions
[0360] In certain embodiments, the goal of the method is to detect
fetal copy number based on a maternal blood sample which contains
some free-floating fetal DNA. In some embodiments, the fraction of
fetal DNA compared to the mother's DNA is unknown. The combination
of a targeting method, such as LIPS, followed by sequencing results
in a platform response that consists of the count of observed
sequences associated with each allele at each SNP. The set of
possible alleles, either A/T or C/G, is known at each SNP. Without
loss of generality, the first allele will be labeled A and the
second allele will be labeled B. Thus, the measurement at each SNP
consists of the number of A sequences (NA) and the number of B
sequences (NB). These will be transformed for the purpose of future
calculations into the total sequence count (n) and the ratio of A
alleles to total (r). The sequence count for a single SNP will be
referred to as the depth of read. The fundamental principal which
allows copy number identification from this data is that the ratio
of A and B sequences will reflect the ratio of A and B alleles
present in the DNA being measured.
n=N.sub.A+N.sub.B
r=N.sub.A/(N.sub.A+N.sub.B)
[0361] Measurements will be initially aggregated over SNPs from the
same parent context based on unordered parent genotypes. Each
context is defined by the mother genotype and the father genotype,
for a total of 9 contexts. For example, all SNPs where the mother's
genotype is AA and the father's genotype is BB are members of the
AA|BB context. The A allele is defined as present at ratio r.sub.m
in the mother genotype and ratio rf in the father genotype. For
example, the allele A is present at ratio r.sub.m=1 where the
mother is AA and ratio rf=0.5 where the father is AB. Thus, each
context defines values for r.sub.m and r.sub.f. Although the child
genotypes cannot always be predicted from the parent genotypes, the
allele ratio averaged over a large number of SNPs can be predicted
based on the assumption that a parent AB genotype will contribute A
and B at equal rates.
[0362] Consider a copy number hypothesis for the child of the form
(n.sub.m,n.sub.f) where n.sub.m is the number of mother copies and
nr is the number of father copies of the chromosome. The expected
allele ratio r.sub.c in the child (averaged over SNPs in a
particular parent context) depends on the allele ratios of the
parent contexts and the parent copy numbers.
r c = n m r m + n f r f n m n f ( 1 ) ##EQU00027##
[0363] In a mixture of maternal and fetal blood, allele copies will
be contributed from both the mother directly and from the child.
Assume that the fraction of child DNA present in the mixture is S.
Then in the mixture, the ratio r of the A allele in a given context
is a linear combination of the mother ratio r.sub.m and the child
ratio r.sub.c, which can be reduced to a linear combination of the
mother ratio and father ratio using equation 1.
r = ( 1 - .delta. ) r m + .delta. r c = ( 1 - .delta. n f n m + n f
) r m + .delta. n f n m + n f r f ( 2 ) ##EQU00028##
Equation 2 predicts the expected ratio of A alleles for SNPs in a
given context as a function of the copy number hypothesis
(n.sub.m,n.sub.f). Note that the allele ratio on individual SNPs is
not predicted by this equation because these depend on random
assignment where at least one parent is heterozygous. Therefore,
the set of sequences from all SNPs in a particular context will be
combined. Assuming that the context contains m SNPs, and recalling
that n sequences will be produced from each SNP, the data from that
context consists of N =mn sequences. Each of the N sequences is
considered an independent random trial where the theoretical rate
of A sequences is the allele ratio r. The measured rate of A
sequences {circumflex over (r)} is therefore known to be Gaussian
distributed with mean r and variance .sigma..sup.2=r(1-r)/N.
[0364] Recall that the theoretical allele ratio is a function of
the parent copy numbers (n.sub.m,n.sub.f). Thus, each hypothesis h
results in a predicted allele ratio r.sub.i.sup.h for the SNP in
parent context i. The data likelihood is defined as the probability
of a given hypothesis producing the observed data. Thus, the
likelihood of measurement r.sub.i.sup.h from context i under
hypothesis h is a binomial distribution, which can be approximated
for large N as a Gaussian distribution with the following mean and
variance. The mean is determined by the context and the hypothesis
as described in equation 2.
p ( r ^ i h ) = N ( r ^ i ; .mu. , .sigma. ) .mu. = r i h .sigma. =
r i h ( 1 - r i h ) N i ##EQU00029##
[0365] The measurements on each of the nine contexts are assumed
independent given the parent copy numbers, due to the common
assumption of independent noise on each SNP. Thus, the data from a
particular chromosome consists of the sequence measurements from
contexts i ranging from 1 to 9. The likelihood of the observed
allele ratios {{circumflex over (r)}.sub.1 . . . , {circumflex over
(r)}.sub.9} from the whole chromosome is therefore the product of
the individual context likelihoods:
p ( r ^ 1 . . . , r ^ 9 ) = i = 1 9 p ( r ^ i h ) = i = 1 9 ( r ^ i
; r i h , r i h ( 1 - r i h ) N i ) ##EQU00030##
Parameter Estimation
[0366] Equation 2 predicts the allele ratio as a function of parent
copy number hypothesis, but also includes the fraction of child
DNA. Therefore, the data likelihood for each chromosome is a
function of through its effect on r.sub.i.sup.h. This effect is
highlighted through the notation p({circumflex over (r)}.sub.1 . .
. , {circumflex over (r)}.sub.9|h; .delta.). This parameter cannot
be predicted with high accuracy, and therefore must be estimated
from the data. A number of different approaches may be used for
parameter estimation. One method involves the measurement of
chromosomes for which copy number errors are not viable at the
stage of development where testing will be performed. The other
method measures only chromosomes on which errors are expected to
occur.
Measure Some Chromosomes Known to be Disomy
[0367] In this method, certain chromosomes will be measured which
cannot have copy number errors at the state of development when
testing is performed. These chromosomes will be referred to as the
training set T. The copy number hypothesis on these chromosomes is
(1,1). Assuming that each chromosome is independent, the data
likelihood of the measurements from all chromosomes t in T is the
product of the individual chromosome likelihoods. The child
fraction .delta. can be selected to maximize the data likelihood
across the chromosomes in T conditioned on the disomy hypothesis.
Let R.sub.t represent the set of measurements {circumflex over
(r)}.sub.i; from all contexts i on chromosome t. Then, the maximum
likelihood estimate .delta.* solves the following:
.delta. * = argmin .delta. t T p ( R t h = ( 1 , 1 ) ; .delta. )
##EQU00031##
[0368] This optimization has only one degree of freedom constrained
between zero and one, and therefore can easily be solved using a
variety of numerical methods. The solution .delta. can then be
substituted into equation 2 in order to calculate the likelihoods
of each hypothesis on each chromosome.
Measure Only Chromosomes Which May Have Copy Number Errors
[0369] If copy number errors are possible on all of the chromosomes
being measured, the accuracy of the ploidy determination increases
greatly if fetal fraction is estimated in parallel with the copy
number hypotheses. Note that the same copy number error present on
all measured chromosomes will be very difficult to detect. For
example, maternal trisomy on all chromosomes at a given child
concentration will result in the same theoretical allele ratios as
disomy on all chromosomes at lower child concentration, because in
both cases the contribution of mother alleles compared to father
alleles increases uniformly across all chromosomes and
contexts.
[0370] A straight forward approach for classification of a limited
set of chromosomes t is to consider the joint chromosome hypothesis
H, which consists of the joint set of hypotheses for all
chromosomes being tested. If the chromosome hypotheses consist of
disomy, maternal trisomy and paternal trisomy, the number of
possible joint hypotheses is 3.sup.T where T is the number of
tested chromosomes. A maximum likelihood estimate .delta.*(H) can
be calculated conditioned on each joint hypothesis. The likelihood
of the joint hypothesis is thus calculated as follows:
.delta. * ( H ) = argmax .delta. .PI. t = 1 T p ( R t | H ; .delta.
) p ( all data | H ) = .PI. t = 1 T p ( R t | H ; .delta. * ( H ) )
##EQU00032##
[0371] The joint hypothesis likelihoods p(all data|H) can be
calculated for each joint hypothesis H, and the maximum likelihood
hypothesis is selected, with its corresponding estimate .delta.*(H)
of the child fraction.
Performance Specifications
[0372] The ability to distinguish between parent copy number
hypotheses is determined by models discussed in the previous
section. At the most general level, the difference in expected
allele ratios under the different hypotheses must be large compared
to the standard deviations of the measurements. Consider the
example of distinguishing between disomy and maternal trisomy, or
hypotheses h.sub.1=(1,1) and h.sub.2=(2, 1). Hypothesis 1 predicts
allele ratio r.sup.1 and hypothesis 2 predictions allele ratio
r.sup.2, as a function of the mother allele ratio r.sub.m and
father allele ratio r.sub.f for the context under
consideration.
r 1 = ( 1 - .delta. 2 ) r m + .delta. 2 r f ##EQU00033## r 2 = ( 1
- .delta. 3 ) r m + .delta. 3 r f ##EQU00033.2##
[0373] The measured allele ratio {circumflex over (r)} is predicted
to be Gaussian distributed, either with mean r.sup.1 or mean
r.sup.2, depending on whether hypothesis 1 or 2 is true. The
standard deviation of the measured allele ratio depends similarly
on the hypothesis, according to equation 3. In a scenario where one
can expect to identify either hypothesis 1 or 2 as truth based on
the measurement {circumflex over (r)}, the means r.sup.1, r.sup.2
and standard deviations .sigma..sup.1, .sigma..sup.2 must satisfy a
relationship such as the following, which guarantees that the means
are far apart compared to the standard deviations. This criterion
represents a 2 percent error rate, meaning a 2 percent chance of
either false negative or false positive.
|r.sup.1-r.sup.2|>2 .sigma..sup.1+2 .sigma..sup.2
Substituting the copy numbers for disomy (1, 1) and maternal
trisomy (2, 1) for hypotheses 1 and 2 results in the following
condition:
| .delta. 6 ( r f - r m ) | > 2 .sigma. 1 + 2 .sigma. 2
##EQU00034## .sigma. 1 = r 1 ( 1 - r 1 ) N ##EQU00034.2## .sigma. 2
= r 2 ( 1 - r 2 ) N ##EQU00034.3## .sigma. 2 = r 2 ( 1 - r 2 ) N
##EQU00034.4##
Overview of an Analysis Method Utilized in Methods Provided
Herein
[0374] In certain examples of embodiments of the present
disclosure, using the parent contexts, and chromosomes known to be
euploid, it is possible to estimate, by a set of simultaneous
equations, the proportion of DNA in the maternal blood from the
mother and the proportion of DNA in the maternal blood from the
fetus. These simultaneous equations are made possible by the
knowledge of the alleles present on the father. In particular,
alleles present on the father and not present on the mother provide
a direct measurement of fetal DNA. One may then look at the
particular chromosomes of interest, such as chromosome 21, and see
whether the measurements on this chromosome under each parental
context are consistent with a particular hypothesis, such as
H.sub.mp where m represents the number of maternal chromosomes and
p represents the number of paternal chromosomes e.g. H.sub.11
representing euploid, H.sub.21 and H.sub.12 representing maternal
and paternal trisomy respectively.
[0375] This method, unlike certain other methods for detecting
chromosome ploid, does not use a reference chromosome as a basis by
which to compare observed allelic ratios on the chromosome of
interest to make a determination of aneuploidy.
[0376] This disclosure presents methods by which one may determine
the ploidy state of a gestating fetus, at one or more chromosome,
in a non-invasive manner, using genetic information determined from
fetal DNA found in maternal blood. The fetal DNA may be purified,
partially purified, or not purified; genetic measurements may be
made on DNA that originated from more than one individual.
Informatics type methods can infer genetic information of the
target individual, such as the ploidy state, from the bulk
genotypic measurements at a set of alleles. The set of alleles may
contain various subsets of alleles, wherein one or more subsets may
correspond to alleles that are found on the target individual but
not found on the non-target individuals, and one or more other
subsets may correspond to alleles that are found on the non-target
individual and are not found on the target individual. The method
may involve using comparing ratios of measured output intensities
for various subsets of alleles to expected ratios given various
potential ploidy states. The platform response may be determined,
and a correction for the bias of the system may be incorporated
into the method.
Key Assumptions of the Method:
[0377] The expected amount of genetic material in the maternal
blood from the mother is constant across all loci.
[0378] The expected amount of genetic material present in the
maternal blood from the fetus is constant across all loci assuming
the chromosomes are euploid.
[0379] The chromosomes that are non-viable (all excluding
13,18,21,X,Y) are all euploid in the fetus. In one embodiment, only
some of the non-viable chromosomes need be euploid on the
fetus.
General Problem Formulation:
[0380] One may write y.sub.ijk=g.sub.ijk(x.sub.ijk)+b.sub.ijk where
x.sub.ijk is the quantity of DNA on the allele k=1 or 2 (1
represents allele A and 2 represents allele B), j=1 . . . 23
denotes chromosome number and i=1 . . . N denotes the locus number
on the chromosome, gijk is platform response for particular locus
and allele ijk, and v.sub.ijk is independent noise on the
measurement for that locus and allele. The amount of genetic
material is given by x.sub.ijk=am.sub.ijk+.DELTA.c.sub.ijk where a
is the amplification factor (or net effect of leakage, diffusion,
amplification etc.) of the genetic material present on each of the
maternal chromosomes, m.sub.ijk (either 0,1,2) is the copy number
of the particular allele on the maternal chromosomes, .DELTA. is
the amplification factor of the genetic material present on each of
the child chromosomes, and c.sub.ijk is the copy number (either
0,1,2,3) of the particular allele on the child chromosomes. Note
that for the first simplified explanation, a and A are assumed to
be independent of locus and allele i.e. independent of i, j, and k.
This gives:
y.sub.ijk=g.sub.ijk(am.sub.ijk+.DELTA.c.sub.ijk)+v.sub.ijk
Approach Using an Affine Model that is Uniform Across All Loci:
[0381] One may model g with an affine model, and for simplicity
assume that the model is the same for each locus and allele,
although it will be understood after reading this disclosure how to
modify the approach when the affine model is dependent on i,j,k.
Assume the platform response model is
g.sub.ijk(x.sub.ijk)=b+am.sub.ijk+.DELTA.c.sub.ijk
where amplification factors a and A have been used without loss of
generality, and a y-axis intercept b has been added which defines
the noise level when there is no genetic material. The goal is to
estimate a and .DELTA.. It is also possible to estimate b
independently, but assume for now that the noise level is roughly
constant across loci, and only use the set of equations based on
parent contexts to estimate a and .DELTA.. The measurement at each
locus is given by
y.sub.ijk=b+am.sub.ijk+.DELTA.c.sub.ijk+v.sub.ijk
[0382] Assuming that the noise v.sub.ijk is i.i.d. for each of the
measurements within a particular parent context, T, one can sum the
signals within that parent context. The parent contexts are
represented in terms of alleles A and B, where the first two
alleles represent the mother and the second two alleles represent
the father: T {AA|BB, BB|AA, AB|AB, AA|AA, BB|BB, AA|AB, AB|AA,
AB|BB, BB|AB}. For each context T, there is a set of loci i,j where
the parent DNA conforms to that context, represented i,j T.
Hence:
y T , k = 1 N T i , j .di-elect cons. T y i , j , k = b + a m k , T
_ + .DELTA. c k , T _ + v k , T ##EQU00035##
[0383] Where m.sub.k,T, c.sub.k,T, and v.sub.k,T, represent the
means of the respective values over all the loci conforming to the
parent context T, or over all i, j T. The mean or expected values
c.sub.k,T, will depend on the ploidy status of the child. The table
below describes the mean or expected values m.sub.k,T, and
c.sub.k,T, for k=1(allele A) or 2(allele B) and all the parent
contexts T. One may calculate the expected values assuming
different hypotheses on the child, namely euploidy and maternal
trisomy. The hypotheses are denoted by the notation H.sub.mf, where
m refers to the number of chromosomes from the mother and f refers
to the number of chromosomes from the father e.g. H.sub.11 is
euploid, H.sub.21 is maternal trisomy. Note that there is symmetry
between some of the states by switching A and B, but all states are
included for clarity:
TABLE-US-00002 Context AA/BB BB/AA AB/AB AA/AA BB/BB AA/AB AB/AA
AB/BB BB/AB m.sub.A,T 2 0 1 2 0 2 1 1 0 m.sub.B,T 0 2 1 0 2 0 1 1 2
c.sub.A,T|H.sub.11 1 1 1 2 0 1.5 1.5 0.5 0.5 c.sub.B,T|H.sub.11 1 1
1 0 2 0.5 0.5 1.5 1.5 c.sub.A,T|H.sub.21 2 1 1.5 3 0 2.5 2 1 0.5
c.sub.B,T|H.sub.21 1 2 1.5 0 3 0.5 1 2 2.5
[0384] It is now possible to write a set of equations describing
all the expected values y.sub.T,k, which can be cast in matrix
form, as follows:
Y = B + A H P + v ##EQU00036## Where ##EQU00036.2## Y = [ y AA | BB
, 1 y BB | AA , 1 y AB | BB , 1 y AA | AA , 1 y BB | BB , 1 y AA |
AB , 1 y AB | AA , 1 y AB | BB , 1 y BB | AB , 1 y AA | BB , 2 y BB
| AA , 2 y AB | AB , 2 y AA | AA , 2 y BB | BB , 2 y AA | AB , 2 y
AB | AA , 2 y AB | BB , 2 y BB | AB , 2 ] T ##EQU00036.3## P = [ a
.DELTA. ] is the matrix of parameters to estimate ##EQU00036.4##
[0385] B=b{right arrow over (1)} where {right arrow over (1)} is
the 18.times.1 matrix of ones [0386] v=[v.sub.A,AA|BB . . .
v.sub.B,BB|BB].sup.T is the 18.times.1 matrix of noise terms [0387]
and A.sub.H is the matrix encapsulating the data in the table,
where the values are different for each hypothesis H on the ploidy
state of the child. Below are examples of the Matrix A.sub.H for
the ploidy hyopotheses H.sub.11 and H.sub.21
[0387] A H 11 = [ 2.0 1.0 0 1.0 1.0 1.0 2.0 2.0 0 0 2.0 1.5 1.0 1.5
1.0 0.5 0 0.5 0 1.0 2.0 1.0 1.0 1.0 0 0 2.0 2.0 0 0.5 1.0 0.5 1.0
1.5 2.0 1.5 ] ##EQU00037## A H 21 = [ 2.0 2.0 0 1.0 1.0 1.5 2.0 3.0
0 0 2.0 2.5 1.0 2.0 1.0 1.0 0 0.5 0 1.0 2.0 2.0 1.0 1.5 0 0 2.0 3.0
0 0.5 1.0 1.0 1.0 2.0 2.0 2.5 ] ##EQU00037.2##
[0388] In order to estimate a and .DELTA., or matrix P, aggregate
the data across a set of chromosomes that one may assume are
euploid on the child sample. This could include all chromosomes j=1
. . . 23 except those that are under test, namely j=13, 18, 21, X
and Y. (Note: one could also apply a concordance test for the
results on the individual chromosomes in order to detect mosaic
aneuploidy on the non-viable chromosomes.) In order to clarify
notation, define Y' as Y measured over all the euploid chromosomes,
and Y'' as Y measured over a particular chromosome under test, such
as chromosome 21, which may be aneuploid. Apply the matrix
A.sub.H.sub.11 to the euploid data in order to estimate the
parameters:
P ^ = argmin P || Y ' - B - A H 11 P || 2 = ( A H 11 T A H 11 ) - 1
A H 11 T Y ~ ##EQU00038##
where {tilde over (Y)}=Y'-B, i.e., the measured data with the bias
removed. The least-squares solution above is only the
maximum-likelihood solution if each of the terms in the noise
matrix v has a similar variance. This is not the case, most simply
because the number of loci N'.sub.T used to compute the mean
measurement for each context T is different for each context. As
above, use the N.sub.T' to refer to the number of loci used on the
chromosomes known to be euploid, and use the C' to denote the
covariance matrix for mean measurements on the chromosomes known to
be euploid. There are many approaches to estimating the covariance
C' of the noise matrix v, which one may assume is distributed as
v.about.N(0, C'). Given the covariance matrix, the
maximum-likelihood estimate of P is
P ^ = argmin P || C ' - 1 / 2 ( Y ' - B - A H 11 P || 2 = ( A H 11
T C ' - 1 A H 11 ) - 1 A H 11 T C ' - 1 Y ~ ##EQU00039##
[0389] One simple approach to estimating the covariance matrix is
to assume that all the terms of v are independent (i.e. no
off-diagonal terms) and invoke the Central Limit Theorem so that
the variance of each term of v scales as 1/N'.sub.T so that one may
find the 18.times.18 matrix
C ' = [ 1 / N AA | BB ' 0 0 1 / N BB | AB ' ] ##EQU00040##
[0390] Once P' has been estimated, use these parameters to
determine the most likely hypothesis on the chromosome under study,
such as chromosome 21. In other words, choose the hypothesis:
H*=arg min.sub.H.parallel.C''.sup.-1/2(Y''-B-A.sub.H{circumflex
over (P)}.parallel..sub.2
[0391] Having found H* one may then estimate the degree of
confidence that one may have in the determination of H*. Assume,
for example, that there are two hypotheses under consideration:
H.sub.11 (euploid) and H.sub.21 (maternal trisomy). Assume that
H*=H.sub.11. Compute the distance measures corresponding to each of
the hypotheses:
d.sub.11=.parallel.C''.sup.-1/2(Y''-B-A.sub.H.sub.11{circumflex
over (P)}.parallel..sub.2
d.sub.21=.parallel.C''.sup.-1/2(Y''-B-A.sub.H.sub.21{circumflex
over (P)}.parallel..sub.2
[0392] It can be shown that the square of these distance measures
are roughly distributed as a Chi-Squared random variable with 18
degrees of freedom. Let .chi.18 represent the corresponding
probability density function for such a variable. One may then find
the ratio in the probabilities pH of each of the hypotheses
according to:
P H 11 P H 21 = .chi. 18 ( d 11 2 ) .chi. 18 ( d 21 2 )
##EQU00041##
[0393] One may then compute the probabilities of each hypothesis by
adding the equation P.sub.H.sub.11+P.sub.H.sub.21=1. The confidence
that the chromosome is in fact euploid is given by
P.sub.H.sub.11.
Variations on the Method
[0394] (1) One may modify the above approach for different biases b
on each of the channels representing alleles A and B. The bias
matrix B is redefined as follows:
B = [ b A 1 .fwdarw. b B 1 .fwdarw. ] ##EQU00042##
where {right arrow over (1)} is a 9.times.1 matrix of ones. As
discussed above, the parameters b.sub.e and b.sub.ib can either be
assumed based on a-priori measurements, or can be included in the
matrix P and actively estimated (i.e. there is sufficient rank in
the equations over all the contexts to do so).
[0395] (2) In the general formulation, where
y.sub.ijk=g.sub.ijk(am.sub.ijk+.DELTA.c.sub.ijk)+v.sub.ijk, one may
directly measure or calibrate the function g.sub.ijk for every
locus and allele, so that the function (which one may assume is
monotonic for the vast majority of genotyping platforms) can be
inverted. One may then use the function inverse to recast the
measurements in terms of the quantity of genetic material so that
the system of equations is linear i.e.
y'.sub.ijk=g.sub.ijk.sup.-1(y.sub.ijk)=am.sub.ijk+.DELTA.c.sub.ijk+v'.sub-
.ijk. This approach is particularly good when g.sub.ijk is an
affine function so that the inversion does not produce
amplification or biasing of the noise in v'.sub.ijk.
[0396] (3) The method above may not be optimal from a noise
perspective since the modified noise term
v'.sub.ijk=g.sub.ijk.sup.-1 (v.sub.ijk) may be amplified or biased
by the function inversion. Another approach is to linearism the
measurements around an operating point i.e.
y.sub.ijk=g.sub.ijk(am.sub.ijk+.DELTA.c.sub.ijk)+v.sub.ijk may be
recast as:
y.sub.ijk.apprxeq.g.sub.ijk(am.sub.ijk)+g.sub.ijk'(am.sub.ijk).DELTA.-
c.sub.ijk+v.sub.ijk. Since one may expect no more than 30% of the
free-floating DNA in the maternal blood to be from the child,
.DELTA.<<a, and the expansion is a reasonable approximation.
Alternatively, for a platform response such as that of the ILLUMINA
BEAD ARRAY, which is monotonically increasing and for which the
second derivative is always negative, one could improve the
linearization estimate according to
y.sub.ijk.apprxeq.g.sub.ijk(am.sub.ijk)+0.5
(g.sub.ijk'(am.sub.ijk)+g.sub.ijk'(am.sub.ijk+.DELTA.c.sub.ijk))
.DELTA.c.sub.ijk+v.sub.ijk. The resulting set of equations may be
solved iteratively for a and .DELTA. using a method such as
Newton-Raphson optimization.
[0397] (4) Another general approach is to measure at the total
amount of DNA on the test chromosome (mother plus fetus) and
compare with the amount of DNA on all other chromosomes, based on
the assumption that amount of DNA should be constant across all
chromosomes. Although this is simpler, one disadvantage is that it
is now known how much is contributed by the child so it is not
possible to estimate confidence bounds meaningfully. However, one
could look at standard deviation across other chromosome signals
that should be euploid to estimate the signal variance and generate
a confidence bound. This method involves including measurements of
maternal DNA which are not on the child DNA so these measurements
contribute nothing to the signal but do contribute directly to
noise. In addition, it is not possible to calibrate out the
amplification biases amongst different chromosomes. To address this
last point, it is possible to find a regression function linking
each chromosome's mean signal level to every other chromosomes mean
signal level, combine the signal from all chromosome by weighting
based on variance of the regression fit, and look to see whether
the test chromosome of interest is within the acceptable range as
defined by the other chromosomes.
Incorporating Data Dropouts
[0398] Elsewhere in this disclosure it has been assumed that the
probability of getting an A is a direct function of the true mother
genotype, the true child genotype, the fraction of the child in the
mix, and the child copy number. It is also possible that mother or
child alleles can drop out, for example instead of having true
child AB in the mix, there is only A, in which case the chance of
getting a nexus sequence measurement of A are much higher. Assume
that mother dropout rate is MDO, and child dropout rate is CDO. In
some embodiments, the mother dropout rate can be assumed to be
zero, and child dropout rates are relatively low, so the results in
practice are not severely affected by dropouts. Nonetheless, they
have been incorporated into the algorithm here. Elsewhere,
lik(x.sub.i|m.sub.i, c, cf)=pdf.sub.x(x.sub.i) has been defined as
the likelihood of getting x.sub.i probability of A on SNP i, given
sequence measurements S, assuming true mother m.sub.i, true child
c. If there is a dropout in the mother or child, the input data is
NOT true mother(m.sub.i) or child(c), but mother after possible
dropout (m.sub.d) and child after a possible dropout (c.sub.d). One
can then rewrite the above formula as
lik ( x i | m i , c , cf ) = m d , c d p ( m d | m i ) * p ( c d |
c ) * lik ( x i | m d , c d , cf ) ##EQU00043##
where p(m.sub.d51 m.sub.i) is the probability of new mother
genotype md, given true mother genotype m, assuming dropout rate
mdo, and p(c.sub.d|c) is the probability of new child genotype
c.sub.d, given true child genotype c, assuming dropout rate CDO. If
nA.sub.T=number of A alleles in true genotype c, nA.sub.D=number of
A alleles in `drop` genotype c.sub.d, where
nA.sub.T.gtoreq.nA.sub.D, and similarly nB.sub.T=number of B
alleles in true genotype c, nB.sub.D=number of B alleles in `drop`
genotype c.sub.d, where nB.sub.T>nB.sub.D and d=dropout rate,
then
p ( c d | c ) = ( n A T n A D ) * d n A T - n A D * ( 1 - d ) n A D
* ( nB T nB D ) * d nB T - nB D * ( 1 - d ) nB D ##EQU00044##
[0399] For one set of experimental data, the parent genotypes have
been measured, as well as the true child genotype, where the child
has maternal trisomy on chromosomes 14 and 21. Sequencing
measurements have been simulated for varying values of child
fraction, N distinct SNPs, and total number of reads NR. From this
data it is possible to derive the most likely child fraction, and
derive copy number assuming known or derived child fraction.
[0400] In one embodiment, the method disclosed herein can be used
to determine a fetal aneuploidy by determining the number of copies
of maternal and fetal target chromosomes, having target sequences
in a mixture of maternal and fetal genetic material. This method
may entail obtaining maternal tissue containing both maternal and
fetal genetic material; in some embodiments this maternal tissue
may be maternal plasma or a tissue isolated from maternal blood.
This method may also entail obtaining a mixture of maternal and
fetal genetic material from said maternal tissue by processing the
aforementioned maternal tissue. This method may entail distributing
the genetic material obtained into a plurality of reaction samples,
to randomly provide individual reaction samples that contain a
target sequence from a target chromosome and individual reaction
samples that do not contain a target sequence from a target
chromosome, for example, performing high throughput sequencing on
the sample. This method may entail analyzing the target sequences
of genetic material present or absent in said individual reaction
samples to provide a first number of binary results representing
presence or absence of a presumably euploid fetal chromosome in the
reaction samples and a second number of binary results representing
presence or absence of a possibly aneuploid fetal chromosome in the
reaction samples. Either of the number of binary results may be
calculated, for example, by way of an informatics technique that
counts sequence reads that map to a particular chromosome, to a
particular region of a chromosome, to a particular locus or set of
loci. This method may involve normalizing the number of binary
events based on the chromosome length, the length of the region of
the chromosome, or the number of loci in the set. This method may
entail calculating an expected distribution of the number of binary
results for a presumably euploid fetal chromosome in the reaction
samples using the first number. This method may entail calculating
an expected distribution of the number of binary results for a
presumably aneuploid fetal chromosome in the reaction samples using
the first number and an estimated fraction of fetal DNA found in
the mixture, for example, by multiplying the expected read count
distribution of the number of binary results for a presumably
euploid fetal chromosome by (1+n/2) where n is the estimated fetal
fraction. The fetal fraction may be estimated by a plurality of
methods, some of which are described elsewhere in this disclosure.
This method may involve using a maximum likelihood approach to
determine whether the second number corresponds to the possibly
aneuploid fetal chromosome being euploid or being aneuploid. This
method may involve calling the ploidy status of the fetus to be the
ploidy state that corresponds to the hypothesis with the maximum
likelihood of being correct given the measured data.
Simplified Explanation for Allele Ratio Method for Ploidy Calling
in NPD
[0401] In one embodiment the ploidy state of a gestating fetus may
be determined using a method that looks at allele ratios. Some
methods determine fetal ploidy state by comparing numerical
sequencing output DNA counts from a suspect chromosome to a
reference euploid chromosome. In contrast to that concept, the
allele ratio method determines fetal ploidy state by looking at
allele ratios for different parental contexts on one chromosome.
This method has no need to use a reference chromosome. For example,
imagine the following possible ploidy states, and the allele ratios
for various parental contexts:
(note: ratio `r` is defined as follows: 1/r=fraction mother
DNA/fraction fetal DNA)
TABLE-US-00003 Child Parent A:B Child A:B geno- A:B Child context
Euploidy genotype P-U tri* type P-M tri* genotype AA|BB 2 + r:r AB
2 + r:2r ABB 2 + r:2r ABB BB|AA r:2 + r AB 2 + 2r:r AAB 2 + 2r:r
AAB AA|AB 1:0 AA 2 + 2r:r AAB 1:0 AAA AA|AB 2 + r:r AB -- -- 2 +
2r:r AAB AA|AB 4 + 2r:r average -- -- 4 + 4r:r average *P-U tri =
paternal matching trisomy; P-M tri = paternal matching trisomy;
[0402] Note that this table represents only a subset of the
parental contexts and a subset of the possible ploidy states that
this method is designed to differentiate. In this case, one can
determine the A:B ratios for a plurality of alleles from a set of
parental contexts in a set of sequencing data. One can then state a
number of hypothesis for each ploidy state, and for each value of
r; each hypothesis will have an expected pattern of A:B ratios for
the different parental contexts. One can then determine which
hypothesis best fits the experimental data.
[0403] For example, using the above set of parental contexts, and
the value of r=0.2, one can rewrite the chart as follows: (For
example, one can calculate [# reads of allele A/# reads of allele
B]; thus 2+r:r becomes 2+0.2:0.2.fwdarw.2.2:0.2=11)
TABLE-US-00004 Child Parent A/B Child A/B geno- A/B Child context
Euploidy genotype P-U tri* type P-M tri* genotype AA|BB 11 AB 5.5
ABB 5.5 ABB BB|AA 0.91 AB 12 AAB 12 AAB AA|AB infinte AA 12 AAB
infinite AAA AA|AB 11 AB -- -- 12 AAB AA|AB 21 average -- -- 44
average
[0404] Now, one can look at the ratios between the A:B ratios for
different parental contexts. In this case, one may expect the
A:B.sub.AA|BB/A:B.sub.AA|AB to be 11/21=0.524 on average for
euploidy; to be 5.5/12=0.458 on average for a paternal unmatched
trisomy, and 5.5/44=0.125 on average for a paternal matching
trisomy. The profile of A:B ratios among different contexts will be
different for different ploidy states, and the profiles should be
distinctive enough that it will be possible to determine the ploidy
state for a chromosome with high accuracy. Note that the calculated
value of r may be determined using a different method, or it can be
determined using a maximum likelihood approach to this method. In
one embodiment, the method requires the maternal genotypic
knowledge. In one embodiment the method requires paternal genotypic
knowledge. In one embodiment the method does not require paternal
genotypic knowledge. In an embodiment, the percent fetal fraction
and the ratio of maternal to fetal DNA are essentially equivalent,
and can be used interchangeably after applying the appropriate
linear algebraic transformation. In some embodiments, r=[percent
fetal fraction]/[1-percent fetal fraction].
SNP Classification Using Phred Scores
[0405] The phred score, q, is defined as follows: P(wrong base
call)=10 (-q/10) Let x=reference ratio of true genotype=number of
reference alleles/number of total alleles. For disomy, x in {0,
0.5, 1} corresponds to {MM, RM, RR}. Let z be the allele observed
in a sequence, z in {R, M}. Here the likelihood of observing z=R is
shown, conditioned on the true ratio of reference alleles in the
genotype (ie, what is P(z=R|x) [0406] P(z=R|x)=P(z=R|gc,
x)P(gc)+P(z=R|bc,x)P(bc) [0407] where gc is the event of a correct
call and be is the event of a bad call.
[0408] P(gc) and P(bc) are calculated from the phred score.
P(z=R|gc,x)=x and P(z=R|bc,x)=1-x, assuming that probes are
unbiased.
[0409] Result, where b=P(wrong base call):
P(z=R|x)=x(1-b)+(1-x)*b
[0410] Note that the probability of a reference allele measurement
converges to the reference allele ratio as the phred score
improves, as expected.
[0411] Assuming that each sequence is generated independently,
conditioned on the true genotype, the likelihood of a set of
measurements at the same SNP is simply the product of the
individual likelihoods. This method accounts for varying phred
scores. In another embodiment, it is possible to account for
varying confidence in the sequence mapping. Given the set of n
sequences for a single SNP, the combination of likelihoods results
in a polynomial of order n that can be evaluated at the candidate
allele ratios that represent the various hypotheses.
SNP Classification Using Phred Threshold
[0412] When a large number of sequences are available for a single
SNP, the polynomial likelihood function on the allele ratio becomes
intractable. An alternative is to consider only the base calls
which have high phred score, and then assume that they are
accurate. Each base read is now an HD Bernoulli according to the
true allele ratio, and the likelihood function is Gaussian. If r is
the ratio of reference reads in the data, the likelihood function
on x (the true reference allele ratio) has mean=r and standard
deviation=sqrt(r*(1-r)/n).
SNP Bias Correlation Across Samples
[0413] Using the two likelihood functions discussed above
(polynomial, Gaussian) a SNP can be classified as RR, RM, or MM by
considering the allele ratios {1, 0.5, 0}, or a maximum likelihood
estimate of the allele ratio can be calculated. When the same SNP
is classified as RM in two different samples, it is possible to
compare the MLE estimates of the allele ratio to look for
consistent "probe bias."
Using Sequence Length as a Prior to Determine the Origin of DNA
[0414] It has been reported that the distribution of length of
sequences differ for maternal and fetal DNA, with fetal generally
being shorter. In one embodiment of the present disclosure, it is
possible to use previous knowledge in the form of empirical data,
and construct prior distribution for expected length of both
mother(P(X|maternal)) and fetal DNA (P(X|fetal)). Given new
unidentified DNA sequence of length x, it is possible to assign a
probability that a given sequence of DNA is either maternal or
fetal DNA, based on prior likelihood of x given either maternal or
fetal. In particular if P(x|maternal)>P(x|fetal), then the DNA
sequence can be classified as maternal, with
P(x|maternal)=P(x|maternal)/[(P(x|maternal)+P(x|fetal)], and if
p(x|maternal)<p(x|fetal), then the DNA sequence can be
classified as fetal, P(x|fetal)=P(x|fetal)/[(
[0415] P(x|maternal)+P(x|fetal)]. In one embodiment of the present
disclosure, a distributions of maternal and fetal sequence lengths
can be determined that is specific for that sample by considering
the sequences that can be assigned as maternal or fetal with high
probability, and then that sample specific distribution can be used
as the expected size distribution for that sample.
[0416] Methods for Determining the Average Copy Number in a Set of
Target Cells
[0417] The methods described above assume that the DNA from the
target cell is from one target cell, or else from target cells
which are essentially genetically identical. There are
circumstances where this assumption may not hold, for example, in
the case of placental mosaicism, where the target is a fetus, and
the DNA from the fetus originates from a plurality of cells where
some of the placental cells are genetically distinct from other
placental cells. For example, in many some case where the fetus is
47,XX+18 or 47,XY+18, the placenta is mosaic--a mixture of 46,XX
and 47,XX+18 or 46,XY and 47,XY+18 respectively.
[0418] Another example involves detection of cancer through copy
number variants, where the target cells are from a tumor, and where
the non-target cells are non-cancerous cells from the host. The
hallmark of cancer is the instability of the genome, and in many if
not all cases, tumors are genetically heterogeneous. Even small
biopsies of tumor tissue show heterogeneity. The ways in which the
genome of the cancerous cells differ from the native host DNA are
considered mutations; some but not necessarily all of these
mutations may drive the oncogenic properties of the cancer. In the
case of a liquid biopsy, i.e. detection of tumor DNA from cell free
DNA (cfDNA) in the blood stream, the cell-free tumor DNA (ctDNA) is
believed to originate from apoptotic or necrotic cancer cells,
which are often heterogeneous, and are representative of some or
all of the cells of the tumor. There are a number of types of
mutations that are seen in cancers, including but not limited to
point mutations, also called single nucleotide variants (SNVs),
copy number variants (CNVs), hypomethylation, hypermethylation,
deletions, and duplications.
[0419] If one considers the normal disomic genome of the host to be
the baseline, then analysis of a mixture of normal and cancer cells
will yield the average difference between the baseline and the DNA
from the cells of origin of the ctDNA in the mixture. For example,
imagine a case where 10% of the DNA in the sample originated from a
cells with a deletion over a region of a chromosome that is
targeted by the assay. A quantitative approach should show that the
quantity of reads corresponding to that region would be expected to
be 95% of what would be expected for a normal sample. This is
because one of the two target chromosomal regions in each of the
tumor cells with a deletion on of the targeted region is missing,
and thus the total amount of DNA mapping to that region would be
90% (for the normal cells) plus 1/2.times.10% (for the tumor
cells)=95%. Alternately, an allelic approach should show that the
ratio of alleles at heterozygous loci averaged 19:20. Now imagine a
case where 10% of the DNA in the sample originated from a cells
with a five-fold focal amplification of a region of a chromosome
that is targeted by the assay. A quantitative approach should show
that the quantity of reads corresponding to that region would be
expected to be 125% of what would be expected for a normal sample.
This is because one of the two target chromosomal regions in each
of the tumor cells with a five-fold focal amplification is copied
an extra five times over the targeted region, and thus the total
amount of DNA mapping to that region would be 90% (for the normal
cells) plus (2+5).times.10%/2 (for the tumor cells)=125%.
Alternately, an allelic approach should show that the ratio of
alleles at heterozygous loci averaged 25:20. Note that when using
an allelic approach alone, a focal amplification of five-fold over
a chromosomal region in a sample with 10% ctDNA may appear the same
as a deletion over the same region in a sample with 40% ctDNA; in
these two cases, the haplotype that is under-represented in the
case of the deletion would appear to be the haplotype without a CNV
in the case with the focal duplication, and the haplotype without a
CNV in the case of the deletion would appear to be the
over-represented haplotype in the case with the focal duplication.
Combining the likelihoods produced by this allelic approach with
likelihoods produced by a quantitative approach would differentiate
between the two possibilities.
[0420] In certain embodiments, provided herein are kits for
performing any of the methods for detecting aneuploidy provided
herein, that include at least one tube of at least one reagent for
performing such method and a computer readable medium or an access
code to an online computer program, to perform one or more of the
analytical techniques used in the method. For example, a kit in
certain embodiments, includes a tube of oligonucleotides for
amplifying a chromosome region of interest that includes a locus,
and an access code for unlocking online software for making an
initial copy number determination or for making a confirmatory copy
number determination. The kit can further include a tube with one
or more reagents for amplifying the locus. The components of the
kit can be contained in the same physical container (e.g. box) or
they can be arranged together on an Internet page.
Sample Preparation
Exemplary Sample Preparation Methods
[0421] In some embodiments, methods of the invention includs
isolating or purifying the DNA and/or RNA. There are a number of
standard procedures known in the art to accomplish such an end. In
some embodiments, the sample may be centrifuged to separate various
layers. In some embodiments, the DNA or RNA may be isolated using
filtration. In some embodiments, the preparation of the DNA or RNA
may involve amplification, separation, purification by
chromatography, liquid liquid separation, isolation, preferential
enrichment, preferential amplification, targeted amplification, or
any of a number of other techniques either known in the art or
described herein. In some embodiments for the isolation of DNA,
RNase is used to degrade RNA. In some embodiments for the isolation
of RNA, DNase (such as DNase I from Invitrogen, Carlsbad, Calif.,
USA) is used to degrade DNA. In some embodiments, an RNeasy mini
kit (Qiagen), is used to isolate RNA according to the
manufacturer's protocol. In some embodiments, small RNA molecules
are isolated using the mirVana PARIS kit (Ambion, Austin, Tex.,
USA) according to the manufacturer's protocol (Gu et al., J.
Neurochem. 122:641-649, 2012, which is hereby incorporated by
reference in its entirety). The concentration and purity of RNA may
optionally be determined using Nanovue (GE Healthcare, Piscataway,
N.J., USA), and RNA integrity may optionally be measured by use of
the 2100 Bioanalyzer (Agilent Technologies, Santa Clara, Calif.,
USA) (Gu et al., J. Neurochem. 122:641-649, 2012, which is hereby
incorporated by reference in its entirety). In some embodiments,
TRIZOL or RNAlater (Ambion) is used to stabilize RNA during
storage.
[0422] In some embodiments, universal tagged adaptors are added to
make a library from isolated nucleic acids. Prior to ligation,
sample DNA may be blunt ended, and then a single adenosine base is
added to the 3-prime end. Prior to ligation the DNA may be cleaved
using a restriction enzyme or some other cleavage method. During
ligation the 3-prime adenosine of the sample fragments and the
complementary 3-prime tyrosine overhang of adaptor can enhance
ligation efficiency. In some embodiments, adaptor ligation is
performed using the ligation kit found in the AGILENT SURESELECT
kit. In some embodiments, the library is amplified using universal
primers. In an embodiment, the amplified library is fractionated by
size separation or by using products such as AGENCOURT AMPURE beads
or other similar methods. In some embodiments, PCR amplification is
used to amplify target loci. In some embodiments, the amplified DNA
is sequenced (such as sequencing using an ILLUMINA IIGAX or HiSeq
sequencer). In some embodiments, the amplified DNA is sequenced
from each end of the amplified DNA to reduce sequencing errors. If
there is a sequence error in a particular base when sequencing from
one end of the amplified DNA, there is less likely to be a sequence
error in the complementary base when sequencing from the other side
of the amplified DNA (compared to sequencing multiple times from
the same end of the amplified DNA).
[0423] In some embodiments, whole genome application (WGA) is used
to amplify a nucleic acid sample. There are a number of methods
available for WGA: ligation-mediated PCR (LM-PCR), degenerate
oligonucleotide primer PCR (DOP-PCR), and multiple displacement
amplification (MDA). In LM-PCR, short DNA sequences called adapters
are ligated to blunt ends of DNA. These adapters contain universal
amplification sequences, which are used to amplify the DNA by PCR.
In DOP-PCR, random primers that also contain universal
amplification sequences are used in a first round of annealing and
PCR. Then, a second round of PCR is used to amplify the sequences
further with the universal primer sequences. MDA uses the phi-29
polymerase, which is a highly processive and non-specific enzyme
that replicates DNA and has been used for single-cell analysis. In
some embodiments, WGA is not performed.
[0424] In some embodiments, selective amplification or enrichment
are used to amplify or enrich target loci. In some embodiments, the
amplification and/or selective enrichment technique may involve PCR
such as ligation mediated PCR, fragment capture by hybridization,
Molecular Inversion Probes, or other circularizing probes. In some
embodiments, real-time quantitative PCR (RT-qPCR), digital PCR, or
emulsion PCR, single allele base extension reaction followed by
mass spectrometry are used (Hung et al., J Clin Pathol 62:308-313,
2009, which is hereby incorporated by reference in its entirety).
In some embodiments, capture by hybridization with hybrid capture
probes is used to preferentially enrich the DNA. In some
embodiments, methods for amplification or selective enrichment may
involve using probes where, upon correct hybridization to the
target sequence, the 3-prime end or 5-prime end of a nucleotide
probe is separated from the polymorphic site of a polymorphic
allele by a small number of nucleotides. This separation reduces
preferential amplification of one allele, termed allele bias. This
is an improvement over methods that involve using probes where the
3-prime end or 5-prime end of a correctly hybridized probe are
directly adjacent to or very near to the polymorphic site of an
allele. In an embodiment, probes in which the hybridizing region
may or certainly contains a polymorphic site are excluded.
Polymorphic sites at the site of hybridization can cause unequal
hybridization or inhibit hybridization altogether in some alleles,
resulting in preferential amplification of certain alleles. These
embodiments are improvements over other methods that involve
targeted amplification and/or selective enrichment in that they
better preserve the original allele frequencies of the sample at
each polymorphic locus, whether the sample is pure genomic sample
from a single individual or mixture of individuals
[0425] In some embodiments, PCR (referred to as mini-PCR) is used
to generate very short amplicons (U.S. application Ser. No.
13/683,604, filed Nov. 21, 2012, U.S. Publication No. 2013/0123120,
U.S. application Ser. No. 13/300,235, filed Nov. 18, 2011, U.S.
Publication No 2012/0270212, filed Nov. 18, 2011, and U.S. Ser. No.
61/994,791, filed May 16, 2014, which are each hereby incorporated
by reference in its entirety). cfDNA (such as fetal cfDNA in
maternal serum or necroptically- or apoptotically-released cancer
cfDNA) is highly fragmented. For fetal cfDNA, the fragment sizes
are distributed in approximately a Gaussian fashion with a mean of
160 bp, a standard deviation of 15 bp, a minimum size of about 100
bp, and a maximum size of about 220 bp. The polymorphic site of one
particular target locus may occupy any position from the start to
the end among the various fragments originating from that locus.
Because cfDNA fragments are short, the likelihood of both primer
sites being present the likelihood of a fragment of length L
comprising both the forward and reverse primers sites is the ratio
of the length of the amplicon to the length of the fragment. Under
ideal conditions, assays in which the amplicon is 45, 50, 55, 60,
65, or 70 bp will successfully amplify from 72%, 69%, 66%, 63%,
59%, or 56%, respectively, of available template fragment
molecules. In certain embodiments that relate most preferably to
cfDNA from samples of individuals suspected of having cancer, the
cfDNA is amplified using primers that yield a maximum amplicon
length of 85, 80, 75 or 70 bp, and in certain preferred embodiments
75 bp, and that have a melting temperature between 50 and
65.degree. C., and in certain preferred embodiments, between
54-60.5.degree. C. The amplicon length is the distance between the
5-prime ends of the forward and reverse priming sites. Amplicon
length that is shorter than typically used by those known in the
art may result in more efficient measurements of the desired
polymorphic loci by only requiring short sequence reads. In an
embodiment, a substantial fraction of the amplicons are less than
100 bp, less than 90 bp, less than 80 bp, less than 70 bp, less
than 65 bp, less than 60 bp, less than 55 bp, less than 50 bp, or
less than 45 bp.
[0426] In some embodiments, amplification is performed using direct
multiplexed PCR, sequential PCR, nested PCR, doubly nested PCR,
one-and-a-half sided nested PCR, fully nested PCR, one sided fully
nested PCR, one-sided nested PCR, hemi-nested PCR, hemi-nested PCR,
triply hemi-nested PCR, semi-nested PCR, one sided semi-nested PCR,
reverse semi-nested PCR method, or one-sided PCR, which are
described in U.S. application Ser. No. 13/683,604, filed Nov. 21,
2012, U.S. Publication No. 2013/0123120, U.S. Application Ser. No.
13/300,235, filed Nov. 18, 2011, U.S. Publication No 2012/0270212,
and U.S. Ser. No. 61/994,791, filed May 16, 2014, which are hereby
incorporated by reference in their entirety. If desired, any of
these methods can be used for mini-PCR.
[0427] If desired, the extension step of the PCR amplification may
be limited from a time standpoint to reduce amplification from
fragments longer than 200 nucleotides, 300 nucleotides, 400
nucleotides, 500 nucleotides or 1,000 nucleotides. This may result
in the enrichment of fragmented or shorter DNA (such as fetal DNA
or DNA from cancer cells that have undergone apoptosis or necrosis)
and improvement of test performance.
[0428] In some embodiments, multiplex PCR is used. In some
embodiments, the method of amplifying target loci in a nucleic acid
sample involves (i) contacting the nucleic acid sample with a
library of primers that simultaneously hybridize to least 100; 200;
500; 750; 1,000; 2,000; 5,000; 7,500; 10,000; 20,000; 25,000;
30,000; 40,000; 50,000; 75,000; or 100,000 different target loci to
produce a reaction mixture; and (ii) subjecting the reaction
mixture to primer extension reaction conditions (such as PCR
conditions) to produce amplified products that include target
amplicons. In some embodiments, at least 50, 60, 70, 80, 90, 95,
96, 97, 98, 99, or 99.5% of the targeted loci are amplified. In
various embodiments, less than 60, 50, 40, 30, 20, 10, 5, 4, 3, 2,
1, 0.5, 0.25, 0.1, or 0.05% of the amplified products are primer
dimers. In some embodiments, the primers are in solution (such as
being dissolved in the liquid phase rather than in a solid phase).
In some embodiments, the primers are in solution and are not
immobilized on a solid support. In some embodiments, the primers
are not part of a microarray. In some embodiments, the primers do
not include molecular inversion probes (MIPs).
[0429] In some embodiments, two or more (such as 3 or 4) target
amplicons (such as amplicons from the miniPCR method disclosed
herein) are ligated together and then the ligated products are
sequenced. Combining multiple amplicons into a single ligation
product increases the efficiency of the subsequent sequencing step.
In some embodiments, the target amplicons are less than 150, 100,
90, 75, or 50 base pairs in length before they are ligated. The
selective enrichment and/or amplification may involve tagging each
individual molecule with different tags, molecular barcodes, tags
for amplification, and/or tags for sequencing. In some embodiments,
the amplified products are analyzed by sequencing (such as by high
throughput sequencing) or by hybridization to an array, such as a
SNP array, the ILLUMINA INFINIUM array, or the AFFYMETRIX gene
chip. In some embodiments, nanopore sequencing is used, such as the
nanopore sequencing technology developed by Genia (see, for
example, the world wide web at geniachip.com/technology, which is
hereby incorporated by reference in its entirety). In some
embodiments, duplex sequencing is used (Schmitt et al., "Detection
of ultra-rare mutations by next-generation sequencing," Proc Natl
Acad Sci USA. 109(36): 14508-14513, 2012, which is hereby
incorporated by reference in its entirety). This approach greatly
reduces errors by independently tagging and sequencing each of the
two strands of a DNA duplex. As the two strands are complementary,
true mutations are found at the same position in both strands. In
contrast, PCR or sequencing errors result in mutations in only one
strand and can thus be discounted as technical error. In some
embodiments, the method entails tagging both strands of duplex DNA
with a random, yet complementary double-stranded nucleotide
sequence, referred to as a Duplex Tag. Double-stranded tag
sequences are incorporated into standard sequencing adapters by
first introducing a single-stranded randomized nucleotide sequence
into one adapter strand and then extending the opposite strand with
a DNA polymerase to yield a complementary, double-stranded tag.
Following ligation of tagged adapters to sheared DNA, the
individually labeled strands are PCR amplified from asymmetric
primer sites on the adapter tails and subjected to paired-end
sequencing. In some embodiments, a sample (such as a DNA or RNA
sample) is divided into multiple fractions, such as different wells
(e.g., wells of a WaferGen SmartChip). Dividing the sample into
different fractions (such as at least 5, 10, 20, 50, 75, 100, 150,
200, or 300 fractions) can increase the sensitivity of the analysis
since the percent of molecules with a mutation are higher in some
of the wells than in the overall sample. In some embodiments, each
fraction has less than 500, 400, 200, 100, 50, 20, 10, 5, 2, or 1
DNA or RNA molecules. In some embodiments, the molecules in each
fraction are sequenced separately. In some embodiments, the same
barcode (such as a random or non-human sequence) is added to all
the molecules in the same fraction (such as by amplification with a
primer containing the barcode or by ligation of a barcode), and
different barcodes are added to molecules in different fractions.
The barcoded molecules can be pooled and sequenced together. In
some embodiments, the molecules are amplified before they are
pooled and sequenced, such as by using nested PCR. In some
embodiments, one forward and two reverse primers, or two forward
and one reverse primers are used.
[0430] The use of a method to target certain alleles followed by
sequencing as part of a method for allele calling or ploidy calling
may confer a number of unexpected advantages. Some methods by which
DNA may be targeted, or selectively enriched, include using
circularizing probes, linked inverted probes (LIPs), capture by
hybridization methods such as SURE SELECT, and targeted PCR
amplification strategies.
[0431] Some embodiments of the present disclosure involve the use
of "Linked Inverted Probes" (LIPs), which have been previously
described in the literature. LIPs is a generic term meant to
encompass technologies that involve the creation of a circular
molecule of DNA, where the probes are designed to hybridize to
targeted region of DNA on either side of a targeted allele, such
that addition of appropriate polymerases and/or ligases, and the
appropriate conditions, buffers and other reagents, will complete
the complementary, inverted region of DNA across the targeted
allele to create a circular loop of DNA that captures the
information found in the targeted allele. LIPs may also be called
pre-circularized probes, pre-circularizing probes, or the
circularizing probes. The LIPs probe may be a linear DNA molecule
between 50 and 500 nucleotides in length, and in a preferred
embodiment between 70 and 100 nucleotides in length; in some
embodiments, it may be longer or shorter than described herein.
Others embodiments of the present disclosure involve different
incarnations, of the LIPs technology, such as Padlock Probes and
Molecular Inversion Probes (MIPs).
[0432] InThere are many methods that may be used to measure the
genetic data of the individual and/or the related individuals in
the aforementioned contexts. The different methods comprise a
number of steps, those steps often involving amplification of
genetic material, addition of olgionucleotide probes, ligation of
specified DNA strands, isolation of sets of desired DNA, removal of
unwanted components of a reaction, detection of certain sequences
of DNA by hybridization, detection of the sequence of one or a
plurality of strands of DNA by DNA sequencing methods. In some
cases, the DNA strands may refer to target genetic material, in
some cases they may refer to primers, in some cases they may refer
to synthesized sequences, or combinations thereof. These steps may
be carried out in a number of different orders. Given the highly
variable nature of molecular biology, it is generally not obvious
which methods, and which combinations of steps, will perform
poorly, well, or best in various situations.
[0433] Note that in theory it is possible to target any number loci
in the genome, anywhere from one loci to well over one million
loci. If a sample of DNA is subjected to targeting, and then
sequenced, the percentage of the alleles that are read by the
sequencer will be enriched with respect to their natural abundance
in the sample. The degree of enrichment can be anywhere from one
percent (or even less) to tens fold, hundred fold, thousand fold or
even many million fold. In the human genome there are roughly 3
billion base pairs, and nucleotides, containing approximately 75
million polymorphic loci. The more loci that are targeted, the
smaller the degree of enrichment is possible. The fewer the number
of loci that are targeted, the greater degree of enrichment is
possible, and the greater depth of read may be achieved at those
loci for a given number of sequence reads.
[0434] In one embodiment of the present disclosure, the targeting
may focus entirely on SNPs. A number of commercial targeting
products are available to enrich exons. Targeting exclusively loci
that include SNPs is particularly advantageous when using a method
for NPD that relies on allele distributions. In one embodiment of
the present disclosure, it is possible to use a targeting method
that focuses on SNPs to enrich a genetic sample in polymorphic
regions of the genome. In one embodiment, it is possible to focus
on a small number of SNPs, for example between 1 and 100 SNPs, or a
larger number, for example, between 100 and 1,000, between 1,000
and 10,000, between 10,000 and 100,000 or more than 100,000 SNPs.
In one embodiment, it is possible to focus on one or a small number
of chromosomes that are correlated with live trisomic births, for
example chromosomes 13, 18, 21, X and Y, or some combination
thereof. In one embodiment, it is possible to enrich the targeted
SNPs by a small factor, for example between 1.01 fold and 100 fold,
or by a larger factor, for example between 100 fold and 1,000,000
fold. In one embodiment of the present disclosure, it is possible
to use a targeting method to create a sample of DNA that is
preferentially enriched in polymorphic regions of the genome. In
one embodiment, it is possible to use the method to create a sample
of DNA that is preferentially enriched in a small number of SNPs,
for example between 1 and 100 SNPs, or a larger number of SNPs, for
example, between 100 and 50,000 SNPs. In one embodiment, it is
possible to use the method to create a DNA sample that is enriched
in SNPs located on one or a small number of chromosomes that are
correlated with live trisomic births, for example chromosomes 13,
18, 21, X and Y, or some combination thereof. In one embodiment, it
is possible to use the method to create a sample of DNA that is
preferentially enriched in a small number of SNPs, for example
between 1 and 100 SNPs, or a larger number of SNPs, for example,
between 100 and 50,000 SNPs. In one embodiment, it is possible to
use the method to create a sample of DNA that is enriched targeted
SNPs by a small factor, for example between 1.01 fold and 100 fold,
or by a larger factor, for example between 100 fold and 1,000,000
fold. In one embodiment, it is possible to use this method to
create a mixture of DNA with any of these characteristics where the
mixture of DNA contains maternal DNA and also free floating fetal
DNA. In one embodiment, it is possible to use this method to create
a mixture of DNA that has any combination of these factors. For
example, a mixture of DNA that contains maternal DNA and fetal DNA,
and that is preferentially enriched in 200 SNPs, all of which are
located on either chromosome 18 or 21, and which are enriched an
average of 1000 fold. In another example, it is possible to use the
method to create a mixture of DNA that is preferentially enriched
in 50,000 SNPs that are all located on chromosomes 13, 18, 21, X
and Y, and the average enrichment per loci is 200 fold. Any of the
targeting methods described herein can be used to create mixtures
of DNA that are preferentially enriched in certain loci.
[0435] In some embodiments, the method may further comprise
measuring the DNA contained in the mixed fraction using a DNA
sequencer, and the DNA contained in the mixed fraction contains a
disproportionate number of sequences from one or more chromosomes,
wherein the one or more chromosomes are selected from the group
consisting of chromosome 13, chromosome 18, chromosome 21,
chromosome X, chromosome Y and combinations thereof.
[0436] In one embodiment, once a mixture has been preferentially
enriched at the set of target loci, it may be sequenced using any
one of the previous, current, or next generation of sequencing
instruments that sequences a clonal sample (a sample generated from
a single molecule; examples include ILLUMINA GAIIx, ILLUMINA HISEQ
or MiSEQ, LIFE TECHNOLOGIES SOLiD, 5500XL, or Ion Torrent PGM or
Proton). The ratios can be evaluated by sequencing through the
specific alleles within the targeted region. These sequencing reads
can be analyzed and counted according the allele type and the
rations of different alleles determined accordingly. For variations
that are one to a few bases in length, detection of the alleles
will be performed by sequencing and it is essential that the
sequencing read span the allele in question in order to evaluate
the allelic composition of that captured molecule. The total number
of captured molecules assayed for the genotype can be increased by
increasing the length of the sequencing read. Full sequencing of
all molecules would guarantee collection of the maximum amount of
data available in the enriched pool. However, sequencing is
currently expensive, and a method that can measure a certain number
of allele ratios using a lower number of sequence reads will have
great value. In addition, there are technical limitations to the
maximum possible length of read as well as accuracy limitations as
read lengths increase. The alleles of greatest utility will be of
one to a few bases in length, but theoretically any allele shorter
than the length of the sequencing read can be used. While allele
variations come in all types, the examples provided herein focus on
SNPs or variants comprised of just a few neighboring base pairs.
Larger variants such as segmental copy number variants can be
detected by aggregations of these smaller variations in many cases
as whole collections of SNP internal to the segment are duplicated.
Variants larger than a few bases, such as STRs require special
consideration and some targeting approaches work while others will
not. The evaluation of the allelic rations is herein determined
[0437] There are multiple targeting approaches that can be used to
specifically isolate and enrich a one or a plurality of variant
positions in the genome. Typically, these rely on taking advantage
of invariant sequence flanking the variant sequence. There is prior
art related to targeting in the context of sequencing where the
substrate is maternal plasma (see, e.g., Liao et al., Clin. Chem.;
57(1): pp. 92-101). However, these approaches all use targeting
probes that target exons, and do not focus on targeting polymorphic
regions of the genome. In one embodiment of the present disclosure,
the method involves using targeting probes that focus exclusively
or almost exclusively on polymorphic regions. In one embodiment of
the present disclosure, the method involves using targeting probes
that focus exclusively or almost exclusively on SNPs. When
polymorphic targeted DNA mixtures are sequenced and analyzed using
an algorithm that determined ploidy using allele ratios, this
targeting method is able to provide far more accurate ploidy
determinations for a given number of sequence reads. In some
embodiments of the present disclosure, the targeted polymorphic
regions consist of at least 10% SNPs, at least 20% SNPs, at least
30% SNPs, at least 40% SNPs, at least 50% SNPs, at least 60% SNPs,
at least 70% SNPs, at least 80% SNPs, at least 90% SNPs, at least
95% SNPs, at least 98% SNPs, at least 99% SNPs, at least 99.9%
SNPs, exclusively SNPs.
[0438] Targeted Sequencing Using PCR Approaches
[0439] In some embodiments, PCR can be used to target specific
locations of the genome. In plasma samples, the original DNA is
highly fragmented (.about.100-200 bp, 150 peak). In PCR, both
forward and reverse primers must anneal to the same fragment to
enable amplification. Therefore, if the fragments are short, the
PCR assays must amplify relatively short regions as well. Like
MIPS, if the polymorphic positions are too close the polymerase
binding site, it could result in biases in the amplification from
different alleles. Currently, PCR primers that target polymorphic
regions, such as SNPs, are typically designed such that the 3' end
of the primer will hybridize to the base immediately adjacent to
the polymorphic base or bases. In one embodiment of the present
disclosure, the 3' ends of both the forward and reverse PCR primers
are designed to hybridize to bases that are one or a few positions
away from the variant positions (polymorphic regions) of the
targeted allele. The number of bases between the polymorphic region
(SNP or otherwise) and the base to which the 3' end of the primer
is designed to hybridize may be one base, it may be two bases, it
may be three bases, it may be four bases, it may be five bases, it
may be six bases, it may be seven to ten bases, it may be eleven to
fifteen bases, or it may be sixteen to twenty bases. The forward
and reverse primers may be designed to hybridize a different number
of bases away from the polymorphic region.
[0440] PCR assay can be generated in large numbers, however, the
interactions between different PCR assays makes it difficult to
multiplex them beyond about one hundred assays. Various complex
molecular approaches can be used to increase the level of
multiplexing, but it may still be limited to fewer than 1000 assays
per reaction. Samples with large quantities of DNA can be split
among multiple sub-reactions and then recombined before sequencing.
For samples where either the overall sample or some subpopulation
of DNA molecules is limited, splitting the sample would introduce
statistical noise. In one embodiment, a small or limited quantity
of DNA may refer to an amount below 10 pg, between 10 and 100 pg,
between 100 pg and 1 ng, between 1 and 10 ng, or between 10 and 100
ng. Note that while this method is particularly useful on small
amounts of DNA where other methods that involve splitting into
multiple pools can cause significant problems related to introduced
stochastic noise, this method still provides the benefit of
minimizing bias when it is run on samples of any quantity of DNA.
In these situations, a pre-amplification step may be used to
increase the overall sample quantity. However, this
pre-amplification step should not appreciably alter the allelic
ratios.
[0441] In one embodiment, the method can generate hundreds to
thousands of PCR products (can be 10,000 and more), e.g. for
genotyping by sequencing or some other genotyping method, from
limited samples such as single cells or DNA from body fluids.
Currently, performing multiplex PCR reactions of more than 5 to 10
targets presents a major challenge and is often hindered by primer
side products, such as primer dimers, and other artifacts. In next
generation sequencing the vast majority of the sequencing reads
would sequence such artifacts and not the desired target sequences
in a sample. In general, to perform targeted sequencing of multiple
(n) targets of a sample (greater than 10, 50 or 1000's), one can
split the sample into n parallel reactions that amplify one
individual target, which is problematic for samples with a limited
amount of DNA. This has been performed in PCR multiwell plates or
can be done in commercial platforms such as the Fluidigm Access
Array (48 reactions per sample in microfluidic chips) or droplet
PCR by Rain Dance Technologies (100s to a few thousands of
targets). Described here is a method to effectively amplify many
PCR reactions, that is applicable to cases where only a limited
amount of DNA is available. In one embodiment, the method may be
applied for analysis of single cells, body fluids, biopsies,
environmental and/or forensic samples. Solution:
[0442] A) Generate and amplify a library with adaptor sequences on
both ends of DNA fragments. Divide into multiple reactions after
library amplification.
[0443] B) Generate (and possibly amplify) a library with adaptor
sequences on both ends of DNA fragments. Perform 1000-plex
amplification of selected targets using one target specific
"Forward" primer per target and one tag specific primer. One can
perform a second amplification from this product using "Reverse"
target specific primers and one (or more) primer specific to a
universal tag that was introduced as part of the target specific
forward primers in the first round.
[0444] C) Perform a 1000-plex preamplification of selected target
for a limited number of cycles. Divide the product into multiple
aliquots and amplify subpools of targets in individual reactions
(for example, 50 to 500-plex, though this can be used all the way
down to singleplex). Pool products of parallel subpools
reactions.
[0445] D) During these amplifications primers may carry sequencing
compatible tags (partial or full length) such that the products can
easily be sequenced.
[0446] There is significant diagnostic value in accurately
determining the relative proportion of alleles present in a sample.
The interpretation of the result depends on the source of the
material. In some embodiments of the present disclosure, the
allelic ratio information can be used to determine the genetic
state of an individual. In some embodiments of the present
disclosure, this information can be used to determine the genetic
state of a plurality of individuals from one DNA sample, wherein
the DNA sample contains DNA from each of the plurality of
individuals. In one embodiment, the allelic ratio information can
be used to determine copy number of whole chromosomes from
individual cells, or bulk samples. In one embodiment, the allelic
ratio information can be used to determine copy number of parts,
regions, or segments of chromosomes individual cells, or bulk
samples. In one embodiment, the allelic ratio information can be
used to determine the relative contribution of different cell types
in mosaic samples. In one embodiment, the allelic ratio information
can be used to determine the fraction of fetal DNA in maternal
plasma samples as well as the chromosome copy number of the fetal
chromosomes.
Generation of Targeted Sequencing Libraries by PCR of Greater Than
100 Targets
[0447] Described herein is a method for amplifying a region of a
chromosome of interest that includes a locus of interest by first
globally amplify the plasma DNA of a sample and then dividing the
sample up into multiple multiplexed target enrichment reactions
with multiple target sequences per reaction. In one embodiment, the
method can be used for preferentially enriching a DNA mixture at a
plurality of loci, the method comprising generating and amplifying
a library from a mixture of DNA where the molecules in the library
have adaptor sequences ligated on both ends of the DNA fragments,
dividing the amplified library into multiple reactions, performing
a first round of multiplex amplification of selected targets using
one target specific "forward" primer per target and one or a
plurality of adaptor specific universal "reverse" primers. In one
embodiment, the method may further comprise performing a second
amplification using "reverse" target specific primers and one or a
plurality of primers specific to a universal tag that was
introduced as part of the target specific forward primers in the
first round. In one embodiment, the method may be used for
preferentially enriching a DNA mixture at a plurality of loci, the
method comprising performing a multiplex preamplification of
selected targets for a limited number of cycles, dividing the
product into multiple aliquots and amplifying subpools of targets
in individual reactions, and pooling products of parallel subpools
reactions. In one embodiment, the primers carry partial or full
length sequencing compatible tags.
Workflow:
[0448] 1. Extract plasma DNA
[0449] 2. Prepare fragment library with universal adaptors on both
ends of fragments.
[0450] 3. Amplify library using universal primers specific to the
adaptors.
[0451] 4. Divide the amplified sample "library" into multiple
aliquots. Perform multiplex (e.g. 100-plex, or 1000-plex with one
target specific primer per target and a tag-specific primer)
amplifications on aliquots.
[0452] 5. Pool aliquots of one sample.
[0453] 6. Barcode sample if not already done.
[0454] 7. Mix samples, adjust concentration.
[0455] 8. Perform sequencing.
[0456] The workflow may contain multiple sub-steps that comprise
one of the listed steps (e.g. step 2. Library preparation may
comprise 3 enzymatic steps (blunt ending, dA tailing and adaptor
ligation) and 3 purification steps).
[0457] Steps of the workflow may be combined, divided up or
performed in different order (e.g. bar coding and pooling of
samples).
[0458] It is important to note that the amplification of a library
can be performed in such a way that it is biased to amplify short
fragments more efficiently. In this manner it is possible to
preferentially amplify shorter sequences, e.g. mono-nucleosomal DNA
fragments as the cell free fetal DNA (of placental origin) found in
the circulation of pregnant women.
PCR assays:
[0459] Can have the tags for sequencing (usually a truncated form
of 15-25 bases). After multiplexing, PCR multiplexes of a sample
are pooled and then the tags are completed (including bar coding)
by a tag-specific PCR (could also be done by ligation).
[0460] The full sequencing tags can be added in the same reaction
as the multiplexing. In the first cycles targets are amplified with
the target specific primers, subsequently the tag-specific primers
take over to complete the SQ-adaptor sequence.
[0461] The PCR primers carry no tags. After m.p. PCR the sequencing
tags are appended to the amplification products by ligation.
Sequencing results:
[0462] The 12 samples were pooled at equal volumes
[0463] Pool cleaned into 100 ul Elution buffer
[0464] Pool diluted to 30 nM (was 75 nM)
[0465] Sent for sequencing
[0466] QC by qPCR
preparation of 15 cy replicates (Orange: 8 Replicates with Barcodes
5 to 12)
[0467] 15 cycles STA [0468] (RED STA protocol: 95 C.times.10 min;
95 C.times.15 s, 65 C.times.1 min, 60 C.times.4 min, 65 C.times.30
s, 72 C.times.30 s; 72 C.times.2 min) [0469] Used the 50 nM primers
reactions [0470] Performed a first ExoSAP straight from
product.fwdarw.failed to remove all primers (Bioanalyzer): just
leave this step out in the future. [0471] Dilute 1/10 (adding 90 ul
H.sub.2O) [0472] 2 ul in 14 ul ExoSAP reaction.fwdarw.dilute to 50
ul=1/25 dilution in this step=total 1/250
[0473] Append SQ tags (longer, full F-SQ and R-m.p. adaptor without
barcodes): [0474] 1 ul DNA in 10 ul PCR: F-SQ x R-SQ-m.p.;
concentrations: 200 nM? [0475] 15 cycles: 95 C.times.10 min; 95
C.times.15 s, 60 C.times.30 s, 65 C.times.15 s, 72 C.times.30 s; 72
C.times.2 min [0476] Add 90 ul H2O, use 1 ul for next step, primer
carry over will be 1/100 of conc in this reaction
[0477] Barcoding PCR (p.9 quick book): [0478] 1 ul DNA in 10 ul
PCR: F-SQ x R-SQ-BC1 to 12-lib.; concentrations: 1 uM [0479] 15
cycles: 95 C.times.10 min; 95 C.times.15 s, 60 C.times.15 s, 72
C.times.30 s; 72 C.times.2 min [0480] Add 40 ul H2O
[0481] .fwdarw.check 1 ul on Bioanalyzer DNA1000 chip.fwdarw.pool
samples.fwdarw.clean up.fwdarw.Bioanalyzer, adjust
conc.fwdarw.sequencing
prep of 30 cy replicate (Yellow: 1 Replicates with Barcode 4 into
Sequencing)
[0482] 30 cycles STA [0483] (Yellow STA protocol: 95 C.times.10
min; 95 C.times.15 s, 65C.times.1 min, 60 C.times.4 min, 65
C.times.30 s, 72 C.times.30s; 72 C.times.2 min) [0484] Used the 50
nM primers reactions [0485] Performed a first ExoSAP straight from
product failed to remove all primers (Bioanalyzer): just leave this
step out in the future. [0486] Dilute 1/10 (adding 90 ul H2O)
[0487] Dilute 1/100.fwdarw.1/25 dilution=total 1/25,000 [0488]
Probably did not perform ExoSAP clean up, small uncertainty from
notes
[0489] Append SQ tags (longer, full F-SQ and R-m.p. adaptor without
barcodes): [0490] 1 ul DNA in 10 ul PCR: F-SQ.times.R-SQ-m.p.;
concentrations: 200 nM? [0491] 15 cycles: 95 C.times.10 min; 95
C.times.15 s, 60 C.times.30 s, 65 C.times.15 s, 72 C.times.30s; 72
C.times.2 min [0492] Add 90 ul H2O, use 1 ul for next step, primer
carry over will be 1/100 of conc in this reaction
[0493] Barcoding PCR (p.9 quick book): [0494] 1 ul DNA in 10 ul
PCR: F-SQ.times.R-SQ-BC1 to 12-lib.; concentrations: 1 uM [0495] 15
cycles: 95 C.times.10 min; 95 C.times.15 s, 60 C.times.15 s, 72
C.times.30 s; 72 C.times.2 min [0496] Add 40 ul H2O
[0497] .fwdarw.check 1 ul on Bioanalyzer DNA1000 chip.fwdarw.pool
samples.fwdarw.clean up.fwdarw.Bioanalyzer, adjust
conc.fwdarw.sequencing
Prep of 1000-plex reactions (Blue: 1000-Plex; from Amplified SQ
Libraries (p.32 Lab Book BZ1))
[0498] BC2=ASQ8=pregnancy plasma 2666 or 2687; BC3=ASQ4=apo sup
16777
[0499] 15 cycles STA [0500] (RED STA protocol: 95 C.times.10 min;
95 C.times.15 s, 65 C.times.1 min, 60 C.times.4 min, 65 C.times.30
s, 72 C.times.30s; 72 C.times.2 min) [0501] 50 nM target specific
tagged R-primers and 200 nM F-SQ-primer [0502] Performed a first
ExoSAP straight from product.fwdarw.failed to remove all primers
(Bioanalyzer): just leave this step out in the future. [0503]
Dilute 1/5 (adding 40 ul H2O) [0504] 2 ul in 14 ul ExoSAP
reactiondilute to 100 ul=1/50 dilution in this step=total 1/250
[0505] Append SQ tags (longer, full F-SQ and R-m.p. adaptor without
barcodes): [0506] 1 ul DNA in 10 ul PCR: F-SQ.times.R-SQ-m.p.;
concentrations: 200 nM? [0507] 15 cycles: 95 C.times.10 min; 95
C.times.15 s, 60 C.times.30 s, 65 C.times.15 s, 72 C.times.30 s; 72
C.times.2 min [0508] Add 90 ul H2O, use 1 ul for next step, primer
carry over will be 1/100 of conc in this reaction
[0509] Barcoding PCR (p.9 quick book): [0510] 1 ul DNA in 10 ul
PCR: F-SQ.times.R-SQ-BC1 to 12-lib.; concentrations: 1 uM [0511] 15
cycles: 95 C.times.10 min; 95 C.times.15 s, 60 C.times.15 s, 72
C.times.30 s; 72 C.times.2 min [0512] Add 40 ul H2O
[0513] .fwdarw.check 1 ul on Bioanalyzer DNA1000 chip.fwdarw.pool
samples.fwdarw.clean up.fwdarw.Bioanalyzer, adjust
conc.fwdarw.sequencing
[0514] By making use of targeting approaches in sequencing the
mixed sample, it may be possible to achieve a certain level of
accuracy with fewer sequence reads. The accuracy may refer to
sensitivity, it may refer to specificity, or it may refer to some
combination thereof. The desired level of accuracy may be between
90% and 95%; it may be between 95% and 98%; it may be between 98%
and 99%; it may be between 99% and 99.5%; it may be between 99.5%
and 99.9%; it may be between 99.9% and 99.99%; it may be between
99.99% and 99.999%, it may be between 99.999% and 100%. Levels of
accuracy above 95% may be referred to as high accuracy.
[0515] There are a number of published methods in the prior art
that demonstrate how one may determine the ploidy state of a fetus
from a mixed sample of maternal and fetal DNA, for example: G. J.
W. Liao et al. Clinical Chemistry 2011; 57(1) pp. 92-101. These
methods target thousands of locations along each chromosome. The
number of locations along a chromosome that may be targeted while
still resulting in a high accuracy ploidy determination on a fetus,
for a given number of sequence reads, from a mixed sample of DNA is
unexpectedly low. In one embodiment of the present disclosure, an
accurate ploidy determination may be made by using targeted
sequencing, using any method of targeting, for example qPCR,
capture by hybridization, or circularizing probes, wherein the
number of loci along a chromosome that need to be targeted may be
between 1,000 and 500 loci; it may be between 500 and 300 loci; it
may be between 300 and 200 loci; it may be between 200 and 150
loci; it may be between 150 and 100 loci; it may be between 100 and
50 loci; it may be between 50 and 20 loci; it may be between 20 and
10 loci. Optimally, it may be between 100 and 500 loci. The high
level of accuracy may be achieved by targeting a small number of
loci and executing an unexpectedly small number of sequence reads.
The number of reads may be between 5 million and 2 million reads;
the number of reads may be between 2 million and 1 million; the
number of reads may be between 1 million and 500,000; the number of
reads may be between 500,000 and 200,000; the number of reads may
be between 200,000 and 100,000; the number of reads may be between
100,000 and 50,000; the number of reads may be between 50,000 and
20,000; the number of reads may be between 20,000 and 10,000; the
number of reads may be below 10,000.
[0516] In some embodiments, there is a composition comprising a
mixture of DNA of fetal origin, and DNA of maternal origin, wherein
the percent of sequences that uniquely map to chromosome 13 is
greater than 4%, greater than 5%, greater than 6%, greater than 7%,
greater than 8%, greater than 9%, greater than 10%, greater than
12%, greater than 15%, greater than 20%, greater than 25%, or
greater than 30%. In some embodiments of the present disclosure,
there is a composition comprising a mixture of DNA of fetal origin,
and DNA of maternal origin, wherein the percent of sequences that
uniquely map to chromosome 18 is greater than 3%, greater than 4%,
greater than 5%, greater than 6%, greater than 7%, greater than 8%,
greater than 9%, greater than 10%, greater than 12%, greater than
15%, greater than 20%, greater than 25%, or greater than 30%. In
some embodiments of the present disclosure, there is a composition
comprising a mixture of DNA of fetal origin, and DNA of maternal
origin, wherein the percent of sequences that uniquely map to
chromosome 21 is greater than 2%, greater than 3%, greater than 4%,
greater than 5%, greater than 6%, greater than 7%, greater than 8%,
greater than 9%, greater than 10%, greater than 12%, greater than
15%, greater than 20%, greater than 25%, or greater than 30%. In
some embodiments of the present disclosure, there is a composition
comprising a mixture of DNA of fetal origin, and DNA of maternal
origin, wherein the percent of sequences that uniquely map to
chromosome X is greater than 6%, greater than 7%, greater than 8%,
greater than 9%, greater than 10%, greater than 12%, greater than
15%, greater than 20%, greater than 25%, or greater than 30%. In
some embodiments of the present disclosure, there is a composition
comprising a mixture of DNA of fetal origin, and DNA of maternal
origin, wherein the percent of sequences that uniquely map to
chromosome Y is greater than 1%, greater than 2%, greater than 3%,
greater than 4%, greater than 5%, greater than 6%, greater than 7%,
greater than 8%, greater than 9%, greater than 10%, greater than
12%, greater than 15%, greater than 20%, greater than 25%, or
greater than 30%.
[0517] In some embodiments, there is a composition comprising a
mixture of DNA of fetal origin, and DNA of maternal origin, wherein
the percent of sequences that uniquely map to a chromosome, that
contains at least one single nucleotide polymorphism is greater
than 0.2%, greater than 0.3%, greater than 0.4%, greater than 0.5%,
greater than 0.6%, greater than 0.7%, greater than 0.8%, greater
than 0.9%, greater than 1%, greater than 1.2%, greater than 1.4%,
greater than 1.6%, greater than 1.8%, greater than 2%, greater than
2.5%, greater than 3%, greater than 4%, greater than 5%, greater
than 6%, greater than 7%, greater than 8%, greater than 9%, greater
than 10%, greater than 12%, greater than 15%, or greater than 20%,
and where the chromosome is taken from the group 13, 18, 21, X, or
Y. In some embodiments of the present disclosure, there is a
composition comprising a mixture of DNA of fetal origin, and DNA of
maternal origin, wherein the percent of sequences that uniquely map
to a chromosome and that contain at least one single nucleotide
polymorphism from a set of single nucleotide polymorphisms is
greater than 0.15%, greater than 0.2%, greater than 0.3%, greater
than 0.4%, greater than 0.5%, greater than 0.6%, greater than 0.7%,
greater than 0.8%, greater than 0.9%, greater than 1%, greater than
1.2%, greater than 1.4%, greater than 1.6%, greater than 1.8%,
greater than 2%, greater than 2.5%, greater than 3%, greater than
4%, greater than 5%, greater than 6%, greater than 7%, greater than
8%, greater than 9%, greater than 10%, greater than 12%, greater
than 15%, or greater than 20%, where the chromosome is taken from
the set of chromosome 13, 18, 21, X and Y, and where the number of
single nucleotide polymorphisms in the set of single nucleotide
polymorphisms is between 1 and 10, between 10 and 20, between 20
and 50, between 50 and 100, between 100 and 200, between 200 and
500, between 500 and 1,000, between 1,000 and 2,000, between 2,000
and 5,000, between 5,000 and 10,000, between 10,000 and 20,000,
between 20,000 and 50,000, and between 50,000 and 100,000.
[0518] In theory, each cycle in the amplification doubles the
amount of DNA present, however, in reality, the degree of
amplification is slightly lower than two. In theory, amplification,
including targeted amplification, will result in bias free
amplification of a DNA mixture. When DNA is amplified, the degree
of allelic bias typically increases with the number of
amplification steps. In some embodiments, the methods described
herein involve amplifying DNA with a low level of allelic bias.
Since the allelic bias compounds, one can determine the per cycle
allelic bias by calculating the nth root of the overall bias where
n is the base 2 logarithm of degree of enrichment. In some
embodiments, there is a composition comprising a second mixture of
DNA, where the second mixture of DNA has been preferentially
enriched at a plurality of polymorphic loci from a first mixture of
DNA where the degree of enrichment is at least 10, at least 100, at
least 1,000, at least 10,000, at least 100,000 or at least
1,000,000, and where the ratio of the alleles in the second mixture
of DNA at each locus differs from the ratio of the alleles at that
locus in the first mixture of DNA by a factor that is, on average,
less than 1,000%, 500%, 200%, 100%, 50%, 20%, 10%, 5%, 2%, 1%,
0.5%, 0.2%, 0.1%, 0.05%, 0.02%, or 0.01%. In some embodiments,
there is a composition comprising a second mixture of DNA, where
the second mixture of DNA has been preferentially enriched at a
plurality of polymorphic loci from a first mixture of DNA where the
per cycle allelic bias for the plurality of polymorphic loci is, on
average, less than 10%, 5%, 2%, 1%, 0.5%, 0.2%, 0.1%, 0.05%, or
0.02%. In some embodiments, the plurality of polymorphic loci
comprises at least 10 loci, at least 20 loci, at least 50 loci, at
least 100 loci, at least 200 loci, at least 500 loci, at least
1,000 loci, at least 2,000 loci, at least 5,000 loci, at least
10,000 loci, at least 20,000 loci, or at least 50,000 loci.
EXPERIMENTAL SECTION
[0519] The presently disclosed embodiments are described in the
following Example, which are set forth to aid in the understanding
of the disclosure, and should not be construed to limit in any way
the scope of the disclosure as defined in the claims which follow
thereafter. The following examples are put forth so as to provide
those of ordinary skill in the art with a complete disclosure and
description of how to use the described embodiments, and is not
intended to limit the scope of the disclosure nor is it intended to
represent that the experiments below are all or the only
experiments performed. Efforts have been made to ensure accuracy
with respect to numbers used (e.g., amounts, temperature, etc.) but
some experimental errors and deviations should be accounted for.
Unless indicated otherwise, parts are parts by volume, and
temperature is in degrees Centigrade. It should be understood that
variations in the methods as disclosed may be made without changing
the fundamental aspects that the experiments are meant to
illustrate.
[0520] EXAMPLE 1
[0521] This example provides a protocol that was used to validate
the performance of a test method for determining the presence or
absence of aneuploidy according to the present invention. The test
method includes a first allelic analysis method that uses a joint
distribution model to identify samples that are high confidence
diploid samples that utilizes only data from chromosomes of
interest without control chromosomes. The identity of these diploid
samples are then passed to a second analysis method that is a
non-allelic method that produces a likelihood of a ploidy state.
Aneuploid probabilities for each test chromosome for each sample
were analyzed for each method and a set of rules were used to
determine whether to call a given sample as a high risk sample,
that is a sample with a high probability of aneuploidy. The set of
rules included at least one rule that combines the aneuploidy
confidences from the first method and the second method for a given
chromosome of interest for a given sample. The test method
eliminates the additional expense and variability introduced by the
use of a separate control chromosome. The validation protocol was
used to validate test method accuracy with measurements of test
sensitivity and specificity on clinical samples (Arm 1) and to
validate test method precision with measurements of test
reproducibility of clinical sample results and quality control (QC)
pass rate (Arm 2).
Background
[0522] The test method estimates the fetal copy number of
chromosomes 13, 18, 21, X, and Y from a maternal blood sample. The
test method utilizes cell free DNA (cfDNA), a mixture of maternal
and fetal DNA isolated from the plasma of pregnant women. The cfDNA
is first made into a library by ligation of adapters followed by
amplification to increase the available total DNA. 13,392 distinct
genetic loci are amplified by targeted multiplex PCR, each
containing a single nucleotide polymorphism (SNP). The SNP
amplicons are then sequenced using next generation sequencing
technology to determine the frequency of the SNP alleles at each
locus. In parallel, genomic DNA is extracted from the maternal
blood cells and, optionally, paternal cheek cells. These genomic
DNAs are amplified and sequenced in a similar manner to plasma DNA
libraries. The resultant SNP allele ratios from the plasma sample
and parental samples are analyzed to create a maximum-likelihood
estimate of the fetal chromosome copy number for each targeted
chromosome.
[0523] After sequencing, the sequence data first goes through a QC
process which determines whether the samples have been successfully
prepared and are eligible to be run through the Panorama copy
number algorithms. If a sample fails in either the QC process, it
is typically re-prepared or resequenced. In general, all samples
are expected to eventually pass the various QC thresholds. In
contrast, the algorithm data review thresholds are the criteria
used to determine a chromosome copy number result. These algorithm
data review thresholds were only applied to data that has already
passed through the QC process.
[0524] In Arm 1 of the validation protocol, .gtoreq.750 clinical
samples (.gtoreq.300 high risk and .gtoreq.450 low risk) were
tested to validate that the test sensitivity and specificity meet
the product requirements as described in PRD-00104 Requirements
Document-NIPT Panorama Rev 04.
[0525] In Arm 2 of the validation protocol, 192 samples were split
into three daughter replicates that were tested with three lots of
selected reagents, three sets of selected instrumentation, three
operators, and on three separate days to validate laboratory
reproducibility.
Reagents.
[0526] Table 1 provides a list of reagents that were used for the
execution of the validation protocol. DNA sequencing was carried
out on a HiSeq Model 2500 (Illumina, San Diego, Calif.).
Thermocycling was performed using a GeneAmp PCR System 9700 (Model
N8050001) (Life Technologies, Carlsbad Calif.)
TABLE-US-00005 TABLE 1 List of Required Reagents Manufacturer
Reagent Manufacturer Part Number 4X Qiagen Multiplex PCR Master
Qiagen 1076436 Mix Lots 1-3 5M TMAC lots 1-3 Sigma 639202 cfDNA
Multiplex PCR Reagents: Natera 111100 Lots 1-3 cfDNA OneStar Natera
1121144 cfDNA OneStar Natera 1121100 gDNA Multiplex PCR Reagents
Natera 121144 gDNA STAR 1 Natera 1221144 gDNA STAR 2 Natera 1222144
Molecular biology grade, DI water Life 10977-023 Technologies F-BC
(Barcoding) Primer IDT n/a R-SQ_NB4 Barcode Plates IDT n/a QIAquick
PCR Purification Kit Qiagen 28106 3M Sodium Acetate Solution Life
AM9740 Technologies Quant-iT dsDNA Broad-Range Assay Life Q33130
Kit (1000) Technologies TruSeq Rapid SR Cluster Kit Illumina
TG-402-4000 or GD-402-4001 10 nM Barcoded PhiX (NB2: IDT n/a 271
PhiX) PhiX kit 10 nM stock for cbot Illumina FC-110-3001 (10 UL)
TruSeq Rapid SBS Kit (50 Illumina TG-402-4002 or cycle) FC-402-4002
2N Sodium Hydroxide Fisher SS264-1 Scientific 1N Sodium Hydroxide
Fisher SS266-1 Scientific
[0527] Statistical Approach/Sample Size
[0528] Justifications for the sampling strategy and statistical
techniques used for each arm of the validation protocol are
provided below.
[0529] Arm 1 consisted of .gtoreq.300 samples known to be from
women carrying a fetus with Trisomy 13, 18, or 21, Monosomy X, or
triploidy. This positive sample cohort consists of all available
samples for which copy number truth has been confirmed. The
positive set was selected to produce the best possible measurement
of test sensitivity.
[0530] .gtoreq.450 samples known to be from women carrying a
euploid fetus were selected for Arm 1. The desired specificity of
0.998 corresponds to one error in 500. The sample set was selected
to achieve maximal resolution on the specificity measurement, while
maintaining compatibility with the requirements related to
automation and plate layout, and practical feasibility given the
high cost of running samples. Although the specificity calculation
will be performed using a child fraction estimate adjustment
(described in the analysis section), the distribution of child
fraction estimates in the samples is not known a priori and
therefore cannot be used to set the sample size.
Arm 2 consisted of three replicates of a test unit of 192 samples.
This number of samples is driven by the automation protocol which
requires at least two plates of 96 samples each
[0531] All plasma-derived samples used in the validation protocol
entered the protocol workflow in the form of an amplified purified
cfDNA library produced from the extracted DNA of maternal
plasma.
[0532] Parental samples from two sources were used in the protocol:
Maternal gDNA extracted from centrifuged maternal blood samples
from which plasma has been removed; and paternal gDNA prepared from
a buccal sample.
Arm 1: Sensitivity and Specificity
[0533] In the sensitivity and specificity arm, the accuracy of the
test method was determined by comparing test results of the samples
used in the validation protocol to their known fetal chromosome
copy number.
[0534] QC failure in a plasma sample due to contamination or low
NOR required that those plasma samples were rerun through the
protocol. However, due to the limited volume of plasma library
available for some samples, it was not be possible to rerun some
samples. In those cases, samples were excluded from all Arm 1
analyses. Failed mother samples were rerun at most 2 times. Failed
father samples were not rerun. Due to the high maximum capacity of
the laboratory automation workflow, all plasma DNA library samples
in Arm 1 will be processed in a single batch.
Arm 2: Laboratory Reproducibility
[0535] In the laboratory reproducibility arm, the reproducibility
of the test method were assessed using multiple reagent lots, sets
of equipment, test operators, and days. For non-critical reagents
and instruments, single lots were used because they are outside the
scope of this reproducibility testing. Specific reagents,
instruments, operators, and execution dates were used for each run
of 192 samples.
[0536] 192 samples were tested for each of the three runs in the
reproducibility test. Samples were isolated and extracted prior to
the execution of the validation protocol. For each sample, four
tubes of plasma (.about.3-5 mL each) were extracted in two pairs of
tubes. Each of the two extractions per sample were prepared into
purified plasma DNA library, and then pooled into a single well for
each case. The pooling generated approximately 70-75 .mu.L of
library material for each case. Each pooled library sample was
distributed into 3 replicate sample plates (22 .mu.L each) for use
in the validation protocol.
[0537] Replicate number 1 of the 192 samples tested in Arm 2 were
included in Arm 1 and underwent high depth of read reflex and
rerunas necessary to generate results for Arm 1 analysis. Arm 2
replicate numbers 2 and 3 did not undergo high depth of read reflex
or rerun. For all three replicates in the Arm 2, only low depth of
read analysis were performed.
[0538] Only samples from replicate 1, 2, and 3 with sufficient
child fraction estimate (.gtoreq.6% for low risk calls and
.gtoreq.10% for high risk calls) to be called at low depth of read
were analyzed in the Arm 2 reproducibility experiment.
[0539] While each plasma sample was tested three times for
reproducibility, the corresponding parent samples for each plasma
sample trio was not amplified and sequenced as replicates. Mother
samples were rerun as necessary to generate a passing QC result.
Father samples were not rerun. The resultant parent sample data was
used in the analysis of all three plasma sample replicates.
Data Analysis
[0540] Structural changes were made to the test method algorithms
to reflect the removal of all targeted loci on chromosomes 1 and 2.
The resultant SNP allele ratios from the plasma sample and parental
samples were analyzed to create a maximum-likelihood estimate of
the fetal chromosome copy number for each targeted chromosome. The
maximum-likelihood estimate was based on two different algorithms
as disclosed elsewhere herein, the het rate method and the
quantitative modeling method (QMM). The het rate algorithm is based
on analysis of the observed allele ratios (fraction of reference
allele) at each SNP using a joint distribution model. The QMM
algorithm is based on non-allelic analysis of the number of
sequencing reads at each SNP in a method that produces a maximum
likelihood of various pleudy hypothesis.
[0541] Data from both Arms were processed through the test
method.
Analysis of Arm 1
[0542] Analysis will be performed using the father sample when
available.
[0543] Samples with unrecoverable QC failures were not included in
syndrome analysis nor count toward syndrome denominator for rate
calculations, including no-call rate.
[0544] The criteria for aneuploidy detection were verified by
observation.
[0545] The criteria for detection of at least one male and one
female sample were verified by observation. The number of incorrect
gender calls were computed and compared to the acceptance
criteria.
[0546] Each sample was evaluated for each syndrome for a result
from {high risk, low risk, risk unchanged}.
[0547] Each syndrome in the set (Trisomy 13, Trisomy 18, Trisomy
21, Monosomy X) was analyzed independently for sensitivity and
specificity and the results were compared to the acceptance
criteria for UR0070. Sensitivity and specificity were computed for
each syndrome according to the CFE projection method described in
Appendix A below, along with an approximate variance. The CFE
distribution from the article Pergament et al. (2014) (Obstet
Gynecol August: 124; 210-8) was used. The acceptance criteria was
met if the desired sensitivity and specificity fall within the
confidence bounds of the estimates from the data. The confidence
bounds were defined as 3 times the square root of the estimated
approximate variance.
[0548] Results with unchanged risk were not included in the
sensitivity or specificity computations but were reflected in the
computed no-call rate.
[0549] The requirement on no-call rate was evaluated on the subset
of euploid-truth samples passing QC, rather than the complete data
set. The no-call rate was computed using the CFE projection method
provided in Appendix A and the commercial CFE distribution. The
aneuploidy rate in commercial data is less than 2 percent and so
the contribution of aneuploid samples to the commercial no-call
rate is negligible.
Analysis of Arm 2
[0550] Reproducibility of clinical calls was evaluated on eligible
trios. An eligible trio was one that passed QC and produced calls
on more than 1 sample replicate. The acceptance criterion is that
there were not more than one changed call in the set of eligible
trios. This is defined as a change from high risk to low risk or
the opposite. The number of changed calls were identified and
compared to the acceptance criteria.
[0551] Appendix A: Computational details for sensitivity,
specificity and no-call rate projected to a known CFE distribution
(generated by a commercial test)
[0552] Commercial data from Panorama Version 1 (Natera, Calif.)
that used control chromosomes was used to support the analysis by
providing a representative commercial CFE distribution. The metrics
calculated from the study data will be used to calculate projected
performance metrics for the commercial product using this
distribution.
[0553] Previous experiments with the test method led to the
following relationship between Panorama Version 1 CFE (f1) and the
test method CFE (f2) for the same blood draw. This relationship
holds in the observed range of CFE, from approximately (f1=0.01) to
(f1=0.35).
f.sub.2=0.3533f.sub.1.sup.2+0.9136f.sub.1
[0554] The commercial child fraction estimate distribution was
determined using a set of approximately 50,000 commercial test
results (Panorama Version 1, Natera, Calif.), which analyzes
samples using a method that utilized control chromosomes.
[0555] The equation above was used to convert the Panorama Version
1 commercial test CFE distribution into the test method CFE
equivalent. This was regarded as the commercial child fraction
estimate distribution going forward, such that all computations
were done on the test method CFE.
[0556] The same approach can be used to generate the CFE
distribution from the Pergament et al. (2014) publication.
[0557] A metric such as sensitivity can be projected to the
commercial child fraction estimate distribution as follows:
[0558] Define a set of CFE intervals, i=1 to N.
[0559] Observe the population rate of each interval from the
commercial data distribution, p.sub.i.
[0560] Compute the metric of interest, xi, such as sensitivity, for
the subset of the syndrome data that falls within each child
fraction estimate interval.
[0561] The projected value of the metric of interest is a weighted
sum across the CFE intervals.
x = i = 1 N p i x i ##EQU00045##
[0562] The variance of the projected value of the metric can be
approximated by a similar method.
VAR [ x ] = i = 1 N p i VAR [ x i ] ##EQU00046##
Results and Analysis
[0563] Arm 1 Analysis: Detection, Accuracy, Failure Rate
[0564] The analysis for detection, accuracy and failure rate
includes both the set of "arm 1" samples and the first replicate of
the set of "arm 2" samples. Thus the starting count of eligible
samples is the combined count of 587.
[0565] Samples excluded from all Arm 1 analysis due to quality
control failures
[0566] As defined in the test protocol, samples failing quality
control metrics were not included in detection, accuracy or failure
rate performance computations of Arm 1, which wass analyzed with
all samples from the Arm 1 cohort and replicate 1 of the Arm 2
cohort. Eight such cases were removed.
[0567] Those failing samples are described below in Table 2.
Collectively, these samples are comprised of 5 cases of trisomy 21
and 3 euploid cases.
TABLE-US-00006 TABLE 2 Summary of Quality Control Failures for Arm
1 Analysis sample count failure reason affected cases 3
Contamination 339370, 339242, 339486 2 sample handling error
339415, 338867 1 unrecoverable mother sample 339617 failure 2
failed sequencing number of 339397, 339229 reads
[0568] Aneuploidy Detection and Gender Detection
[0569] The acceptance criteria were as follows:
[0570] The test was able to detect at least one sample each of
Trisomy 13, 18, 21, Monosomy X, and triploidy
[0571] The test was able to detect gender for at least one male and
one female sample
[0572] Not more than two incorrect gender calls will occur for
eligible samples. An incorrect gender call is defined as incorrect
reporting of the presence or absence of the Y chromosome.
[0573] Table 3 shows the number and type of calls for each
syndrome. Note that female samples include monosomy X and a large
number (248) of samples do not include gender truth.
TABLE-US-00007 TABLE 3 Arm 1 Analysis Results Summary Negative T13
T18 T21 MX Triploidy Total Male Female Eligible 335 15 37 179 9 4
579 170 162 Algorithm 11 1 9 13 2 0 36 7 18 Limitation (No Call)
Correct Calls 324 14 28 163 7 4 540 163 144 Incorrect Calls 0 0 0 2
0 0 2 0 0 Other 0 0 0 1 0 0 1 Abnormality
[0574] All calls were correct with the exception of 2 trisomy 21
cases called false negative and 1 trisomy 21 case called "other
abnormality". The latter is discussed in more detail below.
[0575] Case 338833 was identified as having trisomy 21 through
karyotype analysis of the CVS biopsy. The case was reported as
"no-call due to suspected abnormality" because in addition to the
trisomy 21, which was detected, there were also abnormal
indications on the X chromosome. This case was not counted as a
negative call in the sensitivity computation because the result was
suspected abnormality including trisomy 21.
[0576] Thus, the test method was able to detect at least one sample
each of Trisomy 13, 18, 21, Monosomy X, and triploidy, was able to
detect gender for at least one male and one female sample, and no
incorrect gender calls occurred. Therefore, the aneuploidy
detection and gender detection acceptance criteria were met by the
test method.
Sensitivity and Specificity
[0577] The acceptance criteria were that the sensitivity and
specificity thresholds listed below fell within or below the
3-sigma bounds estimated from the data. Only samples with algorithm
results were included in the analysis and the raw sensitivity and
specificity values were normalized to meet the fetal fraction
distribution observed in the publication by Pergament et al
(2014).
[0578] T21: Sensitivity.gtoreq.99.01% and
specificity.gtoreq.99.89%
[0579] T18: Sensitivity.gtoreq.96.00% and
specificity.gtoreq.99.98%
[0580] T13: Sensitivity.gtoreq.90.00% and
specificity.gtoreq.99.91%
[0581] MX: Sensitivity.gtoreq.90.00% and
specificity.gtoreq.99.91%
[0582] Raw and fetal fraction distribution-adjusted measurements of
sensitivity and specificity with confidence bounds are presented in
Table 4 below.
TABLE-US-00008 TABLE 4 Sensitivity Trisomy Trisomy Trisomy Monosomy
13 18 21 X Correct Calls 14 28 163 7 Incorrect Calls 0 0 2 0
Observed 100% 100% 98.8% 100% Sensitivity Observed 95% 76.8-100
87.7-100 95.7-99.9 59.0-100 CI Projected 100% 100% 97.1% 100%
Sensitivity Projected 3- 78.8-100 87.3-100 87.7-100 N/A Sigma
Bounds
[0583] The observed sensitivity for trisomy 21 was 98.8%. After
fetal fraction adjustment, the estimated sensitivity was 97.1% with
a standard deviation of 3.1% and 3-sigma confidence bounds of
87.7%-100%.
[0584] The observed specificity for trisomy 13, trisomy 18, trisomy
21 and monosomy X was 100%. No adjustment to the fetal fraction
distribution was applied. Monosomy X had too few calls to evaluate
the fetal fraction adjustment in the confidence interval, so
projected bounds were not given.
[0585] Therefore, all syndromes meet the sensitivity and
specificity acceptance criteria that the required minimum values
fall within or below 3 standard deviations of the observed
value.
[0586] Although test sensitivity for sex chromosome trisomies such
as XXX, XXY, and XYY was not specifically addressed by this study,
there were no false positives for these syndromes among any of the
called cases, suggesting that sex chromosome trisomy specificity
was near 100%.
Algorithm Failure Rate
[0587] The acceptance criterion was that the maximum tolerable
algorithm failure rate of 4.20% fell within or above the 3 sigma
bounds estimated from the data. This estimate was based on
projection to the currently observed commercial fetal fraction
distribution and was computed from the set of negative samples with
gestational age.gtoreq.10 weeks.
[0588] Raw and adjusted measurements of algorithm failure rate with
confidence bounds are presented in Table 5 below.
TABLE-US-00009 TABLE 5 Summary of Algorithm Failure Rate Analysis
Count Eligible Negatives any GA 335 Eligible Negatives GA .gtoreq.
10 Weeks 279 Called Low Risk 273 Not Called (Algorithm Limitation)
6
[0589] The observed algorithm failure rate was 2.1% before fetal
fraction distribution adjustment. After adjustment the algorithm
failure rate in a commercial cohort was projected to be 3.67% with
a standard deviation of 2.06% and 3-sigma bounds of 0%-9.86%.
Therefore, the acceptance criteria for algorithm failure rate were
met.
[0590] Note that fetal fraction adjustment computations for
sensitivity and algorithm failure rate followed a method based on
dividing the range of fetal fractions into bins, and combining
those bins according to their population in commercial data.
Arm 2: Reproducibility
[0591] Samples Excluded from all Arm 2 Analyses Due to Quality
Control Failures
[0592] Samples that failed in all 3 replicates were excluded from
the reproducibility analysis. Case 339617 had a failed mother gDNA
sample that did not produce a passing result and, as such, was
excluded leaving 89 eligible samples.
Rate of Samples Passing QC
[0593] The acceptance criterion requires that fewer than 10% of the
samples in each test unit fail QC. This criterion was evaluated on
each test unit independently.
[0594] Out of 89 cases, there were 87 cases where all three
replicates pass QC. Two different cases (339683, 339700) each had a
replicate in test unit one which failed the number of reads QC
threshold. Thus the highest observed QC failure rate was two
samples out of 89, or 2.2%, per test unit. This met the acceptance
criterion.
Reproducibility of Clinical Results
[0595] The acceptance criterion was that of the sample triplicates
that produce 2 or 3 result calls, not more than one triplicate
produced an inconsistent call. At least 50 of the 192 sample
triplicates had to be eligible for clinical calls in all three test
units for the reproducibility analysis to be performed.
[0596] Of the 89 cases where at least 2 replicates pass QC, 22 were
ineligible for review at low depth of read due to their fetal
fraction. Six cases went through review and were selected (in all
replicates) for resequencing at high depth of read due to suspected
aneuploidy call. Two cases had all replicates identified as
"uninformative DNA pattern" and were not called. Combined, 30 cases
were uncallable in all three replicates, leaving 59 eligible
samples that had at least two calls.
[0597] Two cases were called in one replicate and selected for
resequencing at high depth of read in two replicates. Repeatability
analysis was only performed for samples producing results at low
depth of read and did not include analysis based on reflex to high
depth of read.
[0598] Three cases each had one replicate with a QC failure or
reflex request and the calls were consistent in the remaining two
replicates. 54 cases had calls on all three replicates. There were
no cases with inconsistent calls. Therefore, the acceptance
criteria were satisfied.
Conclusions
[0599] All acceptance criteria were met by the test method. In
other words, the test method was able to effectively determine the
presence or absence of aneuploidy of chromosomes of interest in
test samples even when scrutinized against commercial test
performance.
[0600] All patents, patent applications, and published references
cited herein are hereby incorporated by reference in their
entirety. While the methods of the present disclosure have been
described in connection with the specific embodiments thereof, it
will be understood that it is capable of further modification.
Furthermore, this application is intended to cover any variations,
uses, or adaptations of the methods of the present disclosure,
including such departures from the present disclosure as come
within known or customary practice in the art to which the methods
of the present disclosure pertain, and as fall within the scope of
the appended claims.
* * * * *