U.S. patent application number 15/887746 was filed with the patent office on 2018-06-21 for system and method for cleaning noisy genetic data and determining chromosome copy number.
This patent application is currently assigned to Natera, Inc.. The applicant listed for this patent is Natera, Inc.. Invention is credited to Milena BANJEVIC, Zachary DEMKO, David JOHNSON, Dusan KIJACIC, Dimitri PETROV, Matthew RABINOWITZ, Joshua SWEETKIND-SINGER, Jing XU.
Application Number | 20180171409 15/887746 |
Document ID | / |
Family ID | 59019055 |
Filed Date | 2018-06-21 |
United States Patent
Application |
20180171409 |
Kind Code |
A1 |
RABINOWITZ; Matthew ; et
al. |
June 21, 2018 |
SYSTEM AND METHOD FOR CLEANING NOISY GENETIC DATA AND DETERMINING
CHROMOSOME COPY NUMBER
Abstract
Disclosed herein is a system and method for increasing the
fidelity of measured genetic data, for making allele calls, and for
determining the state of aneuploidy, in one or a small set of
cells, or from fragmentary DNA, where a limited quantity of genetic
data is available. Poorly or incorrectly measured base pairs,
missing alleles and missing regions are reconstructed using
expected similarities between the target genome and the genome of
genetically related individuals. In accordance with one embodiment,
incomplete genetic data from an embryonic cell are reconstructed at
a plurality of loci using the more complete genetic data from a
larger sample of diploid cells from one or both parents, with or
without haploid genetic data from one or both parents. In another
embodiment, the chromosome copy number can be determined from the
measured genetic data, with or without genetic information from one
or both parents.
Inventors: |
RABINOWITZ; Matthew; (San
Francisco, CA) ; BANJEVIC; Milena; (Los Altos Hills,
CA) ; DEMKO; Zachary; (San Francisco, CA) ;
JOHNSON; David; (San Francisco, CA) ; KIJACIC;
Dusan; (Los Altos Hills, CA) ; PETROV; Dimitri;
(Stanford, CA) ; SWEETKIND-SINGER; Joshua; (San
Jose, CA) ; XU; Jing; (Jersey City, NJ) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Natera, Inc. |
San Carlos |
CA |
US |
|
|
Assignee: |
Natera, Inc.
San Carlos
CA
|
Family ID: |
59019055 |
Appl. No.: |
15/887746 |
Filed: |
February 2, 2018 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
15446778 |
Mar 1, 2017 |
|
|
|
15887746 |
|
|
|
|
13949212 |
Jul 23, 2013 |
|
|
|
15446778 |
|
|
|
|
12076348 |
Mar 17, 2008 |
8515679 |
|
|
13949212 |
|
|
|
|
11496982 |
Jul 31, 2006 |
|
|
|
12076348 |
|
|
|
|
11603406 |
Nov 22, 2006 |
8532930 |
|
|
11496982 |
|
|
|
|
11634550 |
Dec 6, 2006 |
|
|
|
11603406 |
|
|
|
|
11496982 |
Jul 31, 2006 |
|
|
|
11603406 |
|
|
|
|
60918292 |
Mar 16, 2007 |
|
|
|
60926198 |
Apr 25, 2007 |
|
|
|
60932456 |
May 31, 2007 |
|
|
|
60934440 |
Jun 13, 2007 |
|
|
|
61003101 |
Nov 13, 2007 |
|
|
|
61008637 |
Dec 21, 2007 |
|
|
|
60742305 |
Dec 6, 2005 |
|
|
|
60754396 |
Dec 29, 2005 |
|
|
|
60774976 |
Feb 21, 2006 |
|
|
|
60789506 |
Apr 4, 2006 |
|
|
|
60817741 |
Jun 30, 2006 |
|
|
|
60846610 |
Sep 22, 2006 |
|
|
|
60846610 |
Sep 22, 2006 |
|
|
|
60703415 |
Jul 29, 2005 |
|
|
|
60742305 |
Dec 6, 2005 |
|
|
|
60754396 |
Dec 29, 2005 |
|
|
|
60774976 |
Feb 21, 2006 |
|
|
|
60789506 |
Apr 4, 2006 |
|
|
|
60817741 |
Jun 30, 2006 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 7/005 20130101;
C12Q 1/686 20130101; C12Q 1/6876 20130101; G16B 20/00 20190201;
C12Q 1/6883 20130101; C12Q 2535/122 20130101; C12Q 2525/155
20130101; C12Q 2537/143 20130101; C12Q 2537/16 20130101; C12Q
2535/122 20130101; C12Q 2600/156 20130101; C12Q 1/6855 20130101;
C12Q 1/6851 20130101; C12Q 2600/158 20130101; C12Q 1/6855 20130101;
C12Q 1/686 20130101; G16B 40/00 20190201; C12Q 1/6848 20130101 |
International
Class: |
C12Q 1/6883 20060101
C12Q001/6883; C12Q 1/6851 20060101 C12Q001/6851; G06N 7/00 20060101
G06N007/00; G06F 19/18 20060101 G06F019/18; G06F 19/24 20060101
G06F019/24; C12Q 1/6848 20060101 C12Q001/6848 |
Goverment Interests
GOVERNMENT LICENSING RIGHTS
[0002] This invention was made with government support under Grant
No. R44HD054958-02A2 awarded by the National Institutes of Health.
The government has certain rights in the invention.
Claims
1. A method for measuring the amounts of fetal chromosome segments
on one or more chromosomes of interest and one or more reference
chromosomes in a maternal blood sample, the method comprising:
obtaining fetal and maternal chromosome segments from cell-free DNA
of the maternal blood sample comprising chromosome segments from
the one or more chromosomes of interest and chromosome segments
from the one or more reference chromosomes; hybridizing the fetal
and maternal chromosome segments with a plurality of probes
comprising one or more padlock probes and obtaining one or more
circularized probes by ligation, wherein the circularized probes
comprise sequences of the chromosome segments; removing
non-circularized probes by exonucleolysis; amplifying the
circularized probes; and measuring the amounts of chromosome
segments, wherein the measuring is performed irrespective of allele
value.
2. The method of claim 1, wherein the measuring comprises
measurement of alleles having 100% penetrance.
3. The method of claim 1, wherein the measuring comprises
quantitative measurements without making allele calls.
4. The method of claim 1, wherein the measuring comprises labeling
the amplification products with a detectable tag.
5. The method of claim 1, wherein the measuring comprises
hybridizing the amplification products with a fluorescent tag.
6. The method of claim 1, wherein the measuring comprises measuring
the amounts of chromosome segments from the one or more chromosomes
of interest and the amounts of chromosome segments from the one or
more reference chromosomes
7. The method of claim 1, wherein the method further comprises
comparing the measured amounts of chromosome segments for the one
or more chromosomes of interest from the maternal blood sample with
the measured amounts of chromosome segments for the one or more
reference chromosomes.
8. The method of claim 1, wherein the method further comprises
comparing the measured amounts of chromosome segments for the one
or more chromosomes of interest from the maternal blood sample with
measured amounts of chromosome segments for the one or more
chromosomes of interest from maternal blood samples that are
disomic for the one or more chromosomes of interest.
9. The method of claim 1, wherein the measured amounts of
chromosome segments from each chromosome are combined into a single
measurement for each chromosome.
10. The method of claim 1, wherein the chromosome segments from the
one or more chromosomes of interest map to chromosome 13, 18,
and/or 21.
11. A method of detecting aneuploidy of one or more chromosomes of
interest in a fetus, the method comprising: obtaining fetal and
maternal chromosome segments from cell-free DNA of a maternal blood
sample comprising chromosome segments from the one or more
chromosomes of interest and chromosome segments from the one or
more reference chromosomes; hybridizing the fetal and maternal
chromosome segments with a plurality of probes comprising one or
more padlock probes and obtaining one or more circularized probes
by ligation, wherein the circularized probes comprise sequences of
the chromosome segments; removing non-circularized probes by
exonucleolysis; amplifying the circularized probes; measuring the
amounts of chromosome segments from the one or more chromosomes of
interest and the amounts of chromosome segments from the one or
more reference chromosomes, wherein the measuring is performed
irrespective of allele value; and detecting aneuploidy of the one
or more chromosomes of interest using the measured amounts of
chromosome segments from the one or more chromosomes of interest
and the measured amounts of chromosome segments from the one or
more reference chromosomes.
12. The method of claim 11, wherein the measuring comprises
measurement of alleles having 100% penetrance.
13. The method of claim 11, wherein the measuring comprises
quantitative measurements without making allele calls.
14. The method of claim 11, wherein the measuring comprises
labeling the amplification products with a detectable tag.
15. The method of claim 11, wherein the measuring comprises
hybridizing the amplification products with a fluorescent tag.
16. The method of claim 11, wherein the method further comprises
comparing the measured amounts of chromosome segments for the one
or more chromosomes of interest from the maternal blood sample with
measured amounts of chromosome segments for the one or more
chromosomes of interest from maternal blood samples that are
disomic for the one or more chromosomes of interest.
17. The method of claim 11, wherein the measured amounts of
chromosome segments from each chromosome are combined into a single
measurement for each chromosome.
18. The method of claim 11, wherein the chromosome segments from
the one or more chromosomes of interest map to chromosome 13, 18,
and/or 21.
19. The method of claim 11, wherein the method comprises detecting
aneuploidy at chromosome 13, 18, and/or 21.
20. The method of claim 11, wherein a statistical difference is
detected by computing a standard deviation and/or a mean value for
the measurement for each of the one or more chromosomes of interest
and for each of the one or more reference chromosomes, and
comparing the statistical difference to a threshold determined from
normal samples.
Description
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. application Ser.
No. 15/446,778, filed Mar. 1, 2017. U.S. application Ser. No.
15/446,778 is a continuation of U.S. application Ser. No.
13/949,212, filed Jul. 23, 2013. U.S. application Ser. No.
13/949,212 is a continuation of U.S. application Ser. No.
12/076,348, now U.S. Pat. No. 8,515,679, filed Mar. 17, 2008. U.S.
application Ser. No. 12/076,348, now U.S. Pat. No. 8,515,679, is a
continuation-in-part of U.S. application Ser. No. 11/496,982, now
abandoned, filed Jul. 31, 2006; a continuation-in-part of U.S.
application Ser. No. 11/603,406, now U.S. Pat. No. 8,532,930, filed
Nov. 22, 2006; and a continuation-in-part of U.S. application Ser.
No. 11/634,550, filed Dec. 6, 2006, now abandoned; and claims the
benefit of U.S. Provisional Application No. 60/918,292, filed Mar.
16, 2007; U.S. Provisional Application No. 60/926,198, filed Apr.
25, 2007; U.S. Provisional Application No. 60/932,456, filed May
31, 2007; U.S. Provisional Application No. 60/934,440, filed Jun.
13, 2007; U.S. Provisional Application No. 61/003,101, filed Nov.
13, 2007; and U.S. Provisional Application No. 61/008,637, filed
Dec. 21, 2007. U.S. application Ser. No. 11/634,550, now abandoned,
claims the benefit of U.S. Provisional Application No. 60/742,305,
filed Dec. 6, 2005; U.S. Provisional Application No. 60/754,396,
filed Dec. 29, 2005; U.S. Provisional Application No. 60/774,976,
filed Feb. 21, 2006; U.S. Provisional Application No. 60/789,506,
filed Apr. 4, 2006; U.S. Provisional Application No. 60/817,741,
filed Jun. 30, 2006; and U.S. Provisional Application No.
60/846,610, filed Sep. 22, 2006. U.S. application Ser. No.
11/603,406, now U.S. Pat. No. 8,532,930, is a continuation-in-part
of U.S. application Ser. No. 11/496,982, now abandoned, filed Jul.
31, 2006; and also claims the benefit of U.S. Provisional
Application No. 60/846,610, filed Sep. 22, 2006. U.S. application
Ser. No. 11/496,982, now abandoned, claims the benefit of U.S.
Provisional Application No. 60/703,415, filed Jul. 29, 2005; U.S.
Provisional Application No. 60/742,305, filed Dec. 6, 2005; U.S.
Provisional Application No. 60/754,396, filed Dec. 29, 2005; U.S.
Provisional Application No. 60/774,976, filed Feb. 21, 2006; U.S.
Provisional Application No. 60/789,506, filed Apr. 4, 2006; and
U.S. Provisional Application No. 60/817,741, filed Jun. 30, 2006.
Each of these applications cited above is hereby incorporated by
reference in its entirety.
BACKGROUND OF THE INVENTION
Field of the Invention
[0003] The invention relates generally to the field of acquiring,
manipulating and using genetic data for medically predictive
purposes, and specifically to a system in which imperfectly
measured genetic data of a target individual are made more accurate
by using known genetic data of genetically related individuals,
thereby allowing more effective identification of genetic
variations, specifically aneuploidy and disease linked genes, that
could result in various phenotypic outcomes.
Description of the Related Art
[0004] In 2006, across the globe, roughly 800,000 in vitro
fertilization (IVF) cycles were run. Of the roughly 150,000 cycles
run in the US, about 10,000 involved pre-implantation genetic
diagnosis (PGD). Current PGD techniques are unregulated, expensive
and highly unreliable: error rates for screening disease-linked
loci or aneuploidy are on the order of 10%, each screening test
costs roughly $5,000, and a couple is forced to choose between
testing aneuploidy, which afflicts roughly 50% of IVF embryos, or
screening for disease-linked loci on the single cell. There is a
great need for an affordable technology that can reliably determine
genetic data from a single cell in order to screen in parallel for
aneuploidy, monogenic diseases such as Cystic Fibrosis, and
susceptibility to complex disease phenotypes for which the multiple
genetic markers are known through whole-genome association
studies.
[0005] Most PGD today focuses on high-level chromosomal
abnormalities such as aneuploidy and balanced translocations with
the primary outcomes being successful implantation and a take-home
baby. The other main focus of PGD is for genetic disease screening,
with the primary outcome being a healthy baby not afflicted with a
genetically heritable disease for which one or both parents are
carriers. In both cases, the likelihood of the desired outcome is
enhanced by excluding genetically suboptimal embryos from transfer
and implantation in the mother.
[0006] The process of PGD during IVF currently involves extracting
a single cell from the roughly eight cells of an early-stage embryo
for analysis. Isolation of single cells from human embryos, while
highly technical, is now routine in IVF clinics. Both polar bodies
and blastomeres have been isolated with success. The most common
technique is to remove single blastomeres from day 3 embryos (6 or
8 cell stage). Embryos are transferred to a special cell culture
medium (standard culture medium lacking calcium and magnesium), and
a hole is introduced into the zona pellucida using an acidic
solution, laser, or mechanical techniques. The technician then uses
a biopsy pipette to remove a single blastomere with a visible
nucleus. Features of the DNA of the single (or occasionally
multiple) blastomere are measured using a variety of techniques.
Since only a single copy of the DNA is available from one cell,
direct measurements of the DNA are highly error-prone, or noisy.
There is a great need for a technique that can correct, or make
more accurate, these noisy genetic measurements.
[0007] Normal humans have two sets of 23 chromosomes in every
diploid cell, with one copy coming from each parent. Aneuploidy,
the state of a cell with extra or missing chromosome(s), and
uniparental disomy, the state of a cell with two of a given
chromosome both of which originate from one parent, are believed to
be responsible for a large percentage of failed implantations and
miscarriages, and some genetic diseases. When only certain cells in
an individual are aneuploid, the individual is said to exhibit
mosaicism. Detection of chromosomal abnormalities can identify
individuals or embryos with conditions such as Down syndrome,
Klinefelter's syndrome, and Turner syndrome, among others, in
addition to increasing the chances of a successful pregnancy.
Testing for chromosomal abnormalities is especially important as
the age of a potential mother increases: between the ages of 35 and
40 it is estimated that between 40% and 50% of the embryos are
abnormal, and above the age of 40, more than half of the embryos
are like to be abnormal. The main cause of aneuploidy is
nondisjunction during meiosis. Maternal nondisjunction constitutes
88% of all nondisjunction of which 65% occurs in meiosis 1 and 23%
in meiosis II. Common types of human aneuploidy include trisomy
from meiosis I nondisjunction, monosomy, and uniparental disomy. In
a particular type of trisomy that arises in meiosis II
nondisjunction, or M2 trisomy, an extra chromosome is identical to
one of the two normal chromosomes. M2 trisomy is particularly
difficult to detect. There is a great need for a better method that
can detect for many or all types of aneuploidy at most or all of
the chromosomes efficiently and with high accuracy.
[0008] Karyotyping, the traditional method used for the prediction
of aneuploidy and mosaicism is giving way to other more
high-throughput, more cost effective methods such as Flow Cytometry
(FC) and fluorescent in situ hybridization (FISH). Currently, the
vast majority of prenatal diagnoses use FISH, which can determine
large chromosomal aberrations and PCR/electrophoresis, and which
can determine a handful of SNPs or other allele calls. One
advantage of FISH is that it is less expensive than karyotyping,
but the technique is complex and expensive enough that generally a
small selection of chromosomes are tested (usually chromosomes 13,
18, 21, X, Y; also sometimes 8, 9, 15, 16, 17, 22); in addition,
FISH has a low level of specificity. Roughly seventy-five percent
of PGD today measures high-level chromosomal abnormalities such as
aneuploidy using FISH with error rates on the order of 10-15%.
There is a great demand for an aneuploidy screening method that has
a higher throughput, lower cost, and greater accuracy.
[0009] The number of known disease associated genetic alleles is
currently at 389 according to OMIM and steadily climbing.
Consequently, it is becoming increasingly relevant to analyze
multiple positions on the embryonic DNA, or loci, that are
associated with particular phenotypes. A clear advantage of
pre-implantation genetic diagnosis over prenatal diagnosis is that
it avoids some of the ethical issues regarding possible choices of
action once undesirable phenotypes have been detected. A need
exists for a method for more extensive genotyping of embryos at the
pre-implantation stage.
[0010] There are a number of advanced technologies that enable the
diagnosis of genetic aberrations at one or a few loci at the
single-cell level. These include interphase chromosome conversion,
comparative genomic hybridization, fluorescent PCR, mini-sequencing
and whole genome amplification. The reliability of the data
generated by all of these techniques relies on the quality of the
DNA preparation. Better methods for the preparation of single-cell
DNA for amplification and PGD are therefore needed and are under
study. All genotyping techniques, when used on single cells, small
numbers of cells, or fragments of DNA, suffer from integrity
issues, most notably allele drop out (ADO). This is exacerbated in
the context of in-vitro fertilization since the efficiency of the
hybridization reaction is low, and the technique must operate
quickly in order to genotype the embryo within the time period of
maximal embryo viability. There exists a great need for a method
that alleviates the problem of a high ADO rate when measuring
genetic data from one or a small number of cells, especially when
time constraints exist.
[0011] Listed here is a set of prior art which is related to the
field of the current invention. None of this prior art contains or
in any way refers to the novel elements of the current invention.
In U.S. Pat. No. 6,489,135 Parrott et al. provide methods for
determining various biological characteristics of in vitro
fertilized embryos, including overall embryo health,
implantability, and increased likelihood of developing successfully
to term by analyzing media specimens of in vitro fertilization
cultures for levels of bioactive lipids in order to determine these
characteristics. In US Patent Application 20040033596 Threadgill et
al. describe a method for preparing homozygous cellular libraries
useful for in vitro phenotyping and gene mapping involving
site-specific mitotic recombination in a plurality of isolated
parent cells. In U.S. Pat. No. 5,635,366 Cooke et al. provide a
method for predicting the outcome of IVF by determining the level
of 11.beta.-hydroxysteroid dehydrogenase (11.beta.-HSD) in a
biological sample from a female patient. In U.S. Pat. No. 7,058,517
Denton et al. describe a method wherein an individual's haplotypes
are compared to a known database of haplotypes in the general
population to predict clinical response to a treatment. In U.S.
Pat. No. 7,035,739 Schadt at al. describe a method is described
wherein a genetic marker map is constructed and the individual
genes and traits are analyzed to give a gene-trait locus data,
which are then clustered as a way to identify genetically
interacting pathways, which are validated using multivariate
analysis. In US Patent Application US 2004/0137470 A1, Dhallan et
al. describe using primers especially selected so as to improve the
amplification rate, and detection of, a large number of pertinent
disease related loci, and a method of more efficiently quantitating
the absence, presence and/or amount of each of those genes. In
World Patent Application WO 03/031646, Findlay et al. describe a
method to use an improved selection of genetic markers such that
amplification of the limited amount of genetic material will give
more uniformly amplified material, and it can be genotyped with
higher fidelity.
[0012] Current methods of prenatal diagnosis can alert physicians
and parents to abnormalities in growing fetuses. Without prenatal
diagnosis, one in 50 babies is born with serious physical or mental
handicap, and as many as one in 30 will have some form of
congenital malformation. Unfortunately, standard methods require
invasive testing and carry a roughly 1 percent risk of miscarriage.
These methods include amniocentesis, chorion villus biopsy and
fetal blood sampling. Of these, amniocentesis is the most common
procedure; in 2003, it was performed in approximately 3% of all
pregnancies, though its frequency of use has been decreasing over
the past decade and a half. A major drawback of prenatal diagnosis
is that given the limited courses of action once an abnormality has
been detected, it is only valuable and ethical to test for very
serious defects. As result, prenatal diagnosis is typically only
attempted in cases of high-risk pregnancies, where the elevated
chance of a defect combined with the seriousness of the potential
abnormality outweighs the risks. A need exists for a method of
prenatal diagnosis that mitigates these risks.
[0013] It has recently been discovered that cell-free fetal DNA and
intact fetal cells can enter maternal blood circulation.
Consequently, analysis of these cells can allow early Non-Invasive
Prenatal Genetic Diagnosis (NIPGD). A key challenge in using NIPGD
is the task of identifying and extracting fetal cells or nucleic
acids from the mother's blood. The fetal cell concentration in
maternal blood depends on the stage of pregnancy and the condition
of the fetus, but estimates range from one to forty fetal cells in
every milliliter of maternal blood, or less than one fetal cell per
100,000 maternal nucleated cells. Current techniques are able to
isolate small quantities of fetal cells from the mother's blood,
although it is very difficult to enrich the fetal cells to purity
in any quantity. The most effective technique in this context
involves the use of monoclonal antibodies, but other techniques
used to isolate fetal cells include density centrifugation,
selective lysis of adult erythrocytes, and FACS. Fetal DNA
isolation has been demonstrated using PCR amplification using
primers with fetal-specific DNA sequences. Since only tens of
molecules of each embryonic SNP are available through these
techniques, the genotyping of the fetal tissue with high fidelity
is not currently possible.
[0014] Much research has been done towards the use of
pre-implantation genetic diagnosis (PGD) as an alternative to
classical prenatal diagnosis of inherited disease. Most PGD today
focuses on high-level chromosomal abnormalities such as aneuploidy
and balanced translocations with the primary outcomes being
successful implantation and a take-home baby. A need exists for a
method for more extensive genotyping of embryos at the
pre-implantation stage. The number of known disease associated
genetic alleles is currently at 389 according to OMIM and steadily
climbing. Consequently, it is becoming increasingly relevant to
analyze multiple embryonic SNPs that are associated with disease
phenotypes. A clear advantage of pre-implantation genetic diagnosis
over prenatal diagnosis is that it avoids some of the ethical
issues regarding possible choices of action once undesirable
phenotypes have been detected.
[0015] Many techniques exist for isolating single cells. The FACS
machine has a variety of applications; one important application is
to discriminate between cells based on size, shape and overall DNA
content. The FACS machine can be set to sort single cells into any
desired container. Many different groups have used single cell DNA
analysis for a number of applications, including prenatal genetic
diagnosis, recombination studies, and analysis of chromosomal
imbalances. Single-sperm genotyping has been used previously for
forensic analysis of sperm samples (to decrease problems arising
from mixed samples) and for single-cell recombination studies.
[0016] Isolation of single cells from human embryos, while highly
technical, is now routine in in vitro fertilization clinics. To
date, the vast majority of prenatal diagnoses have used fluorescent
in situ hybridization (FISH), which can determine large chromosomal
aberrations (such as Down syndrome, or trisomy 21) and
PCR/electrophoresis, which can determine a handful of SNPs or other
allele calls. Both polar bodies and blastomeres have been isolated
with success. It is critical to isolate single blastomeres without
compromising embryonic integrity. The most common technique is to
remove single blastomeres from day 3 embryos (6 or 8 cell stage).
Embryos are transferred to a special cell culture medium (standard
culture medium lacking calcium and magnesium), and a hole is
introduced into the zona pellucida using an acidic solution, laser,
or mechanical drilling. The technician then uses a biopsy pipette
to remove a single visible nucleus. Clinical studies have
demonstrated that this process does not decrease implantation
success, since at this stage embryonic cells are
undifferentiated.
[0017] There are three major methods available for whole genome
amplification (WGA): ligation-mediated PCR (LM-PCR), degenerate
oligonucleotide primer PCR (DOP-PCR), and multiple displacement
amplification (MDA). In LM-PCR, short DNA sequences called adapters
are ligated to blunt ends of DNA. These adapters contain universal
amplification sequences, which are used to amplify the DNA by PCR.
In DOP-PCR, random primers that also contain universal
amplification sequences are used in a first round of annealing and
PCR. Then, a second round of PCR is used to amplify the sequences
further with the universal primer sequences. Finally, MDA uses the
phi-29 polymerase, which is a highly processive and non-specific
enzyme that replicates DNA and has been used for single-cell
analysis. Of the three methods, DOP-PCR reliably produces large
quantities of DNA from small quantities of DNA, including single
copies of chromosomes. On the other hand, MDA is the fastest
method, producing hundred-fold amplification of DNA in a few hours.
The major limitations to amplification material from a single cells
are (1) necessity of using extremely dilute DNA concentrations or
extremely small volume of reaction mixture, and (2) difficulty of
reliably dissociating DNA from proteins across the whole genome.
Regardless, single-cell whole genome amplification has been used
successfully for a variety of applications for a number of
years.
[0018] There are numerous difficulties in using DNA amplification
in these contexts. Amplification of single-cell DNA (or DNA from a
small number of cells, or from smaller amounts of DNA) by PCR can
fail completely, as reported in 5-10% of the cases. This is often
due to contamination of the DNA, the loss of the cell, its DNA, or
accessibility of the DNA during the PCR reaction. Other sources of
error that may arise in measuring the embryonic DNA by
amplification and microarray analysis include transcription errors
introduced by the DNA polymerase where a particular nucleotide is
incorrectly copied during PCR, and microarray reading errors due to
imperfect hybridization on the array. The biggest problem, however,
remains allele drop-out (ADO) defined as the failure to amplify one
of the two alleles in a heterozygous cell. ADO can affect up to
more than 40% of amplifications and has already caused PGD
misdiagnoses. ADO becomes a health issue especially in the case of
a dominant disease, where the failure to amplify can lead to
implantation of an affected embryo. The need for more than one set
of primers per each marker (in heterozygotes) complicate the PCR
process. Therefore, more reliable PCR assays are being developed
based on understanding the ADO origin. Reaction conditions for
single-cell amplifications are under study. The amplicon size, the
amount of DNA degradation, freezing and thawing, and the PCR
program and conditions can each influence the rate of ADO.
[0019] All those techniques, however, depend on the minute DNA
amount available for amplification in the single cell. This process
is often accompanied by contamination. Proper sterile conditions
and microsatellite sizing can exclude the chance of contaminant DNA
as microsatellite analysis detected only in parental alleles
exclude contamination. Studies to reliably transfer molecular
diagnostic protocols to the single-cell level have been recently
pursued using first-round multiplex PCR of microsatellite markers,
followed by real-time PCR and microsatellite sizing to exclude
chance contamination. Multiplex PCR allows for the amplification of
multiple fragments in a single reaction, a crucial requirement in
the single-cell DNA analysis. Although conventional PCR was the
first method used in PGD, fluorescence in situ hybridization (FISH)
is now common. It is a delicate visual assay that allows the
detection of nucleic acid within undisturbed cellular and tissue
architecture. It relies firstly on the fixation of the cells to be
analyzed. Consequently, optimization of the fixation and storage
condition of the sample is needed, especially for single-cell
suspensions.
[0020] Advanced technologies that enable the diagnosis of a number
of diseases at the single-cell level include interphase chromosome
conversion, comparative genomic hybridization (CGH), fluorescent
PCR, and whole genome amplification. The reliability of the data
generated by all of these techniques rely on the quality of the DNA
preparation. PGD is also costly, consequently there is a need for
less expensive approaches, such as mini-sequencing. Unlike most
mutation-detection techniques, mini-sequencing permits analysis of
very small DNA fragments with low ADO rate. Better methods for the
preparation of single-cell DNA for amplification and PGD are
therefore needed and are under study. The more novel microarrays
and comparative genomic hybridization techniques, still ultimately
rely on the quality of the DNA under analysis.
[0021] Several techniques are in development to measure multiple
SNPs on the DNA of a small number of cells, a single cell (for
example, a blastomere), a small number of chromosomes, or from
fragments of DNA. There are techniques that use Polymerase Chain
Reaction (PCR), followed by microarray genotyping analysis. Some
PCR-based techniques include whole genome amplification (WGA)
techniques such as multiple displacement amplification (MDA), and
Molecular Inversion Probes (MIPS) that perform genotyping using
multiple tagged oligonucleotides that may then be amplified using
PCR with a singe pair of primers. An example of a non-PCR based
technique is fluorescence in situ hybridization (FISH). It is
apparent that the techniques will be severely error-prone due to
the limited amount of genetic material which will exacerbate the
impact of effects such as allele drop-outs, imperfect
hybridization, and contamination.
[0022] Many techniques exist which provide genotyping data. Taqman
is a unique genotyping technology produced and distributed by
Applied Biosystems. Taqman uses polymerase chain reaction (PCR) to
amplify sequences of interest. During PCR cycling, an allele
specific minor groove binder (MGB) probe hybridizes to amplified
sequences. Strand synthesis by the polymerase enzymes releases
reporter dyes linked to the MGB probes, and then the Taqman optical
readers detect the dyes. In this manner, Taqman achieves
quantitative allelic discrimination. Compared with array based
genotyping technologies, Taqman is quite expensive per reaction
($0.40/reaction), and throughput is relatively low (384 genotypes
per run). While only 1 ng of DNA per reaction is necessary,
thousands of genotypes by Taqman requires microgram quantities of
DNA, so Taqman does not necessarily use less DNA than microarrays.
However, with respect to the IVF genotyping workflow, Taqman is the
most readily applicable technology. This is due to the high
reliability of the assays and, most importantly, the speed and ease
of the assay (.about.3 hours per run and minimal molecular
biological steps). Also unlike many array technologies (such as 500
k Affymetrix arrays), Taqman is highly customizable, which is
important for the IVF market. Further, Taqman is highly
quantitative, so anueploidies could be detected with this
technology alone.
[0023] Illumina has recently emerged as a leader in high-throughput
genotyping. Unlike Affymetrix, Illumina genotyping arrays do not
rely exclusively on hybridization. Instead, Illumina technology
uses an allele-specific DNA extension step, which is much more
sensitive and specific than hybridization alone, for the original
sequence detection. Then, all of these alleles are amplified in
multiplex by PCR, and then these products hybridized to bead
arrays. The beads on these arrays contain unique "address" tags,
not native sequence, so this hybridization is highly specific and
sensitive. Alleles are then called by quantitative scanning of the
bead arrays. The Illumina Golden Gate assay system genotypes up to
1536 loci concurrently, so the throughput is better than Taqman but
not as high as Affymetrix 500 k arrays. The cost of Illumina
genotypes is lower than Taqman, but higher than Affymetrix arrays.
Also, the Illumina platform takes as long to complete as the 500 k
Affymetrix arrays (up to 72 hours), which is problematic for IVF
genotyping. However, Illumina has a much better call rate, and the
assay is quantitative, so anueploidies are detectable with this
technology. Illumina technology is much more flexible in choice of
SNPs than 500k Affymetrix arrays.
[0024] One of the highest throughput techniques, which allows for
the measurement of up to 250,000 SNPs at a time, is the Affymetrix
GeneChip 500K genotyping array. This technique also uses PCR,
followed by analysis by hybridization and detection of the
amplified DNA sequences to DNA probes, chemically synthesized at
different locations on a quartz surface. A disadvantage of these
arrays are the low flexibility and the lower sensitivity. There are
modified approaches that can increase selectivity, such as the
"perfect match" and "mismatch probe" approaches, but these do so at
the cost of the number of SNPs calls per array.
[0025] Pyrosequencing, or sequencing by synthesis, can also be used
for genotyping and SNP analysis. The main advantages to
pyrosequencing include an extremely fast turnaround and unambiguous
SNP calls, however, the assay is not currently conducive to
high-throughput parallel analysis. PCR followed by gel
electrophoresis is an exceedingly simple technique that has met the
most success in preimplantation diagnosis. In this technique,
researchers use nested PCR to amplify short sequences of interest.
Then, they run these DNA samples on a special gel to visualize the
PCR products. Different bases have different molecular weights, so
one can determine base content based on how fast the product runs
in the gel. This technique is low-throughput and requires
subjective analyses by scientists using current technologies, but
has the advantage of speed (1-2 hours of PCR, 1 hour of gel
electrophoresis). For this reason, it has been used previously for
prenatal genotyping for a myriad of diseases, including:
thalassaemia, neurofibromatosis type 2, leukocyte adhesion
deficiency type I, Hallopeau-Siemens disease, sickle-cell anemia,
retinoblastoma, Pelizaeus-Merzbacher disease, Duchenne muscular
dystrophy, and Currarino syndrome.
[0026] Another promising technique that has been developed for
genotyping small quantities of genetic material with very high
fidelity is Molecular Inversion Probes (MIPs), such as Affymetrix's
Genflex Arrays. This technique has the capability to measure
multiple SNPs in parallel: more than 10,000 SNPS measured in
parallel have been verified. For small quantities of genetic
material, call rates for this technique have been established at
roughly 95%, and accuracy of the calls made has been established to
be above 99%. So far, the technique has been implemented for
quantities of genomic data as small as 150 molecules for a given
SNP. However, the technique has not been verified for genomic data
from a single cell, or a single strand of DNA, as would be required
for pre-implantation genetic diagnosis.
[0027] The MIP technique makes use of padlock probes which are
linear oligonucleotides whose two ends can be joined by ligation
when they hybridize to immediately adjacent target sequences of
DNA. After the probes have hybridized to the genomic DNA, a
gap-fill enzyme is added to the assay which can add one of the four
nucleotides to the gap. If the added nucleotide (A,C,T,G) is
complementary to the SNP under measurement, then it will hybridize
to the DNA, and join the ends of the padlock probe by ligation. The
circular products, or closed padlock probes, are then
differentiated from linear probes by exonucleolysis. The
exonuclease, by breaking down the linear probes and leaving the
circular probes, will change the relative concentrations of the
closed vs. the unclosed probes by a factor of 1000 or more. The
probes that remain are then opened at a cleavage site by another
enzyme, removed from the DNA, and amplified by PCR. Each probe is
tagged with a different tag sequence consisting of 20 base tags
(16,000 have been generated), and can be detected, for example, by
the Affymetrix GenFlex Tag Array. The presence of the tagged probe
from a reaction in which a particular gap-fill enzyme was added
indicates the presence of the complimentary amino acid on the
relevant SNP.
[0028] The molecular biological advantages of MIPS include: (1)
multiplexed genotyping in a single reaction, (2) the genotype
"call" occurs by gap fill and ligation, not hybridization, and (3)
hybridization to an array of universal tags decreases false
positives inherent to most array hybridizations. In traditional
500K, TaqMan and other genotyping arrays, the entire genomic sample
is hybridized to the array, which contains a variety of perfect
match and mismatch probes, and an algorithm calls likely genotypes
based on the intensities of the mismatch and perfect match probes.
Hybridization, however, is inherently noisy, because of the
complexities of the DNA sample and the huge number of probes on the
arrays. MIPs, on the other hand, uses multiplex probes (i.e., not
on an array) that are longer and therefore more specific, and then
uses a robust ligation step to circularize the probe. Background is
exceedingly low in this assay (due to specificity), though allele
dropout may be high (due to poor performing probes).
[0029] When this technique is used on genomic data from a single
cell (or small numbers of cells) it will like PCR based approaches
suffer from integrity issues. For example, the inability of the
padlock probe to hybridize to the genomic DNA will cause allele
dropouts. This will be exacerbated in the context of in-vitro
fertilization since the efficiency of the hybridization reaction is
low, and it needs to proceed relatively quickly in order to
genotype the embryo in a limited time period. Note that the
hybridization reaction can be reduced well below vendor-recommended
levels, and micro-fluidic techniques may also be used to accelerate
the hybridization reaction. These approaches to reducing the time
for the hybridization reaction will result in reduced data
quality.
[0030] Once the genetic data has been measured, the next step is to
use the data for predictive purposes. Much research has been done
in predictive genomics, which tries to understand the precise
functions of proteins, RNA and DNA so that phenotypic predictions
can be made based on genotype. Canonical techniques focus on the
function of Single-Nucleotide Polymorphisms (SNP); but more
advanced methods are being brought to bear on multi-factorial
phenotypic features. These methods include techniques, such as
linear regression and nonlinear neural networks, which attempt to
determine a mathematical relationship between a set of genetic and
phenotypic predictors and a set of measured outcomes. There is also
a set of regression analysis techniques, such as Ridge regression,
log regression and stepwise selection, that are designed to
accommodate sparse data sets where there are many potential
predictors relative to the number of outcomes, as is typical of
genetic data, and which apply additional constraints on the
regression parameters so that a meaningful set of parameters can be
resolved even when the data is underdetermined. Other techniques
apply principal component analysis to extract information from
undetermined data sets. Other techniques, such as decision trees
and contingency tables, use strategies for subdividing subjects
based on their independent variables in order to place subjects in
categories or bins for which the phenotypic outcomes are similar. A
recent technique, termed logical regression, describes a method to
search for different logical interrelationships between categorical
independent variables in order to model a variable that depends on
interactions between multiple independent variables related to
genetic data. Regardless of the method used, the quality of the
prediction is naturally highly dependant on the quality of the
genetic data used to make the prediction.
[0031] Normal humans have two sets of 23 chromosomes in every
diploid cell, with one copy coming from each parent. Aneuploidy, a
cell with an extra or missing chromosomes, and uniparental disomy,
a cell with two of a given chromosome that originate from one
parent, are believed to be responsible for a large percentage of
failed implantations, miscarriages, and genetic diseases. When only
certain cells in an individual are aneuploid, the individual is
said to exhibit mosaicism. Detection of chromosomal abnormalities
can identify individuals or embryos with conditions such as Down
syndrome, Klinefelters syndrome, and Turner syndrome, among others,
in addition to increasing the chances of a successful pregnancy.
Testing for chromosomal abnormalities is especially important as
mothers age: between the ages of 35 and 40 it is estimated that
between 40% and 50% of the embryos are abnormal, and above the age
of 40, more than half of the embryos are abnormal.
[0032] Karyotyping, the traditional method used for the prediction
of aneuploides and mosaicism is giving way to other more high
throughput, more cost effective methods. One method that has
attracted much attention recently is Flow cytometry (FC) and
fluorescence in situ hybridization (FISH) which can be used to
detect aneuploidy in any phase of the cell cycle. One advantage of
this method is that it is less expensive than karyotyping, but the
cost is significant enough that generally a small selection of
chromosomes are tested (usually chromosomes 13, 18, 21, X, Y; also
sometimes 8, 9, 15, 16, 17, 22); in addition, FISH has a low level
of specificity. Using FISH to analyze 15 cells, one can detect
mosaicism of 19% with 95% confidence. The reliability of the test
becomes much lower as the level of mosaicism gets lower, and as the
number of cells to analyze decreases. The test is estimated to have
a false negative rate as high as 15% when a single cell is
analysed. There is a great demand for a method that has a higher
throughput, lower cost, and greater accuracy.
[0033] Listed here is a set of prior art which is related to the
field of the current invention. None of this prior art contains or
in any way refers to the novel elements of the current invention.
In U.S. Pat. No. 6,720,140, Hartley et al describe a
recombinational cloning method for moving or exchanging segments of
DNA molecules using engineered recombination sites and
recombination proteins. In U.S. Pat. No. 6,489,135 Parrott et al.
provide methods for determining various biological characteristics
of in vitro fertilized embryos, including overall embryo health,
implantability, and increased likelihood of developing successfully
to term by analyzing media specimens of in vitro fertilization
cultures for levels of bioactive lipids in order to determine these
characteristics. In US Patent Application 20040033596 Threadgill et
al. describe a method for preparing homozygous cellular libraries
useful for in vitro phenotyping and gene mapping involving
site-specific mitotic recombination in a plurality of isolated
parent cells. In U.S. Pat. No. 5,994,148 Stewart et al. describe a
method of determining the probability of an in vitro fertilization
(IVF) being successful by measuring Relaxin directly in the serum
or indirectly by culturing granulosa lutein cells extracted from
the patient as part of an IVF/ET procedure. In U.S. Pat. No.
5,635,366 Cooke et al. provide a method for predicting the outcome
of IVF by determining the level of 11.beta.-hydroxysteroid
dehydrogenase (11.beta.-HSD) in a biological sample from a female
patient. In U.S. Pat. No. 7,058,616 Larder et al. describe a method
for using a neural network to predict the resistance of a disease
to a therapeutic agent. In U.S. Pat. No. 6,958,211 Vingerhoets et
al. describe a method wherein the integrase genotype of a given HIV
strain is simply compared to a known database of HIV integrase
genotype with associated phenotypes to find a matching genotype. In
U.S. Pat. No. 7,058,517 Denton et al. describe a method wherein an
individual's haplotypes are compared to a known database of
haplotypes in the general population to predict clinical response
to a treatment. In U.S. Pat. No. 7,035,739 Schadt at al. describe a
method is described wherein a genetic marker map is constructed and
the individual genes and traits are analyzed to give a gene-trait
locus data, which are then clustered as a way to identify
genetically interacting pathways, which are validated using
multivariate analysis. In U.S. Pat. No. 6,025,128 Veltri et al.
describe a method involving the use of a neural network utilizing a
collection of biomarkers as parameters to evaluate risk of prostate
cancer recurrence.
[0034] The cost of DNA sequencing is dropping rapidly, and in the
near future individual genomic sequencing for personal benefit will
become more common. Knowledge of personal genetic data will allow
for extensive phenotypic predictions to be made for the individual.
In order to make accurate phenotypic predictions high quality
genetic data is critical, whatever the context. In the case of
prenatal or pre-implantation genetic diagnoses a complicating
factor is the relative paucity of genetic material available. Given
the inherently noisy nature of the measured genetic data in cases
where limited genetic material is used for genotyping, there is a
great need for a method which can increase the fidelity of, or
clean, the primary data.
SUMMARY OF THE INVENTION
[0035] The system disclosed enables the cleaning of incomplete or
noisy genetic data using secondary genetic data as a source of
information, and also the determination of chromosome copy number
using said genetic data. While the disclosure focuses on genetic
data from human subjects, and more specifically on as-yet not
implanted embryos or developing fetuses, as well as related
individuals, it should be noted that the methods disclosed apply to
the genetic data of a range of organisms, in a range of contexts.
The techniques described for cleaning genetic data are most
relevant in the context of pre-implantation diagnosis during
in-vitro fertilization, prenatal diagnosis in conjunction with
amniocentesis, chorion villus biopsy, fetal tissue sampling, and
non-invasive prenatal diagnosis, where a small quantity of fetal
genetic material is isolated from maternal blood. The use of this
method may facilitate diagnoses focusing on inheritable diseases,
chromosome copy number predictions, increased likelihoods of
defects or abnormalities, as well as making predictions of
susceptibility to various disease- and non-disease phenotypes for
individuals to enhance clinical and lifestyle decisions. The
invention addresses the shortcomings of prior art that are
discussed above.
[0036] In one aspect of the invention, methods make use of
knowledge of the genetic data of the mother and the father such as
diploid tissue samples, sperm from the father, haploid samples from
the mother or other embryos derived from the mother's and father's
gametes, together with the knowledge of the mechanism of meiosis
and the imperfect measurement of the embryonic DNA, in order to
reconstruct, in silico, the embryonic DNA at the location of key
loci with a high degree of confidence. In one aspect of the
invention, genetic data derived from other related individuals,
such as other embryos, brothers and sisters, grandparents or other
relatives can also be used to increase the fidelity of the
reconstructed embryonic DNA. It is important to note that the
parental and other secondary genetic data allows the reconstruction
not only of SNPs that were measured poorly, but also of insertions,
deletions, and of SNPs or whole regions of DNA that were not
measured at all.
[0037] In one aspect of the invention, the fetal or embryonic
genomic data which has been reconstructed, with or without the use
of genetic data from related individuals, can be used to detect if
the cell is aneuploid, that is, where fewer or more than two of a
particular chromosome is present in a cell. The reconstructed data
can also be used to detect for uniparental disomy, a condition in
which two of a given chromosome are present, both of which
originate from one parent. This is done by creating a set of
hypotheses about the potential states of the DNA, and testing to
see which hypothesis has the highest probability of being true
given the measured data. Note that the use of high throughput
genotyping data for screening for aneuploidy enables a single
blastomere from each embryo to be used both to measure multiple
disease-linked loci as well as to screen for aneuploidy.
[0038] In another aspect of the invention, the direct measurements
of the amount of genetic material, amplified or unamplified,
present at a plurality of loci, can be used to detect for monosomy,
uniparental disomy, trisomy and other aneuploidy states. The idea
behind this method is that measuring the amount of genetic material
at multiple loci will give a statistically significant result.
[0039] In another aspect of the invention, the measurements, direct
or indirect, of a particular subset of SNPs, namely those loci
where the parents are both homozygous but with different allele
values, can be used to detect for chromosomal abnormalities by
looking at the ratios of maternally versus paternally miscalled
homozygous loci on the embryo. The idea behind this method is that
those loci where each parent is homozygous, but have different
alleles, by definition result in a heterozygous loci on the embryo.
Allele drop outs at those loci are random, and a shift in the ratio
of loci miscalled as homozygous can only be due to incorrect
chromosome number.
[0040] It will be recognized by a person of ordinary skill in the
art, given the benefit of this disclosure, that various aspects and
embodiments of this disclosure may implemented in combination or
separately.
BRIEF DESCRIPTION OF THE DRAWINGS
[0041] FIG. 1: determining probability of false negatives and false
positives for different hypotheses.
[0042] FIG. 2: the results from a mixed female sample, all loci
hetero.
[0043] FIG. 3: the results from a mixed male sample, all loci
hetero.
[0044] FIG. 4: Ct measurements for male sample differenced from Ct
measurements for female sample.
[0045] FIG. 5: the results from a mixed female sample; Taqman
single dye.
[0046] FIG. 6: the results from a mixed male; Taqman single
dye.
[0047] FIG. 7: the distribution of repeated measurements for mixed
male sample.
[0048] FIG. 8: the results from a mixed female sample; qPCR
measures.
[0049] FIG. 9: the results from a mixed male sample; qPCR
measures.
[0050] FIG. 10: Ct measurements for male sample differenced from Ct
measurements for female sample.
[0051] FIG. 11: detecting aneuploidy with a third dissimilar
chromosome.
[0052] FIGS. 12A and 12B: an illustration of two amplification
distributions with constant allele dropout rate.
[0053] FIG. 13: a graph of the Gaussian probability density
function of alpha.
[0054] FIG. 14: matched filter and performance for 19 loci measured
with Taqman on 60 pg of DNA.
[0055] FIG. 15: matched filter and performance for 13 loci measured
with Taqman on 6 pg of DNA.
[0056] FIG. 16: matched filter and performance for 20 loci measured
with qPCR on 60 pg of DNA.
[0057] FIG. 17: matched filter and performance for 20 loci measured
with qPCR on 6 pg of DNA.
[0058] FIG. 18A: matched filter and performance for 20 loci using
MDA and Taqman on 16 single cells.
[0059] FIG. 18B: Matched filter and performance for 11 loci on
Chromosome 7 and 13 loci on Chromosome X using MDA and Taqman on 15
single cells.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Conceptual Overview of the System
[0060] The goal of the disclosed system is to provide highly
accurate genomic data for the purpose of genetic diagnoses. In
cases where the genetic data of an individual contains a
significant amount of noise, or errors, the disclosed system makes
use of the expected similarities between the genetic data of the
target individual and the genetic data of related individuals, to
clean the noise in the target genome. This is done by determining
which segments of chromosomes of related individuals were involved
in gamete formation and, when necessary where crossovers may have
occurred during meiosis, and therefore which segments of the
genomes of related individuals are expected to be nearly identical
to sections of the target genome. In certain situations this method
can be used to clean noisy base pair measurements on the target
individual, but it also can be used to infer the identity of
individual base pairs or whole regions of DNA that were not
measured. It can also be used to determine the number of copies of
a given chromosome segment in the target individual. In addition, a
confidence may be computed for each call made. A highly simplified
explanation is presented first, making unrealistic assumptions in
order to illustrate the concept of the invention. A detailed
statistical approach that can be applied to the technology of today
is presented afterward.
[0061] In one aspect of the invention, the target individual is an
embryo, and the purpose of applying the disclosed method to the
genetic data of the embryo is to allow a doctor or other agent to
make an informed choice of which embryo(s) should be implanted
during IVF. In another aspect of the invention, the target
individual is a fetus, and the purpose of applying the disclosed
method to genetic data of the fetus is to allow a doctor or other
agent to make an informed choice about possible clinical decisions
or other actions to be taken with respect to the fetus.
Definitions
[0062] SNP (Single Nucleotide Polymorphism): a single nucleotide
that may differ between the genomes of two members of the same
species. In our usage of the term, we do not set any limit on the
frequency with which each variant occurs.
[0063] To call a SNP: to make a decision about the true state of a
particular base pair, taking into account the direct and indirect
evidence.
[0064] Locus: a particular region of interest on the DNA of an
individual, which may refer to a SNP, the site of a possible
insertion or deletion, or the site of some other relevant genetic
variation. Disease-linked SNPs may also refer to disease-linked
loci.
[0065] To call an allele: to determine the state of a particular
locus of DNA. This may involve calling a SNP, or determining
whether or not an insertion or deletion is present at that locus,
or determining the number of insertions that may be present at that
locus, or determining whether some other genetic variant is present
at that locus.
[0066] Correct allele call: An allele call that correctly reflects
the true state of the actual genetic material of an individual.
[0067] To clean genetic data: to take imperfect genetic data and
correct some or all of the errors or fill in missing data at one or
more loci. In the context of this disclosure, this involves using
genetic data of related individuals and the method described
herein.
[0068] To increase the fidelity of allele calls: to clean genetic
data.
[0069] Imperfect genetic data: genetic data with any of the
following: allele dropouts, uncertain base pair measurements,
incorrect base pair measurements, missing base pair measurements,
uncertain measurements of insertions or deletions, uncertain
measurements of chromosome segment copy numbers, spurious signals,
missing measurements, other errors, or combinations thereof.
[0070] Noisy genetic data: imperfect genetic data, also called
incomplete genetic data.
[0071] Uncleaned genetic data: genetic data as measured, that is,
where no method has been used to correct for the presence of noise
or errors in the raw genetic data; also called crude genetic
data.
[0072] Confidence: the statistical likelihood that the called SNP,
allele, set of alleles, or determined number of chromosome segment
copies correctly represents the real genetic state of the
individual.
[0073] Parental Support (PS): a name sometimes used for the any of
the methods disclosed herein, where the genetic information of
related individuals is used to determine the genetic state of
target individuals. In some cases, it refers specifically to the
allele calling method, sometimes to the method used for cleaning
genetic data, sometimes to the method to determine the number of
copies of a segment of a chromosome, and sometimes to some or all
of these methods used in combination.
[0074] Copy Number Calling (CNC): the name given to the method
described in this disclosure used to determine the number of
chromosome segments in a cell.
[0075] Qualitative CNC (also qCNC): the name given to the method in
this disclosure used to determine chromosome copy number in a cell
that makes use of qualitative measured genetic data of the target
individual and of related individuals.
[0076] Multigenic: affected by multiple genes, or alleles.
[0077] Direct relation: mother, father, son, or daughter.
[0078] Chromosomal Region: a segment of a chromosome, or a full
chromosome.
[0079] Segment of a Chromosome: a section of a chromosome that can
range in size from one base pair to the entire chromosome.
[0080] Section: a section of a chromosome. Section and segment can
be used interchangeably.
[0081] Chromosome: may refer to either a full chromosome, or also a
segment or section of a chromosome.
[0082] Copies: the number of copies of a chromosome segment may
refer to identical copies, or it may refer to non-identical copies
of a chromosome segment wherein the different copies of the
chromosome segment contain a substantially similar set of loci, and
where one or more of the alleles are different. Note that in some
cases of aneuploidy, such as the M2 copy error, it is possible to
have some copies of the given chromosome segment that are identical
as well as some copies of the same chromosome segment that are not
identical.
[0083] Haplotypic Data: also called `phased data` or `ordered
genetic data;` data from a single chromosome in a diploid or
polyploid genome, i.e., either the segregated maternal or paternal
copy of a chromosome in a diploid genome.
[0084] Unordered Genetic Data: pooled data derived from
measurements on two or more chromosomes in a diploid or polyploid
genome, i.e., both the maternal and paternal copies of a chromosome
in a diploid genome.
[0085] Genetic data `in`, `of`, `at` or `on` an individual: These
phrases all refer to the data describing aspects of the genome of
an individual. It may refer to one or a set of loci, partial or
entire sequences, partial or entire chromosomes, or the entire
genome.
[0086] Hypothesis: a set of possible copy numbers of a given set of
chromosomes, or a set of possible genotypes at a given set of loci.
The set of possibilities may contain one or more elements.
[0087] Target Individual: the individual whose genetic data is
being determined. Typically, only a limited amount of DNA is
available from the target individual. In one context, the target
individual is an embryo or a fetus.
[0088] Related Individual: any individual who is genetically
related, and thus shares haplotype blocks, with the target
individual.
[0089] Platform response: a mathematical characterization of the
input/output characteristics of a genetic measurement platform,
such as TAQMAN or INFINIUM. The input to the channel is the true
underlying genotypes of the genetic loci being measured. The
channel output could be allele calls (qualitative) or raw numerical
measurements (quantitative), depending on the context. For example,
in the case in which the platform's raw numeric output is reduced
to qualitative genotype calls, the platform response consists of an
error transition matrix that describes the conditional probability
of seeing a particular output genotype call given a particular true
genotype input. In the case in which the platform's output is left
as raw numeric measurements, the platform response is a conditional
probability density function that describes the probability of the
numeric outputs given a particular true genotype input.
[0090] Copy number hypothesis: a hypothesis about how many copies
of a particular chromosome segment are in the embryo. In a
preferred embodiment, this hypothesis consists of a set of
sub-hypotheses about how many copies of this chromosome segment
were contributed by each related individual to the target
individual.
Technical Description of the System
A Allele Calling: Preferred Method
[0091] Assume here the goal is to estimate the genetic data of an
embryo as accurately as possible, and where the estimate is derived
from measurements taken from the embryo, father, and mother across
the same set of n SNPs. Note that where this description refers to
SNPs, it may also refer to a locus where any genetic variation,
such as a point mutation, insertion or deletion may be present.
This allele calling method is part of the Parental Support (PS)
system. One way to increase the fidelity of allele calls in the
genetic data of a target individual for the purposes of making
clinically actionable predictions is described here. It should be
obvious to one skilled in the art how to modify the method for use
in contexts where the target individual is not an embryo, where
genetic data from only one parent is available, where neither, one
or both of the parental haplotypes are known, or where genetic data
from other related individuals is known and can be
incorporated.
[0092] For the purposes of this discussion, only consider SNPs that
admit two allele values; without loss of generality it is possible
to assume that the allele values on all SNPs belong to the alphabet
A={A,C}. It is also assumed that the errors on the measurements of
each of the SNPs are independent. This assumption is reasonable
when the SNPs being measured are from sufficiently distant genic
regions. Note that one could incorporate information about
haplotype blocks or other techniques to model correlation between
measurement errors on SNPs without changing the fundamental
concepts of this invention.
[0093] Let e=(e.sub.1,e.sub.2) be the true, unknown, ordered SNP
information on the embryo, e.sub.1,e.sub.2 A.sup.n. Define e.sub.1
to be the genetic haploid information inherited from the father and
e.sub.2 to be the genetic haploid information inherited from the
mother. Also use e.sub.i=(e.sub.1i,e.sub.2i) to denote the ordered
pair of alleles at the i-th position of e. In similar fashion, let
f=(f.sub.1,f.sub.2) and m=(m.sub.1,m.sub.2) be the true, unknown,
ordered SNP information on the father and mother respectively. In
addition, let g.sub.1 be the true, unknown, haploid information on
a single sperm from the father. (One can think of the letter g as
standing for gamete. There is no g.sub.2. The subscript is used to
remind the reader that the information is haploid, in the same way
that f.sub.1 and f.sub.2 are haploid.) It is also convenient to
define r=(f,m), so that there is a symbol to represent the complete
set of diploid parent information from which e inherits, and also
write
r.sub.i=(f.sub.i,m.sub.i)=((f.sub.1i,f.sub.2i),(m.sub.1i,m.sub.2i))
to denote the complete set of ordered information on father and
mother at the i-th SNP. Finally, let =( .sub.1, .sub.2) be the
estimate of e that is sought, .sub.1, .sub.2 A.sup.n.
[0094] By a crossover map, it is meant an n-tuple .theta.
{1,2}.sup.n that specifies how a haploid pair such as
(f.sub.1,f.sub.2) recombines to form a gamete such as e.sub.1.
Treating .theta. as a function whose output is a haploid sequence,
define
.theta.(f).sub.i=.theta.(f.sub.1,f.sub.2).sub.i=f.sub..theta.i,i.
To make this idea more concrete, let f.sub.1=ACAAACCC, let
f.sub.2=CAACCACA, and let .theta.=11111222. Then
.theta.(f.sub.1,f.sub.2)=ACAAAACA. In this example, the crossover
map .theta. implicitly indicates that a crossover occurred between
SNPs i=5 and i=6.
[0095] Formally, let .theta. be the true, unknown crossover map
that determines e.sub.1 from f, let .PHI. be the true, unknown
crossover map that determines e.sub.2 from m, and let .psi. be the
true, unknown crossover map that determines g.sub.1 from f That is,
e.sub.1=.theta.(f), e.sub.2=.PHI.(m), g.sub.1=.psi.(f). It is also
convenient to define X=(.theta.,.PHI.,.psi.) so that there is a
symbol to represent the complete set of crossover information
associated with the problem. For simplicity sake, write e=X(r) as
shorthand for e=(.theta.(f).PHI.(m)); also write e.sub.i=X(r.sub.i)
as shorthand for e.sub.i=X(r).sub.i
[0096] In reality, when chromosomes combine, at most a few
crossovers occur, making most of the 2.sup.n theoretically possible
crossover maps distinctly improbable. In practice, these very low
probability crossover maps will be treated as though they had
probability zero, considering only crossover maps belonging to a
comparatively small set .OMEGA.. For example, if .OMEGA. is defined
to be the set of crossover maps that derive from at most one
crossover, then |.OMEGA.|=2n.
[0097] It is convenient to have an alphabet that can be used to
describe unordered diploid measurements. To that end, let
B={A,B,C,X}. Here A and C represent their respective homozygous
locus states and B represents a heterozygous but unordered locus
state. Note: this section is the only section of the document that
uses the symbol B to stand for a heterozygous but unordered locus
state. Most other sections of the document use the symbols A and B
to stand for the two different allele values that can occur at a
locus. X represents an unmeasured locus, i.e., a locus drop-out. To
make this idea more concrete, let f.sub.1=ACAAACCC, and let
f.sub.2=CAACCACA. Then a noiseless unordered diploid measurement of
f would yield {tilde over (f)}=BBABBBCB.
[0098] In the problem at hand, it is only possible to take
unordered diploid measurements of e, f, and m, although there may
be ordered haploid measurements on g.sub.1. This results in noisy
measured sequences that are denoted {tilde over (e)} B.sup.n,
{tilde over (f)} B.sup.n, {tilde over (m)} B.sup.n, and {tilde over
(g)}.sub.1 A.sup.n respectively. It will be convenient to define
{tilde over (r)}=({tilde over (f)}, {tilde over (m)}) so that there
is a symbol that represents the noisy measurements on the parent
data. It will also be convenient to define {tilde over (D)}=({tilde
over (r)}, {tilde over (e)}, {tilde over (g)}.sub.1) so that there
is a symbol to represent the complete set of noisy measurements
associated with the problem, and to write T), {tilde over
(D)}.sub.i=({tilde over (r)}.sub.i, {tilde over (e)}.sub.i, {tilde
over (g)}.sub.1i)=({tilde over (f)}.sub.i, {tilde over (m)}.sub.i,
{tilde over (e)}.sub.i, {tilde over (g)}.sub.1i) to denote the
complete set of measurements on the i-th SNP. (Please note that,
while f, is an ordered pair such as (A,C), {tilde over (f)}.sub.i
is a single letter such as B.)
[0099] Because the diploid measurements are unordered, nothing in
the data can distinguish the state (f.sub.1, f.sub.2) from
(f.sub.2, f.sub.1) or the state (m.sub.1, m.sub.2) from (m.sub.2,
m.sub.1). These indistinguishable symmetric states give rise to
multiple optimal solutions of the estimation problem. To eliminate
the symmetries, and without loss of generality, assign
.theta..sub.1=.PHI..sub.1=1.
[0100] In summary, then, the problem is defined by a true but
unknown underlying set of information {r, e, g.sub.1, X}, with
e=X(r). Only noisy measurements {tilde over (D)}=({tilde over (r)},
{tilde over (e)}, {tilde over (g)}.sub.1) are available. The goal
is to come up with an estimate of e, based on {tilde over (D)}.
[0101] Note that this method implicitly assumes euploidy on the
embryo. It should be obvious to one skilled in the art how this
method could be used in conjunction with the aneuploidy calling
methods described elsewhere in this patent. For example, the
aneuploidy calling method could be first employed to ensure that
the embryo is indeed euploid and only then would the allele calling
method be employed, or the aneuploidy calling method could be used
to determine how many chromosome copies were derived from each
parent and only then would the allele calling method be employed.
It should also be obvious to one skilled in the art how this method
could be modified in the case of a sex chromosome where there is
only one copy of a chromosome present.
Solution Via Maximum a Posteriori Estimation
[0102] In one embodiment of the invention, it is possible, for each
of the n SNP positions, to use a maximum a posteriori (MAP)
estimation to determine the most probable ordered allele pair at
that position. The derivation that follows uses a common shorthand
notation for probability expressions. For example,
P(e'.sub.i,{tilde over (D)}|X') is written to denote the
probability that random variable e.sub.i takes on value e'.sub.i
and the random variable {tilde over (D)} takes on its observed
value, conditional on the event that the random variable X takes on
the value X'. Using MAP estimation, then, the i-th component of ,
denoted .sub.i=( .sub.1i, .sub.2i) is given by
e ^ i = argmax e i ' P ( e i ' | D ~ ) = argmax e i ' P ( e i ' , D
~ ) = argmax e i ' X ' .di-elect cons. .OMEGA. 3 P ( X ' ) P ( e i
' , D ~ | X ' ) ##EQU00001## ( a ) = argmax e i ' X ' .di-elect
cons. .OMEGA. 3 : .theta. 1 ' = .phi. 1 ' = 1 P ( X ' ) P ( e i ' ,
D ~ i | X ' ) j .noteq. i P ( D ~ j | X ' ) ##EQU00001.2## ( b ) =
argmax e i ' X ' .di-elect cons. .OMEGA. 3 P : .theta. 1 ' = .phi.
1 ' = 1 P ( X ' ) r i ' .di-elect cons. A 4 P ( r i ' ) P ( e i ' ,
D ~ i | X ' , r i ' ) j .noteq. i r j ' .di-elect cons. A 4 P ( r j
' ) P ( D ~ j | X ' , r j ' ) ##EQU00001.3## ( c ) = argmax e i ' X
' .di-elect cons. .OMEGA. 3 : .theta. 1 ' = .phi. 1 ' = 1 P ( X ' )
r i ' .di-elect cons. A 4 P ( r i ' ) P ( e i ' | X ' , r i ' ) P (
D ~ i | X ' , r i ' ) j .noteq. i r j ' .di-elect cons. A 4 P ( r j
' ) P ( D ~ j | X ' , r j ' ) (* ) = argmax e i ' X ' .di-elect
cons. .OMEGA. 3 : .theta. 1 ' = .phi. 1 ' = 1 P ( X ' ) j r j '
.di-elect cons. A 4 1 ( i .noteq. j or X ' ( r j ' ) = e i ' ) P (
r j ' ) P ( D ~ j | X ' , r j ' ) ##EQU00001.4##
[0103] In the preceding set of equations, (a) holds because the
assumption of SNP independence means that all of the random
variables associated with SNP i are conditionally independent of
all of the random variables associated with SNP j, given X; (b)
holds because r is independent of X; (c) holds because e.sub.i and
{tilde over (D)}.sub.i are conditionally independent given r.sub.i
and X (in particular, e.sub.i=X(r.sub.i)); and (*) holds, again,
because e.sub.i=X(r.sub.i), which means that
P(e'.sub.i|X',r'.sub.i) evaluates to either one or zero and hence
effectively filters r'.sub.i to just those values that are
consistent with e'.sub.i and X'.
[0104] The final expression (*) above contains three probability
expressions: P(X'), P(r'.sub.j), and P({tilde over (D)}.sub.j|1
X',r'.sub.j). The computation of each of these quantities is
discussed in the following three sections.
Crossover Map Probabilities
[0105] Recent research has enabled the modeling of the probability
of recombination between any two SNP loci. Observations from sperm
studies and patterns of genetic variation show that recombination
rates vary extensively on kilobase scales and that much
recombination occurs in recombination hotspots. The NCBI data about
recombination rates on the Human Genome is publicly available
through the UCSC Genome Annotation Database.
[0106] One may use the data set from the Hapmap Project or the
Perlegen Human Haplotype Project. The latter is higher density; the
former is higher quality. These rates can be estimated using
various techniques known to those skilled in the art, such as the
reversible-jump Markov Chain Monte Carlo (MCMC) method that is
available in the package LDHat.
[0107] In one embodiment of the invention, it is possible to
calculate the probability of any crossover map given the
probability of crossover between any two SNPs. For example,
P(.theta.=11111222) is one half the probability that a crossover
occurred between SNPs five and six. The reason it is only half the
probability is that a particular crossover pattern has two
crossover maps associated with it: one for each gamete. In this
case, the other crossover map is .theta.=22222111.
[0108] Recall that X=(.theta.,.PHI.,.psi.), where
e.sub.1=.theta.(f), e.sub.2=.PHI.(m), g.sub.1=.psi.(f). Obviously
.theta., .PHI., and .psi. result from independent physical events,
so P(X)=P(.theta.)P(.PHI.)P(.psi.). Further assume that
P.sub..theta.(.sup. )=P.sub..PHI.(.sup. )=P.sub..psi.(.sup. ),
where the actual distribution P.sub..theta.(.sup. ) is determined
in the obvious way from the Hapmap data.
Allele Probabilities
[0109] It is possible to determine
P(r.sub.i)=P(f.sub.i)P(m.sub.i)=P(f.sub.i1)P(f.sub.i2)P(m.sub.i1)P(m.sub.-
i2) using population frequency information from databases such as
dbSNP. Also, as mentioned previously, choose SNPs for which the
assumption of intra-haploid independence is a reasonable one. That
is, assume that
P ( r ) = i P ( r i ) ##EQU00002##
Measurement Errors
[0110] Conditional on whether a locus is heterozygous or
homozygous, measurement errors may be modeled as independent and
identically distributed across all similarly typed loci. Thus:
P ( D ~ | X , r ) = i P ( D ~ i | X , r i ) = i P ( f ~ i , m ~ i ,
e ~ i , g ~ 1 i | X , f i , m i ) = i P ( f ~ i | f i ) P ( m ~ i |
m i ) P ( e ~ i | .theta. ( f i ) , .phi. ( m i ) ) P ( g ~ 1 i |
.PSI. ( f i ) ) ##EQU00003##
where each of the four conditional probability distributions in the
final expression is determined empirically, and where the
additional assumption is made that the first two distributions are
identical. For example, for unordered diploid measurements on a
blastomere, empirical values p.sub.d=0.5 and p.sub.a=0.02 are
obtained, which lead to the conditional probability distribution
for P({tilde over (e)}.sub.i|e.sub.i) shown in Table 1.
[0111] Note that the conditional probability distributions
mentioned above, P({tilde over (f)}.sub.i|f.sub.i), P({tilde over
(m)}.sub.i|m.sub.i), P({tilde over (e)}.sub.i|e.sub.i), can vary
widely from experiment to experiment, depending on various factors
in the lab such as variations in the quality of genetic samples, or
variations in the efficiency of whole genome amplification, or
small variations in protocols used. Therefore, in a preferred
embodiment, these conditional probability distributions are
estimated on a per-experiment basis. We focus in later sections of
this disclosure on estimating P({tilde over (e)}.sub.i|e.sub.i),
but it will be clear to one skilled in the art after reading this
disclosure how similar techniques can be applied to estimating
P({tilde over (f)}.sub.i|f.sub.i) and P({tilde over
(m)}.sub.i|m.sub.i). The distributions can each be modeled as
belonging to a parametric family of distributions whose particular
parameter values vary from experiment to experiment. As one example
among many, it is possible to implicitly model the conditional
probability distribution P({tilde over (e)}.sub.i|e.sub.i) as being
parameterized by an allele dropout parameter p.sub.d and an allele
dropin parameter p.sub.a. The values of these parameters might vary
widely from experiment to experiment, and it is possible to use
standard techniques such as maximum likelihood estimation, MAP
estimation, or Bayesian inference, whose application is illustrated
at various places in this document, to estimate the values that
these parameters take on in any individual experiment. Regardless
of the precise method one uses, the key is to find the set of
parameter values that maximizes the joint probability of the
parameters and the data, by considering all possible tuples of
parameter values within a region of interest in the parameter
space. As described elsewhere in the document, this approach can be
implemented when one knows the chromosome copy number of the target
genome, or when one doesn't know the copy number call but is
exploring different hypotheses. In the latter case, one searches
for the combination of parameters and hypotheses that best match
the data are found, as is described elsewhere in this
disclosure.
[0112] Note that one can also determine the conditional probability
distributions as a function of particular parameters derived from
the measurements, such as the magnitude of quantitative genotyping
measurements, in order to increase accuracy of the method. This
would not change the fundamental concept of the invention.
[0113] It is also possible to use non-parametric methods to
estimate the above conditional probability distributions on a
per-experiment basis. Nearest neighbor methods, smoothing kernels,
and similar non-parametric methods familiar to those skilled in the
art are some possibilities. Although this disclosure focuses
parametric estimation methods, use of non-parametric methods to
estimate these conditional probability distributions would not
change the fundamental concept of the invention. The usual caveats
apply: parametric methods may suffer from model bias, but have
lower variance. Non-parametric methods tend to be unbiased, but
will have higher variance.
[0114] Note that it should be obvious to one skilled in the art,
after reading this disclosure, how one could use quantitative
information instead of explicit allele calls, in order to apply the
PS method to making reliable allele calls, and this would not
change the essential concepts of the disclosure.
B Factoring the Allele Calling Equation
[0115] In a preferred embodiment of the invention, the algorithm
for allele calling can be structured so that it can be executed in
a more computationally efficient fashion. In this section the
equations are re-derived for allele-calling via the MAP method,
this time reformulating the equations so that they reflect such a
computationally efficient method of calculating the result.
Notation
[0116] X*,Y*,Z* {A,C}.sup.nx2 are the true ordered values on the
mother, father, and embryo respectively.
[0117] H* {A,C}.sup.nxbx2 are true values on h sperm samples.
[0118] B* {A,C}.sup.nxbx2 are true ordered values on b
blastomeres.
[0119] D={x,y,z,B,H} is the set of unordered measurement data on
father, mother, embryo, b blastomeres and h sperm samples.
D.sub.i={x.sub.i,y.sub.i,z.sub.i,H.sub.i,B.sub.i,} is the data set
restricted to the i-th SNP.
[0120] r {A,C}.sup.4 represents a candidate 4-tuple of ordered
values on both the mother and father at a particular locus.
[0121] {circumflex over (Z)}.sub.i {A,C}.sup.2 is the estimated
ordered embryo value at SNP i.
[0122] Q=(2+2b+h) is the effective number of haploid chromosomes
being measured, excluding the parents. Any hypothesis about the
parental origin of all measured data (excluding the parents
themselves) requires that Q crossover maps be specified.
[0123] .chi. {1,2}.sup.nxQ is a crossover map matrix, representing
a hypothesis about the parental origin of all measured data,
excluding the parents. Note that there are 2.sup.nQ different
crossover matrices. .chi..sub.i .chi..sub.i, is the matrix
restricted to the i-th row. Note that there are 2.sup.Q vector
values that the i-th row can take on, from the set .chi.
{1,2}.sup.Q.
[0124] f(x; y, z) is a function of (x, y, z) that is being treated
as a function of just x. The values behind the semi-colon are
constants in the context in which the function is being
evaluated.
PS Equation Factorization
[0125] Z ^ i = argmax z i P ( Z i , D ) = argmax z i .chi. P (
.chi. ) P ( Z i , D | .chi. ) = argmax z i .chi. P ( .chi. 1 ) P (
.chi. 2 | .chi. 1 ) P ( .chi. n , .chi. n - 1 ) ( r .di-elect cons.
( A , C ) 4 P ( r ) P ( Z i , D i | X i , r ) ) j .noteq. 1 ( r
.di-elect cons. ( A , C ) 4 P ( r ) P ( D j | X j , r ) ) = argmax
z i .chi. P ( .chi. 1 ) P ( .chi. 2 | .chi. 1 ) P ( .chi. n , .chi.
n - 1 ) f 1 ( .chi. i ; Z i , D i ) j = 1 f 2 ( .chi. j ; D j ) =
argmax z i .chi. 1 .di-elect cons. { 1 , 2 } Q .chi. 2 .di-elect
cons. { 1 , 2 } Q P ( .chi. 1 ) P ( .chi. 2 | .chi. 1 ) P ( .chi. n
, .chi. n - 1 ) f 1 ( .chi. i ; Z i , D i ) j = 1 f 2 ( .chi. j ; D
j ) = argmax z i .chi. 1 .di-elect cons. { 1 , 2 } Q P ( .chi. 1 )
f 2 ( .chi. 1 ; D 1 ) .times. .chi. 2 .di-elect cons. { 1 , 2 } Q P
( .chi. 2 | .chi. 1 ) f 2 ( .chi. 2 ; D 2 ) .times. .chi. 1
.di-elect cons. { 1 , 2 } Q P ( .chi. i | .chi. i - 1 ) f 1 ( .chi.
i ; Z i , D i ) .times. .chi. n .di-elect cons. { 1 , 2 } Q P (
.chi. n | .chi. n - 1 ) f 2 ( .chi. n ; D n ) ##EQU00004##
[0126] The number of different crossover matrices .chi. is
2.sup.nQ. Thus, a brute-force application of the first line above
is U(n2.sup.nQ). By exploiting structure via the factorization of
P(.chi.) and P(z.sub.i,D|.chi.), and invoking the previous result,
final line gives an expression that can be computed in
O(n2.sup.2Q).
C Quantitative Detection of Aneuploidy
[0127] In one embodiment of the invention, aneuploidy can be
detected using the quantitative data output from the PS method
discussed in this patent. Disclosed herein are multiple methods
that make use of the same concept; these methods are termed Copy
Number Calling (CNC). The statement of the problem is to determine
the copy number of each of 23 chromosome-types in a single cell.
The cell is first pre-amplified using a technique such as whole
genome amplification using the MDA method. Then the resulting
genetic material is selectively amplified with a technique such as
PCR at a set of n chosen SNPs at each of m=23 chromosome types.
[0128] This yields a data set [t.sub.ij], i=1 . . . n, j=1 . . . m
of regularized ct (ct, or CT, is the point during the cycle time of
the amplification at which dye measurement exceeds a given
threshold) values obtained at SNP i, chromosome j. A regularized ct
value implies that, for a given (i,j), the pair of raw ct values on
channels FAM and VIC (these are arbitrary channel names denoting
different dyes) obtained at that locus are combined to yield a ct
value that accurately reflects the ct value that would have been
obtained had the locus been homozygous. Thus, rather than having
two ct values per locus, there is just one regularized ct value per
locus.
[0129] The goal is to determine the set {n.sub.j} of copy numbers
on each chromosome. If the cell is euploid, then n.sub.j=2 for all
j; one exception is the case of the male X chromosome. If
n.sub.j.noteq.2 for at least one j, then the cell is aneuploid;
excepting the case of male X.
Biochemical Model
[0130] The relationship between ct values and chromosomal copy
number is modeled as follows:
.alpha..sub.ijn.sub.jQ2.sup..beta.ijtij-Q.sub.T In this expression,
n.sub.j is the copy number of chromosome j. Q is an abstract
quantity representing a baseline amount of pre-amplified genetic
material from which the actual amount of pre-amplified genetic
material at SNP i, chromosome j can be calculated as
.alpha..sub.ijn.sub.jQ. .alpha..sub.ij is a preferential
amplification factor that specifies how much more SNP i on
chromosome j will be pre-amplified via MDA than SNP 1 on chromosome
1. By definition, the preferential amplification factors are
relative to .alpha..sub.111.
[0131] .beta..sub.ij is the doubling rate for SNP i chromosome j
under PCR. t.sub.ij is the ct value. Q.sub.T is the amount of
genetic material at which the ct value is determined. T is a
symbol, not an index, and merely stands for threshold.
[0132] It is important to realize that .alpha.i.sub.ij,
.beta..sub.ij, and Q.sub.T are constants of the model that do not
change from experiment to experiment. By contrast, n.sub.j and Q
are variables that change from experiment to experiment. Q is the
amount of material there would be at SNP 1 of chromosome 1, if
chromosome 1 were monosomic.
[0133] The original equation above does not contain a noise term.
This can be included by rewriting it as follows:
(* ) .beta. ij t ij = log Q T .alpha. ij - log n j , log Q + Z ij
##EQU00005##
[0134] The above equation indicates that the ct value is corrupted
by additive Gaussian noise Z.sub.ij. Let the variance of this noise
term be .sigma..sub.ij.sup.2.
Maximum Likelihood (ML) Estimation of Copy Number
[0135] In one embodiment of the method, the maximum likelihood
estimation is used, with respect to the model described above, to
determine n.sub.j. The parameter Q makes this difficult unless
another constraint is added:
1 m j log n j = 1 ##EQU00006##
[0136] This indicates that the average copy number is 2, or,
equivalently, that the average log copy number is 1. With this
additional constraint one can now solve the following ML
problem:
Q ^ , n ^ j = argmax Q , n j ij f z ( log n j + log Q - ( log Q T
.alpha. ij - .beta. ij t ij ) ) s . t . 1 m j log n j = 1 = argmin
Q , n j ij 1 .sigma. ij 2 ( log n j + log Q - ( log Q T .alpha. ij
- .beta. ij t ij ) ) 2 s . t . 1 m j log n j = 1 ##EQU00007##
The last line above is linear in the variables log n.sub.j and log
Q, and is a simple weighted least squares problem with an equality
constraint. The solution can be obtained in closed form by forming
the Lagrangian
L ( log n j , log Q ) = ij 1 .sigma. ij 2 ( log n j + log Q - ( log
Q T .alpha. ij - .beta. ij l ij ) ) 2 + .lamda. j log n j
##EQU00008##
and taking partial derivatives. Solution when Noise Variance is
Constant
[0137] To avoid unnecessarily complicating the exposition, set
.sigma.i.sub.ij.sup.2=1. This assumption will remain unless
explicitly stated otherwise. (In the general case in which each
.sigma..sub.ij.sup.2 is different, the solutions will be weighted
averages instead of simple averages, or weighted least squares
solutions instead of simple least squares solutions.) In that case,
the above linear system has the solution:
log Q j = .DELTA. 1 n i ( log Q T .alpha. ij - .beta. ij t ij )
##EQU00009## log Q = 1 m j log Q j - 1 ##EQU00009.2## log n j = log
Q j - log Q = log Q j Q ##EQU00009.3##
The first equation can be interpreted as a log estimate of the
quantity of chromosome j. The second equation can be interpreted as
saying that the average of the Q.sub.j is the average of a diploid
quantity; subtracting one from its log gives the desired monosome
quantity. The third equation can be interpreted as saying that the
copy number is just the ratio
Q j Q . ##EQU00010##
Note that n.sub.j is a `double difference`, since it is a
difference of Q-values, each of which is itself a difference of
values.
Simple Solution
[0138] The above equations also reveal the solution under simpler
modeling assumptions: for example, when making the assumption
.alpha.i.sub.ij=1 for all i and j and/or when making the assumption
that .beta..sub.ij=.beta. for all i and j. In the simplest case,
when both .alpha.i.sub.ij=1 and .beta..sub.ij=.beta., the solution
reduces to
(* *) log n j = 1 + .beta. ( 1 mn ij t ij - 1 n i t ij )
##EQU00011##
The Double Differencing Method
[0139] In one embodiment of the invention, it is possible to detect
monosomy using double differencing. It should be obvious to one
skilled in the art how to modify this method for detecting other
aneuploidy states. Let {t.sub.ij} be the regularized ct values
obtained from MDA pre-amplification followed by PCR on the genetic
sample. As always, t.sub.ij is the ct value on the i-th SNP of the
j-th chromosome. Denote by t.sub.j the vector of ct values
associated with the j-th chromosome. Make the following
definitions:
t _ = .DELTA. 1 mn ij t ij ##EQU00012## t ~ j = .DELTA. t j - t ~ 1
##EQU00012.2##
Classify chromosome j as monosomic if and only if f.sup.T{tilde
over (t)}.sub.i is higher than a certain threshold value, where f
is a vector that represents a monosomy signature. f is the matched
filter, whose construction is described next.
[0140] The matched filter f is constructed as a double difference
of values obtained from two controlled experiments. Begin with
known quantities of euploid male genetic data and euploid female
genetic material. Assume there are large quantities of this
material, and pre-amplification can be omitted. On both the male
and female material, use PCR to sequence n SNPs on both the X
chromosome (chromosome 23), and chromosome 7. Let {t.sub.ij.sup.X},
i=1 . . . n, j {7, 23} denote the measurements on the female, and
let {t.sub.ij.sup.Y} similarly denote the measurements on the male.
Given this, it is possible to construct the matched filter f from
the resulting data as follows:
t _ 7 X = .DELTA. 1 n i t i , 7 X ##EQU00013## t _ i _ Y _ =
.DELTA. 1 n i t i , 7 Y ##EQU00013.2## .DELTA. X = .DELTA. t 23 X -
t _ 7 X 1 ##EQU00013.3## .DELTA. Y = .DELTA. t 23 Y - t _ 7 Y 1
##EQU00013.4## f = .DELTA. .DELTA. Y - .DELTA. X ##EQU00013.5##
In the above, equations, t.sub.7.sup.X and t.sub.7.sup.Y are
scalars, while .DELTA..sup.X and .DELTA..sup.Y are vectors. Note
that the superscripts X and Y are just symbolic labels, not
indices, denoting female and male respectively. Do not to confuse
the superscript X with measurements on the X chromosome. The X
chromosome measurements are the ones with subscript 23.
[0141] The next step is to take noise into account and to see what
remnants of noise survive in the construction of the matched filter
f as well as in the construction of {tilde over (t)}.sub.j. In this
section, consider the simplest possible modeling assumption: that
(.beta..sub.ij=.beta. for all i and j, and that .alpha..sub.ij=1
for all i and j. Under these assumptions, from (*) above:
.beta.t.sub.ij=log Q.sub.T-log n.sub.j-log Q+Z.sub.ij Which can be
rewritten as:
t ij = 1 .beta. log Q T - 1 .beta. log n j - 1 .beta. log Q + Z ij
##EQU00014##
In that case, the i-th component of the matched filter f is given
by:
f i = .DELTA. .DELTA. i Y - .DELTA. i X = { t i , 23 Y - t _ 7 Y }
- { ( t i , 23 X - t _ 7 X } = { ( 1 .beta. log Q T - 1 .beta. log
n 23 Y - 1 .beta. log Q Y + Z i , 23 Y ) - 1 n i ( 1 .beta. log Q T
- 1 .beta. log n 7 Y - 1 .beta. log Q Y + Z i , 7 Y ) } - { ( 1
.beta. log Q T - 1 .beta. log n 23 X - 1 .beta. log Q X + Z i , 23
X ) - 1 n i ( 1 .beta. log Q T - 1 .beta. log n 7 X - 1 .beta. log
Q X + Z i , 7 X ) } - { ( 1 .beta. + Z i , 23 Y ) - 1 n i Z i , 7 Y
} - { Z i , 23 X - 1 n i Z i , 7 X } ##EQU00015##
Note that the above equations take advantage of the fact that all
the copy number variables are known, for example, n.sub.23.sup.Y=1
and that n.sub.23.sup.X=2.
[0142] Given that all the noise terms are zero mean, the ideal
matched filter is 1/.beta.1. Further, since scaling the filter
vector doesn't really change things, the vector 1 can be used as
the matched filter. This is equivalent to simply taking the average
of the components of {tilde over (t)}.sub.j In other words, the
matched filter paradigm is not necessary if the underlying
biochemistry follows the simple model. In addition, one may omit
the noise terms above, which can only serve to lower the accuracy
of the method. Accordingly, this gives:
t ~ ij = .DELTA. t j - t _ = { 1 .beta. log Q T - 1 .beta. log n i
- 1 .beta. log Q + Z ij } - 1 mn i , j { 1 .beta. log Q T - 1
.beta. log n i - 1 .beta. log Q + Z ij } = 1 .beta. ( 1 - log n j )
+ Z ij - 1 mn i , j Z ij ##EQU00016##
In the above, it is assumed that
1 mn i , j log n j = 1. ##EQU00017##
that is, that the average copy number is 2. Each element of the
vector is an independent measurement of the log copy number (scaled
by 1/.beta.), and then corrupted by noise. The noise term Z.sub.ij
cannot be gotten rid of: it is inherent in the measurement. The
second noise term probably cannot be gotten rid of either, since
subtracting out t is necessary to remove the nuisance term
1 .beta. log Q . ##EQU00018##
Again, note that, given the observation that each element of {tilde
over (t)}.sub.j is an independent measurement of
1 .beta. ( 1 - log n j ) , ##EQU00019##
it is clear that a UMVU (uniform minimum variance unbiased)
estimate of
1 .beta. ( 1 - log n j ) ##EQU00020##
is just the average of the elements of {tilde over (t)}.sub.j. (In
the case in which each .sigma..sub.ij.sup.2 is different, it will
be a weighted average.) Thus, performing a little bit of algebra,
the UMVU estimator for log n.sub.j is given by:
1 n i t ~ ij .apprxeq. 1 .beta. ( 1 - log n j ) log n j .apprxeq. 1
- .beta. 1 n i , j t ~ ij = 1 - .beta. ( 1 n i t ij - 1 mn i , j t
ij ) ##EQU00021##
Analysis Under the Complicated Model
[0143] Now repeat the preceding analysis with respect to a
biochemical model in which each .beta..sub.ij and .alpha..sub.ij is
different. Again, take noise into account and to see what remnants
of noise survive in the construction of the matched filter f as
well as in the construction of {tilde over (t)}.sub.j. Under the
complicated model, from (*) above:
.beta. ij t ij = log Q T .alpha. ij - log n j - log Q + Z ij
##EQU00022##
Which can be rewritten as:
(* **) t ij = 1 .beta. ij log Q T .alpha. ij - 1 .beta. ij log n j
- 1 .beta. ij log Q + Z ij ##EQU00023##
The i-th component of the matched filter f is given by:
f i = .DELTA. i Y - .DELTA. i X - { l i , 23 Y - l 7 Y } - { ( l i
, 23 X - l 7 X } = { ( 1 .beta. i , 23 log Q T .alpha. i , 23 - 1
.beta. i , 23 log n 23 Y - 1 .beta. i , 23 log Q Y + Z i , 23 Y ) -
1 n i ( 1 .beta. i , 7 log Q T .alpha. i , 7 - 1 .beta. i , 7 log n
7 Y - 1 .beta. i , 7 log Q Y + Z i , 7 Y ) } - { ( 1 .beta. i , 23
log Q T .alpha. i , 23 - 1 .beta. i , 23 log n 23 X - 1 .beta. i ,
23 log Q X + Z i , 23 X ) - 1 n i ( 1 .beta. i , 7 log Q T .alpha.
i , 7 - 1 .beta. i , 7 log n 7 X - 1 .beta. i , 7 log Q X + Z i , 7
X ) } = 1 .beta. i , 23 + ( 1 .beta. i , 23 - ( 1 n i 1 .beta. i ,
7 ) ) log Q Y Q X + { Z i , 23 Y - Z i , 23 X + 1 n i Z i , 7 X - 1
n i Z i , 7 Y } ##EQU00024##
Under the complicated model, this gives:
t ~ ij = t j - t = { 1 .beta. ij log Q T .alpha. ij - 1 .beta. ij
log n j - 1 .beta. ij log Q + Z ij } - 1 mn i , j { 1 .beta. ij log
Q T .alpha. ij - 1 .beta. ij log n j - 1 .beta. ij log Q + Z ij }
##EQU00025##
An Alternate Way to Regularize CT Values
[0144] In another embodiment of the method, one can average the CT
values rather than transforming to exponential scale and then
taking logs, as this distorts the noise so that it is no longer
zero mean. First, start with known Q and solve for betas. Then do
multiple experiments with known n_j to solve for alphas. Since
aneuploidy is a whole set of hypotheses, it is convenient to use ML
to determine the most likely n_j and Q values, and then use this as
a basis for calculating the most likely aneuploid state, e.g., by
taking the n_j value that is most off from 1 and pushing it to its
nearest aneuploid neighbor.
Estimation of the Error Rates in the Embryonic Measurements.
[0145] In one embodiment of the invention, it is possible to
determine the conditional probabilities of particular embryonic
measurements given specific underlying true states in embryonic
DNA. In certain contexts, the given data consists of (i) the data
about the parental SNP states, measured with a high degree of
accuracy, and (ii) measurements on all of the SNPs in a specific
blastomere, measured poorly.
[0146] Use the following notation: U--is any specific homozygote,
is the other homozygote at that SNP, H is the heterozygote. The
goal is to determine the probabilities (p.sub.ij) shown in Table 2.
For instance p.sub.11 is the probability of the embryonic DNA being
U and the readout being U as well. There are three conditions that
these probabilities have to satisfy:
p.sub.11+p.sub.12+p.sub.13+p.sub.14=1 (1)
p.sub.21.+-.p.sub.22.+-.p.sub.23.+-.p.sub.24=1 (2)
p.sub.21=p.sub.23 (3)
The first two are obvious, and the third is the statement of
symmetry of heterozygote dropouts (H should give the same dropout
rate on average to either U or ).
[0147] There are 4 possible types of matings: U.times.U, U.times. ,
U.times.H, H.times.H. Split all of the SNPs into these 4 categories
depending on the specific mating type. Table 3 shows the matings,
expected embryonic states, and then probabilities of specific
readings (p.sub.ij). Note that the first two rows of this table are
the same as the two rows of the Table 2 and the notation (p.sub.ij)
remains the same as in Table 2.
[0148] Probabilities p.sub.3i and p.sub.4i can be written out in
terms of p.sub.1i and p.sub.2i.
p.sub.31=1/2[p.sub.11+p.sub.21] (4)
p.sub.32=1/2[p.sub.12+p.sub.22] (5)
p.sub.33=1/2[p.sub.13+P.sub.23] (6)
p34=1/2[p.sub.14+p.sub.24] (7)
p.sub.41=1/4[p.sub.11+2p.sub.21+p.sub.13] (8)
p.sub.42=1/2[p.sub.12+p.sub.22] (9)
p.sub.43=1/4[p.sub.11+2p.sub.23+p.sub.13] (10)
p.sub.44=1/2[p.sub.14+p.sub.24] (11)
These can be thought of as a set of 8 linear constraints to add to
the constraints (1), (2), and (3) listed above. If a vector
P=[p.sub.11, p.sub.12, p.sub.13, p.sub.14, p.sub.21 . . . ,
p.sub.44].sup.T (16.times.1 dimension) is defined, then the matrix
A (11.times.16) and a vector C can be defined such that the
constraints can be represented as:
AP=C (12)
C=[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].sup.T.
Specifically, A is shown in Table 4, where empty cells have
zeroes.
[0149] The problem can now be framed as that of finding P that
would maximize the likelihood of the observations and that is
subject to a set of linear constraints (AP=C). The observations
come in the same 16 types as p.sub.ij. These are shown in Table 5.
The likelihood of making a set of these 16 n.sub.ij observations is
defined by a multinomial distribution with the probabilities
p.sub.ij and is proportional to:
L ( P , n ij ) .varies. ij p ij n ij ( 13 ) ##EQU00026##
Note that the full likelihood function contains multinomial
coefficients that are not written out given that these coefficients
do not depend on P and thus do not change the values within P at
which L is maximized. The problem is then to find:
max P [ L ( P , n ij ) ] = max P [ ln ( L ( P , n ij ) ) ] = max P
( ij n ij ln ( p ij ) ) ( 14 ) ##EQU00027##
subject to the constraints AP=C. Note that in (14) taking the ln of
L makes the problem more tractable (to deal with a sum instead of
products). This is standard given that value of x such that f(x) is
maximized is the same for which ln(f(x)) is maximized.
P(n.sub.j,Q,D)=P(n.sub.j)P(Q)P(D.sub.j|Q,n.sub.j)P(D.sub.k.noteq.j|Q).
D MAP Detection of Aneuploidy without Parents
[0150] In one embodiment of the invention, the PS method can be
applied to determine the number of copies of a given chromosome
segment in a target without using parental genetic information. In
this section, a maximum a-posteriori (MAP) method is described that
enables the classification of genetic allele information as
aneuploid or euploid. The method does not require parental data,
though when parental data are available the classification power is
enhanced. The method does not require regularization of channel
values. One way to determine the number of copies of a chromosome
segment in the genome of a target individual by incorporating the
genetic data of the target individual and related individual(s)
into a hypothesis, and calculating the most likely hypothesis is
described here. In this description, the method will be applied to
ct values from TAQMAN measurements; it should be obvious to one
skilled in the art how to apply this method to any kind of
measurement from any platform. The description will focus on the
case in which there are measurements on just chromosomes X and 7;
again, it should be obvious to one skilled in the art how to apply
the method to any number of chromosomes and sections of
chromosomes.
Setup of the Problem
[0151] The given measurements are from triploid blastomeres, on
chromosomes X and 7, and the goal is to successfully make
aneuploidy calls on these. The only "truth" known about these
blastomeres is that there must be three copies of chromosome 7. The
number of copies of chromosome X is not known.
[0152] The strategy here is to use MAP estimation to classify the
copy number N7 of chromosome 7 from among the choices {1,2,3} given
the measurements D. Formally that looks like this:
n ^ 7 = argmax n 7 .di-elect cons. { 1 , 2 , 3 } P ( n 7 , D )
##EQU00028##
Unfortunately, it is not possible to calculate this probability,
because the probability depends on the unknown quantity Q. If the
distribution f on Q were known, then it would be possible to solve
the following:
n ^ 7 = argmax n 7 .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) P (
n 7 , D Q ) dQ ##EQU00029##
In practice, a continuous distribution on Q is not known. However,
identifying Q to within a power of two is sufficient, and in
practice a probability mass function (pmf) on Q that is uniform on
say {2.sup.1, 2.sup.2 . . . 2.sup.40} can be used. In the
development that follows, the integral sign will be used as though
a probability distribution function (pdf) on Q were known, even
though in practice a uniform pmf on a handful of exponential values
of Q will be substituted.
[0153] This discussion will use the following notation and
definitions: N.sub.7 is the copy number of chromosome seven. It is
a random variable. n.sub.7 denotes a potential value for N.sub.7.
N.sub.X is the copy number of chromosome X. n.sub.X denotes a
potential value for N.sub.X. N.sub.j is the copy number of
chromosome-j, where for the purposes here j {7,X}. n.sub.j denotes
a potential value for N.sub.j. D is the set of all measurements. In
one case, these are TAQMAN measurements on chromosomes X and 7, so
this gives D={D.sub.7,D.sub.X}, where
D.sub.j={t.sub.ij.sup.A,t.sub.ij.sup.C} is the set of TAQMAN
measurements on this chromosome. t.sub.ij.sup.A is the ct value on
channel-A of locus i of chromosome-j. Similarly, t.sub.ij.sup.C is
the ct value on channel-C of locus i of chromosome-j. (A is just a
logical name and denotes the major allele value at the locus, while
C denotes the minor allele value at the locus.) Q represents a
unit-amount of genetic material such that, if the copy number of
chromosome-j is n.sub.j, then the total amount of genetic material
at any locus of chromosome-j is n.sub.jQ. For example, under
trisomy, if a locus were AAC, then the amount of A-material at this
locus would be 2Q, the amount of C-material at this locus is Q, and
the total combined amount of genetic material at this locus is 3Q.
(n.sup.A,n.sup.C) denotes an unordered allele patterns at a locus
when the copy number for the associate chromosome is n. n.sup.A is
the number of times allele A appears on the locus and n.sup.C is
the number of times allele C appears on the locus. Each can take on
values in 0, . . . , n, and it must be the case that
n.sup.A+n.sup.C=n. For example, under trisomy, the set of allele
patterns is {(0,3), (1,2), (2,1), (3,0)}. The allele pattern (2,1)
for example corresponds to a locus value of A.sup.2C, i.e., that
two chromosomes have allele value A and the third has an allele
value of C at the locus. Under disomy, the set of allele patterns
is {(0,2), (1,1), (2,0)}. Under monosomy, the set of allele
patterns is {(0,1), (1,0)}.
Q.sub.T is the (known) threshold value from the fundamental TAQMAN
equation Q.sub.02.sup..beta.t=Q.sub.T. .beta. is the (known)
doubling-rate from the fundamental TAQMAN equation
Q.sub.02.sup..beta.t=Q.sub.T. .perp. (pronounced "bottom") is the
ct value that is interpreted as meaning "no signal". f.sub.Z(.chi.)
is the standard normal Gaussian pdf evaluated at .chi.. .sigma. is
the (known) standard deviation of the noise on TAQMAN ct
values.
MAP Solution
[0154] In the solution below, the following assumptions have been
made:
[0155] N.sub.7 and N.sub..chi. are independent.
[0156] Allele values on neighboring loci are independent.
[0157] The goal is to classify the copy number of a designated
chromosome. In this case, the description will focus on chromosome
7. The MAP solution is given by:
n ^ 7 = argmax n 7 .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) P (
n 7 , D Q ) dQ = argmax n 7 .di-elect cons. { 1 , 2 , 3 } .intg. f
( Q ) n x .di-elect cons. { 1 , 2 , 3 } P ( n 7 , n x , D Q ) dQ =
argmax n 7 .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) n x
.di-elect cons. { 1 , 2 , 3 } P ( n 7 ) P ( n x ) P ( D 7 Q , n 7 )
P ( D x Q , n x ) dQ = argmax n 7 .di-elect cons. { 1 , 2 , 3 }
.intg. f ( Q ) ( P ( n 7 ) P ( D 7 Q , n 7 ) ) ( n x .di-elect
cons. { 1 , 2 , 3 } P ( n x ) P ( D x Q , n x ) ) dQ = argmax n 7
.di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) ( P ( n 7 ) i P ( t i
, 7 A , t i , 7 C Q , n 7 ) ) ( n Z .di-elect cons. { 1 , 2 , 3 } P
( n x ) i P ( t i , x A , t i , x C Q , n x ) ) dQ (* ) = argmax n
7 .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) ( P ( n 7 ) i n A +
n C = n 7 P ( n A , n C n 7 , i ) P ( t i , 7 A Q , n A ) P ( t i ,
7 C Q , n C ) ) .times. ( n x .di-elect cons. { 1 , 2 , 3 } P ( n x
) i n A + n C = n x P ( n A , n C n x , i ) P ( t i , x A Q , n A )
P ( t i , x C Q , n C ) ) dQ ##EQU00030##
Allele Distribution Model
[0158] Equation (*) depends on being able to calculate values for
P(n.sup.A,n.sup.C|n.sub.7,i) and P(n.sup.A,n.sup.C|n.sub.X,i).
These values may be calculated by assuming that the allele pattern
(n.sup.A,n.sup.C) is drawn i.i.d (independent and identically
distributed) according to the allele frequencies for its letters at
locus i. An example should suffice to illustrate this. Calculate
P((2,1)|n.sub.7=3) under the assumption that the allele frequency
for A is 60%, and the minor allele frequency for C is 40%. (As an
aside, note that P((2,1)|n.sub.7-2)-0, since in this case the pair
must sum to 2.) This probability is given by
P ( ( 2 , 1 ) n 7 = 3 ) = ( 3 2 ) ( .60 ) Z ( .40 )
##EQU00031##
The general equation is
P ( n A , n C n j , i ) = ( n n A ) ( 1 - p ij ) n A ( p ij ) n C
##EQU00032##
Where p.sub.ij is the minor allele frequency at locus i of
chromosome j.
Error Model
[0159] Equation (*) depends on being able to calculate values for
P(t.sup.A|Q,n.sup.A) and P(t.sup.C|Q,n.sup.C). For this an error
model is needed. One may use the following error model:
P ( t A Q , n A ) = { p d t A = .perp. and n A > 0 ( 1 - p a )
fz ( 1 .sigma. ( t A - 1 .beta. log Q T n A Q ) ) t A .noteq.
.perp. and n A > 0 1 - p a t A = .perp. and n A = 0 2 p a fz ( 1
.sigma. ( t A - .perp. ) ) t A .noteq. .perp. and n A = 0 }
##EQU00033##
[0160] Each of the four cases mentioned above is described here. In
the first case, no signal is received, even though there was
A-material on the locus. That is a dropout, and its probability is
therefore p.sub.d. In the second case, a signal is received, as
expected since there was A-material on the locus. The probability
of this is the probability that a dropout does not occur,
multiplied by the pdf for the distribution on the ct value when
there is no dropout. (Note that, to be rigorous, one should divide
through by that portion of the probability mass on the Gaussian
curve that lies below .perp., but this is practically one, and will
be ignored here.) In the third case, no signal was received and
there was no signal to receive. This is the probability that no
drop-in occurred, 1-p.sub.a. In the final case, a signal is
received even through there was no A-material on the locus. This is
the probability of a drop-in multiplied by the pdf for the
distribution on the ct value when there is a drop-in. Note that the
`2` at the beginning of the equation occurs because the Gaussian
distribution in the case of a drop-in is modeled as being centered
at .perp.. Thus, only half of the probability mass lies below
.perp. in the case of a drop-in, and when the equation is
normalized by dividing through by one-half, it is equivalent to
multiplying by 2. The error model for P(t.sup.C|Q,n.sup.C) by
symmetry is the same as for P(t.sup.AQ,n.sup.A) above. It should be
obvious to one skilled in the art how different error models can be
applied to a range of different genotyping platforms, for example
the ILLUMINA INFINIUM genotyping platform.
Computational Considerations
[0161] In one embodiment of the invention, the MAP estimation
mathematics can be carried out by brute-force as specified in the
final MAP equation, except for the integration over Q. Since
doubling Q only results in a difference in ct value of 1/.beta.,
the equations are sensitive to Q only on the log scale. Therefore
to do the integration it should be sufficient to try a handful of
Q-values at different powers of two and to assume a uniform
distribution on these values. For example, one could start at
Q=Q.sub.T2.sup.-20.beta., which is the quantity of material that
would result in a ct value of 20, and then halve it in succession
twenty times, yielding a final Q value that would result in a ct
value of 40.
[0162] What follows is a re-derivation of a derivation described
elsewhere in this disclosure, with slightly difference emphasis,
for elucidating the programming of the math. Note that the variable
D below is not really a variable. It is always a constant set to
the value of the data set actually in question, so it does not
introduce another array dimension when representing in MATLAB.
However, the variables D.sub.j do introduce an array dimension, due
to the presence of the index j.
n ^ 7 = argmax n 7 .di-elect cons. { 1 , 2 , 3 } P ( n 7 , D )
##EQU00034## P ( n 7 , D ) = Q P ( n 7 , Q , D ) ##EQU00034.2## P (
n 7 , Q , D ) = P ( n 7 ) P ( Q ) P ( D 7 Q , n 7 ) P ( D x Q )
##EQU00034.3## P ( D j Q ) - n j .di-elect cons. { 1 , 2 , 3 } P (
D j , n j Q ) ##EQU00034.4## P ( D j , n j Q ) = P ( n j ) P ( D j
Q , n j ) ##EQU00034.5## P ( D j Q , n j ) i P ( D ij Q , n j )
##EQU00034.6## P ( D ij Q , n j ) n A + n C = n j P ( D ij , n A ,
n C Q , n j ) ##EQU00034.7## P ( D ij , n A , n C Q , n j ) = P ( n
A , n C n j , i ) P ( t ij A Q , n A ) P ( t ij C Q , n C )
##EQU00034.8## P ( n A , n C n j , i ) = ( n n A ) ( 1 - p ij ) n A
( p ij ) n C ##EQU00034.9## P ( t ij A Q n A ) = { p d t A = .perp.
and n A > 0 ( 1 - p d ) fz ( 1 .sigma. ( t A - 1 .beta. log Q T
n A Q ) ) t A .noteq. .perp. and n A > 0 1 - p a t A = .perp.
and n A = 0 2 p t fz ( 1 .sigma. ( t A - .perp. ) ) t A .noteq.
.perp. and n A = 0 } ##EQU00034.10##
EMAP Detection of Aneuploidy with Parental Info
[0163] In one embodiment of the invention, the disclosed method
enables one to make aneuploidy calls on each chromosome of each
blastomere, given multiple blastomeres with measurements at some
loci on all chromosomes, where it is not known how many copies of
each chromosome there are. In this embodiment, the a MAP estimation
is used to classify the copy number N.sub.j of chromosome where j
{1, 2 . . . 22, X, Y}, from among the choices {0, 1, 2, 3} given
the measurements D, which includes both genotyping information of
the blastomeres and the parents. To be general, let j {1, 2 . . .
m} where m is the number of chromosomes of interest; m=24 implies
that all chromosomes are of interest. Formally, this looks
like:
n ^ j = argmax n j .di-elect cons. { 1 , 2 , 3 } P ( n j , D )
##EQU00035##
[0164] Unfortunately, it is not possible to calculate this
probability, because the probability depends on an unknown random
variable Q that describes the amplification factor of MDA. If the
distribution f on Q were known, then it would be possible to solve
the following:
n ^ j = arg max n j .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) P
( n j , D | Q ) Q ##EQU00036##
[0165] In practice, a continuous distribution on Q is not known.
However, identifying Q to within a power of two is sufficient, and
in practice a probability mass function (pmf) on Q that is uniform
on say {2.sup.1, 2.sup.2 . . . , 2.sup.40} can be used. In the
development that follows, the integral sign will be used as though
a probability distribution function (pdf) on Q were known, even
though in practice a uniform pmf on a handful of exponential values
of Q will be substituted.
[0166] This discussion will use the following notation and
definitions:
[0167] N.sub..alpha. is the copy number of autosomal chromosome
.alpha., where .alpha. {1, 2 . . . 22}. It is a random variable.
n.sub..alpha. denotes a potential value for N.sub..alpha..
[0168] N.sub.X is the copy number of chromosome X. n.sub.X denotes
a potential value for N.sub.X.
[0169] N.sub.j is the copy number of chromosome-j, where for the
purposes here j {1, 2 . . . m}. n.sub.j denotes a potential value
for N.sub.j.
[0170] m is the number of chromosomes of interest, m=24 when all
chromosomes are of interest.
[0171] H is the set of aneuploidy states. h H. For the purposes of
this derivation, let H={paternal monosomy, maternal monosomy,
disomy, t1 paternal trisomy, t2 paternal trisomy, t1 maternal
trisomy, t2 maternal trisomy}. Paternal monosomy means the only
existing chromosome came from the father; paternal trisomy means
there is one additional chromosome coming from father. Type 1 (t1)
paternal trisomy is such that the two paternal chromosomes are
sister chromosomes (exact copy of each other) except in case of
crossover, when a section of the two chromosomes are the exact
copies. Type 2 (t2) paternal trisomy is such that the two paternal
chromosomes are complementary chromosomes (independent chromosomes
coming from two grandparents). The same definitions apply to the
maternal monosomy and maternal trisomies.
[0172] D is the set of all measurements including measurements on
embryo D.sub.E and on parents D.sub.F,D.sub.M. In the case where
these are TAQMAN measurements on all chromosomes, one can say:
D={D.sub.1, D.sub.2 . . . , D'.sub.m}, D.sub.E={D.sub.E,1,
D.sub.E,2 . . . D.sub.E,m}, where D.sub.k=(D.sub.E,k, D.sub.F,k,
D.sub.M, k), D.sub.Ej={t.sub.E,ij.sup.A,t.sub.E,ij.sup.C} is the
set of TAQMAN measurements on chromosome j.
[0173] t.sub.E,ij.sup.A is the ct value on channel-A of locus i of
chromosome-j. Similarly, t.sub.E,ij.sup.C is the ct value on
channel-C of locus i of chromosome-j. (A is just a logical name and
denotes the major allele value at the locus, while C denotes the
minor allele value at the locus.)
[0174] Q represents a unit-amount of genetic material after MDA of
single cell's genomic DNA such that, if the copy number of
chromosome-j is n.sub.j, then the total amount of genetic material
at any locus of chromosome-j is n.sub.jQ. For example, under
trisomy, if a locus were AAC, then the amount of A-material at this
locus is 2Q, the amount of C-material at this locus is Q, and the
total combined amount of genetic material at this locus is 3Q.
[0175] q is the number of numerical steps that will be considered
for the value of Q.
[0176] N is the number of SNPs per chromosome that will be
measured.
[0177] (n.sup.A,n.sup.C) denotes an unordered allele patterns at a
locus when the copy number for the associated chromosome is n.
n.sup.A is the number of times allele A appears on the locus and
n.sup.C is the number of times allele C appears on the locus. Each
can take on values in 0, . . . , n, and it must be the case that
n.sup.A+n.sup.C=n. For example, under trisomy, the set of allele
patterns is {(0,3),(1,2),(2,1),(3,0)}. The allele pattern (2,1) for
example corresponds to a locus value of A.sup.2C, i.e., that two
chromosomes have allele value A and the third has an allele value
of C at the locus. Under disomy, the set of allele patterns is
{(0,2),(1,1),(2,0)}. Under monosomy, the set of allele patterns is
{(0,1),(1,0)}.
[0178] Q.sub.T is the (known) threshold value from the fundamental
TAQMAN equation Q.sub.02.sup..beta.t=Q.sub.T.
[0179] .beta. is the (known) doubling-rate from the fundamental
TAQMAN equation Q.sub.02.sup..beta.t=Q.sub.T.
[0180] .perp. (pronounced "bottom") is the ct value that is
interpreted as meaning "no signal".
[0181] f.sub.Z(x) is the standard normal Gaussian pdf evaluated at
x.
[0182] .sigma. is the (known) standard deviation of the noise on
TAQMAN ct values.
MAP Solution
[0183] In the solution below, the following assumptions are
made:
[0184] N.sub.js are independent of one another.
[0185] Allele values on neighboring loci are independent.
[0186] The goal is to classify the copy number of a designated
chromosome. For instance, the MAP solution for chromosome a is
given by
n ^ j = arg max n j .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) P
( n j , D | Q ) Q = arg max n j .di-elect cons. { 1 , 2 , 3 }
.intg. f ( Q ) n 1 .di-elect cons. { 1 , 2 , 3 } n j - 1 .di-elect
cons. { 1 , 2 , 3 } n j + 1 .di-elect cons. { 1 , 2 , 3 } n m
.di-elect cons. { 1 , 2 , 3 } P ( n 1 , n m , D | Q ) Q = arg max n
j .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) n 1 .di-elect cons.
{ 1 , 2 , 3 } n j - 1 .di-elect cons. { 1 , 2 , 3 } n j + 1
.di-elect cons. { 1 , 2 , 3 } n m .di-elect cons. { 1 , 2 , 3 } k =
1 m P ( n k ) P ( D k | Q , n k ) Q = arg max n j .di-elect cons. {
1 , 2 , 3 } .intg. f ( Q ) ( P ( n j ) P ( D j | Q , n j ) ) ( k
.noteq. 1 n k .di-elect cons. { 1 , 2 , 3 } P ( n k ) P ( D k | Q ,
n k ) ) Q = arg max n j .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q
) ( P ( n j ) h .di-elect cons. H P ( D j | Q , n j , h ) P ( h | n
j ) ) ( k + j n k .di-elect cons. { 1 , 2 , 3 } P ( n k ) h
.di-elect cons. H P ( D k | Q , n k , h ) P ( h | n k ) ) Q = arg
max n j .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) ( P ( n j ) h
.di-elect cons. H P ( h | n j ) i P ( t E , ij A , t E , ij C , D F
, ij D M , ij | Q , n j , h ) ) .times. ( k .noteq. j n k .di-elect
cons. { 1 , 2 , 3 } P ( n k ) h .di-elect cons. H P ( h | n k ) i P
( t E , ik A , t E , ik C , D F , ik D M , ik | Q , n k , h ) ) Q =
arg max n j .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) ( P ( n j
) h .di-elect cons. H P ( h | n j ) i n F A + n F C = 2 n M A + n M
C = 2 P ( n F A , n F C , n M A , n M C ) P ( t E , ij A , t E , ij
C , D F , ij D M , ij | Q , n j , h , n F A , n F C , n M A , n M C
) ) .times. ( k .noteq. j n k .di-elect cons. { 1 , 2 , 3 } P ( n k
) h .di-elect cons. H P ( h | n k ) i n F A + n F C = 2 n M A + n M
C = 2 P ( n F A , n F C , n M A , n M C ) P ( t E , ik A , t E , ik
C , D F , ik D M , ik | Q , n k , h , n F A , n F C , n M A , n M C
) ) Q = arg max n 1 .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q } (
P ( n j ) h .di-elect cons. H P ( h | n j ) i n F A + n F C = 2 n K
A + n K C = 2 P ( n F A , n F C , n M A , n M C ) P ( t F , ij A |
n F A Q ' ) P ( t F , ij C | n F C Q ' ) P ( t M , ij A | n M A Q '
) P ( t M , ij C | n N C Q ' ) .times. n A + n C = n j P ( n A , n
C | n j , h , n F A , n F C , n M A , n M C ) P ( t E , ij A | Q ,
n A ) P ( t E , ij C | Q , n C ) ) .times. ( k .noteq. j n k
.di-elect cons. { 1 , 2 , 3 } P ( n k ) h .di-elect cons. H P ( h |
n k ) .tau. n F A + n F C = 2 n M A + n M C = 2 P ( n F A , n F C ,
n M A , n M C ) P ( .tau. F , ik A | n F A Q ' ) P ( .tau. F , ik C
| n F C Q ' ) P ( .tau. M , ik A | n M A Q ' ) P ( .tau. m , ik C |
n M C Q ' ) .times. n A + n C = n 1 P ( n A , n C | n k , h , n F A
, n F C , n M A , n M C ) P ( .tau. E , ik A | n A Q ) P ( t E , ik
C | n C Q ) ) Q ( * ) ##EQU00037##
Here it is assumed that Q', the Q are known exactly for the
parental data.
Copy Number Prior Probability
[0187] Equation (*) depends on being able to calculate values for
P(n.sub..alpha.) and P(n.sub.X), the distribution of prior
probabilities of chromosome copy number, which is different
depending on whether it is an autosomal chromosome or chromosome X.
If these numbers are readily available for each chromosome, they
may be used as is. If they are not available for all chromosomes,
or are not reliable, some distributions may be assumed. Let the
prior probability
P ( n a = 1 ) = P ( n a = 2 ) = P ( n a = 3 ) = 1 3
##EQU00038##
for autosomal chromosomes, let the probability of sex chromosomes
being XY or XX be 1/2.
P ( n x = 0 ) = 1 3 .times. 1 4 = 1 12 . P ( n x = 1 ) = 1 3
.times. 3 4 + 1 3 .times. 1 2 + 1 3 .times. 1 2 .times. 1 4 = 11 24
= 0.458 , ##EQU00039##
where 3/4 is the probability of the monosomic chromosome being X
(as oppose to Y), 1/2 is the probability of being XX for two
chromosomes and 1/4 is the probability of the third chromosome
being Y.
P ( n x = 3 ) = 1 3 .times. 1 2 .times. 3 4 = 1 8 = 0.125 ,
##EQU00040##
where 1/2 is the probability of being XX for two chromosomes and
3/4 is the probability of the third chromosome being X.
P ( n x = 2 ) = 1 - P ( n x = 0 ) - P ( n x = 1 ) - P ( n x = 3 ) =
4 12 = 0.333 . ##EQU00041##
Aneuploidy State Prior Probability
[0188] Equation (*) depends on being able to calculate values for
P(h|n.sub.j), and these are shown in Table 6. The symbols used in
the Table 6 are explained below
TABLE-US-00001 Symbol Meaning Ppm paternal monosomy probability Pmm
maternal monosomy probability Ppt paternal trisomy probability
given trisomy Pmt maternal trisomy probability given trisomy pt1
probability of type 1 trisomy for paternal trisomy, or P(type
1|paternal trisomy) pt2 probability of type 2 trisomy for paternal
trisomy, or P(type 2|paternal trisomy) mt1 probability of type 1
trisomy for maternal trisomy, or P(type 1|maternal trisomy) mt2
probability of type 2 trisomy for maternal trisomy, or P(type
2|maternal trisomy)
Note that there are many other ways that one skilled in the art,
after reading this disclosure, could assign or estimate appropriate
prior probabilities without changing the essential concept of the
patent. Allele Distribution Model without Parents
[0189] Equation (*) depends on being able to calculate values for
p(n.sup.A,n.sup.C|n.sub..alpha.,i) and
P(n.sup.A,n.sup.C|n.sub.X,i). These values may be calculated by
assuming that the allele pattern (n.sup.A,n.sup.C) is drawn i.i.d
according to the allele frequencies for its letters at locus i. An
illustrative example is given here. Calculate P((2,1)|n.sub.7=3)
under the assumption that the allele frequency for A is 60%, and
the minor allele frequency for C is 40%. (As an aside, note that
P((2,1)|n.sub.7=2)=0, since in this case the pair must sum to 2.)
This probability is given by
P ( ( 2 , 1 ) | n 7 = 3 ) = ( 3 2 ) ( .60 ) 2 ( .40 )
##EQU00042##
The general equation is
P ( n A , n C | n j , i ) = ( n n A ) ( 1 - p ij ) n A ( p ij ) n C
##EQU00043##
Where p.sub.ij is the minor allele frequency at locus i of
chromosome j.
Allele Distribution Model Incorporating Parental Genotypes
[0190] Equation (*) depends on being able to calculate values for
p(n.sup.A,n.sup.C|n.sub.j,h,T.sub.P,ijT.sub.M,ij) which are listed
in Table 7. In a real situation, LDO will be known in either one of
the parents, and the table would need to be augmented. If LDO are
known in both parents, one can use the model described in the
Allele Distribution Model without Parents section.
Population Frequency for Parental Truth
[0191] Equation (*) depends on being able to calculate
p(T.sub.FAJT.sub.MAJ). The probabilities of the combinations of
parental genotypes can be calculated based on the population
frequencies. For example, P(AA,AA)=P(A).sup.4, and
P(AC,AC)=P.sub.heteroz.sup.2 where P.sub.heteroz=2P(A)P(C) is the
probability of a diploid sample to be heterozygous at one locus
i.
Error Model
[0192] Equation (*) depends on being able to calculate values for
P(t.sup.A|Q,n.sup.4) and P(t.sup.C|Q,n.sup.C). For this an error
model is needed. One may use the following error model:
P ( t A | Q , n A ) = { p d t A = .perp. and n A > 0 ( 1 - p d )
f 2 ( 1 .sigma. ( t A - 1 .beta. log Q T n A Q ) ) t A .noteq.
.perp. and n A > 0 1 - p a t A = .perp. and n A = 0 2 p a f Z (
1 .sigma. ( t A - .perp. ) ) t A .noteq. .perp. and n A = 0 }
##EQU00044##
[0193] This error model is used elsewhere in this disclosure, and
the four cases mentioned above are described there. The
computational considerations of carrying out the MAP estimation
mathematics can be carried out by brute-force are also described in
the same section.
Computational Complexity Estimation
[0194] Rewrite the equation (*) as follows,
n ^ j = arg max n j .di-elect cons. { 1 , 2 , 3 } .intg. f ( Q ) (
P ( n j ) i n A + n C = n j P ( n A , n C | n j , i ) P ( t i , j A
| Q , n A ) P ( t i , j C | Q , n C ) ) .times. ( k .noteq. j n k
.di-elect cons. { 1 , 2 , 3 } P ( n k ) i n A + n C = n k P ( n A ,
n C | n k , i ) P ( t i , k A | Q , n A ) P ( t i , k C | Q , n C )
) Q ( * ) ##EQU00045##
Let the computation time for P(n.sup.A,n.sup.C|n.sub.j,i) be
t.sub.x, that for P(t.sub.i,j.sup.A|Q,n.sup.A) or
P(t.sub.i,j.sup.C|Q,n.sup.C) be t.sub.y. Note that
P(n.sup.A,n.sup.C|n.sub.j,i) may be pre-computed, since their
values don't vary from experiment to experiment. For the discussion
here, call a complete 23-chromosome aneuploidy screen an
"experiment". Computation of
.PI..sub.i.SIGMA..sub.nA+nC=njP(n.sup.A,n.sup.C|n.sub.j,i)P(t.sub.i,j.sup-
.A)P(t.sub.i,j.sup.C|Q,n.sup.C) for 23 chromosomes takes
if n.sub.j=1,(2+t.sub.x+2*t.sub.y)*2N*m
if n.sub.j=2,(2+t.sub.x+2*t.sub.y)*3N*m
if n.sub.j=3,(2+t.sub.x+2*t.sub.y)*4N*m
The unit of time here is the time for a multiplication or an
addition. In total, it takes (2+t.sub.x+2*t.sub.y)*9N*m
[0195] Once these building blocks are computed, the overall
integral may be calculated, which takes time on the order of
(2+t.sub.x+2*t.sub.y)*9N*m*q. In the end, it takes 2*m comparisons
to determine the best estimate for n.sub.j. Therefore, overall the
computational complexity is O(N*m*q).
[0196] What follows is a re-derivation of the original derivation,
with a slight difference in emphasis in order to elucidate the
programming of the math. Note that the variable D below is not
really a variable. It is always a constant set to the value of the
data set actually in question, so it does not introduce another
array dimension when representing in MATLAB. However, the variables
D.sub.j do introduce an array dimension, due to the presence of the
index j.
n ^ j = arg max n j .di-elect cons. { 1 , 2 , 3 } P ( n j , D )
##EQU00046## P ( n j , D ) = Q P ( n j , Q , D ) ##EQU00046.2## P (
n j , Q , D ) = P ( n j ) P ( Q ) P ( D j | Q , n j ) P ( D k
.noteq. j | Q ) ##EQU00046.3## P ( D j | Q ) - n j .di-elect cons.
{ 1 , 2 , 3 } P ( D j , n j | Q ) ##EQU00046.4## P ( D j , n j | Q
) = P ( n j ) P ( D j | Q , n j ) ##EQU00046.5## P ( D j | Q , n j
) = i P ( D ij | Q , n j ) ##EQU00046.6## P ( D ij | Q , n j ) = n
A + n C = n j P ( D ij , n A , n C | Q , n j ) ##EQU00046.7## P ( D
ij , n A , n C | Q , n j ) = P ( n A , n C | n j , i ) P ( t ij A |
Q , n A ) P ( t ij C | Q , n C ) ##EQU00046.8## P ( n A , n C | n j
, i ) = ( n n A ) ( 1 - p ij ) n A ( p ij ) n C ##EQU00046.9## P (
t ij A | Q , n A ) = { p d t A = .perp. and n A > 0 ( 1 - p d )
f Z ( 1 .sigma. ( t A - 1 .beta. log Q T n A Q ) ) t A .noteq.
.perp. and n A > 0 1 - p a t A = .perp. and n A = 0 2 p a f Z (
1 .sigma. ( t A - .perp. ) ) t A .noteq. .perp. and n A = 0 }
##EQU00046.10##
F Qualitative Chromosome Copy Number Calling
[0197] One way to determine the number of copies of a chromosome
segment in the genome of a target individual by incorporating the
genetic data of the target individual and related individual(s)
into a hypothesis, and calculating the most likely hypothesis is
described here. In one embodiment of the invention, the aneuploidy
calling method may be modified to use purely qualitative data.
There are many approaches to solving this problem, and several of
them are presented here. It should be obvious to one skilled in the
art how to use other methods to accomplish the same end, and these
will not change the essence of the disclosure.
Notation for Qualitative CNC
[0198] 1. N is the total number of SNPs on the chromosome.
[0199] 2. n is the chromosome copy number.
[0200] 3. n.sup.M is the number of copies supplied to the embryo by
the mother: 0, 1, or 2.
[0201] 4. n.sup.F is the number of copies supplied to the embryo by
the father: 0, 1, or 2.
[0202] 5. p.sub.d is the dropout rate, and f(p.sub.d) is a prior on
this rate.
[0203] 6. p.sub.a is dropin rate, and f(p.sub.a) is a prior on this
rate.
[0204] 7. c is the cutoff threshold for no-calls.
[0205] 8. D=(x.sub.k,y.sub.k) is the platform response on channels
X and Y for SNP k.
[0206] 9. D(c)={G(x.sub.k,y.sub.k);c}={ .sub.k.sup.(c)} is the set
of genotype calls on the chromosome. Note that the genotype calls
depend on the no-call cutoff threshold c.
[0207] 10. .sub.k.sup.(c) is the genotype call on the k-th SNP (as
opposed to the true value): one of AA, AB, BB, or NC (no-call).
[0208] 11. Given a genotype call at SNP k, the variables ( .sub.X,
.sub.Y) are indicator variables (1 or 0), indicating whether the
genotype implies that channel X or Y has "lit up". Formally,
.sub.x=1 just in case contains the allele A, and .sub.Y=1 just in
case contains the allele B.
[0209] 12. M={g.sub.k.sup.M} is the known true sequence of genotype
calls on the mother. g.sup.M refers to the genotype value at some
particular locus.
[0210] 13. F={g.sub.k.sup.F} is the known true sequence of genotype
calls on the father. g.sup.F refers to the genotype value at some
particular locus.
[0211] 14. n.sup.A,n.sup.B are the true number of copies of A and B
on the embryo (implicitly at locus k), respectively. Values must be
in {0,1,2,3,4}.
[0212] 15. c.sub.M.sup.A,c.sub.M.sup.B are the number of A alleles
and B alleles respectively supplied by the mother to the embryo
(implicitly at locus k). The values must be in {0, 1, 2}, and must
not sum to more than 2. Similarly, c.sub.F.sup.A,c.sub.F.sup.B are
the number of A alleles and B alleles respectively supplied by the
father to the embryo (implicitly at locus k). Altogether, these
four values exactly determine the true genotype of the embryo. For
example, if the values were (1,0) and (1,1), then the embryo would
have type AAB.
Solution 1: Integrate Over Dropout and Dropin Rates.
[0213] In the embodiment of the invention described here, the
solution applies to just a single chromosome. In reality, there is
loose coupling among all chromosomes to help decide on dropout rate
p.sub.d, but the math is presented here for just a single
chromosome. It should be obvious to one skilled in the art how one
could perform this integral over fewer, more, or different
parameters that vary from one experiment to another. It should also
be obvious to one skilled in the art how to apply this method to
handle multiple chromosomes at a time, while integrating over ADO
and ADI. Further details are given in Solution 3B below.
P ( n | D ( c ) , M , F ) = ( n M , n F ) .di-elect cons. n P ( n M
, n F | D ( c ) , M , F ) ##EQU00047## P ( n M , n F | D ( c ) , M
, F ) = P ( n M ) P ( n F ) P ( D ( c ) | n M , n F , M , F ) ( n M
, n F ) P ( n M ) P ( n F ) P ( D ( c ) | n M , n F , M , F )
##EQU00047.2## P ( D ( c ) | n M , n F , M , F ) = .intg. .intg. f
( p d ) f ( p a ) P ( D ( c ) | n M , n F , M , F , p d , p a ) p d
p a ##EQU00047.3## P ( D ( c ) | n M , n F , M , F , p d , p a ) =
k P ( G ( x k , y k ; c ) | n M , n F , g k M , g k F , p d , p a )
= g M .di-elect cons. { AA , AB , BB } g F .di-elect cons. { AA ,
AB , BB } g ^ .di-elect cons. { AA , AB , BB , NC } { k : g k M = g
M , g k F = g F , g ^ k ( c ) = g ^ } P ( g ^ | n M , n F , g M , g
F , p d , p a ) = g M .di-elect cons. { AA , AB , BB } g F
.di-elect cons. { AA , AB , BB } g ^ .di-elect cons. { AA , AB , BB
, NC } p ( g ^ | n M , n F , g M , g F , p d , p a ) { k : g k M =
g M , g k F = g F , g ^ k ( c ) = g ^ } exp ( g M .di-elect cons. {
AA , AB , BB } g F .di-elect cons. { AA , AB , BB } g ^ .di-elect
cons. { AA , AB , BB , NC } { k : g k M = g M , g k F = g F , g ^ k
( c ) = g ^ } .times. log P ( g ^ | n M , n F , g M , g F , p d , p
a ) ) ##EQU00047.4## p ( g ^ | n M , n F , g M , g F , p d , p a )
= n A , n B P ( n A , n B | n M , n F , g M , g F , )
geneticmodeling ( P ( g ^ X | n A , p d , p a ) P ( g ^ Y | n B , p
d , p a ) ) platformmodeling ##EQU00047.5## p ( g ^ X | n A , p d ,
p a ) = ( g ^ X ( ( 1 - p d n A ) + ( n A = 0 ) p a ) + ( 1 - g ^ X
) ( ( n A > 0 ) p d n A + ( n A = 0 ) ( 1 - p a ) ) )
##EQU00047.6##
The derivation other is the same, except applied to channel Y.
P ( n A , n B | n M , n F , g M , g F , ) = c M A + c F A = n A c M
B + c F B = n B P ( c M A , c M B | n M , g M ) P ( c F A , c F B |
n F , g F ) ##EQU00048## P ( c M A , c M B | n M , g M ) = ( c M A
+ c M B = n M ) { ( c M B = 0 ) , g M = AA ( c M A = 0 ) , g M = BB
1 n M + 1 , g M = AB ##EQU00048.2##
The other derivation is the same, except applied to the father.
Solution 2: Use ML to Estimate Optimal Cutoff Threshold c
Solution 2, Variation A
[0214] c ^ = arg max c .di-elect cons. ( 0 , a ] P ( D ( c ) | M ,
F ) ##EQU00049## P ( n ) = ( n M , n F ) , .di-elect cons. n P ( n
M , n F | D ( c ^ ) , M , F ) ##EQU00049.2##
[0215] In this embodiment, one first uses the ML estimation to get
the best estimate of the cutoff threshold based on the data, and
then use this c to do the standard Bayesian inference as in
solution 1. Note that, as written, the estimate of c would still
involve integrating over all dropout and dropin rates. However,
since it is known that the dropout and dropin parameters tend to
peak sharply in probability when they are "tuned" to their proper
values with respect to c, one may save computation time by doing
the following instead:
Solution 2, Variation B
[0216] c ^ , p ^ d , p ^ a = arg max c , p d , p a f ( p d ) f ( p
a ) P ( D ( c ) | M , F , p d , p a ) ##EQU00050## P ( n ) = ( n M
, n F ) , .di-elect cons. n P ( n M , n F | D ( c ^ ) , M , F , p ^
d , p ^ a ) ##EQU00050.2##
[0217] In this embodiment, it is not necessary to integrate a
second time over the dropout and dropin parameters. The equation
goes over all possible triples in the first line. In the second
line, it just uses the optimal triple to perform the inference
calculation.
Solution 3: Combining Data Across Chromosomes
[0218] The data across different chromosomes is conditionally
independent given the cutoff and dropout/dropin parameters, so one
reason to process them together is to get better resolution on the
cutoff and dropout/dropin parameters, assuming that these are
actually constant across all chromosomes (and there is good
scientific reason to believe that they are roughly constant). In
one embodiment of the invention, given this observation, it is
possible to use a simple modification of the methods in solution 3
above. Rather than independently estimating the cutoff and
dropout/dropin parameters on each chromosome, it is possible to
estimate them once using all the chromosomes.
Notation
[0219] Since data from all chromosomes is being combined, use the
subscript j to denote the j-th chromosome. For example, D.sub.j(c)
is the genotype data on chromosome j using c as the no-call
threshold. Similarly, M.sub.j,F.sub.j are the genotype data on the
parents on chromosome j.
Solution 3, Variation A: Use all Data to Estimate Cutoff
Dropout/Dropin
[0220] c ^ , p ^ d , p ^ a = arg max c , p d , p a f ( p d ) f ( p
a ) j P ( D j ( c ) | M j , F j , p d , p a ) ##EQU00051## P ( n j
) = ( n M , n F ) .di-elect cons. n j P ( n M , n F | D j ( c ^ ) ,
M j , F j , p ^ d , p ^ a ) ##EQU00051.2##
Solution 3, Variation B:
[0221] Theoretically, this is the optimal estimate for the copy
number on chromosome j.
n ^ j = argmax n ( n M , n F ) .di-elect cons. n .intg. .intg. f (
p d ) f ( p a ) P ( D j ( c ^ ) ) | n M , n F , M j , F j , p d , p
a ) l .noteq. j P ( D j ( c ^ ) ) n M , n F , M j , F j , p d , p a
) dp d dp a ##EQU00052##
Estimating Dropout/Dropin Rates from Known Samples
[0222] For the sake of thoroughness, a brief discussion of dropout
and dropin rates is given here. Since dropout and dropin rates are
so important for the algorithm, it may be beneficial to analyze
data with a known truth model to find out what the true
dropout/dropin rates are. Note that there is no single tree dropout
rate: it is a function of the cutoff threshold. That said, if
highly reliable genomic data exists that can be used as a truth
model, then it is possible to plot the dropout/dropin rates of MDA
experiments as a function of the cutoff-threshold. Here a maximum
likelihood estimation is used.
c ^ , p ^ d , p ^ a = arg max c , p d , p a jk P ( g ^ jk ( c ) g
jk , p d , p a ) ##EQU00053##
In the above equation, .sub.jk.sup.(c), is the genotype call on SNP
k of chromosome j, using c as the cutoff threshold, while g.sub.jk,
is the true genotype as determined from a genomic sample. The above
equation returns the most likely triple of cutoff, dropout, and
dropin. It should be obvious to one skilled in the art how one can
implement this technique without parent information using prior
probabilities associated with the genotypes of each of the SNPs of
the target cell that will not undermine the validity of the work,
and this will not change the essence of the invention.
G Bayesian Plus Sperm Method
[0223] Another way to determine the number of copies of a
chromosome segment in the genome of a target individual is
described here. In one embodiment of the invention, the genetic
data of a sperm from the father and crossover maps can be used to
enhance the methods described herein. Throughout this description,
it is assumed that there is a chromosome of interest, and all
notation is with respect to that chromosome. It is also assumed
that there is a fixed cutoff threshold for genotyping. Previous
comments about the impact of cutoff threshold choice apply, but
will not be made explicit here. In order to best phase the
embryonic information, one should combine data from all blastomeres
on multiple embryos simultaneously. Here, for ease of explication,
it is assumed that there is just one embryo with no additional
blastomeres. However, the techniques mentioned in various other
sections regarding the use of multiple blastomeres for
allele-calling translate in a straightforward manner here.
Notation
[0224] 1. n is the chromosome copy number.
[0225] 2. n.sup.M is the number of copies supplied to the embryo by
the mother: 0, 1, or 2.
[0226] 3. n.sup.F is the number of copies supplied to the embryo by
the father: 0, 1, or 2.
[0227] 4. p.sub.d is the dropout rate, and f(p.sub.d) is a prior on
this rate.
[0228] 5. p.sub.a is the dropin rate, and f(p.sub.a) is a prior on
this rate.
[0229] 6. D={ .sub.k} is the set of genotype measurements on the
chromosome of the embryo. .sub.k is the genotype call on the k-th
SNP (as opposed to the true value): one of AA, AB, BB, or NC
(no-call). Note that the embryo may be aneuploid, in which case the
true genotype at a SNP may be, for example, AAB, or even AAAB, but
the genotype measurements will always be one of the four listed.
(Note: elsewhere in this disclosure `B` has been used to indicate a
heterozygous locus. That is not the sense in which it is being used
here. Here `A` and `B` are used to denote the two possible allele
values that could occur at a given SNP.)
[0230] 7. M={g.sub.k.sup.M} is the known true sequence of genotypes
on the mother. g.sub.k.sup.M is the genotype value at the k-th
SNP.
[0231] 8. F={g.sub.k.sup.F} is the known true sequence of genotypes
on the father. g.sub.k.sup.F is the genotype value at the k-th
SNP.
[0232] 9. S={ .sub.k.sup.S} is the set of genotype measurements on
a sperm from the father. .sub.k.sup.S is the genotype call at the
k-th SNP.
[0233] 10. (m.sub.1,m.sub.2) is the true but unknown ordered pair
of phased haplotype information on the mother. m.sub.1k is the
allele value at SNP k of the first haploid sequence. m.sub.2k is
the allele value at SNP k of the second haploid sequence.
(m.sub.1,m.sub.2) M is used to indicate the set of phased pairs
(m.sub.1,m.sub.2) that are consistent with the known genotype M.
Similarly, (m.sub.1,m.sub.2) g.sub.k.sup.M is used to indicate the
set of phased pairs that are consistent with the known genotype of
the mother at SNP k.
[0234] 11. (f.sub.1,f.sub.2) is the true but unknown ordered pair
of phased haplotype information on the father. f.sub.1k is the
allele value at SNP k of the first haploid sequence. f.sub.2k is
the allele value at SNP k of the second haploid sequence.
(f.sub.1,f.sub.2) F is used to indicate the set of phased pairs
(f.sub.1,f.sub.2) that are consistent with the known genotype F.
Similarly, (f.sub.1,f.sub.2) g.sub.k.sup.F is used to indicate the
set of phased pairs that are consistent with the known genotype of
the father at SNP k.
[0235] 12. s.sub.1 is the true but unknown phased haplotype
information on the measured sperm from the father. s.sub.1k is the
allele value at SNP k of this haploid sequence. It can be
guaranteed that this sperm is euploid by measuring several sperm
and selecting one that is euploid.
[0236] 13. .chi..sup.M={.PHI..sub.1, . . . , .PHI..sub.nM} is the
multiset of crossover maps that resulted in maternal contribution
to the embryo on this chromosome. Similarly,
.chi..sup.F={.theta..sub.1, . . . , .theta..sub.nF} is the multiset
of crossover maps that results in paternal contribution to the
embryo on this chromosome. Here the possibility that the chromosome
may be aneuploid is explicitly modeled. Each parent can contribute
zero, one, or two copies of the chromosome to the embryo. If the
chromosome is an autosome, then euploidy is the case in which each
parent contributes exactly one copy, i.e.,
.chi..sup.M={.PHI..sub.1} and .chi..sup.F={.theta..sub.1}. But
euploidy is only one of the 3.times.3=9 possible cases. The
remaining eight are all different kinds of aneuploidy. For example,
in the case of maternal trisomy resulting from an M2 copy error,
one would have .chi..sup.M={.PHI..sub.1.PHI..sub.1} and
.chi..sup.F={.theta..sub.1}. In the case of maternal trisomy
resulting from an M1 copy error, one would have
.chi..sup.M-{.PHI..sub.1,.PHI..sub.2} and
.chi..sup.F={.theta..sub.1}. (.chi..sup.M, .chi..sup.F) n will be
used to indicate the set of sub-hypothesis pairs (.chi..sup.N,
.chi..sup.F) that are consistent with the copy number n.
.chi..sub.k.sup.M will be used to denote {.PHI..sub.1,k, . . . ,
.PHI..sub.nM.sub.k}, the multiset of crossover map values
restricted to the k-th SNP, and similarly for .chi..sup.F.
.chi..sub.k.sup.M(m.sub.1,m.sub.2) is used to mean the multiset of
allele values {.PHI..sub.1,k(m.sub.1,m.sub.2), . . . ,
.PHI..sub.nM.sub.k(m.sub.1,m.sub.2)}={m.sub..PHI.i,k, . . . ,
m.sub..PHI.nM,k}. Keep in mind that .PHI..sub.1,k {1,2}.
[0237] 14. .psi. is the crossover map that resulted in the measured
sperm from the father. Thus s.sub.1=.psi.(f.sub.1,f.sub.2). Note
that it is not necessary to consider a crossover multiset because
it is assumed that the measured sperm is euploid. .psi..sub.k will
be used to denote the value of this crossover map at the k-th
SNP.
[0238] 15. Keeping in mind the previous two definitions, let
{e.sub.1.sup.M, . . . , e.sub.n.sup.MM} be the multiset of true but
unknown haploid sequences contributed to the embryo by the mother
at this chromosome. Specifically,
e.sub.1.sup.M=.PHI..sub.1(m.sub.1,m.sub.2), where .PHI..sub.l the
l-th element of the multiset .chi..sup.M, and e.sub.lk.sup.M is the
allele value at the k-th snp. Similarly, let {e.sub.1.sup.F, . . .
, e.sub.nF.sup.F} be the multiset of true but unknown haploid
sequences contributed to the embryo by the father at this
chromosome. Then e.sub.l.sup.F-.theta..sub.l(f.sub.1,f.sub.2),
where .theta..sub.i is the l-th element of the multiset
.chi..sup.F, and f.sub.lk.sup.M is the allele value at the k-th
SNP. Also, {e.sub.1.sup.M, . . . ,
e.sub.nm.sup.M}=.chi..sup.M(m.sub.1m.sub.2), and {e.sub.1.sup.F, .
. . , e.sub.nF.sup.F}=.chi..sup.F(f.sub.1,f.sub.2) may be
written.
[0239] 16. P( .sub.k|.chi..sub.k.sup.M(m.sub.1,m.sub.2),
X.sub.k.sup.F(f.sub.1,f.sub.2),p.sub.d,p.sub.c) denotes the
probability of the genotype measurement on the embryo at SNP k
given a hypothesized true underlying genotype on the embryo and
given hypothesized underlying dropout and dropin rates. Note that
.chi..sub.k.sup.M(m.sub.1,m.sub.2) and
.chi..sub.k.sup.F(f.sub.1,f.sub.2) are both multisets, so are
capable of expressing aneuploid genotypes. For example,
.chi..sub.k.sup.M(m.sub.1,m.sub.2)={A,A} and
.chi..sub.k.sup.F(f.sub.1,f.sub.2)={B} expresses the maternal
trisomic genotype AAB.
[0240] Note that in this method, the measurements on the mother and
father are treated as known truth, while in other places in this
disclosure they are treated simply as measurements. Since the
measurements on the parents are very precise, treating them as
though they are known truth is a reasonable approximation to
reality. They are treated as known truth here in order to
demonstrate how such an assumption is handled, although it should
be clear to one skilled in the art how the more precise method,
used elsewhere in the patent, could equally well be used.
Solution
[0241] n ^ = arg max n P ( n , D , M , F , S ) ##EQU00054## P ( n ,
D , M , F , S ) = ( x U , x F ) .di-elect cons. n .psi. P ( .chi. M
, .chi. F , .psi. , D , M , F , S ) = ( .chi. M , .chi. F )
.di-elect cons. n P ( .chi. M ) P ( .chi. F ) .psi. P ( .psi. )
.intg. f ( pd ) .intg. f ( p a ) k P ( g ^ k , g k M , g k F , g ^
k S | .chi. k M , .chi. k F , .psi. k , p d , p a ) p d p c = (
.chi. M , .chi. F ) .di-elect cons. n P ( .chi. M ) P ( .chi. F )
.psi. P ( .psi. ) .intg. f ( pd ) .intg. f ( p c ) .times. k ( f 1
, f 2 ) = 0 k F P ( f 1 ) P ( f 2 ) P ( g ^ k s | .psi. k ( f 1 , f
2 ) , p d , p a ) ( m 1 , m 2 ) = 0 k M P ( m 1 ) P ( m 2 ) P ( g ^
k | .chi. k M ( m 1 , m 2 ) , .chi. k F ( f 1 , f 2 ) .
##EQU00054.2##
[0242] How to calculate each of the probabilities appearing in the
last equation above has been described elsewhere in this
disclosure. A method to calculate each of the probabilities
appearing in the last equation above has also been described
elsewhere in this disclosure. Although multiple sperm can be added
in order to increase reliability of the copy number call, in
practice one sperm is typically sufficient. This solution is
computationally tractable for a small number of sperm.
H Simplified Method Using Only Polar Homozygotes
[0243] In another embodiment of the invention, a similar method to
determine the number of copies of a chromosome can be implemented
using a limited subset of SNPs in a simplified approach. The method
is purely qualitative, uses parental data, and focuses exclusively
on a subset of SNPs, the so-called polar homozygotes (described
below). Polar homozygotic denotes the situation in which the mother
and father are both homozygous at a SNP, but the homozygotes are
opposite, or different allele values. Thus, the mother could be AA
and the father BB, or vice versa.
[0244] Since the actual allele values are not important--only their
relationship to each other, i.e. opposites--the mother's alleles
will be referred to as MM, and the father's as FF. In such a
situation, if the embryo is euploid, it must be heterozygous at
that allele. However, due to allele dropouts, a heterozygous SNP in
the embryo may not be called as heterozygous. In fact, given the
high rate of dropout associated with single cell amplification, it
is far more likely to be called as either MM or FF, each with equal
probability.
[0245] In this method, the focus is solely on those loci on a
particular chromosome that are polar homozygotes and for which the
embryo, which is therefore known to be heterozygous, but is
nonetheless called homozygous. It is possible to form the statistic
|MM|/(|MM|+|FF|), where |MM| is the number of these SNPs that are
called MM in the embryo and |FF| is the number of these SNPs that
are called FF in the embryo.
[0246] Under the hypothesis of euploidy, |MM|)/(|MM|+|FF|) is
Gaussian in nature, with mean 1/2 and variance 1/4N, where
N=(|MM|+|FF|). Therefore the statistic is completely independent of
the dropout rate, or, indeed, of any other factors. Due to the
symmetry of the construction, the distribution of this statistic
under the hypothesis of euploidy is known.
[0247] Under the hypothesis of trisomy, the statistic will not have
a mean of 1/2. If, for example, the embryo has MMF trisomy, then
the homozygous calls in the embryo will lean toward MINI and away
from FF, and vice versa. Note that because only loci where the
parents are homozygous are under consideration, there is no need to
distinguish M1 and M2 copy errors. In all cases, if the mother
contributes 2 chromosomes instead of 1, they will be MM regardless
of the underlying cause, and similarly for the father. The exact
mean under trisomy will depend upon the dropout rate, p, but in no
case will the mean be greater than 1/3, which is the limit of the
mean as p goes to 1. Under monosomy, the mean would be precisely 0,
except for noise induced by allele dropins.
[0248] In this embodiment, it is not necessary to model the
distribution under aneuploidy, but only to reject the null
hypothesis of euploidy, whose distribution is completely known. Any
embryo for which the null hypothesis cannot be rejected at a
predetermined significance level would be deemed normal.
[0249] In another embodiment of the invention, of the homozygotic
loci, those that result in no-call (NC) on the embryo contain
information, and can be included in the calculations, yielding more
loci for consideration. In another embodiment, those loci that are
not homozygotic, but rather follow the pattern ANAB, can also be
included in the calculations, yielding more loci for consideration.
It should be obvious to one skilled in the art how to modify the
method to include these additional loci into the calculation.
I Reduction to Practice of the PS Method as Applied to Allele
Calling
[0250] In order to demonstrate a reduction to practice of the PS
method as applied to cleaning the genetic data of a target
individual, and its associated allele-call confidences, extensive
Monte-Carlo simulations were run. The PS method's confidence
numbers match the observed rate of correct calls in simulation. The
details of these simulations are given in separate documents whose
benefits are claimed by this disclosure. In addition, this aspect
of the PS method has been reduced to practice on real triad data (a
mother, a father and a born child). Results are shown below in
Table 8. The TAQMAN assay was used to measure single cell genotype
data consisting of diploid measurements of a large buccal sample
from the father (columns p.sub.1,p.sub.2), diploid measurements of
a buccal sample from the mother (m.sub.1,m.sub.2), haploid
measurements on three isolated sperm from the father
(h.sub.1,h.sub.2,h.sub.3), and diploid measurements of four single
cells from a buccal sample from the born child of the triad. Note
that all diploid data are unordered. All SNPs are from chromosome 7
and within 2 megabases of the CFTR gene, in which a defect causes
cystic fibrosis.
[0251] The goal was to estimate (in E1,E2) the alleles of the
child, by running PS on the measured data from a single child
buccal cell (e1,e2), which served as a proxy for a cell from the
embryo of interest. Since no maternal haplotype sequence was
available, the three additional single cells of the child
sample--(b11,b12), (b21,b22), (b22,b23), were used in the same way
that additional blastomeres from other embryos are used to infer
maternal haplotype once the paternal haplotype is determined from
sperm. The true allele values (T1,T2) on the child are determined
by taking three buccal samples of several thousand cells,
genotyping them independently, and only choosing SNPs on which the
results were concordant across all three samples. This process
yielded 94 concordant SNPs. Those loci that had a valid genotype
call, according to the ABI 7900 reader, on the child cell that
represented the embryo, were then selected. For each of these 69
SNPs, the disclosed method determined de-noised allele calls on the
embryo (E.sub.1,E.sub.2), as well as the confidence associated with
each genotype call.
[0252] Twenty-nine (29%) percent of the 69 raw allele calls in
uncleaned genetic data from the child cell were incorrect (marked
with a dash "-" in column e1 and e2, Table 8). Columns
(E.sub.1,E.sub.2) show that PS corrected 18 of these (as indicated
by a box in column E1 and E2, but not in column `conf`, Table 8),
while two remained miscalled (2.9% error rate; marked with a dash
"-" in column `conf`, Table 8). Note that the two SNPs that were
miscalled had low confidences of 53.8% and 74.4%. These low
confidences indicate that the calls might be incorrect, due either
to a lack of data or to inconsistent measurements on multiple sperm
or "blastomeres." The confidence in the genotype calls produced is
an integral part of the PS report. Note that this demonstration,
which sought to call the genotype of 69 SNPs on a chromosome, was
more difficult than that encountered in practice, where the
genotype at only one or two loci will typically be of interest,
based on initial screening of parents' data. In some embodiments,
the disclosed method may achieve a higher level of accuracy at loci
of interest by: i) continuing to measure single sperm until
multiple haploid allele calls have been made at the locus of
interest; ii) including additional blastomere measurements; iii)
incorporating maternal haploid data from extruded polar bodies,
which are commonly biopsied in pre-implantation genetic diagnosis
today. It should be obvious to one skilled in the art that there
exist other modifications to the method that can also increase the
level of accuracy, as well as how to implement these, without
changing the essential concept of the disclosure.
J Reduction to Practice of the PS Method as Applied to Calling
Aneuploidy
[0253] To demonstrate the reduction to practice of certain aspects
of the invention disclosed herein, the method was used to call
aneuploidy on several sets of single cells. In this case, only
selected data from the genotyping platform was used: the genotype
information from parents and embryo. A simple genotyping algorithm,
called "pie slice", was used, and it showed itself to be about
99.9% accurate on genomic data. It is less accurate on MDA data,
due to the noise inherent in MDA. It is more accurate when there is
a fairly high "dropout" rate in MDA. It also depends, crucially, on
being able to model the probabilities of various genotyping errors
in terms of parameters known as dropout rate and dropin rate.
[0254] The unknown chromosome copy numbers are inferred because
different copy numbers interact differently with the dropout rate,
dropin rate, and the genotyping algorithm. By creating a
statistical model that specifies how the dropout rate, dropin rate,
chromosome copy numbers, and genotype cutoff-threshold all
interact, it is possible to use standard statistical inference
methods to tease out the unknown chromosome copy numbers.
[0255] The method of aneuploidy detection described here is termed
qualitative CNC, or qCNC for short, and employs the basic
statistical inferencing methods of maximum-likelihood estimation,
maximum-a-posteriori estimation, and Bayesian inference. The
methods are very similar, with slight differences. The methods
described here are similar to those described previously, and are
summarized here for the sake of convenience.
Maximum Likelihood (ML)
[0256] Let X.sub.1, . . . , X.sub.n.about.f(x;.theta.). Here the
X.sub.i are independent, identically distributed random variables,
drawn according to a probability distribution that belongs to a
family of distributions parameterized by the vector .theta.. For
example, the family of distributions might be the family of all
Gaussian distributions, in which case .theta.=(.mu., .sigma.) would
be the mean and variance that determine the specific distribution
in question. The problem is as follows: .theta. is unknown, and the
goal is to get a good estimate of it based solely on the
observations of the data X.sub.1, . . . , X.sub.n. The maximum
likelihood solution is given by
.theta. ^ = arg max .theta. i f ( X i ; .theta. ) ##EQU00055##
Maximum A' Posteriori (MAP) Estimation
[0257] Posit a prior distribution f(.theta.) that determines the
prior probability of actually seeing .theta. as the parameter,
allowing us to write X.sub.1, . . . , X.sub.n.about.f(x|.theta.).
The MAP solution is given by
.theta. ^ = arg max .theta. f ( .theta. ) i ( f ( X i | .theta. )
##EQU00056##
Note that the ML solution is equivalent to the MAP solution with a
uniform (possibly improper) prior.
Bayesian Inference
[0258] Bayesian inference comes into play when
.theta.=(.theta..sub.1, . . . , .theta..sub.d) is multidimensional
but it is only necessary to estimate a subset (typically one) of
the parameters .theta..sub.j. In this case, if there is a prior on
the parameters, it is possible to integrate out the other
parameters that are not of interest. Without loss of generality,
suppose that .theta..sub.1 is the parameter for which an estimate
is desired. Then the Bayesian solution is given by:
.theta. ^ 1 = arg max .theta. 1 f ( .theta. 1 ) .intg. f ( .theta.
2 ) f ( .theta. d ) i ( f ( X i | .theta. ) .theta. 2 .theta. d
##EQU00057##
Copy Number Classification
[0259] Any one or some combination of the above methods may be used
to determine the copy number count, as well as when making allele
calls such as in the cleaning of embryonic genetic data. In one
embodiment, the data may come from INFINIUM platform measurements
{(x.sub.jk,y.sub.jk)}, where x.sub.jk is the platform response on
channel X to SNP k of chromosome j, and y.sub.jk is the platform
response on channel Y to SNP k of chromosome j. The key to the
usefulness of this method lies in choosing the family of
distributions from which it is postulated that these data are
drawn. In one embodiment, that distribution is parameterized by
many parameters. These parameters are responsible for describing
things such as probe efficiency, platform noise. MDA
characteristics such as dropout, dropin, and overall amplification
mean, and, finally, the genetic parameters: the genotypes of the
parents, the true but unknown genotype of the embryo, and, of
course, the parameters of interest: the chromosome copy numbers
supplied by the mother and father to the embryo.
[0260] In one embodiment, a good deal of information is discarded
before data processing. The advantage of doing this is that it is
possible to model the data that remains in a more robust manner.
Instead of using the raw platform data {(x.sub.jk,y.sub.jk)}, it is
possible to pre-process the data by running the genotyping
algorithm on the data. This results in a set of genotype calls
(y.sub.jk), where y.sub.jk {NC,AA,AB,BB}. NC stands for "no-call".
Putting these together into the Bayesian inference paradigm above
yields:
n ^ j M , n ^ j F = max n M , n F .intg. .intg. f ( p d ) f ( p a )
k P ( g jk | n M , n F , M j , F j , p d , p a ) p d p a
##EQU00058##
Explanation of the Notation:
[0261] {circumflex over (n)}.sub.j.sup.N,{circumflex over
(n)}.sub.j.sup.F are the estimated number of chromosome copies
supplied to the embryo by the mother and father respectively. These
should sum to 2 for the autosomes, in the case of euploidy, i.e.,
each parent should supply exactly 1 chromosome.
[0262] p.sub.d and p.sub.a are the dropout and dropin rates for
genotyping, respectively. These reflect some of the modeling
assumptions. It is known that in single-cell amplification, some
SNPs "drop out", which is to say that they are not amplified and,
as a consequence, do not show up when the SNP genotyping is
attempted on the INFINIUM platform. This phenomenon is modeled by
saying that each allele at each SNP "drops out" independently with
probability p.sub.a during the MDA phase. Similarly, the platform
is not a perfect measurement instrument. Due to measurement noise,
the platform sometimes picks up a ghost signal, which can be
modeled as a probability of dropin that acts independently at each
SNP with probability p.sub.a.
[0263] M.sub.j,F.sub.j are the true genotypes on the mother and
father respectively. The true genotypes are not known perfectly,
but because large samples from the parents are genotyped, one may
make the assumption that the truth on the parents is essentially
known.
Probe Modeling
[0264] In one embodiment of the invention, platform response
models, or error models, that vary from one probe to another can be
used without changing the essential nature of the invention. The
amplification efficiency and error rates caused by allele dropouts,
allele dropins, or other factors, may vary between different
probes. In one embodiment, an error transition matrix can be made
that is particular to a given probe. Platform response models, or
error models, can be relevant to a particular probe or can be
parameterized according to the quantitative measurements that are
performed, so that the response model or error model is therefore
specific to that particular probe and measurement.
Genotyping
[0265] Genotyping also requires an algorithm with some built-in
assumptions. Going from a platform response (x,y) to a genotype g
requires significant calculation. It is essentially requires that
the positive quadrant of the x/y plane be divided into those
regions where AA, AB, BB, and NC will be called. Furthermore, in
the most general case, it may be useful to have regions where AAA,
AAB, etc., could be called for trisomies.
[0266] In one embodiment, use is made of a particular genotyping
algorithm called the pie-slice algorithm, because it divides the
positive quadrant of the x/y plane into three triangles, or "pie
slices". Those (x,y) points that fall in the pie slice that hugs
the X axis are called AA, those that fall in the slice that hugs
the Y axis are called BB, and those in the middle slice are called
AB. In addition, a small square is superimposed whose lower-left
corner touches the origin. (x,y) points falling in this square are
designated NC, because both x and y components have small values
and hence are unreliable.
[0267] The width of that small square is called the no-call
threshold and it is a parameter of the genotyping algorithm. In
order for the dropin/dropout model to correctly model the error
transition matrix associated with the genotyping algorithm, the
cutoff threshold must be tuned properly. The error transition
matrix indicates for each true-genotype/called-genotype pair, the
probability of seeing the called genotype given the true genotype.
This matrix depends on the dropout rate of the MDA and upon the
no-call threshold set for the genotyping algorithm.
[0268] Note that a wide variety of different allele calling, or
genotyping, algorithms may be used without changing the fundamental
concept of the invention. For example, the no-call region could be
defined by a many different shapes besides a square, such as for
example a quarter circle, and the no call thresholds may vary
greatly for different genotyping algorithms.
Results of Aneuploidy Calling Experiments
[0269] Presented here are experiments that demonstrate the
reduction to practice of the method disclosed herein to correctly
call ploidy of single cells. The goal of this demonstration was
twofold: first, to show that the disclosed method correctly calls
the cell's ploidy state with high confidence using samples with
known chromosome copy numbers, both euploid and aneuploid, as
controls, and second to show that the method disclosed herein calls
the cell's ploidy state with high confidence using blastomeres with
unknown chromosome copy numbers.
[0270] In order to increase confidences, the ILLUMINA INFINIUM II
platform, which allows measurement of hundreds of thousands of SNPs
was used. In order to run this experiment in the context of PGD,
the standard INFINIUM II protocol was reduced from three days to 20
hours. Single cell measurements were compared between the full and
accelerated INFINIUM II protocols, and showed .about.85%
concordance. The accelerated protocol showed an increase in locus
drop-out (LDO) rate from <1% to 5-10%; however, because hundreds
of thousands of SNPs are measured and because PS accommodates
allele dropouts, this increase in LDO rate does not have a
significant negative impact on the results.
[0271] The entire aneuploidy calling method was performed on eight
known-euploid buccal cells isolated from two healthy children from
different families, ten known-trisomic cells isolated from a human
immortalized trisomic cell line, and six blastomeres with an
unknown number of chromosomes isolated from three embryos donated
to research. Half of each set of cells was analyzed by the
accelerated 20-hour protocol, and the other half by the standard
protocol. Note that for the immortalized trisomic cells, no parent
data was available. Consequently, for these cells, a pair of
pseudo-parental genomes was generated by drawing their genotypes
from the conditional distribution induced by observation of a large
tissue sample of the trisomic genotype at each locus.
[0272] Where truth was known, the method correctly called the
ploidy state of each chromosome in each cell with high confidence.
The data are summarized below in three tables. Each table shows the
chromosome number in the first column, and each pair of
color-matched columns represents the analysis of one cell with the
copy number call on the left and the confidence with which the call
is made on the right. Each row corresponds to one particular
chromosome. Note that these tables contain the ploidy information
of the chromosomes in a format that could be used for the report
that is provided to the doctor to help in the determination of
which embryos are to be selected for transfer to the prospective
mother. (Note `1` may result from both monosomy and uniparental
disomy.) Table 9 shows the results for eight known-euploid buccal
cells; all were correctly found to be euploid with high confidences
(>0.99). Table 10 shows the results for ten known-trisomic cells
(trisomic at chromosome 21); all were correctly found to be
trisomic at chromosome 21 and disomic at all other chromosomes with
high confidences (>0.92). Table 11 shows the results for six
blastomeres isolated from three different embryos. While no truth
models exist for donated blastomeres, it is possible to look for
concordance between blastomeres originating from a single embryo,
however, the frequency and characteristics of mosaicism in human
embryos are not currently known, and thus the presence or lack of
concordance between blastomeres from a common embryo is not
necessarily indicative of correct ploidy determination. The first
three blastomeres are from one embryo (e1) and of those, the first
two (e1b1 and e1b3) have the same ploidy state at all chromosomes
except one. The third cell (e1b6) is complex aneuploid. Both
blastomeres from the second embryo were found to be monosomic at
all chromosomes. The blastomere from the third embryo was found to
be complex aneuploid. Note that some confidences are below 90%,
however, if the confidences of all aneuploid hypotheses are
combined, all chromosomes are called either euploid or aneuploid
with confidence exceeding 92.8%.
K Laboratory Techniques
[0273] There are many techniques available allowing the isolation
of cells and DNA fragments for genotyping, as well as for the
subsequent genotyping of the DNA. The system and method described
here can be applied to any of these techniques, specifically those
involving the isolation of fetal cells or DNA fragments from
maternal blood, or blastomeres from embryos in the context of IVF.
It can be equally applied to genomic data in silico, i.e. not
directly measured from genetic material. In one embodiment of the
system, this data can be acquired as described below. This
description of techniques is not meant to be exhaustive, and it
should be clear to one skilled in the arts that there are other
laboratory techniques that can achieve the same ends.
Isolation of Cells
[0274] Adult diploid cells can be obtained from bulk tissue or
blood samples. Adult diploid single cells can be obtained from
whole blood samples using FACS, or fluorescence activated cell
sorting. Adult haploid single sperm cells can also be isolated from
a sperm sample using FACS. Adult haploid single egg cells can be
isolated in the context of egg harvesting during IVF
procedures.
[0275] Isolation of the target single cell blastomeres from human
embryos can be done using techniques common in in vitro
fertilization clinics, such as embryo biopsy. Isolation of target
fetal cells in maternal blood can be accomplished using monoclonal
antibodies, or other techniques such as FACS or density gradient
centrifugation.
[0276] DNA extraction also might entail non-standard methods for
this application. Literature reports comparing various methods for
DNA extraction have found that in some cases novel protocols, such
as the using the addition of N-lauroylsarcosine, were found to be
more efficient and produce the fewest false positives.
Amplification of Genomic DNA
[0277] Amplification of the genome can be accomplished by multiple
methods including: ligation-mediated PCR (LM-PCR), degenerate
oligonucleotide primer PCR (DOP-PCR), and multiple displacement
amplification (MDA). Of the three methods, DOP-PCR reliably
produces large quantities of DNA from small quantities of DNA,
including single copies of chromosomes; this method may be most
appropriate for genotyping the parental diploid data, where data
fidelity is critical. MDA is the fastest method, producing
hundred-fold amplification of DNA in a few hours; this method may
be most appropriate for genotyping embryonic cells, or in other
situations where time is of the essence.
[0278] Background amplification is a problem for each of these
methods, since each method would potentially amplify contaminating
DNA. Very tiny quantities of contamination can irreversibly poison
the assay and give false data. Therefore, it is critical to use
clean laboratory conditions, wherein pre- and post-amplification
workflows are completely, physically separated. Clean,
contamination free workflows for DNA amplification are now routine
in industrial molecular biology, and simply require careful
attention to detail.
Genotyping Assay and Hybridization
[0279] The genotyping of the amplified DNA can be done by many
methods including MOLECULAR INVERSION PROBES (MIPs) such as
AFFYMETRIX's GENFLEX TAG array, microarrays such as AFFYMETRIX's
500K array or the ILLUMINA BEAD ARRAYS, or SNP genotyping assays
such as APPLIEDBIOSCIENCE's TAQMAN assay. The AFFYMETRIX 500K
array, MIPs/GENFLEX, TAQMAN and ILLUMINA assay all require
microgram quantities of DNA, so genotyping a single cell with
either workflow would require some kind of amplification. Each of
these techniques has various tradeoffs in terms of cost, quality of
data, quantitative vs. qualitative data, customizability, time to
complete the assay and the number of measurable SNPs, among others.
An advantage of the 500K and ILLUMINA arrays are the large number
of SNPs on which it can gather data, roughly 250,000, as opposed to
MIPs which can detect on the order of 10,000 SNPs, and the TAQMAN
assay which can detect even fewer. An advantage of the MIPs, TAQMAN
and ILLUMINA assay over the 500K arrays is that they are inherently
customizable, allowing the user to choose SNPs, whereas the 500K
arrays do not permit such customization.
[0280] In the context of pre-implantation diagnosis during IVF, the
inherent time limitations are significant; in this case it may be
advantageous to sacrifice data quality for turn-around time.
Although it has other clear advantages, the standard MIPs assay
protocol is a relatively time-intensive process that typically
takes 2.5 to three days to complete. In MIPs, annealing of probes
to target DNA and post-amplification hybridization are particularly
time-intensive, and any deviation from these times results in
degradation in data quality. Probes anneal overnight (12-16 hours)
to DNA sample. Post-amplification hybridization anneals to the
arrays overnight (12-16 hours). A number of other steps before and
after both annealing and amplification bring the total standard
timeline of the protocol to 2.5 days. Optimization of the MIPs
assay for speed could potentially reduce the process to fewer than
36 hours. Both the 500K arrays and the ILLUMINA assays have a
faster turnaround: approximately 1.5 to two days to generate highly
reliable data in the standard protocol. Both of these methods are
optimizable, and it is estimated that the turn-around time for the
genotyping assay for the 500 k array and/or the ILLUMINA assay
could be reduced to less than 24 hours. Even faster is the TAQMAN
assay which can be run in three hours. For all of these methods,
the reduction in assay time will result in a reduction in data
quality, however that is exactly what the disclosed invention is
designed to address.
[0281] Naturally, in situations where the timing is critical, such
as genotyping a blastomere during IVF, the faster assays have a
clear advantage over the slower assays, whereas in cases that do
not have such time pressure, such as when genotyping the parental
DNA before IVF has been initiated, other factors will predominate
in choosing the appropriate method. For example, another tradeoff
that exists from one technique to another is one of price versus
data quality. It may make sense to use more expensive techniques
that give high quality data for measurements that are more
important, and less expensive techniques that give lower quality
data for measurements where the fidelity is not as critical. Any
techniques which are developed to the point of allowing
sufficiently rapid high-throughput genotyping could be used to
genotype genetic material for use with this method.
Methods for Simultaneous Targeted Locus Amplification and Whole
Genome Amplification.
[0282] During whole genome amplification of small quantities of
genetic material, whether through ligation-mediated PCR (LM-PCR),
multiple displacement amplification (MDA), or other methods,
dropouts of loci occur randomly and unavoidably. It is often
desirable to amplify the whole genome nonspecifically, but to
ensure that a particular locus is amplified with greater certainty.
It is possible to perform simultaneous locus targeting and whole
genome amplification.
[0283] In a preferred embodiment, the basis for this method is to
combine standard targeted polymerase chain reaction (PCR) to
amplify particular loci of interest with any generalized whole
genome amplification method. This may include, but is not limited
to: preamplification of particular loci before generalized
amplification by MDA or LM-PCR, the addition of targeted PCR
primers to universal primers in the generalized PCR step of LM-PCR,
and the addition of targeted PCR primers to degenerate primers in
MDA.
L Techniques for Screening for Aneuploidy Using High and Medium
Throughput Genotyping
[0284] In one embodiment of the system the measured genetic data
can be used to detect for the presence of aneuploides and/or
mosaicism in an individual. Disclosed herein are several methods of
using medium or high-throughput genotyping to detect the number of
chromosomes or DNA segment copy number from amplified or
unamplified DNA from tissue samples. The goal is to estimate the
reliability that can be achieved in detecting certain types of
aneuploidy and levels of mosaicism using different quantitative
and/or qualitative genotyping platforms such as ABI Taqman, MIPS,
or Microarrays from Illumina, Agilent and Affymetrix. In many of
these cases, the genetic material is amplified by PCR before
hybridization to probes on the genotyping array to detect the
presence of particular alleles. How these assays are used for
genotyping is described elsewhere in this disclosure.
[0285] Described below are several methods for screening for
abnormal numbers of DNA segments, whether arising from deletions,
aneuploides and/or mosaicism. The methods are grouped as follows:
(i) quantitative techniques without making allele calls; (ii)
qualitative techniques that leverage allele calls; (iii)
quantitative techniques that leverage allele calls; (iv) techniques
that use a probability distribution function for the amplification
of genetic data at each locus. All methods involve the measurement
of multiple loci on a given segment of a given chromosome to
determine the number of instances of the given segment in the
genome of the target individual. In addition, the methods involve
creating a set of one or more hypotheses about the number of
instances of the given segment; measuring the amount of genetic
data at multiple loci on the given segment; determining the
relative probability of each of the hypotheses given the
measurements of the target individual's genetic data; and using the
relative probabilities associated with each hypothesis to determine
the number of instances of the given segment. Furthermore, the
methods all involve creating a combined measurement M that is a
computed function of the measurements of the amounts of genetic
data at multiple loci. In all the methods, thresholds are
determined for the selection of each hypothesis H.sub.i based on
the measurement M, and the number of loci to be measured is
estimated, in order to have a particular level of false detections
of each of the hypotheses.
[0286] The probability of each hypothesis given the measurement M
is P(H.sub.i|M)=P(M|H.sub.i)P(H.sub.i)/P(M). Since P(M) is
independent of H.sub.i, we can determine the relative probability
of the hypothesis given M by considering only
P(M|H.sub.i)P(H.sub.i). In what follows, in order to simplify the
analysis and the comparison of different techniques, we assume that
P(H.sub.i) is the same for all {H.sub.i}, so that we can compute
the relative probability of all the P(H.sub.i|M) by considering
only P(M|H.sub.i). Consequently, our determination of thresholds
and the number of loci to be measured is based on having particular
probabilities of selecting false hypotheses under the assumption
that P(H.sub.i) is the same for all {H.sub.i}. It will be clear to
one skilled in the art after reading this disclosure how the
approach would be modified to accommodate the fact that P(H.sub.i)
varies for different hypotheses in the set {H.sub.i}. In some
embodiments, the thresholds are set so that hypothesis Ho is
selected which maximizes P(H.sub.i|M) over all i. However,
thresholds need not necessarily be set to maximize P(H.sub.i|M),
but rather to achieve a particular ratio of the probability of
false detections between the different hypotheses in the set
{H.sub.i}.
[0287] It is important to note that the techniques referred to
herein for detecting aneuploides can be equally well used to detect
for uniparental disomy, unbalanced translocations, and for the
sexing of the chromosome (male or female; XY or XX). All of the
concepts concern detecting the identity and number of chromosomes
(or segments of chromosomes) present in a given sample, and thus
are all addressed by the methods described in this document. It
should be obvious to one skilled in the art how to extend any of
the methods described herein to detect for any of these
abnormalities.
The Concept of Matched Filtering
[0288] The methods applied here are similar to those applied in
optimal detection of digital signals. It can be shown using the
Schwartz inequality that the optimal approach to maximizing Signal
to Noise Ratio (SNR) in the presence of normally distributed noise
is to build an idealized matching signal, or matched filter,
corresponding to each of the possible noise-free signals, and to
correlate this matched signal with the received noisy signal. This
approach requires that the set of possible signals are known as
well as the statistical distribution mean and Standard Deviation
(SD) of the noise. Herein is described the general approach to
detecting whether chromosomes, or segments of DNA, are present or
absent in a sample. No differentiation will be made between looking
for whole chromosomes or looking for chromosome segments that have
been inserted or deleted. Both will be referred to as DNA segments.
It should be clear after reading this description how the
techniques may be extended to many scenarios of aneuploidy and sex
determination, or detecting insertions and deletions in the
chromosomes of embryos, fetuses or born children. This approach can
be applied to a wide range of quantitative and qualitative
genotyping platforms including Taqman, qPCR, Illumina Arrays,
Affymetrix Arrays, Agilent Arrays, the MIPS kit etc.
Formulation of the General Problem
[0289] Assume that there are probes at SNPs where two allelic
variations occur, x and y. At each locus i, i=1 . . . N, data is
collected corresponding to the amount of genetic material from the
two alleles. In the Taqman assay, these measures would be, for
example, the cycle time, C.sub.t, at which the level of each
allele-specific dye crosses a threshold. It will be clear how this
approach can be extended to different measurements of the amount of
genetic material at each locus or corresponding to each allele at a
locus. Quantitative measurements of the amount of genetic material
may be nonlinear, in which case the change in the measurement of a
particular locus caused by the presence of the segment of interest
will depend on how many other copies of that locus exist in the
sample from other DNA segments. In some cases, a technique may
require linear measurements, such that the change in the
measurement of a particular locus caused by the presence of the
segment of interest will not depend on how many other copies of
that locus exist in the sample from other DNA segments. An approach
is described for how the measurements from the Taqman or qPCR
assays may be linearized, but there are many other techniques for
linearizing nonlinear measurements that may be applied for
different assays.
[0290] The measurements of the amount of genetic material of allele
x at loci 1 . . . N is given by data d.sub.x=[d.sub.x1 . . .
d.sub.xN]. Similarly for allele y, d.sub.y=[d.sub.y1 . . .
d.sub.yN]. Assume that each segment j has alleles a.sub.j=[a.sub.j1
. . . a.sub.jN] where each element a.sub.ji is either x or y.
Describe the measurement data of the amount of genetic material of
allele x as d.sub.x=s.sub.x+.nu..sub.x where s.sub.x is the signal
and .nu..sub.x is a disturbance. The signal
s.sub.x=[f.sub.x(a.sub.11, . . . , a.sub.J1) . . .
f.sub.x(a.sub.JN, . . . , a.sub.JN] where f.sub.x is the mapping
from the set of alleles to the measurement, and J is the number of
DNA segment copies. The disturbance vector .nu..sub.x is caused by
measurement error and, in the case of nonlinear measurements, the
presence of other genetic material besides the DNA segment of
interest. Assume that measurement errors are normally distributed
and that they are large relative to disturbances caused by
nonlinearity (see section on linearizing measurements) so that
.nu..sub.xi.apprxeq.n.sub.xi where n.sub.xi has variance
.sigma..sub.xi.sup.2 and vector n.sub.x is normally distributed
.about.N(0,R), R=E(n.sub.xn.sub.x.sup.T). Now, assume some filter h
is applied to this data to perform the measurement
m.sub.x=h.sup.Td.sub.x=h.sup.Ts.sub.x+h.sup.T.nu..sub.x. In order
to maximize the ratio of signal to noise
(h.sup.Ts.sub.x/h.sup.Tn.sub.x) it can be shown that h is given by
the matched filter h=.mu.R.sup.-1s.sub.x where .mu. is a scaling
constant. The discussion for allele x can be repeated for allele
y.
Method 1a: Measuring Aneuploidy or Sex by Quantitative Techniques
that do not Make Allele Calls When the Mean and Standard Deviation
for Each Locus is Known
[0291] Assume for this section that the data relates to the amount
of genetic material at a locus irrespective of allele value (e.g.
using qPCR), or the data is only for alleles that have 100%
penetrance in the population, or that data is combined on multiple
alleles at each locus (see section on linearizing measurements)) to
measure the amount of genetic material at that locus. Consequently,
in this section one may refer to data d.sub.x and ignore d.sub.y.
Assume also that there are two hypotheses: h.sub.0 that there are
two copies of the DNA segment (these are typically not identical
copies), and h.sub.1 that there is only 1 copy. For each
hypothesis, the data may be described as
d.sub.xi(h.sub.0)=s.sub.xi(h.sub.0)+n.sub.xi and
d.sub.xi(h.sub.1)=s.sub.xi(h.sub.1)+n.sub.xi respectively, where
s.sub.xi(h.sub.0) is the expected measurement of the genetic
material at locus i (the expected signal) when two DNA segments are
present and s(h.sub.1) is the expected data for one segment.
Construct the measurement for each locus by differencing out the
expected signal for hypothesis h.sub.0:
m.sub.xi=d.sub.xi-s.sub.xi(h.sub.0). If h.sub.1 is true, then the
expected value of the measurement is
E(m.sub.xi)=s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0). Using the matched
filter concept discussed above, set
h=(1/N)R.sup.-1(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)). The
measurement is described as m=h.sup.Td.sub.x=(1/N).SIGMA..sub.i=1 .
. .
N((s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0))/.sigma..sub.xi.sup.2)m.sub.xi.
[0292] If h.sub.1 is true, the expected value of
E(m|h.sub.1)=m.sub.1=(1/N).SIGMA..sub.i=1 . . .
N(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.2
and the standard deviation of m is
.sigma..sub.m|h1.sup.2=(1/N.sup.2).SIGMA..sub.i=1 . . .
N**x.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.4).sigma-
..sub.xi.sup.2=(1/N.sup.2).SIGMA..sub.i=1 . . .
N(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.2.
[0293] If h.sub.0 is true, the expected value of m is
E(m|h.sub.0)=m.sub.0=0 and the standard deviation of m is again
.sigma..sub.m|h0.sup.2=(1/N.sup.2).SIGMA..sub.i=1 . . .
N(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.2.
[0294] FIG. 1 illustrates how to determine the probability of false
negatives and false positive detections. Assume that a threshold t
is set half-way between m.sub.1 and m.sub.0 in order to make the
probability of false negatives and false positives equal (this need
not be the case as is described below). The probability of a false
negative is determined by the ratio of
(m.sub.1-t)/.sigma..sub.m|h1=(m.sub.1-m.sub.0)/(2.sigma..sub.m|h1).
"5-Sigma" statistics may be used so that the probability of false
negatives is 1-normcdf(5,0,1)=2.87e-7. In this case, the goal is
for (m.sub.1-m.sub.0)/(2.sigma..sub.m|h0)>5 or 10
sqrt((1/N.sup.2).SIGMA..sub.i=1 . . .
N(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.2)<(1/-
N).SIGMA..sub.i=1 . . .
N(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.2
or sqrt(.SIGMA..sub.i=1 . . .
N(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.2)>10.
In order to compute the size of N, Mean Signal to Noise Ratio can
be computed from aggregated data: MSNR=(1/N).SIGMA.=.sub.i=1 . . .
N(s.sub.xi(h.sub.1)-s.sub.xi(h.sub.0)).sup.2/.sigma..sub.xi.sup.2.
N can then be found from the inequality above:
sqrt(N).sqrt(MSNR)>10 or N>100/MSNR.
[0295] This approach was applied to data measured with the Taqman
Assay from Applied BioSystems using 48 SNPs on the X chromosome.
The measurement for each locus is the time, C.sub.t that it takes
the die released in the well corresponding to this locus to exceed
a threshold. Sample 0 consists of roughly 0.3 ng (50 cells) of
total DNA per well of mixed female origin where subjects had two X
chromosomes; sample 1 consisted of roughly 0.3 ng of DNA per well
of mixed male origin where subject had one X chromosome. FIGS. 2
and 3 show the histograms of measurements for samples 1 and 0. The
distributions for these samples are characterized by m.sub.0=29.97;
SD.sub.0=1.32, m.sub.1=31.44, SD.sub.1=1.592. Since this data is
derived from mixed male and female samples, some of the observed SD
is due to the different allele frequencies at each SNP in the mixed
samples. In addition, some of the observed SD will be due to the
varying efficiency of the different assays at each SNP, and the
differing amount of dye pipetted into each well. FIG. 4 provides a
histogram of the difference in the measurements at each locus for
the male and female sample. The mean difference between the male
and female samples is 1.47 and the SD of the difference is 0.99.
While this SD will still be subject to the different allele
frequencies in the mixed male and female samples, it will no longer
be affected the different efficiencies of each assay at each locus.
Since the goal is to differentiate two measurements each with a
roughly similar SD, the adjusted SD may be approximated for each
measurement for all loci as 0.99/sqrt(2)=0.70. Two runs were
conducted for every locus in order to estimate .sigma..sub.xi for
the assay at that locus so that a matched filter could be applied.
A lower limit of .sigma..sub.xi was set at 0.2 in order to avoid
statistical anomalies resulting from only two runs to compute
.sigma..sub.xi. Only those loci (numbering 37) for which there were
no allele dropouts over both alleles, over both experiment runs and
over both male and female samples were used in the plots and
calculations. Applying the approach above to this data, it was
found that MSNR=2.26, hence N=2.sup.25.sup.2/2.26 2=17 loci.
Although applied here only to the X chromosome, and to
differentiating 1 copy from 2 copies, this experiment indicates the
number of loci necessary to detect M2 copy errors for all
chromosomes, where two exact copies of a chromosome occur in a
trisomy, using Method 3 described below.
[0296] The measurement used for each locus is the cycle number, Ct,
that it takes the die released in the well corresponding to a
particular allele at the given locus to exceed a threshold that is
automatically set by the ABI 7900HT reader based on the noise of
the no-template control. Sample 0 consisted of roughly 60 pg
(equivalent to genome of 10 cells) of total DNA per well from a
female blood sample (XX); sample 1 consisted of roughly 60 pg of
DNA per well from a male blood sample (X). As expected, the Ct
measurement of female samples is on average lower than that of male
samples.
[0297] There are several approaches to comparing the Taqman Assay
measurements quantitatively between female and male samples. Here
illustrated is one approach. To combine information from the FAM
and VIC channel for each locus, C.sub.t values of the two channels
were converted to the copy numbers of their respective alleles,
summed, and then converted back to a composite C.sub.t value for
that locus. The conversion between C.sub.t value and the copy
number was based on the equation N.sub.C=10.sup.(-a*Ct+b) which is
typically used to model the exponential growth of the die
measurement during real-time PCR. The coefficients a and b were
determined empirically from the C.sub.t values using multiple
measurements on quantities of 6 pg and 60 pg of DNA. We determined
that a.apprxeq.0.298, b.apprxeq.10.493; hence we used the
linearizing formula N.sub.C=10.sup.(-0.298Ct+10.493).
[0298] FIG. 14, top panel, shows the means and standard deviations
of the differences between the composite C.sub.t values of male and
female samples at each of the 19 loci measured. The mean difference
between the male and female samples is 1.19 and the SD of the
differences is 0.62. Note that this locus-specific SD will not be
affected by the different efficiencies of the assay at each locus.
Three runs were conducted for every locus in order to estimate
.sigma..sub.xi for the assay at that locus so that a matched filter
could be created and applied. Note that a lower limit of standard
deviation at each locus was set at 0.6 in order to avoid
statistical anomalies resulting from the small number of runs at
each locus. Only those loci for which there were no allele dropouts
over at least two experiment runs and over both male and female
samples were used in the plots and calculations. Applying the
approach above to this data, it was found that MSNR=9.7. Hence N=6
loci are required in order to have 99.99% confidence of the test,
assuming those 6 loci generate the same MSNR (Mean Signal to Noise
Ratio) as the 19 loci used in this test. The combined measure m and
its expected standard deviation after applying the matched filter
for male and female samples is shown in FIG. 14, bottom.
[0299] The same approach was applied to samples diluted 10 times
from the above mentioned DNA. Now each well consisted of roughly 6
pg (equivalent amount of a single cell genome) of total DNA of
female and male blood samples. As is the previous case, three
replicates were tested for each locus in order to estimate the mean
and standard deviations of the difference in C.sub.t levels between
male and female samples, and a lower limit of 0.6 was set on the
standard deviation at each locus in order to avoid statistical
anomalies resulting from the small number of runs at each locus.
Only those loci for which there were no allele dropouts over at
least two experiment runs and over both male and female samples
were used in the plots and calculations. This resulted in 13 loci
that were used in creating the matched filter. FIG. 15, with the
same lay out as in FIG. 14, shows how the differences between male
and female measurements are combined into a single measure. The
estimated number, N, of SNPs that need to be measured in order to
assure 99.99% confidence of the test is 11. This assumes that these
11 loci have the same MSNR (Mean Signal to Noise Ration) as the 13
loci tested in this experiment.
[0300] Note that this result of 11 loci is significantly lower than
we expect to employ in practice. The primary reason is that this
experiment performed an allele-specific amplification in each
Taqman well, in which 6 pg of DNA is placed. The expected standard
deviation for each locus is larger when one initially performs a
whole-genome amplification in order to generate a sufficient
quantity of genetic material that can be placed in each well. In a
later section, we describe the experiment that addresses this
issue, using Multiple Displacement Amplification (MDA) whole genome
amplification of single cells.
[0301] A similar approach to that described above was also applied
to data measured with the SYBR qPCR Assay using 20 SNPs on the X
chromosome of female and male blood samples. Again, the measurement
for each locus is the cycle number, C.sub.t, that it takes the die
released in the well corresponding to this locus to exceed a
threshold. Note that we do not need to combine measurements from
different dyes in this case, since only one dye is used to
represent the total amount of genetic material at a locus,
independent of allele value. Sample 0 and 1 consisted of roughly 60
pg (10 cells) of total DNA per well of female and male samples,
respectively. FIG. 16, top show the means and standard deviations
of differences of C.sub.t for male and female samples at each of
the 20 loci. The mean difference between the male and female
samples is 1.03 and the SD of the difference is 0.78. Three runs
were conducted for every locus and only those loci for which there
were no allele dropouts over at least two experiment runs and over
both male and female samples were used. Applying the approach above
to this data, it was found that N=14 loci in order to have 99.99%
confidence of the test, assuming those 14 loci have the same MSNR
as the 20 used in this experiment.
[0302] And again, in order to estimate the number of SNPs needed
for single cell measurements, this technique was applied to samples
that consist of 6 pg of total DNA of female and male origin, see
FIG. 17. The estimated number N, of SNPs that need to be measured
in order to assure 99.99% confidence in differentiating one
chromosome copy from two, is 27, assuming those 27 loci have the
same MSNR as the 20 used in this experiment. As described above,
this result of 27 loci is lower than we expect to employ in
practice, since this experiment uses locus-specific amplification
in each qPCR well. We next describe an experiment that addresses
this issue. The experiments discussed hitherto were designed to
specifically address locus or allele-specific amplification of
small DNA quantities, without complicating the experiment by
introducing issues of cells lysis and whole genome amplification.
The quantitative measurements were done on small quantities of DNA
diluted from a large sample, without using whole genome
pre-amplification of single cells. Despite the goal to simplify the
experiment, separate dilution and pipetting of the two samples
affected the amount of DNA used to compare male with female samples
at each locus. To ameliorate this effect, the diluted
concentrations were measured using spectrometer comparisons to DNA
with known concentrations and were calibrated appropriately.
[0303] We now describe an experiment that employs the protocol that
will be used for real aneuploidy screening, including cell lysis
and whole genome amplification. The level of amplification is in
excess of 10,000 in order to generate sufficient genetic material
to populate roughly 1,000 Taqman wells with 60 pg of DNA from each
cell. In order to estimate the standard deviation of single
genotyping assays with whole-genome pre-amplifications, multiple
experiments were conducted where a single female HeLa cell (XX) was
pre-amplified using Multiple Displacement Amplification (MDA) and
the amounts of DNA at 20 loci on its X chromosome were measured
using quantitative PCR assays. This experiment was repeated for 16
single HeLa Cells in 16 separate MDA pre-amplifications. The
results of the experiment are designed to be conservative, since
the standard deviation between loci from separate amplifications
will be greater than the standard deviation expected between loci
used in the same reaction. In actual implementation, we will
compare loci of chromosomes that were involved in the same MDA
amplification. Furthermore, it is conservative since we assume that
the C.sub.t of a cell with one X chromosome will be one cycle more
than that of the HeLa cell (XX), i.e. we increased the C.sub.t
measurements of a double-X cell by 1 to simulate a single-X cell.
This is a conservative estimate because the difference between
C.sub.t values are typically greater than 1 due to inefficiencies
of the MDA and PCR assays i.e. with a perfectly efficient PCR
reaction, the amount of DNA is doubled in each cycle. However, the
amplification is typically less than a factor of two in each PCR
cycle due to imperfect hybridization and other effects. This
experiment was designed to establish an upper limit on the amount
of loci that we will need to measure to screen aneuploidy that
involve M2 copy errors where quantitative data is necessary.
Applying the Matched Filter technique for this data, as shown in
FIG. 18A, the minimum number of SNPs to achieve 99.99% is estimated
as N=131.
[0304] To really estimate how many SNPs are required for this
approach to differentiate one or two copies of chromosomes in real
aneuploidy screening, it is highly desirable to test the system on
samples with one sample containing twice the amount of genetic
material as in the other. This precise control is not easily
achieved by sample handling in separate wells because the dilution,
pipetting and/or amplification efficiency vary from well to well.
Here an experiment was designed to overcome these issues by using
an internal control, namely by comparing the amount of genetic
material on Chromosome 7 and Chromosome X of a male sample.
Multiple experiments were conducted where a single male MRC-5 cell
(X) was pre-amplified using Multiple Displacement Amplification
(MDA) and the amounts of DNA at 11 loci on its Chromosome 7 and 13
loci on its Chromosome X were measured using ABI Taqman assays. The
difference in numbers of loci on Chromosome 7 and X was chosen
because larger standard deviation of measured amount of genetic
material is expected for Chromosome X. This experiment was repeated
for 15 single MRC-5 Cells in 15 separate MDA pre-amplifications.
After combining readouts from both FAM and VIC channels of all
loci, we used the averaged composite Ct values on Chromosome 7 as
the reference, which corresponds to the "normal" sample referred to
in aneuploidy screening. Composite Ct values on Chromosome X were
differenced by the reference, and if a significant difference voted
by many loci is detected, it then corresponds to the "aneuploidy"
condition. To create a matched filter, standard deviation of these
differences at each loci was measured using the results of 15
independent single cell experiments. The mean and standard
deviation at each loci was shown in FIG. 18B. And MSNR=0.631. So
the number of SNPs to be measured to achieve 99.99% confidence,
N=88. Note that this is the upper limit because we are still using
single cells pre-amplified separately.
Method 1b: Measuring Aneuploidy or Sex by Quantitative Techniques
that do not Make Allele Calls when the Mean and Std. Deviation is
not Known or is Uniform
[0305] When the characteristics of each locus are not known well,
the simplifying assumptions that all the assays at each locus will
behave similarly can be made, namely that E(m.sub.xi) and
.sigma..sub.xi are constant across all loci i, so that it is
possible to refer instead only to E(m.sub.x) and .sigma..sub.x. In
this case, the matched filtering approach m=h.sup.Td.sub.x reduces
to finding the mean of the distribution of d.sub.x. This approach
will be referred to as comparison of means, and it will be used to
estimate the number of loci required for different kinds of
detection using real data.
[0306] As above, consider the scenario when there are two
chromosomes present in the sample (hypothesis h.sub.0) or one
chromosome present (h.sub.1). For h.sub.0, the distribution is
N(.mu..sub.0,.sigma..sub.0.sup.2) and for h.sub.1 the distribution
is N(.mu..sub.1,.sigma..sub.1.sup.2). Measure each of the
distributions using N.sub.0 and N.sub.1 samples respectively, with
measured sample means and SDs m.sub.1, m.sub.0, s.sub.1, and
s.sub.0. The means can be modeled as random variables M.sub.0,
M.sub.1 that are normally distributed as
M.sub.0.about.N(.mu..sub.0, .sigma..sub.0.sup.2/N.sub.0) and
M.sub.1.about.N(.mu..sub.1, .sigma..sub.1.sup.2/N.sub.1). Assume
N.sub.1 and N.sub.0 are large enough (>30) so that one can
assume that M.sub.1.about.N(m.sub.1, s.sub.1.sup.2/N.sub.1) and
M.sub.0.about.N(m.sub.0, s.sub.0.sup.2/N.sub.0). In order to test
whether the distributions are different, the difference of the
means test may be used, where d=m.sub.1-m.sub.0. The variance of
the random variable D is
.sigma..sub.d.sup.2=.sigma..sub.1.sup.2/N.sub.1+.sigma..sub.0.sup.2/N.sub-
.0 which may be approximated as
.sigma..sub.d.sup.2=s.sub.1.sup.2/N.sub.1+s.sub.0.sup.2/N.sub.0.
Given h.sub.0, E(d)=0; given h.sub.1, E(d)=.mu..sub.1-.mu..sub.0.
Different techniques for making the call between h.sub.1 for
h.sub.0 will now be discussed.
[0307] Data measured with a different run of the Taqman Assay using
48 SNPs on the X chromosome was used to calibrate performance.
Sample 1 consists of roughly 0.3 ng of DNA per well of mixed male
origin containing one X chromosome; sample 0 consisted of roughly
0.3 ng of DNA per well of mixed female origin containing two X
chromosomes. N.sub.1=42 and N.sub.0=45. FIGS. 5 and 6 show the
histograms for samples 1 and 0. The distributions for these samples
are characterized by m.sub.1=32.259, s.sub.1=1.460,
.sigma..sub.m1=s.sub.1/sqrt(N.sub.1)=0.225; m.sub.0=30.75;
s.sub.0=1.202, .sigma..sub.m0=s.sub.0/sqrt(N.sub.0)=0.179. For
these samples d=1.509 and .sigma..sub.d=0.2879.
[0308] Since this data is derived from mixed male and female
samples, much of the standard deviation is due to the different
allele frequencies at each SNP in the mixed samples. SD is
estimated by considering the variations in C.sub.t for one SNP at a
time, over multiple runs. This data is shown in FIG. 7. The
histogram is symmetric around 0 since C.sub.t for each SNP is
measured in two runs or experiments and the mean value of Ct for
each SNP is subtracted out. The average std. dev. across 20 SNPs in
the mixed male sample using two runs is s=0.597. This SD will be
conservatively used for both male and female samples, since SD for
the female sample will be smaller than for the male sample. In
addition, note that the measurement from only one dye is being
used, since the mixed samples are assumed to be heterozygous for
all SNPs. The use of both dyes requires the measurements of each
allele at a locus to be combined, which is more complicated (see
section on linearizing measurements). Combining measurements on
both dyes would double signal amplitude and increase noise
amplitude by roughly sqrt(2), resulting in an SNR improvement of
roughly sqrt(2) or 3 dB.
Detection Assuming No Mosaicism and No Reference Sample
[0309] Assume that m.sub.0 is known perfectly from many
experiments, and every experiment runs only one sample to compute
m.sub.1 to compare with m.sub.0. N.sub.1 is the number of assays
and assume that each assay is a different SNP locus. A threshold t
can be set half way between m.sub.0 and m.sub.1 to make the
likelihood of false positives equal the number of false negatives,
and a sample is labeled abnormal if it is above the threshold.
Assume s.sub.1=s.sub.2=s=0.597 and use the 5-sigma approach so that
the probability of false negatives or positives is
1-normcdf(5,0,1)=2.87e-7. The goal is for
5s.sub.1/sqrt(N.sub.1)<(m.sub.1-m.sub.0)/2, hence N.sub.1=100
s.sub.1.sup.2/(m.sub.1-m.sub.0).sup.2=16. Now, an approach where
the probability of a false positive is allowed to be higher than
the probability of a false negatives, which is the harmful
scenario, may also be used. If a positive is measured, the
experiment may be rerun. Consequently, it is possible to say that
the probability of a false negative should be equal to the square
of the probability of a false positive. Consider FIG. 1, let
t=threshold, and assume Sigma_0=Sigma_1=s. Thus
(1-normcdf((t-m.sub.0)/s,0,1)).sup.2=1-normcdf((m.sub.1-t)/s,0,1).
Solving this, it can be shown that t=m.sub.0+0.32(m.sub.1-m.sub.0).
Hence the goal is for
5s/sqrt(N.sub.1)<m.sub.1-m.sub.0-0.32(m.sub.1-m.sub.0)=(m.sub.1-m.sub.-
0)/1.47, hence
N.sub.1=(5.sup.2)(1.47.sup.2)s.sup.2/(m.sub.1-m.sub.0).sup.2=9.
Detection with Mosaicism without Running a Reference Sample
[0310] Assume the same situation as above, except that the goal is
to detect mosaicism with a probability of 97.7% (i.e. 2-sigma
approach). This is better than the standard approach to
amniocentesis which extracts roughly 20 cells and photographs them.
If one assumes that 1 in 20 cells is aneuploid and this is detected
with 100% reliability, the probability of having at least one of
the group being aneuploid using the standard approach is
1-0.95.sup.20=64%. If 0.05% of the cells are aneuploid (call this
sample 3) then m.sub.3=0.95m.sub.0+0.05m.sub.1 and
var(m.sub.3)=(0.95s.sub.0.sup.2+0.05s.sub.1.sup.2)/N.sub.1. Thus,
std(m.sub.3)2<(m.sub.3-m.sub.0)/2=>sqrt(0.95s.sub.0+0.05s.sub.1.sup-
.2)/sqrt(N.sub.1)<0.05(m.sub.1-m.sub.2)/4=>N.sub.1=16(0.95s.sub.2.su-
p.2+0.05s.sub.1.sup.2)/(0.05.sup.2(m.sub.1-m.sub.2).sup.2)=1001.
Note that using the goal of 1-sigma statistics, which is still
better than can be achieved using the conventional approach (i.e.
detection with 84.1% probability), it can be shown in a similar
manner that N.sub.1=250.
Detection with No Mosaicism and Using a Reference Sample
[0311] Although this approach may not be necessary, assume that
every experiment runs two samples in order to compare m.sub.1 with
truth sample m.sub.2. Assume that N=N.sub.1=N.sub.0. Compute
d=m.sub.1-m.sub.0 and, assuming .sigma..sub.1=.sigma..sub.0, set a
threshold t=(m.sub.0+m.sub.1)/2 so that the probability of false
positives and false negatives is equal. To make the probability of
false negatives 2.87e-7, it must be the case that (m1-m2)/2>5
sqrt(s.sub.1.sup.2/N+s.sub.2.sup.2/N)=>N=100(s.sub.1.sup.2+s.sub.2.sup-
.2)/(m1-m2).sup.2=32.
Detection with Mosaicism and Running a Reference Sample
[0312] As above, assume the probability of false negatives is 2.3%
(i.e. 2-sigma approach). If 0.05% of the cells are aneuploid (call
this sample 3) then m.sub.3=0.95m.sub.0+0.05m.sub.1 and
var(m.sub.3)=(0.95s.sub.0.sup.2+0.05s.sub.1.sup.2)/N.sub.1.
d=m.sub.3-m.sub.2 and
.sigma..sub.d.sup.2=(1.95s.sub.0.sup.2+0.05s.sub.1.sup.2)/N. It
must be that
std(m.sub.3)2<(m.sub.0-m.sub.2)/2=>sqrt(1.95s.sub.2.sup.2+0.05-
s.sub.1.sup.2)/sqrt(N)<0.05(m.sub.1-m.sub.2)/4=>N=16(1.95s.sub.2.sup-
.2+0.05s.sub.1.sup.2)/(0.05.sup.2(m.sub.1-m.sub.2).sup.2)=2002.
Again using 1-sigma approach, it can be shown in a similar manner
that N=500.
[0313] Consider the case if the goal is only to detect 5% mosaicism
with a probability of 64% as is the current state of the art. Then,
the probability of false negative would be 36%. In other words, it
would be necessary to find x such that 1-normcdf(x,0,1)=36%. Thus
N=4(0.36
2)(1.95s.sub.2.sup.2+0.05s.sub.1.sup.2)/(0.05.sup.2(m.sub.1-m.sub.2).sup.-
2)=65 for the 2-sigma approach, or N=33 for the 1-sigma approach.
Note that this would result in a very high level of false
positives, which needs to be addressed, since such a level of false
positives is not currently a viable alternative.
[0314] Also note that if N is limited to 384 (i.e. one 384 well
Taqman plate per chromosome), and the goal is to detect mosaicism
with a probability of 97.72%, then it will be possible to detect
mosaicism of 8.1% using the 1-sigma approach. In order to detect
mosaicism with a probability of 84.1% (or with a 15.9% false
negative rate), then it will be possible to detect mosaicism of
5.8% using the 1-sigma approach. To detect mosaicism of 19% with a
confidence of 97.72% it would require roughly 70 loci. Thus one
could screen for 5 chromosomes on a single plate.
[0315] The summary of each of these different scenarios is provided
in Table 1. Also included in this table are the results generated
from qPCR and the SYBR assays. The methods described above were
used and the simplifying assumption was made that the performance
of the qPCR assay for each locus is the same. FIGS. 8 and 9 show
the histograms for samples 1 and 0, as described above.
N.sub.0=N.sub.1=47. The distributions of the measurements for these
samples are characterized by m.sub.1=27.65, s.sub.1=1.40,
.sigma..sub.m1=s.sub.1/sqrt(N.sub.1)=0.204; m.sub.0=26.64;
s.sub.0=1.146, .sigma..sub.m0=s.sub.0/sqrt(N.sub.0)=0.167. For
these samples d=1.01 and .sigma..sub.d=0.2636. FIG. 10 shows the
difference between C.sub.t for the male and female samples for each
locus, with a standard deviation of the difference over all loci of
0.75. The SD was approximated for each measurement of each locus on
the male or female sample as 0.75/sqrt(2)=0.53.
Method 2: Qualitative Techniques that Use Allele Calls
[0316] In this section, no assumption is made that the assay is
quantitative. Instead, the assumption is that the allele calls are
qualitative, and that there is no meaningful quantitative data
coming from the assays. This approach is suitable for any assay
that makes an allele call. FIG. 11 describes how different haploid
gametes form during meiosis, and will be used to describe the
different kinds of aneuploidy that are relevant for this section.
The best algorithm depends on the type of aneuploidy that is being
detected.
[0317] Consider a situation where aneuploidy is caused by a third
segment that has no section that is a copy of either of the other
two segments. From FIG. 11, the situation would arise, for example,
if p.sub.1 and p.sub.4, or p.sub.2 and p.sub.3, both arose in the
child cell in addition to one segment from the other parent. This
is very common, given the mechanism which causes aneuploidy. One
approach is to start off with a hypothesis h.sub.0 that there are
two segments in the cell and what these two segments are. Assume,
for the purpose of illustration, that h.sub.0 is for p.sub.3 and m4
from FIG. 11. In a preferred embodiment this hypothesis comes from
algorithms described elsewhere in this document. Hypothesis h.sub.1
is that there is an additional segment that has no sections that
are a copy of the other segments. This would arise, for example, if
p.sub.2 or m.sub.1 was also present. It is possible to identify all
loci that are homozygous in p.sub.3 and m.sub.4. Aneuploidy can be
detected by searching for heterozygous genotype calls at loci that
are expected to be homozygous.
[0318] Assume every locus has two possible alleles, x and y. Let
the probability of alleles x and y in general be p.sub.x and
p.sub.y respectively, and p.sub.x+p.sub.y=1. If h.sub.1 is true,
then for each locus i for which p.sub.3 and m.sub.4 are homozygous,
then the probability of a non-homozygous call is p.sub.y or
p.sub.x, depending on whether the locus is homozygous in x or y
respectively. Note: based on knowledge of the parent data, i.e.
p.sub.1, p.sub.2, p.sub.4 and m.sub.1, m.sub.2, m.sub.3, it is
possible to further refine the probabilities for having
non-homozygous alleles x or y at each locus. This will enable more
reliable measurements for each hypothesis with the same number of
SNPs, but complicates notation, so this extension will not be
explicitly dealt with. It should be clear to someone skilled in the
art how to use this information to increase the reliability of the
hypothesis.
[0319] The probability of allele dropouts is p.sub.d. The
probability of finding a heterozygous genotype at locus i is
p.sub.0i given hypothesis h.sub.0 and p.sub.1i given hypothesis
h.sub.1.
Given h.sub.0: p.sub.0i=0
Given h.sub.1: p.sub.1i=p.sub.x(1-p.sub.d) or
p.sub.1i=p.sub.y(1-p.sub.d) depending on whether the locus is
homozygous
for x or y.
[0320] Create a measurement m=1/N.sub.h.SIGMA..sub.i=1 . . .
NhI.sub.i where I.sub.i is an indicator variable, and is 1 if a
heterozygous call is made and 0 otherwise. N.sub.h is the number of
homozygous loci. One can simplify the explanation by assuming that
p.sub.x=p.sub.y and p.sub.0i, p.sub.1i for all loci are the same
two values p.sub.0 and p.sub.1. Given h.sub.0, E(m)=p.sub.0=0 and
.sigma..sup.2.sub.m|h0=p.sub.0(1-p.sub.0)/N.sub.h. Given h.sub.1,
E(m)=p.sub.1 and .sigma..sup.2.sub.m|h1=p.sub.1(1-p.sub.1)/N.sub.h.
Using 5 sigma-statistics, and making the probability of false
positives equal the probability of false negatives, it can be shown
that (p.sub.1-p.sub.0)/2>5.sigma..sub.m|h1 hence
N.sub.h=100(p.sub.0(1-p.sub.0)+p.sub.1(1-p.sub.1))/(p.sub.1-p.sub.0).sup.-
2. For 2-sigma confidence instead of 5-sigma confidence, it can be
shown that
N.sub.h=4.2.sup.2(p.sub.0(1-p.sub.0)+p.sub.1(1-p.sub.1))/(p.sub.1-p.-
sub.0).sup.2.
[0321] It is necessary to sample enough loci N that there will be
sufficient available homozygous loci N.sub.h-avail such that the
confidence is at least 97.7% (2-sigma). Characterize
N.sub.h-avail=.SIGMA..sub.i=1 . . . NJ.sub.i where J.sub.i is an
indicator variable of value 1 if the locus is homozygous and 0
otherwise. The probability of the locus being homozygous is
p.sub.x.sup.2+p.sub.y.sup.2. Consequently,
E(N.sub.h-avail)=N(P.sub.x.sup.2.+-.p.sub.y.sup.2) and
.sigma..sub.Nh-avail.sup.2=N(p.sub.x.sup.2+p.sub.y.sup.2)(1-p.sub.x.sup.2-
-p.sub.y.sup.2). To guarantee N is large enough with 97.7%
confidence, it must be that
E(N.sub.h-avail)-2.sigma..sub.Nh-avail=N.sub.h where N.sub.h is
found from above.
[0322] For example, if one assumes p.sub.d=0.3,
p.sub.x=p.sub.y=0.5, one can find N.sub.h=186 and N=391 for 5-sigma
confidence. Similarly, it is possible to show that N.sub.h=30 and
N=68 for 2-sigma confidence i.e. 97.7% confidence in false
negatives and false positives.
[0323] Note that a similar approach can be applied to looking for
deletions of a segment when h.sub.0 is the hypothesis that two
known chromosome segment are present, and h.sub.1 is the hypothesis
that one of the chromosome segments is missing. For example, it is
possible to look for all of those loci that should be heterozygous
but are homozygous, factoring in the effects of allele dropouts as
has been done above.
[0324] Also note that even though the assay is qualitative, allele
dropout rates may be used to provide a type of quantitative measure
on the number of DNA segments present.
Method 3: Making Use of Known Alleles of Reference Sequences, and
Quantitative Allele Measurements
[0325] Here, it is assumed that the alleles of the normal or
expected set of segments are known. In order to check for three
chromosomes, the first step is to clean the data, assuming two of
each chromosome. In a preferred embodiment of the invention, the
data cleaning in the first step is done using methods described
elsewhere in this document. Then the signal associated with the
expected two segments is subtracted from the measured data. One can
then look for an additional segment in the remaining signal. A
matched filtering approach is used, and the signal characterizing
the additional segment is based on each of the segments that are
believed to be present, as well as their complementary chromosomes.
For example, considering FIG. 11, if the results of PS indicate
that segments p2 and m1 are present, the technique described here
may be used to check for the presence of p2, p3, m1 and m4 on the
additional chromosome. If there is an additional segment present,
it is guaranteed to have more than 50% of the alleles in common
with at least one of these test signals. Note that another
approach, not described in detail here, is to use an algorithm
described elsewhere in the document to clean the data, assuming an
abnormal number of chromosomes, namely 1, 3, 4 and 5 chromosomes,
and then to apply the method discussed here. The details of this
approach should be clear to someone skilled in the art after having
read this document.
[0326] Hypothesis h.sub.0 is that there are two chromosomes with
allele vectors a.sub.1, a.sub.2. Hypothesis h.sub.1 is that there
is a third chromosome with allele vector a.sub.3. Using a method
described in this document to clean the genetic data, or another
technique, it is possible to determine the alleles of the two
segments expected by h.sub.0: a.sub.1=[a.sub.11 . . . a.sub.1N] and
a.sub.2=[a.sub.21 . . . a.sub.2N] where each element a.sub.ji is
either x or y. The expected signal is created for hypothesis
h.sub.0: s.sub.0x=[f.sub.0x(a.sub.11, a.sub.21) . . .
f.sub.x0(a.sub.1N, a.sub.2N)], s.sub.0y=[f.sub.y(a.sub.11,
a.sub.21) . . . f.sub.y(a.sub.1N, a.sub.2N)] where f.sub.x, f.sub.y
describe the mapping from the set of alleles to the measurements of
each allele. Given h.sub.0, the data may be described as
d.sub.xi=s.sub.0xi+n.sub.xi,
n.sub.xi.about.N(0,.sigma..sub.xi.sup.2);
d.sub.yi=s.sub.0yi+n.sub.yi,
n.sub.yi.about.N(0,.sigma..sub.yi.sup.2). Create a measurement by
differencing the data and the reference signal:
m.sub.xi=d.sub.xi-s.sub.xi; m.sub.yi=d.sub.yi-s.sub.yi. The full
measurement vector is m=[m.sub.x.sup.T m.sub.y.sup.T].sup.T.
[0327] Now, create the signal for the segment of interest the
segment whose presence is suspected, and will be sought in the
residual based on the assumed alleles of this segment:
a.sub.3=[a.sub.31 . . . a.sub.3N]. Describe the signal for the
residual as: s.sub.r=[s.sub.rx.sup.T s.sub.ry.sup.1].sup.T where
s.sub.rx=[f.sub.rx(a.sub.31) . . . f.sub.rx(a.sub.3N)],
s.sub.ry=[f.sub.ry(a.sub.31) . . . f.sub.ry(a.sub.3N)] where
f.sub.rx(a.sub.3i)=.delta..sub.xi if a.sub.3i=x and 0 otherwise,
f.sub.ry(a.sub.3i)=.delta..sub.yi if a.sub.3i=y and 0 otherwise.
This analysis assumes that the measurements have been linearized
(see section below) so that the presence of one copy of allele x at
locus i generates data .delta..sub.xi+n.sub.xi and the presence of
.kappa..sub.x copies of the allele x at locus i generates data
.kappa..sub.x.delta..sub.xi+n.sub.xi. Note however that this
assumption is not necessary for the general approach described
here. Given h.sub.1, if allele a.sub.3i=x then
m.sub.xi=.delta..sub.xi+n.sub.xi, m.sub.yi=n.sub.yi and if
a.sub.3i=y then m.sub.xi=n.sub.xi,
m.sub.yi=.delta..sub.yi+n.sub.yi. Consequently, a matched filter
h=(1/N)R.sup.-1s.sub.r can be created where R=diag([.sigma.x1.sup.2
. . . .sigma..sub.xN.sup.2 .sigma..sub.y1.sup.2 . . .
.sigma..sub.yN.sup.2]) The measurement is m=h.sup.Td.
h.sub.0: m=(1/N).SIGMA..sub.i=1 . . . N
s.sub.rxin.sub.xi/.sigma..sub.xi.sup.2+s.sub.ryin.sub.yi/.sigma..sub.yi.s-
up.2
h.sub.1: m=(1/N).SIGMA..sub.i=1 . . . N
s.sub.rxi(.delta..sub.xi+n.sub.xi)/.delta..sub.xi.sup.2+s.sub.ryi(.delta.-
.sub.yi+n.sub.yi)/.sigma..sub.yi.sup.2
In order to estimate the number of SNPs required, make the
simplifying assumptions that all assays for all alleles and all
loci have similar characteristics, namely that
.delta..sub.xi=.delta..sub.yi=.delta. and
.sigma..sub.xi=.sigma..sub.yi=.sigma. for i=1 . . . N. Then, the
mean and standard deviation may be found as follows:
h.sub.0:
E(m)=m.sub.0=0;.sigma..sub.m|h0.sup.2=(1/N.sup.2.sigma..sup.4)(-
N/2)(.sigma..sup.2.delta..sup.2+.sigma..sup.2.delta..sup.2)=.delta..sup.2/-
(N.sigma..sup.2)
h.sub.1:
E(m)=m.sub.1=(1/N)(N/2.sigma..sup.2)(.delta..sup.2+.delta..sup.-
2)=.delta..sup.2/.sigma..sup.2;.sigma..sub.m|h1.sup.2=(1/N.sup.2.sigma..su-
p.4)(N)(.sigma..sup.2.delta..sup.2)=.delta..sup.2/(N.sigma..sup.2)
Now compute a signal-to-noise ratio (SNR) for this test of h.sub.1
versus h.sub.0. The signal is
m.sub.1-m.sub.0=.delta..sup.2/.sigma..sub.2, and the noise variance
of this measurement is
.sigma..sub.m|h0.sup.2+.sigma..sub.m|h1.sup.2=2.delta..sup.2/(N.sigma..su-
p.2). Consequently, the SNR for this test is
(.delta..sup.4/.sigma..sup.4)/(2.delta..sup.2/(N.sigma..sup.2))=N.delta..-
sup.2/(2.sigma..sup.2).
[0328] Compare this SNR to the scenario where the genetic
information is simply summed at each locus without performing a
matched filtering based on the allele calls. Assume that
h=(1/N).sub.1 where .sub.1 is the vector of N ones, and make the
simplifying assumptions as above that
.delta..sub.xi=.delta..sub.yi=.delta. and
.sigma..sub.xi=.sigma..sub.yi=.sigma. for i=1 . . . N. For this
scenario, it is straightforward to show that if m=h.sup.Td:
h.sub.0:
E(m)=m.sub.0=0;.sigma..sub.m|h0.sup.2=N.sigma..sup.2/N.sup.2+N.-
sigma..sup.2/N.sup.2=2.sigma..sup.2/N
h.sub.1:
E(m)=m.sub.1=(1/N)(N.delta./2+N.delta./2)=.delta.;.sigma..sub.m-
|h1.sup.2=(1/N.sup.2)(N.sigma..sup.2+N.sigma..sup.2)=2.sigma..sup.2/N
Consequently, the SNR for this test is
N.delta..sup.2/(4.sigma..sup.2). In other words, by using a matched
filter that only sums the allele measurements that are expected for
segment a.sub.3, the number of SNPs required is reduced by a factor
of 2. This ignores the SNR gain achieved by using matched filtering
to account for the different efficiencies of the assays at each
locus.
[0329] Note that if we do not correctly characterize the reference
signals s.sub.xi and s.sub.yi then the SD of the noise or
disturbance on the resulting measurement signals m.sub.xi and
m.sub.yi will be increased. This will be insignificant if
.delta.<<.sigma., but otherwise it will increase the
probability of false detections. Consequently, this technique is
well suited to test the hypothesis where three segments are present
and two segments are assumed to be exact copies of each other. In
this case, s.sub.xi and s.sub.yi will be reliably known using
techniques of data cleaning based on qualitative allele calls
described elsewhere. In one embodiment method 3 is used in
combination with method 2 which uses qualitative genotyping and,
aside from the quantitative measurements from allele dropouts, is
not able to detect the presence of a second exact copy of a
segment.
[0330] We now describe another quantitative technique that makes
use of allele calls. The method involves comparing the relative
amount of signal at each of the four registers for a given allele.
One can imagine that in the idealized case involving a single,
normal cell, where homogenous amplification occurs, (or the
relative amounts of amplification are normalized), four possible
situations can occur: (i) in the case of a heterozygous allele, the
relative intensities of the four registers will be approximately
1:1:0:0, and the absolute intensity of the signal will correspond
to one base pair; (ii) in the case of a homozygous allele, the
relative intensities will be approximately 1:0:0:0, and the
absolute intensity of the signal will correspond to two base pairs;
(iii) in the case of an allele where ADO occurs for one of the
alleles, the relative intensities will be approximately 1:0:0:0,
and the absolute intensity of the signal will correspond to one
base pair; and (iv) in the case of an allele where ADO occurs for
both of the alleles, the relative intensities will be approximately
0:0:0:0, and the absolute intensity of the signal will correspond
to no base pairs.
[0331] In the case of aneuploides, however, different situations
will be observed. For example, in the case of trisomy, and there is
no ADO, one of three situations will occur: (i) in the case of a
triply heterozygous allele, the relative intensities of the four
registers will be approximately 1:1:1:0, and the absolute intensity
of the signal will correspond to one base pair; (ii) in the case
where two of the alleles are homozygous, the relative intensities
will be approximately 2:1:0:0, and the absolute intensity of the
signal will correspond to two and one base pairs, respectively;
(iii) in the case where are alleles are homozygous, the relative
intensities will be approximately 1:0:0:0, and the absolute
intensity of the signal will correspond to three base pairs. If
allele dropout occurs in the case of an allele in a cell with
trisomy, one of the situations expected for a normal cell will be
observed. In the case of monosomy, the relative intensities of the
four registers will be approximately 1:0:0:0, and the absolute
intensity of the signal will correspond to one base pair. This
situation corresponds to the case of a normal cell where ADO of one
of the alleles has occurred, however in the case of the normal
cell, this will only be observed at a small percentage of the
alleles. In the case of uniparental disomy, where two identical
chromosomes are present, the relative intensities of the four
registers will be approximately 1:0:0:0, and the absolute intensity
of the signal will correspond to two base pairs. In the case of UPD
where two different chromosomes from one parent are present, this
method will indicate that the cell is normal, although further
analysis of the data using other methods described in this patent
will uncover this.
[0332] In all of these cases, either in cells that are normal, have
aneuploides or UPD, the data from one SNP will not be adequate to
make a decision about the state of the cell. However, if the
probabilities of each of the above hypothesis are calculated, and
those probabilities are combined for a sufficient number of SNPs on
a given chromosome, one hypothesis will predominate, it will be
possible to determine the state of the chromosome with high
confidence.
Methods for Linearizing Quantitative Measurements
[0333] Many approaches may be taken to linearize measurements of
the amount of genetic material at a specific locus so that data
from different alleles can be easily summed or differenced. We
first discuss a generic approach and then discuss an approach that
is designed for a particular type of assay.
[0334] Assume data d refers to a nonlinear measurement of the
amount of genetic material of allele x at locus i. Create a
training set of data using N measurements, where for each
measurement, it is estimated or known that the amount of genetic
material corresponding to data d.sub.xi is .beta..sub.xi. The
training set .beta..sub.xi, i=1 . . . N, is chosen to span all the
different amounts of genetic material that might be encountered in
practice. Standard regression techniques can be used to train a
function that maps from the nonlinear measurement, d.sub.xi, to the
expectation of the linear measurement, E(.beta..sub.xi). For
example, a linear regression can be used to train a polynomial
function of order P, such that E(.beta..sub.xi)=[1 d.sub.xi
d.sub.xi.sup.2 . . . d.sub.xi.sup.P]c where c is the vector of
coefficients c=[c.sub.0 c.sub.1 c.sub.P].sup.T. To train this
linearizing function, we create a vector of the amount of genetic
material for N measurements .beta..sub.x=[.beta..sub.x1 . . .
.beta..sub.xN].sup.T and a matrix of the measured data raised to
powers 0 . . . P: D=[[1 d.sub.x1 d.sub.x1.sup.2 . . .
d.sub.x1.sup.P].sup.T [1 d.sub.x2 d.sub.x2.sup.2 . . .
d.sub.x2.sup.P].sup.T . . . [1 d.sub.xN d.sub.xN.sup.2 . . .
d.sub.xN.sup.P].sup.T].sup.T. The coefficients can then be found
using a least squares fit
c=(D.sup.TD).sup.-1D.sup.T.beta..sub.x.
[0335] Rather than depend on generic functions such as fitted
polynomials, we may also create specialized functions for the
characteristics of a particular assay. We consider, for example,
the Taqman assay or a qPCR assay. The amount of die for allele x
and some locus i, as a function of time up to the point where it
crosses some threshold, may be described as an exponential curve
with a bias offset:
g.sub.xi(t)=.alpha..sub.xi+.beta..sub.xiexp(.gamma..sub.xit) where
.alpha..sub.xi is the bias offset, .gamma..sub.xi is the
exponential growth rate, and .beta..sub.xi corresponds to the
amount of genetic material. To cast the measurements in terms of
.beta..sub.xi, compute the parameter .alpha..sub.xi by looking at
the asymptotic limit of the curve g.sub.xi(-.infin.) and then may
find .beta..sub.xi and .gamma..sub.xi by taking the log of the
curve to obtain
log(g.sub.xi(t)-.alpha..sub.xi)=log((.beta..sub.xi)+.gamma..sub.xit
and performing a standard linear regression. Once we have values
for .alpha..sub.xi and .gamma..sub.xi, another approach is to
compute .beta..sub.xi from the time, t.sub.x, at which the
threshold g.sub.x is exceeded.
.beta..sub.xi=(g.sub.x-.alpha..sub.xi)exp(-.gamma..sub.xit.sub.-
x). This will be a noisy measurement of the true amount of genetic
data of a particular allele.
[0336] Whatever techniques is used, we may model the linearized
measurement as .beta..sub.xi=.kappa..sub.x.delta..sub.xi+n.sub.xi
where .sigma..sub.x.sup.2 is the number of copies of allele x,
.delta..sub.xi is a constant for allele x and locus i, and
n.sub.xi.about.N(0, .sigma..sub.x.sup.2) where .sigma..sub.x.sup.2
can be measured empirically.
Method 4: Using a Probability Distribution Function for the
Amplification of Genetic Data at Each Locus
[0337] The method described here is relevant for high throughput
genotype data either generated by a PCR-based approach, for example
using an Affymetrix Genotyping Array, or using the Molecular
Inversion Probe (MIPs) technique, with the Affymetrix GenFlex Tag
Array. In the former case, the genetic material is amplified by PCR
before hybridization to probes on the genotyping array to detect
the presence of particular alleles. In the latter case, padlock
probes are hybridized to the genomic DNA and a gap-fill enzyme is
added which can add one of the four nucleotides. If the added
nucleotide (A, C, T, G) is complementary to the SNP under
measurement, then it will hybridize to the DNA, and join the ends
of the padlock probe by ligation. The closed padlock probes are
then differentiated from linear probes by exonucleolysis. The
probes that remain are then opened at a cleavage site by another
enzyme, amplified by PCR, and detected by the GenFlex Tag Array.
Whichever technique is used, the quantity of material for a
particular SNP will depend on the number of initial chromosomes in
the cell on which that SNP is present. However, due to the random
nature of the amplification and hybridization process, the quantity
of genetic material from a particular SNP will not be directly
proportional to the starting number of chromosomes. Let q.sub.s,A,
q.sub.s,G, q.sub.s,T, q.sub.s,C represent the amplified quantity of
genetic material for a particular SNP s for each of the four
nucleic acids (A, C, T, G) constituting the alleles. Note that
these quantities are typically measured from the intensity of
signals from particular hybridization probes on the array. This
intensity measurement can be used instead of a measurement of
quantity, or can be converted into a quantity estimate using
standard techniques without changing the nature of the invention.
Let q.sub.s be the sum of all the genetic material generated from
all alleles of a particular SNP:
q.sub.s=q.sub.s,A+q.sub.s,G+q.sub.s,T+q.sub.s,C. Let N be the
number of chromosomes in a cell containing the SNP s. N is
typically 2, but may be 0, 1 or 3 or more. For either
high-throughput genotyping method discussed above, and many other
methods, the resulting quantity of genetic material can be
represented as q.sub.s=(A+A.sub..theta.,s)N+.theta..sub.s where A
is the total amplification that is either estimated a-priori or
easily measured empirically, A.sub..theta.,s is the error in the
estimate of A for the SNP s, and .theta..sub.s is additive noise
introduced in the amplification, hybridization and other process
for that SNP. The noise terms A.sub..theta.,s and .theta..sub.s are
typically large enough that q.sub.s will not be a reliable
measurement of N. However, the effects of these noise terms can be
mitigated by measuring multiple SNPs on the chromosome. Let S be
the number of SNPs that are measured on a particular chromosome,
such as chromosome 21. We can then generate the average quantity of
genetic material over all SNPs on a particular chromosome
q = 1 S s = 1 S q s = NA + 1 S s = 1 S A .theta. , s N + .theta. s
( 15 ) ##EQU00059##
[0338] Assuming that A.sub..theta.,s and .theta..sub.s are normally
distributed random variables with 0 means and variances
.sigma..sup.2.sub.A.sub..theta.,s and
.sigma..sup.2.sub..theta..sub.s, we can model q=NA+.phi. where
.phi. is a normally distributed random variable with 0 mean and
variance
1 S ( N 2 .sigma. A .theta. , s 2 + .sigma. .theta. 2 ) .
##EQU00060##
Consequently, if we measure a sufficient number of SNPs on the
chromosome such that:
S>>(N.sup.2.sigma..sup.2.sub.A.sub..theta.,s+.sigma..sup.2.sub..the-
ta.), we can accurately estimate N=q/A.
[0339] The quantity of material for a particular SNP will depend on
the number of initial segments in the cell on which that SNP is
present. However, due to the random nature of the amplification and
hybridization process, the quantity of genetic material from a
particular SNP will not be directly proportional to the starting
number of segments. Let q.sub.s,A, q.sub.s,G, q.sub.s,T, q.sub.s,C
represent the amplified quantity of genetic material for a
particular SNP s for each of the four nucleic acids (A,C,T,G)
constituting the alleles. Note that these quantities may be exactly
zero, depending on the technique used for amplification. Also note
that these quantities are typically measured from the intensity of
signals from particular hybridization probes. This intensity
measurement can be used instead of a measurement of quantity, or
can be converted into a quantity estimate using standard techniques
without changing the nature of the invention. Let q.sub.S be the
sum of all the genetic material generated from all alleles of a
particular SNP: q.sub.s=q.sub.s,A+q.sub.s,G+q.sub.s,T+q.sub.s,C.
Let N be the number of segments in a cell containing the SNP s. N
is typically 2, but may be 0, 1 or 3 or more. For any high or
medium throughput genotyping method discussed, the resulting
quantity of genetic material can be represented as
q.sub.s=(A+A.sub..theta.,s)N+.theta..sub.s where A is the total
amplification that is either estimated a-priori or easily measured
empirically, A.sub..theta.,s is the error in the estimate of A for
the SNP s, and .theta..sub.s is additive noise introduced in the
amplification, hybridization and other process for that SNP. The
noise terms A.sub..theta.,s and .theta..sub.s are typically large
enough that q.sub.s will not be a reliable measurement of N.
However, the effects of these noise terms can be mitigated by
measuring multiple SNPs on the chromosome. Let S be the number of
SNPs that are measured on a particular chromosome, such as
chromosome 21. It is possible to generate the average quantity of
genetic material over all SNPs on a particular chromosome as
follows:
q = 1 S s = 1 S q s = NA + 1 S s = 1 S A .theta. , s N + .theta. s
( 16 ) ##EQU00061##
Assuming that A.sub..theta.,s and .theta..sub.s are normally
distributed random variables with 0 means and variances
.sigma..sup.2.sub.A.sub..theta.,S and
.sigma..sup.2.sub..theta..sub.S, one can model q=NA+.phi. where
.phi. is a normally distributed random variable with 0 mean and
variance
1 S ( N 2 .sigma. A .theta. , S 2 + .sigma. .theta. 2 ) .
##EQU00062##
consequently, if sufficient number of SNPs are measured on the
chromosome such that
S>>(N.sup.2.sigma..sup.2.sub.A.sub..theta.,S+.sigma..sup.-
2.sub..theta.), then N=q/A can be accurately estimated.
[0340] In another embodiment, assume that the amplification is
according to a model where the signal level from one SNP is
s=a+.alpha. where (a+.alpha.) has a distribution that looks like
the picture in FIG. 12A, left. The delta function at 0 models the
rates of allele dropouts of roughly 30%, the mean is a, and if
there is no allele dropout, the amplification has uniform
distribution from 0 to a.sub.0. In terms of the mean of this
distribution a.sub.0 is found to be a.sub.0=2.86a. Now model the
probability density function of .alpha. using the picture in FIG.
12B, right. Let s.sub.c be the signal arising from c loci; let n be
the number of segments; let .alpha..sub.i be a random variable
distributed according to FIGS. 12A and 12B that contributes to the
signal from locus i; and let a be the standard deviation for all
{a.sub.i}. s.sub.c=anc+.SIGMA..sub.i=1 . . . nc .alpha..sub.i;
mean(s.sub.c)=anc; std(s.sub.c)=sqrt(nc).sigma.. If .sigma. is
computed according to the distribution in FIG. 12B, right, it is
found to be .sigma.=0.907a.sup.2. We can find the number of
segments from n=s.sub.c/(ac) and for "5-sigma statistics" we
require std(n)<0.1 so
std(s.sub.c)/(ac)=0.1=>0.95a.sqrt(nc)/(ac)=0.1 so c=0.95.sup.2
n/0.1.sup.2=181.
[0341] Another model to estimate the confidence in the call, and
how many loci or SNPs must be measured to ensure a given degree of
confidence, incorporates the random variable as a multiplier of
amplification instead of as an additive noise source, namely
s=a(1+.alpha.). Taking logs, log(s)=log(a)+log(1+.alpha.). Now,
create a new random variable .gamma.=log(1+.alpha.) and this
variable may be assumed to be normally distributed
.about.N(0,.sigma.). In this model, amplification can range from
very small to very large, depending on .sigma., but never negative.
Therefore .alpha.=e.sup..gamma.-1; and s.sub.c=.SIGMA..sub.i=1 . .
. cna(1+.alpha..sub.i). For notation, mean(s.sub.c) and expectation
value E(s.sub.c) are used interchangeably
E(S.sub.C)=acn+aE(.SIGMA..sub.i=1 . . .
cn.alpha..sub.i)=acn+aE(.SIGMA..sub.i=1 . . .
cn.alpha..sub.i)=acn(1+E(.alpha.))
[0342] To find E(.alpha.) the probability density function (pdf)
must be found for .alpha. which is possible since .alpha. is a
function of .gamma. which has a known Gaussian pdf.
p.sub..alpha.(.alpha.)=p.sub..gamma.(.gamma.)(d.gamma./d.alpha.).
So:
p .gamma. ( .gamma. ) = 1 2 .pi. .sigma. e - .gamma. 2 2 .sigma. 2
and d .gamma. d .alpha. = d d .alpha. ( log ( 1 + .alpha. ) ) = 1 1
+ .alpha. e - .gamma. ##EQU00063## and : ##EQU00063.2## p .alpha. (
.alpha. ) = 1 2 .pi. .sigma. e - .gamma. 2 2 .sigma. 2 e - .gamma.
= 1 2 .pi. .sigma. e - ( log ( 1 + .alpha. ) ) 2 2 .sigma. 2 1 1 +
.alpha. ##EQU00063.3##
[0343] This has the form shown in FIG. 13 for .sigma.=1. Now,
E(.alpha.) can be found by integrating over this pdf
E(.alpha.)=.intg..sub.-.infin..sup..infin..alpha.p.sub..alpha.(.alpha.)d.-
alpha. which can be done numerically for multiple different
.sigma.. This gives E(s.sub.c) or mean(s.sub.c) as a function of
.sigma.. Now, this pdf can also be used to find var(s.sub.c):
var ( S c ) = E ( S c - E ( S C ) ) 2 = E ( i = 1 cn a ( 1 +
.alpha. i ) - acn - aE ( i = 1 cn .alpha. i ) ) 2 = E ( i = 1 cn aa
i - aE ( i = 1 cn .alpha. i ) ) 2 = a 2 E ( i = 1 cn .alpha. i -
cnE ( .alpha. ) ) 2 = a 2 E ( ( i = 1 cn .alpha. i ) 2 - 2 cnE (
.alpha. ) ( i = 1 cn .alpha. i ) + c 2 n 2 E ( .alpha. ) 2 ) = a 2
E ( cn .alpha. 2 + cn ( cn - 1 ) .alpha. i .alpha. j - 2 cnE (
.alpha. ) ( i = 1 cn .alpha. i ) + c 2 n 2 E ( .alpha. ) 2 ) = a 2
c 2 n 2 ( E ( .alpha. 2 ) + ( cn - 1 ) E ( .alpha. i .alpha. j ) -
2 cnE ( .alpha. ) 2 + cnE ( .alpha. ) 2 ) = a 2 c 2 n 2 ( E (
.alpha. 2 ) + ( cn - 1 ) E ( .alpha. i .alpha. j ) - cnE ( .alpha.
) 2 ) ##EQU00064##
which can also be solved numerically using p.sub.a(a) for multiple
different a to get var(s.sub.c) as a function of .sigma.. Then, we
may take a series of measurements from a sample with a known number
of loci c and a known number of segments n and find
std(s.sub.c)/E(s.sub.c) from this data. That will enable us to
compute a value for .sigma.. In order to estimate n,
E(s.sub.c)=nac(1+E(.alpha.)) so
n ^ = S c ac ( 1 + ( E ( .alpha. ) ) ##EQU00065##
can be measured so that
std ( n ^ ) = std ( S c ) ac ( 1 + ( E ( .alpha. ) ) std ( n )
##EQU00066##
[0344] When summing a sufficiently large number of independent
random variables of 0-mean, the distribution approaches a Gaussian
form, and thus s.sub.c (and {circumflex over (n)}) can be treated
as normally distributed and as before we may use 5-sigma
statistics:
std ( n ^ ) = std ( S c ) ac ( 1 + ( E ( .alpha. ) ) < 0.1
##EQU00067##
in order to have an error probability of 2normcdf(5,0,1)=2.7e-7.
From this, one can solve for the number of loci c.
Sexing
[0345] In one embodiment of the system, the genetic data can be
used to determine the sex of the target individual. After the
method disclosed herein is used to determine which segments of
which chromosomes from the parents have contributed to the genetic
material of the target, the sex of the target can be determined by
checking to see which of the sex chromosomes have been inherited
from the father: X indicates a female, and Y indicates a make. It
should be obvious to one skilled in the art how to use this method
to determine the sex of the target.
Validation of the Hypotheses
[0346] In some embodiments of the system, one drawback is that in
order to make a prediction of the correct genetic state with the
highest possible confidence, it is necessary to make hypotheses
about every possible states. However, as the possible number of
genetic states are exceptionally large, and computational time is
limited, it may not be reasonable to test every hypothesis. In
these cases, an alternative approach is to use the concept of
hypothesis validation. This involves estimating limits on certain
values, sets of values, properties or patterns that one might
expect to observe in the measured data if a certain hypothesis, or
class of hypotheses are true. Then, the measured values can tested
to see if they fall within those expected limits, and/or certain
expected properties or patterns can be tested for, and if the
expectations are not met, then the algorithm can flag those
measurements for further investigation.
[0347] For example, in a case where the end of one arm of a
chromosome is broken off in the target DNA, the most likely
hypothesis may be calculated to be "normal" (as opposed, for
example to "aneuploid"). This is because the particular hypotheses
that corresponds to the true state of the genetic material, namely
that one end of the chromosome has broken off, has not been tested,
since the likelihood of that state is very low. If the concept of
validation is used, then the algorithm will note that a high number
of values, those that correspond to the alleles that lie on the
broken off section of the chromosome, lay outside the expected
limits of the measurements. A flag will be raised, inviting further
investigation for this case, increasing the likelihood that the
true state of the genetic material is uncovered.
[0348] It should be obvious to one skilled in the art how to modify
the disclosed method to include the validation technique. Note that
one anomaly that is expected to be very difficult to detect using
the disclosed method is balanced translocations.
M Notes
[0349] As noted previously, given the benefit of this disclosure,
there are more embodiments that may implement one or more of the
systems, methods, and features, disclosed herein.
[0350] In all cases concerning the determination of the probability
of a particular qualitative measurement on a target individual
based on parent data, it should be obvious to one skilled in the
art, after reading this disclosure, how to apply a similar method
to determine the probability of a quantitative measurement of the
target individual rather than qualitative. Wherever genetic data of
the target or related individuals is treated qualitatively, it will
be clear to one skilled in the art, after reading this disclosure,
how to apply the techniques disclosed to quantitative data.
[0351] It should be obvious to one skilled in the art that a
plurality of parameters may be changed without changing the essence
of the invention. For example, the genetic data may be obtained
using any high throughput genotyping platform, or it may be
obtained from any genotyping method, or it may be simulated,
inferred or otherwise known. A variety of computational languages
could be used to encode the algorithms described in this
disclosure, and a variety of computational platforms could be used
to execute the calculations. For example, the calculations could be
executed using personal computers, supercomputers, a massively
parallel computing platform, or even non-silicon based
computational platforms such as a sufficiently large number of
people armed with abacuses.
[0352] Some of the math in this disclosure makes hypotheses
concerning a limited number of states of aneuploidy. In some cases,
for example, only monosomy, disomy and trisomy are explicitly
treated by the math. It should be obvious to one skilled in the art
how these mathematical derivations can be expanded to take into
account other forms of aneuploidy, such as nullsomy (no chromosomes
present), quadrosomy, etc., without changing the fundamental
concepts of the invention.
[0353] When this disclosure discusses a chromosome, this may refer
to a segment of a chromosome, and when a segment of a chromosome is
discussed, this may refer to a full chromosome. It is important to
note that the math to handle a segment of a chromosome is the same
as that needed to handle a full chromosome. It should be obvious to
one skilled in the art how to modify the method accordingly
[0354] It should be obvious to one skilled in the art that a
related individual may refer to any individual who is genetically
related, and thus shares haplotype blocks with the target
individual. Some examples of related individuals include:
biological father, biological mother, son, daughter, brother,
sister, half-brother, half-sister, grandfather, grandmother, uncle,
aunt, nephew, niece, grandson, granddaughter, cousin, clone, the
target individual himself/herself/itself, and other individuals
with known genetic relationship to the target. The term `related
individual` also encompasses any embryo, fetus, sperm, egg,
blastomere, blastocyst, or polar body derived from a related
individual.
[0355] It is important to note that the target individual may refer
to an adult, a juvenile, a fetus, an embryo, a blastocyst, a
blastomere, a cell or set of cells from an individual, or from a
cell line, or any set of genetic material. The target individual
may be alive, dead, frozen, or in stasis.
[0356] It is also important to note that where the target
individual refers to a blastomere that is used to diagnose an
embryo, there may be cases caused by mosaicism where the genome of
the blastomere analyzed does not correspond exactly to the genomes
of all other cells in the embryo.
[0357] It is important to note that it is possible to use the
method disclosed herein in the context of cancer genotyping and/or
karyotyping, where one or more cancer cells is considered the
target individual, and the non-cancerous tissue of the individual
afflicted with cancer is considered to be the related individual.
The non-cancerous tissue of the individual afflicted with the
target could provide the set of genotype calls of the related
individual that would allow chromosome copy number determination of
the cancerous cell or cells using the methods disclosed herein.
[0358] It is important to note that the method described herein
concerns the cleaning of genetic data, and as all living or once
living creatures contain genetic data, the methods are equally
applicable to any live or dead human, animal, or plant that
inherits or inherited chromosomes from other individuals.
[0359] It is important to note that in many cases, the algorithms
described herein make use of prior probabilities, and/or initial
values. In some cases the choice of these prior probabilities may
have an impact on the efficiency and/or effectiveness of the
algorithm. There are many ways that one skilled in the art, after
reading this disclosure, could assign or estimate appropriate prior
probabilities without changing the essential concept of the
patent.
[0360] It is also important to note that the embryonic genetic data
that can be generated by measuring the amplified DNA from one
blastomere can be used for multiple purposes. For example, it can
be used for detecting aneuploidy, uniparental disomy, sexing the
individual, as well as for making a plurality of phenotypic
predictions based on phenotype-associated alleles. Currently, in
IVF laboratories, due to the techniques used, it is often the case
that one blastomere can only provide enough genetic material to
test for one disorder, such as aneuploidy, or a particular
monogenic disease. Since the method disclosed herein has the common
first step of measuring a large set of SNPs from a blastomere,
regardless of the type of prediction to be made, a physician,
parent, or other agent is not forced to choose a limited number of
disorders for which to screen. Instead, the option exists to screen
for as many genes and/or phenotypes as the state of medical
knowledge will allow. With the disclosed method, one advantage to
identifying particular conditions to screen for prior to genotyping
the blastomere is that if it is decided that certain loci are
especially relevant, then a more appropriate set of SNPs which are
more likely to cosegregate with the locus of interest, can be
selected, thus increasing the confidence of the allele calls of
interest.
[0361] It is also important to note that it is possible to perform
haplotype phasing by molecular haplotyping methods. Because
separation of the genetic material into haplotypes is challenging,
most genotyping methods are only capable of measuring both
haplotypes simultaneously, yielding diploid data. As a result, the
sequence of each haploid genome cannot be deciphered. In the
context of using the disclosed method to determine allele calls
and/or chromosome copy number on a target genome, it is often
helpful to know the maternal haplotype; however, it is not always
simple to measure the maternal haplotype. One way to solve this
problem is to measure haplotypes by sequencing single DNA molecules
or clonal populations of DNA molecules. The basis for this method
is to use any sequencing method to directly determine haplotype
phase by direct sequencing of a single DNA molecule or clonal
population of DNA molecules. This may include, but not be limited
to: cloning amplified DNA fragments from a genome into a
recombinant DNA constructs and sequencing by traditional dye-end
terminator methods, isolation and sequencing of single molecules in
colonies, and direct single DNA molecule or clonal DNA population
sequencing using next-generation sequencing methods.
[0362] The systems, methods, and techniques of the present
invention may be used to in conjunction with embyro screening or
prenatal testing procedures. The systems, methods, and techniques
of the present invention may be employed in methods of increasing
the probability that the embryos and fetuses obtain by in vitro
fertilization are successfully implanted and carried through the
full gestation period. Further, the systems, methods, and
techniques of the present invention may be employed in methods of
decreasing the probability that the embryos and fetuses obtain by
in vitro fertilization that are implanted and gestated are not
specifically at risk for a congenital disorder.
[0363] Thus, according to some embodiments, the present invention
extends to the use of the systems, methods, and techniques of the
invention in conjunction with pre-implantation diagnosis
procedures.
[0364] According to some embodiments, the present invention extends
to the use of the systems, methods, and techniques of the invention
in conjunction with prenatal testing procedures.
[0365] According to some embodiments, the systems, methods, and
techniques of the invention are used in methods to decrease the
probability for the implantation of an embryo specifically at risk
for a congenital disorder by testing at least one cell removed from
early embryos conceived by in vitro fertilization and transferring
to the mother's uterus only those embryos determined not to have
inherited the congenital disorder.
[0366] According to some embodiments, the systems, methods, and
techniques of the invention are used in methods to decrease the
probability for the implantation of an embryo specifically at risk
for a chromosome abnormality by testing at least one cell removed
from early embryos conceived by in vitro fertilization and
transferring to the mother's uterus only those embryos determined
not to have chromosome abnormalities.
[0367] According to some embodiments, the systems, methods, and
techniques of the invention are used in methods to increase the
probability of implanting an embryo obtained by in vitro
fertilization that is at a reduced risk of carrying a congenital
disorder.
[0368] According to some embodiments, the systems, methods, and
techniques of the invention are used in methods to increase the
probability of gestating a fetus.
[0369] According to preferred embodiments, the congenital disorder
is a malformation, neural tube defect, chromosome abnormality,
Down's syndrome (or trisomy 21), Trisomy 18, spina bifida, cleft
palate, Tay Sachs disease, sickle cell anemia, thalassemia, cystic
fibrosis, Huntington's disease, and/or fragile x syndrome.
Chromosome abnormalities include, but are not limited to, Down
syndrome (extra chromosome 21), Turner Syndrome (45X0) and
Klinefelter's syndrome (a male with 2 X chromosomes).
[0370] According to preferred embodiments, the malformation is a
limb malformation. Limb malformations include, but are not limited
to, amelia, ectrodactyly, phocomelia, polymelia, polydactyly,
syndactyly, polysyndactyly, oligodactyly, brachydactyly,
achondroplasia, congenital aplasia or hypoplasia, amniotic band
syndrome, and cleidocranial dysostosis.
[0371] According to preferred embodiments, the malformation is a
congenital malformation of the heart. Congenital malformations of
the heart include, but are not limited to, patent ductus
arteriosus, atrial septal defect, ventricular septal defect, and
tetralogy of fallot.
[0372] According to preferred embodiments, the malformation is a
congenital malformation of the nervous system. Congenital
malformations of the nervous system include, but are not limited
to, neural tube defects (e.g., spina bifida, meningocele,
meningomyelocele, encephalocele and anencephaly), Arnold-Chiari
malformation, the Dandy-Walker malformation, hydrocephalus,
microencephaly, megencephaly, lissencephaly, polymicrogyria,
holoprosencephaly, and agenesis of the corpus callosum.
[0373] According to preferred embodiments, the malformation is a
congenital malformation of the gastrointestinal system. Congenital
malformations of the gastrointestinal system include, but are not
limited to, stenosis, atresia, and imperforate anus.
[0374] According to some embodiments, the systems, methods, and
techniques of the invention are used in methods to increase the
probability of implanting an embryo obtained by in vitro
fertilization that is at a reduced risk of carrying a
predisposition for a genetic disease.
[0375] According to preferred embodiments, the genetic disease is
either monogenic or multigenic. Genetic diseases include, but are
not limited to, Bloom Syndrome, Canavan Disease, Cystic fibrosis,
Familial Dysautonomia, Riley-Day syndrome, Fanconi Anemia (Group
C), Gaucher Disease, Glycogen storage disease 1a, Maple syrup urine
disease, Mucolipidosis IV, Niemann-Pick Disease, Tay-Sachs disease,
Beta thalessemia, Sickle cell anemia, Alpha thalessemia, Beta
thalessemia, Factor XI Deficiency, Friedreich's Ataxia, MCAD,
Parkinson disease-juvenile, Connexin26, SMA, Rett syndrome,
Phenylketonuria, Becker Muscular Dystrophy, Duchennes Muscular
Dystrophy, Fragile X syndrome, Hemophilia A, Alzheimer
dementia-early onset, Breast/Ovarian cancer, Colon cancer,
Diabetes/MODY, Huntington disease, Myotonic Muscular Dystrophy,
Parkinson Disease-early onset, Peutz-Jeghers syndrome, Polycystic
Kidney Disease, Torsion Dystonia
Combinations of the Aspects of the Invention
[0376] As noted previously, given the benefit of this disclosure,
there are more aspects and embodiments that may implement one or
more of the systems, methods, and features, disclosed herein. Below
is a short list of examples illustrating situations in which the
various aspects of the disclosed invention can be combined in a
plurality of ways. It is important to note that this list is not
meant to be comprehensive; many other combinations of the aspects,
methods, features and embodiments of this invention are
possible.
[0377] In one embodiment of the invention, it is possible to
combine several of the aspect of the invention such that one could
perform both allele calling as well as aneuploidy calling in one
step, and to use quantitative values instead of qualitative for
both parts. It should be obvious to one skilled in the art how to
combine the relevant mathematics without changing the essence of
the invention.
[0378] In a preferred embodiment of the invention, the disclosed
method is employed to determine the genetic state of one or more
embryos for the purpose of embryo selection in the context of IVF.
This may include the harvesting of eggs from the prospective mother
and fertilizing those eggs with sperm from the prospective father
to create one or more embryos. It may involve performing embryo
biopsy to isolate a blastomere from each of the embryos. It may
involve amplifying and genotyping the genetic data from each of the
blastomeres. It may include obtaining, amplifying and genotyping a
sample of diploid genetic material from each of the parents, as
well as one or more individual sperm from the father. It may
involve incorporating the measured diploid and haploid data of both
the mother and the father, along with the measured genetic data of
the embryo of interest into a dataset. It may involve using one or
more of the statistical methods disclosed in this patent to
determine the most likely state of the genetic material in the
embryo given the measured or determined genetic data. It may
involve the determination of the ploidy state of the embryo of
interest. It may involve the determination of the presence of a
plurality of known disease-linked alleles in the genome of the
embryo. It may involve making phenotypic predictions about the
embryo. It may involve generating a report that is sent to the
physician of the couple so that they may make an informed decision
about which embryo(s) to transfer to the prospective mother.
[0379] Another example could be a situation where a 44-year old
woman undergoing IVF is having trouble conceiving. The couple
arranges to have her eggs harvested and fertilized with sperm from
the man, producing nine viable embryos. A blastomere is harvested
from each embryo, and the genetic data from the blastomeres are
measured using an ILLUMINA INFINIUM BEAD ARRAY. Meanwhile, the
diploid data are measured from tissue taken from both parents also
using the ILLUMINA INFINIUM BEAD ARRAY. Haploid data from the
father's sperm is measured using the same method. The method
disclosed herein is applied to the genetic data of the blastomere
and the diploid maternal genetic data to phase the maternal genetic
data to provide the maternal haplotype. Those data are then
incorporated, along with the father's diploid and haploid data, to
allow a highly accurate determination of the copy number count for
each of the chromosomes in each of the embryos. Eight of the nine
embryos are found to be aneuploid, and the one embryo is found to
be euploid. A report is generated that discloses these diagnoses,
and is sent to the doctor. The report has data similar to the data
found in Tables 9, 10 and 11. The doctor, along with the
prospective parents, decides to transfer the euploid embryo which
implants in the mother's uterus.
[0380] Another example may involve a pregnant woman who has been
artificially inseminated by a sperm donor, and is pregnant. She is
wants to minimize the risk that the fetus she is carrying has a
genetic disease. She undergoes amniocentesis and fetal cells are
isolated from the withdrawn sample, and a tissue sample is also
collected from the mother. Since there are no other embryos, her
data are phased using molecular haplotyping methods. The genetic
material from the fetus and from the mother are amplified as
appropriate and genotyped using the ILLUMINA INFINIUM BEAD ARRAY,
and the methods described herein reconstruct the embryonic genotype
as accurately as possible. Phenotypic susceptibilities are
predicted from the reconstructed fetal genetic data and a report is
generated and sent to the mother's physician so that they can
decide what actions may be best.
[0381] Another example could be a situation where a racehorse
breeder wants to increase the likelihood that the foals sired by
his champion racehorse become champions themselves. He arranges for
the desired mare to be impregnated by IVF, and uses genetic data
from the stallion and the mare to clean the genetic data measured
from the viable embryos. The cleaned embryonic genetic data allows
the breeder to select the embryos for implantation that are most
likely to produce a desirable racehorse.
Tables 1-11
[0382] Table 1. Probability distribution of measured allele calls
given the true genotype.
[0383] Table 2. Probabilities of specific allele calls in the
embryo using the U and H notation.
[0384] Table 3. Conditional probabilities of specific allele calls
in the embryo given all possible parental states.
[0385] Table 4. Constraint Matrix (A).
[0386] Table 5. Notation for the counts of observations of all
specific embryonic allelic states given all possible parental
states.
[0387] Table 6. Aneuploidy states (h) and corresponding
P(h|n.sub.j), the conditional probabilities given the copy
numbers.
[0388] Table 7. Probability of aneuploidy hypothesis (H)
conditional on parent genotype.
[0389] Table 8. Results of PS algorithm applied to 69 SNPs on
chromosome 7.
[0390] Table 9. Aneuploidy calls on eight known euploid cells.
[0391] Table 10. Aneuploidy calls on ten known trisomic cells.
[0392] Table 11. Aneuploidy calls for six blastomeres.
TABLE-US-00002 TABLE 1 Probability distribution of measured allele
calls given the true genotype. p(dropout) = 0.5, p(gain) = 0.02
measured true AA AB BB XX AA 0.735 0.015 0.005 0.245 AB 0.250 0.250
0.250 0.250 BB 0.005 0.015 0.735 0.245
TABLE-US-00003 TABLE 2 Probabilities of specific allele calls in
the embryo using the U and H notation. Embryo readouts Embryo truth
state U H empty U p.sub.11 p.sub.12 p.sub.13 p.sub.14 H p.sub.21
p.sub.22 p.sub.23 p.sub.24
TABLE-US-00004 TABLE 3 Conditional probabilities of specific allele
calls in the embryo given all possible parental states. Expected
truth Embryo readouts types and Parental state in conditional
probabilities matings the embryo U H empty U .times. U U p.sub.11
p.sub.12 p.sub.13 p.sub.14 U .times. H p.sub.21 p.sub.22 p.sub.23
p.sub.24 U .times. H 50% U, 50% H p.sub.31 p.sub.32 p.sub.33
p.sub.34 H .times. H 25% U, 25% , p.sub.41 p.sub.42 p.sub.43
p.sub.44 50% H
TABLE-US-00005 TABLE 4 Constraint Matrix (A). 1 1 1 1 1 1 1 1 1 -1
-.5 -.5 1 -.5 -.5 1 -.5 -.5 1 -.5 -.5 1 -.25 -.25 -.5 1 -.5 -.5 1
-.25 -.25 -.5 1 -.5 -.5 1
TABLE-US-00006 TABLE 5 Notation for the counts of observations of
all specific embryonic allelic states given all possible parental
states. Embryo readouts types Parental Expected embryo and observed
counts matings truth state U H Empty U .times. U U n.sub.11
n.sub.12 n.sub.13 n.sub.14 U .times. H n.sub.21 n.sub.22 n.sub.23
n.sub.24 U .times. H 50% U, 50% H n.sub.31 n.sub.32 n.sub.33
n.sub.34 H .times. H 25% U, 25% , n.sub.41 n.sub.42 n.sub.43
n.sub.44 50% H
TABLE-US-00007 TABLE 6 Aneuploidy states (h) and corresponding
P(h|n.sub.j), the conditional probabilities given the copy numbers.
N H P(h|n) In General 1 paternal monosomy 0.5 Ppm 1 maternal
monosomy 0.5 Pmm 2 Disomy 1 1 3 paternal trisomy t1 0.5 * pt1 ppt *
pt1 3 paternal trisomy t2 0.5 * pt2 ppt * pt2 3 maternal trisomy t1
0.5 * pm1 pmt * mt1 3 maternal trisomy t2 0.5 * pm2 pmt * mt2
TABLE-US-00008 TABLE 7 Probability of aneuploidy hypothesis (H)
conditional on parent genotype. embryo allele counts hypothesis
(mother, father) genotype copy # nA nC H AA, AA AA, AC AA, CC AC,
AA AC,AC AC, CC CC, AA CC, AC CC, CC 1 1 0 father only 1 1 1 0.5
0.5 0.5 0 0 0 1 1 0 mother only 1 0.5 0 1 0.5 0 1 0.5 0 1 0 1
father only 0 0 0 0 0.5 0.5 1 1 1 1 0 1 mother only 0 0.5 1 0.5 0.5
1 0 0.5 1 2 2 0 disomy 1 0.5 0 0.5 0.25 0 0 0 0 2 1 1 disomy 0 0.5
1 0.5 0.5 0.5 1 0.5 0 2 0 2 disomy 0 0 0 0 0.25 0.5 0 0.5 1 3 3 0
father t1 1 0.5 0 0 0 0 0 0 0 3 3 0 father t2 1 0.5 0 0.5 0.25 0 0
0 0 3 3 0 mother t1 1 0 0 0.5 0 0 0 0 0 3 3 0 mother t2 1 0.5 0 0.5
0.25 0 0 0 0 3 2 1 father t1 0 0.5 1 1 0.5 0 0 0 0 3 2 1 father t2
0 0.5 1 0 0.25 0.5 0 0 0 3 2 1 mother t1 0 1 0 0.5 0.5 0 1 0 0 3 2
1 mother t2 0 0 0 0.5 0.25 0 1 0.5 0 3 1 2 father t1 0 0 0 0 0.5 1
1 0.5 0 3 1 2 father t2 0 0 0 0.5 0.25 0 1 0.5 0 3 1 2 mother t1 0
0 1 0 0.5 0.5 0 1 0 3 1 2 mother t2 0 0.5 1 0 0.25 0.5 0 0 0 3 0 3
father t1 0 0 0 0 0 0 0 0.5 1 3 0 3 father t2 0 0 0 0 0.25 0.5 0
0.5 1 3 0 3 mother t1 0 0 0 0 0 0.5 0 0 1 3 0 3 mother t2 0 0 0 0
0.25 0.5 0 0.5 1
TABLE-US-00009 TABLE 9 Aneuploidy calls on eight known euploid
cells Chr # Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8
1 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 3 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 4 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 5 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 6 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 7 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
8 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 9 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 10 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 11 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 12 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 13 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 14 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
15 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 16 2 1.00000 2 0.99997 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 17 2 1.00000 2 0.99995 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 18 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 19 2 1.00000 2 0.99998 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 20 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21 2 0.99993 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
22 2 1.00000 2 1.00000 2 0.99040 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 0.99992 X 2 0.99999 2 0.99994 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000
TABLE-US-00010 TABLE 10 Aneuploidy calls on ten known trisomic
cells Chr # Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8
Cell 9 Cell 10 1 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 2 1.00000 3 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 4 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 2 1.00000 2 1.00000 5 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
6 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 7 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 0.92872
2 1.00000 8 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 9 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 2 1.00000 10 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 11 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 2 1.00000 2 1.00000 12 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
13 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 14 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 15 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 0.99998 2 1.00000 16 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 0.99999 2 1.00000 2 1.00000 2 1.00000
2 1.00000 2 1.00000 17 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
0.96781 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 18 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
2 1.00000 2 1.00000 2 1.00000 19 2 1.00000 2 1.00000 2 0.99999 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000
20 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 0.99997 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 21 -- 1.00000 -- 1.00000 --
1.00000 -- 1.00000 -- 1.00000 -- 1.00000 -- 1.00000 -- 1.00000 --
1.00000 -- 1.00000 22 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2
1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 23 1
1.00000 1 1.00000 1 1.00000 1 1.00000 1 1.00000 1 1.00000 1 1.00000
1 1.00000 1 1.00000 1 1.00000
TABLE-US-00011 TABLE 11 Aneuploidy calls for six blastomeres Chr #
e1b1 e1b3 e1b6 e2b1 e2b2 e3b2 1 2 1.00000 2 1.00000 1 1.00000 1
1.00000 1 1.00000 3 1.00000 2 2 1.00000 2 1.00000 3 1.00000 1
1.00000 1 1.00000 2 0.99994 3 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 3 1.00000 4 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 3 1.00000 5 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 3 0.99964 6 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 3 1.00000 7 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 2 0.99866 8 2 1.00000 2 1.00000 3 0.99966 1
1.00000 1 1.00000 3 1.00000 9 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 3 0.99999 10 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 1 1.00000 11 2 1.00000 2 1.00000 3 1.00000 1
1.00000 1 1.00000 2 0.99931 12 2 1.00000 2 1.00000 2 1.00000 1
1.00000 1 1.00000 1 1.00000 13 2 1.00000 2 1.00000 3 0.98902 1
1.00000 1 1.00000 2 0.99969 14 2 1.00000 2 1.00000 2 0.99991 1
1.00000 1 1.00000 3 1.00000 15 2 1.00000 2 1.00000 2 0.99986 1
1.00000 1 1.00000 3 0.99999 16 2 1.00000 3 0.98609 2 0.74890 1
1.00000 1 1.00000 2 0.94126 17 2 1.00000 2 1.00000 2 0.97983 1
1.00000 1 1.00000 2 1.00000 18 2 1.00000 2 1.00000 2 0.98367 1
1.00000 1 1.00000 1 1.00000 19 2 1.00000 2 1.00000 4 0.64546 1
1.00000 1 1.00000 3 1.00000 20 2 1.00000 2 1.00000 3 0.58327 1
1.00000 1 1.00000 2 0.95078 21 2 0.99952 2 1.00000 2 0.97594 1
1.00000 1 1.00000 1 0.99776 22 2 1.00000 2 0.98219 2 0.99217 1
1.00000 1 0.99989 2 1.00000 23 2 1.00000 3 1.00000 3 1.00000 1
1.00000 1 1.00000 3 0.99998 24 1 0.99122 1 0.99778 1 0.99999
* * * * *