U.S. patent application number 14/886520 was filed with the patent office on 2016-02-04 for evolutionary models of multiple sequence alignments to predict offspring fitness prior to conception.
The applicant listed for this patent is GenePeeks, Inc.. Invention is credited to Nigel DELANEY, Ari Silver, Lee Silver.
Application Number | 20160034635 14/886520 |
Document ID | / |
Family ID | 54836376 |
Filed Date | 2016-02-04 |
United States Patent
Application |
20160034635 |
Kind Code |
A1 |
DELANEY; Nigel ; et
al. |
February 4, 2016 |
EVOLUTIONARY MODELS OF MULTIPLE SEQUENCE ALIGNMENTS TO PREDICT
OFFSPRING FITNESS PRIOR TO CONCEPTION
Abstract
A system, device and method for receiving multiple aligned
genetic sequences obtained from genetic samples of multiple
organisms of one or more different species. A measure of
evolutionary variation may be computed for one or more alleles at
each of one or more aligned genetic loci. The aligned genetic loci
in the multiple organisms may be derived from one or more common
ancestral genetic loci or may be otherwise related. The measure of
evolutionary variation may be a function of variation in alleles at
corresponding aligned genetic loci in the multiple aligned genetic
sequences. One or more likelihoods may be computed that an allele
mutation at each of the one or more genetic loci in a simulated
virtual progeny will be deleterious based on the measure of
evolutionary variation of alleles at the corresponding aligned
genetic loci for the multiple organisms.
Inventors: |
DELANEY; Nigel; (San
Francisco, CA) ; Silver; Ari; (New York, NY) ;
Silver; Lee; (New York, NY) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
GenePeeks, Inc. |
New York |
NY |
US |
|
|
Family ID: |
54836376 |
Appl. No.: |
14/886520 |
Filed: |
October 19, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14568456 |
Dec 12, 2014 |
|
|
|
14886520 |
|
|
|
|
62013139 |
Jun 17, 2014 |
|
|
|
Current U.S.
Class: |
702/19 |
Current CPC
Class: |
G16B 20/00 20190201;
G16B 20/10 20190201; G16B 20/20 20190201; G16B 5/00 20190201; G16B
10/00 20190201; G16B 20/40 20190201 |
International
Class: |
G06F 19/18 20060101
G06F019/18; G06F 19/12 20060101 G06F019/12 |
Claims
1. A method of predicting deleterious mutations in virtual progeny,
the method comprising: receiving multiple aligned genetic sequences
obtained from genetic samples of multiple organisms of one or more
different species; computing a measure of evolutionary variation of
alleles at each of one or more aligned genetic loci derived from
one or more common ancestral genetic loci in the multiple organisms
as a function of variation in alleles at corresponding aligned
genetic loci in the multiple aligned genetic sequences; and
computing one or more likelihoods that an allele mutation at each
of the one or more genetic loci in a simulated virtual progeny will
be deleterious based on the measure of evolutionary variation of
alleles at the corresponding aligned genetic loci for the multiple
organisms.
2. The method of claim 1, comprising: simulating a mating of two
potential parents by combining at least a portion of their genetic
information to generate a genetic sequence of the virtual progeny;
and assigning the virtual progeny one or more of the likelihoods of
being deleterious associated with one or more alleles in the
genetic sequence.
3. The method of claim 2, comprising: generating a virtual gamete
for each potential parent by at least partially randomly selecting
one of two allele copies in the parent's chromosomes to simulate
recombination at each of a sequence of genetic loci; and combining
the two virtual gametes from the two potential parents to generate
the genetic sequence of the virtual progeny.
4. The method of claim 3, comprising: repeating said step of
generating a virtual gamete for each of a plurality of at least
partially random sequence of alleles to generate a plurality of
different virtual gametes for each potential parent; and repeating
said step of combining the two virtual gametes for each of a
plurality of different combinations of the two virtual gametes to
generate a plurality of genetic sequences of the virtual progeny;
and repeating said step of assigning the virtual progeny one or
more of the likelihoods for each of the plurality of genetic
sequences of the virtual progeny to generate one or more
likelihoods or likelihood distributions that an allele mutation
will be deleterious in the virtual progeny.
5. (canceled)
6. (canceled)
7. The method of claim 1, wherein the multiple organisms are from
multiple different species.
8. The method of claim 1, wherein the multiple organisms are from a
single species.
9. The method of claim 1, comprising computing one or more
functions of variation in alleles at corresponding aligned genetic
loci between a genetic sequence of an individual organism and one
or more reference genetic information data sets.
10. The method of claim 1, comprising comparing the one or more
likelihoods to one or more thresholds or other statistical models
to predict if an allele mutation will be deleterious in the virtual
progeny.
11. The method of claim 1, wherein the likelihood that an allele
mutation will be deleterious in the virtual progeny is relatively
higher for allele mutations at corresponding aligned genetic loci
that have a relatively lower measure of evolutionary variation in
alleles.
12. The method of claim 1, comprising weighing the measure of
evolutionary variation at different genetic loci based on a
distribution of mutation rates at the different genetic loci in the
multiple aligned genetic sequences.
13. The method of claim 1, comprising weighing the measure of
evolutionary variation at different genetic loci to identify
genetic loci in which relatively few mutations have been observed
in evolutionary history.
14. The method of claim 1, wherein a phylogenetic tree is used to
generate the function of variation in alleles that have
proliferated in the multiple organisms over evolutionary history to
predict the likelihood that such variations in alleles would be
deleterious in the virtual progeny.
15. The method of claim 14, wherein the likelihood that an allele
mutation in the virtual progeny would be deleterious is based on a
frequency with which the allele mutation has occurred and persisted
in the multiple organisms over evolutionary history.
16. The method of claim 14, wherein the likelihood that an allele
mutation in the virtual progeny would be deleterious is based on a
proximity in the phylogenetic tree representing an evolutionary
timescale between a reference genetic sequence of the same species
as the virtual progeny and one or more other species in which the
allele mutation has occurred.
17. The method of claim 14, wherein the phylogenetic tree is
defined by a model of probabilities that an allele i will mutate to
an allele j over an interval of evolutionary time.
18. The method of claim 1, wherein the function of variation in
alleles is a score that quantifies the relative amount of sequence
conservation at the aligned genetic loci.
19. The method of claim 1, wherein the function of variation in
alleles is based on a Shannon entropy of alleles at the aligned
genetic loci.
20. The method of claim 1, wherein the function of variation in
alleles is based on an average pairwise difference between
different alleles at the aligned genetic loci.
21. The method of claim 1, wherein the function of variation in
alleles is based on a distance metric between a reference genetic
sequence and a genetic sequence of the virtual progeny.
22. The method of claim 21, comprising: identifying one or more
genetic loci in which the virtual progeny genetic sequence has one
or more allele mutations that differs from one or more alleles at
the one or more genetic loci in the reference genetic sequence; and
assigning a rank to each of the multiple aligned genetic sequences
ordered based on similarity to a reference genetic sequence,
wherein the distance metric is selected from the group consisting
of: the rank of a first ordered sequence with a different allele
than the reference genetic sequence at one or more genetic loci
aligned with the one or more identified genetic locus and the rank
of the first ordered sequence with the same allele mutation at a
corresponding aligned genetic loci as the virtual progeny genetic
sequence.
23. (canceled)
24. The method of claim 1, wherein the function of variation in
alleles measures variations in alleles located in multiple
different aligned genetic loci derived from multiple common
ancestral genetic loci.
25. The method of claim 1, wherein the one or more likelihoods are
computed by training a function to discriminate between mutations
predefined to be deleterious and mutations predefined to be
neutral.
26. The method of claim 1, wherein the one or more likelihoods are
computed by training a function to assess a likelihood of a
mutation reaching a certain frequency in a population.
27. The method of claim 1, wherein the function of allele variation
at one or more genetic loci is based on a ratio .omega. of a
non-synonymous substitution rate to a synonymous substitution rate,
wherein a non-synonymous substitution is an allele substitution in
a codon that does not change an amino acid encoded by the codon and
a synonymous substitution is an allele substitution in the codon
that does change the amino acid.
28. The method of claim 27, wherein the measure of evolutionary
variation of alleles is defined based on probabilities t.sub.i,j
that an allele i will mutate into an allele j over an interval of
evolutionary time as follows: t ij = { .omega. q ij if i .fwdarw. j
is non - synonymous q ij if i .fwdarw. j is synonymous ##EQU00004##
where .omega. is the ratio of non-synonymous to synonymous
substitution rates.
29. A method of predicting deleterious mutations in virtual
progeny, the method comprising: simulating a mating of two
potential parents by combining at least a portion of their genetic
information to generate a genetic sequence of the virtual progeny;
computing one or more likelihoods that an allele mutation at each
of the one or more genetic loci in the genetic sequence of the
virtual progeny will be deleterious based on a measure of
evolutionary variation of alleles at corresponding aligned genetic
loci in a multiple sequence alignment of multiple genetic sequences
of multiple organisms; and assigning the virtual progeny one or
more of the likelihoods of being deleterious associated with one or
more alleles in the genetic sequence.
30. A system for predicting deleterious mutations in virtual
progeny, the system comprising: a memory configured to store
multiple aligned genetic sequences obtained from genetic samples of
multiple organisms of one or more different species; and a
processor configured to use the stored multiple aligned genetic
sequences to: compute a measure of evolutionary variation of
alleles at each of one or more aligned genetic loci derived from
one or more common ancestral genetic loci in the multiple organisms
as a function of variation in alleles at corresponding aligned
genetic loci in the multiple aligned genetic sequences; and compute
one or more likelihoods that an allele mutation at each of the one
or more genetic loci in a simulated virtual progeny will be
deleterious based on the measure of evolutionary variation of
alleles at the corresponding aligned genetic loci for the multiple
organisms.
31. (canceled)
32. The system of claim 30, wherein the processor is configured to:
generate a virtual gamete for each potential parent by at least
partially randomly selecting one of two allele copies in the
parent's chromosomes to simulate recombination at each of a
sequence of genetic loci; combine the two virtual gametes from the
two potential parents to generate the genetic sequence of the
virtual progeny; and assign the virtual progeny one or more of the
likelihoods of being deleterious associated with one or more
alleles in the genetic sequence.
33. The system of claim 30, wherein the processor is configured to
compute the one or more likelihoods that an allele mutation in the
virtual progeny would be deleterious based on a frequency with
which the allele mutation has occurred and persisted in the
multiple organisms over evolutionary history.
34. The system of claim 30, wherein the processor is configured to
compute the one or more likelihoods that an allele mutation in the
virtual progeny would be deleterious based on a proximity in a
phylogenetic tree representing an evolutionary timescale between
the virtual progeny genetic sequence and one or more other genetic
sequences of other organisms in which the allele mutation has
occurred.
Description
REFERENCE TO RELATED APPLICATIONS
[0001] This patent application claims the benefit of U.S. patent
application Ser. No. 14/568,456 filed Dec. 12, 2014, which claims
benefit of U.S. provisional patent application No. 62/013,139,
filed Jun. 17, 2014, both of which are incorporated by reference in
its entirety.
FIELD OF THE INVENTION
[0002] Embodiments of the invention relate to predictions of
evolutionary fitness of genes in a population of organisms. In
particular, some embodiments of the invention relate to the use of
genetic variation, whether within a single species or across
multiple species, to predict the fitness of hypothetical or virtual
offspring associated with a potential mating before that mating
occurs.
BACKGROUND OF THE INVENTION
[0003] Every year thousands of babies are born with genetic
diseases. Often, the parents of these children are both healthy,
but each parent possesses genetic mutations that when passed in
combination to the child, endow it from the time of conception with
an unmitigated genetic defect. Children with such diseases may
suffer, have diminished lifespans and can entail large emotional
and financial costs, so many prospective parents attempt to
minimize the chance that they pass on genetic elements that cause
disease.
[0004] Carrier testing, in which both parents are genotyped at loci
of their genomes that are known to cause disease, is a technique
widely used to achieve this goal. Such tests rely on a defined set
of alleles known to cause diseases, and then screen for the
presence of these alleles in one or both parents prior to
conception. The alleles screened in such tests typically have been
established to cause disease by examining pedigrees of patients
with the disease, by using cellular or animal models of the effect
of the particular allele, or alternate means. In all cases, the
correlation between alleles and genetic diseases are determined by
studying one or more individuals that have already been born.
[0005] Although carrier testing is used in a limited number of
cases, even if all possible prospective matings were filtered by
this technique, children suffering from genetic diseases would
still be born. This is because carrier tests inherently screen only
a known subset of all alleles that can cause disease. The
incompleteness of these tests is evidenced by the fact that the
number of alleles associated with disease in public databases such
as ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) and OMIM
(http://www.ncbi.nlm.nih.gov/omim) continues to grow every year,
and in turn so do the number of loci tested by carrier screening.
Similarly, many patients can present with pathologies which appear
to have a genetic basis, but for which no specific underlying
genetic mutation has yet been determined. In many of these cases, a
novel pathogenic variant or variants is then later discovered by
various means and added to the catalog of known disease associated
mutations. For example, the genomes of many patients with similar
pathologies can be sequenced and shared mutations found.
Alternatively, mutations that occur in an individual patient's
genome which appear damaging (missense, nonsense, etc.) and are
present in genes known to be associated with a biological process
related to the pathology, may be tested in a cellular or animal
model.
[0006] While the steady increase of the catalog of variants known
to cause disease implies that carrier testing will get better, it
also evinces that it suffers from two fundamental inadequacies. The
first is that a diseased child must be born and diagnosed in order
to find a new disease associated allele. The second, and more
insidious, is that carrier testing cannot assess the impact of
novel or de novo mutations. If a variant is specific to an
individual or family and has not been previously studied, carrier
testing cannot determine what effect it may have on future
offspring. Additionally, because novel variants initially only
appear as one half of a heterozygote genotype, if the allele is
recessive, but damaging when combined with itself or another
recessive mutation, it is very difficult to resolve the effect of
the mutation until, from the perspective of a parent who wants to
avoid passing on disease causing alleles, it is too late.
SUMMARY OF THE INVENTION
[0007] A system, device and method are described to overcome the
aforementioned issues in the art. Some embodiments may assign one
or more likelihoods that an allele mutation in a simulated virtual
progeny is deleterious based on the evolutionary variation at the
allele loci in real extant species or populations, for example, in
order to filter out prospective pairings of gametes prior to
conception.
[0008] According to some embodiments of the invention, a system,
device and method may use the evolution of genetic variation of
multiple organisms within one species ("single-species" or
"intra-species" model) or across multiple different species
("multi-species" or "inter-species" model) to predict the
likelihood that alleles would be deleterious in hypothetical,
simulated or virtual progeny. Past evolutionary trends in allele
mutations of extant or surviving (currently or once-living)
organisms representative of one or more species or populations may
be analyzed to predict the future fitness of a potential
hypothetical or virtual (never or non-living) progeny simulated for
two potential parents.
[0009] According to some embodiments of the invention, a system,
device and method may receive multiple aligned genetic sequences
obtained from genetic samples of multiple organisms of one or more
different species. Genetic loci are aligned from different
sequences for different organisms that are derived from one or more
common ancestral genetic loci correlated with the same trait(s),
disease(s), codon(s), that are positioned or sandwiched between
other correlated marker loci, or that are otherwise related. A
measure of evolutionary variation may be computed for one or more
alleles at each of one or more aligned genetic loci of the multiple
aligned sequences. The measure of evolutionary variation may be a
function of variation in alleles at corresponding aligned genetic
loci in the multiple aligned genetic sequences. One or more
likelihoods may be computed that an allele, either a new mutation
or one present in the alignment, at each of the one or more genetic
loci in a simulated virtual progeny will be deleterious based on
the measure of evolutionary variation of alleles at the
corresponding aligned genetic loci for the multiple organisms.
[0010] According to some embodiments of the invention, a system,
device and method may generate the virtual (hypothetical, potential
or non-living) progeny by simulating a mating between two (living)
potential parents by combining at least a portion of their genetic
information. Simulating a mating may include combining genetic
information of both of the two potential parents at one or more
genetic loci. In one embodiment, a mating may be simulated by
generating a virtual gamete for each potential parent by at least
partially randomly selecting one of two allele copies in the
parent's two sets of chromosomes to simulate recombination at each
of one or more genetic loci. Two virtual gametes from the two
respective potential parents may be combined to generate a genetic
sequence of a virtual progeny.
[0011] Once the virtual progeny is simulated, alleles or mutations
in the virtual progeny may be assigned the one or more of the
likelihoods or scores determined for corresponding alleles or
mutations in aligned loci of the multiple extant organisms. These
likelihoods may indicate the potential or probability that the
virtual progeny's alleles or mutations would be deleterious, for
example, if those alleles or mutations were found in the genome of
a living organism such as a human child.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The subject matter regarded as the invention is particularly
pointed out and distinctly claimed in the concluding portion of the
specification. The invention, however, both as to organization and
method of operation, together with objects, features, and
advantages thereof, may best be understood by reference to the
following detailed description when read with the accompanying
drawings in which:
[0013] FIG. 1 schematically illustrates an example of an alignment
of multiple genetic sequences (SEQ ID Nos.: 1-36, respectively, in
order of appearance) according to an embodiment of the invention
(the abbreviations for the species are as follows: LATCH=Latimeria
chalumnae; XENTR=Xenopus tropicalis; TAEGU=Taeniopygia guttata;
MELGA=Meleagris gallopavo; CHICK=Gallus gallus;
ORNAN=Ornithorhynchus anatinus; LOXAF=Loxodonta africana;
HORSE=Equus caballus; TURTR=Tursiops truncatus; MYOLU=Myotis
lucifugus; AILME=Ailuropoda melanoleuca; OTOGA=Otolemur gamettii;
CALJA=Callithrix jacchus; MACMU=Macaca mulatta; NOMLE=Nomascus
leucogenys; PONAB=Pongo abelii; GORILLA=Gorilla gorilla; CHIMP=Pan
troglodytes; HUMAN=Homo sapiens; TUPBE=Tupaia belangeri;
OCHPR=Ochotona princeps; CAVPO=Cavia porcellus; SPETR=Spermophilus
tridecemlineatus; DIPOR=Dipodomys ordii; MOUSE=Mus musculus;
RAT=Rattus norvegicus; SARHA=Sarcophilus harrisii;
MONDO=Monodelphis domestica; MACEU=Macropus eugenii; DANRE=Danio
rerio; GADMO=Gadus morhua; ORYLA=Oryzias latipes;
GASAC=Gasterosteus aculeatus; ORENI=Oreochromis niloticus;
TETNG=Tetraodon nigroviridis; TAKRU=Takifugu rubripes);
[0014] FIG. 2 schematically illustrates an example of a
phylogenetic tree inferred from the multiple sequence alignment
shown in FIG. 1 according to an embodiment of the invention;
[0015] FIG. 3 schematically illustrates a system for executing one
or more methods according to embodiments of the invention;
[0016] FIG. 4 schematically illustrates an example of simulating a
hypothetical mating of two potential parents for generating a
virtual progeny according to an embodiment of the invention;
and
[0017] FIG. 5 is flowchart of a method for using the evolution of
multiple organisms to predict deleterious mutations in virtual
progeny according to an embodiment of the invention.
[0018] It will be appreciated that for simplicity and clarity of
illustration, elements shown in the figures have not necessarily
been drawn to scale. For example, the dimensions of some of the
elements may be exaggerated relative to other elements for clarity.
Further, where considered appropriate, reference numerals may be
repeated among the figures to indicate corresponding or analogous
elements.
DETAILED DESCRIPTION OF THE INVENTION
[0019] In the following description, various aspects of the present
invention will be described. For purposes of explanation, specific
configurations and details are set forth in order to provide a
thorough understanding of the present invention. However, it will
also be apparent to one skilled in the art that the present
invention may be practiced without the specific details presented
herein. Furthermore, well known features may be omitted or
simplified in order not to obscure the present invention.
[0020] Unless specifically stated otherwise, as apparent from the
following discussions, it is appreciated that throughout the
specification discussions utilizing terms such as "processing,"
"computing," "calculating," "determining," or the like, refer to
the action and/or processes of a computer or computing system, or
similar electronic computing device, that manipulates and/or
transforms data represented as physical, such as electronic,
quantities within the computing system's registers and/or memories
into other data similarly represented as physical quantities within
the computing system's memories, registers or other such
information storage, transmission or display devices.
[0021] Embodiments of the invention relate to multiple types of
genetic sequences: [0022] Reference genetic sequences: Genetic
sequences used to generate an evolutionary model, such as, a
phylogenetic tree. Reference genetic sequences may include
standardized genetic sequences from organisms representative of one
or more evolutionarily extant (currently or previously living)
populations or species, such as those released by genome consortia
(e.g., human reference genome, such as, Genome Reference Consortium
Human Build 37 (GRCh37) provided by the Genome Reference
Consortium). Reference genetic sequences may additionally or
alternatively include non-standardized sequences of organisms, such
as, any member of a population or species. A single-species model
may be generated using reference genetic sequences from multiple
organisms of the same single species, e.g., 1,000 chimpanzee or
humans. A multi-species model may be generated using reference
genetic sequences from multiple organisms of multiple different
species, e.g., one model using 1,000 humans, 10 chimpanzee and one
gorilla, or another model using a single different organism from
each different species as shown in FIG. 1. Reference genetic
sequences may be used to analyze the evolution of successful
(positive) or neutral (non-deleterious) allele mutations or
variations across one or more extant species. An evolutionary model
may predict likelihoods that allele mutations or variations would
be deleterious based on their frequency or rarity of occurrence
across the multiple reference genetic sequences. For example,
allele mutations or variations that are relatively more rare across
the reference genetic sequences may be considered negatively
selected for evolutionarily (e.g. associated with a deleterious
trait for which an organism cannot or has a relatively lower
likelihood of surviving or reproducing), while allele mutations or
variations that are relatively more common across the reference
genetic sequences may be considered positively or neutrally
selected for evolutionarily (e.g. not associated with a deleterious
trait, but traits for which an organism has a neutral or improved
likelihood of surviving or reproducing). [0023] Potential parent
genetic sequences: Genetic sequences of real (currently or
previously living) potential parents, for example, from which
genetic information is combined to simulate a virtual mating
generating one or more virtual children or progeny, to predict
before they conceive a child, a likelihood that such a child would
have a deleterious trait. The potential parent genetic sequences
may be obtained from genetic samples of two potential parents
seeking to mate, or from a first potential parent seeking a genetic
donor and a second potential parent from a pool of candidate
donors. [0024] Virtual progeny genetic sequences: Genetic sequences
of simulated (never living) virtual progeny generated by simulating
a mating or combining genetic information from two potential parent
genetic sequences. Each virtual progeny genetic sequence may be a
prediction or simulation of one possible genetic sequence of a
child of the two potential parents, before that child is conceived.
To achieve more robust results, the simulated mating may be
repeated to generate multiple virtual progeny genetic sequences for
each pair of potential parents. The virtual progeny genetic
sequences may be compared to the reference genetic sequences, for
example, to identify evolutionarily rare, and therefore, likely
deleterious traits. In some embodiments, genetic information may be
used interchangeably for potential parent genetic sequences and
reference genetic sequences. In one example, genetic information
from a potential parent or donor may be used instead of, or in
combination with reference consortium genetic sequences, to
generate an evolutionary model or phylogenetic tree. In another
example, reference consortium genetic sequences may be used instead
of, or in combination with potential parent or donor genetic
sequences, to simulate matings or predict likelihoods of
deleterious traits in offspring.
[0025] As used herein, a "genetic sequence" may include genetic
information representing one or more bases, nucleotides or alleles
(sequences of nucleotides defining different forms of a gene) for
any number of sequential or non-sequential genetic loci. For
example, a "genetic sequence" may refer to allele information at a
single genetic locus, or multiple genetic loci, such as, one or
more gene segments or an entire genome. A genetic sequence is a
data structure representing genetic information at one or more loci
of a real or virtual genome. Genetic sequence data structures may
include, for example, one or more vectors, scalar values,
functions, sequences, sets, matrices, tables, lists, arrays, and/or
other data structures, representing one or more bases, nucleotides,
genes, alleles, codons or other generic material. The data
structures representing each single chromosome sequence may be one
dimensional (e.g., representing a single base or allele per locus)
or multi-dimensional (e.g., representing multiple or all bases
A,T,C,G or alleles at each locus and a probability associated with
the likelihood of each existing in a potential progeny). The same
(or different) data structures may be used for real and virtual
genome sequences, though real genome sequences generally represent
real genetic material (e.g., DNA extracted from a currently or
previously existing genetic sample), while virtual genome sequences
have no real corresponding genetic material (e.g., the sequence may
represent an imaginary non-existing gamete, progeny, etc.).
[0026] According to some embodiments of the invention, multiple
reference genetic sequences from multiple extant organisms within
one species ("single-species" model) or multiple different species
("multi-species" model) may be used to generate an evolutionary
model to predict deleterious allele mutations in virtual progeny.
In one example of a multi-species model, multiple different
vertebrate species may be used to predict deleterious allele
mutations in virtual human progeny.
[0027] The reference genetic sequences may be aligned to link or
associate one or more genetic loci in each of the multiple
different sequences. Aligned loci of the different sequences may be
derived from one or more common ancestral genetic loci and/or may
relate to the same features, diseases or traits. A measure of
evolutionary variation of alleles at one or more of the aligned
genetic loci may be computed, for example, as a function of
variation in alleles at corresponding aligned genetic loci in the
multiple aligned reference genetic sequences. Aligned genetic loci
associated with a relatively lower frequency of allele variation
may indicate that the alleles are "functional" or relatively
important to an organism's survival and their mutations may have a
relatively higher likelihood of causing deleterious traits in an
organism, whereas aligned genetic loci associated with a relatively
higher frequency of allele variation in the reference genetic
sequences may indicate that the alleles are less or non-functional
and may be mutated with a relatively lower likelihood of impacting
the survival or formation of deleterious traits in an organism. In
some embodiments, the reference genetic sequences in the model may
be weighted according to their evolutionary proximity of its
population or species to the population or species of the virtual
progeny and potential parent. For example, more weight may be
assigned to reference genetic sequences of populations or species
that are relatively more evolutionarily related (e.g., closer on a
phylogenetic tree or having a relatively smaller Hamming
distance).
[0028] Genetic sequences may be obtained from two potential
parents, such as, two individuals that plan on mating or between
one individual seeking a genetic donor and each of a plurality of
candidates from a pool of genetic donors. The potential parents'
genetic sequences may be obtained from genetic samples of
biological material from the potential parents. A mating may be
simulated between two potential parents, for example, by combining
the genetic information from the two potential parents' genetic
sequences to generate one or more genetic sequences of simulated
virtual progeny.
[0029] The virtual progeny genetic sequences may be aligned with
one or more of the reference genetic sequences to identify one or
more alleles that evolved from the same ancestral genetic loci. The
virtual progeny may be assigned one or more of the likelihoods of
exhibiting deleterious traits associated with one or more alleles
or mutations in the virtual progeny genetic sequences based on the
measure of evolutionary variation of alleles at the corresponding
aligned genetic loci in the reference genetic sequences.
Predicting Deleterious Alleles
[0030] Embodiments of the invention overcome the limitations of
relying on specific information derived from human or cellular
studies of the effect of mutation in order to score the propensity
or probability that a particular mutation or allele will cause a
deleterious phenotype, trait or disease.
[0031] An insight recognized according to embodiments of the
invention is that extant genetic variation, that is existing or
surviving genetic variation present amongst homologous or
paralogous reference DNA sequences present in different organisms
or members of a population, represents the outcome of an experiment
that can be informative for predicting whether a given mutation or
allele variation in a prospective parent's genome is likely to be
deleterious to their child.
[0032] This experiment is the process of evolution, which has
governed the replication and diversification of life on Earth.
Today, there are many species, and individuals within a species all
contain copies of genetic material which is derived from common
ancestral versions. As species and individuals reproduce and copy
their DNA, mutations appear which make these descendent copies
distinct from the parental versions. The eventual fate of such new
mutations, whether they will continue to be passed along to
offspring or eventually die out, is determined by a stochastic
process that is influenced by the mutation's effect on the
reproductive fitness of the organism. Mutations that have no
functional effect (neutral mutations) or are beneficial to an
organism (positive mutations) are more likely to eventually
increase in frequency and persist in the population, increasing
diversity or replacing their parental version. In contrast,
mutations which lower the reproductive fitness of an organism
(negative or deleterious mutations) are unlikely to persist and
contribute to future genetic variation.
[0033] Over the course of evolutionary time, a great many mutations
have appeared and persisted, leading to the present diversity
amongst DNA sequences derived from a common ancestor. However, this
diversity is not equally distributed amongst all sequence positions
in a genome. Although mutations are essentially introduced during
the replication process independent of any functional effect they
may have, the evolutionary filtering process is greatly influenced
by such effects. As such, when comparing the genomes of several
species or individuals today, we see that some areas are conserved
(such as having the same coding sequence and/or non-coding
sequence), while others have much more greatly diverged (having
very diverged sequences from each other or relative to the
ancestral copy number).
[0034] Reference is made to FIG. 1, which shows an example of an
alignment of multiple genetic sequences (SEQ ID Nos.: 1-36,
proceeding from top to bottom, respectively) according to an
embodiment of the invention. In the example of FIG. 1, the multiple
aligned reference genetic sequences represent a portion of the DNA
sequence coding for the PEX10 proteins present in organisms from
multiple vertebrate species. Item A in the figure shows a nucleic
acid genetic locus which is completely conserved across all species
in the alignment, as all species have a Guanine (symbolized by the
letter G) at this locus position. Although many mutations that
change the amino acid at this position have undoubtedly been
introduced into this gene over the course of the 500 million years
of vertebrate evolution, the fact that no such mutation persists
today is a strong indication that such mutations are likely to be
deleterious and reduce evolutionary fitness. In contrast, the
position in the gene indicated by item B in FIG. 1 is much more
variable, with different species having at that locus position one
of the following DNA bases: Guanine (G), Adenine (A), Thymine (T),
Cytosine (C). The diversity of DNA (or alternatively the amino
acids encoded by the DNA) at this genetic locus provides an
indication that it is relatively less likely that a mutation at
this position in a parents genotype will be deleterious, relative
to a mutation at the genetic locus position indicated by item
A.
Assessing the Likelihood in Deleterious Alleles Based Directly on a
Multiple Sequence Alignment
[0035] A multiple sequence alignment of present day reference
genetic sequences may be derived from common ancestral genetic loci
of multiple species (e.g. different vertebrate sequenced genomes)
or multiple individuals within a single species (e.g. a collection
of human sequences). A substantially large sample size of
organisms, populations or species (e.g., tens, hundreds, or more)
may be used for statistically significant likelihoods, for example,
to reduce bias error due to a skewed sample set.
[0036] Embodiments of the invention may compute a measure of
evolutionary variation of alleles f at each of one or more aligned
genetic loci as a function of variation in alleles F at
corresponding aligned genetic loci in the multiple sequence
alignment (MSA). The measure of evolutionary variation of alleles f
may be transformed into a likelihood or score s associated with a
relative propensity that this allele mutation would be damaging if
produced in a child. This likelihood or score s may be derived, for
example, using two functional transformations F and S, to convert
columns of aligned genetic loci of a multiple sequence alignment
(MSA) and a putative mutation or allele in a virtual progeny into a
propensity score or likelihood s relevant to assessing the effect
of that particular allele or mutation on the virtual progeny, for
example, as shown in equation (1):
Multiple Sequence Alignment (MSA).fwdarw.f=F(MSA).fwdarw.s=S(f)
(1)
[0037] The first functional transformation shown in equation (1),
f=F(MSA), is used to compute a measure of evolutionary variation of
alleles f at each of one or more genetic loci derived from one or
more common ancestral genetic loci in the multiple organisms as a
function of variation in alleles F at corresponding aligned genetic
loci in the multiple aligned genetic sequences. The first
functional transformation may create a raw score that quantifies
the relative amount of sequence conservation at the one or more
genetic loci. There are many possible instantiations of this
function that may be used according to embodiments of the
invention. For example, one such function may input information
from the DNA or amino acid genetic sequences present in the
alignment and output a Shannon entropy of the sequence characters
at each of the one or more genetic loci. Denoting a frequency of a
particular symbol (DNA base or amino acid) at a particular genetic
locus or column position, (j), in a multiple sequence alignment as
P.sub.i, i={A, C, G, T} (for DNA, or the set of amino acid symbols
if considering a protein alignment), the Shannon entropy function
may be calculated, for example, as shown in equation (2):
F(MSA.sub.j)=.SIGMA..sub.ip.sub.ilog.sub.2p.sub.i (2)
[0038] Another example of the first functional transformation shown
in equation (1), f=F (MSA), may take the average pairwise
difference between different symbols (S) in an aligned sequence
column of length N, for example, as in equation (3):
F ( MSA j ) = ( N 2 ) - 1 i = 1 N k = ( i + 1 ) N { 1 if S i
.noteq. S k 0 if S i = S k } ( 3 ) ##EQU00001##
[0039] Other possible functional forms of the first functional
transformation, F(MSA), may calculate a distance metric from a
particular species or sequence in the reference alignment. For
example, the function may rank all the sequences in the alignment
according to their Hamming distance from the reference (e.g.,
human) sequence, and then calculate the rank of the first sequence
with a divergent symbol at the relevant position in the alignment,
or if ranking a particular mutation, the rank of the first sequence
matching that particular mutation. Additional functional forms such
as not using the ordinal rankings of sequences by Hamming distance,
but instead using the Hamming distance itself as the metric may be
used.
[0040] Additionally, the function F(MSA) need not return a single
value or be a function of a single column in the multiple sequence
alignment. The function may be a composite function of one or more
of the functions previously described in addition to others (e.g.,
F.sub.1, F.sub.2, F.sub.3, . . . ), and may output a vector of
values (e.g., (s.sub.1, s.sub.2, s.sub.3, . . . )) rather than a
single value. The function may also take as input multiple columns
(j), or even the entire alignment, when calculating the value(s),
f, and may also take as input a particular mutation under
consideration, which may or may not affect the calculation of the
values returned by the function.
[0041] The second functional transformation specified by equation
(1), s=S(f), is a function which converts the measure of
evolutionary variation of alleles into a relative propensity or
likelihood of being damaging. Many possible instantiations of this
function are also possible. For example, a function s=S(f) may
score the value(s) of f according to its ranking in the empirical
distribution of values for all mutations or alleles considered, or
that could be considered, based on a collection of multiple
reference sequence alignments. Other scoring methods may also be
used. For example, a function trained to discriminate mutations in
a database of known or suspected to be damaging alleles from
neutral alleles may be used to assess the likelihood of damage
(e.g., using any variety of statistical models or derived variants
from experimental findings), or a function which is trained to
assess the likelihood of a mutation reaching a certain frequency in
the population. In all cases, this functional transformation, in
combination with the first, allows particular genetic allele
mutations to be ranked and assessed for likelihood of survival or
damage if produced in a child.
Creating Functional Forms of the Measure of Evolutionary Variation
F(MSA) Using Phylogenetic History for Assessing Likelihoods of
Deleterious Effects of Alleles
[0042] Embodiments of the invention may model the phylogenetic
history of the multiple reference aligned genetic sequences (MSA)
for the purpose of assessing the likelihood of damaging effects of
an allele mutation in a new organism.
[0043] Because DNA replication is semi-conservative, the
evolutionary history of a DNA molecule may be represented by a
bifurcating tree, known as a "phylogenetic" tree, that represents
the known or inferred historical or evolutionary relationships
between present day extant reference genetic sequences. A large
body of scientific literature has developed over the past 30 years
that studies the problem of inferring this tree from present day
sequences. Typically, such models envision the evolutionary process
between nodes of the tree as being similar to a general time
reversible (GTR) Markov chain. In these models, in an interval of
time, t, there is a certain probability that a base in the sequence
will mutate, or transition to another base (e.g., A.fwdarw.C). Such
models may be described using a transition matrix that describes
the relative probability of transition from one base to another,
for example, as shown in equation (4):
Q = T C A G ( a .pi. C b .pi. A c .pi. G a .pi. T d .pi. A e .pi. G
b .pi. T d .pi. C f .pi. G c .pi. T e .pi. C f .pi. A ) T C A G ( 4
) ##EQU00002##
[0044] In equation (4) above, the elements of the transition matrix
Q define a probability that each base denoted by the row will
transition to each base denoted by the column, for example, in an
infinitely small evolutionary time interval .DELTA.t. Note that the
diagonal terms in the transition matrix are not shown, as they are
simply equal to one minus the other elements in the row (the
probabilities of the elements in each row sum to 1). The .pi..sub.i
terms represent the equilibrium frequency of the nucleotide bases
{i=A, C, G, T}, and the symbols a, b, c, d, e and f are parameters
that further govern the substitution dynamics. This matrix
represents a generalized time-reversible model, in which each rate
below the diagonal equals a reciprocal rate above the diagonal
multiplied by the equilibrium ratio of the two bases. Equation (4)
is only one example of a substitution matrix used for phylogenetic
inference.
[0045] Reference is made to FIG. 2, which schematically illustrates
an example of a phylogenetic tree inferred from a multiple sequence
alignment, for example, as shown in FIG. 1, according to an
embodiment of the invention. The length of the branches shown in
the tree may represent or be a function of an inferred number of
substitutions or allele mutations per genetic locus that have
occurred from its direct ancestral sequence. In the example shown
in FIG. 2, a scale bar for the branch lengths is shown for a 0.2
branch length. The phylogenetic tree may be directly inferred using
the Markov model described above. In inferring the tree, the
likelihood of a tree and its branch length (e.g., given in units of
expected substitutions) are maximized by varying the tree or branch
lengths, until a tree with the highest likelihood is found.
Alternatively, a prior probability distribution, as used in
Bayesian statics, may be specified for all parameters in the
phylogenetic model, and several trees may be sampled according to
their posterior probability distribution, as used in Bayesian
statics.
[0046] After fitting the Markov model that accounts for the
phylogenetic history of the sequence alignment, whether using a
Bayesian approach or maximum likelihood approach, embodiments of
the invention may provide not only an inferred phylogenetic tree,
but also a model for the evolutionary process of the multiple
reference sequences in that tree.
[0047] Embodiments of the invention may use this evolutionary model
to directly assess the likelihood that a mutation is damaging by
examining the probability that a mutation found in a parent's
genome persists into the future. That is, rather than using the
model to infer the past history of evolution, the model is trained
using the outcomes of the past and then used to calculate the
likelihood of an allele persisting into the future. Because this
likelihood is directly related to the probability that the allele
is damaging (less likely mutations being more likely to be
damaging), this phylogenetic approach, which directly accounts for
the past history of the sequence in a parametric model, is a
uniquely valuable functional form for the F(MSA) specified by
equation (1).
[0048] Important extensions to the phylogenetic model are those
which either change the model to account for sequence context
(e.g., information about sequence location or what a sequence
encodes, such as, methylation or homopolymer status) and functional
effect (e.g., synonymous vs. non-synonymous, or affecting or not
affecting expression), or that partition the sequence in some way
to account for varying rates of substitutions, for example, based
on the location of loci in the genome.
[0049] To account for varying rates of substitutions, for example,
instead of the direct application of the substation rate matrix in
equation (4) to all sequence substitutions, an alternate model may
specify that although the relative rates of different types of
substitutions at all alignment positions was governed by (4), the
global rate at each site (that is the total mutation rate, denoted
.mu.), may vary across sites or loci in the genome. A multitude of
such models are possible. For example, the rates at different
genetic loci may be drawn from a parametric distribution, such as a
.GAMMA.-distribution that is also fit during the modeling
procedure, or the distribution of rates may be derived from several
categories of possible distributions. In one embodiment, this model
may specify two different distribution categories (such as,
conserved or rapidly evolving categories) and then train the model
to identify to which distribution category the observed sequence
belongs, for example, using a hidden Markov procedure. In some
embodiments of the invention, the inferred or posterior probability
that a genetic locus or mutation belongs to a category in the
phylogenetic model may be returned by the F(MSA) function instead
of the likelihood itself.
[0050] To account for functional effects, the phylogenetic model
used to form the likelihood for the F(MSA) function may directly
account for the functional consequence of a mutation. For example,
the coding sequence of a protein is determined by triplets of
neighboring DNA nucleotides that form a functional unit referred to
as a "codon." A mutation in a nucleotide within a codon may either
have a functional effect of changing the amino acid sequence
encoded by that codon (in which case it is referred to as
"non-synonymous" since the mutation encodes for a different amino
acid sequence) or may be a substitution with no functional effect
on the amino acid encoded due to the redundancy of the genetic code
(in which case it is referred to as "synonymous" since the mutation
encodes for the same amino acid sequence). The Markov model that
may be used in predicting the likely damaging effect of a mutation
may directly account for such functional effects. For example, the
transition probability of mutating from nucleotide i to j specified
by the ijth matrix Q element q.sub.ij in equation (4), may be
replaced by an instantaneous transition probability t.sub.ij, for
example, defined by equation (5).
t ij = { .omega. q ij if i .fwdarw. j is non - synonymous q ij if i
.fwdarw. j is synonymous ( 5 ) ##EQU00003##
[0051] This would allow a new instantaneous transition matrix, T,
to be used in the model, and a new parameter, co, which is equal to
the non-synonymous to synonymous substitution rate to be used in
predicting the likelihood that an allele is damaging or that it
persists into the future. In practice, the co parameter may be
constant for an entire multiple sequence alignment, may be assigned
to each codon position in the alignment by assuming they are drawn
from some hierarchical distribution, or may be uniquely assigned to
each codon position. The substitution model specified by equation
(5) may also be altered to account for each combination of the
64.times.64 possible elements of a transition matrix representing
the rate in which each of the 64 possible codons (e.g., the
4.sup.3=64 different combinations of four nucleotide states (A, T,
C, G) at three nucleotide positions in each codon) transition to
each of the 64 possible codons. In all instantiations of the
evolutionary model, the functional effect of a sequence change,
whether on the amino acid, regulatory context or other biological
context may be directly accounted for, and used to predict the
likelihood that the allele was damaging.
[0052] Reference is made to FIG. 3, which schematically illustrates
a system 300 according to an embodiment of the invention.
[0053] System 300 may include a genetic sequencer 302, a sequence
analyzer 304 and/or a sequence aligner 306. Units 302-306 may be
implemented in one or more computerized devices as hardware or
software units, for example, specifying instructions configured to
be executed by a processor. One or more of units 302-306 may be
implemented as separate devices or combined as an integrated
device.
[0054] Genetic sequencer 302 may input biological samples, such as,
blood, tissue, or saliva, or information derived therefrom, of each
real (living) potential parent and may output the potential
parent's genetic sequence including the individual's genetic
information at one or more genetic loci, for example, a human
genome.
[0055] Sequence analyzer 304 may input two potential parent's
genetic sequences to simulate a mating by combining genetic
information therefrom and output a virtual progeny genetic sequence
of a virtual gamete, for example, as described in reference to FIG.
4.
[0056] Sequence aligner 306 may align one or more loci of the
virtual progeny genetic sequence and a plurality of reference
genetic sequences of extant organisms from one or more species.
[0057] Sequence analyzer 304 may input the multiple sequence
alignment and may compute a measure f of evolutionary variation of
alleles at one or more genetic loci, which may be transformed into
one or more likelihoods or scores s associated with a relative
propensity that these alleles would be damaging if produced in a
child.
[0058] Genetic sequencer 302, sequence aligner 304, and sequence
analyzer 306 may include one or more controller(s) or processor(s)
308, 310, and 312, respectively, configured for executing
operations and one or more memory unit(s) 314, 316, and 318,
respectively, configured for storing data such as genetic
information or sequences and/or instructions (e.g., software)
executable by a processor, for example for carrying out methods as
disclosed herein. Processor(s) 308, 310, and 312 may include, for
example, a central processing unit (CPU), a digital signal
processor (DSP), a microprocessor, a controller, a chip, a
microchip, an integrated circuit (IC), or any other suitable
multi-purpose or specific processor or controller. Processor(s)
308, 310, and 312 may individually or collectively be configured to
carry out embodiments of a method according to the present
invention by for example executing software or code. Memory unit(s)
314, 316, and 318 may include, for example, a random access memory
(RAM), a dynamic RAM (DRAM), a flash memory, a volatile memory, a
non-volatile memory, a cache memory, a buffer, a short term memory
unit, a long term memory unit, or other suitable memory units or
storage units. Genetic sequencer 302, sequence aligner 304, and
sequence analyzer 306 may include one or more input/output devices,
such as output display 320 (e.g., such as a monitor or screen) for
displaying to users results provided by sequence analyzer 306 and
an input device 322 (e.g., such as a mouse, keyboard or
touchscreen) for example to control the operations of system 300
and/or provide user input or feedback, such as, selecting one or
more models or phylogenetic trees, selecting one or more genus or
species to use for generating the models, selecting input genetic
sequences, selecting two potential parents or multiple donor
candidates from a pool of potential parents with which to simulate
mating, selecting a number of iterations for simulating a mating
with a different pair of virtual gametes in each iteration from
each pair of potential parents, etc.
[0059] Reference is made to FIG. 4, which schematically illustrates
an example of simulating a hypothetical mating of two (i.e. a first
and a second) potential parents for generating a virtual progeny
according to an embodiment of the invention.
[0060] For each of the two potential parents, a processor (e.g.,
sequence analyzer processor 312 of FIG. 3) may receive a potential
parent's diploid genetic sequence 402, 404. A "diploid" genetic
sequence includes two alleles from the two sets of chromosomes
respectively labeled "A" and "B" at each genetic locus of a diploid
cell of the potential parent, whereas a "haploid" genetic sequence
includes one allele from one chromosome at each genetic locus of a
haploid cell of the potential parent. For each of the two potential
parents' diploid genetic sequences 402 and 404, the processor may
simulate genetic recombination of the two sets of chromosomes A and
B from the parent's diploid genetic sequence 402, 404 (having two
alleles at each genetic locus) to generate a virtual gamete haploid
genetic sequence 406, 408 (having one allele per genetic locus).
The processor may simulate recombination by progressing
locus-by-locus along a "haplopath" through each parent's diploid
genetic sequence 402, 404 and selecting one of the two alleles at
each genetic locus (either the allele in chromosome A or the allele
in chromosome B). The selection of alleles may be at least
partially random and/or at least partially non-random, for example,
based on defined correlations between alleles at different loci
referred to as "linkage disequilibrium". The haploid genetic
sequence may mimic or simulate recombination of the genetic
material in the two chromosomes A and B to form a discrete haploid
genetic sequence of a virtual gamete 406, 408, e.g., a virtual
sperm or virtual egg.
[0061] The two virtual gamete haploid genetic sequences 406 and 408
for the two respective potential parents may be combined to
simulate a mating between the first and second potential parents
resulting in a virtual progeny diploid genetic sequence 410 (a
discrete genome of a child potentially to be conceived).
[0062] Since the selection of alleles is at least partially random,
this mating is just one of the many possible genetic combinations
for the first and second potential parents. This process may be
repeated multiple times (e.g., hundreds or thousands of times),
each time following a different recombination path (e.g., a
different sequence of alleles selected) for one or both of the
potential parents, to generate multiple genetic permutations that
are possible for mating the first and second potential parents. The
virtual progeny diploid genetic sequence 410 may include a single
(e.g., most probable) genetic sequence or a probability
distribution of multiple possible sequences, for example, to
indicate, for many possible matings, the overall likelihood of each
of multiple alleles at each of one or more loci in a virtual or
hypothetical progeny.
[0063] Embodiments of the invention may use methods for simulating
a mating between two potential parents and generating a virtual
progeny genetic sequence described in U.S. Pat. No. 8,805,620,
which is incorporated herein by reference in its entirety. Other
methods may also be used.
[0064] Once the virtual progeny genetic sequence 410 is generated,
it may be aligned with one or more reference genetic sequences of
one or more organisms in a multiple sequence alignment (MSA). Based
on a measure of variation between organisms at the aligned genetic
loci, the virtual progeny may be assigned one or more of the
likelihoods that one or more alleles or mutations in the virtual
genetic sequence 410 would be deleterious, for example, if
replicated in a real living progeny.
[0065] Reference is made to FIG. 5, which is a flowchart of a
method for using the past evolution of multiple organisms to
predict deleterious mutations in future virtual, hypothetical or
simulated (non-living) potential progeny, in accordance with an
embodiment of the invention.
[0066] In operation 500, a processor (e.g., sequence analyzer
processor 312 of FIG. 3) may receive multiple aligned reference
genetic sequences of multiple extant organisms representative of
one or more species or populations (e.g., as shown in FIG. 1). The
reference genetic sequences may be sequenced by a genetic sequencer
(e.g., sequencer 302 of FIG. 3) or pre-stored and retrieved from a
memory or database (e.g., one or more of memory unit(s) 314 and/or
318) and may be aligned by a sequence aligner (e.g., sequence
aligner 304 of FIG. 3) or pre-aligned in the memory or
database.
[0067] In operation 510, the processor may build or obtain a model
representing measures of evolutionary variation of alleles or
nucleotides at one or more aligned genetic loci between the
multiple organisms. The model may be a single-species model (e.g.,
the multiple organisms are from the same single species) or a
multi-species model (e.g., the multiple organisms are from
different multiple species). The model may include, for example, a
phylogenetic tree (e.g., as shown in FIG. 2), or another data
structure.
[0068] In operation 520, the processor may receive genetic
sequences of two potential parents (e.g., potential parent genetic
sequences 402 and 404 of FIG. 4). The sequences may be derived by a
sequencer (e.g., genetic sequencer processor 308 of FIG. 3) from
biological samples, such as, blood, saliva, etc., of the two
potential parents. The two potential parents may include two
individuals interested in mating together, or one individual
interested in conceiving a child and another individual from a
group of genetic donors. The biological samples for different
potential parents may be obtained and sequenced at the same or
different times and may be stored for later analysis.
[0069] In operation 530, the processor may simulate a mating
between the two potential parents by combining their genetic
sequences to generate one or more virtual progeny genetic sequences
(e.g., sequence 410 of FIG. 4). The processor may generate a
virtual gamete (haploid genetic sequence) for each potential parent
by at least partially randomly selecting one of two allele copies
in the parent's two chromosomes (diploid genetic sequence) to
simulate recombination at each of a sequence of genetic loci. A
virtual gamete for each of the two potential parents (e.g., one
virtual sperm and one virtual egg) may be combined to generate the
genetic sequence of the virtual progeny. Multiple virtual gametes
may be generated for each potential parent by repeating the
recombination process each time selecting a different at least
partially random sequence of alleles. Multiple virtual progeny
genetic sequences may be generated for multiple pairs of potential
parents by repeating the step of combining two virtual gametes for
each of a plurality of different combinations of two virtual
gametes. In one embodiment, the independent carrier status of an
individual may be determined by simulating a mating combining the
individual's genetic sequence information with that of a sample,
averaged, or reference genetic sequence of the same species.
[0070] In operation 540, the processor may use the model of
operation 500 defining the evolutionary past variation among
multiple extant organisms of different populations or species to
predict or interpolate a likelihood or probability of evolutionary
health of the virtual progeny simulated in operation 530. The
processor may determine the differences between the virtual progeny
genetic sequence and one or more aligned reference genetic
sequences and may assign each allele (or only different or mutated
alleles) a measure of evolutionary variation that is a function of
variations in alleles at corresponding aligned genetic loci in the
multiple aligned genetic sequences (e.g., loci derived from one or
more common ancestral genetic loci in the multiple organisms). The
processor may compute one or more likelihoods that an allele
mutation at each of the one or more genetic loci in the simulated
virtual progeny will be deleterious based on the measure of
evolutionary variation of alleles at the corresponding aligned
genetic loci for the multiple organisms. The likelihoods may
include one or more likelihoods or likelihood distributions for one
or more alleles, one or more allele mutations, one or more genes,
one or more codons, one or more genetic loci or loci segments, for
one or more virtual progeny of two potential parents (e.g.,
generated by repeatedly simulating a mating using different virtual
gamete(s) in each iteration) and/or for one or more pairs of
potential parents (e.g., generated by repeatedly simulating a
mating step, in each iteration using the genetic information of the
same first one of the two potential parents and a different second
one of the two potential parents iteratively selected from a
plurality of genetic donor candidates). The one or more likelihoods
may be compared to one or more thresholds or other statistical
models to predict if (or a likelihood or degree in which) an allele
mutation will be deleterious in the virtual progeny. For example,
mutations at genetic loci with relatively constant or fixed alleles
and relatively lower measures of evolutionary variation may be
associated with relatively higher likelihoods of deleterious
traits, whereas mutations at genetic loci with relatively volatile
or changing alleles and relatively higher measures of evolutionary
variation may be associated with relatively lower likelihoods of
resulting in deleterious traits.
[0071] In operation 550, an output device (e.g., output device 320
of FIG. 3) may output or display, e.g., to a user, the one or more
likelihoods or likelihood distributions that a hypothetical child
conceived by the two potential parents having the virtual progeny
genetic sequence generated in operation 530 would have deleterious
traits, or other data generated in operation 540. For example, the
output device may output one or more likelihoods that an allele
mutation at each of the one or more genetic loci in the simulated
virtual progeny will be deleterious based on the measure of
evolutionary variation of alleles at the corresponding aligned
genetic loci for the multiple organisms.
[0072] Other or different operations or orders of operations may be
used and operations may be repeated, e.g., until the likelihoods
converge or asymptotically approach a statistically stable
result.
[0073] In accordance with embodiments of the present invention and
as used herein, the following terms are defined with the following
meanings, unless explicitly stated otherwise.
[0074] As used herein, "haploid cell" refers to a cell with a
single set (n) of unpaired chromosomes.
[0075] "Gametes", as used herein, are specialized haploid cells
(e.g., spermatozoa and oocytes) produced through the process of
meiosis and involved in sexual reproduction.
[0076] As used herein, "diploid cell" has a homologous pair of each
of its autosomal chromosomes, and has two copies (2n) of each
autosomal genetic locus.
[0077] The term "chromosome", as used herein, refers to a molecule
of DNA with a sequence of base pairs that corresponds closely to a
defined chromosome reference sequence of the organism in
question.
[0078] The term "gene", as used herein, refers to a DNA sequence in
a chromosome that codes for a product (either RNA or its
translation product, a polypeptide) or otherwise plays a role in
the expression of said product. A gene contains a DNA sequence with
biological function. The biological function may be contained
within the structure of the RNA product or a coding region for a
polypeptide. The coding region includes a plurality of coding
segments ("exons") and intervening non-coding sequences ("introns")
between individual coding segments and non-coding regions preceding
and following the first and last coding regions respectively.
[0079] As used herein, "locus" refers to any segment of DNA
sequence defined by chromosomal coordinates in a reference genome
known to the art, irrespective of biological function. A DNA locus
may contain multiple genes or no genes; it may be a single base
pair or millions of base pairs.
[0080] As used herein, an "allele" is one of two or more existing
genetic variants of a specific polymorphic genomic locus.
[0081] As used herein, "genotype" refers to the diploid combination
of alleles at a given genetic locus, or set of related loci, in a
given cell or organism. A homozygous subject carries two copies of
the same allele and a heterozygous subject carries two distinct
alleles. In the simplest case of a locus with two alleles "A" and
"a", three genotypes may be formed: A/A, A/a, and a/a.
[0082] As used herein, "genotyping" refers to any experimental,
computational, or observational protocol for distinguishing an
individual's genotype at one or more well-defined loci.
[0083] As used herein, "linkage disequilibrium" is the non-random
association of alleles at two or more loci within a particular
population. Linkage disequilibrium is measured as a departure from
the null hypothesis of linkage equilibrium, where each allele at
one locus associates randomly with each allele at a second locus in
a population of individual genomes.
[0084] As used herein, a "genome" is the total genetic information
carried by an individual organism or cell, represented by the
complete DNA sequences of its chromosomes.
[0085] As used herein, a genetic "trait" is a distinguishing
attribute of an individual, whose expression is fully or partially
influenced by an individual's genetic constitution.
[0086] As used herein, "disease" refers to a trait that is at least
partially heritable and causes a reduction in the quality of life
of an individual person.
[0087] As used herein, a "phenotype" includes alternative traits
which may be discrete or continuous. Phenotypes may include both
traits and diseases.
[0088] As used herein, a "haplopath" is a haploid path laid out
along a defined region of a diploid genome by a single iteration of
a Monte Carlo simulation or a single chain generated through a
Markov process. A haplopath is generated through the application of
formal rules of genetics that describe the reduction of the diploid
genome into haploid genomes through the natural process of meiosis.
It may be formed by starting at one end of a personal chromosome or
genome and walking from locus to locus, choosing a single allele at
each step based on available linkage disequilibrium information,
inter-locus allele association coefficients, and formal rules of
genetics that describe the natural process of gamete production in
a sexually reproducing organism.
[0089] A "virtual gamete" is a data structure representing an
imaginary non-existing gamete, for example, simulated by at least
partially randomly selecting genetic information from both
chromosomes of a single potential parent genetic sequence. A
virtual gamete may represent information selected along a single
haplopath that extends across one or more loci, such as, an entire
genome.
[0090] As used herein, a "virtual progeny genetic sequence" is a
data structure representing the genetic information of an imaginary
non-existing virtual progeny. The virtual progeny genetic sequence
is, for example, a discrete genetic combination of two virtual
gametes.
[0091] As used herein, a "variant" is a particular allele at a
locus where at least two alleles have been identified.
[0092] As used herein, a "mutation" has the same meaning as a
"mutant allele" which is a variant that causes a gene to function
abnormally.
[0093] Embodiments of the invention may include an article such as
a computer or processor readable non-transitory storage medium,
such as for example a memory, a disk drive, or a USB flash memory
device encoding, including or storing instructions, e.g.,
computer-executable instructions, which when executed by a
processor or controller, cause the processor or controller to carry
out methods disclosed herein.
[0094] Different embodiments are disclosed herein. Features of
certain embodiments may be combined with features of other
embodiments; thus certain embodiments may be combinations of
features of multiple embodiments.
[0095] The foregoing description of the embodiments of the
invention has been presented for the purposes of illustration and
description. It is not intended to be exhaustive or to limit the
invention to the precise form disclosed. It should be appreciated
by persons skilled in the art that many modifications, variations,
substitutions, changes, and equivalents are possible in light of
the above teaching. It is, therefore, to be understood that the
appended claims are intended to cover all such modifications and
changes as fall within the true spirit of the invention.
Sequence CWU 1
1
36127DNAFugu rubripes 1aggtggatcc cacaaaacgt caaattc
27227DNATetraodon nigroviridis 2aggtggaccc ctcgaaatgt cacattc
27327DNAOreochromis niloticus 3aagtggatcc cactaaacgt cagatcc
27427DNAGasterosteus aculeatus 4aggtggaccc cactgagcgc cgcgtcc
27527DNAOryzias latipes 5aggtggaccc ctccaaacgg cagattc
27627DNAGadus morhua 6aacttgatcc cactaaacgc aagattc 27727DNADanio
rerio 7aggtggaccc aagcaaaagg cgaatcc 27827DNAMacropus eugenii
8aggtggaccc atccaagaga agagtcc 27927DNAMonodelphis domestica
9aggttgaccc atccaagaga aaagttc 271027DNASarcophilus harrisii
10aggtggaccc atccaagagc agggtcc 271127DNARattus norvegicus
11aagtggaccc atcccggcag cgtgtgc 271227DNAMus musculus 12aagtggaccc
atcccagcag cgtgtgc 271327DNADipodomys ordii 13aggtggatgg gtcccagcgg
cgtgtgc 271427DNASpermophilus tridecemlineatus 14aggtggaccc
atcccagcgg catgtgc 271527DNACavia porcellus 15aggtggaccc ctcaaggcag
cgcgtgc 271627DNAOchotona princeps 16ccgtggaccc gtcccggcgg caggtgc
271727DNATupaia belangeri 17aggtggaccc atcccagagg cgcgtgc
271827DNAHomo sapiens 18aggtggaccc atcgcggata catgtgc 271927DNAPan
troglodytes 19aggtggaccc atcgcggata cacgtgc 272027DNAGorilla
gorilla 20aggtggaccc atcgcggaca catgtgc 272127DNAPongo abelii
21aggtggaccc atctcagaca catgtgc 272228DNANomascus leucogenys
22aggtggaccc catcacggac acatgtgc 282328DNAMacaca mulatta
23aggtggaccc cgtcacagac acgtgtgc 282427DNACallithrix jacchus
24aggtggaccc gtcggggagg caggtgc 272527DNAOtolemur garnettii
25aggtggaccc ctcccagagg caggtgc 272627DNAAiluropoda melanoleuca
26aggtggaccc atcccggagc agagtgc 272727DNAMyotis lucifugus
27aggtggaccc gtcccggagg caggtgc 272827DNATursiops truncatus
28aggtggaccc gacccagggc cgagtgc 272927DNAEquus caballus
29aggtggaccc atcccagagc cgtgtgc 273027DNALoxodonta africana
30aggtggaccc atcccggcgc caggtac 273127DNAOrnithorhynchus anatinus
31aggtggaccc gtcggcccgg agagtcc 273227DNAGallus gallus 32aagttgactc
caccaagaaa agggtac 273327DNAMeleagris gallopavo 33aagttgactc
aaccaagaaa agggtgc 273427DNATaeniopygia guttata 34aggttgaccc
aagtaagaaa aaggtgc 273527DNAXenopus tropicalis 35aggttgattt
atctaaaaga aaagttc 273627DNALatimeria chalumnae 36aggttgatcc
atccaatagg agggttc 27
* * * * *
References