U.S. patent application number 13/471370 was filed with the patent office on 2012-12-06 for predicting gene variant pathogenicity.
This patent application is currently assigned to UNIVERSITY OF UTAH. Invention is credited to David K. Crockett, Perry G. Ridge, Andrew Wilson.
Application Number | 20120310539 13/471370 |
Document ID | / |
Family ID | 47140060 |
Filed Date | 2012-12-06 |
United States Patent
Application |
20120310539 |
Kind Code |
A1 |
Crockett; David K. ; et
al. |
December 6, 2012 |
PREDICTING GENE VARIANT PATHOGENICITY
Abstract
A computer-implemented, gene-specific prediction tool for
classifying and interpreting gene tests is described. The
prediction tool includes a predictor using a consensus framework.
The predictor employs a weighted metric of existing and
complementary prediction algorithms and calculated reference
intervals of disease outcomes to calculate a consensus score used
in interpreting gene tests.
Inventors: |
Crockett; David K.; (South
Jordan, UT) ; Ridge; Perry G.; (Orem, UT) ;
Wilson; Andrew; (Salt Lake City, UT) |
Assignee: |
UNIVERSITY OF UTAH
Salt Lake City
UT
|
Family ID: |
47140060 |
Appl. No.: |
13/471370 |
Filed: |
May 14, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61518833 |
May 12, 2011 |
|
|
|
Current U.S.
Class: |
702/19 |
Current CPC
Class: |
G16B 50/00 20190201;
G16B 40/00 20190201; G16B 20/00 20190201; G06N 5/046 20130101; C12Q
2500/00 20130101 |
Class at
Publication: |
702/19 |
International
Class: |
G06F 19/00 20110101
G06F019/00 |
Claims
1. A method of interpreting biologic information, comprising:
selecting a plurality of classifiers, each trained using data
relating to a gene variant and to a respective phenotype associated
with the variant, for interpreting a genetic test; obtaining an
interpretation of the genetic test from each selected classifier;
determining a correlation matrix including a numerical correlation
of each of a plurality of pairs of classifiers, the numerical
correlation indicating a correlation between members of the
respective pair; by a processor, performing factor analysis of the
correlation matrix to determine whether the interpretation obtained
from a classifier is statistically independent from interpretations
of remaining classifiers, and, in an event the classifier is not
statistically independent from the interpretations of the remaining
classifiers, obtaining a consensus score for each classifier as a
function of a reduced matrix obtained from the factor analysis of
the correlation matrix; and comparing an overall consensus score,
calculated using the consensus scores determined for each
classifier, against a predetermined range including a threshold;
and when the overall consensus score falls within the range,
reporting a genetic test interpretation associated with the
threshold as an outcome of the genetic test.
2. The method of claim 1, further comprising obtaining the
consensus score for each classifier as a function of a linear sum
of the interpretations obtained from the classifier in an event the
interpretations of all classifiers are statistically independent
from one another.
3. The method of claim 1, further comprising obtaining the
consensus score for each classifier as a function of a scalar
product of the reduced matrix and the interpretation obtained from
the classifier.
4. The method of claim 1, wherein performing factor analysis of the
correlation matrix includes performing at least one of regression
modeling, common factor analysis, or principle component analysis
of the correlation matrix.
5. The method of claim 1, further comprising: accessing a database
including the data relating to a gene variant and its phenotype,
and training each classifier using the data obtained from the
database.
6. The method of claim 1, wherein the data relating to the gene
variant and its phenotypes include Rearranged During Transformation
(RET) proto-oncogene data.
7. The method of claim 1, further comprising determining the
predetermined range by determining a first numerical consensus
score reference interval for a benign phenotype associated with a
gene variant and a second numerical consensus score reference
interval for a pathogenic phenotype associated with a gene
variant.
8. The method of claim 7, further comprising comparing the overall
consensus score against the first and second numerical consensus
score reference intervals, and reporting a benign gene test outcome
in an event the consensus score falls within the first numerical
interval or reporting a pathogenic gene test outcome in an event
the consensus score falls within the second numerical interval.
9. The method of claim 1, further comprising obtaining a numerical
representation of the interpretation obtained from each classifier,
the numerical representation including at least one of mean,
median, standard deviation, minimum, or maximum of numerical values
output from the classifier.
10. The method of claim 9, further comprising determining the
correlation matrix as a function of numerical representations of
the plurality of the classifiers.
11. The method of claim 1, wherein determining the numerical
correlation between each pair of classifiers includes determining
Spearman's rank correlation coefficient for the pair of
classifier.
12. The method of claim 1, wherein reporting the genetic test
interpretation includes graphically displaying the overall
consensus score on the predetermined range.
13. The method of claim 1, wherein reporting the genetic test
interpretation includes displaying the consensus scores of the
plurality of classifiers on a radial plot.
14. The method of claim 1, further comprising reporting an
uncertain outcome for the genetic test in an event the overall
consensus score falls outside of the predetermined range.
15. The method of claim 14, further comprising graphically
reporting the uncertain outcome, the graphical report including
display of the overall consensus score superimposed on the
predetermined range.
16. The method of claim 14, further comprising graphically
reporting the uncertain outcome, the graphical report including
display of the consensus scores of the plurality of classifiers on
a radial plot.
17. A non-transitory computer-readable medium encoded with a
computer program comprising instructions executable by a processor
for: selecting a plurality of classifiers, each trained using data
relating to a gene variant and to a respective phenotype associated
with the variant, for interpreting the genetic test; obtaining an
interpretation of the genetic test from each selected classifier;
determining a correlation matrix including a numerical correlation
of each of a plurality of pairs of classifiers, the numerical
correlation indicating a correlation between members of the
respective pair; performing factor analysis of the correlation
matrix to determine whether the interpretation obtained from a
classifier is statistically independent from interpretations of
remaining classifiers, and, in an event the classifier is not
statistically independent from the interpretations of the remaining
classifiers, obtaining a consensus score for each classifier as a
function of a reduced matrix obtained from the factor analysis of
the correlation matrix; and comparing an overall consensus score,
calculated using the consensus scores determined for each
classifier, against a predetermined range including a threshold;
and when the overall consensus score falls within the range,
reporting a genetic test interpretation associated with the
threshold as an outcome of the genetic test
18. The non-transitory computer-readable medium of claim 17,
further comprising obtaining the consensus score for each
classifier as a function of a linear sum of the interpretations
obtained from the classifier in an event the interpretations of all
classifiers are statistically independent from one another.
19. The non-transitory computer-readable medium of claim 17,
further comprising obtaining the consensus score for each
classifier as a function of a scalar product of the reduced matrix
and the interpretation obtained from the classifier.
20. The non-transitory computer-readable medium of claim 17,
wherein performing factor analysis of the correlation matrix
includes performing at least one of regression modeling, common
factor analysis, or principle component analysis of the correlation
matrix.
Description
RELATED APPLICATIONS
[0001] The present application claims the benefit of priority under
35 U.S.C. .sctn.119 from U.S. Provisional Patent Application Ser.
No. 61/518,833 entitled "Decision Support for Uncertain Gene
Variants," filed on May 12, 2011, which is hereby incorporated by
reference in its entirety for all purposes.
BACKGROUND
[0002] Medical genetics involves diagnosis, management, and
determination of risk of hereditary disorders. Understanding the
genotype-phenotype correlation of gene variants in disease is a
major component of medical genetics. In monogenic diseases, gene
mutations are typically curated as either "pathogenic" or "benign."
However, many gene variants (i.e., gene mutations) are classified
as being "unknown" or "uncertain" because they cannot be clearly
associated with a clinical phenotype. Accurate interpretation of
gene testing, including accurate phenotype association of gene
variants, is an important component in customization of healthcare
such that decisions and practices provided to a patient are
tailored to the individual patient.
[0003] In recent years, various efforts, such as the Human Variome
Project, 1000 Genomes, and NCBI Genetic Testing Registry, have
resulted in a growing interest in annotation and clinical
interpretation of gene variants in human diseases. Further, with
rapidly evolving technologies (e.g., Single Nucleotide
Polymorphisms (SNP) chip genome wide association studies and
next-generation sequencing), genomic analysis has become faster and
more cost effective, yielding much larger data sets than previously
available. However, there exists a gap between the rapidly growing
collections of genetic variation (i.e., genetic mutation) and
practical clinical implementation. Further, as genetic information
is incorporated into the electronic medical record, new decision
support approaches are needed to provide clinicians with a
preferred course of treatment. Moreover, for decision support rules
to add value, the clinical relevance of laboratory information
should be well understood.
[0004] Gene variant classification is critical in informing
clinicians of the most appropriate course of treatment. To that
end, medical geneticists typically rely on patient history and
family segregation, literature review and trusted colleagues to
stay informed of the phenotype consequences of a given gene
variant. Although computer-based prediction methods may be employed
to classify gene variants, there still exists a lack of a widely
accepted standard computational predictor of mutation severity for
novel or uncertain gene variants in clinical use. Further, existing
prediction methods, despite being actively used in laboratories, do
not offer sufficient accuracy to predict disease phenotype to the
degree necessary to be clinically applicable.
[0005] In the recent years, updated recommendations on reporting
and classification of gene variants, including approaches targeted
at determining the clinical significance of variants of uncertain
significance, have been proposed from the American College of
Medical Geneticists (ACMG). Further, in order to improve
interpretation of unclassified genetic variants, definitions and
terminology have also been recommended by the International Agency
for Research on Cancer (IARC).
[0006] Despite these recommendations, terms such as "deleterious,"
"mutation," "pathogenic," or "causative of disease" are still being
used in reporting genetic tests. Further, test results such as
"indeterminate," "unknown," "uncertain," "unclassified," or
"undetermined" render interpretation of the significance of a gene
test result difficult. Further compounding this issue, word
modifiers such as "likely," "suspected," "predicted," "mild,"
"moderate," or "severe" often are used to accompany variant
classification.
[0007] The lack of a quantitative metric or a standardized scale
for evaluation of novel or uncertain gene variants render test
result interpretation difficult and subjective to location and
expertise at hand. A second and closely related challenge is the
lack of an objective and standardized framework or context to make
that metric meaningful. The quantitative metric and framework for
evaluation become especially critical for interpretation of novel
and uncertain gene variants where there is the obvious lack of
traditional or existing evidence such as family history, pedigree
trios or sib pairs, confirming literature reports, bench assay
biochemical evidence, or colleague consensus of disease
association.
SUMMARY
[0008] Certain embodiments of the present invention relate to a
gene-specific prediction tool for classifying and interpreting of a
gene test. The prediction tool may include a predictor implemented
using a consensus framework. In certain embodiments, the consensus
framework may include a weighted metric of existing and
complementary prediction algorithms and calculated reference
intervals of known disease outcomes.
[0009] In certain embodiment, a plurality of classifiers for
interpreting a genetic test is selected from among a selection of
classifiers. Each classifier may be trained using data relating to
gene variants and their known phenotypes. An interpretation of the
genetic test may be obtained from each selected classifier. A
correlation matrix including numerical correlation of each pair of
classifiers may be determined. The numerical correlation may
indicate a correlation between the pair of classifiers. The factor
analysis of the correlation matrix may be performed to determine
whether the interpretation obtained from a classifier is
statistically independent from the interpretations of remaining
classifiers. In an event the classifier is not statistically
independent from the interpretations of remaining classifiers; a
consensus score for each classifier may be obtained. The consensus
score may be obtained as a function of a reduced matrix obtained
from the factor analysis of the correlation matrix. An overall
consensus score may be calculated using the consensus scores
determined for each classifier. The overall consensus score is
compared against a predetermined range. In an event the overall
consensus score falls within the predetermined range, the genetic
test interpretation associated with the predetermined threshold is
reported as an outcome of the genetic test.
[0010] In certain embodiments, the consensus score for each
classifier may be obtained as a function of a linear sum of the
interpretations obtained from the classifier in an event the
interpretations of all classifiers are statistically independent
from one another.
[0011] In some embodiments, the consensus score for each classifier
may be obtained as a function of a scalar product of the reduced
matrix and the interpretation obtained from the classifier.
[0012] In some embodiments, performing factor analysis of the
correlation matrix includes performing at least one of regression
modeling, common factor analysis, or principle component analysis
of the correlation matrix.
[0013] In some embodiments, a database including the data relating
to gene variants and their known phenotypes may be accessed and ach
classifier may be trained using the data obtained from the
database. In some embodiments, the data relating to gene variants
and their known phenotypes may include Rearranged During
Transformation (RET) proto-oncogene data.
[0014] In certain embodiments, the predetermined range may be
determined using a first numerical consensus score reference
interval determined for known benign phenotypes associated with
gene variants and a second numerical consensus score reference
interval determined for known pathogenic phenotypes associated with
gene variants. In some embodiments, the overall consensus score may
be compared against the first and second numerical consensus score
reference intervals and a benign gene test outcome may be reported
in an event the consensus score falls within the first numerical
interval or a pathogenic gene test outcome may be reported in an
event the consensus score falls within the second numerical
interval.
[0015] In some embodiments, a numerical representation of the
interpretation obtained from each classifier may be obtained. The
numerical representation may include at least one of mean, median,
standard deviation, minimum, and maximum of numerical values output
from the classifier. In certain embodiments, the correlation matrix
may be obtained as a function of the numerical representations of
the plurality of the classifiers. In some embodiments, the
numerical correlation between each pair of classifiers may be
obtained by determining a Spearman's rank correlation coefficient
for the pair of classifier.
[0016] In some embodiments, the genetic test interpretation may be
reported by graphically displaying the overall consensus score on
the predetermined range. In certain embodiments, the consensus
scores of the plurality of classifiers may be displayed on a radial
plot.
[0017] In some embodiments, in an event the overall consensus score
falls outside of the predetermined range, an uncertain outcome for
the genetic test may be reported. In some embodiments, the overall
consensus for the uncertain results may be displayed over a
predetermined range. In certain embodiments, the consensus scores
of the plurality of classifiers may be displayed on a radial plot.
Visualization of the Consensus output may be used to augment
available clinical information and assist in improving prediction
algorithms as gene variant knowledge increases.
[0018] The advantages and novel features are set forth in part in
the description which follows, and in part will become apparent to
those skilled in the art upon examination of the following and the
accompanying drawings or may be learned by production or operation
of the examples. The advantages of the present teachings may be
realized and attained by practice or use of the methodologies,
instrumentalities and combinations described herein.
[0019] It is understood that other configurations of the subject
technology will become readily apparent to those skilled in the art
from the following detailed description, wherein various
configurations of the subject technology are shown and described by
way of illustration. As will be realized, the subject technology is
capable of other and different configurations and its several
details are capable of modification in various other respects, all
without departing from the scope of the subject technology.
[0020] Accordingly, the drawings and detailed description are to be
regarded as illustrative in nature and not as restrictive.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] The drawing figures depict one or more implementations in
accord with the present teachings, by way of example only, not by
way of limitation. In the figures, like reference numerals refer to
the same or similar elements.
[0022] FIG. 1 is a high-level block diagram of an embodiment of the
present invention for interpreting a gene test.
[0023] FIG. 2 illustrates RET protein domains and their reported
disease causing variants as associated with different MEN2
phenotypes.
[0024] FIG. 3 is a flow diagram of the procedures for interpreting
gene test results using a PSAAP classifier.
[0025] FIG. 4 is a table that summarizes performance of various
classifiers in interpreting gene test results using a dataset of
RET gene variant-disease data.
[0026] FIG. 5 is a high-level block diagram of procedures for
determining a consensus score of multiple classifiers according to
certain embodiments disclosed herein.
[0027] FIGS. 6A-6B illustrate the use of radar plots for consensus
scoring.
[0028] FIGS. 7A-7B include examples of a comprehensive display for
consensus scoring.
DETAILED DESCRIPTION
[0029] The detailed description set forth below is intended as a
description of various configurations of the subject technology and
is not intended to represent the only configurations in which the
subject technology may be practiced. The appended drawings are
incorporated herein and constitute a part of the detailed
description. The detailed description includes specific details for
the purpose of providing a thorough understanding of the subject
technology. However, it will be apparent to those skilled in the
art that the subject technology may be practiced without these
specific details. In some instances, well-known structures and
components are shown in block diagram form in order to avoid
obscuring the concepts of the subject technology. Like components
are labeled with identical element numbers for ease of
understanding.
[0030] FIG. 1 is a high-level block diagram 100 of an embodiment of
the present invention for interpreting a gene test. A user 102 of a
computing device 101 may access a database 150 to obtain gene
variant-disease data 160 from the database 150. The term "gene
variant" refers to a specific form of alteration/variation in the
normal sequence of a gene. Genetic variations among individuals may
occur on different scales, ranging from variations in the number
and appearance of the chromosomes to nucleotide. Although the
significance of a gene variant or a gene variation is often
unclear, in some cases, available studies of genotypes and their
corresponding phenotype may be used to determine the significance
of a gene variant. The database 150 may store, collect, and/or
display information regarding gene variants and their possible
disease association (i.e., gene variant-disease data).
[0031] The database 150 may be local or remote to the computing
device. Although not shown in FIG. 1, in certain embodiments, the
user 102 may access more than one database of gene variant-disease
data, each of which may be local or remote to the computing device.
In some embodiments, the computing device user 102 may access a
remote database 150 via a network (e.g., band limited
communications network). In some embodiments, the computing device
user may submit a request for the gene variant-disease data 160 to
the database 150 and receive the data 160 from the database in
response to that request.
[0032] The computing device 101 may include a machine-implemented
predictor tool 105 which may be used to interpret a gene test. The
prediction tool 105 may include a consensus classifier 110 that is
responsible for performing the procedures required for classifying
and interpreting the gene test. The details of the consensus
classifier 110 are described later with reference to FIGS. 5-7.
[0033] In some embodiments, the interpretation results may be
reported to the computing device user 102 on the display 103 of the
computing device 101. In certain embodiments, in addition to or in
place of displaying the gene test interpretation results, the
computing device 101 may employ other reporting schemes known in
the art to report the gene test interpretation results.
[0034] The term "gene test," as used herein, refers to a test
involving examination of deoxyribonucleic acid (DNA) molecules in
search for genetic disorders, identifying individuals carrying a
copy of a gene that may be responsible for a disease (carrier
screening), diagnostic testing, pre-symptomatic testing, etc.
[0035] In some embodiments, the database 150 may store information
on single nucleotide variant and their disease association.
Generally, a "single nucleotide variant" (SNV) or a "single
nucleotide polymorphism" (SNP) refers to variations occurring when
a single nucleotide in the genome differs between paired
chromosomes of an individual or members of a biological species.
Further, the term "non-synonymous single nucleotide polymorphism"
(nsSNP) may be used to refer to a point mutation or a change in
amino acid sequence as compared to a wild type or reference
sequence.
[0036] Certain nsSNP variants have been shown to be causative of
disease. Therefore, investigating the functional effect of SNP has
been of interest for many years. Due to the cost, labor, and
expertise required for wet-bench molecular evaluation,
computational tools have been used to assist in investigating the
functional effect SNP. These computational tools often focus on SNP
variants in protein coding regions that change one amino acid for
another. The severity of a given amino acid sequence change may
range from mild to severe, and has been reported to impact various
medical areas, including genetic disease susceptibility (e.g.,
sickle cell anemia), common disease risks (e.g., Alzheimer's
disease risk), or drug sensitivities, as seen in Warfarin
treatment. Historically, physical and chemical properties of amino
acids have been used as a proxy to assess the functional impact of
these substitution mutations.
[0037] Some early efforts in predicting amino acid substitution
effects focused on metrics of estimating the expected evolutionary
distance between each possible amino acid pair. For example, Point
Accepted Mutation (PAM) matrices have been used to approximate the
evolutionary distance and frequency of amino acids for equivalent
protein positions in closely related species. Each matrix, in PAM
matrices, includes a number of standard amino acids in
corresponding rows and columns, such that the value in a given cell
represents the probability of having one amino acid substituted for
another. Such matrices are commonly referred to as "substitution"
matrices.
[0038] A substitution matrix may be used to derive a scoring matrix
that may be used to assess the similarities between two aligned
sequences. The Blocks of Amino Acid Substitution Matrix (BLOSUM) is
an example of a substitution matrix that may be used for sequence
alignment of proteins. BLOSUM considers highly conserved protein
regions and may be used for more distantly related species. Both
PAM and BLOSUM employ raw mutation rates to compute a score for
each amino acid substitution and calculate the likelihood that the
mutation is caused by an evolutionary change (i.e., over time) and
not by sheer chance. Further, these substitution matrices assume
that substitutions that are consistent with evolutionary trends
conserved across many species are less likely to disrupt protein
function. Conversely, substitutions that are not consistent with
evolution (i.e., non-conserved substitutions) are more likely
associated with disease.
[0039] Alternative approaches utilizing amino acid properties have
considered how physiochemical properties differ with changes in
volume, hydrophobicity, net charge, packing density, and solvent
accessibility all shown to correlate with predicted functional
impact of SNP variants. For example, the Grantham distance method
combines the biophysical properties and evolutionary distances
between amino acid pairs, in a setting where the significance of
the amino acid substitution is quantified in a three-dimensional
(3D) space. Specifically, the significance of the amino acid
substitution is quantified, as a weighted Euclidean distance, in
the three-dimensional space having amino acid side chain
composition, polarity and volume as coordinates. The weighted
Euclidean distance is modeled to estimate amino acid substitution
mutation rates.
[0040] Further, some computational algorithms focus on the fact the
importance of the evolutionary distance separating a pair of amino
acids depends on the position where an amino acid substitution
occurs. Specifically, amino acid distribution at equivalent
positions in a protein family is functionally or structurally
important, where these positions may not tolerate a variety of
amino acid changes. These equivalent positions may be found by
constructing an alignment from multiple related protein sequences.
Thus, amino acid residues in highly conserved alignment may be
assumed to be under some purifying evolutionary selection and
important for normal protein function. Computational algorithms may
be used to quantify this conserved evolutionary selection in
protein activity, such as calculating the frequency of the most
common amino acid in an alignment column. For example, Shannon
entropy may be used to compute the distribution of all amino acids
at a specific aligned position. This idea may further be improved
by using relative entropy to augment comparing Shannon entropy of a
conserved alignment against the Shannon entropy of the amino acid
background distribution.
[0041] Some mutation prediction computational algorithms and
prediction scoring tools for interpreting gene test consider both
physicochemical properties of amino acid substitution and
evolutionary conservation. Examples of such methods include Sorting
Intolerant From Tolerant (SIFT), Position-Specific Independent
Counts (PSIC), Align Grantham Variation Grantham Distance (AGVGD),
and Multivariate Analysis of Protein Polymorphism (MAPP) score.
[0042] The SIFT algorithm may be used to compute a weighted
frequency average of which amino acid residue appears in a multiple
alignment position, coupled with an estimate of unobserved variant
frequencies.
[0043] The PSIC profile score method considers the difference of
likelihood between reference and variant amino acid at a given
aligned position using a position-specific scoring matrix
(PSSM).
[0044] The AGVGD method is an extension of the original Grantham
distance method that may be used in multiple sequence alignments
and true simultaneous multiple comparisons. Grantham variation (GV)
may be computed by replacing each value-pair of a given amino acid
residue component for composition, polarity, and charge with the
maximum and minimum value in that alignment position.
[0045] The MAPP score constructs a statistical summary of an
alignment column by use of phylogenetic tree and tree topology
weighting each sequence by branch length.
[0046] Furthermore, some computational algorithms consider protein
structure-function relationships of amino acid substitution. For
example, solvent accessibility of an amino acid may be used as a
predictor of functional impact, where substituting various amino
acid residues may disrupt the hydrophobic core of a soluble
protein. Structural modeling of disease proteins may be used to
determine whether a nsSNP variant results in protein backbone
strain or leads to overpacking substitutions. A large number of
X-ray crystal structures have been determined which often include
protein interacting partners, and/or small molecule, peptide
ligands or inhibitors. The ability to locate a nsSNP variant on a
computational protein structure makes it possible to evaluate
whether the amino acid substitution occurs in or near a binding or
catalytic site or at a domain-domain interface of protein
interaction.
[0047] Polymorphism Phenotyping (PolyPhen) is an example of an
algorithm that takes advantage of structural modeling.
Specifically, PolyPhen is an automated tool that may be used to
evaluate any possible impact of amino acid substitution on the
structure and function of a human protein. PolyPhen uses a
Dictionary of Secondary Structure in Proteins (DSSP) to map a given
substitution site to known protein 3D structures.
[0048] Mutation Prediction (MutPred) is another example of a
prediction algorithm that may be used with the embodiments
described herein. MutPred generates mutability profiles of amino
acid sequences from the corresponding complementary DNA sequences
and generates weighted and un-weighted profiles. In the weighted
profiles relative mutabilities are multiplied by the likelihood of
clinical detection depending on chemical differences.
[0049] PMUT is another example of a mutation prediction algorithm.
PMUT uses a two layer neural network and is trained using human
mutational data. PMUT allows for either prediction of single point
amino acidic mutations or scanning of mutational hot spots. Results
are obtained by alanine scanning, identifying massive mutations,
and genetically accessible mutations.
[0050] Although clinicians often rely on patient history, family
segregation, literature review and trusted colleagues to stay
informed of the phenotypic consequences of a given gene variant
found in a gene test, in absence of traditional evidence, well
established machine learning or computational tools may be used to
predict and access phenotypic consequences of the gene variant.
However, established algorithms do not always complete the
prediction, and furthermore are not always in agreement with the
curated data or each other.
[0051] RET (Rearranged During Transformation) proto-oncogene data
are an example of the gene-disease data 160 that may be used with
embodiments of the present invention. In some embodiments,
well-curated gene variant collections, such as RET data, may be
used. Further, in some embodiments, physicochemical properties of
amino acids in the coded proteins may be utilized to determine
mutation severity.
[0052] The RET oncogene is located on chromosome 10q11, with 21
exons coding a full length protein of 1,114 amino acids. Conserved
functional domains found within the protein include a signal
peptide, cadherin repeat domains, transmembrane domain, and protein
tyrosine kinase. Mutations in the RET oncogene have been directly
associated with Multiple Endocrine Neoplasia type 2 (MEN2), a
hereditary thyroid carcinoma syndrome. Although well known
mutations often guide patient therapy and surgical options, other
RET sequence mutations vary in functional severity. Some mutations
may be pathogenic, some may be benign, and some may be of unknown
significance. Curated RET oncogene mutations for MEN2 have been
reported, many of which have documented phenotype outcomes.
[0053] FIG. 2 illustrates RET protein domains and their reported
disease causing variants as associated with different MEN2
phenotypes. Specifically, conserved domains of signal peptide (SP),
cadherin repeat domains (CAD), cysteine rich region (CYS),
transmembrane domain (TM), and protein tyrosine kinase (Kinase) are
shown. Three specific disease phenotypes have been shown to be
associated with these domains. Specifically, familial medullary
thyroid cancer (FMTC), multiple endocrine neoplasia type 2A
(MEN2A), and multiple endocrine neoplasia type 2B (MEN2B) have been
shown to correspond to CYS and Kinase domains.
[0054] The RET gene belongs to the cadherin super family and
encodes a receptor tyrosine kinase which functions in signaling
pathways for cell growth and differentiation. The RET gene plays a
critical role in neural crest development and may undergo oncogenic
activation, in vivo and in vitro, by cytogenetic rearrangement. The
RET gene may further be classified by Gene Ontology (GO) categories
of biological process of homophilic cell adhesion, posterior midgut
development, and protein amino acid phosphorylation. The GO
annotated cellular location of the RET is component integral to
membrane and the GO category of molecular functions lists ATP
binding, calcium ion binding and transmembrane receptor protein
tyrosine kinase activity.
[0055] As explained above, to date, various computational
algorithms and prediction scoring tools for classifying gene test
results (e.g., SIFT, PSIC, AGVGD, MutPred, or MAPP) have been
developed. Traditional classification schemes may also be used to
classify and interpret gene test results. For example, classifiers
such as Zero Rules (ZeroR), naive Bayesian, Simple Logistic
Regression (Simple Logistic), Support Vector Machine (SMO),
k-nearest neighbor (IBk), and Random Forest Regression (Random
Forest) may be used to interpret gene test results.
[0056] In one embodiment of the present invention, curated RET
gene-disease data are used to train, test, and verify performance
of various mutation classification and prediction tools (e.g.,
SIFT, PSIC, AGVGD, MutPred, or MAPP) as well as various traditional
classification tools (e.g., ZeroR, Simple Logistic, SMO, IBk, or
Random Forest). In one embodiment, k-fold cross validation may be
used to access classifier performance. In k-fold cross validation,
the original dataset is partitioned into k samples and of the k
samples, k-1 subsamples are used as training data for training the
classifier. The cross validation is repeated k times, during which
each of the k samples used once. The resulting k outcomes are
averaged to produce a single estimation. In one embodiment, the
weighted average from a three fold cross validation of sensitivity
(i.e., k=3, true positive rate), specificity (true negative rate),
and positive predictive value (precision) may be calculated for
each classifier algorithm. Specifically, assuming that the
probability of having a true detection (hit) is p, the probability
of having a false detection (miss) is q=1-p. The probability of
having a true positive may be calculated as p.sup.2 and the
probability of having a false negative may be calculated as pq.
Using this definition, the sensitivity of the classifier is
p.sup.2/(p.sup.2+pq)=p and the specificity of the classifier is
q.sup.2/(q.sup.2+pq)=q. The performance of the classifier (i.e.,
predictive positive value) may be measured as a function of the
sensitivity and specificity.
[0057] Primary Sequence Amino Acid Properties (PSAAP) classifier is
another example of a classification and prediction tool that may be
used to interpret gene test results. The details of the PSAAP
prediction algorithm are described in Attorney Docket No.
076950-0130, U.S. patent application Ser. No. 13/471,294, filed on
May 14, 2012, the teaching of which is hereby incorporated by
reference in its entirety for all purposes.
[0058] FIG. 3 is a flow diagram of the procedures for interpreting
gene test results using a PSAAP classifier. The RET variant data
may be used to test and train the PSAAP algorithm 305. In some
embodiments, non-synonymous RET variant data may be used 305.
Non-synonymous RET variants are characterized by physicochemical
differences in primary amino acid sequence resulting from the
mutation. Regardless of the type of RET variant data used, the RET
variant data may include exonic nsSNP variants 310 with known
outcomes of benign and pathogenic 320. Attribute selection (feature
selection) 340 may be performed to select a subset of relevant
features that may be used for classification. Specifically,
attributes of mutation status may be characterized using values of
physical, chemical, conformational, or energetic properties of the
genes. Attribute selection (feature selection) 340 may be performed
during classification training/testing.
[0059] The properties used in attribute selection 340 may be
obtained from an AAindex database 330. AAindex 330 is a database of
numerical indices that represents various physicochemical and
biochemical properties of amino acids and pairs of amino acids. For
each RET variant, matrices of delta values 335 for each biochemical
property of the substituted amino acid are calculated using the
corresponding AAindex 330. The resulting mutation are described by
an array of variables, archived using a structured query language
(SQL), that corresponds to the absolute value of the difference
between the value of the property in the amino acid present in the
wild type and the one in the mutant.
[0060] Random selection may be used to build a training set 350 and
a test set 360. Although training and test sets include different
disease subtypes such as MEN2A, MEN2B, FMTC, MEN2A and FMTC, class
labels of "pathogenic" and "benign" 320 may be used to describe all
curated disease association.
[0061] A classifier, such as a naive Bayesian classifier 355, may
be employed to classify the variants. Specifically, the training
set 450 may be used to train the classifier 355. The test set 360
is then tested using the classifier 355 and the outcome of the test
is used to assign disease association 365 to the gene variants in
the test set 360. Uncertain variants 370 are also analyzed and
their predicted disease association 375 is output from the PSAAP
classifier.
[0062] The performance of the PSAAP algorithm may be evaluated
using calculated values of sensitivity (true positive rate),
specificity (true negative rate), and positive predictive value
(precision).
[0063] FIG. 4 is a table that summarizes performance of various
classifiers (e.g., mutation prediction tools and traditional
classifiers) in interpreting gene test results using a dataset of
RET gene variant-disease data. The classifier performance is ranked
by positive predictive value (PPV) or the percentage of variants
classified as pathogenic that actually were pathogenic.
[0064] For the dataset used, the ZeroR classifier (zero rules),
which selects the majority class by default, yields a baseline
performance of 55.7%. The nearest neighbor, random forest, support
vector machine, and simple logistic give similar performance to
each other with 77.6%, 78.9%, 79.1%, and 81.4% respectively. The
naive Bayesian classifier appears to be the best performing
algorithm with a positive predictive value of 82.7%, which
translates to a gain in performance of 27% over the ZeroR
classifier. Further, as shown, the traditional classifiers (e.g.,
ZeroR, IBk, Random Forest, SMO, Simple Logistic, Naive Bayesian)
perform better than or similar to the existing mutation prediction
algorithms (e.g., PolyPhen, SIFT, MutPred, and PMUT). Specifically,
the PolyPhen, SIFT, MutPred, and PMUT result in positive predictive
values of 54.1%, 77.9%, 84.3%, and 72.3%, respectively. The PSAAP
prediction algorithm yields the highest performance at 8.3%.
[0065] FIG. 5 is a high-level block diagram of procedures for
determining a consensus score for multiple classifiers according to
certain embodiments disclosed herein. The consensus classifier 110,
shown in FIG. 1, may employ the procedures shown in FIG. 5 to
classify and interpret test results.
[0066] Various classifiers 510-1, . . . , 510-n, such as those
outlined in FIG. 4, may be used. In one embodiment, a user may
select a finite number of classifiers from among various available
classifiers. For example, in one embodiment, the MutPred, PMUT,
PolyPhen, SIFT, and PSAAP classifiers may be used. Various
combinations of classifiers may be used with embodiments of the
present invention. The classifiers outlined in FIG. 4 are intended
to serve a non-limiting examples of classifiers that may be used
with the embodiments disclosed herein. One skilled in the art
appreciates that any classifier and prediction tool known in the
art may be used with the embodiment disclosed herein.
[0067] As explained above, each classifier 510-i may be trained and
tested with gene variant-disease data 160 (FIG. 1), such as RET
gene-disease data. Complementary methods (not shown, e.g., sequence
alignment, structural alignment, amino acid substitution penalties,
structural disruption, sequence homology, etc.) may be applied to
the data prior to performing classification and prediction.
[0068] Gene test data 501 from genetic testing may be input to the
classifiers 510-1, . . . , 510-n. A descriptive statistics
calculator 530 may calculate descriptive statistics values 530-i
for each classifier outcome (hereinafter referenced as 510-i) using
the numerical output obtained from each classifier 510-i. The
descriptive statistics values 530-i for each classifier may include
elements such as mean, median, standard deviation, minimum, and
maximum of the numerical values output from that classifier. The
descriptive statistics 530-i of each classifier 510-i may be used
to summarize and quantitatively represent the features of the data
collected from each classifier 510-i.
[0069] A correlation calculator 550 may calculate the correlation
between each pair of classifiers. For example, if five classifiers
(C1, C2, C3, C4, C5) are being used, the correlation between
classifiers (C1, C2), (C1, C3), (C1, C4), (C1, C5), (C2, C3), (C2,
C4), (C2, C5), (C3, C4), (C3, C5), and (C4, C5) may be obtained. In
some embodiments, the obtained correlation coefficients 555-i may
be arranged into a correlation matrix. In certain embodiments, the
correlation coefficients 555-i may be obtained using the
descriptive statistics 530-i obtained from each classifier 510-i.
In some embodiments, a Spearman's rank correlation coefficient may
be calculated between the classifier pairs and used as a numerical
or non-parametric measure of the statistical dependence between the
classifier pairs. One skilled in the art appreciates that other
available methods for determining correlation dependence may be
used in addition to or in place of Spearman's correlation to
determine the correlation coefficient 555-i.
[0070] Further, a variance analyzer 560 may be used to determine
the variance between the independent classifiers 510-1, . . . ,
510-n using the descriptive statistics 530-i obtained from each
classifier 510-i. For example, Factor Analysis may be used to
describe the variance among the independent classifiers 510-1, . .
. , 510-n. Factor analysis describes the variability among the
classifiers in terms of a number of variables or "factors."
[0071] Factor analysis may be performed using principal components
to determine the weights of association between the different
classifiers 510-i. Specifically, in one embodiment, a set of
eigenvectors is applied to weight each classifier 510-i accordingly
by eigenvalues from principal components, with more than 80% of the
cumulative variance reached using only the first three
eigenvalues.
[0072] Further, to compensate for the lack of independence between
the variables (i.e., descriptive statistics values 530-i), a
weighted average calculator 570 may use the resulting correlation
rank 555-i and variance 565-i values for each classifier to
determine a weighted average 575-i for each classifier. A consensus
score calculator 580 may determine a consensus score 585-i for each
classifier based on weighted average score of each classifier
510-i.
[0073] For example, in one embodiment, the classifiers 510-1, . . .
, 510-n are used to determine a numerical prediction for each gene
variant. The numerical prediction of each classifier is then used
to obtain descriptive statistics 530-i for each classifier 510-i.
Spearman correlation coefficients, describing the correlation of
each classifier with other classifiers are then calculated. The
Spearman correlation coefficients may be arranged into a matrix
whose eigenvalues are obtained during principle component analysis.
Once principle component analysis is performed, the obtained matrix
is analyzed to determine if the classifiers are independent of one
another. In that case, a consensus score for each classifier may be
obtained as a linear sum of the numerical outcomes obtained from
that classifier. If the classifiers are determined to be dependent,
a consensus score for each classifier as a function of a reduced
matrix may be obtained from the principle component analysis of the
correlation matrix. Specifically, a scalar product (inner product
or dot product) of the reduced matrix and the numerical outcomes of
a classifier may be used to obtain the consensus score for that
classifier.
[0074] A minimum number of eigenvectors that cumulatively explain
the variance of the independent classifiers are selected with
respect to a predetermined threshold. The reduced set of
eigenvectors is used to create a weighted average sum for each
classifier. Specifically, the weighted average sum, in some
embodiments, may be calculated by multiplying (via an inner or dot
product) the reduced eigenvector by the descriptive statistics of
each classifier. This results in a single number that may be used
as the "consensus score" of a classifier.
[0075] In some embodiments, a reference range for the consensus
score may be determined. Specifically, a reference range may be
defined such that if the consensus score for a gene test falls
within that range, the gene variant is classified as pathogenic and
if the consensus score for a gene test falls outside of that range
it is classified as benign.
[0076] For example, the reference score may be calculated for RET
gene variants with known disease outcome with analogy to
calculating analyte reference intervals for age or gender in
traditional laboratory testing. A nonparametric reference interval
may be used for benign (n=46) and pathogenic (n=51) with 95%
confidence intervals (CI) for the lower and upper bounds. The
confidence ratio of the reference interval may also be
calculated.
[0077] The overall consensus score may be used to augment the
events in which a gene-specific classifier does not outperform the
existing tools. This advantage of consensus predictor over a single
predictor may be seen by removing seven RET gene variants with
known disease association where originally they were classified as
variants of uncertain significance. After excluding these seven
variants from the gene-specific training set, analysis using the
Consensus Score Calculator is repeated. Under such setup, the
Consensus score correctly predicts the sixth variant. Closer
inspection showed the remaining seventh variant is a nucleotide
level "silent" polymorphism (no amino acid change), which could
have been recognized by spice effect prediction software.
[0078] In some embodiments, a graphing display may be used in the
consensus score calculator 580 to preserve contribution of each
variable (classifier). For example, radial plots (also known as
radar or spider plots) may be employed.
[0079] FIGS. 6A-6B illustrate the use of radar plots for consensus
scoring. As shown in FIGS. 6A-6B, using radar plots for consensus
scoring may preserve the contribution of each predictor to the
total sum. For example, as shown in FIG. 6A, consensus score plot
of 470 (85, 90, 98, 97, 100) for the pathogenic gene variant C609Y
is obtained. As shown in FIG. 6B, consensus output of 83 (7, 13,
19, 4, 40) for a benign variant V376A is obtained.
[0080] Further, a more comprehensive display for consensus scoring
may be used for augmenting clinical decision making FIGS. 7A-7B
include examples of such display. As shown, the display may
incorporate features such as classifier output, predictor calls,
weighted sum, and be presented in a color-metric scale. In FIG. 7A
a pathogenic gene variant C634R having scoring of 367 is shown and
in FIG. 7B a benign variant G691S with a Consensus score of 97 is
shown.
[0081] Those of skill in the art would appreciate that the various
illustrative blocks, modules, elements, components, methods, and
algorithms described herein may be implemented as electronic
hardware, computer software, or combinations of both. To illustrate
this interchangeability of hardware and software, various
illustrative blocks, modules, elements, components, methods, and
algorithms have been described above generally in terms of their
functionality. Whether such functionality is implemented as
hardware or software depends upon the particular application and
design constraints imposed on the overall system. Skilled artisans
may implement the described functionality in varying ways for each
particular application. Various components and blocks may be
arranged differently (e.g., arranged in a the subject
technology.
[0082] It is understood that the specific order or hierarchy of
steps in the processes disclosed is an illustration of exemplary
approaches. Based upon design preferences, it is understood that
the specific order or hierarchy of steps in the processes may be
rearranged. Some of the steps may be performed simultaneously. The
accompanying method claims present elements of the various steps in
a sample order, and are not meant to be limited to the specific
order or hierarchy presented.
[0083] The previous description is provided to enable any person
skilled in the art to practice the various aspects described
herein. The previous description provides various examples of the
subject technology, and the subject technology is not limited to
these examples. Various modifications to these aspects will be
readily apparent to those skilled in the art, and the generic
principles defined herein may be applied to other aspects. Thus,
the claims are not intended to be limited to the aspects shown
herein, but is to be accorded the full scope consistent with the
language claims, wherein reference to an element in the singular is
not intended to mean "one and only one" unless specifically so
stated, but rather "one or more." Unless specifically stated
otherwise, the term "some" refers to one or more.
[0084] A phrase such as an "aspect" does not imply that such aspect
is essential to the subject technology or that such aspect applies
to all configurations of the subject technology. A disclosure
relating to an aspect may apply to all configurations, or one or
more configurations. An aspect may provide one or more examples. A
phrase such as an aspect may refer to one or more aspects and vice
versa. A phrase such as an "embodiment" does not imply that such
embodiment is essential to the subject technology or that such
embodiment applies to all configuration of the subject technology.
A disclosure relating to an embodiment may apply to all
embodiments, or one or more embodiments. An embodiment may provide
one or more examples. A phrase such an embodiment may refer to one
or more embodiments and vice versa. A phrase such as a
"configuration" does not imply that such configuration is essential
to the subject technology or that such configuration applies to all
configurations of the subject technology. A disclosure relating to
a configuration may apply to all configurations, or one or more
configurations. A configuration may provide one or more examples. A
phrase such a configuration may refer to one or more configurations
and vice versa.
[0085] All structural and functional equivalents to the elements of
the various aspects described throughout this disclosure that are
known or later come to be known to those of ordinary skill in the
art are expressly incorporated herein by reference and are intended
to be encompassed by the claims. Moreover, nothing disclosed herein
is intended to be dedicated to the public regardless of whether
such disclosure is explicitly recited in the claims. No claim
element is to be construed under the provisions of 35 U.S.C.
.sctn.112, sixth paragraph, unless the element is expressly recited
using the phrase "means for" or, in the case of a method claim, the
element is recited using the phrase "step for." Furthermore, to the
extent that the term "include," "have," or the like is used in the
description or the claims, such term is intended to be inclusive in
a manner similar to the term "comprise" as "comprise" is
interpreted when employed as a transitional word in a claim.
[0086] Various modifications may be made to the examples described
in the foregoing, and any related teachings may be applied in
numerous applications, only some of which have been described
herein. It is intended by the following claims to claim any and all
applications, modifications and variations that fall within the
true scope of the present teachings.
* * * * *