U.S. patent application number 16/197267 was filed with the patent office on 2019-06-27 for systems and methods for snp characterization and identifying off target variants.
This patent application is currently assigned to Affymetrix, Inc.. The applicant listed for this patent is Affymetrix, Inc.. Invention is credited to Hong Gao, Ali Pirani, Mei-Mei Shen, Teresa Webster.
Application Number | 20190198134 16/197267 |
Document ID | / |
Family ID | 51529813 |
Filed Date | 2019-06-27 |
View All Diagrams
United States Patent
Application |
20190198134 |
Kind Code |
A1 |
Gao; Hong ; et al. |
June 27, 2019 |
SYSTEMS AND METHODS FOR SNP CHARACTERIZATION AND IDENTIFYING OFF
TARGET VARIANTS
Abstract
Methods for processing data using information gained from
examining biological materials identifies and characterized probes
for Single Nucleotide Polymorphisms and identifies Off Target
Variants.
Inventors: |
Gao; Hong; (Santa Clara,
CA) ; Pirani; Ali; (Santa Clara, CA) ;
Webster; Teresa; (Santa Clara, CA) ; Shen;
Mei-Mei; (Santa Clara, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Affymetrix, Inc. |
Carlsbad |
CA |
US |
|
|
Assignee: |
Affymetrix, Inc.
Carlsbad
CA
|
Family ID: |
51529813 |
Appl. No.: |
16/197267 |
Filed: |
November 20, 2018 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13844285 |
Mar 15, 2013 |
|
|
|
16197267 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G16B 20/00 20190201 |
International
Class: |
G16B 20/00 20060101
G16B020/00 |
Claims
1-15. (canceled)
16. A method of genotyping single nucleotide polymorphisms (SNPs)
by identifying off-target variants (OTVs) using a computer system
comprising: accessing genotype clustering data from a number of
samples from a probeset location, the genotype clustering data
comprising genotype cluster calls of samples, cluster centers, and
variance; accessing sample data that indicates contrast and
intensity detected at a probeset location across a number of
samples; searching for optimal clustering of said sample data in
both contrast and intensity dimensions, allowing for an OTV
genotype cluster; iteratively updating sample assignments to
clusters and cluster centers until convergence; and outputting
genotyping cluster results for said samples.
17. The method according to claim 16 wherein said identifying a
probeset as an OTV probeset is not based on characteristics of any
single sample.
18. The method according to claim 16 wherein said searching and
said iteratively updating are performed as part of an expectation
maximization.
19. The method according to claim 16 wherein said searching and
said iteratively updating use portions of sample intensity data
that were not used to generate said genotype clustering data.
20. The method according to claim 16 wherein said searching and
said iteratively updating give sample intensity data more weight
than was used for clustering said genotype clustering data.
21. The method according to claim 16 further comprising:
determining a distance from the center of a heterozygote cluster to
a line connecting homozygote clusters when said samples are plotted
according to contrast and intensity; comparing the distance to a
threshold; labeling as OTV clusters, clusters characterized by good
cluster resolution according to a plurality of cluster parameters
and by said distance exceeding a threshold.
22. The method according to claim 21 wherein said distance is
referred to as a heterozygote strength offset (HetSO).
23. The method according to claim 16 further comprising:
determining if samples outside of a populated OTV cluster
constitute a genotype.
24. The method according to claim 16 wherein the genotype
clustering data is generated by an automated clustering algorithm
comprising one or more selected from the group: AxiomGT1, BRLMM-P,
or any other suitable clustering algorithm.
25. The method according to claim 16 wherein said searching and
iteratively updating is performed by a clustering algorithm
selected from the group: Expectation-Maximization (EM), 2D
Baum-Welch, Viterbi, HMM analysis.
26. The method according to claim 16 further comprising:
distinguishing heterozygotes clusters from OTV clusters.
27. The method according to claim 16 further comprising:
determining a cluster is an OTV cluster when: the samples overall
have good genotype cluster properties with the exception of a
cluster in an OTV position; Fishers linear discriminant (FLD) of
the cluster is greater than or equal to a threshold; HomRO is
greater than or equal to threshold; homozygote cluster(s) is offset
from zero in the contrast dimension; and HetSO is less than or
equal to a threshold.
28. The method according to claim 16 further comprising:
determining a cluster is an OTV cluster when: the initial sample
clusters have good genotype cluster properties according to one or
more cluster parameters, with the exception of a cluster with a
HetSO value less than or equal to a threshold; Fishers linear
discriminant (FLD) of the cluster is greater than or equal to a
threshold; HomRO is greater than or equal to threshold; homozygote
cluster(s) is offset from zero in the contrast dimension; and HetSO
is less than or equal to a threshold.
29. The method according to claim 16 further comprising: selecting
between four and three genotype clusters models by, for each model:
initiating by using posterior information of genotype cluster from
a clustering algorithm; assigning each sample into one of genotype
clusters according to expectation/likelihood, appropriate for the
model; calculating the genotype cluster center locations by
maximizing likelihood; and selecting the model with the greatest
maximum likelihood.
30. A computer readable medium containing computer interpretable
instructions that when loaded into an appropriately configured
information processing device will cause the device to operate in
accordance with the method of claim 16.
31. The method according to claim 16 further comprising: using said
selected probeset to genotype said SNP.
32. The method according to claim 16 further comprising: using said
selected probeset to select probes for use in a genomic assay
system or device, such as a genotyping array.
33. (canceled)
34. A computer readable medium containing computer interpretable
instructions that when loaded into an appropriately configured
information processing device will cause the device to operate in
accordance with the method of claim 16.
35-56. (canceled)
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. application Ser.
No. 13/844,285 filed on Mar. 15, 2013. All referenced documents and
applications herein and all documents referenced therein are
incorporated herein by reference for all purposes. This application
may be related to other patent applications and issued patents
assigned to the assignee indicated above. These applications and
issued patents are incorporated herein by reference to the extent
allowed under applicable law.
COPYRIGHT NOTICE
[0002] Pursuant to 37 C.F.R. 1.71(e), applicant notes that a
portion of this disclosure contains material that is subject to and
for which is claimed copyright protection (such as, but not limited
to, source code listings, screen shots, user interfaces, or user
instructions, or any other aspects of this submission for which
copyright protection is or may be available in any jurisdiction).
The copyright owner has no objection to the facsimile reproduction
by anyone of the patent document or patent disclosure, as it
appears in the Patent and Trademark Office patent file or records.
All other rights are reserved, and all other reproduction,
distribution, creation of derivative works based on the contents,
public display, and public performance of the application or any
part thereof are prohibited by applicable copyright law.
APPENDIX
[0003] This application is being filed electronically with a source
code appendix that sets out selected source code portions from a
copyrighted software program, owned by the assignee of this patent
document, which manifests the invention according to specific
embodiments. This appendix and all other papers filed herewith,
including papers filed in any attached Information Disclosure
Statement (IDS), are incorporated herein by reference. The appendix
contains further examples and information related to various
embodiments of the invention at various stages of development.
[0004] Permission is granted to make copies of the appendices
solely in connection with the making of facsimile copies of this
patent document in accordance with applicable law; all other rights
are reserved, and all other reproduction, distribution, creation of
derivative works based on the contents, public display, and public
performance of the appendix or any part thereof are prohibited by
the copyright laws.
BACKGROUND
1. Field of the Invention
[0005] The present invention relates to systems and methods for
processing data using information gained from examining biological
material. In particular, a preferred embodiment of the invention
relates to analysis of processed image data from scanned biological
probe arrays for determining genotype information via
identification of polymorphisms, e.g., Single Nucleotide
Polymorphisms (SNPs), insertion/deletion polymorphisms (indels),
whole or partial chromosome deletions or duplications.
2. Related Art
[0006] Synthesized nucleic acid probe arrays, such as
Affymetrix.RTM. arrays (Affymetrix, Inc., Santa Clara, Calif.) in
the form of GeneChip.RTM. array cartridges, peg array strips, and
Axiom.RTM. peg array plates, Illumina.RTM. BeadArray.RTM. arrays
(Illumina, Inc., San Diego, Calif.), and Agilent.RTM.
SurePrint.RTM. arrays (Agilent Technologies, Inc., Santa Clara,
Calif.) and spotted probe arrays, have been used to generate
unprecedented amounts of information about biological systems. A
variety of techniques are known for the manufacture and use of
nucleic acid probe arrays, such as the techniques disclosed within,
e.g., U.S. Pat. No. 5,143,854 to Pirrung et al.; U.S. Pat. No.
5,744,305 to Fodor et al.; U.S. Pat. No. 7,332,273 to Trulson et
al.; U.S. Pat. Nos. 5,945,334 and 6,140,044 to Besemer et al.; U.S.
Pat. No. 5,545,531 to Rava et al.; U.S. Pat. No. 6,660,233 to
Coassin et al.; U.S. Patent Application Publication Nos.
2004/0038388 and 2006/0088863 to Yamamoto et al.; U.S. Patent
Application Publication No. 2005/0023672 to Oostman et al.; U.S.
Patent Application Publication No. 2008/0003667 to Jones et al.;
U.S. Patent Application Publication Nos. 2006/0246576,
2006/0234371, 2011/0136699 and 2010/0248981 to Shirazi; pending
U.S. patent application Ser. No. 13/157,268, filed Jun. 9, 2011;
U.S. Pat. No. 6,242,266 to Schleifer et al.; U.S. Pat. No.
6,375,903 to Cerrina et al.; U.S. Pat. No. 5,436,327 to Southern et
al.; U.S. Pat. No. 5,474,796 to Brennan; U.S. Pat. No. 5,658,802 to
Hayes et al.; U.S. Pat. No. 5,770,151 to Roach et al.; U.S. Pat.
No. 5,807,522 to Brown et al.; U.S. Pat. No. 5,981,733 to Gamble et
al.; U.S. Pat. No. 6,101,946 to Martinsky; U.S. Pat. Nos. 6,355,431
and 6,429,027 to Chee et al.; U.S. Pat. No. 7,510,841 to
Stuelpnagel et al., U.S. Pat. Nos. 7,745,091 and 7,745,092 to True;
U.S. Patent Application Publication No. 2010/0297448 to True et
al.; and U.S. Patent Application Publication Nos. 2010/0227279,
2010/0227770 and 2009/0149340 to True, all of which are expressly
incorporated herein by reference for all purposes. For example, the
Genome-Wide Human SNP Array 6.0 from Affymetrix, Inc. is able to
genotype more than 900,000 SNPs while Axiom.RTM. Genotyping Arrays
from Affymetrix, Inc. are able to genotype from a customizable
selection of between 1,500 and 2.6 million SNPs per array. Analysis
of genotype data from such microarrays may lead to the development
of new drugs, new varieties or strains of organisms, including
plants, animals, bacteria, archaea and fungi, new diagnostic tools
and treatments based upon genetic information (including
information tailored to specific target populations) and the
correlation of such information to diseases such cancer, and
agrigenomic testing to maximize the economic value of plant and
animal species such as bovine, buffalo, chicken, rice, lettuce and
pepper.
[0007] Many methods and techniques for genotyping are known in the
art. For example, see U.S. Pat. Nos. 6,300,063; 6,687,692;
6,223,127; 7,099,777; 6,850,846; 6,988,040; and 7,031,846, each of
which is incorporated by reference in their entireties. See also
U.S. Published Patent Application Publication Nos. 2002/0168651;
2003/0120431; 2005/0123971; 2004/0157243; 2006/0134674;
2004/0117128; 2004/0117127; 2005/0009069; and 2007/0128647, each of
where is incorporated by reference in their entireties.
[0008] The discussion of any work, publications, sales, or activity
anywhere in this submission, including in any documents submitted
with this application, shall not be taken as an admission that any
such work constitutes prior art. The discussion of any activity,
work, or publication herein is not an admission that such activity,
work, or publication existed or was known in any particular
jurisdiction.
Other Features & Benefits
[0009] The invention and various specific aspects and embodiments
will be better understood with reference to the following drawings
and detailed descriptions. For purposes of clarity, this discussion
refers to devices, methods, and concepts in terms of specific
examples. However, the invention and aspects thereof may have
applications to a variety of types of devices and systems. It is
therefore intended that the invention not be limited except as
provided in the attached claims and equivalents.
[0010] Furthermore, it is well known in the art that logic systems
and methods such as described herein can include a variety of
different components and different functions in a modular fashion.
Different embodiments of the invention can include different
mixtures of elements and functions and may group various functions
as parts of various elements. For purposes of clarity, the
invention is described in terms of systems that include many
different innovative components and innovative combinations of
innovative components and known components. No inference should be
taken to limit the invention to combinations containing all of the
innovative components listed in any illustrative embodiment in this
specification.
[0011] Before describing the present invention in detail, it is to
be understood that this invention is not limited to particular
compositions or systems, which can, of course, vary. It is also to
be understood that the terminology used herein is for the purpose
of describing particular embodiments only, and is not intended to
be limiting. As used in this specification and the appended claims,
the singular forms "a", "an" and "the" include plural referents
unless the content and context clearly dictates otherwise. Thus,
for example, reference to "a device" includes a combination of two
or more such devices, and the like.
[0012] Unless defined otherwise, technical and scientific terms
used herein have meanings as commonly understood by one of ordinary
skill in the art to which the invention pertains. Although any
methods and materials similar or equivalent to those described
herein can be used in practice or for testing of the present
invention, the preferred materials and methods are described
herein.
SUMMARY OF THE INVENTION
[0013] According to specific embodiments, the present invention is
involved with methods and/or systems and/or devices that can be
used together or independently to evaluate biological assay or
testing or experiments and provide or evaluate results. In specific
embodiments, the invention is involved with an information
processing device, such as a computer or laboratory equipment,
configured with logic instructions or modules to access data and
perform steps as described herein. In further embodiments, the
invention is involved with logic instructions and/or data recorded
on a tangible media.
[0014] In SNP genotyping and other polymorphism genotyping, a
no-call or incorrect genotype call can occur as the result of
abnormal hybridization intensity for a sample at a given location.
One source of genotype calling errors in this context is biological
in origin and can be attributed to previously uncharacterized
variation in genomic DNA, either in the sequence targeted by a
probe set or with respect to the proximal and/or distal restriction
sites used with certain reduced complexity genome-wide
amplification approaches. These variants can reduce hybridization
intensity sufficiently to eliminate or reverse the contrast between
allelic probes such that an incorrect genotype call (or no-call) is
made. These variants and are referred to herein as "off-target
variants" (OTVs).
[0015] OTVs tend to produce an additional genotype cluster besides
the normal two or three genotypes for diploid species: AA, AB and
BB, or AA, BB. This cluster can be referred to as the variant or
"V" cluster or OTV clusters. The OTV samples create problems in
genotype calling. These samples are often miscalled as "AB," the
heterozygote cluster. According to specific embodiments, systems
and methods as described herein separate this OTV cluster from the
other genotype clusters, thereby improving genotyping calls.
[0016] For example, the technique from MouseDivGenoR (Didion)
package fails to detect some OTVs. The method uses an OTV score, a
product of two p values for a sample to determine OTVs. The two p
values measure deviation from the center of its cluster and
deviation from the center of all the samples in terms of size
dimension, respectively. Then the Didion method uses a nearest
neighbor algorithm to cluster. The technique uses a stringent
threshold (score>0.9999) and fails particularly when three
genotype clusters exist, AA, BB, and V.
[0017] According to specific embodiments, an OTV genotype analysis
uses the posterior information from a genotype cluster analysis
(e.g., from an initial cluster algorithm, such as AxiomGT1) to
initiate an expectation maximization (EM) algorithm or similar
technique to determine whether an OTV cluster is present and
identify which samples are in the OTV cluster. The method can also
repeat the calls for samples in the AA, AB, and BB clusters. In
particular embodiments, a post-processing technique for OTV-Typing
uses a 2D EM algorithm with constraints to cluster based on both
contrast and strength dimensions, where use of the strength
dimension, according to specific embodiments, in determining OTV
clusters is more extensive than use of the strength in the initial
clustering In one example, the method uses clustering posteriors
from the Affymetrix Power Tool AxiomGT1 algorithm as the starting
points for AA, AB, BB and OTV clusters in the EM algorithm. Then
the algorithm iteratively updates sample assignments and cluster
centers until convergence.
[0018] According to further embodiments, an OTV genotype analysis
is done as part of the overall genotype cluster analysis, using
methods as described herein as will be understood to persons of
skill in the art having benefit of the teachings provided
herein.
[0019] In a number of prior methods and systems, part of the final
genotyping analysis for at least some polymorphisms generally
rested on a human expert or technician to either characterize one
or more probesets according quality or select among multiple
probesets for a particular polymorphisms based on perceived
quality. In many genotyping devices or systems, there are more than
one probes or probesets (e.g., 2 probes) for detecting a particular
polymorphism. The probesets are in different locations on a gene
chip, for example, but often adjacent to each other. Where 2 or
more probesets existed for a polymorphism, a human technician
viewing the data would select the probeset location for a
particular polymorphism that "looked best" in terms of having
strength, intensity, contrast, or shape characteristics that would
be associated with a good set of genotyping calls. Where only one
probeset exist for a polymorphism, a human technician viewing the
data may characterized the results for a particular location as
poor and possibly delete those results from the overall
analysis.
[0020] According to further embodiments, SNP or other polymorphism
data are analyzed by a computer system that determines a best
probeset for one or more polymorphisms automatically so that a
final best genotyping call can be made without expert human review.
According to further embodiments, SNP or other polymorphism data
are automatically classified by a computer system according to a
number of quality classifications. In a specific embodiment, six
example classifications are: (1) PolyHighResolution, characterized
by good cluster resolution, and at least 2 example of the minor
allele: (2) MonoHighResolution characterized by less than 2
examples of the minor allele usually due to low minor allele
frequency (MAF) samples, but possible cluster fusion/compression;
(3) OTV, where an OTV cluster has been called; (4) No Minor Hom
characterized by two clusters with no examples of the minor
homozygous genotypes; (5) Call Rate Below Threshold characterized
by SNP call rate below threshold, but other cluster properties are
above threshold; (6) Other, wherein one or more cluster properties
are below threshold and therefore lower quality genotypes are
likely.
[0021] According to further embodiments, a number of new quality
metrics for SNP classification are defined.
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] The above and further features will be more clearly
appreciated from the following detailed description when taken in
conjunction with the accompanying drawings. The file of this patent
contains a least one drawing executed in color. Copies of this
patent with color drawings will be provided by the United States
Patent and Trademark Office upon request and payment of the
necessary fee.
[0023] FIG. 1A provides examples of re-clustering or re-calling
genotyping data after allowing for a variant cluster according to
specific embodiments.
[0024] FIG. 1B provides examples of re-clustering or re-calling
genotyping data after allowing for a variant cluster according to
specific embodiments.
[0025] FIG. 2 illustrates workflow of an algorithm (referred to at
times as OTVCaller.TM.) according to specific embodiments, wherein
two models are considered, including a three-genotype model (AA,
BB, and OTV) and a four-genotype model (AA, AB, BB, and OTV).
[0026] FIG. 3 is a flowchart of an example method for clustering
genotyping data into genotype clusters and an OTV cluster using
posterior data from a genotype clustering according to specific
embodiments.
[0027] FIG. 4 is a flowchart of an example method for clustering
genotyping data into genotype clusters and an OTV cluster using
posterior data from a genotype clustering and evaluating more than
one group of possible genotype clusters according to specific
embodiments.
[0028] FIG. 5 is a flowchart of an example method for initially
clustering genotyping data into genotype clusters and an OTV
cluster and optionally evaluating more than one group of possible
genotype clusters according to specific embodiments.
[0029] FIG. 6A illustrates an example of a SNP genotyping data
result with a high call rate but a low FLD value according to
specific embodiments.
[0030] FIG. 6B illustrates well-clustered SNP genotyping data with
well-separated genotype cluster centers according to specific
embodiments.
[0031] FIG. 7A illustrates an example of SNP genotyping data, where
sequence mismatches between the sample and the reference genome
have produced a HetSO value outside of the acceptable range.
[0032] FIG. 7B illustrates an example of SNP genotyping data, where
sequence mismatches between the sample and the reference genome
have produced a HetSO value outside of the acceptable range.
[0033] FIG. 8A illustrates an example of SNP genotyping data
misclustered and with a negative HomRO.
[0034] FIG. 8B illustrates an example of well clustered SNP
genotyping data.
[0035] FIG. 9A illustrates an example output of a logic routine
that determines genotyping data quality metrics or probeset assay
data quality metrics according to specific embodiments.
[0036] FIG. 9B illustrates an example output of a logic routine
that determines genotyping data quality metrics or probeset assay
data quality metrics according to specific embodiments.
[0037] FIG. 10 is an illustration of six genotyping data type
classification according to specific embodiments.
[0038] FIG. 11 is a flowchart illustrating overall steps for
polymorphism (e.g., SNP) analysis, including classification and OTV
identification according to specific embodiments.
[0039] FIG. 12 shows an example of visualization of a SNP
genotyping data probeset results with interpretations using a logic
routing according to specific embodiments.
[0040] FIG. 13 is a block diagram showing a representative example
logic device in which various aspects of the present invention may
be embodied.
DESCRIPTION OF SPECIFIC EMBODIMENTS
Glossary
[0041] OTV stands for off target variant (OTV), which is a variant
of a genomic DNA sequence that was previously uncharacterized or
otherwise not accounted for within design of, e.g., a probe set
intended to interrogate a SNP within that genomic DNA sequence.
This unaccounted for variation in turn affects the overall
intensity and contrast in the genotyping cluster data (e.g., the
data associated with A and B allele signals) based upon the
hybridization of the probes within the probe set with the OTV.
[0042] An OTV cluster is an additional genotype cluster that is
utilized according to embodiments described herein to improve
genotyping calls believed to be affected by OTVs. Thus, in the
example of a diploid organism that presents three possible
genotypes for a particular SNP (e.g., AA, BB, AB), an OTV cluster
would be a fourth cluster used to improve the genotyping calls of
AA, BB, and AB for the particular SNP within the samples at
issue.
[0043] A single-nucleotide polymorphism (SNP, pronounced snip;
plural snips) is a DNA sequence variation occurring when a single
nucleotide--A, T, C or G--in the genome (or other shared sequence)
differs between members of a biological species or paired
chromosomes in a human. For example, two sequenced DNA fragments
from different individuals, AAGCCTA to AAGCTTA, contain a
difference in a single nucleotide. In this case, we say that there
are two alleles. Almost all common SNPs have only two alleles. The
genomic distribution of SNPs is not homogenous; SNPs usually occur
in non-coding regions more frequently than in coding regions or, in
general, where natural selection is acting and fixating the allele
of the SNP that constitutes the most favorable genetic adaptation.
Besides natural selection, other factors like genetic recombination
and mutation rate can also determine SNP density. In many example
species, one in 100-1000 nucleotides is a SNP. SNP frequency
differs between ethnic/geographical groups. In some assay systems,
methods used for distinguishing SNP alleles can also distinguish
other alleles common in the genome, such as insertions/deletions.
When used herein, unless the context requires, SNP encompasses any
polymorphism that can be detected according to the methods for
detecting SNPs.
[0044] Probesets are sets of oligonucleotide sequences that are
used to determine the presence of a particular SNP.
REFS
[0045] 1. Didion J P, Yang H, Sheppard K, Fu C P, McMillan L,
Pardo-Manuel de Villena F, Churchill G A: Discovery of novel
variants in genotyping arrays improves genotype retention and
reduces ascertainment bias. BMC Genomics 2012, 13:34.
[0046] 2. Yang H, Ding Y, Hutchins L N, Szatkiewicz J, Bell T A,
Paigen B J, Graber J H, Pardo-Manuel de Villena F, Churchill G A: A
customized and versatile high-density genotyping array for the
mouse. Nat Methods 2009, 6:663-6.
[0047] 3. Yang H, Wang J R, Didion J P, Buus R J, Bell T A, Welsh C
E, Bonhomme F, Yu A H, Nachman M W, Pialek J, Tucker P, Boursot P,
McMillan L, Churchill G A, Pardo-Manuel de Villena F: Subspecific
origin and haplotype diversity in the laboratory mouse. Nat Genet.
2011 43:648-5.
OTVs
Overview
[0048] SNP analysis is a known tool for studying genetic variation
generally in closely related subjects (e.g., different individuals
of one species, different strains of a species, related or closely
related species, etc.) Modern SNP genome analysis generally targets
a very large number of SNPs and optionally some other
polymorphisms, each having at least one probeset. In some systems,
many polymorphisms may have two or more different probesets, with
each of the different probesets providing a possible genotyping
result for the polymorphism. In one method, an individual sample is
exposed to a genotyping array or other probeset system to determine
the presence of different polymorphism alleles in the sample.
Because most organisms have multiple copies of every chromosome,
there can be different alleles detected for the same sample. Thus,
a sample is generally characterized by multiple alleles (e.g., 2)
of each polymorphism. Determining the multiple alleles for a
polymorphism is generally referred to in the art as genotyping or
SNP genotyping.
[0049] One example of a recent, high-density, SNP detection array
is the Affymetrix.RTM. Mouse Diversity Genotyping Array. The entire
array is tiled (populated) with more than seven million 25-base
pair oligonucleotide probes, which can assay nearly 625,000 SNPs
and more than 900,000 invariant genomic probes (IGPs). The probes
bind to labeled DNA from a target sample. Generally, analysis
software is used to quantify the brightness of each fluorescing
DNA-probe complex on a gridded image. High intensity spots indicate
high affinity between the probe and target DNA sequences and are
used to decode the genotypes of individual SNPs. Affymetrix
provides other arrays, including human, dog, and other mouse
arrays.
[0050] SNP or polymorphism genotype calling refers to the process
of determining at a polymorphism location what alleles are present.
In bi-allelic SNP polymorphisms, there are generally two different
base pairs that can be present at a location, which may be referred
to as allele A and allele B. The genotype of an SNP is generally
one of (A, A), (B, B), or (A, B). The first two genotypes are
generally referred to as homogeneous and the last as heterogeneous.
The same is true for some other polymorphisms, such as a number of
insertion/deletion polymorphisms.
[0051] Methods, comprising one or more of software programs, logic
modules, and data capture systems, use a variety of methods to
infer or determine discrete genotypes for an SNP location from
continuous intensity data from a number of samples. Many methods,
(such as the BRLMM-P or AxiomGT1 algorithm developed by Affymetrix)
employ clustering of multiple samples based on the contrast between
allelic probe intensities. Samples belonging to the two clusters
with a large absolute contrast are called as homozygous genotypes
and samples with low contrast are called as heterozygous genotypes.
Samples that do not fall within any of the defined (e.g., two or
three) clusters in the contrast dimension are generally uncalled,
or null, often referred to as "no call." Aspects related to the
BRLMM-P algorithm are discussed in U.S. Pat. No. 8,200,440 to
Hubbell and assigned to Affymetrix. The entire contents of that
patent, including all incorporated references, are incorporated
herein by reference for all purposes. The BRLMM-P algorithm with
some modifications and improvements is incorporated into the
Affymetrix AxionGT1 analysis system.
[0052] A no-call or incorrect genotype call is the result of
abnormal hybridization intensity for a sample at a given SNP
location. One source of genotype calling errors is biological in
origin and can be attributed to previously uncharacterized
variation in genomic DNA, either in the sequence targeted by a
probe set or in the proximal or distal restriction sites used for
genome-wide amplification. These variants can reduce hybridization
intensity sufficiently to eliminate or reverse the contrast between
allelic probes such that an incorrect genotype call (or no-call) is
made. Didion termed such variants "off-target variants" (OTVs) to
distinguish them from the expected variant targeted by the SNP
probe set and termed probe sets affected by OTVs as variable
intensity oligonucleotides (VINOs) due to the dynamic effect of
OTVs on hybridization intensity. In the present discussion, the
term OTV is used to refer to the both the allele and probe sets
affected by it. Genotyping errors due to uncharacterized sequence
variation were largely unaddressed in prior SNP genotyping studies.
Didion proposed a MouseDivGenoR genotype-calling algorithm
implemented as a package for the R language that is claimed capable
of recognizing OTVs and treating samples having OTVs as a distinct
genotype class. Didion also proposed a tool for visualization of
probe set hybridization intensities in which intensity contrast is
plotted against average intensity. These plots revealed that most
erroneous heterozygous calls in inbred strains are easily
distinguished from true heterozygous calls by greatly reduced
average intensity. Didion suggested that genotyping algorithms that
discriminate based only on contrast (such as BRLMM-P) would be
unable to detect OTVs.
[0053] Thus, "off-target variants" (OTVs) tend to produce an
additional genotype cluster besides the normal two or three
genotypes for diploid species, AA, AB and BB, or AA, BB. This
cluster can be referred to as the variant or "V" cluster or the OTV
cluster. The OTV cluster samples create problems in genotype
calling. These samples are often miscalled as "AB," the
heterozygote cluster. According to specific embodiments, systems
and methods as described herein separate this OTV cluster from the
other genotype clusters more effectively than previous methods.
[0054] For example, the technique from MouseDivGenoR (Didion)
package fails to detect some OTVs. The method uses an OTV score, a
product of two p values for to determine OTVs. The two p values
measure deviation from the center of cluster and deviation from the
center of all the samples in terms of size (or intensity)
dimension, respectively. Then the Didion method uses a nearest
neighbor algorithm to determine the V cluster. The technique uses a
stringent threshold (score>0.9999) and fails particularly when
three genotype clusters exist, AA, BB, and V.
OTV Identification and Genotype Clustering
[0055] FIG. 1A-B provide examples of re-clustering or re-calling
genotyping data after allowing for a variant cluster according to
specific embodiments. According to specific embodiments, an OTV
genotype analysis uses the posterior information from a genotype
cluster analysis (e.g., from an initial cluster algorithm, such as
AxiomGT1.RTM.) to initiate an expectation maximization EM algorithm
or similar technique to determine whether an OTV cluster is present
and to identify or call which samples are in the OTV cluster.
According to specific embodiments, a method can also repeat the
calls for samples in the AA, AB, and BB clusters. In particular
embodiments, a post-processing technique for OTV-Typing uses a 2D
EM algorithm with constraints to cluster based on both contrast and
strength dimensions. In one example, the method using the
posteriors from the Affymetrix Power Tool (APT) AxiomGT10 algorithm
as the starting points for AA, AB, BB and OTV clusters in the EM
algorithm. Then the algorithm iteratively updates sample
assignments and cluster centers until convergence is reached.
[0056] Thus, according to specific embodiments, a statistical
approach generally considering all the sample data from a probeset
is used to genotype the OTVs. In one example, the posterior
genotyping information generated by the automated clustering
algorithm AxiomGT10 is used to initiate the 2D Baum-Welch algorithm
in order to search for the optimal clustering including the
possibility of an OTV cluster using both intensity and contrast
information from the samples. (In the example figures provided
herein, sample data is plotted with intensity on the X-axis and
contrast on the Y-axis and these orientations are used in some of
the description below. As is known in the art, different plots of
the same data are equivalent.) As is known in the art, the 2D
Baum-Welch algorithm then iteratively updates sample assignments
and cluster centers until convergence is reached. An overview
workflow of such a method, (referred to at times as the
OTVCaller.TM.) algorithm is illustrated in FIG. 2. (In one example
embodiment, the method is implemented with other methods, including
methods described herein, in a combined software package).
[0057] The AxiomGT1 algorithm is based on the BRLMM-P algorithm and
incorporates a multi-channel processing method and features for
pre-processing and genotype calling Clusters in this particular
system are represented as two-dimensional Gaussians and resistance
to non-Gaussian cluster behavior has been improved. Training data
has been used to generate SNP-specific models that represent the
cluster properties learned for each marker. Unlike the DMET Plus
Assay, which is designed to call in a single-sample mode without
adapting to the data, the AxiomGT1 default behavior uses dynamic
clustering to adapt the clusters to the observed data. Although a
single sample can be run by itself, more samples allow the
algorithm to learn more from the training data.
Off Target Variant (OTV)
[0058] OTV SNPs or other polymorphisms according to specific
embodiments are identified or genotyped using posterior genotyping
information as the prior information for the genotype clusters
including an OTV cluster, e.g., (AA, AB, BB and V), (AA, BB and V).
An EM algorithm is applied to iteratively learn the optimal
locations of centers of these clusters and make OTV calls for the
OTV cluster until convergence is reached. According to specific
embodiments, the method successfully identifies an OTV cluster in
the presence of three genotypes AA, AB and BB or two genotypes AA
and BB.
[0059] Further embodiments, use a metric for SNP probeset quality
control referred to at times as heterozygote strength offset
(HetSO). HetSO in an example embodiment is defined generally as the
vertical distance or other distance depending on orientation (as
measured by [A+B]/2 or signal size) from the center of the
heterozygote cluster to the line connecting the centers of the
homozygote clusters. Generally, large negative values are
associated with or indicate OTV clusters. According to specific
embodiments, OTVs are therefore recognized as genotypes with good
cluster resolution but with a HetSO (or related value) meeting
specific criteria (e.g., a threshold of HetSO<about -0.3). The
criteria are customizable or adjustable according to specific
embodiments.
Specific Example
[0060] A highly detailed description of a method for OTV genotyping
is presented below. This description is presented for illustrative
purposes. A number of possible variations will be understood to
those of skill in the are have the benefits of the teachings
provided herein.
[0061] One example OTV genotyping algorithm analyzes probesets one
by one and proceeds generally as follows. [0062] For a given
probeset, assume that there are K genotypes, where K can be 3 (AA,
BB, OTV) or 4 (AA, AB, BB, OTV). There are N samples genotyped at
that SNP. [0063] Denote x.sub.j to be contrast value of sample j
(where j=1, 2, . . . , N), which is calculated as log 2(intensity
from channel A)-log 2(intensity from channel B). [0064] Denote
y.sub.j to be average intensity strength of sample j, also called
as "size", which is calculated as (log 2(intensity from channel
A)+log 2(intensity from channel B))/2. [0065] Denote z.sub.jk to be
the probability that sample j belongs to genotype cluster k, where
j=1, 2, . . . , N and k=1, 2, . . . , K. [0066] Denote a.sub.k to
be center location (mean) of size dimension for genotype cluster k
and m.sub.k to be center location (mean) of contrast dimension for
genotype cluster k. [0067] Denote Va.sub.k to be variance of size
dimension for genotype cluster k and Vm.sub.k to be variance of
contrast dimension for genotype cluster k. [0068] Denote p.sub.k to
be the prior probability of assignment of any sample to genotype
cluster k.
[0069] A 2D Gaussian mixture model is fitted to both the size and
the contrast dimension data. It uses the Expectation-Maximization
(EM) algorithm. For maximization step, the mean and variance of
genotype cluster k for both size and contrast dimensions are
calculated to maximize the likelihood. To initialize the EM
algorithm, obtain the values from prior clustering, To estimate the
maximum likelihood estimator (MLE) of contrast mean of genotype
cluster k:
m k = j = 1 N x j z jk j = 1 N z jk ##EQU00001##
[0070] The MLE of size mean of genotype cluster k is
a k = j = 1 N y j z jk j = 1 N z jk ##EQU00002##
[0071] The MLE of size variance of genotype cluster k is
Vm k = j = 1 N ( x j - m k ) 2 z jk j = 1 N z jk ##EQU00003##
[0072] The MLE of size variance of genotype cluster k is
Va k = j = 1 N ( y j - a k ) 2 z jk j = 1 N z jk ##EQU00004##
[0073] The MLE of probability that any sample belongs to genotype
cluster k is
p k = j = 1 N z jk N ##EQU00005##
[0074] For expectation step, the probability of sample j assigning
to genotype cluster k is:
z jk = p k Norm ( x j , m k , Vm k ) Norm ( y j , a k , Va k ) i =
1 K p i Norm ( x j , m i , Vm i ) Norm ( y j , a ki , Va i )
##EQU00006##
[0075] The algorithm iterates the Expectation step and the
Maximization step until convergence is reached. Then Bayesian
Information Criteria (BIC) is used, as will be understood in the
art, to select among the models with three or four genotype
clusters.
[0076] FIG. 3 is a flowchart of an example method for clustering
genotyping data into genotype clusters and an OTV cluster using
posterior data from a genotype clustering according to specific
embodiments.
[0077] FIG. 4 is a flowchart of an example method for clustering
genotyping data into genotype clusters and an OTV cluster using
posterior data from a genotype clustering and evaluating more than
one group of possible genotype clusters according to specific
embodiments.
[0078] FIG. 5 is a flowchart of an example method for initially
clustering genotyping data into genotype clusters and an OTV
cluster and optionally evaluating more than one group of possible
genotype clusters according to specific embodiments.
SNP Probeset Quality Control (QC) Metrics
[0079] According to further specific embodiments, a set of probeset
quality control (qc) metrics detects misclustered SNPs or other
polymorphisms and other properties to provide better analysis.
According to specific embodiments, the set of QC metrics can
include one or more of the metrics discussed below. According to
further specific embodiments, one or more of these metrics are used
to classify probesets into quality or characteristic classes.
[0080] Some SNP metrics have previously been used to rank probes
for genotyping, see, e.g.,
affymetrix(.)com/support/developer/powertools/changelogNIGNETTE-WGSA-empi-
-rical-probe-ranking html. According to specific embodiments,
combinations of previously used metrics and new metrics have been
developed to classify SNP probeset genotyping results into one of
several classifications, including an OTV classification.
SNP Call Rate
[0081] According to specific embodiments, the SNP call rate is an
indication of how many SNPs, generally across multiple samples,
were called into an available genotype. It may be expressed as a
percentage or ratio or decimal and while variations are possible,
is generally expressed as (Samples Called/Total Samples.) In
specific embodiments, SNP call rate=Number of Samples Called/N
N=the number of samples over which a genotype call is attempted for
the SNP. According to specific embodiments, number of samples
called is the number of samples assigned a genotype call (e.g.,
either AA, BB or AB, AA or BB, genotypes for polyploidy, optionally
genotypes including an OTV cluster) at the SNP locus. This is also
generally the number of samples that do not have a "No Call"
assignment. According to specific embodiments, SNP call rate (which
is different than sample call rate) is a measure of both data
completeness and of genotype cluster quality (at low values). Very
low SNP call rates are generally due to a failure to resolve the
genotype clusters, and poor cluster resolution may produce
inaccurate genotypes in the samples that are called. Even if the
called genotypes are correct, if the no-calls are non-randomly
distributed across the genotype clusters, this missing information
may lead to false positive associations in a genome-wide
association (GWA) study.
FLD
[0082] Fisher's Linear Discriminant is a well-known statistical
method generally used to find a combination of features that
characterizes or separates two or more classes of objects or events
and previously used to characterized probesets. According to
specific embodiments, Fisher's Linear Discriminant (FLD) is applied
to the genotype clusters and is a measure of the cluster quality of
a probeset. It is used with other parameters and metrics as
discussed herein to characterize probesets. High-quality clusters
have well-separated centers (with respect to other clusters) and
little variance about the center of the cluster (i.e., the clusters
are narrow). High-quality clusters can be identified in cluster
plots by examining the shape and separation of the SNP probeset
posteriors that are produced during genotyping. The posteriors
(e.g., the data or a diagrammatic representation (e.g., an
ellipse)] that identifies the cluster location and variance of each
genotype cluster that is produced by the genotyping algorithm) for
the individual clusters should be narrow and have centers that are
well separated from each other. It is recommended that probesets
with an FLD value less than a threshold (e.g., 3.6 in one example
system) be removed from the study prior to downstream analysis. SNP
call rates and FLD values are normally correlated, but in some
cases FLD will detect problems not detected by the SNP call rate
metric. FIG. 6A illustrates an example of a SNP genotyping data
result with a high call rate but a low FLD value and in contrast, B
illustrates well-clustered SNP genotyping data with well-separated
genotype cluster centers. In FIG. 6A, the clustering algorithm has
found the location of the BB cluster too close to the AB cluster
and has assigned two data points (which are actually AB genotypes)
to the BB genotype cluster (blue). The misclustering event is
detected by measuring the FLD value for this SNP because the
mislabeled BB cluster center is very close to the AB cluster
center, thus producing an FLD value of 3.30. In contrast, the
well-clustered The SNP in FIG. 6B has well-separated genotype
cluster centers and an FLD value of 4.9 FLD in this figure is
basically the smallest difference between the cluster centers as
indicated by the triangles. BB cluster samples are shown as blue
inverted triangles. AA cluster samples are shown as red triangles.
AB cluster samples are shown as yellow circles and No Calls are
shown as grey boxes.
[0083] According to specific embodiments, the FLD is computed
directly from the genotypes and the data, consisting of the
distance between cluster centers divided by the standard deviation
within clusters. This measures the separation between clusters
directly in terms of cluster variation. While a useful metric, it
is difficult to interpret when genotypes are missing for one or
more clusters (and is filled in by prior information). In specific
examples, there are 3 FLD values, one for each cluster pair.
Generally, the largest FLD is the `best`. [0084] FLDAH-FLD between
AA and AB [0085] FLDHB-FLD between AB and BB [0086] FLDAB-FLD
between AA and BB
[0087] According to specific embodiment, an FLD for the genotype
clusters of an SNP can be determined generally as follows:
(FLD)=Min(I=AA, BB) {M.sub.AB-M.sub.i/ad} [0088] Where:
M.sub.AB=center of heterozygous cluster in log ratio dimension,
M.sub.AA, M.sub.BB=center of homozygote A, B cluster in log ration
dimension; sd=square root of variance pooled across all three
distributions. FLD is undefined if either of the two clusters (AB
and either AA or BB) is unpopulated.
Heterozygous Cluster Strength Offset (HetSO)
[0089] Heterozygous Cluster Strength Offset (HetSO) is defined
according to specific embodiments as the vertical distance (as
measured by [A+B]/2 or signal size) from the center of the
Heterozygous (e.g. AB) cluster to the line connecting the centers
of the homozygous clusters (e.g., AA, BB). Low HetSO values are
produced by misclustering events and by inclusion of samples with
high degree of mismatches relative to the reference genome (i.e.,
samples with variation in the genome region against which the 30
mer SNP probe sequence was designed). FIGS. 7A and B illustrate an
example of SNP genotyping data, where sequence mismatches between
the sample and the reference genome have produced a HetSO value
outside of the acceptable range. When there is a high degree of
mismatch between the sample and reference genome, the A and B
intensities produced by the samples are low, and fall into the
location of the heterozygous cluster. If enough samples fall into
this heterozygous cluster location, then the clustering algorithm
incorrectly calls these cases as AB (instead of No Calls). These
samples can be identified by low HetSO values. Visually, the
average signal value (as measured along the y-axis) is much lower
for the heterozygous cluster than for the homozygous clusters.
Generally, there is a recommendation to remove any SNPs with HetSO
less than -0.1 or some other threshold from the downstream
genotyping analysis.
HetSO=A.sub.AB-A.sub.BB+(A.sub.AA-A.sub.BB).times.(M.sub.AB-M.sub.BB/M.s-
ub.AA-M.sub.BB) [0090] Where (M.sub.AA, A.sub.BB)=center of AA
cluster, etc.
Homozygote Ratio Offset (HomRO)
[0091] Homozygote Ratio Offset (HomRO) is the location, in the
contrast dimension (e.g., the X axis in example figures provided
herein) of the homozygous genotype cluster center that is closest
to zero (the heterozygote position), and/or most likely to be
misplaced. If the homozygous cluster is on the expected side of
zero (the heterozygous position), the value is positive, otherwise
negative. A low or negative value tends to indicate that the
genotyping algorithm has mislabeled the clusters, producing
incorrect genotype cells. Considering only populated homozygote
clusters: [0092] If both are on the proper side of zero contrast,
the HomRO=min(X.sub.AA, abs(X.sub.BB)), where X.sub.AA=the
posterior mean of the AA cluster, and X.sub.BB=the posterior mean
of the BB cluster. [0093] If both are positive (on the AA side of
zero), HomRO=-X.sub.BB (negative indicates "wrong side") [0094] If
both are negative (on the BB side of zero), HomRO=X.sub.AA.
[0095] FIG. 8A illustrates an example of SNP genotyping data
misclustered and with a negative HomRO value and B illustrates a
well clustered SNP genotyping data. In this example the HomRO value
is -2.26. The putative AA (red) genotype cluster is on the
wrong/negative side in contrast space. As a consequence of
incorrectly identifying most of the data points as AA, the
genotyping algorithm has additionally produced a cluster split and
labeled some of the points (all actually BB) as AB, where AA (red)
genotype cluster is positive (0.84) as expected.
[0096] In a specific example embodiment, one or more of these
metrics are calculated for a plurality of available probesets from
a probeset cluster data file. FIG. 9A-B illustrate an example
output of a logic routine that determines genotyping data quality
metrics or probeset assay data quality metrics according to
specific embodiments.
SNP Type Classification
[0097] FIG. 10 is an illustration of six genotyping data type
classification according to specific embodiments. In one example,
this classification is enabled with logic routines as part of a
larger SNP analysis system, (e.g., Axiom Analysis Workflow.RTM.).
For example, the illustrated classification examples were extracted
from Maize. As shown in the figure, the six example classifications
are: [0098] (1) PolyHighResolution, characterized by good cluster
resolution, and at least 2 examples of the minor allele. According
to specific embodiments, these can be understood as converted
polymorphic SNPs. [0099] (2) MonoHighResolution, characterized by
less than 2 sample points of the minor allele usually due to low
MAF samples, but possible cluster fusion/compression According to
specific embodiments, these can be understood as converted
monomorphic SNPs. [0100] (3) Off Target Variant (OTV), where an OTV
cluster has been called as discussed herein. According to specific
embodiments, these probesets give very low signal intensity in some
individuals. Often the result of a nearby destabilizing SNPs or a
small deletion. [0101] (4) No Minor Homozygote, characterized by
two clusters with no examples of the minor homozygous genotypes.
According to specific embodiments, these can be understood as 2
cluster SNPs found in homozygous inbred individuals. [0102] (5)
Call Rate Below Threshold, characterized by SNP call rate below
threshold, but other cluster properties are above threshold. [0103]
(6) Other, wherein one or more cluster properties are below
threshold and therefore lower quality genotypes are likely. These
are effectively unconverted SNPs.
Example Rules for Identifying SNP Types
[0104] According to specific embodiments, one or more SNP types are
defined in terms of quality metrics as discussed above and other
parameters of SNPs. An example of a set of SNP types and example
rules for defining them according to specific embodiments is
provided below. Not all rules may be necessary or used for defining
the classifications. [0105] I. PolyHighResolution [0106] SNP
Cluster passes all thresholds as discussed below. [0107] II.
MonoHighResolution [0108] SNP Cluster passes all thresholds, except
nMinorAllele [0109] III. Off Target Variant (OTV) [0110] SNP(s)
have good cluster properties, with the exception of a cluster in
the OTV position. [0111] FLD>=threshold (e.g., 3.6) [0112]
HomRO>=threshold [0113] HomHet==1, if set [0114] Heterozygous
cluster is populated [0115] HetSO<threshold (e.g., -0.3) [0116]
Homozygote cluster(s) is offset from zero in the contrast dimension
[0117] IV. CallRateBelowThreshold [0118] SNP Cluster passes all
thresholds, except Call Rate, with nMinorAllele ignored [0119] V.
NoMinorHom [0120] Only if HomHet is set [0121] SNP Cluster passes
all thresholds and is a two cluster SNP (one Homozygous and one
Heterozygous), with HetSO and nMinorAllele ignored
"No Minor Hom" Classification Considerations
[0122] When genome is polyploid, unambiguous assignment of the
genotype for a 2 cluster SNP ("NoMinorHom" SNP classification) can
be challenging when one of the cluster in the diploid heterozygous
position without prior knowledge of the sample genetics (rapeseed
example) However, when lines are in-bred, heterozygous genotypes
assignment without examples of both homozygous genotypes may be
unexpected (e.g., for wheat example.)
Example Thresholds for PolyHighResolution SNPs
[0123] Threshold values for the parameters described above and for
parameters as known in the art as used in the SNP classifier as
described above. These values may be varied in different systems
and for a number of different reasons. One example set of values is
as provided below. [0124] CR.cut-Call Rate threshold; usually about
95-97 [0125] FLD.cut-FLD threshold; usually about 3.6 [0126]
HetSO.cut-HetSOthreshold usually about -0.1 [0127]
HomRO2.cut-HomROthreshold for 2-cluster SNPs; usually about 0.3
[0128] HomRO3.cut-HomROthreshold for 3-cluster SNPs; usually about
-0.9 [0129] nMinorAllele-Number of Minor Alleles; usually two
[0130] HomHet.flag=FALSE
Using SNP Classifications to Select SNP Probe Sets
[0131] According to specific embodiments, the classifications
described above can be used to select a best probe set. A "best"
probeset is generally selected in various situations where more
than one probeset is available for an SNP, as would be understood
in the art. Probeset selection is done in many different situations
and according to many different criteria.
[0132] According to specific embodiments as described herein, one
method for probeset selection is to (1) Categorize probesets into
the available types (e.g., the six types described above). For an
SNP, select a probeset that has the highest priority as categorized
by the types.
[0133] In a specific embodiment, the type priority is:
PolyHighResolution, MonoHighResolution, OTV,
CallRateBelowThreshold, NoMinorHom, Other. Call Rate and FLD are
used to break ties.
Example Software Package
[0134] According to further specific embodiments, one or more of
the above methods is included in a software package to
automatically genotype polymorphisms from array data or similar
genotype data. Such a software package may include initial
clustering or such a software package may be a post-processing
package applied after initial clustering. In either case, according
to specific embodiments, such a software package provides fully
automated probeset selection for polymorphisms as described herein.
According to further embodiments, such a software package
identifies the presence of OTV probes using statistical
characteristics of the majority or all of the sample data from the
probe.
[0135] SNPolisher.TM. is one particular example of such a software
system, in this case designed to post-process genotyping results by
Affymetrix Axiom Genotyping Arrays, though other genotyping data
can be handled so long as the data is presented in the formats used
by the program. Alternatively, SNPolisher can be modified or
customized to work with any comparable data sets. FIG. 11 is a
flowchart illustrating overall steps for polymorphism (e.g., SNP)
analysis, including classification and OTV identification according
to specific embodiments.
[0136] According to specific embodiments, SNPolisher contains four
major functions (1) SNP Quality Control (QC) Metrics Generation,
(2) SNP Classification based on QC metrics, (3) SNP Genotype
Cluster Visualization, and (4) Off-target variant (OTV) genotyping.
Input files in an example system are the standard genotyping output
files from Affymetrix Power Tools (APT) or Genotyping Console (GTC)
for the Axiom arrays (note ps2snp.file is in the library file
package). SNPolisher calculates the QC metrics for each
SNP/probeset to determine the probeset's quality and optionally
classifies SNPs/probesets into six major categories or types as
described above. SNPolisher can select the best probeset to
represent a SNP if multiple probesets exist for a specific SNP.
SNPolisher can also generate the visualization for each
SNP/probeset to evaluate its quality. FIG. 12 shows an example of
visualization of a SNP genotyping data probeset results with
interpretations using a logic routing according to specific
embodiments. Third, it can OTV genotype those OTV SNPs to produce
AA, AB, BB and OTV genotype clusters.
[0137] According to specific embodiments, SNPolisher.TM. is
implemented as a well-known statistical programming module referred
to as an R package. SNPolisher.TM. is at times distributed with two
test data sets that are used as examples to describe operations
below. Testdata1 contains genotype data from a diploid species with
500 probesets genotyped in 300 samples. Testdata2 contains genotype
data from an allo-hexaploid species with 800 probesets genotyped in
100 samples. Below is described in general terms the usage of the
four major functions using these test data-sets.
Quality Control Metrics Generation and Probeset Classifications
[0138] In an example system, metrics and classification are
generated using two input files, e.g.,
"AxiomGT1.snp-posteriors.txt" containing the posterior information
for each genotype cluster, and the other is "AxiomGT1.calls.txt"
containing inferred genotypes. QC metrics generation and probeset
classifications are performed generally as described above. One
example system uses and additional, 7.sup.th classifications,
"Hemizygous" to indicate SNPs on chrY/chrW/Mito.
Hardware
[0139] Systems, methods, and products to address these and other
needs are described herein with respect to illustrative,
non-limiting, implementations. Various alternatives, modifications
and equivalents are possible. For example, certain systems,
methods, and computer software products are described herein using
exemplary implementations for analyzing data from arrays of
biological materials made by spotting or other methods such as
photolithography or bead based systems. However, these systems,
methods, and products may be applied with respect to many other
types of probe arrays and, more generally, with respect to numerous
parallel biological assays produced in accordance with other
conventional technologies and/or produced in accordance with
techniques that may be developed in the future. For example, the
systems, methods, and products described herein may be applied to
parallel assays of nucleic acids, PCR products generated from cDNA
clones, proteins, antibodies, or many other biological materials.
These materials may be disposed on slides (as typically used for
spotted arrays), on substrates employed for GENECHIP.TM. arrays, or
on beads, optical fibers, or other substrates or media, which may
include polymeric coatings or other layers on top of slides or
other substrates. Moreover, the probes need not be immobilized in
or on a substrate, and, if immobilized, need not be disposed in
regular patterns or arrays. For convenience, the term "probe array"
will generally be used broadly hereafter to refer to all of these
types of arrays and parallel biological assays.
[0140] An embodiment of a method of analyzing data from processed
images of biological probe arrays is described that comprises
receiving one or more data files comprising a plurality of
intensity values associated with a probe on a biological probe
array or comprising cluster assignments from or other output
derived from such data; and assigning a plurality of genotype calls
using a distance of the one or more intensity values from the
posterior estimate.
[0141] The above embodiments and implementations are not
necessarily inclusive or exclusive of each other and may be
combined in any manner that is non-conflicting and otherwise
possible, whether they be presented in association with a same, or
a different, embodiment or implementation. The description of one
embodiment or implementation is not intended to be limiting with
respect to other embodiments and/or implementations. In addition,
any one or more function, step, operation, or technique described
elsewhere in this specification may, in alternative
implementations, be combined with any one or more function, step,
operation, or technique described in the summary. Thus, the above
embodiment and implementations are illustrative rather than
limiting.
Creating Probesets and Assays
[0142] It is generally well known that the method involves
extracting genomic DNA from samples, digesting genomic DNA using
restriction enzymes, labeling DNA fragments for hybridization to
the probes on the chip. After hybridization, the chips are washed
and scanned and a digital image is taken. For each probeset,
summarize pixels to probeset intensities. This produces the raw
data for the SNP analysis.
[0143] Various methods are used for genotype calling: plot each SNP
in two dimensions (A & B allele intensities), though not every
SNP looks alike Specific methods to call genotypes include
Affymetrix Birdseed, CRLMM, Affy 500 k: Chiamo.
[0144] Various embodiments of the present invention provide methods
and/or systems for probeset genotyping and characterization that
can be implemented on a general purpose or special purpose
information handling appliance or logic enabled system, such as a
laboratory or diagnostic or production system, using any suitable
programming language such as R, Java, C++, C#, Cobol, C, Pascal,
Fortran, PL1, LISP, assembly, etc., and any suitable data or
formatting specifications, such as HTML, XML, dHTML, TIFF, JPEG,
tab-delimited text, binary, etc. In the interest of clarity, not
all features of an actual implementation are described in this
specification. It will be understood that in the development of any
such actual implementation (as in any software development
project), numerous implementation-specific decisions must be made
to achieve the developers' specific goals and subgoals, such as
compliance with system-related and/or business-related constraints,
which will vary from one implementation to another. Moreover, it
will be appreciated that such a development effort might be complex
and time-consuming, but would nevertheless be a routine undertaking
of software engineering for those of ordinary skill having the
benefit of this disclosure.
Embodiment in a Programmed Information Appliance
[0145] FIG. 13 is a block diagram showing a representative example
logic device in which various aspects of the present invention may
be embodied. As will be understood to practitioners in the art from
the teachings provided herein, the invention can be implemented in
hardware and/or software. In some embodiments of the invention,
different aspects of the invention can be implemented in either
client-side logic or server-side logic. As will be understood in
the art, the invention or components thereof may be embodied in a
fixed media program component containing logic instructions and/or
data that when loaded into an appropriately configured computing
device cause that device to perform according to the invention. As
will be understood in the art, a fixed media containing logic
instructions may be delivered to a user on a fixed media for
physically loading into a user's computer or a fixed media
containing logic instructions may reside on a remote server that a
user accesses through a communication medium in order to download a
program component.
[0146] FIG. 13 shows an information appliance (or digital device)
700 that may be understood as a logical apparatus that can read
instructions from media 717 and/or network port 719, which can
optionally be connected to server 720 having fixed media 722.
Apparatus 700 can thereafter use those instructions to direct
server or client logic, as understood in the art, to embody aspects
of the invention. One type of logical apparatus that may embody the
invention is a computer system as illustrated in 700, containing
CPU 707, optional input devices 709 and 711, disk drives 715 and
optional monitor 705. Fixed media 717, or fixed media 722 over port
719, may be used to program such a system and may represent a
disk-type optical or magnetic media, magnetic tape, solid state
dynamic or static memory, etc. In specific embodiments, the
invention may be embodied in whole or in part as software recorded
on this fixed media. Communication port 719 may also be used to
initially receive instructions that are used to program such a
system and may represent any type of communication connection.
[0147] The invention also may be embodied in whole or in part
within the circuitry of an application specific integrated circuit
(ASIC) or a programmable logic device (PLD). In such a case, the
invention may be embodied in a computer understandable descriptor
language, which may be used to create an ASIC, or PLD that operates
as herein described.
[0148] The general structure and techniques, and more specific
embodiments that can be used to effect different ways of carrying
out the more general goals are described herein.
[0149] Although only a few embodiments have been disclosed in
detail above, other embodiments are possible and the inventor(s)
intend these to be encompassed within this specification. The
specification describes specific examples to accomplish a more
general goal that may be accomplished in another way. This
disclosure is intended to be exemplary, and the claims are intended
to cover any modification or alternative that might be predictable
to a person having ordinary skill in the art. For example, While
Affymetrix.TM. arrays are described in the embodiments, other
embodiments may use other kinds of genotyping devices or
systems.
[0150] Also, the inventors intend that only those claims which use
the words "means for" are intended to be interpreted under 35 USC
112, sixth paragraph. Moreover, no limitations from the
specification are intended to be read into any claims, unless those
limitations are expressly included in the claims. The computers
described herein may be any kind of computer, either general
purpose, or some specific purpose computer such as a workstation.
The computer may be an Intel (e.g., Pentium or Core 2 duo) or AMD
based computer, running Windows XP or Linux, or may be a Macintosh
computer. The computer may also be a handheld computer, such as a
PDA, cellphone, or laptop.
[0151] The programs may be written in C or Python, or Java, Brew or
any other programming language. The programs may be resident on a
storage medium, e.g., magnetic or optical, e.g. the computer hard
drive, a removable disk or media such as a memory stick or SD
media, wired or wireless network based or Bluetooth based Network
Attached Storage (NAS), or other removable medium, or other
removable medium. The programs may also be run over a network, for
example, with a server or other machine sending signals to the
local machine, which allows the local machine to carry out the
operations described herein.
[0152] Where a specific numerical value is mentioned herein, it
should be considered that the value may be increased or decreased
by 20%, while still staying within the teachings of the present
application, unless some different range is specifically mentioned.
Where a specified logical sense is used, the opposite logical sense
is also intended to be encompassed.
* * * * *