U.S. patent application number 17/249520 was filed with the patent office on 2021-06-24 for phase-aware determination of identity-by-descent dna segments.
The applicant listed for this patent is 23andMe, Inc.. Invention is credited to Adam Auton, William Allen Freyman, Ethan Macneil Jewett, Kimberly Faith McManus, Suyash S. Shringarpure.
Application Number | 20210193257 17/249520 |
Document ID | / |
Family ID | 1000005434592 |
Filed Date | 2021-06-24 |
United States Patent
Application |
20210193257 |
Kind Code |
A1 |
Freyman; William Allen ; et
al. |
June 24, 2021 |
PHASE-AWARE DETERMINATION OF IDENTITY-BY-DESCENT DNA SEGMENTS
Abstract
The disclosed embodiments concern methods, apparatus, systems
and computer program products for estimating IBD segments. Some
implementations use a templated positional Burrows-Wheeler
transform (PBWT) technique and a phase switch error heuristic to
correct genotyping errors and phase switch errors to make fast and
accurate phase aware IBD estimates. In some implementations a
templated PBWT technique and a probabilistic hidden Markov model
(HMM) are used to correct genotyping errors and phase switch
errors.
Inventors: |
Freyman; William Allen;
(Menlo Park, CA) ; McManus; Kimberly Faith; (San
Francisco, CA) ; Shringarpure; Suyash S.; (Mountain
View, CA) ; Jewett; Ethan Macneil; (San Jose, CA)
; Auton; Adam; (Menlo Park, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
23andMe, Inc. |
Sunnyvale |
CA |
US |
|
|
Family ID: |
1000005434592 |
Appl. No.: |
17/249520 |
Filed: |
March 4, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
16947107 |
Jul 17, 2020 |
|
|
|
17249520 |
|
|
|
|
62876497 |
Jul 19, 2019 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
C12Q 1/6827 20130101;
G16B 30/00 20190201; G16B 20/20 20190201 |
International
Class: |
G16B 20/20 20060101
G16B020/20; C12Q 1/6827 20060101 C12Q001/6827; G16B 30/00 20060101
G16B030/00 |
Claims
1. A computer implemented method of processing haplotypes to reduce
genotyping errors when determining identity by descent (IBD)
segments between haplotypes, the method comprising: providing a
first digital template comprising a first arrangement of masked and
unmasked sites in a window of consecutive haplotype sites;
providing a second digital template comprising a second arrangement
of masked and unmasked sites in a window of consecutive haplotype
sites, wherein the first and second arrangements are different;
providing two or more haplotypes strings for identification of IBD
segments therebetween, each of the two or more haplotype strings
representing a sequence of allele values at polymorphic sites in a
haplotype of an organism; and computationally identifying IBD
segments between the two or more haplotype strings by (i)
identifying first matches among alleles of the haplotype strings at
unmasked sites produced by applying the first digital template to
the two or more haplotype strings, (ii) identifying second matches
among alleles of the haplotype at unmasked sites produced by
applying the second digital template to the two or more haplotype
strings, and (iii) merging the first and second matches among
alleles to produce a merged set of IBD segments, wherein the merged
set of IBD segments has reduced impact from genotyping errors
compared to a set of IBD segments generated without applying the
first and second digital templates.
2. The method of claim 1, wherein the first and second templates
each have a size of at least four consecutive haplotype sites.
3. The method of claim 1, wherein identifying the first matches
among alleles at unmasked sites comprises sequentially applying the
first digital template to the two or more haplotype strings, each
time moving to a next sequential section of the two or more
haplotype strings.
4. The method of claim 1, wherein computationally identifying IBD
segments between the two or more haplotype strings further
comprises: computationally identifying additional matches among
alleles at unmasked sites produced by applying one or more
additional digital templates to the two or more haplotype strings,
wherein the one or more additional digital templates have
additional arrangements of masked and unmasked sites in windows of
consecutive haplotype sites, and each of the additional
arrangements is different from both the first and the second
arrangements, and wherein merging the first and second matches
among alleles to produce a merged set of IBD segments further
comprises computationally merging the additional matches with the
first and second matches to produce the merged set of IBD
segments.
5. The method of claim 4, wherein computationally identifying
additional matches among alleles at unmasked sites employs a third
digital template, a fourth digital template, a fifth digital
template, and a sixth digital template.
6. The method of claim 5, wherein the first through sixth digital
templates each comprise two masked sites and two unmasked
sites.
7. The method of claim 1, wherein the first digital template and
the second digital template each have a ratio of masked sites to
unmasked sites of between about 2:1 to about 1:2.
8. The method of claim 1, wherein the two or more haplotype strings
comprise at least one thousand haplotype strings.
9. The method of claim 1, wherein the two or more haplotype strings
comprise at least one million haplotype strings.
10. The method of claim 1, wherein computationally identifying IBD
segments between the two or more haplotype strings comprises
performing a positional Burrows-Wheeler transform (PBWT) on the
unmasked sites produced by applying the first and second templates
to the two or more haplotype strings.
11. The method of claim 10, wherein computationally merging the
first and second matches among alleles is performed while
considering individual polymorphic sites of the two or more
haplotype strings using the PBWT.
12. The method of claim 1, wherein the total number of digital
templates is between 2 and k, where k is the number of haplotype
sites in the window.
13. The method of claim 1, wherein the total number of digital
templates is k!/(m!*(k-m)!), where k is the number of haplotype
sites in the window and m is the number of masked sites in the
window.
14. The method of claim 1, wherein applying the first digital
template comprises a deterministic process employing the first
arrangement of masked and unmasked sites.
15. A system for processing haplotypes to reduce genotyping errors
when determining identity by descent (IBD) segments between
haplotypes, the system comprising: (a) one or more processors and
associated memory; (b) computer readable instructions for:
providing a first digital template comprising a first arrangement
of masked and unmasked sites in a window of consecutive haplotype
sites; providing a second digital template comprising a second
arrangement of masked and unmasked sites in a window of consecutive
haplotype sites, wherein the first and second arrangements are
different; providing two or more haplotype strings for
identification of IBD segments therebetween, each of the two or
more haplotype strings representing a sequence of allele values at
polymorphic sites in a haplotype of an organism; and identifying
IBD segments between the two or more haplotype strings by (i)
identifying first matches among alleles of the haplotype strings at
unmasked sites produced by applying the first digital template to
the two or more haplotype strings, (ii) identifying second matches
among alleles of the haplotype at unmasked sites produced by
applying the second digital template to the two or more haplotype
strings, and (iii) merging the first and second matches among
alleles to produce a merged set of IBD segments, wherein the merged
set of IBD segments has reduced impact from genotyping errors
compared to a set of IBD segments generated without applying the
first and second digital templates.
16. The system of claim 15, wherein the first and second templates
each have a size of at least four consecutive haplotype sites.
17. The system of claim 15, wherein the instructions for
identifying the first matches among alleles at unmasked sites
comprises instructions for sequentially applying the first digital
template to the two or more haplotype strings, each time moving to
a next sequential section of the two or more haplotype strings.
18. The system of claim 15, wherein the instructions for
identifying IBD segments between the two or more haplotype strings
further comprise instructions for: computationally identifying
additional matches among alleles at unmasked sites produced by
applying one or more additional digital templates to the two or
more haplotype strings, wherein the one or more additional digital
templates have additional arrangements of masked and unmasked sites
in windows of consecutive haplotype sites, and each of the
additional arrangements is different from both the first and the
second arrangements, and wherein merging the first and second
matches among alleles to produce a merged set of IBD segments
further comprises computationally merging the additional matches
with the first and second matches to produce the merged set of IBD
segments.
19. The system of claim 18, wherein the instructions for
identifying additional matches among alleles at unmasked sites
employ a third digital template, a fourth digital template, a fifth
digital template, and a sixth digital template.
20. The system of claim 15, wherein the two or more haplotype
strings comprise at least one thousand haplotype strings.
21. The system of claim 15, wherein the instructions for
identifying IBD segments between the two or more haplotype strings
comprise instructions performing a positional Burrows-Wheeler
transform (PBWT) on the unmasked sites produced by applying the
first and second templates to the two or more haplotype
strings.
22. The system of claim 21, wherein the instructions for merging
the first and second matches among alleles comprise instructions
for performing the merging while considering individual polymorphic
sites of the two or more haplotype strings using the PBWT.
23. A method of identifying IBD segments between two or more
haplotype strings, each of the two or more haplotype strings
representing a sequence of allele values at polymorphic sites in a
haplotype of an organism, the method comprising: (a)
computationally identifying IBD segments between the two or more
haplotype strings by (i) identifying first matches among alleles of
two or more haplotype strings at unmasked sites produced by
applying a first digital template to the two or more haplotype
strings, (ii) identifying second matches among alleles of the
haplotype at unmasked sites produced by applying a second digital
template to the two or more haplotype strings, and (iii) merging
the first and second matches among alleles to produce a merged set
of IBD segments, wherein the first digital template comprises a
first arrangement of masked and unmasked sites in a window of
consecutive haplotype sites, wherein the second digital template
comprises a second arrangement of masked and unmasked sites in the
window of consecutive haplotype sites, and wherein the first and
second arrangements are different; and (b) identifying a potential
phase switch error in at least one of the two or more haplotype
strings; and (c) correcting the phase switch error.
24. The method of claim 23, wherein identifying the potential phase
switch error comprises identifying proximal IBD segments in at
least one pair of the two or more haplotype strings.
25. The method of claim 23, wherein identifying the potential phase
switch error further comprises computationally iterating through
the two or more haplotype strings, by: (i) identifying a first
potential IBD segment between the two or more haplotype strings;
(ii) comparing the first site of the first potential IBD segment to
the last site of a previously identified second potential IBD
segment; and (iii) determining that the last site of the second
potential IBD segment and the first site of the first potential IBD
segment are within a threshold number of sites of each other; and
wherein correcting the phase switch error comprises merging the
first potential IBD segment and the second potential IBD segment to
form a combined potential IBD segment based on determining that the
last site of the second potential IBD segment and the first site of
the first potential IBD segment are within the threshold number of
sites of each other.
26. The method of claim 25, wherein the first potential IBD segment
and the second potential IBD segment each have a length of at least
the threshold number of sites.
27. The method of claim 25, wherein the threshold number of sites
is between about 0 and 500 SNPs.
Description
INCORPORATION BY REFERENCE
[0001] An Application Data Sheet is filed concurrently with this
specification as part of the present application. Each application
that the present application claims benefit of or priority to as
identified in the concurrently filed Application Data Sheet is
incorporated by reference herein in its entirety and for all
purposes.
BACKGROUND
[0002] Modern genetic data sets already number in the millions of
genomes and are growing rapidly. Inferring the genomic location and
length of identical-by-descent (IBD) segments among the related
individuals in these data sets is a central step in many genetic
analyses.
[0003] IBD estimates can best be exploited when they are made using
phased haplotypes; this means each individual in the data set is
represented by two sequences each of which consists of alleles
co-located on the same chromosome and inherited from a different
parent. IBD estimates that are phase aware can improve relationship
and pedigree inference, allow health and trait inheritance to be
traced, and make possible a range of other inferences regarding
demographic history and ancestry that are not possible when IBD
estimates are made using only unphased genotype data. Therefore,
methods and systems that can improve performance of phase aware IBD
estimates have significant value.
[0004] All patents, patent applications, and other publications,
including all sequences disclosed within these references, referred
to herein are expressly incorporated herein by reference, to the
same extent as if each individual publication, patent or patent
application was specifically and individually indicated to be
incorporated by reference. All documents cited are, in relevant
part, incorporated herein by reference in their entireties for the
purposes indicated by the context of their citation herein.
However, the citation of any document is not to be construed as an
admission that it is prior art with respect to the present
disclosure.
SUMMARY
[0005] The disclosed implementations concern methods, apparatus,
systems, and computer program products for processing haplotype
data to accurately estimate IBD segments between individuals.
[0006] A first aspect of the disclosure provides
computer-implemented methods for estimating IBD segments between
individuals.
[0007] Another aspect of the disclosure provides systems for
estimating IBD segments. In some implementations, the system
involves: a sequencer for sequencing nucleic acids of the test
sample; a processor; and one or more computer-readable storage
media having stored thereon instructions for execution on said
processor to estimate IBD segments between individuals.
[0008] Another aspect of the disclosure provides a computer program
product including a non-transitory machine readable medium storing
program code that, when executed by one or more processors of a
computer system, causes the computer system to implement the
methods above for estimating IBD segments.
[0009] A computer implemented method of processing haplotypes to
reduce genotyping errors when determining identity by descent (IBD)
segments between haplotypes is provided, the method including:
providing a first digital template including a first arrangement of
masked and unmasked sites in a window of consecutive haplotype
sites; providing a second digital template including a second
arrangement of masked and unmasked sites in a window of consecutive
haplotype sites, wherein the first and second arrangements are
different; providing two or more haplotypes strings for
identification of IBD segments therebetween, each of the two or
more haplotype strings representing a sequence of allele values at
polymorphic sites in a haplotype of an organism; and
computationally identifying IBD segments between the two or more
haplotype strings by (i) identifying first matches among alleles of
the haplotype strings at unmasked sites produced by applying the
first digital template to the two or more haplotype strings, (ii)
identifying second matches among alleles of the haplotype at
unmasked sites produced by applying the second digital template to
the two or more haplotype strings, and (iii) merging the first and
second matches among alleles to produce a merged set of IBD
segments, wherein the merged set of IBD segments has reduced impact
from genotyping errors compared to a set of IBD segments generated
without applying the first and second digital templates.
[0010] In some embodiments, the first and second templates each
have a size of at least four consecutive haplotype sites. In some
embodiments, identifying the first matches among alleles at
unmasked sites includes sequentially applying the first digital
template to the two or more haplotype strings, each time moving to
a next sequential section of the two or more haplotype strings. In
some embodiments, computationally identifying IBD segments between
the two or more haplotype strings further includes: computationally
identifying additional matches among alleles at unmasked sites
produced by applying one or more additional digital templates to
the two or more haplotype strings, wherein the one or more
additional digital templates have additional arrangements of masked
and unmasked sites in windows of consecutive haplotype sites, and
each of the additional arrangements is different from both the
first and the second arrangements, and wherein merging the first
and second matches among alleles to produce a merged set of IBD
segments further includes computationally merging the additional
matches with the first and second matches to produce the merged set
of IBD segments. In some embodiments, computationally identifying
additional matches among alleles at unmasked sites employs a third
digital template, a fourth digital template, a fifth digital
template, and a sixth digital template. In some embodiments, the
first through sixth digital templates each include two masked sites
and two unmasked sites. In some embodiments, the first digital
template and the second digital template each have a ratio of
masked sites to unmasked sites of between about 2:1 to about
1:2.
[0011] In some embodiments, the two or more haplotype strings
include at least one thousand haplotype strings. In some
embodiments, the two or more haplotype strings include at least one
million haplotype strings. In some embodiments, computationally
identifying IBD segments between the two or more haplotype strings
includes performing a positional Burrows-Wheeler transform (PBWT)
on the unmasked sites produced by applying the first and second
templates to the two or more haplotype strings. In some
embodiments, computationally merging the first and second matches
among alleles is performed while considering individual polymorphic
sites of the two or more haplotype strings using the PBWT. In some
embodiments, the total number of digital templates is between 2 and
k, where k is the number of haplotype sites in the window. In some
embodiments, the total number of digital templates is
k!/(m!*(k-m)!), where k is the number of haplotype sites in the
window and m is the number of masked sites in the window. In some
embodiments, applying the first digital template comprises a
deterministic process employing the first arrangement of masked and
unmasked sites.
[0012] In another aspect of the embodiments provided herein, a
system for processing haplotypes to reduce genotyping errors when
determining identity by descent (IBD) segments between haplotypes
is provided, the system including: (a) one or more processors and
associated memory; (b) computer readable instructions for:
providing a first digital template including a first arrangement of
masked and unmasked sites in a window of consecutive haplotype
sites; providing a second digital template including a second
arrangement of masked and unmasked sites in a window of consecutive
haplotype sites, wherein the first and second arrangements are
different; providing two or more haplotypes strings for
identification of IBD segments therebetween, each of the two or
more haplotype strings representing a sequence of allele values at
polymorphic sites in a haplotype of an organism; and identifying
IBD segments between the two or more haplotype strings by (i)
identifying first matches among alleles of the haplotype strings at
unmasked sites produced by applying the first digital template to
the two or more haplotype strings, (ii) identifying second matches
among alleles of the haplotype at unmasked sites produced by
applying the second digital template to the two or more haplotype
strings, and (iii) merging the first and second matches among
alleles to produce a merged set of IBD segments, wherein the merged
set of IBD segments has reduced impact from genotyping errors
compared to a set of IBD segments generated without applying the
first and second digital templates.
[0013] In some embodiments, the first and second templates each
have a size of at least four consecutive haplotype sites. In some
embodiments, the instructions for identifying the first matches
among alleles at unmasked sites includes instructions for
sequentially applying the first digital template to the two or more
haplotype strings, each time moving to a next sequential section of
the two or more haplotype strings. In some embodiments, the
instructions for identifying IBD segments between the two or more
haplotype strings further include instructions for: computationally
identifying additional matches among alleles at unmasked sites
produced by applying one or more additional digital templates to
the two or more haplotype strings, wherein the one or more
additional digital templates have additional arrangements of masked
and unmasked sites in windows of consecutive haplotype sites, and
each of the additional arrangements is different from both the
first and the second arrangements, and wherein merging the first
and second matches among alleles to produce a merged set of IBD
segments further includes computationally merging the additional
matches with the first and second matches to produce the merged set
of IBD segments. In some embodiments, the instructions for
identifying additional matches among alleles at unmasked sites
employ a third digital template, a fourth digital template, a fifth
digital template, and a sixth digital template. In some
embodiments, the first through sixth digital templates each include
two masked sites and two unmasked sites. In some embodiments, the
first digital template and the second digital template each have a
ratio of masked sites to unmasked sites of between about 2:1 to
about 1:2.
[0014] In some embodiments, the two or more haplotype strings
include at least one thousand haplotype strings. In some
embodiments, the two or more haplotype strings include at least one
million haplotype strings. In some embodiments, the instructions
for identifying IBD segments between the two or more haplotype
strings include instructions performing a positional
Burrows-Wheeler transform (PBWT) on the unmasked sites produced by
applying the first and second templates to the two or more
haplotype strings. In some embodiments, the instructions for
merging the first and second matches among alleles include
instructions for performing the merging while considering
individual polymorphic sites of the two or more haplotype strings
using the PBWT. In some embodiments, the total number of digital
templates is between 2 and k, where k is the number of haplotype
sites in the window. In some embodiments, the total number of
digital templates is k!/(m!*(k-m)!), where k is the number of
haplotype sites in the window and m is the number of masked sites
in the window.
[0015] In another aspect of the embodiments herein, a method of
identifying IBD segments between two or more haplotypes strings,
each of the two or more haplotype strings representing a sequence
of allele values at polymorphic sites in a haplotype of an organism
is provided, the method including: (a) computationally identifying
IBD segments between the two or more haplotype strings by (i)
identifying first matches among alleles of two or more haplotype
strings at unmasked sites produced by applying a first digital
template to the two or more haplotype strings, (ii) identifying
second matches among alleles of the haplotype at unmasked sites
produced by applying a second digital template to the two or more
haplotype strings, and (iii) merging the first and second matches
among alleles to produce a merged set of IBD segments, wherein the
first digital template includes a first arrangement of masked and
unmasked sites in a window of consecutive haplotype sites, wherein
the second digital template includes a second arrangement of masked
and unmasked sites in the window of consecutive haplotype sites,
and wherein the first and second arrangements are different; and
(b) identifying a potential phase switch error in at least one of
the two or more haplotype strings; and (c) correcting the phase
switch error. In some embodiments, identifying the potential phase
switch error includes identifying proximate IBD segments in at
least one pair of the two or more haplotype strings.
[0016] In another aspect of the embodiments herein, a computer
implemented method of determining identity by descent (IBD)
segments is provided, the method including: determining first
potential IBD segments among phased haplotype data for a plurality
of individuals, wherein the first potential IBD segments have an
end site; determining second potential IBD segments among haplotype
data for the plurality of individuals, wherein the second potential
IBD segments have a start site; determining that the end site of
the first potential IBD segments and the start site of the second
potential IBD segments are within a threshold number of sites of
each other; and merging the first potential IBD segments and the
second potential IBD segments to form a combined potential IBD
segment.
[0017] In some embodiments, the first potential IBD segments and
the second potential IBD segments are on different haplotypes for
an individual of the plurality of individuals, and the method
further includes: determining a phase switch error occurred at a
site between the first potential IBD segment and the second
potential IBD segment for the individual; and swapping the
haplotypes for the individual from the position of the phase switch
error. In some embodiments, the first potential IBD segments and
the second potential IBD segments overlap for an individual of the
plurality of individuals. In some embodiments, the first potential
IBD segment and the second potential IBD segment each span at least
the threshold number of sites. In some embodiments, the threshold
number of sites is between about 0 and 500 SNPs. In some
embodiments, the plurality of individuals do not share a
parent-child relationship.
[0018] In some embodiments, the method further includes:
determining a third potential IBD segments among phased haplotype
data for a plurality of individuals, wherein the third potential
IBD segments have a start site; determining that the end site of
the combined potential IBD segments and the start site of the third
potential IBD segments are within the threshold number of SNPs of
each other; and merging the combined potential IBD segments and the
third potential IBD segments. In some embodiments, the combined
potential IBD segments and the third potential IBD segments are on
different haplotypes for an individual of the plurality of
individuals, and the method further includes: determining a phase
switch error occurred at a site between the combined potential IBD
segment and the third potential IBD segment for the individual; and
swapping the haplotypes for the individual from the position of the
phase switch error. In some embodiments, the method further
includes determining that the combined potential IBD segments have
a minimum length in centimorgans and storing the combined potential
IBD segments as IBD segments for the plurality of individuals.
[0019] In another aspect of the embodiments herein, a computer
implemented method of processing haplotypes to reduce errors when
determining identity by descent (IBD) segments between haplotypes
is provided, the method including: providing two or more paired
haplotypes strings for identification of IBD segments therebetween,
each of the two or more paired haplotype strings representing a
sequence of allele values at polymorphic sites in a haplotype of an
organism; and computationally iterating through the two or more
paired haplotype strings by: (i) identifying a first potential IBD
segment between the two or more haplotype strings by identifying
matches among alleles of the haplotype strings; (ii) comparing the
first site of the first potential IBD segment to the last site of a
previously identified second potential IBD segment (iii)
determining that the last site of the second potential IBD segment
and the first site of the first potential IBD segment are within a
threshold number of sites of each other; and (iv) merging the first
potential IBD segment and the second potential IBD segment to form
a combined potential IBD segment.
[0020] In another aspect of the embodiments disclosed herein, a
computer implemented method of processing haplotypes to reduce
errors when determining identity by descent (IBD) segments between
haplotypes is provided, the method including: (a) computationally
identifying initial IBD segments between two or more haplotype
strings by identifying first matches among alleles of the haplotype
strings using a plurality of templates, each including a unique
arrangement of masked and unmasked sites in a window of consecutive
haplotype sites; and (b) providing information characterizing the
initial IBD segments to a hidden Markov model (HMM) which removes
potential phase switch errors to produce final IBD segment, wherein
the HMM analyzes the information characterizing the initial IBD
segments using distances between consecutive haplotype sites on a
chromosome, one or more rates of recombination based on meiosis,
and one or more rates of phase switch error based on a phasing
method employed to phase the haplotypes.
[0021] In some embodiments, the method further includes, after (a)
and before (b), removing some initial IBD segments determined to
belong to haplotypes having less than a threshold amount of initial
IBD segments, wherein the initial IBD segments provided to the EIHM
in (b) have had some initial IBD segments removed. In some
embodiments, the threshold amount of initial IBD segments is less
than two initial IBD segments per chromosome.
[0022] In another aspect of the embodiments described herein, a
computer implemented method of determining identical-by-descent
(IBD) segments is provided, the method including: (a) for each
polymorphic site in a series of polymorphic sites of two
individuals, obtaining an IBD state that indicates whether alleles
of the two individuals at the polymorphic site are part of an IBD
segment, and, if so, which of the two individuals' phased
haplotypes are part of the IBD segment, wherein the series of
polymorphic sites are included in one or more pairs of chromosomes;
and (b) applying a hidden Markov model (HMM) to the IBD states to
produce one or more error-corrected IBD segments, wherein the HMM
model takes as input, in addition to the IBD states as observed IBD
states, (i) a rate of recombination based on a number of meioses,
and (ii) at least one rate of phase switch error based on a phasing
method employed to phase the haplotypes; wherein applying the HMM
removes likely phase switch errors and produces the error-corrected
IBD segments based on a most likely sequence of hidden IBD states;
and wherein operations (a) and (b) are performed by one or more
processors of a computer system.
[0023] In some embodiments, the HMM takes as input: (iii) genetic
distances between consecutive sites on a chromosome. In some
embodiments, transition rates of the HMM are based on a rate at
which IBD segments start, which rate is modeled as a function of
the number of meioses. In some embodiments, the rate at which IBD
segments start (.alpha..sub.s) is modeled as follows:
.alpha. s = 1 1 + ( 1 0 0 . 0 .times. m ) ##EQU00001##
wherein m is the number of meioses. In some embodiments, transition
rates of the HMM are based on a rate at which IBD segments end. In
some embodiments, the rate at which IBD segments end is modeled as
a function of the number of meioses. In some embodiments, the rate
at which IBD segments end (.alpha..sub.e) is modeled as
follows:
.alpha. e = m 1 0 0 . 0 ##EQU00002##
wherein m is the number of meioses. In some embodiments, the IBD
states include nine different IBD states, which indicate nine
conditions of zero IBD, half IBD, and full IBD. In some
embodiments, transition rates of the HMM are based on a transition
matrix Q.sub.a in FIG. 8B. In some embodiments, transition rates of
the HMM are weighted by a probability that full IBD between the two
individuals is truly present. In some embodiments, the probability
that full IBD between the two individuals is truly present is
modeled as a logistic function of an amount of estimated full IBD.
In some embodiments, the probability that full IBD between the two
individuals is truly present (.beta.) is modeled as follows:
.beta. = 1 1 + e .eta. ( 0 . 1 2 5 - 2 ) ##EQU00003##
wherein i.sub.2 is the amount of estimated full IBD, and .eta. is
an empirical parameter defining the steepness of the logistic
function. In some embodiments, the transition rates are weighted by
weighting transitions into full IBD states with .beta. and
weighting transitions out of full IBD states with 1/.beta.. In some
embodiments, the IBD states include 9 different IBD states, and the
transition rates are based on a transition matrix Q.sub..beta. in
(Eq. 5). In some embodiments, transition rates of the HMM are based
on the at least one rate of phase switch error. In some
embodiments, the at least one rate of phase switch error includes a
rate of phase switch error for each of the two individuals, there
are 4 types of phase switch errors, the IBD states include 9
different IBD states, and the transition rates are based on a
36.times.36 transition matrix Q in (Eq. 6). In some embodiments,
transition probabilities of the HMM are based on the genetic
distances between consecutive sites on a chromosome.
[0024] In some embodiments, the transition probabilities are
obtained by exponentiating a transition matrix. In some
embodiments, transition probabilities of hidden IBD states
Y.sub.l+1 given hidden IBD states Y.sub.l are modeled as:
P(Y.sub.l+1|Y.sub.l, m, .mu..sub.0, .mu..sub.1,
i.sub.2)=e.sup.Qd.sup.l wherein m is the number of meioses,
.mu..sub.0 is a phase switch error rate for a first individual of
the two individuals, .mu..sub.1 is a phase switch error rate for a
second individual of the two individuals, i.sub.2 is an amount of
estimated full IBD, Q is a transition matrix described by Eq. (Q),
and d.sub.l is the genetic distances between sites l and l+1.
[0025] In some embodiments, emission probabilities of the HMM are
dependent on phase switch errors. In some embodiments, the emission
probabilities are defined by a uniform error term that weights
probabilities of observed IBD states based on four different ways
the two individuals may be in phase switch errors. In some
embodiments, (b) includes using transition probabilities and
emission probabilities of the HMM to identify the most likely
sequence of hidden IBD states given the observed states. In some
embodiments, the mostly likely sequence of hidden IBD states is
identified using a Viterbi dynamic programming process. In some
embodiments, the method further includes: performing (a) and (b)
for a plurality of iterations, each iteration using a different
number of meioses for the rate of recombination, thereby producing
a plurality of sets of error-corrected IBD segments; and selecting
a set of error-corrected IBD segments having a highest likelihood
or probability in the plurality of sets as a final estimate of one
or more IBD segments. In some embodiments, the different numbers of
meioses are in a range from 1 to 14. In some embodiments, the
method is initiated when the two individuals' IBD segments
including the series of polymorphic sites meet a criterion. In some
embodiments, the two individuals' IBD segments include two or more
IBD segments on a single chromosome. In some embodiments, the two
individuals' IBD segments exceed a minimum total amount of shared
IBD
[0026] Although the examples herein concern humans and the language
is primarily directed to human concerns, the concepts described
herein are applicable to genomes from any biological organism.
These and other objects and features of the present disclosure will
become more fully apparent from the following description and
appended claims in conjunction with the associated drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0027] The accompanying drawings, which are included as part of the
present specification, illustrate embodiments and, together with
the general description given above and the detailed description of
the embodiment given below, serve to explain and teach the
principles described herein.
[0028] FIG. 1 presents possible phase switch errors in long IBD
segments.
[0029] FIGS. 2A and 2B present example process flows according to
various embodiments discussed herein.
[0030] FIG. 3 presents an example process for finding IBD
segments.
[0031] FIG. 4 illustrates haplotype strings, a positional prefix
array, and a divergence array in accordance with various
embodiments discussed herein.
[0032] FIGS. 5A and 5B present process flows according to various
embodiments herein.
[0033] FIG. 5C presents pseudocode for an algorithm according to
various embodiments herein.
[0034] FIG. 5D presents an illustration of the various data
structures used in accordance with a Templated PBWT process as
described herein.
[0035] FIGS. 6A and 6C present process flows for a phase switch
error correction heuristic according to various embodiments
herein.
[0036] FIG. 6B presents various types of phase switch errors that
may occur between two pairs of haplotypes.
[0037] FIG. 7 illustrates using a HMM to process four haplotypes of
two individuals to correct phase switch errors.
[0038] FIG. 8 presents a flow diagram for correcting phase switch
errors in IBD segments using a HMM according to various
embodiments.
[0039] FIG. 9A illustrates the structure of an example HMM
model.
[0040] FIG. 9B illustrates a transition matrix that may be used as
part of a HMM according to various embodiments herein.
[0041] FIG. 10 presents a functional diagram of a computer system
for performing various embodiments disclosed herein.
[0042] FIG. 11 presents a block diagram of an IBD-based personal
genomics service according to various embodiments herein.
[0043] FIG. 12 presents a plot comparing the speed of various IBD
inference methods.
[0044] FIGS. 13-16 present plots comparing the IBD estimate errors
of various methods.
[0045] FIGS. 17 and 18 present plots of errors of various methods
for various simulated pairs of haplotypes.
[0046] FIG. 19 presents plots of the false positive and false
negative rates of various methods.
[0047] FIGS. 20A and 20B illustrate the runtime of various methods
and the parameters used for assessing runtime.
[0048] FIGS. 21 and 22 present illustrates of IBD haplotype sharing
across Mexican states as determined by a Templated PBWT method as
described herein.
DETAILED DESCRIPTION
[0049] The disclosure concerns methods, apparatus, systems, and
computer program products for estimating IBD segments between
individuals using haplotype data.
[0050] Numeric ranges are inclusive of the numbers defining the
range. It is intended that every maximum numerical limitation given
throughout this specification includes every lower numerical
limitation, as if such lower numerical limitations were expressly
written herein. Every minimum numerical limitation given throughout
this specification will include every higher numerical limitation,
as if such higher numerical limitations were expressly written
herein. Every numerical range given throughout this specification
will include every narrower numerical range that falls within such
broader numerical range, as if such narrower numerical ranges were
all expressly written herein.
[0051] The headings provided herein are not intended to limit the
disclosure.
[0052] Unless defined otherwise herein, all technical and
scientific terms used herein have the same meaning as commonly
understood by one of ordinary skill in the art. Various scientific
dictionaries that include the terms included herein are well known
and available to those in the art. Although any methods and
materials similar or equivalent to those described herein find use
in the practice or testing of the embodiments disclosed herein,
some methods and materials are described.
[0053] The terms defined immediately below are more fully described
by reference to the Specification as a whole. It is to be
understood that this disclosure is not limited to the particular
methodology, protocols, and reagents described, as these may vary,
depending upon the context they are used by those of skill in the
art.
Definitions
[0054] As used herein, the singular terms "a," "an," and "the"
include the plural reference unless the context clearly indicates
otherwise.
[0055] Unless otherwise indicated, nucleic acids are written left
to right in 5' to 3' orientation and amino acid sequences are
written left to right in amino to carboxy orientation,
respectively.
[0056] The term "plurality" refers to more than one element. For
example, the term is used herein in reference to a number of
nucleic acid molecules or sequence reads that is sufficient to
identify significant differences in repeat expansions in test
samples and control samples using the methods disclosed herein.
[0057] A DNA segment is identical by state (IBS) in two or more
individuals if they have identical nucleotide sequences in this
segment. An IBS segment is identical by descent (IBD) in two or
more individuals if they have inherited it from a common ancestor
without recombination, that is, the segment has the same ancestral
origin in these individuals. DNA segments that are IBD are IBS per
definition, but segments that are not IBD can still be IBS due to
the same mutations in different individuals or recombinations that
do not alter the segment.
[0058] The terms "polynucleotide," "nucleic acid" and "nucleic acid
molecules" are used interchangeably and refer to a covalently
linked sequence of nucleotides (i.e., ribonucleotides for RNA and
deoxyribonucleotides for DNA) in which the 3' position of the
pentose of one nucleotide is joined by a phosphodiester group to
the 5' position of the pentose of the next. The nucleotides include
sequences of any form of nucleic acid, including, but not limited
to RNA and DNA molecules such as cell-free DNA (cfDNA) molecules.
The term "polynucleotide" includes, without limitation, single- and
double-stranded polynucleotides.
[0059] The term "parameter" herein refers to a numerical value that
characterizes a physical property. Frequently, a parameter
numerically characterizes a quantitative data set and/or a
numerical relationship between quantitative data sets. For example,
a ratio (or function of a ratio) between the number of sequence
tags mapped to a chromosome and the length of the chromosome to
which the tags are mapped, is a parameter.
[0060] The term "site" refers to a unique position (i.e. chromosome
ID, chromosome position and orientation) on a reference genome. In
some embodiments, a site may be a residue, a sequence tag, or a
segment's position on a sequence.
[0061] The term "based on" when used in the context of obtaining a
specific quantitative value, herein refers to using another
quantity as input to calculate the specific quantitative value as
an output. The specific quantitative value may be based on multiple
other quantities, not just the one identified.
[0062] As used herein the term "chromosome" refers to the
heredity-bearing gene carrier of a living cell, which is derived
from chromatin strands comprising DNA and protein components
(especially histones). The conventional internationally recognized
individual human genome chromosome numbering system is employed
herein.
Introduction and Overview
[0063] Modern genetic data sets already number in the millions of
genomes and are growing rapidly. Inferring the genomic location and
length of identical-by-descent (IBD) segments among the related
individuals in these data sets is a central step in many genetic
analyses. IBD estimates can best be exploited when they are made
using phased haplotypes; this means each individual in the data set
is represented by two sequences each of which consist of alleles
co-located on the same chromosome and inherited from a different
parent. IBD estimates that are phase aware can improve relationship
and pedigree inference, allow health and trait inheritance to be
traced, and make possible a range of other inferences regarding
demographic history and ancestry that are not possible when IBD
estimates are made using only unphased genotype data.
[0064] Estimating phase aware IBD segments is challenging not only
because of the large sizes of the genomic data sets but also due to
two types of error that break up IBD segments: genotyping error and
phase switch/phasing error. Because IBD segments are broken up by
meiotic recombination events they are expected to be longer for
close relatives. However, long IBD segments are more likely to be
impacted by genotype and phasing errors compared to short segments.
Thus, errors are particularly problematic when detecting IBD among
individuals that are closely related (e.g. first, second, and third
degree relatives) since long IBD segments are more likely to be
fragmented by these errors. This makes accurate inference of phase
aware IBD among close relatives particularly problematic.
[0065] Genotyping error is error introduced via genotyping (e.g.,
sequencing) in which an allele that is actually of one type (e.g.,
a T) gets called a different allele (e.g., an A). This commonly has
the effect of prematurely terminating a sequence of matches that
would otherwise be long enough to be considered an IBD segment.
Phase switch error is error introduced by phasing: maternal and
paternal copies are reversed. See FIG. 1. In some typical cases,
statistical phasing processes introduce a phase switch error, on
average, about every 12 centimorgans.
[0066] FIG. 1 illustrates phase switch error in fragment long IBD
segments. As shown in the left panel, two IBD segments (thicker
lines) are shared on a single chromosome in two related individuals
(top and bottom). As shown in the right panel, phase switch errors
(illustrated by vertical dashed lines) occur at different positions
along the chromosome in the two individuals, fragmenting the true
IBD segments into many erroneous false IBD segments. The individual
represented by the top two haplotypes has six phase switch errors
and the individual represented by the bottom two haplotypes has
four phase switch errors.
[0067] Various embodiments disclosed herein address phase shift
and/or genotyping error and thereby improve IBD segment
identification using phased haplotype data. In certain embodiments,
computational methods employ different templates of masked and
unmasked regions to exclude certain haplotype sites during IBD
segment identification. By combining results using different
templates, the computational methods mitigate genotyping error. In
certain embodiments, a probabilistic model considering meiotic
rates, phase switch error in phased data, and distance between
adjacent haplotype sites is used to identify and remove some phase
shift error. In some implementations, phased haplotype data is
processed using a method based on the positional Burrows-Wheeler
transform (PBWT) and a probabilistic hidden Markov model (HMM). In
some implementations, phased haplotype data is processed using a
method based on the PBWT and a heuristic to correct phase switch
errors. These approaches have been found to make fast and accurate
phase aware IBD estimates.
[0068] Various IBD segment finding methods to compute phase aware
IBD segments are disclosed herein. These methods may minimize
genotyping and phasing error using one or more of the following
techniques:
[0069] 1. To handle genotyping error, an IBD segment computation
procedure passes multiple digital templates over phased haplotypes
to differentially mask haplotype sites, thereby temporarily
ignoring different sites of potential genotype error. The IBD
segments being generated by the different templates are combined to
effectively remove fragmentation caused by genotyping error. In
some embodiments, the procedure is a templated positional
Burrows-Wheeler transform. This is the positional Burrows-Wheeler
transform (PBWT; Durbin 2014) with substantial modifications to
handle genotyping errors and missing data.
[0070] 2. To handle phase switch error reduction, IBD segments that
are likely fragmented by phase switch errors introduced during
statistical phasing are addressed using a heuristic that recognizes
likely occurrences of phase switch errors and/or a probabilistic
model that accounts for error rates of the phasing techniques.
[0071] In certain embodiments, the model is applied to
haplotype/IBD segment data generated using the templating approach
described in 1. Some implementations apply a hidden Markov model
(HMM) that accounts for both recombination and the phase switching
process to reduce these errors. The HMM passes along the
chromosomes of the two individuals "stitching" fractured IBD
segments together.
[0072] In some embodiments, a heuristic is applied that identifies
fractured IBD segments based on potential IBD segments that start
or end within a distance of the start or end of another IBD
segment. The heuristic may stitch the IBD segments together by, for
example, swapping haplotype segments in a target individual. In
some embodiments, the heuristic may be applied as potential IBD
segments are generated using the templating approach described in 1
without a separate iteration over the IBD segment data. In some
embodiments, the heuristic assumes that IBD segments within a
threshold distance of each other are likely to be a single IBD
segment fractured by one or more phase switch errors.
[0073] An example workflow is illustrated in the flow chart of FIG.
2A. As shown, a process 201 begins by initially obtaining phased
sequence data for two or more individuals who are to be compared to
detect IBD segments. See block 203. In various embodiments, phased
data for many more individuals is obtained. In some contexts,
hundreds, thousands, or even millions of individuals have their
phased haplotype data compared using methods disclosed herein. In
some implementations data for greater than about 10 million
individuals' phased haplotype data are compared. In any event, the
operation 203 involves identifying individuals or haplotypes on
which an IBD comparison is to be run.
[0074] The phased data includes two strings of haplotype data for
each individual (one per chromosome). In other words, there are
four haplotypes to be considered for two individuals. Phased
haplotype data may be obtained from various sources including
statistical techniques such as BEAGLE, FINCH, EAGLE, and other
known techniques. An example discussion of phasing techniques is
presented in U.S. Pat. No. 9,836,576, filed Mar. 13, 2013, and
incorporated herein by reference in its entirety.
[0075] The haplotype data may be represented as strings of allele
values (e.g., 1s and 0s) for sites in the haplotype, each of which
is the site of a polymorphism. Each such site may be referred to as
an index in the haplotype string. In various embodiments, the
process assumes that each haplotype site is a biallelic site on a
chromosome. It may be given a value of, e.g., 0 for one allele and
a value of 1 for the other allele. A typical chromosome may provide
hundreds of thousands of sites. Each haplotype may be given its own
unique identifier, which may be arbitrarily set.
[0076] The phased haplotype data is provided to a first processing
block as illustrated by a block 205. This processing may reduce the
fragmenting impact of genotyping error in IBD segment finding. In
the first processing block, multiple operations are performed in
parallel, sequentially, or in some combination thereof. In various
embodiments, significant computational efficiency is realized by
performing these operations together for a given haplotype (e.g.,
in an inner loop of a software routine as illustrated in the sample
code below). The operations performed in the first phase include
the following: (a) applying digital templates with masked and
unmasked positions to exclude certain haplotype sites along the
length of the haplotype, (b) identifying matching allele values at
unmasked positions along the haplotypes to identify putative IBD
segments for the various digital templates, and (c) merging the
resulting IBD segments (e.g., as they are being generated) from the
various digital templates.
[0077] As explained in more detail elsewhere herein, the IBD
segment finding logic may be implemented using the PBWT method.
Regardless of whether PBWT or another IBD segment finding process
is used, the method produces multiple putative IBD sub-segments,
one for each different digital template used. The method also
merges or stitches together the IBD sub-segment as generated using
each of the various templates. As suggested, in some embodiments,
the templating, comparing, and merging operations are performed at
a given site, before considering the next site, where the three
operations are again performed.
[0078] There are many different ways to construct the digital
templates. In various embodiments, they are constructed as small
windows that can be "slid" or "ratcheted" along the length of the
haplotype strings, considering consecutive sub-segments of the
haplotype sites as they go. Criteria to consider in choosing
template structures are the length of the template, the number of
masked or null sites in the template, and the arrangement of masked
and unmasked sites in the template. Typically, full sets of
templates are used in the process that contain all possible
arrangements of masked and unmasked sites in a template length. An
example set of four site digital templates, each employing two
masked and two unmasked positions is described below. Of course,
the process may alternatively employ larger (or smaller) templates
and/or use templates having a higher or lower proportion of null
positions per template.
[0079] The output of the first processing block implemented in
operation 205 is a set of IBD segments or other haplotype matching
data for combinations of the various individuals whose phased
haplotype data is processed. This data is then passed to a second
processing block for processing as illustrated in an operation 207.
A goal of this second operation may involve reducing the
fragmenting impact of phase switch error in IBD segment
finding.
[0080] In operation 207, the haplotype/IBD data is subjected to a
probabilistic model that accounts for recombination rates based on
meioses, which vary based on degree or relatedness of any two
individuals, and rates of phase switch errors introduced by the
phasing technique(s) employed to generate the phased haplotype
data. The model may also account for other inputs such as the
genetic distance between adjacent sites on the haplotype and/or the
probability of having full IBD state. As exemplified below, a
hidden Markov model may be used to implement the probabilistic
model.
[0081] An optional last operation of the process 201 involves
presenting the processed IBD information in a way that can show the
degree of relatedness of the two or more haplotypes that are
compared. See the operation represented in block 209.
[0082] Another example workflow is illustrated in the flow chart of
FIG. 2B. As shown, a process 251 begins by initially obtaining
phased sequence data for two or more individuals who are to be
compared for identifying IBD segments. See block 253. As with other
embodiments, phased data for many more individuals may be obtained,
e.g., hundreds to millions of individuals. Regardless, the
operation of block 253 involves identifying individuals and/or
haplotypes on which an IBD comparison is to be run.
[0083] Computational aspects of process 251 include sequentially
considering haplotype positions for genotype errors and phase
switch errors within individual haplotypes while keeping track of
match segments between haplotypes. Processing each new haplotype
position is initiated at a process operation 254, which selects the
next position in the haplotypes under consideration.
[0084] For a given haplotype position, a first processing operation
255 considers possible errors in the individual haplotypes using
multiple templates such as those described elsewhere herein.
Haplotype position matches are determined using these templates,
and, from these results, an overall decision on matching segments
is made. In some implementations, the resulting match segments have
reduced genotyping error.
[0085] The haplotype position under consideration is then analyzed
in a processing operation as illustrated in block 257. A result of
this operation may involve reducing the fragmenting impact of
genotyping error in IBD segment finding.
[0086] In operation 257, the haplotype/IBD data may be analyzed by
one or more phase switch heuristics and/or models that identify
situations where phase switch errors are likely to have occurred.
As an example, a heuristic may identify situations where one or
more IBD segments between individuals end at a first position and
then new IBD segments begin at a second position within a threshold
distance from the first position. An identified likely phase switch
error may be corrected by joining the IBD segments in an individual
identified to possess the likely phase switch error. In some cases,
error is corrected by swapping haplotype segments within the
identified individual. Regardless of the technique implemented in
operation 257, a result of the operation may involve reducing the
fragmenting impact of phase switch error in IBD segment
finding.
[0087] An optional last operation of the process 251 involves
presenting the processed IBD information in a way that can show the
degree of relatedness of the two or more haplotypes that are
compared. See the operation represented in block 259.
IBD Segments
[0088] Various repeat expansion analyses using DNA samples involve
aligning or mapping sequence reads from a sequencer to a reference
sequence. A reference sequence may be the sequence of a whole
genome, the sequence of a chromosome, the sequence of a
sub-chromosomal region, etc. From a computational perspective,
repeats create ambiguities in alignment, which, in turn, can
produce biases and errors even at the whole chromosome counting
level. Paired end reads coupled with adjustable insert length in
various embodiments can help to eliminate ambiguity in alignment of
repeating sequences and detection of repeat expansion.
[0089] In various embodiments, a goal of the process is to use
alignment of multiple haplotypes to determine genetic
relationship(s) between two or more individuals, or in some cases
that potentially involve inbreeding, within a single individual.
Fundamentally, the process determines relationships between two
haplotypes. IBD may be used for this purpose.
[0090] IBD can be understood in the context of meiosis and
recombinable DNA. Because of recombination and independent
assortment of chromosomes, the autosomal DNA and X chromosome DNA
(collectively referred to as recombinable DNA) from the parents is
shuffled at the next generation, with small amounts of mutation.
Thus, only relatives will share long stretches of genomic regions
where their recombinable DNA is completely or nearly identical.
Such regions are referred to as "identical-by-descent" (IBD)
regions because they arose from the same DNA sequences in an
earlier generation/common ancestor.
[0091] In some embodiments, locating IBD regions includes
sequencing the entire genomes of the individuals and comparing the
genome sequences. In some embodiments, locating IBD regions
includes assaying a large number of markers that tend to vary in
different individuals and comparing the markers. Examples of such
markers include Single Nucleotide Polymorphisms (SNPs), which are
points along the genome with two or more variations; e.g., Short
Tandem Repeats (STRs), which are repeated patterns of two or more
repeated nucleotide sequences adjacent to each other; and
Copy-Number Variants (CNVs), which include longer sequences of DNA
that could be present in varying numbers in different individuals.
Long stretches of DNA sequences from different individuals' genomes
in which markers in the same locations are the same indicate that
the rest of the sequences, although not assayed directly, are also
likely identical.
[0092] Techniques for matching individual haplotypes, e.g., by
using IBD are known. Some of them cannot efficiently handle large
numbers of haplotypes and even those that can, do not adequately
account for errors. Some discussion of IBD segments may be found in
U.S. Pat. No. 8,463,554 filed Dec. 22, 2009, and incorporated
herein by reference in its entirety.
Templated PBWT
[0093] To initially detect IBD segments one may extend the
positional Burrows-Wheeler transform (PBWT; Durbin 2014). In
certain embodiments, a PBWT process is implemented according to the
following description. Initially, each haplotype under
consideration is given its own unique identifier, which may be
arbitrarily set. Then, during execution, the method steps through
the sites of all haplotypes under consideration,
position-by-position, starting at a first position, which may be
identified as position 0. As the method steps through the haplotype
sites, it keeps track of two arrays, which are updated for every
position (index) in the haplotypes. Also, during a pass through the
haplotype sites, a templated PBWT process may apply one, some, or
all of the digital templates at each position.
[0094] The first array is a "positional prefix array" that contains
a list of all haplotypes under consideration. It is populated with
IDs of all the haplotypes. A separate instance of the positional
prefix array is produced each time a new site is encountered while
traversing the haplotype string. Over the course of the process,
and while certain haplotypes have identical allele values from one
position to the next, these haplotypes are grouped together in the
positional prefix array. In other words, haplotypes having matching
allele values, remain together (in the same block) within the
positional prefix array for as long as their alleles match. By
keeping the haplotypes together while alleles match, the positional
prefix array contains information about putative IBD segments.
[0095] The second array is a "divergence array" that indicates
where matches between any two haplotypes under consideration began.
It reflects how many positions/sites back in the haplotype string
until there was a difference. In other words, this matrix keeps
track of the last time that two haplotypes did not match by, e.g.,
providing the index value of the last mismatch for any two
haplotypes.
[0096] An example of a general IBD segment finding process 301 is
depicted in FIG. 3. As illustrated, the process begins by receiving
haplotype strings representing allele values across all haplotypes
to be considered in the process. See block 303. This may correspond
to block 203 in FIG. 2A and/or block 253 of FIG. 2B.
[0097] Before considering the first site in the haplotypes, the
process lists all haplotypes in the positional prefix array. It may
do this randomly or in some order, but typically it does not yet
account for the allele values at any haplotype site. The individual
haplotypes may be listed by unique identifiers. Further, before
considering the first site on the haplotype, the values in
divergence array are all set 0 because there are no previous sites
that have been considered. The array initializations are
illustrated by operation 305 in process 301 of FIG. 3.
[0098] At the first site of the haplotype strings (which may be
reflected as the first column in an aligned set of the haplotype
strings), the process goes through all haplotypes in the order of
the positional prefix array (which may initially be random or
otherwise arbitrary) and orders the haplotypes such that those that
have a first allele value (e.g., 0) in the current position are
grouped together at the top, and all that have a second allele
value (e.g., 1) in the position are grouped together at the bottom.
See operation 309. Of course, this order could be reversed or even
extended in the case of multiallelic sites. Either way, this
operation produces a new positional prefix array in which all
haplotype indexes that have a 0 at the current position are grouped
together in the array, and all haplotype indexes that have a 1 at
the position are grouped together. By "grouped together," it is
meant that haplotype identifiers are provided in adjacent positions
in the positional prefix array. This is illustrated in FIG. 4 which
shows a group of six haplotype strings and the associated
positional prefix array at a few haplotype sites. Obviously, a
typical haplotype has many more sites than illustrated for the
haplotypes in FIG. 4. Further, a typical IBD segment finding
process may simultaneously evaluate many more haplotype strings
(e.g., hundreds, thousands, or millions).
[0099] Regarding the divergence array, by considering the allele
values at the first index position with those in the previous
position (which does not exist in this iteration), the process
notes that all potential IBD segments begin at site 0 and therefore
they effectively have a mismatch at position 0. Therefore, the
first entry in the divergence array is all zeros. See operation 311
in process 301 and the first column in the divergence array of FIG.
4.
[0100] The order of haplotypes in the divergence array is the same
as in the positional prefix array. Using this order of haplotypes
in the divergence array, the values in the divergence array are,
for currently matching haplotypes, the sites (index values) of the
first matching position between the two adjacent haplotypes within
the array. Thus, for adjacent haplotypes that currently match, the
value in the divergence array is the first matching position of the
current segment. However if two adjacent haplotypes no longer match
at the current site the method assigns the next site to the new div
array even though it has not peeked ahead and checked if the two
haplotypes actually match at the next site. If, in the next
iteration, the method learns that the segments still do not match,
the relevant value in the divergence array simply gets updated
again.
[0101] Returning to FIG. 3 and ignoring for now operations 313 and
315, when the process has completed operations on a given haplotype
site, it determines whether there are any further sites to
consider. See operation 317. If there are such sites, the process
loops back to an operation 307, where the next site/index is
selected for consideration.
[0102] At the next column (associated with the next site in the
aligned haplotype strings), the process again goes through the
haplotypes and again rearranges the haplotype identifiers in the
positional prefix array so that those having the same allele value
at the current position are grouped together, e.g., all haplotypes
having a 0 allele value in the current position are grouped at the
top of the array and all those having a 1 allele value are grouped
at the bottom. Haplotype strings that have the same alleles over
multiple consecutive positions stay near one another in the
positional prefix array.
[0103] The divergence array uses the new arrangement of haplotypes
(from the positional prefix array), flags any mismatches between
adjacent haplotypes and the current position and inserts the next
haplotype site number for mismatching pairs. The next site number
is the location of the next possible start position for a new match
segment.
[0104] Thus, in some implementations, element i in the divergence
array indicates when a current segment match began between the
haplotype at ppa[i] and the haplotype at ppa[i-1]. For purposes of
example, consider the case where at position 5 in the haplotype
alignment, the positional prefix array has the following values:
[0105] 2 [0106] 3 [0107] 1 [0108] 4 And the divergence array has
the following values: [0109] 0 [0110] 0 [0111] 3 [0112] 2
[0113] In this example haplotypes 2 and 3 have a match that extends
from the beginning of the alignment (position 0) to the current
position (position 5). Haplotype 1 matches with haplotype 3 from
position 3 to position 5 (which also implies haplotype 1 and
haplotype 2 have the same match). And haplotype 4 matches with
haplotype 1 from position 2 to position 5 (which also implies
haplotype 4 matches haplotypes 1, 2, and 3 between positions 3 and
5). There are no mismatched haplotypes at the current site in this
example. The routines use the alleles at the current position in
the alignment to construct the divergence array for the next
position. So, in the above example, if at position 5 two haplotypes
do not match, the routine inserts 6 into the position of the
haplotypes in the divergence array. Note that the method does not
check whether the two haplotypes actually match at position 6,
which is why the divergence array does not always contain the
beginning position of matches. Once the method actually visits
position 6 this value will be overwritten if the haplotypes do not
match at site 7. The method continues in this fashion (overwriting
values in the divergence array) until actual matches are found.
[0114] Over the course of the process, the lengths of matching
segments of all haplotype pairs are tracked. When the match length
of two haplotypes is greater than a preset threshold, the process
flags the two haplotypes as having a potential IBD segment. In the
example of process 301, it does this by creating new match segment
records when two haplotypes have a number of consecutive shared
matches that is greater than threshold number of consecutive sites.
See operation 313. The threshold value may be chosen to balance
speed and sensitivity. In certain examples, the threshold number is
between about 50 and 1000 sites (e.g., about 200 matching sites).
Similarly, the process may complete a match segment record when two
matching haplotypes finally diverge in allele values, thereby
ending the match segment. See operation 315. To this end, the
process may maintain a separate report populated with matches of
greater than the threshold length. The matches may be identified by
start position and end position (indexes) and the haplotypes
involved in the match (e.g., haplotype ID #11 and haplotype ID #5).
In certain embodiments, the match segment includes both the
starting and ending sites of the match segment.
[0115] In cases where a match segment does not end--such as when
the end of a haplotype string ends and there are no further sites
to consider--the process may still flag the match segment for
further consideration. As with matches having defined end points,
the match is identified by the two matching haplotypes and their
starting index for the match segment. The ending index for the
match region is the site at the end of the haplotype.
[0116] For the sake of clarity, the example of FIG. 3 does not
illustrate treatment of the haplotype strings by different digital
templates. This will be discussed below.
[0117] The process 301 proceeds through operations 307-315 for each
successive haplotype site and reorders the haplotype IDs in the
positional prefix array based on matches (the haplotypes having a 0
at the current position are grouped together and those having a 1
are grouped together).
[0118] In general, the haplotypes that have long stretches of
matched sites stay together in the positional prefix array for long
durations. This is because all matching haplotypes stay together in
the positional prefix array until one of them has a different
allele value at a particular haplotype position. At that point, the
one or more haplotypes that diverge from the larger group are moved
to a different position in the positional prefix array. By
preserving the positional prefix arrays for each position in the
haplotype, the method keeps sufficient information to reconstruct
all IBD segments for any two haplotypes. This includes all
haplotypes under consideration, including the first and last
haplotypes in the positional prefix array.
[0119] In the divergence array, the values for two adjacent
haplotypes (in the array) remain at a value of their first match
until they again mismatch, at which time the corresponding
divergence array value is marked with the next index position after
the new mismatch. FIG. 4 presents an example of positional prefix
array values and divergence array values for a few sites on an
alignment of haplotype strings.
[0120] When there are no further haplotype sites to consider, as
indicated by process block 317, potential IBD segments have been
identified and these may be processed in various ways such as,
optionally, being used in a relatedness analysis of individuals
whose haplotypes were considered in the analysis. See operation
319. Alternatively, and as described with reference to FIG. 2A and
as further described below, the potential IBD segments may be
further processed by a model such as an HMM model to correct for
phase shift errors. However, it should be understood that this
additional processing is optional, particularly in situations where
phased haplotype data can be expected to have relatively few phase
shift errors. In some embodiments, potential IBD segments that are
flagged by the PBWT process, but are shorter than a defined genetic
length, are excluded from further consideration. An example of a
threshold genetic length is between about 1.5 and 3 centimorgans.
Thus, for example, if a segment extends over 200 sites but does not
extend the full required genetic distance, the method discards the
segment.
[0121] The PBWT process, as well as many other IBD segment finding
processes, assumes that there are no errors. If there is in fact an
error, it may prematurely truncate a sequence of matches and/or
artificially prolong a sequence. Typically, long matches that in
fact exist (e.g., between close relatives) are prematurely broken
due to genotyping and/or phase switch errors.
[0122] As mentioned, integrating digital templates into an IBD
segment finding process can mitigate the impact of some errors,
particularly genotyping errors. One approach employs a digital
template that shifts over the haplotype strings and masks certain
haplotype sites from consideration as it goes. This approach takes
a normal haplotype alignment but applies the template to skip over
some sites that would otherwise be considered. With the excluded
sites removed from consideration, the process identifies putative
IBD segment matches using a general approach such as PBWT. By
masking some sites from comparison, sites of erroneous calls may be
ignored. Some templates may consider the erroneously called alleles
while others exclude them. By considering putative IBD segments
created using all the templates, the process can remove breaks and
more accurately identify complete IBD segments.
[0123] The template provides a sliding window of consecutive sites
having, in some embodiments, a fixed mask pattern. The template is
moved successively along the haplotype string, typically with no
overlap of sites between one application of the template and the
next. Other than masking some sites, the process is similar to a
generic IBD segment finding method such as the PBWT process. That
is, in some embodiments, the process generates a positional prefix
array and a divergence array for haplotype strings modified by the
template. In the course of the process, the computational system
flags and records matching segments as before. But the matching
segments produced by single templates have some sites excluded. In
alternative embodiments, the templating function employs a
probabilistic function to pick mask sites.
[0124] In some embodiments, the mask pattern is deterministic based
on the template. The masking of sites may follow a specific
pattern, based on each template, rather than a random selection or
masking of sites. For example, the mask pattern may remain fixed as
the process moves from one haplotype site to the next. In some
cases, however, the mask pattern may vary as the process moves over
haplotype sites, but such variation may be deterministic rather
than random.
[0125] In certain embodiments, the process employs multiple fixed
templates for a given matching problem. Examples of templates
include OhOh (all odd sites), hOhO (all even sites), OOhh, hhOO,
OhhO, and hOOh, where sites at O will be masked out and only sites
at h will be used to construct the method. The choice of templates
to use together in a process may be made such that for the fixed
length of the templates (e.g., the four site templates exemplified
here), may guarantee that if there were any errors (e.g., two
errors) within this window, at least one of these templates
correctly report a match. For example, if there were errors at
sites 2 and 4 at one application of a four site template, only the
hOhO would give an error-free read.
[0126] Depending on the number of indexes/sites in a template and
the number excluded within a window, different numbers of templates
may be used in a given process. Broadly, the total number of
digital templates may be between 1 and k, where k is the number of
haplotype sites in the window. In some implementations, the total
number of digital templates is k!/(m!*(k-m)!), where k is the
number of haplotype sites in the window and m is the number of
masked sites in the window.
[0127] In certain embodiments, such as the case of a four site
template with two null sites, six templates may be used. However,
other template combinations may be employed; e.g., at least two
templates, at least six templates, at least eight templates, at
least ten templates, etc. In various embodiments, the templates are
characterized by a ratio of masked to unmasked sites which ranges
between about 1/w and (w-1)/w, where w is the length of the
template window In various embodiments, the templates are
characterized by a length equivalent to the total size of the
haplotype alignment. As examples, a range of template lengths is
between about three and ten consecutive sites.
[0128] A templating function can be tuned to alter sensitivity to
error. As suggested, one templating function may be implanted as a
decision tree that uses a window size of 4 haplotype sites and 6
templates, and so guarantees any matches within that 4 site window
as long as there are no more than 2 errors. If i is the current
template (range 0 to 5) and k is the current position within a
template window (range 0 to 3), then this templating function T(i,
k) may be represented as:
TABLE-US-00001 T(i, k): if i == 0 and (k == 0 or k == 2): return 1
else if i == 1 and (k == 1 or k == 3): return 1 else if i == 2 and
(k == 0 or k == 1): return 1 else if i == 3 and (k == 2 or k == 3):
return 1 else if i == 4 and (k == 0 or k == 3): return 1 else if i
== 5 and (k == 1 or k == 2): return 1 else return 0
[0129] FIG. 5A illustrates application of templates to an IBD
segment finding method. The depicted process 501 begins by
receiving data and setting up parameters for the routine. This may
involve receiving phased haplotypes to be used in the templated
matching routine (block 503), defining templates to apply (block
505), and setting up any needed matrices and arrays such as a
positional prefix array and a divergence array. The depicted
process loops over the various sites of the haplotypes, and at each
site it loops over the available templates. This is depicted as
follows.
[0130] The process increments to the next haplotype site at an
operation 507, and while at the current site, it iterates over the
various templates, starting by incrementing to the next template at
an operation 509. While the routine is fixed at a particular
template, the process identifies matches and mismatches among the
haplotype strings (block 511) and merges match segments for the
various templates (block 513). Operation 511 identifies matches
only if the current haplotype site is unmasked in the current
template. Assuming that the current site is unmasked, operation 511
may be implemented in various ways such as by updating positional
prefix and divergence arrays. Note that each template may have its
own match segment information. Using this information, operation
513 may merge currently pending segments (at the current haplotype
site) from among the various templates.
[0131] Operation 515 serves to iterate over all the templates while
the process is fixed at a given haplotype site and operation 517
serves to iterate over all haplotype sites. Ultimately all
haplotype sites are considered and the error-corrected IBD segments
are completed. See operation 519.
[0132] FIG. 5B illustrates another application of templating to an
IBD segment finding method. The depicted process 551 begins by
receiving data and set up parameters for the routine. This may
involve receiving phased haplotypes (block 553) to be used in the
templated matching routine, defining templates to apply (block
555), and setting up any needed matrices and arrays such as a
positional prefix array and a divergence array.
[0133] The depicted process loops over the various sites of the
haplotypes, and at each site it loops over the available templates.
This is depicted in the figure as follows. The process increments
to the next haplotype site (the current haplotype site) at an
operation 557, and while at the current site, it iterates over the
various templates, starting by incrementing to the next template at
an operation 559. While the routine is fixed at a particular
template, the process identifies matches and mismatches among the
haplotype strings (block 561) and merges match segments for the
various templates (block 563). An operation 561 identifies matches
only if the current haplotype site is unmasked in the current
template. Assuming that the current site is unmasked, operation 561
may be implemented in various ways such as by updating positional
prefix and divergence arrays. Note that each template may have its
own match segment information. Using this information, operation
563 may merge currently pending segments (at the current haplotype
site) from among the various templates. Operation 565 serves to
iterate over all the templates while the process is fixed at a
given haplotype site.
[0134] An optional operation 566 identifies and addresses phase
switch errors at the current haplotype position. As indicated, in
some embodiments, phase switch errors may be addressed using a
heuristic that recognizes typical phase switch errors.
[0135] An operation 567 serves to iterate over all haplotype sites.
If any of the templates indicates a continuous sequence of matching
sites including the current site or sites adjacent to the current
site, the match sequence is deemed to continue, even if one or more
of the templates indicates a gap in the match sequence. Ultimately
all haplotype sites are considered and the error-corrected IBD
segments are completed. See operation 569.
[0136] While processes 501 and 551 are implemented in a manner in
which all templates are considered at one site before incrementing
to the next site, the templating process need not be implemented in
this manner. A different approach considers all sites for one
template, saves that template's putative IBD segments, considers
all sites for a second template, saves that template's putative IBD
segments, and so on until all templates are considered. The
resulting putative, template-specific IBD segments may then be
merged.
[0137] Merging may involve aligning the putative IBD segments from
each templated result, and then scanning through the
template-specific segments for pairs of haplotypes. During this
process, as long as one of the six templates (or however many are
used) still shows a continuing segment, the method keeps a merged
IBD segment intact.
[0138] In various embodiments, the methods assume that any IBD
start or end points within an otherwise continuous IBD segment are
caused by errors. This is a reasonable assumption because the
comparison is made between two individuals. There is a very low
probability that two haplotypes will match, for greater than a
threshold length, by chance.
[0139] In some embodiments, an additional filtering operation to
remove some putative IBD segments is performed after one of the
above-described processes such as process 301 or process 501. For
example, the filter may operate by discarding putative IBD segments
of size below three centimorgans.
[0140] To help illustrate the range of implementations, the
following example description of templated PBWT is provided. Given
M haplotypes with N bi-allelic sites, the PBWT algorithm can
identify identical subsequences of the haplotypes in O(NM) time. A
limitation of PBWT is that it requires exact subsequence matches
with no errors or missing data. To reduce sensitivity to error and
missing data, a templated PBWT may be used. A templated PBWT may be
designed or configured to identify matching subsequences of the
haplotypes despite missing data and genotyping errors with only a
small linear increase in computational time compared to the
PBWT.
[0141] One approach for extending PBWT to report matching
haplotypes that include some errors involves constructing multiple
replicates of the PBWT data structure. Each of these PBWTs is built
by masking the haplotype alignment using a different repeating
template. Each PBWT may then be individually swept through
identifying exact subsequence matches. The matching subsequences
from all PBWTs (each from a different template) may then be merged
to produce all matching subsequences within the full (unmasked)
haplotype alignment.
[0142] Many different digital template repeating structures may be
used. One example uses different repeating templates: for example
OhOh, hOhO, OOhh, hhOO, OhhO, and hOOh, where sites at O will be
masked out and only sites at h are used to construct the IBD
segments using, e.g., PBWT. These example templates address
haplotype data with no more than two errors per four site window.
The design of these six specific templates guarantees that all
matches across any given four site window will be found as long as
there are no more than two errors within the window. This is
because given any possible arrangement of two errors across four
sites in the original haplotype alignment at least one of the PBWT
replicates will have those errors masked out and therefore still
deliver the match correctly.
[0143] This method's sensitivity to errors may be modified by
changing the arrangement and number of templates. For example, more
templates could be utilized to ensure matches across longer
windows; indeed (n/k) templates are required to ensure all matches
across windows of size n with no more than k errors per window. In
practice genotyping errors are often low enough that six templates
would be adequate (templates of length 4 with two sites masked);
even with a genotyping error rate as high as 0.001 the probability
of three errors within a four site window is 3.996.times.10.sup.-9.
Running each templated PBWT replicate can be easily
parallelized.
[0144] Templating the PBWT as described above to handle errors and
return subsequence matches can be executed in linear time by
passing through the data only once and avoiding the need for a
post-hoc merging algorithm. In some PBWT implementations, at each
position k within the haplotype alignment two arrays are
constructed: ppak the positional prefix array and divk the
divergence array (Durbin 2014). ppak is a list of the haplotypes
sorted so that their reversed prefixes (from k-1 to 0) are ordered.
This ordering ensures that haplotypes that match through position
k-1 will end up adjacent to one another in ppak. The divergence
array divk keeps track of where those matches began, the ith
element in divk represents the beginning of the match between the
ith element in ppak and the i-1th element in ppak.
[0145] In certain embodiments, to create a templated PBWT, the
method constructs a separate ppaj,k and divj,k for each template j
used at site k. In this approach, a set of templates (as described
above) may be formalized as an indicator function T (j, k) with the
value 0 when the template j skips over site k and 1 if template j
processes site k. As the haplotype alignment is passed through, T
(j, k) is called for each template j; if T (j, k) is 1 then ppaj,k
and divj,k are assembled accordingly. When a matching subsequence
of at least Lm sites terminates at site k under template j the
start and end positions of the match are stored in auxiliary data
structures Ps and Pe, respectively. Ps and Pe are both M by M two
dimensional arrays in which the position x, y holds the start/end
positions of the match between haplotype x and haplotype y. If
another subsegment has already been stored the routine checks to
see if the new matching subsegment overlaps and possibly extends
the existing subsegment. If they do not overlap, the routine checks
if the old matching segment has a genetic length (in cM) of at
least Lf and then reports it. The new matching subsegment is then
stored in its place. In this way the "templating" of the haplotype
alignment is performed within this modified form of the PBWT
itself, and matching subsegments from each template are merged and
extended directly as the haplotype alignment is passed through.
Depending on the choice of T (j, k), the templated PBWT has a
worst-case time complexity of O(NMt) where t represents the number
of templates defined within T (j, k); thus the method represents a
linear tradeoff between the speed of PBWT and sensitivity to error.
However, using the templates described in the paragraphs above the
templated PBWT takes time O(NM3) because t=6 and N becomes N/2
since each template only processes 2 out of every 4 positions in
the alignment.
[0146] An example templated PBWT is further detailed as pseudocode
in Algorithm 1. The algorithm employs 2 parameters: (1) Lm is the
minimum number of sites that a sub-segment must span within the
haplotype alignment to be merged and extend other sub-segments, and
(2) Lf is the final minimum length (in cM) that a segment must have
to be reported by the algorithm. Notably the algorithm handles
missing data by extending the current longest match. At site k the
longest matching haplotype to haplotype ppaj,k will be either
ppaj-1,k or ppaj+1,k, so if missing data in ppaj,k is encountered
it is simply assumed the haplotype continues to extend the longest
match. One additional detail is omitted in Algorithm 1 for space
considerations; after passing through all sites in the haplotype
alignment the routine loops through the haplotypes one last time to
report any "trailing" matches (matches that extend all the way
through the end of the haplotypes).
[0147] FIG. 5D illustrates the Templated PBWT data structures. To
identify haplotype sharing among a large panel of haplotypes, the
TPBWT passes once through an M by N by t three-dimensional
structure where M is the number of haplotypes, N is the number of
bi-allelic sites, and t is the number of templates. Each template
is a pattern at which sites are masked out (shaded out in the
figure). During the left-to-right pass through this structure, at
each site k, two arrays are updated. The positional prefix array
ppa and the divergence array div are both two dimensional arrays of
size M by t. At site k, each of the t columns of ppa and div are
updated for the templates that are not masked out. Each of the t
columns in ppa contains the haplotypes sorted in order of their
reversed prefixes. Similarly, each of the t columns in div contains
the position at which matches began between haplotypes adjacent to
one another in the sorted order of ppa. During the left-to-right
pass through this structure, short fragments of IBD shared between
haplotypes i and j, broken up by errors, are identified by each of
the t templates (green arrows). As these fragments are identified
they are merged and extended with one another in the current match
arrays Ps and Pe. While merging and extending IBD fragments a
heuristic may be used to scan for and fix putative phase switch
errors, as will be discussed further herein.
Algorithm 1
[0148] Templated PBWT algorithm to find matching subsequences. Here
t represents the number of templates defined within the templating
function T(t, k), N is the number of sites in the haplotype
alignment, and M is the number of haplotypes. Additionally
A.sub.j,k is the allele at position k for haplotype ppa.sub.t,j.
FIG. 5C shows pseudocodes of the algorithm.
Phase Correction Heuristic
[0149] As noted above, long IBD segments may be fractured by phase
switch errors introduced by phasing techniques used to phase the
haplotype data of the individuals. The locations and frequencies of
such fractures may occur in predictable ways. In some embodiments a
heuristic is employed to correct phase switch errors as IBD
segments are identified. As noted herein in FIG. 5B and operation
566, a heuristic may be applied to identify and correct potential
phase switch errors by analysis of sequential haplotype sites. In
certain embodiments, a heuristic involves merging adjacent
potential IBD segments (matching segments) that end within a
threshold distance and then joining or swapping haplotype segments
of a given individual, as necessary, to correct phase switch
errors. While not wishing to be bound by any particular theory,
this heuristic may improve IBD determination because it is
biologically unlikely that two IBD segments on the same or opposite
haplotypes of an individual both end within a particular distance
(e.g., about 500 or fewer SNPs on the same or opposite
haplotypes).
[0150] In some embodiments, the phase switch heuristic is turned
off between closely related pairs of individuals, e.g., between
parent and child. For example, if an individual is trio-phased
(phasing a child's genotype compared to the parent's genotype), the
phasing is considered highly accurate and there are few to no phase
switch errors. While the phase switch heuristic is discussed in the
context of a Templated PBWT process, the heuristic may be used
alone or in conjunction with any of various other algorithms that
identify IBD segments for phased haplotype data. Such other
algorithms may or may not include analyses that identify and/or
correct genotyping and similar errors.
[0151] As a new IBD segment is identified, the start position of
the new IBD segment is compared to the end position of an adjacent
IBD segment. In some cases, the start position of the new IBD
segment and the end position of the adjacent IBD segment are on the
same haplotype with a gap between them. In some cases, the start
position of the new IBD segment and the end position of the
adjacent IBD segment are on opposing haplotypes with either a gap
between them or an overlap.
[0152] If the length of the gap or overlap between adjacent IBD
segments is within a threshold value, then the two IBD segments may
be merged to form a single IBD segment. In some embodiments, the
threshold value is between about 0-500 SNPs, about 0-300 SNPs,
about 200-300 SNPs, or about 0-100 SNPs. In some embodiments, the
threshold value for merging adjacent IBD segments is the same
threshold value for determining that two haplotypes have a minimum
number of sites that a sub-segment spans to be considered a
potential IBD segment (Lm). If the two IBD segments are on opposite
haplotypes, portions of the haplotypes (i.e., haplotype segments)
may be swapped starting at the location of a break in the IBD
segments. Thus, as the process proceeds to the next haplotype site
and beyond, the haplotypes remain swapped unless/until the
heuristic determines another phase switch error has occurred and
swaps the haplotypes. As the Templated PBWT continues along the
chromosome sites, the haplotypes used to identify potential IBD
segments remain swapped. In effect, in some embodiments the
heuristic is used to correct the actual haplotypes for phase switch
errors in addition to correcting IBD segments.
[0153] In some implementations, the merged potential IBD segment
must have a minimum length Lf to be deemed an IBD segment. Of
course, a merged segment that does not initially qualify as an IBD
segment may grow to a length required to be an IBD segment.
[0154] FIG. 6A illustrates via a series of diagrams how the
heuristic may be applied for four individuals that share IBD
segments with a focal person. Each pair of haplotypes (0 and 1;
dotted lines) represents a copy of the haplotypes of the focal
person, while the grey bars represent the IBD segments the focal
person shares with four other individuals. In other words, the top
pair of haplotypes shows IBD segments of the focal person and
another individual, the second pair of haplotypes shows IBD
segments of the same focal person but with a second other
individual, and so on. While a focal person is provided for
purposes of illustrating the heuristic, in some embodiments the
heuristic is applied to correct phase switch errors in multiple or
all individuals simultaneously. In such embodiments, there may not
be a focal person, per se. However, to simplify illustration of
such embodiments, correcting phase switch errors may be presented
with reference to a "focal" individual. In some embodiments, the
process is implemented using one or more focal individuals. For
example, a focal person may be a new user or customer that is added
to the database of, e.g., a person genetics platform. In some
implementations, the process runs using the new user or customer as
the focal individual against the entire database.
[0155] Panels B through F represent the TPBWT's sweep along the
chromosome from left to right, with the black arrow labeled TPBWT
representing the current position. As the TPBWT sweeps along the
haplotypes identifying IBD matches it uses a heuristic to identify
and fix putative phase switch errors. In diagram A, two haplotypes
(0 and 1; dotted lines) of the focal person and the IBD segments
they share with the four other individuals in the haplotype
alignment are plotted. The focal person has two phase switch errors
(red dashed lines) that break up long IBD segments. In diagram B,
the Templated PBWT scans left to right along the chromosome,
keeping track of IBD segments shared among all pairs of
individuals. When a phase switch error is encountered in the focal
person all IBD segments shared with the focal person are fragmented
at the position of the switch error. In diagram C, the Templated
PBWT continues to scan left to right and finds another IBD segment.
If the new segment begins near the end of all the old segments but
on the complementary haplotype of the focal person, then the
Templated PBWT infers a phase switch error to have occurred. In
diagram D, since a phase switch error is inferred within the focal
person, the focal person's haplotypes are now swapped (at or near
the point of the phase switch error) so new IBD segments now merge
and extend the fragments on the complementary haplotype that were
broken up by the phase switch error. If instead the phase switch
error is inferred within one of the other individuals, then that
other individual's haplotypes are swapped, and the focal person's
haplotypes remain unswitched. In diagram E, when the arrangement of
IBD segments on the complementary haplotypes again suggests another
phase switch error has been encountered the algorithm swaps the
focal person's haplotypes again, but this time at the location of
the other phase switch error. In diagram F, the Templated PBWT
continues to the end of haplotypes after successfully identifying
phase switch errors and "stitching" IBD fragments back into correct
long IBD segments.
[0156] In some embodiments, the heuristic is applied to correct
phase switch errors when a new potential IBD segment is identified.
As described above, a potential IBD segment is identified when the
Templated PBWT reaches the rightmost end of the potential IBD
segment. For example, in diagrams C and E only a single new
potential IBD segment is identified because the TPBWT has not
reached the end of the other IBD segments, triggering their
identification as potential IBD segments and application of the
heuristic. To illustrate this point, note that in panel E the
rightmost fragment in the second from top haplotype pair has not
yet been identified since the TPBWT operation has not reached the
fragment's rightmost end. However, by panel F, the TPBWT has
scanned further right along the chromosome and identified that
fragment and applied the heuristic to it (which merged it into the
long IBD segment).
[0157] As may be appreciated by FIG. 5B, the heuristic is applied
as the Templated PWBT iterates through successive sites along all
chromosomes. As potential IBD segments are identified, the
proximity of the identified IBD segment to a prior IBD segment on
either haplotype is determined to infer whether there is a phase
switch error and the IBD segments should be `stitched` together to
form a single IBD segment.
[0158] FIG. 6B illustrates the possible scenarios considered by the
Templated PBWT for adjacent IBD segments. Diagram A shows a first
IBD segment shared by P and Q. Diagrams B-E show the various
combinations of second IBD segments between P and Q that may be
considered by the heuristic (the grey box indicates the two
potential IBD segments are within a threshold length of each
other). The second IBD segments are within a threshold number of
SNPs of the first IBD segments. In Diagram B, all IBD segments are
on the same haplotype, but are separated by a gap. This may result
from an even number of phase switch errors, causing the haplotypes
to be swapped at both ends of the gap such that the potential IBD
segments are on the same haplotype. It should also be understood
that the exact number of phase switch errors within the gap may be
unknown and is not necessary to infer which individual harbors the
phase switch error(s). For example, in Diagram B phase switch
errors may have occurred in P, Q, or P and Q, and the heuristic may
be applied as a result of two potential IBD segments being
identified within a threshold range of each other, as discussed
herein.
[0159] It should be also be understood that while the Templated
PBWT may be used to correct for short gaps, e.g., 1-3 SNPs, the gap
illustrated here may be larger, for example up to about 100 SNPs,
or about 300 SNPs, or about 500 SNPs. This may be caused by various
errors, including multiple phase switch errors within the gap, such
that the matching sites are insufficiently long to be considered
potential IBD segments. The heuristic as described herein infers
that two segments within the threshold distance are likely to be a
single segment broken up by errors, and thus merges them despite
the gap.
[0160] Diagram C illustrates the second IBD segments being on
opposite haplotypes for both P and Q, which may be the result of a
phase switch error in both individuals. In such cases, the
haplotypes may be swapped in both individuals. Diagrams D and E
illustrate either Q or P, respectively, having second IBD segments
on the opposite haplotype. In these scenarios, if the second IBD
segments are within a threshold distance of the first IBD segments
but on the opposite haplotype, a phase switch error is inferred and
the haplotypes from the second IBD segments forward may be swapped
and the first and second IBD merged. In cases D and E, the
individuals having the first and second IBD segments on opposite
haplotypes are the ones inferred to have the phase switch error,
and only those individuals' haplotypes are swapped. The swaps begin
at the points on or near where the breaks between the first and
second IBD segments occur.
[0161] As described above, the Templated PBWT handles haplotype
error (miscalls) and missing data. It is also robust to "blip"
phase switch errors in which the phase at a single site is swapped.
However, phase switch errors spaced out along the chromosome will
cause long regions of the haplotypes to be swapped and fragment IBD
segments as illustrated in FIG. 1. To handle these errors the
Templated PBWT may apply a phase correction heuristic that scans
for certain patterns of haplotype sharing to identify and correct
phase switch errors. Note that for haploid data sets such as human
male sex chromosomes this heuristic can be turned off. Large
cohorts of samples have patterns of haplotype sharing that are
highly informative regarding the location of phase switch errors.
The phase switch errors in an individual will fragment all IBD
segments shared with that individual at the position of the switch
error. Each IBD segment that spans the switch error will be broken
into two fragments at the position of the error: these fragments
will be on complementary haplotypes within the individual with the
error and yet may remain on the same haplotype within the other
individual. For some closely related pairs (parent-child) this
pattern of haplotype sharing may be the result of actual
recombination patterns, however for the majority of more distantly
related individuals the pattern can be used to identify phase
switch errors.
[0162] As the TPBWT scans left to right through the haplotype
alignment finding new IBD segments it keeps track of previously
found IBD segments shared among pairs of haplotypes in P.sub.s and
P.sub.e. When a new segment shared between two individuals P and Q
is found to be adjacent to an existing segment (either slightly
overlapping or with a small gap between them; determined by the
parameter Lm) there are a number of possible scenarios (FIG. 6C).
If the new segment is on the same haplotypes as the existing
segment, then possibly the two segments are fragments of a longer
segment broken up by error and should be merged. If the new segment
begins near the end of the existing segment and the new segment is
not on the same haplotypes as the old segment, then possibly there
was a phase switch error in both individuals. If the new segment
begins near the end of the existing segment and the new segment is
on the same haplotype as the existing segment in individual P but
on the complementary haplotypes in individual Q, then possibly
there was a phase switch error in individual Q. And of course, the
opposite pattern could exist suggesting a phase switch error in
individual P. If a phase switch error has been identified in either
individual P, Q, or both, then the TPBWT will swap the haplotypes
for the individuals containing the error (See FIG. 6B). Now the new
IBD segments merge and extend the fragments on the complementary
haplotype that were broken up by the phase switch error. When the
arrangement of IBD segments on the complementary haplotypes again
suggests another phase switch error has been encountered the
algorithm stops swapping the individual's haplotypes. This simple
heuristic continues to the end of haplotypes "stitching" short
stretches of IBD fragmented by errors back into the correct long
IBD segments.
[0163] FIG. 6C illustrates a process for using a phase switching
error correction heuristic as described herein. The process 600
begins by receiving haplotype data for a plurality of individuals.
The process 600 may be performed while scanning the haplotype data
using a method such as a Templated PBWT (e.g., during operation 566
in FIG. 5B) or may be performed as a standalone process. The
process begins by receiving phased haplotype data to be considered.
The process loops over multiple haplotype sites, considering each
one separately, but also considering adjacent and/or near sites
that contain IBD segments, particularly nearly terminated IBD
segments along with newly started IBD segments. An operation 602
sets the next haplotype site for consideration. In embodiments
where this heuristic is integrated with another analysis, e.g.,
Templated PBWT, this site incrementing operation may already have
been performed. An operation 603 determines phased haplotypes of at
least two individuals have first potential IBD segments that
terminate at a first location. In some embodiments, first potential
IBD segments may be identified after a matching subsequence having
at least Lm sites terminates. The subsequences must possess more
than a threshold length Lm to be considered possible IBD segments.
The first potential IBD segments terminate at a first location
which may be stored for later reference.
[0164] An operation 605 determines that the at least two
individuals have second IBD segments that start at a second
location within a threshold distance of where the first IBD segment
ended. The threshold distance may be as described above. In some
implementations, the distance may be either a gap or an overlap
between the first and second potential IBD segments.
[0165] An operation 607 identifies or infers which individual, from
among those having second IBD segments that starts at a location
within the threshold distance of where the first IBD segment ended,
likely has a phase switch error. The second potential IBD segments
may be between any combination of haplotypes of the at least two
individuals. See FIG. 6B. When the second potential IBD segment
begins on the opposite haplotype as the first potential IBD segment
in at least one of the individuals, a phase switch error is implied
in the individual that has the first and second IBD segments on
opposite haplotypes. In various embodiments, where two fragments of
a fragmented IBD segment occur on opposite haplotype strings of
single individual, the operation infers that that individual has
the phase switch error, and only that individual's haplotypes need
correction for the phase switch error. In cases where both
individuals sharing an IBD segment have IBD fragments on opposite
haplotype strings, the operation may infer that both individuals
have phase switch errors at the same or proximate positions. In
some cases operation 607 may be skipped. For example, as shown in
diagram B of FIG. 6B, above, the potential IBD segments to be
merged are on the same haplotype. In such embodiments it may be
difficult to determine which individual had a phase switch error
and also unnecessary to properly merge the potential IBD segments
(as swapping the haplotypes is not necessary to have the potential
IBD segments on the same haplotype).
[0166] In operation 609, based on the first potential IBD segments
and the second potential IBD segments being within the threshold
distance of each other, the first potential IBD segments and the
second potential IBD segments are merged. If the first potential
IBD segments and the second potential IBD segments are on opposite
haplotypes for any of the at least two individuals (i.e., a phase
switch error occurred for those individuals), the haplotypes may be
swapped for those individuals. The swap may occur at the location
of the phase switch error.
[0167] Operation 611 is an optional operation to determine whether
each potential IBD segment is sufficiently long and/or meets other
criteria to be considered a true IBD (e.g., a minimum length Lf).
If the criteria are met, the potential IBD segments are determined
to be actual IBD segments.
[0168] Operation 613 is an optional operation to correct for
potential genotyping errors. See e.g., the discussion of the
Templated PBWT.
[0169] In block 615 the current haplotype site is checked for
whether it is the last haplotype site. If it is the last haplotype
site, the process finishes. If it is not the last haplotype site,
the process returns to operation 602 to select the next haplotype
site and continue scanning for IBD segments. In some embodiments
where process 600 is part of another method to identify IBD
segments, e.g., a Templated PBWT, the loop may also allow for the
Templated PBWT algorithm to continue scanning the next haplotype
site.
Error Correction Using Hidden Markov Model (HMM)
[0170] Description and Application of HMM
[0171] As shown in FIG. 1, long IBD segments are likely fractured
by phase switch errors introduced by phasing techniques used to
phase the haplotype data of the individuals. FIG. 7 schematically
illustrates using a HMM to process four haplotypes of two
individuals (individual 1 and individual 2, two haplotypes for each
individual on chromosome 5) to correct phase switch errors,
"stitching" IBD segments fractured by phase switch errors. The HMI
process covers the full span of the four haplotypes from left to
right shown sequentially in the top panel, lower left panel, and
lower right panel.
[0172] FIG. 8 shows a flow diagram illustrating process 800 for
correcting phase switch errors in IBD segments using a hidden
Markov model (HMM) according to some implementations. In some
implementations, the error correction optionally is initiated or
triggered when IBD segments of the two individuals being compared
meet one or more criteria. See decision box 802. This conditional
trigger can avoid processing IBD segments that may not need
corrections. In some implementations, for example, the HMM error
correction process is triggered when the two individuals' IBD
segments include two or more IBD segments on a single chromosome.
This can avoid applying error correction when there is only a
single IBD segment on a single chromosome where no phase switch
errors have occurred. In some implementations, the criterion is met
when the two individuals' IBD segments exceed a minimum total
amount of shared IBD.
[0173] In these implementations, if the IBD segments of two
individuals do not meet the criterion or criteria, the process
ends. See box 802, "No" branch, and box 814. If the IBD segments of
the two individuals meet the criterion or criteria, process 800
proceeds to obtain an IBD state for each polymorphic site of a
series of polymorphic sites of the two individuals. See the box
802, "Yes" branch and box 804. The IBD state indicates whether
alleles of the two individuals at the polymorphic site are part of
an IBD segment, and if so, which of the two individuals' phased
haplotypes are part of the IBD segment. The series of polymorphic
sites are located in one or more pairs of chromosomes of each
individual. In some implementations, the polymorphic sites are
biallelic sites. In other implementations, more than two alleles
may be implemented at a site. In some implementations, the IBD
states indicate different conditions of zero IBD, half IBD, and
full IBD. In some implementations when the polymorphic site is a
biallelic site, the IBD states include nine different IBD states
corresponding to nine conditions of zero IBD, half IBD, and full
IBD as further described in examples hereinafter.
[0174] Process 800 then involves applying the HMM to the IBD
states. Box 806. The HMM model takes the IBD states as inputs and
uses them as observed states of the model. The HMM model also takes
as input (i) a rate of recombination based on a number of meioses
(m), (ii) at least one rate of phase switch error based on a
phasing method employed to phase the haplotypes, and, optionally,
(iii) genetic distances between consecutive sites on a chromosome.
In some implementations, genetic distances between consecutive
sites on a chromosome may be omitted. The term "model input" herein
refers to both variables and parameters. The HMM model's
transmission rates or probabilities depend on (i) and (ii), and
optionally (iii). The application of the HMM model removes likely
phase switch errors and produces error corrected IBD segments based
on a most likely sequence of hidden IBD states given the observed
IBD states. See block 808. Applying the HMM involves using
transition probabilities and emission probabilities of the HMM to
identify the most likely sequence of hidden IBD states given the
observed IBD states. In some implementations, the most likely
sequence of hidden IBD states is identified using the Viterbi
dynamic programming process.
[0175] Process 800 is implemented using a computer. It is not
practical or feasible to apply the model without a computer due to
the complexity of the model. For example, applying the HMM requires
using a 36.times.36 transmission matrix and a 36.times.36 emission
matrix for each polymorphic site, often at hundreds of thousands of
polymorphic site, to calculate a most likely sequence. It can take
many years and errors for a person to calculate just a single
Viterbi sequence.
[0176] In some implementations, the error correction process
involves only the operations illustrated in boxes 804, 806, and
808. Such implementations include: (a) for each polymorphic site in
a series of polymorphic sites of two individuals, obtaining an IBD
state that indicates whether alleles of the two individuals at the
polymorphic site are part of an IBD segment, and, if so, which of
the two individuals' phased haplotypes are part of the IBD segment,
wherein the series of polymorphic sites are comprised in or lie
along one or more pairs of chromosomes; and (b) applying a hidden
Markov model (HMM) to the IBD states to produce one or more
error-corrected IBD segments, wherein the HMM model takes as input,
in addition to the IBD states as observed IBD states, (i) a rate of
recombination based on a number of meioses, (ii) at least one rate
of phase switch error based on a phasing method employed to phase
the haplotypes, and (iii) genetic distances between consecutive
sites on a chromosome. In some implementations, genetic distances
between consecutive sites on a chromosome may be omitted.
[0177] Some implementations of the disclosure include multiple
iterations of applying the HMM to test different numbers of meioses
(m). As illustrated in FIG. 8, process 800 determines whether there
are additional values of m that need to be tested. If so, the
process loops back to box 804 to obtain IBD states and apply the
HMM using a different value of m. In some implementations, the
different numbers of meioses are in the range from 0.1 to 14
crossovers. In some implementations, the values of m are in the
range from 1 to 14. See block 810, "Yes" branch. If there are no
additional values of m to be tested, process 800 proceeds to use
the set of error corrected IBD segments having the highest
probability as a final estimate of IBD segments for the two
individuals. See decision block 810, "No" branch, and block 812.
Thereafter, process 800 ends at block 814.
[0178] In some implementations, m is fixed at 1, requiring no
multiple iterations. In other implementations, m=1 provides the set
of error corrected IBD segments with the highest probability among
multiple values of m.
[0179] FIG. 9A schematically illustrates the structure of the HMM
model. It includes a series of hidden states (illustrated as
circles on top) representing the ground-truth IBD states at a
series of polymorphic sites and a series of observed states
(illustrated as circles at the bottom) representing the observed
IBD states based on phased haplotype data of the two individuals.
The arrows in the diagram denote conditional dependencies. The
hidden states obey the Markov property, such that the hidden state
at any site depends on only the hidden state at the immediately
previous site. In other words, H.sub.l depends only on H.sub.l-1.
Moreover, the observed state at a particular site depends only on
the hidden state at the particular site. In other words, O.sub.l
depends on only H.sub.l.
[0180] In the standard type of hidden Markov model considered here,
the state space of the hidden variable is discrete. The parameters
of a HMM are of two types, transition probabilities and emission
probabilities. The transition probabilities between site l-1 and
site/determine the probability of H.sub.l given H.sub.l-1. The
emission probabilities at site/determine the probability of O.sub.l
given H.sub.l.
[0181] The probability of a series of observed states and a series
of hidden states is:
Pr(H.sub.1, H.sub.2, H.sub.3, . . . , O.sub.1, O.sub.2, O.sub.3, .
. .
)=Pr(H.sub.1)Pr(O.sub.1|H.sub.1)Pr(H.sub.2|H.sub.1)Pr(O.sub.2|H.sub.2)Pr(-
H.sub.3|H.sub.2)Pr(O.sub.3|H.sup.3) . . . (Eq. 1)
[0182] where probabilities Pr(O.sub.i|H.sub.i) are emission
probabilities/parameters, and probabilities Pr(H.sub.i|H.sub.i-1)
are the transition probabilities/parameters.
[0183] The hidden state space assumes one of N possible values,
modeled as a discrete distribution. For each of the N possible
states that a hidden variable at point l can be in, there is a
transition probability from this state to each of the N possible
states of the hidden variable at point l+1, for a total of N.sup.2
transition probabilities. Note that the set of transition
probabilities for transitions from any given state must sum to 1.
As such, the N.times.N matrix of transition probabilities is a
Markov matrix.
[0184] In addition, for each of the N possible states, there is a
set of emission probabilities governing the distribution of the
observed variable at a particular point given the state of the
hidden variable at that point. The size of this set depends on the
nature of the observed variable. For example, if the observed
variable is discrete with M possible values, governed by a discrete
distribution, there will be a total of N.times.M emission
probabilities.
[0185] In some implementations, each polymorphic site is biallelic,
and the IBD states at any site can include nine different IBD
states, indicating nine conditions of zero IBD, half IBD, and full
IBD. Considering 4 phased haplotypes for 2 individuals there are 9
different ways site l can be observed as IBD between the two
individuals. In some implementations, the IBD state at site l
notated as c.sub.*l is represented by a string of 4 integers each
corresponding to the 4 haplotypes. The first two integers refer to
the maternal and paternal haplotypes in individual 0 and the last
two integers refer to the maternal and paternal haplotypes in
individual 1. When the haplotype at site l is not IBD practitioners
represent it as a 0. Therefore, c.sub.*l=0000 indicates that the
two individuals at site l are not IBD, or zero IBD. Accordingly,
there are 4 different ways the two individuals could be half IBD:
0101 is when the individuals are IBD through their paternal
haplotypes, 1001 is when the individual 0's maternal haplotype is
IBD with individual 1's paternal haplotype, 0110 is when the
individual 0's paternal haplotype is IBD with individual 1's
maternal haplotype, and 1010 is when the individuals are IBD
through their maternal haplotypes.
[0186] Similarly there are 4 different ways the two individuals
could be full IBD: 1212, 2112, 1221, and 2121. According to the
model IBD segments follow a Markovian process in which segments
begin and end independently. This means that when modeling full IBD
practitioners must keep track of the identity of each of the two
IBD segments so one cannot simply represent full IBD as 1111.
[0187] Of course, other notations may be used to represent the nine
different IBD conditions to a similar effect.
[0188] In some implementations, the IBD states are expanded by
multiplying these different 9 conditions of IBD with four types of
phase switch errors. But if one disregards the phase switch error
types, there would be 9.times.9 transition rates between hidden
states of two consecutive sites.
[0189] In some implementations, transition rates of the HMM are
based upon a rate at which IBD segments start. In some
implementations, the rate at which IBD segments start is modeled as
a function of the number of meioses. See box 706, input (i). In
some implementations, the rate at which IBD segments start
(.alpha..sub.s) is modeled as follows.
.alpha. s = 1 1 + ( 1 0 0 . 0 .times. m ) ( Eq . 2 )
##EQU00004##
[0190] wherein m is the number of meioses, and the recombination
rate is assumed to be 1 crossover per 100 cM.
[0191] In some implementations, transition rates of hidden IBD
states are based on a rate at which IBD segments end. In some
implementations, the rate at which IBD segments end is modeled as a
function of the number of meioses. In some implementations, the
rate at which IBD segments ends (.alpha..sub.e) is modeled as
follows.
.alpha. e = m 1 0 0 . 0 ( Eq . 3 ) ##EQU00005##
[0192] wherein m is the number of meioses.
[0193] In some implementations, the IBD states include nine
different IBD states, and transition rates are based on a
transition matrix Q.sub.a in FIG. 9B. In matrix Q.sub.a, each row
includes rates for transitioning from the IBD state denoted by the
four-letter string at site l to nine IBD states at l+1.
[0194] In some implementations, the transition rates of hidden IBD
states are weighted by a probability that full IBD between the two
individuals is truly present. In some implementations, the
probability that the full IBD between the two individuals is truly
present is modeled as a logistic function of an amount of estimated
full IBD. In some implementations, the probability that full IBD
between the two individuals is truly present (.beta.) is modeled as
follows.
.beta. = 1 1 + e .eta. ( 0 . 1 2 5 - 2 ) ( Eq . 4 )
##EQU00006##
[0195] wherein i.sub.2 is the amount of estimated full IBD, and
.eta. is an empirical parameter defining the steepness of the
logistic function.
[0196] In some implementations, the transition rates of hidden IBD
states are weighted by weighting transitions into full IBD states
with .beta., and waiting transitions out of full IBD states with
1/.beta.. In some implementations, the IBD states include nine
different IBD states, and the transition rates of hidden IBD states
are based on a transition matrix as follows.
Q .beta. i j = { Q .alpha. i j .times. .beta. , if i .ltoreq. 5 and
j > 5 Q .alpha. i j .times. 1 .beta. , if i > 5 and j
.ltoreq. 5 Q .alpha. i j , otherwise ( Eq . 5 ) ##EQU00007##
[0197] In some implementations, the transition rates of hidden IBD
states are based on the at least one rate of phase switch error.
See block 706, model input (ii). In some implementations, there are
four types of phase switch errors in the two individuals: no switch
error in either individual, switch error in individual 0, switch
error in individual 1, and switch error in both individuals. The
IBD states include nine different IBD states as described herein.
The at least one rate of phase switch error includes a rate of
phase switch error for each of the two individuals, .mu..sub.1 and
.mu..sub.2, respectively. In some implementations, the phase switch
error rates for the two individuals are the same when the same
phasing method is used for both individuals. In some
implementations, the transition rates are based on the 36.times.36
transition matrix described as follows.
Q = [ no switch Q .beta. .mu. 0 Q .beta. .mu. 1 Q .beta. .mu. 0
.mu. 1 Q .beta. switch in 0 .mu. 0 Q .beta. Q .beta. .mu. 0 .mu. 1
Q .beta. .mu. 1 Q .beta. switch in 1 .mu. 1 Q .beta. .mu. 0 .mu. 1
Q .beta. Q .beta. .mu. 0 Q .beta. switch in both .mu. 0 .mu. 1 Q
.beta. .mu. 1 Q .beta. .mu. 0 Q .beta. Q .beta. ] ( Eq . 6 )
##EQU00008##
[0198] In some implementations, transition probabilities of hidden
IBD states are based upon genetic distances between consecutive
sites on a chromosome. See box 706, model input (iii). In some
implementations, transition probabilities of hidden IBD states are
obtained by exponentiating a transition matrix. In some
implementations, transition probabilities of hidden IBD states
Y.sub.l+1 given hidden IBD states Y.sub.l are modeled as:
P(Y.sub.l+1|Y.sub.l, m, .mu..sub.0, .mu..sub.1,
i.sub.2)=e.sup.Qd.sup.i (Eq. 7)
[0199] where m is the number of meioses, .mu..sub.0 is a phase
switch error rate for a first individual of the two individuals,
.mu..sub.1 is a phase switch error rate for a second individual of
the two individuals, i.sub.2 is an amount of estimated full IBD, Q
is a transition matrix described by Eq. 6, and d.sub.l is the
genetic distances between sites l and l+1.
[0200] In some implementations, the emission probabilities of the
HMM are dependent on phase switch errors. In some implementations,
the emission probabilities are defined by a uniform error term that
weights probabilities of observed IBD states based on the four
different ways the two individuals may be in phase switch
errors.
[0201] Example HMM Implementation
[0202] To help illustrate the range of implementations, the
following example description of using HMM to determine IBD
segments, correcting phase switch error is provided. Under the
model, IBD segments shared between two related individuals are
generated by passing along the four haplotypes of the two
individuals. IBD segments begin and end following a Poisson process
with rates that are determined by the number of meioses m that
occurred on the pedigree between the two individuals. Phase switch
errors occur following a Poisson process with a rate .mu.
determined by empirically testing statistical phasing methods.
[0203] Let c.sub.*={c.sub.*1, . . . , c.sub.*L} be the L sites
observed along a chromosome. Practitioners represent the different
ways site l can be observed as IBD between the two individuals as
c.sub.*l. Additionally, let {right arrow over (d)}={d.sub.1, . . .
, d.sub.L} be a vector of genetic distances where the distance
between sites l and l+1 is d.sub.l. Finally, let .epsilon. be an
error term that captures the probability that the IBD state
practitioners observed at site l is incorrect due to phasing and/or
genotyping errors. The conditional probability P(c.sub.*|m, .mu.,
{right arrow over (d)}, .epsilon.) is structured as a hidden Markov
model (HMM) with latent variables {right arrow over (Y)}={Y.sub.1,
. . . , Y.sub.L}. Here Y.sub.l represents the different ways site l
could be observed as IBD plus the different ways the two
individuals may be in a phase switch error.
[0204] State Space
[0205] When considering the 4 phased haplotypes for 2 individuals,
there are 9 different ways site l can be observed as IBD between
the two individuals. Practitioners notate the IBD state at site l
as c.sub.*l, which is represented by a string of 4 integers each
corresponding to the 4 haplotypes. The first two integers refer to
the maternal and paternal haplotypes in individual 0 and the last
two integers refer to the maternal and paternal haplotypes in
individual 1. When the haplotype at site 1 is not IBD inventors
represent it as a 0.
[0206] Therefore, c.sub.*l=0000 indicates that the two individuals
at site 1 are not IBD, or zero IBD. Accordingly, there are 4
different ways the two individuals could be half IBD: 0101 is when
the individuals are IBD through their paternal haplotypes, 1001 is
when the individual 0's maternal haplotype is IBD with individual
1's paternal haplotype, 0110 is when the individual 0's paternal
haplotype is IBD with individual 1's maternal haplotype, and 1010
is when the individuals are IBD through their maternal
haplotypes.
[0207] Similarly there are 4 different ways the two individuals
could be full IBD: 1212, 2112, 1221, and 2121. According to the
model IBD segments follow a Markovian process in which segments
begin and end independently. This means that when modeling full IBD
practitioners must keep track of the identity of each of the two
IBD segments so one cannot simply represent full IBD as 1111.
[0208] For this HMM, hidden states Y.sub.l represents the different
ways site l could be observed as IBD and also includes information
about the different ways in which the two individuals may or may
not be in a switch error. There are 4 ways in which the two
individuals may or may not be in a switch error: neither are in a
switch error, individual 0 is in a switch error, individual 1 is in
a switch error, or both individuals could be in a switch error.
Since there are 9 ways site l could be observed to be IBD and 4
ways phase switch errors could obfuscate the true IBD state, there
are a total of 36 states for the latent variables Y.sub.l.
[0209] IBD Segment Model
[0210] Practitioners model the transitions among hidden states
Y.sub.l with an instantaneous transition rate matrix. If, for a
moment, practitioners do not consider transitions in which phase
switch errors may occur and practitioners only consider transitions
among the 9 IBD states that can be observed, practitioners can
define the transition matrix Q.sub.a shown in FIG. 9A.
[0211] The matrix Q.sub.a defines the way the model moves between
zero, half, and full IBD states. As the model passes along the
chromosome as is the rate at which IBD segments begin
.alpha. s = 1 1 + ( 1 0 0 . 0 .times. m ) ( Eq . 2 )
##EQU00009##
[0212] which depends on the number of meioses m on the pedigree
between the two individuals being compared. Similarly, the rate at
which IBD segments end is
.alpha. e = m 1 0 0 . 0 ( Eq . 3 ) ##EQU00010##
[0213] Another way to interpret .alpha..sub.e is that it represents
the length of the IBD segments shared between individuals 0 and 1.
Likewise .alpha..sub.s represents the length of segments with no
IBD shared between the two individuals.
[0214] Full IBD Error Model
[0215] Phase switch errors break up half IBD segments into shorter
adjacent half IBD segments on different haplotypes. Since the
templated PBWTs procedure described above imperfectly estimates the
start and end positions of IBD segments, when the lengths of the
two adjacent half IBD segments are over estimated this can result
in a short region of erroneous full IBD. Since full IBD is not
expected for most pairs of relatives we model the error in the
observed proportion of full IBD using a simple logistic function.
Practitioners indicate the probability of full IBD truly being
present as .beta., which is defined as
.beta. = 1 1 + e .eta. ( 0 . 1 2 5 - 2 ) ( Eq . 4 )
##EQU00011##
[0216] Here i.sub.2 is the amount of full IBD estimated by the
templated PBWTs. When i.sub.2.gtoreq.25% (the amount expected for
full siblings) then the probability of there truly being full IBD
is .beta.=1. As the amount of full IBD estimated by the templated
PBWTs i.sub.2 approaches zero, .beta. also approaches zero. The
steepness of the logistic curve is defined by .eta.. Simulation
tests showed .eta.=192 was sufficient to reduce the error in full
IBD introduced by the templated PBWTs.
[0217] We incorporate .beta. into the HMM by weighting the
transitions into full IBD states with .beta. and by weighting the
transitions out of full IBD states 1/.beta.. More explicitly,
practitioners define Q.sub..beta.as:
Q .beta. i j = { Q .alpha. i j .times. .beta. , if i .ltoreq. 5 and
j > 5 Q .alpha. i j .times. 1 .beta. , if i > 5 and j
.ltoreq. 5 Q .alpha. i j , otherwise ( Eq . 5 ) ##EQU00012##
[0218] When .beta.=1:0 note that Q.sub..beta. is equal to
Q.sub..alpha..
[0219] Phase Switch Error Model
[0220] Finally practitioners are ready to incorporate phase switch
errors into the HMM. Practitioners now expand the 9 state
Q.sub..beta. matrix into the full 36 states that are possible among
the hidden states Y.sub.l.
Q = [ no switch Q .beta. .mu. 0 Q .beta. .mu. 1 Q .beta. .mu. 0
.mu. 1 Q .beta. switch in 0 .mu. 0 Q .beta. Q .beta. .mu. 0 .mu. 1
Q .beta. .mu. 1 Q .beta. switch in 1 .mu. 1 Q .beta. .mu. 0 .mu. 1
Q .beta. Q .beta. .mu. 0 Q .beta. switch in both .mu. 0 .mu. 1 Q
.beta. .mu. 1 Q .beta. .mu. 0 Q .beta. Q .beta. ] ( Eq . 6 )
##EQU00013##
[0221] Here practitioners incorporate two switch error rates
.mu..sub.0 and .mu..sub.1 for individual 0 and individual 1,
respectively. If both individuals were phased using the same method
then one may set .mu..sub.0=.mu..sub.1. Given the genetic distance
d.sub.l between sites l and l+1, the number of meioses m, the
switch error rates .mu..sub.0 and .mu..sub.1, and the amount of
full IBD observed i.sub.2, practitioners can find the probability
of transitioning between the states Y.sub.l and Y.sub.l+1 by
exponentiating matrix Q:
P(Y.sub.l+1Y.sub.l, m, .mu..sub.0, .mu..sub.1,
i.sub.2)=e.sup.Qd.sup.l (Eq. 7)
[0222] Probability of Observed States
[0223] The probability of observing the IBD state c.sub.*l given
the possible phase switch errors at site l is P(c.sub.*l|Y.sub.l),
which are emission probabilities. Practitioners define these
emission probabilities using a simple uniform error term
.epsilon..gtoreq.1 that weights observed states based upon the 4
different ways the two individuals may be in phase switch errors.
For example, when there are no phase switch errors in either
individual, practitioners weight the emission probability so that
it is more probable that the observed IBD state is the "true"
hidden state:
P(c.sub.*l|Y.sub.l)=P(0101|0101; no switch
errors)=.epsilon..times.P(0101|1001; no switch errors) (Eq. 8)
[0224] Conversely, when there is a switch error in one or both of
the two individuals practitioners do not expect the "true" hidden
IBD state to be the same as the observed IBD state. For example, if
there is a switch error in individual 0 then if the "true" IBD
state is 1001 practitioners would most probably observe the state
0101:
P ( c * l Y l ) = P ( 0101 1001 ; switch error in ind 0 ) = .times.
P ( 0101 0101 ; switch error in ind 0 ) ( Eq . 9 ) ##EQU00014##
[0225] Similarly if there were switch errors in both individuals,
when the "true" IBD state is 2121 practitioners would observe 1212
with the highest probability. As .di-elect cons..fwdarw..infin.,
the phase switch errors will entirely determine what IBD state can
be observed. If .di-elect cons.=1, all IBD states can be observed
with equal probability regardless of the hidden state Y.sub.l.
[0226] Integrating the Templated PBWT and the HMM
[0227] An implementation described here is named Phased IBD. It is
used in the experiments described hereinafter. It has two stages:
First the templated PBWT and then the phase-correcting HMM. The
templated PBWT stage generates the IBD segments among all
haplotypes very quickly and efficiently. However, if run on many
pairs (e.g., thousands or millions pairs) of individuals, the
second stage of the algorithm (the HMM) would be prohibitively slow
for large datasets. In those datasets, though, most individuals
will not share any IBD and so they do not require any phase switch
error correction. Moreover, the vast majority of related
individuals will only be distantly related and share few IBD
segments. If they share at most one IBD segment per chromosome then
phase switch errors have not broken up their observed IBD segments
and so the HMM does not apply. The HMM, the slow stage of the
2-part algorithm, is thus only applied to the small number of
individuals within the dataset that are closely related.
Practitioners require a pair of individuals to have at least 2
observed IBD segments on a single chromosome before running them
through the phase-correcting HMM, though additionally we can
require a minimum total amount of shared IBD (in cM) to increase
the speed of the entire algorithm.
Processing IBD Segments
[0228] Identified IBD segments can be used for a wide range of
purposes. For instance, the amount (length and number) of IBD
sharing depends on the familial relationships between the tested
individuals. Therefore, one application of IBD segment detection is
to quantify relatedness. For example, methods for using IBD
segments to quantify relatedness are described in U.S. Pat. No.
8,463,554, issued Jul. 11, 2013, which is incorporated by reference
in its entirety for all purposes.
[0229] In some implementations, the number of shared IBD segments
and the amount of DNA shared by two users are computed based on the
IBD segments obtained as described above. In some implementations,
the longest IBD segment is determined. In some implementations, the
amount of DNA shared includes the sum of the lengths of IBD regions
and/or percentage of DNA shared. The sum is referred to as IBDhalf
or half IBD because the individuals share DNA identical by descent
for at least one of the homologous chromosomes. The predicted
relationship between the users, the range of possible
relationships, or both, is determined using the IBDhalf and number
of segments, based on the distribution pattern of IBDhalf and
shared segments for different types of relationships. For example,
in a first degree parent/child relationship, the individuals have
IBDhalf that is 100% the total length of all the autosomal
chromosomes and 22 shared autosomal chromosome segments; in a
second degree grandparent/grandchild relationship, the individuals
have IBDhalf that is approximately half the total length of all the
autosomal chromosomes and many more shared segments; in each
subsequent degree of relationship, the percentage of IBD half of
the total length is about 50% of the previous degree. Also, for
more distant relationships, in each subsequent degree of
relationship, the number of shared segments is approximately half
of the previous number.
[0230] There is a statistical range of possible relationships for
the same IBDhalf and shared segment number. In some
implementations, the distribution patterns are determined
empirically based on survey of real populations. Different
population groups may exhibit different distribution patterns. For
example, the level of homozygosity within endogamous populations is
found to be higher than in populations receiving gene flow from
other groups. In some implementations, the bounds of particular
relationships are estimated using simulations of IBD using
generated family trees. Based at least in part on the distribution
patterns, the IBDhalf, and shared number of segments, the degree of
relationship between two individuals can be estimated.
[0231] IBD segments can also be used determine ethnicity or
ancestry. See, e.g., U.S. patent application Ser. No. 15/664,619,
filed Jul. 31, 2017, which is incorporated by reference in its
entirety for all purposes.
[0232] Moreover, IBD can be used to perform genotype imputation.
Genotype imputation refers to the statistical inference of genotype
information not directed assayed. This is especially helpful
because many individuals only have sparsely assayed genotype data,
usually targeting a limited number of genetic markers in the
genome. If IBD segments are determined between two individuals, it
can be inferred that the genotype of the two individuals are the
same in the IBD segments. Thus the known genotype information of an
IBD segment of one of the two individuals can be "imputed" into
that of the other individual. This further allows association study
between phenotypes and genotypes even using individuals that have
only the phenotype data collected but not the genotype data
assayed. See, e.g., U.S. patent application Ser. No. 15/256,388,
filed Sep. 2, 2016, which is incorporated by reference in its
entirety for all purposes.
Apparatus and Systems
[0233] FIG. 10 is a functional diagram illustrating a programmed
computer system for performing the pipelined ancestry prediction
process in accordance with some implementations. Computer system
100, which includes various subsystems as described below, includes
at least one microprocessor subsystem (also referred to as a
processor or a central processing unit (CPU)) 102. For example,
processor 102 can be implemented by a single-chip processor or by
multiple processors. In some implementations, processor 102 is a
general purpose digital processor that controls the operation of
the computer system 100. Using instructions retrieved from memory
110, the processor 102 controls the reception and manipulation of
input data, and the output and display of data on output devices
(e.g., display 118). In some implementations, processor 102
includes and/or is used to provide phasing, genotype error
correction, and/or phasing error correction, etc. as described
herein.
[0234] Processor 102 is coupled bi-directionally with memory 110,
which can include a first primary storage, typically a random
access memory (RAM), and a second primary storage area, typically a
read-only memory (ROM). As is well known in the art, primary
storage can be used as a general storage area and as scratch-pad
memory, and can also be used to store input data and processed
data. Primary storage can also store programming instructions and
data, in the form of data objects and text objects, in addition to
other data and instructions for processes operating on processor
102. Also as is well known in the art, primary storage typically
includes basic operating instructions, program code, data, and
objects used by the processor 102 to perform its functions (e.g.,
programmed instructions). For example, memory 110 can include any
suitable computer-readable storage media, described below,
depending on whether, for example, data access needs to be
bi-directional or uni-directional. For example, processor 102 can
also directly and very rapidly retrieve and store frequently needed
data in a cache memory (not shown).
[0235] A removable mass storage device 112 provides additional data
storage capacity for the computer system 100, and is coupled either
bi-directionally (read/write) or uni-directionally (read only) to
processor 102. For example, storage 112 can also include
computer-readable media such as magnetic tape, flash memory,
PC-CARDS, portable mass storage devices, holographic storage
devices, and other storage devices. A fixed mass storage 120 can
also, for example, provide additional data storage capacity. The
most common example of mass storage 120 is a hard disk drive. Mass
storage 112, 120 generally store additional programming
instructions, data, and the like that typically are not in active
use by the processor 102. It will be appreciated that the
information retained within mass storage 112 and 120 can be
incorporated, if needed, in standard fashion as part of memory 110
(e.g., RAM) as virtual memory.
[0236] In addition to providing processor 102 access to storage
subsystems, bus 114 can also be used to provide access to other
subsystems and devices. As shown, these can include a display
monitor 118, a network interface 116, a keyboard 104, and a
pointing device 106, as well as an auxiliary input/output device
interface, a sound card, speakers, and other subsystems as needed.
For example, the pointing device 106 can be a mouse, stylus, track
ball, or tablet, and is useful for interacting with a graphical
user interface.
[0237] The network interface 116 allows processor 102 to be coupled
to another computer, computer network, or telecommunications
network using a network connection as shown. For example, through
the network interface 116, the processor 102 can receive
information (e.g., data objects or program instructions) from
another network or output information to another network in the
course of performing method/process steps. Information, often
represented as a sequence of instructions to be executed on a
processor, can be received from and outputted to another network.
An interface card or similar device and appropriate software
implemented by (e.g., executed/performed on) processor 102 can be
used to connect the computer system 100 to an external network and
transfer data according to standard protocols. For example, various
process implementations disclosed herein can be executed on
processor 102, or can be performed across a network such as the
Internet, intranet networks, or local area networks, in conjunction
with a remote processor that shares a portion of the processing.
Additional mass storage devices (not shown) can also be connected
to processor 102 through network interface 116.
[0238] An auxiliary I/O device interface (not shown) can be used in
conjunction with computer system 100. The auxiliary I/O device
interface can include general and customized interfaces that allow
the processor 102 to send and, more typically, receive data from
other devices such as microphones, touch-sensitive displays,
transducer card readers, tape readers, voice or handwriting
recognizers, biometrics readers, cameras, portable mass storage
devices, and other computers.
[0239] In addition, various implementations disclosed herein
further relate to computer storage products with a computer
readable medium that includes program code for performing various
computer-implemented operations. The computer-readable medium is
any data storage device that can store data which can thereafter be
read by a computer system. Examples of computer-readable media
include, but are not limited to, all the media mentioned above:
magnetic media such as hard disks, floppy disks, and magnetic tape;
optical media such as CD-ROM disks; magneto-optical media such as
optical disks; and specially configured hardware devices such as
application-specific integrated circuits (ASICs), programmable
logic devices (PLDs), and ROM and RAM devices. Examples of program
code include both machine code, as produced, for example, by a
compiler, or files containing higher level code (e.g., script) that
can be executed using an interpreter.
[0240] The computer system shown in FIG. 10 is but an example of a
computer system suitable for use with the various implementations
disclosed herein. Other computer systems suitable for such use can
include additional or fewer subsystems. In addition, bus 114 is
illustrative of any interconnection scheme serving to link the
subsystems. Other computer architectures having different
configurations of subsystems can also be utilized.
[0241] FIG. 11 is a block diagram illustrating an implementation of
an IBD-based personal genomics services system that provides
services based on IBD information, which include but are not
limited to relatedness estimation, relative detection, ancestry
determination, and genotype-phenotype association. In this example,
a user uses a client device 1102 to communicate with an IBD-based
personal genomics services system 1106 via a network 1104. Examples
of device 1102 include a laptop computer, a desktop computer, a
smart phone, a mobile device, a tablet device or any other
computing device. IBD-based personal genomics services system 1106
is used to perform a pipelined process to predict ancestry based on
a user's IBD information. IBD-based personal genomics services
system 1106 can be implemented on a networked platform (e.g., a
server or cloud-based platform, a peer-to-peer platform, etc.) that
supports various applications. For example, implementations of the
platform perform ancestry prediction and provide users with access
(e.g., via appropriate user interfaces) to their personal genetic
information (e.g., genetic sequence information and/or genotype
information obtained by assaying genetic materials such as blood or
saliva samples) and predicted ancestry information. In some
implementations, the platform also allows users to connect with
each other and share information. Device 110 can be used to
implement 1102 or 1106.
[0242] In some implementations, DNA samples (e.g., saliva, blood,
etc.) are collected from genotyped individuals and analyzed using
DNA microarray or other appropriate techniques. The genotype
information is obtained (e.g., from genotyping chips directly or
from genotyping services that provide assayed results) and stored
in database 1108 and is used by system 1106 to make ancestry
predictions. Reference data, including genotype data of reference
individuals, simulated data (e.g., results of machine-based
processes that simulate biological processes such as recombination
of parents' DNA), pre-computed data (e.g., a precomputed reference
haplotype data used in phasing and model training) and the like can
also be stored in database 1108 or any other appropriate storage
unit.
Experimental
[0243] This experiment compares a method according to some
implementations as described above to other computer implemented
methods known in the art. All of these methods are
computer-implemented. IBD accuracies and computer performances are
compared among the methods.
[0244] The method according to some implementations is labeled as
Phased IBD. It includes techniques as described in the templated
PBWT and the HMM examples above.
[0245] A method described by Durbin is labeled as PBWT. See, R.
Durbin. Efficient haplotype matching and storage using the
positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30(9):
1266-1272, 2014.
[0246] The method described by Naseri is labeled as RaPID. See, A.
Naseri, X. Liu, S. Zhang, and D. Zhi. Ultra-fast identity by
descent detection in biobank-xcale cohorts using positional
Burrows-Wheeler transform. bioRxiv, page 103325, 2017.
[0247] The method described by Browning is labeled as Refined IBD.
See, B. L. Browning and S. R. Browning. Improving the accuracy and
efficiency of identity-by-descent detection in population data.
Genetics, 194(2):459-471, 2013.
[0248] A method described by Zhou is labeled hap-IBD. See, Y. Zhou,
S. R. Browning, and B. L. Browning. A fast and simple method for
detecting identity by descent segments in large-scale data.
BioRxiv, 2019.
[0249] A method described by Shemirani is labeled iLASH. See, R.
Shemirani, G. M. Belbin, C. L. Avery, E. E. Kenny, C. R. Gignouz,
and J. L. Ambite. Rapid detection of identity-by-descent tracts for
mega-scale datasets. BioRxiv, page 749507, 2019.
[0250] FIG. 12 shows results comparing the speed of different IBD
inference methods. IBD segments were computed for 50781 SNPs from
human chromosome 1. The x-axis shows the number of haplotypes and
the y-axis shows the time in seconds to infer IBD. Phased IBD used
a minimum IBD segment length of 200 sites and 1.5 cM. Durbin's PBWT
used a 200 site minimum. RaPID (version 1.2.3) used 10 runs, 2
successes, window size of 35, and minimum length of 1.5 cM. Refined
IBD used a minimum length of 1.5 cM and all default parameter value
settings. The results of FIG. 11 show that Phased IBD and PBWT are
the fastest among the four methods and similar to each other. RaPID
is the slowest.
[0251] As explained above and illustrated below, Phased IBD can
correct various errors (including genotyping errors and phase
switch errors) that cannot be addressed by PBWT, it is noteworthy
that Phased IBD achieves similar computational speed as PBWT.
[0252] On the other hand, although RaPID and Refined IBD can
correct errors, albeit to a lesser extent than Phased IBD as shown
in FIGS. 14 and 15, they require significantly longer computer run
time.
[0253] FIGS. 13-16 compare the IBD estimate errors (or the opposite
of accuracy) of various methods. FIG. 13 compares the absolute
error in number of IBD segments between the Templated PBWT method
(x-axis) and Phased IBD (y-axis) that includes both Templated PBWT
component and the HMM component. The results of FIG. 13 show that
the HMM process greatly reduce the error rates, reducing maximum
error segments by about three folds from about 300 to 100.
[0254] FIG. 14 shows that Phased IBD is more accurate than PBWT.
Each axis shows the proportion of the genome with incorrect IBD
estimates. PBWT (x-axis) is sensitive to both genotyping and
phasing errors compared to Phased IBD (y-axis).
[0255] FIG. 15 shows that Phased IBD is more accurate than Refined
IBD. Refined IBD (x-axis) has higher error than Phased IBD
(y-axis), although Refined IBD outperforms both PBWT and RaPID.
[0256] FIG. 16 shows that Phased IBD is more accurate than RaPID
(version 1.2.3). RaPID (x-axis), like PBWT, has high error in
relationship types with long IBD segments like parent-child and
siblings. Unlike other methods, Phased IBD "stitches" together long
IBD segments highly fragmented by phasing and genotyping
errors.
Simulation Study
[0257] A simulation study was performed to assess the accuracy of
IBD inference methods. Simulated haplotype data sets in which the
IBD segments shared were perfectly known were created and then
modified to introduce realistic levels of genotyping and phasing
errors to test the impact of the errors on IBD segment
determinations. Haplotypes inherited with recombination over 400
replicated pedigrees were simulated. Each pedigree had three
generations and included at least one pair of each type of close
relatives that were used for the simulation study: parent-child,
grandparent-grandchild, aunt-niece, first cousins, and siblings.
Each pedigree founder consisted of a randomly sampled and unrelated
research consented 23andMe customer. Recombination was simulated
using a Poisson model with a rate of 1 expected crossover per 100
cM. This resulted in simulated haplotypes for 2000 closely related
pairs of individuals with perfectly known IBD segments, 400 pairs
of each relationship type: parent-child, grandparent-grandchild,
aunt-niece, first cousins, and siblings.
[0258] Genotyping errors were introduced into the simulated data
set using a simple model. At each position along the simulated
chromosomes an error in the genotype call was introduced with a
probability of 0.001. When a site was selected for an error, half
of the genotype call would be flipped with equal probability (e.g.,
a 0/0 genotype would be converted to a 1/0 or a 0/1 with equal
probability).
[0259] Statistical phasing errors were also introduced into the
simulated haplotype datasets. All of the simulated haplotypes were
converted into their respective diploid genotypes and then the
statistical haplotype phasing method Eagle2 was used. For the
phasing reference panel a phasing panel that included about 200000
non-Europeans and about 300000 Europeans was used.
[0260] The various methods used to analyze the simulated data had
the following parameters:
TABLE-US-00002 TABLE 1 Algorithm parameter values used for the IBD
inference methods during analysis of simulated data. Software
Parameters TPBWT Default templates, L_m = 200, L_f = 3.0, phase
switch error correction heuristic on PBWT -long Within 200 Hap-IBD
v1.0 Default options, except nthreads = 1 length = 3.0 RaPID v1.7
-w 3 -r 10 -s 2 -d 3.0 Refined IBD Default options, except nthreads
= 1 length = 3.0 v16May19 iLASH Perm_count 12 shingle_size 20
shingle_overlap 0 bucket_count 4 max_thread 1 match_threshold 0.99
interest_threshold 0.70 min_length 3.0 auto_slice 1 cm_overlap
1.4
[0261] FIG. 17 shows that Templated PBWT had less error in the
estimated number of IBD segments shared between relatives than all
other methods analyzed. The y-axis represents the number of
erroneous IBD segments estimated for a simulated pair of relatives.
Error was highest in closely related pairs that shared long IBD
segments, particularly parent-child and siblings.
[0262] FIG. 18 shows that Templated PBWT had less error in the
estimated percentage of the genome that is IBD in simulated
relatives than other methods. PBWT had less than error than other
methods except Templated PBWT, while hap-IBD and Refined IBD had
the largest error. Error was higher in simulated pairs that shared
long IBD segments, such as parent-child, compared to more distance
relatives pairs such as first cousins. Compared to Templated PBWT,
the other methods were more sensitive to phasing and genotyping
errors in estimated IBD segments.
[0263] FIG. 19 shows false negative (charts 1901 and 1905) and
false positive rates (charts 1903a-b and 1907a-b) of inferring IBD
by various methods. Rates were calculated for bins of IBD segment
lengths. False negative rate by segment is the proportion of true
segments in a size bin that do not overlap any segment compared to
the total number of true segments in the size bin. False negative
rate by segment coverage is the proportion of the length of true
segments in a size bin not covered by any estimated segment
compared to the total length of true segments in the size bin.
False positive rate by segment is the proportion of estimated
segments in a size bin that do not overlap any true segment
compared to the total number of estimated segments in the size bin.
False positive rate by segment coverage is the proportion of the
length of estimated segments in a size bin not covered by any true
segment compared to the total length of estimated segments in the
size bin. Plots 1903b and 1907b present the false positive rate
with a smaller y-axis scale than plots 1903a and 1907a,
respectively. For IBD segments .gtoreq.4 cM all methods had low
false positive rates. For IBD segments greater than .gtoreq.6 cM
the Templated PBWT outperformed all other methods.
[0264] FIG. 20A shows IBD computation runtimes for various methods.
All methods were run using 1 CPU core. Templated PBWT was faster
than all other methods except Durbin's PBWT. The relative time
shows the runtime to compute IBD for each haplotype in sample sizes
of 400 to 20000 haplotypes relative to the time needed to compute
IBD for each haplotype in a sample size of 400. A slope near zero
indicates linear time complexity, while a positive slope indicates
super-linear time complexity. Templated PBWT shows a near linear
time complexity. FIG. 20B provides additional compute times for
parallelized IBD analyses with large sample sizes. Times are shown
for in-sample IBD computes on 1 million individuals, out-of-sample
IBD computes on 10k individuals against 1 million, and
out-of-sample IBD computes on 10k individuals against 10 million.
The first two rows show the compute times measured when IBD was
estimated over 42927 sites of human chromosome 1. The last three
rows show those compute times extrapolated to 23 chromosomes with a
total of 600k sites. The last row additionally extrapolates the
time for an out-of-sample analysis on 1 million to 10 million
individuals. CPU time is the sum of the computation time for all
compute cores. Wall clock time is the "real" time that the entire
analysis took to run.
Haplotype Sharing in Mexico
[0265] To demonstrate the utility of the IBD estimates made using
the Templated PBWT and the 23andMe database a brief case study was
performed to examine the geographic patterns of haplotype sharing
within Mexico. 9517 research consented 23andMe customers who self
reported that all 4 of their grandparents were from the same
Mexican state were identified. Each customer was genotyped on
either the 23andMe v4 or v5 bead chip genotyping platform. SNPs
with <85% genotyping rate, SNPs with MAF<0.001, SNPs with low
trio concordance (effect <0.6 and p-value <1e-20), and SNPs
with allele counts of 0 within the samples selected for the phasing
reference panel were removed. After this quality control filtering
the v4 platform had 453065 SNPs and v5 platform had 544042 SNPs.
Haplotypes were phased using Eagle2 as described in Loh et. al.,
Reference-based phasing using the haplotype reference consortium
panel. Nature genetics, 48(11):1443, 2016. Individuals on the v4
platform were phased with a reference panel containing 691759
samples. Individuals on the v5 platform were phased with a
reference panel containing 286305 samples.
[0266] IBD sharing among the 9517 individuals was computed using
the Templated PBWT with the parameters described in Table 1. IBD
estimates among individuals on the same genotyping platform were
made using the in-sample method described above, and estimates made
among individuals on different platforms was made using the
out-of-sample approach described above over the intersection of
platform SNPs (only the SNPs present in both the v4 and v5
genotyping platforms). Hierarchical clustering of the mean pairwise
IBD haplotype sharing across Mexican states was performed using
Ward's method (Ward Jr 1963) in R. To remove close relatives we
excluded any pair of individuals that shared more than 20 cM.
Geographic maps of the mean pairwise IBD shared across Mexican
states were made using the R packages mxmaps, ggplot2, and viridis
(Valle-Jones 2019; Wickham 2016; Gamier 2018).
[0267] FIGS. 21 and 22 show IBD haplotype sharing across Mexican
states as determined by a Templated PBWT method. Hierarchal
clustering of IBD sharing across Mexican states identified
geographic clusters with elevated levels of haplotype sharing, as
shown in FIG. 21. There were two large clusters: one cluster in the
Yucatan peninsula and the southern Mexican states, and another
cluster representing Mexico City and the central and northern
states. The clusters were further subdivided into individual
states.
[0268] Mean pairwise IBD haplotype sharing was highest within
states and among geographically neighboring states, as shown in
FIG. 22. For example, mean IBD shared among individuals with all 4
grandparents from Nuevo Leon was over 12 cM, and the mean pairwise
IBD shared between individuals with all 4 grandparents from Nuevo
Leon and individuals with all 4 grandparents from neighboring
Coahuila and Tamaulipas was over 10 cM. In contrast, mean pairwise
sharing between individuals with all 4 grandparents from Nuevo Leon
and individuals with all 4 grandparents from Yucatan was less than
6 cM. Similar geographically stratified IBD sharing was found
throughout Mexico, as shown in FIG. 22.
Conclusion
[0269] In the description above, for purposes of explanation only,
specific nomenclature is set forth to provide a thorough
understanding of the present disclosure. However, it will be
apparent to one skilled in the art that these specific details are
not required to practice the teachings of the present
disclosure.
[0270] The language used to disclose various embodiments describes,
but should not limit, the scope of the claims. For example, in the
previous description, for purposes of clarity and conciseness of
the description, not all of the numerous components shown in the
figures are described. The numerous components are shown in the
drawings to provide a person of ordinary skill in the art a
thorough, enabling disclosure of the present specification. The
operation of many of the components would be understood and
apparent to one skilled in the art. Similarly, the reader is to
understand that the specific ordering and combination of process
actions described is merely illustrative, and the disclosure may be
performed using different or additional process actions, or a
different combination of process actions.
[0271] Each of the additional features and teachings disclosed
herein can be utilized separately or in conjunction with other
features and teachings for protective coverings. Representative
examples using many of these additional features and teachings,
both separately and in combination, are described in further detail
with reference to the attached drawings. This detailed description
is merely intended for illustration purposes to teach a person of
skill in the art further details for practicing preferred aspects
of the present teachings and is not intended to limit the scope of
the claims. Therefore, combinations of features disclosed in the
detailed description may not be necessary to practice the teachings
in the broadest sense, and are instead taught merely to describe
particularly representative examples of the present disclosure.
Additionally and obviously, features may be added or subtracted as
desired without departing from the broader spirit and scope of the
disclosure. Accordingly, the disclosure is not to be restricted
except in light of the attached claims and their equivalents.
[0272] Moreover, the various features of the representative
examples and the dependent claims may be combined in ways that are
not specifically and explicitly enumerated in order to provide
additional useful embodiments of the present teachings. It is also
expressly noted that all value ranges or indications of groups of
entities disclose every possible intermediate value or intermediate
entity for the purpose of original disclosure, as well as for the
purpose of restricting the claimed subject matter. It is also
expressly noted that the dimensions and the shapes of the
components shown in the figures are designed to help to understand
how the present teachings are practiced, but not intended to limit
the dimensions and the shapes shown in the examples.
[0273] None of the pending claims includes limitations presented in
"means plus function" or "step plus function" form. (See, 35 USC
.sctn. 112(f)). It is Applicant's intent that none of the claim
limitations be interpreted under or in accordance with 35 U.S.C.
.sctn. 112(f).
* * * * *