U.S. patent application number 17/337186 was filed with the patent office on 2021-10-14 for assay methods and compositions for detecting contamination of nucleic acid identifiers.
The applicant listed for this patent is Agilent Technologies, Inc.. Invention is credited to Paige Anderson, Javelin Chi, Henrik Johansson, Katie Leigh Zobeck.
Application Number | 20210317442 17/337186 |
Document ID | / |
Family ID | 1000005681554 |
Filed Date | 2021-10-14 |
United States Patent
Application |
20210317442 |
Kind Code |
A1 |
Zobeck; Katie Leigh ; et
al. |
October 14, 2021 |
ASSAY METHODS AND COMPOSITIONS FOR DETECTING CONTAMINATION OF
NUCLEIC ACID IDENTIFIERS
Abstract
The present invention relates to nucleic acid samples for
massively parallel sequencing. More particularly, the present
invention relates to assay methods, compositions and kits for
detecting contamination of nucleic acid identifiers such as sample
barcodes.
Inventors: |
Zobeck; Katie Leigh;
(Campbell, CA) ; Anderson; Paige; (Belmont,
CA) ; Chi; Javelin; (Sunnyvale, CA) ;
Johansson; Henrik; (Mountain View, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Agilent Technologies, Inc. |
Santa Clara |
CA |
US |
|
|
Family ID: |
1000005681554 |
Appl. No.: |
17/337186 |
Filed: |
June 2, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
16792813 |
Feb 17, 2020 |
|
|
|
17337186 |
|
|
|
|
15645085 |
Jul 10, 2017 |
10633651 |
|
|
16792813 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
C12Q 1/6806 20130101;
A63F 13/55 20140902; C12N 15/1065 20130101; C12Q 1/6848 20130101;
C12Q 1/689 20130101; C12Q 1/6874 20130101 |
International
Class: |
C12N 15/10 20060101
C12N015/10; C12Q 1/6848 20060101 C12Q001/6848; C12Q 1/6874 20060101
C12Q001/6874; C12Q 1/6806 20060101 C12Q001/6806; C12Q 1/689
20060101 C12Q001/689; A63F 13/55 20060101 A63F013/55 |
Claims
1. A kit for an assay for a set of oligonucleotide samples, the kit
comprising: a set of oligonucleotide samples comprising
oligonucleotides, each oligonucleotide having a 5' constant region,
a sample identifier, and a 3' constant region, wherein each sample
identifier is unique within the set; and a set of assay primers
comprising a priming portion and an assay identifier, wherein the
priming portion is the same as or complementary to one of the
constant regions of the oligonucleotides, wherein each assay
identifier is unique within the set.
2. The kit of claim 1, wherein each of the oligonucleotide samples
of the set is in a separate vessel, and each vessel comprises only
one sample identifier unless one or more of the samples is
contaminated.
3. The kit of claim 1, wherein the set of oligonucleotide samples
comprises at least 8 samples.
4. The kit of claim 3, wherein the set of assay primers comprises
at least 8 assay primers.
5. The kit of claim 1, wherein the set of oligonucleotide samples
comprises at least 32 samples.
6. The kit of claim 5, wherein the set of assay primers comprises
at least 32 assay primers.
7. The kit of claim 1, wherein the set of oligonucleotide samples
comprises at least 96 samples.
8. The kit of claim 7, wherein the set of assay primers comprises
at least 96 assay primers.
9. The kit of claim 1, wherein the assay primers further comprise a
5' constant region comprising a standard 5' amplification region
for a sequencing platform and a sequencing priming region.
10. The kit of claim 9, wherein the standard 5' amplification
region comprises a P5 sequence or a P7 sequence.
11. The kit of claim 1, wherein the 5' constant region of the
oligonucleotides comprises a sequencing priming region.
12. The kit of claim 1, wherein the 3' constant region of the
oligonucleotides comprises a standard 3' amplification region for a
sequencing platform.
13. The kit of claim 1, wherein the assay identifies contamination
in the set of oligonucleotide samples.
14. A kit for an assay for a set of set of oligonucleotide samples
comprising oligonucleotides, each oligonucleotide having a 5'
constant region, a sample identifier, and a 3' constant region,
wherein each sample identifier is unique within the set, the kit
comprising: an assay primer comprising a priming portion and an
assay identifier, wherein the priming portion is the same as or
complementary to one of the constant regions of the
oligonucleotides, wherein: each assay identifier is unique within
the set, and the set comprises at least 8 assay primers in separate
vessels.
15. The kit of claim 14, wherein the set of assay primers comprises
at least 16 assay primers in separate vessels.
16. The kit of claim 14, wherein the set of assay primers comprises
at least 32 assay primers in separate vessels.
17. The kit of claim 14, wherein the set of assay primers comprises
at least 48 primers in separate vessels.
18. The kit of claim 14, wherein the set of assay primers comprises
at least 96 primers in separate vessels.
19. The kit of claim 14, wherein the assay primers further comprise
a 5' constant region comprising a standard 5' amplification region
for a sequencing platform and a sequencing priming region.
20. The kit of claim 14, wherein the assay identifies contamination
in sets of oligonucleotides comprising sample identifiers.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is continuation of U.S. patent application
Ser. No. 16/792,813, filed on Feb. 17, 2020, which is a divisional
of U.S. patent application Ser. No. 15/645,085, filed on Jul. 10,
2017, now U.S. Pat. No. 10,633,651, the contents of all of which
are fully incorporated herein by reference.
SEQUENCE LISTING
[0002] This instant application contains a Sequence Listing which
has been submitted in ASCII format via EFS-Web and is hereby
incorporated by reference in its entirety. Said ASCII copy, created
on Jun. 28, 2021, is named 20170066-07_Sequence_Listing.txt, and is
1,914 bytes in size.
FIELD OF THE INVENTION
[0003] The present invention relates to the field of molecular
biology. In particular, the present invention relates to assay
methods and compositions for detecting contamination of nucleic
acid identifiers such as sample barcodes.
BACKGROUND OF THE INVENTION
[0004] Identifiers (e.g., sample barcodes or molecular barcodes)
can be present in nucleic acids for a variety of purposes. Most
commonly, sample barcodes are added to target nucleic acid
molecules prior to the amplification and/or sequencing of such
molecules, so that the origin or source of sequence information can
be identified. Nucleic acid molecules from different samples can be
pooled together and subjected to massively parallel sequencing in
order to efficiently determine sequence information from numerous
different samples. Prior to sequencing, sample identifiers (often
referred to as sample barcodes) can be added to the nucleic acid
molecules, and this facilitates grouping, analysis, and
interpretation of information. As another example, molecular
barcodes can be added to target nucleic acid molecules prior to
amplification, so that the replicates of the initial target
molecule can subsequently be identified and grouped together.
[0005] Sample barcodes are frequently used with target molecules
that will be analyzed by massively parallel sequencing, so that
nucleic acid molecules from different samples can be pooled for
sequencing, and the sequence information can be assigned to a
sample. Scientists and laboratories that perform massively parallel
sequencing occasionally detect a sample barcode in a pool even when
this sample barcode was not included in the sequencing pool. This
indicates that a contaminating sample barcode is present in the
pooled nucleic acids, which may be caused by a sample barcode
aliquot containing more than one sample barcode sequence, namely
the expected barcode sequence and the contaminating barcode
sequence. Contaminating barcodes could be introduced at any stage
of the preparation of sample barcode aliquots, beginning from the
earliest stage, including the synthesis and purification of DNA
oligos, or though handling steps in the process of diluting and
aliquoting sample barcode sequences. Even when present at low
frequencies, such as 1% or lower, the presence of contaminating
sample barcodes can create problems with regard to the reliability
and interpretation of the sequence information.
[0006] Sample barcodes are often provided in a set of containers,
such as a well plate, where each container holds a different sample
barcode. When the sample barcodes are used in laboratory analysis,
such as by pipetting the sample barcodes from their containers to
the various samples to be analyzed, there is a risk that a
container or sample may become contaminated.
[0007] Contamination of sample barcodes could be detected by
preparing individual sequencing libraries for each sample barcode
and sequencing them individually. Alternatively contamination could
be detected with a pooling scheme that provides the ability to
compare a sample barcode and contamination of another sample
barcode in at least one of the pools. However, a large number of
pools would have to be prepared and sequenced in separate
sequencing runs in order to isolate sample barcodes from a large
number of samples, such as 48 or 96 samples. This would be
expensive, inefficient and time-consuming. It also has the
potential of erroneously finding contamination in a sample barcode
that was not present in the tube, but instead introduced in one of
the many library preparation steps, leading to false positives.
SUMMARY OF THE INVENTION
[0008] As one aspect of the present invention, methods are provided
for attaching assay identifiers (e.g., quality control barcodes) to
a set of oligonucleotide samples comprising oligonucleotides, where
each oligonucleotide comprises a 5' constant region, a sample
identifier (e.g., a sample barcode), and a 3' constant region, and
each sample identifier is unique in the set in the absence of
contamination. In some embodiments, the constant regions comprise
standard amplification regions for a sequencing platform, or their
reverse complement. For example, in some embodiments, the 5'
constant region is an Illumina Index 1 sequence and the 3' constant
region is the reverse complement of Illumina P7 sequence (P7'), and
in other embodiments, the orientation is reversed such that the 5'
constant region is an Illumina P7 sequence and the 3' constant
region is an Illumina Read 2 sequence. The methods comprise
providing each of the oligonucleotide samples of the set in a
separate vessel, so that each vessel comprises only one sample
identifier unless one or more of the samples is contaminated. The
methods also comprise amplifying the oligonucleotides with an assay
primer and a second primer in each vessel. Assay primers comprise
one or more constant regions (such as P5 and a Read 1 Primer
sequence), an assay identifier, and a priming portion that is the
same as or complementary to one of the constant regions of the
oligonucleotides. Each vessel comprises only one assay identifier
unless one or more of the assay primers are contaminated. The
method thus provides oligonucleotide amplicons comprising an assay
identifier and a sample identifier.
[0009] As another aspect, methods are provided for detecting
contamination in a set of oligonucleotides comprising sample
identifiers. The methods comprise providing a set of
oligonucleotide samples comprising oligonucleotides, each
oligonucleotide having a 5' constant region, a sample identifier
(such as a sample barcode), and a 3' constant region.
Oligonucleotides within a sample have the same sample identifier
and each of the samples within the set has a different sample
identifier, unless one or more of the samples is contaminated. The
methods also comprise amplifying the oligonucleotides or
complements of the oligonucleotides with assay primers and a second
primer. A different assay primer is used for each sample, and each
assay primer comprises a priming portion and an assay identifier
(such as a QC barcode), thereby generating a set of oligonucleotide
amplicons. Each oligonucleotide amplicon comprising one of the
assay identifiers, the 5' constant region, one of the sample
identifiers, and the 3' constant region. The methods also comprise
pooling the oligonucleotide amplicons in one or more pools;
sequencing the one or more pools to determine sequence information
for at least the sample identifier and the assay identifier of the
oligonucleotide amplicons; determining whether the sample
identifiers in a first pool include a contaminating sample
identifier; and determining whether the assay identifiers in the
first pool include a contaminating assay identifier.
[0010] In some embodiments, the present methods comprise pooling
the oligonucleotide amplicons in at least two pools, and separately
sequencing the first pool and the second pool to determine
sequences for at least the sample identifier and the assay
identifier of the oligonucleotide amplicons. The present methods
can also comprise determining whether the sample identifiers in the
second pool include a contaminating sample identifier. In some
embodiments, the present methods also comprise determining whether
the assay identifiers in the second pool include a contaminating
assay identifier. In some embodiments, the present methods further
comprise identifying a contaminating sample identifier in a first
pool by determining that the contaminating sample identifier is
from a second pool. In some embodiments, the present methods
further comprise identifying a contaminating sample identifier in a
first pool by determining that the second pool does not include a
contaminating assay identifier. In some embodiments, the present
methods further comprise identifying a contaminating assay
identifier in a first pool by determining that the second pool
includes a contaminating assay identifier. In some embodiments, the
contaminating sample identifier is determined by one or both of (i)
identifying one or more of the sample identifiers that are
associated with more than one assay identifier, and (ii)
identifying assay identifiers that are associated with more than
one sample identifier
[0011] As another aspect, compositions are provided which are
useful in assays adapted for determining contamination in a set of
oligonucleotides comprising sample identifiers. The compositions
comprise at least one oligonucleotide having a 5' constant region,
a sample identifier (such as a sample barcode), and a 3' constant
region, and at least one assay primer comprising a priming portion
and an assay identifier. In some embodiments, the compositions
further comprise one or more of a DNA polymerase, and
deoxynucleotides.
[0012] As yet another aspect, kits are provided for assays adapted
for determining contamination in a set of oligonucleotides
comprising sample identifiers. The kits comprise at least 8 assay
primers, alternatively at least 16 assay primers, alternatively at
least 32 assay primers, alternatively at least 48 primers or at
least 96 primers, in separate vessels. Each assay primer identifier
comprises a priming portion and an assay identifier.
[0013] In some embodiments of the foregoing aspects, a set or pool
of oligonucleotide samples comprises at least 8 samples,
alternatively at least 16 samples, alternatively at least 32
samples, alternatively at least 48 samples, alternatively at least
96 samples, where each sample has a sample identifier that is
unique within the set or pool. In some embodiments, a set of assay
primers comprises at least 32 assay identifiers, alternatively at
least 48 assay identifiers, alternatively at least 96 assay
identifiers, where each assay primer has an assay sample identifier
that is unique within the set or pool.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] FIGS. 1A, 1B and 1C show embodiments of the present methods
of attaching an assay identifier to an oligonucleotide having a
sample identifier.
[0015] FIG. 2 shows sequence of two different embodiments of assay
primers according to the present disclosure. The two embodiments
contain many of the same regions, but the 5' constant regions are
different. In version 2, there is less overlap between the 5'
constant region and the 3' constant region.
[0016] FIG. 3 shows the distribution of amplicon sizes from
amplification of an oligonucleotide using the assay primer of the
first embodiment in FIG. 2.
[0017] FIG. 4 shows the distribution of amplicon sizes from
amplification of an oligonucleotide using the assay primer of the
second embodiment in FIG. 2.
[0018] FIG. 5 shows another embodiment of the present methods of
attaching an assay identifier to an oligonucleotide having a sample
identifier, where the identifier is attached at a 3' location
relative to the sample identifier.
[0019] FIG. 6 shows another embodiment of the present methods of
attaching an assay identifier to an oligonucleotide having a sample
identifier, where the constant regions of the oligonucleotide are
not compatible with a desired sequencing platform.
[0020] FIG. 7 shows a pooling scheme for detecting contamination of
sample identifiers using the present methods and compositions.
DETAILED DESCRIPTION OF THE INVENTION
[0021] The present methods, compositions and kits are useful for
detecting contamination in a set of oligonucleotides for nucleic
acid samples and allow the production of sample identifier sets
that are substantially free of contamination. This is a significant
advance and benefit, as the presence of sample barcode
contamination may result in result in false calling of genetic
variants which can have severe consequences for research and
clinical applications.
[0022] The methods, compositions and kit employ oligonucleotides
which have a 5' constant region, a sample identifier, and a 3'
constant region. Each of the oligonucleotides within a sample has
the same sample identifier and each of the samples within the set
has different sample identifiers, unless one or more of the samples
is contaminated by a contaminating sample identifier. In some
embodiments, each of the samples within the set has a sample
identifier which is unique in the set, meaning that it is intended
to be and will be unique in the absence of contamination.
[0023] A "sample identifier" comprises a sample barcode or any
degenerate or random sequence that can be used to identify a
sample. Sample identifiers may be flanked (directly or indirectly)
by constant regions. In some embodiments, the sample identifier can
be a sample barcode comprising 6 or more random or degenerate
nucleotides; alternatively the sample identifier can be a sample
barcode comprising 8 or more random or degenerate nucleotides, or
10 or more random or degenerate nucleotides. In some embodiments, a
sample identifier comprises 8 known bases, and an assay identifier
comprises 10 degenerate bases. In other embodiments, a sample
identifier comprises 4 known bases or 6 known bases. In some
embodiments, the number of bases in the sample identifier can be
selected based on the number of samples to be distinguished. Longer
sample identifiers and sample barcodes are also possible. For
example, a sample identifier comprising 18 bases (8 known bases and
10 degenerate bases) has been employed to prepare a library of
oligonucleotides for an Ion Torrent sequencing platform. A sample
identifier with more than 19 bases is also feasible and may be
desired, especially if the assay is used for other sequencing
platforms and applications. In some embodiments, the complement of
an initial sample barcode is in an oligonucleotide amplicon, and
this complement is also considered a sample identifier.
[0024] A "constant" region is one that comprises a known sequence,
and because it is known, it can serve a desired function. A
constant region will generally be the same or substantially the
same among oligonucleotides of a set. The known sequence can serve
as a priming site (region) for amplification or primer extension,
and/or can hybridize to a nucleic acid attached to a support. In
some embodiments, a constant region comprises a sequence of
standard region, such as a standard amplification region used in a
sequencing platform. A constant region can comprise a number of
nucleotides from a known or standard region sufficient for the
function of the standard region, such as a sufficient number of
nucleotides to hybridize to a standard primer for
amplification.
[0025] A "contaminating" molecule or sequence is one that is not
designed to be in a set or pool, or should not be present in a set
or pool or sample unless there is some contamination. For example,
a barcode in a first set or pool of sequences is a contaminating
barcode if it should not be present in the first set or pool and/or
should only be present in a second set or pool.
[0026] The present methods and compositions provide a solution to
the problem of identifying contamination in sets of
oligonucleotides comprising sample identifiers such as sample
barcodes. The present techniques have a relatively small number of
handling steps, which is desirable since handling steps increase
risk of contamination. Additionally, a pooling scheme and analysis
method is provided which reduces the number of pools and sequencing
runs required to detect contamination between samples. Instead of a
large number of pools, this present method can reduce the pools
used to detect contamination in a set of 96 sample identifiers. In
some embodiments, two sequencing pools are used to detect sample
identifier contamination in a set of 96 sample identifiers.
[0027] The present methods and compositions can also be used to
amplify oligonucleotides (such as library molecules, adaptors,
aptamers or other ssDNA molecules used to target proteins or
peptides) which have a series of random nucleotides (which are
considered sample identifiers herein) between two constant regions
in order to detect sequence diversity, including detection of
molecular barcodes. It could also be used to identify single
nucleotide polymorphs (SNPs) or sites of mutagenesis in known
regions of DNA.
[0028] The oligonucleotides which may be assayed by the present
methods include adaptors for nucleic acid molecules or regions from
standard adaptors, such as the amplification region from a standard
adaptor for a sequencing platform. The oligonucleotide can also
include a label, tag, or other moiety. By way of example, the
oligonucleotide includes a biotin moiety, allowing for enrichment
of the oligonucleotides by binding to avidin or streptavidin. This
approach is used in the commercially available Haloplex kit
(Agilent Technologies). The oligonucleotides which may be assayed
by the present methods include library molecules, which are
molecules prepared to be part of a library for a sequencing
platform. A library molecule generally comprises an insert to which
a sample identifier and one or more standard regions for sequencing
platforms are attached. Other regions can also be included in a
library molecule. With a library molecule, the sample identifier
can be a molecular barcode, or it can be a second sample barcode
that is in addition to a first sample barcode.
[0029] The methods also comprise amplifying the oligonucleotides or
complements of the oligonucleotides with assay primers and a second
primer. A different assay primer is used for each sample, and each
assay primer comprises a priming portion and an assay identifier
(such as a QC barcode), thereby generating a set of oligonucleotide
amplicons. Each oligonucleotide amplicon comprises one of the assay
identifiers, the 5' constant region, one of the sample identifiers,
and the 3' constant region. The present assay methods can be
readily adapted to various standardized sequencing platforms (for
example, the Illumina and Ion Torrent sequencing platforms), by
selecting constant regions that are standard for those
platforms.
[0030] In some embodiments, the present methods detect sample
identifier contamination at a level less than 1%, alternatively
less than 0.5%, alternatively less than 0.1% using a small number
of handling steps to avoid or prevent assay-induced contamination,
and provide a method of pooling and analysis, such that a small
number of sequencing runs is performed. The present disclosure
provides a fast and relatively inexpensive method to prepare
libraries from potentially contaminated oligonucleotides having
sample identifiers. The libraries are adapted for sequencing,
especially massively parallel sequencing, on one or more desired
sequencing platforms.
[0031] In some embodiments, the oligonucleotide amplicons comprise
a 5' constant region and a 3' constant region. Furthermore, the 5'
constant region comprises a standard 5' adaptor for a sequencing
platform and a sequencing priming region, an assay identifier, a
middle constant region comprising a sequencing priming region, and
a sample identifier, and the 3' constant region comprising a
standard 3' adaptor for a sequencing platform. In some embodiments,
the oligonucleotide amplicons comprise (i) a 5' constant region
comprising a standard 5' adaptor for a sequencing platform and a
sequencing priming region, (ii) an assay identifier, (iii) a middle
constant region comprising a sequencing priming region, (iv) a
sample identifier, and (v) a 3' constant region of comprising a
standard 3' adaptor for a sequencing platform. For example, a
standard 5' adaptor can comprise an Illumina P5 or P5' sequence,
and a standard 3' adapter can comprise an Illumina P7 or P7'
sequence. P7' indicates the complement of P7; likewise, P5'
indicates the complement of P5. In other embodiments, the
oligonucleotide amplicon comprises a 5' constant region comprising
a standard 5' adapter, a sample identifier, a middle constant
region, an assay identifier, and a 3' constant region comprising a
standard 3' adapter.
[0032] The present methods, compositions and kits can also be used
to a modify an oligonucleotide comprising a region that is standard
for a first sequencing platform (for example, an amplification
region or a sequencing primer site (region)), so that it includes a
region that is standard for a different sequencing platform. In
some embodiments, a second primer comprises a 3' region
complementary to a 3' constant region of the oligonucleotides, and
the second primer further comprises a 5' region comprising a
standard amplification region, wherein the 3' constant region of
the oligonucleotides comprises a standard amplification region for
a different sequencing platform than the standard amplification
region of the 5' region of the second primer.
[0033] The present disclosure also provides novel pooling and
sequencing schemes for identifying contamination of sample
identifiers and assay identifiers. In some embodiments, the present
methods comprise pooling the oligonucleotide amplicons in at least
two pools; sequencing the two pools to determine the sequences of
at least portions of the oligonucleotide amplicons comprising the
sample identifiers and the assay identifiers; determining whether
the sample identifiers in the second pool include a contaminating
sample identifier; and determining whether the assay identifiers in
the second pool include a contaminating assay identifier. In some
embodiments, the present methods further comprise determining a
contaminating sample identifier by determining that the
contaminating sample identifier is from a second pool. In some
embodiments, the methods further comprise identifying a
contaminating sample identifier by determining that the second pool
does not include a contaminating assay identifier. In some
embodiments, the present methods further comprise identifying a
contaminating assay identifier by determining that the second pool
does not include a contaminating assay identifier.
[0034] In some embodiments, the present methods further comprise
grouping sequences of the oligonucleotide amplicons according to
the assay identifiers to form assay groups; and determining if
there is more than one sample identifier sequence in each of the
assay groups. In some embodiments, the present methods further
comprise grouping sequences of the oligonucleotide amplicons
according to the sample identifiers to form sample groups; and
determining if there is more than one assay identifier sequence in
each of the sample groups. In some embodiments, the methods
comprise forming at least two pools from the oligonucleotide
amplicons; sequencing at least two pools of amplicons to obtain
sequence information of the oligonucleotide amplicons; wherein the
sequence information for the individual oligonucleotide amplicon at
least comprises the sequence of the assay identifier and the sample
identifier. In some embodiments, the present methods can comprise
grouping amplicon sequence information according to the assay
identifier; and determining if grouped amplicon sequence
information contains more than one of the sample identifiers.
[0035] The methods can comprise determining if there is a mismatch
between an assay identifier and a sample identifier, such as where
at least one of the sample identifiers is associated an assay
identifier that it should not be associated with, and/or where at
least one of assay identifiers is associated with a sample
identifier that it should not be associated with.
[0036] The present methods can be used with sample preparation kits
for NGS. They can also be used with library preparation reagents.
The present methods can also be employed to assay target enrichment
kits and sets that contain sample barcodes or other identifiers,
including SureSelect reagent kits. SureSelect kits (available from
Agilent Technologies) contain oligonucleotides having a sample
identifier and having one or more constant regions 5' and 3' to the
sample identifier, namely PCR primers.
[0037] The present disclosure allows for the production of sample
identifier sets or kits that are substantially free of
contamination, such as having less than 0.1% of a contaminating
sample identifier, or less than 0.01%.
[0038] In FIG. 1A, an oligonucleotide 102 comprises a 5' constant
region 110, a sample identifier 112, and a 3' constant region 114.
For example, the 5' constant region 110 can comprise a standard
sequence such as an Illumina Index 1 sequence, the sample
identifier 112, and the 3' constant region 114 can comprise a
standard amplification sequence, such as the Illumina P7' sequence.
The constant regions can comprise any standard priming site
(region) for amplification or sequencing. The oligonucleotide 102
is amplified using a primer 104 having a priming region 115
complementary to at least a portion of the 3' constant region 114.
For example, the primer 104 can be a P7 primer. In the same step or
a subsequent step, the oligonucleotide 102 or complement thereof is
amplified with a primer 106 having a priming region 120
complementary to at least a portion of the 5' constant region 110
or its complement 111. The primer also comprises an assay
identifier 122 and one or more constant regions 126, 124 (for
example an Illumina P5 sequence 126 and a read 1 sequencing primer
124). Additional rounds of amplification produce oligonucleotide
amplicons 108 comprising one or more constant regions 126, 124, the
assay identifier 122, the sequence of the 5' constant region 120 of
the initial oligonucleotide, the sample identifier sequence 128,
and the 3' constant region 130 of the initial oligonucleotide. The
sample identifier sequence 128 in the amplicons 108 is generally an
identical copy of the sample identifier 112 of the oligonucleotide
102. Constant region 120 of the amplicon 108 will be mostly
identical to constant region 110 of the oligonucleotide 102,
however either could be partially truncated. For example, constant
region 110 could be truncated on the 5' end, and constant region
120 could be truncated on the 3' end. Likewise, constant region 130
of the amplicon 108 and constant region 114 of the oligonucleotide
102 will generally be the same, though constant region 114 could be
partially truncated on the 3' end, and constant region 130 could be
partially truncated on the 5' end. The oligonucleotide amplicons
108 are adapted for sequencing on a standard platform for massively
parallel sequencing due to the constant regions.
[0039] FIG. 1B shows another embodiment of the present methods. In
this embodiment, oligonucleotide 103 comprises a 3' constant region
111, a sample identifier 113, and a 5' constant region 115. For
example, the 3' constant region 111 can be the Illumina Read 2
sequence (or another standard region for a sequencing platform),
and the 5' constant region 115 can be an Illumina P7 sequence or
any standard priming site (a region) for amplification or
sequencing. Amplification produces oligonucleotide amplicons 109
comprising one or more constant regions 127, 125, the assay
identifier 123, the 3' constant region 111, the sample identifier
sequence 113, and the sequence of the 5' constant region 115.
Additional rounds of amplification can be conducted with primer 131
which has the same sequence as a portion of constant region 115
sufficient to function as a primer.
[0040] FIG. 1C demonstrates how the assay method can be performed
when the initial oligonucleotide is a library molecule, that is a
molecule comprising an insert to which a sample identifier and
standard regions for sequencing platforms are attached. In this
embodiment, the assay method can detect contamination that occurred
during the library preparation. The oligonucleotide 102 comprises a
first 5' constant region 110, a sample identifier 112, a 3'
constant region 114, and further comprises an insert 140, a second
5' constant region, 142 (such as a Read 1 priming site), an
optional second sample identifier 144, and a third 5' constant
region 146 (for example, an amplification priming site). The insert
140 comprises a target sequence to be studied, analyzed or
subjected to additional testing, such as sequencing on a massively
parallel sequencing platform. A second sample identifier 144 is
optionally included in many library preparations. Oligonucleotide
103 (which is a complementary strand of oligonucleotide 102)
comprises a first 3' constant region 111, a sample identifier 113,
a 5' constant region 115, and further comprises an insert 141, a
second 3' constant region 143 (such as a Read 1 priming site), a
optional second sample identifier 145, and a third 3' constant
region 147 (for example, an amplification priming site). The
oligonucleotide 102 is amplified using an assay primer 104 having a
priming region 115 complementary to at least a portion of the 3'
constant region 111. For example, the primer 104 can be a P7
primer. In the same step or a subsequent step, the oligonucleotide
102 or complement thereof is amplified with a primer 106 having a
priming region 120 complementary to at least a portion of the 5'
constant region 110 or its complement 111. The primer also
comprises an assay identifier 122 and one or more constant regions
126, 124 (for example an Illumina P5 sequence 126 and a Read 1
sequencing primer region 124). Additional rounds of amplification
produce oligonucleotide amplicons 108 comprising one or more
constant regions 126, 124, the assay identifier 122, the sequence
of the 5' constant region 120 of the initial oligonucleotide, the
sample identifier sequence 128, and the 3' constant region 130 of
the initial oligonucleotide. In the embodiment shown,
oligonucleotide amplicon 108 does not include insert 140, but in
some embodiments, primer 106 binds a region 3' to the insert 140,
and the insert 140 is thereby included in the amplicons. A pooling
method (as described in Example 4) can be employed on a library
prepared with two or more sample barcodes (where the barcodes are
attached via either ligation or amplification) and the pooling
method can be used to identify if sample barcode contamination
occurred after the library preparation was performed.
[0041] By the selection of constant regions and priming regions on
the assay primers, this method is adaptable for different library
preparation methods (including Haloplex XTHS, Haloplex HS,
SureSelect XT, and SureSelect QXT, all from Agilent) and different
standardized sequencing platforms (including Illumina and Ion
Torrent). Sequencing platforms for massively parallel sequencing
include Ion Torrent PGM and Proton semiconductor sequencers, and
Illumina MiSeq, HiSeq, MiniSeq, and NextSeq. Other sequencing
platforms are in development and the present compositions and
methods can be used with the standard amplification regions for
those platforms.
[0042] In some embodiments, constant regions on the oligonucleotide
and/or the assay identifier comprise sequences suitable for use on
a standardized sequencing platform. For example, a constant region
can have the sequence of an amplification region for an Illumina
sequencing platform, such as an Illumina P5 sequence or an Illumina
P7 sequence, or such as an Ion Torrent Adapter A sequence or an Ion
Torrent Adapter P1 sequence, or such as the sequencing primer
regions, such as Illumina Read1, Index1, Read2 or Index2. Other
amplification regions or sequencing primer regions can be used for
different platforms. Table 1 sets forth the sequences of standard
regions currently used in Illumina and Ion Torrent sequencing
platforms:
TABLE-US-00001 TABLE 1 Illumina P5 5'- AATGATACGGCGACCACCGA (SEQ ID
NO:1) -3' Illumina P7 5'- CAAGCAGAAGACGGCATACGAGAT (SEQ ID NO:2)
-3' Illumina Read1 5'- ACACTCTTTCCCTACACGACGCTCTTCCGATCT (SEQ ID
NO:3) -3' Illumina Index1 5'- GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
(SEQ ID NO:4) -3' Illumina Read2 5'-
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (SEQ ID NO:5) -3' Illumina
Index2 5'- AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT (SEQ ID NO:6) -3'
IonTorrent A 5'- CCATCTCATCCCTGCGTGTCTCCGACTCAG (SEQ ID NO:7) -3'
IonTorrent P1 5'- CCTCTCTATGGGCAGTCGGTGAT(SEQ ID NO:8) -3'
In some embodiments, a constant region of an oligonucleotide
comprises a sequence selected from the sequences set forth in Table
1.
[0043] FIG. 5 shows how the present methods and compositions can be
used to add an assay identifier at a 3' location relative to the
sample identifier. This approach is especially suitable for
oligonucleotides which are adapters configured for attachment to 5'
ends of target molecules to be sequenced or primers intended to
amply the 5' end of target molecules. Thus, in this embodiment, the
present methods are particularly suited for detecting identifiers
present in a 5' adaptor (and is an alternative to a 3' adaptor as
shown in FIG. 1).
[0044] In FIG. 5, an oligonucleotide 502 comprises a 5' constant
region 510, a sample identifier 512, and a 3' constant region 514.
For example, the 5' constant region 510 can be an Illumina P5
sequence, the sample identifier 512 can be a sample barcode, and
the 3' constant region 514 can be an Illumina Read 1 sequence. The
oligonucleotide 502 is amplified using a primer 504 having a
priming region 515 complementary to at least a portion of the 3'
constant region 514. For example, the priming region 515 can be the
reverse complement of the 3' constant region 514, that is the
reverse complement of an Illumina Read 1 sequence. Primer 504 also
comprises an assay identifier 517 and an adapter 519 for a
sequencing platform or its complement, for example the reverse
complement of Illumina P7 (P7'). The oligonucleotide 502 or
complement thereof is amplified with a primer 506 having a priming
region 520 complementary to at least a portion of the 5' constant
region 510 or its complement. Additional rounds of amplification
produce oligonucleotide amplicons 508 comprising a 3' adapter 518,
the assay identifier 522, 516, the 3' constant region 514, the
sample identifier 512, and 5' constant region 520. The
oligonucleotide amplicons 508 are adapted for sequencing on a
standard platform for massively parallel sequencing because at
least one, and often both, constant regions include an adapter for
such a platform.
[0045] FIG. 6 shows how the present assay methods and compositions
can be used to detect contamination in the oligonucleotides when
they are surrounded by two constant regions, and neither of those
constant regions is compatible with the sequence platform to be
used for the assay. Alternatively, this approach can be used to
convert adaptors and primers from one sequencing platform so that
they can be sequenced on another platform. For example, the
oligonucleotides such as adaptors used in an Ion Torrent HaloPlex
assay can be assayed using an assay primer containing: Illumina P5,
QXT Read1, QC index, IonTorrent Read primers; and an amplification
primer containing: Illumina P7 and the reverse complement to the
Haloplex dark bases (dark bases are those that do not generate the
fluorescence associated with nucleotide incorporation during
sequencing). This allows these primers to be assayed for
contamination on an Illumina sequencer. This approach can also be
used to allow sequencing of oligonucleotides that are not intended
for sequencing and do not include amplification regions for
sequencing platforms, provided those oligonucleotides comprise a 5'
constant region, an unknown region, and a 3' constant region.
[0046] In FIG. 6, an oligonucleotide 602 comprises a 5' constant
region 610, a sample identifier 612, and a 3' constant region 614.
In this embodiment, the constant regions 610, 614 of
oligonucleotide 602 are for a first sequencing platform, such as an
Ion Torrent sequencing platform, but it is desired to sequence the
oligonucleotide 602 on a second sequencing platform, such as an
Illumina sequencing platform. For example, the 5' constant region
610 can be an Ion Torrent Adapter A sequence, the sample identifier
612 can be a sample barcode, and the 3' constant region 614 can be
dark bases provided to allow for ligation and quality control. The
oligonucleotide 602 is amplified using a primer 604 having a
priming region 615 complementary to at least a portion of the 3'
constant region 614 (that is, complementary to at least a portion
of the dark bases). Primer 604 also comprises a region 617
comprising a region corresponding to a standard amplification
region for a sequencing platform, for example, an Illumina P7
sequence. The oligonucleotide 602 or complement thereof is
amplified with a primer 606 having a priming region 620
complementary to at least a portion of the 5' constant region 610
or its complement 611. The primer 606 also comprises an assay
identifier 622 and one or more constant regions (for example an
Illumina P5 sequence 626 and an Illumina Read 1 sequence 624).
Amplification continues with primer 606 and primer 604 using
suitable amplification cycles to provide oligonucleotide amplicons
suitable for sequencing on an Illumina sequencing platform.
Additional rounds of amplification produce oligonucleotide
amplicons 608 comprising one or more constant regions 626, 624, the
assay identifier 622, the sequence 620 of the 5' constant region
610 of the initial oligonucleotide 620, the sample identifier 612,
the sequence of the 3' constant region 614 and an amplification
region 628. The oligonucleotide amplicons 608 are adapted for
sequencing on a standard platform for massively parallel sequencing
due to the constant regions 626 and/or amplification region
628.
[0047] In some embodiments, the presence of a complementary DNA
strand (as in the case of an adaptor) may cause problems with
detecting contamination or sequence variation, if the complementary
adaptor strand contains both of the binding regions for
amplification primers. In such situations, both strands will be
amplified and any detected contamination/sequence variation could
be due to differences in the sequence of the barcode sequence
present on the two strands. In many cases, the adaptor design is
such that this will not occur.
EXAMPLE 1
[0048] An embodiment of the present methods is employed to
determine whether there is sample barcode contamination in a kit
having Illumina adapter sequences. As shown in FIG. 1A, an
oligonucleotide 102 having a sample identifier 112 is flanked by
Illumina Index1 sequence as its 5' constant region 110, and an
Illumina P7' sequence as its 3' constant region 114. P7' indicates
the complement of P7; likewise, P5' indicates the complement of P5.
FIG. 1 illustrates a method for detecting contamination of this
oligonucleotide 102 with oligonucleotides having a different sample
identifier. Amplification can be performed using a standard DNA
polymerase, a P7 primer, and another primer containing P5, a Read 1
Primer sequence, a QC barcode and Index 1 sequence (from 5' to 3',
respectively). A high fidelity DNA polymerase can be used to reduce
or minimize erroneous contamination detection due to PCR
errors.
[0049] Two versions or embodiments of the assay primers were used
to develop the assay. The sequences of these two versions are shown
in FIG. 2. Initial attempts using version 1 of the assay primer,
which contain both the Illumina Read 1 primer and the reverse
complement of Illumina Read 2 (Index 1) primer sequence in the
assay primer, resulted in a small amount of the expected 130 bp
amplicon and a large amount of shorter amplification products (Lane
B1 in FIG. 3). These products potentially come from secondary
amplification products that are created due to the 13 bp
complementarity between the 3' end of Read 1 and the 5' end of
Index 1. By changing the sequence of Read 1 from the Illumina
sequence to the QXT Read 1 sequence (version 2 of the assay
primer), these secondary amplification products were largely
eliminated (Lane B1 in FIG. 4).
EXAMPLE 2
[0050] Haloplex and Haloplex HS Kits were tested to see if the
oligonucleotide containing the sample barcodes could be amplified
in the supplied index solution supplied in the kits. It was found
that the oligonucleotides could be cleanly amplified as a strong
amplification product was generated when using the assay primer
(FIG. 4, lane B1 (supplied index solution)).
EXAMPLE 3
[0051] Assay primers were tested with SureSelect XT and SureSelect
XT2 reagent kits, and oligonucleotides were successfully amplified.
The present assay primers were also used to test SureSelect XTHS
reagent kits, with modifications to the overlap sequence, and
oligonucleotides were successfully amplified.
[0052] Amplification of these libraries can occur even when the
oligonucleotide is modified in a way to prevent elongation, as
subsequent rounds after the first two rounds use the synthesized
molecule as a template. The amplification method also works in the
presence of 5' biotin modifications.
EXAMPLE 4
[0053] A set of 96 or more sample identifiers is provided. The set
can be used to add sample identifiers to nucleic acids prior to
amplification and/or prior to pooling before sequencing. However,
if contamination occurred in one of these sample identifiers during
kit assembly or reagent preparation, it could cause the detection
of a low allele variant in a sample. To be confident about lack of
contamination, it would take a large number of sequencing runs to
ensure every sample identifier could be confirmed as having no
contamination.
[0054] The following scheme overcomes this limitation and can be
used to determine contamination of sample identifiers (also
referred to as sample barcodes or SBCs in this example) and/or
assay identifiers (also referred to as QC barcodes or QCBCs in this
example). A set of 96 oligonucleotides containing different sample
identifiers are split into two groups: Group 1 and Group 2, each
containing 48 of the oligonucleotides. Group 1 has SBC1 to SBC48,
and Group 2 has SBC49 to SBC96. Each sample identifier in Group 1
is amplified with an assay primer containing one of 48 different
assay identifiers (QCBC1 to QCBC48). Each sample identifier in
Group 2 is amplified with one of the same 48 assay identifiers that
was used in Group 1, such that every assay identifier (QCBC1
through QCBC48) is present in both Groups and in two amplification
reactions, and every sample identifier (SBC1 through SBC96) is only
present in only one Group and in one amplification reaction. The
association of assay identifiers (QCBCs) with sample identifiers
(SBCs) according to the scheme is shown in FIG. 7. For illustrative
purposes, the SBCs are shown as being arranged in a 96-well plate,
though they do not have to be provided or used in well plates.
[0055] PCR amplification produces oligonucleotide amplicons having
a QCBC and an SBC. In the absence of contamination, each SBC is
associated with one QCBC. In other words, when sequenced, the
sequence information for each an SBC should have a single QCBC
associated with it. FIG. 7 shows the associations that will be
produced using this scheme. However, it is desirable to sequence
amplicons in pools rather than individually using massively
parallel sequencing, thereby reducing time, expense, and effort
required for sequencing. Thus, the oligonucleotide amplicons
generated in Group 1 are pooled together and sequenced, and the
oligonucleotide amplicons from Group 2 are pooled together and
sequenced. The sequencing of the pools produces sequence
information for the various amplicons included in the pools, and
the sequencing information for a given amplicon will have a sample
identifier and an assay identifier associated with it.
[0056] Sequencing in this manner will allow for the detection of
contamination due to sample identifiers or assay identifiers based
on the associations identified after analysis of the sequence
information. For this analysis, it is helpful to include all the
potential sample identifiers (whether then are intended to be
present in the pool or not) in the analysis of the sequencing
information. If contamination occurs, it can be from the sample
identifier or the assay primer. The pattern in which sample
identifiers and assay identifiers appear in the two sequencing
pools (from Group 1 and Group 2) will determine whether it is
sample identifier contamination or assay identifier contamination.
The present scheme allows one to determine which is the source of
the contamination.
[0057] If a sample identifier from Group 2 is observed in Group 1
(for example, if the sequence of SBC66 is found in the sequencing
information for Group 1), this indicates contamination of one of
the sample barcodes in Group 1, as there are 49 sample identifiers
rather than the expected 48. However, this knowledge alone does not
indicate which of the sample identifiers in Group 1 was
contaminated with SBC66. The specific sample barcode contaminated
is determined based on which assay identifier is associated with
the contaminating SBC66. If the SBC66 found in the first pool is
associated with QCBC10, then SBC10 is the sample identifier that
was contaminated with SBC66. Whichever sample identifier in Group 1
has the same assay identifier associated with it as the
contaminating sample identifier, that is the sample identifier that
is contaminated.
[0058] Additionally, the present methods, compositions and kits can
also detect contamination within a pool by identifying sample
identifiers that are associated with more than one assay identifier
and/or by identifying assay identifiers that are associated with
more than one sample identifier. If sequence information indicates
the presence of amplicons having SBC13 and QCBC13, as well as
amplicons having SBC13 and QCBC29 (that is, SBC13 is associated
with QCBC13 and with QCBC29), this indicates there is some
contamination. However, this knowledge alone does not indicate
whether SBC29 was contaminated with SBC13, or whether QCBC13 was
contaminated with QCBC29. By identifying whether there is
contamination of the same assay identifier in the second pool, one
can identify the source of contamination. In the second pool, SBC61
will only be associated with QCBC13 in the absence of
contamination. However if SBC61 is also associated with QCBC29,
this indicates that QCBC13 was contaminated, since the
contamination occurred in both pools. If SBC61 is not associated
with QCBC29, then QCBC13 is not contaminated, and SBC29 was the
source of contamination in the first pool. The same approach also
works for Group 1 sample identifiers present in the Group 2 pool.
The present methods provide the ability to differentiate between
contamination of a sample identifier and contamination of an assay
identifier using two sequencing pools.
[0059] The present methods and compositions can also be used to
determine sequence variation of random nucleotides found between
two constant regions. The assay identifier can act as a standard
sample barcode and only one pool of samples would be required,
assuming sequencing output is sufficient to detect the level of
contamination desired. For instance, this assay can be used to
identify low level amount of contamination occurring in sequences
where a small variable region exists between two constant regions
and may be beneficial for identifying contamination or variation in
oligonucleotides used for any intended applications.
[0060] The foregoing description of exemplary or preferred
embodiments should be taken as illustrating, rather than as
limiting, the present invention which is defined by the claims. As
will be readily appreciated, numerous variations and combinations
of the features set forth above can be utilized without departing
from the present invention as set forth in the claims. Such
variations are not regarded as a departure from the scope of the
invention, and all such variations are intended to be included
within the scope of the following claims. All references cited
herein are incorporated by reference in their entireties.
Sequence CWU 1
1
8120DNAArtificial SequenceSynthetic Sequence 1aatgatacgg cgaccaccga
20224DNAArtificial SequenceSynthetic Sequence 2caagcagaag
acggcatacg agat 24333DNAArtificial SequenceSynthetic Sequence
3acactctttc cctacacgac gctcttccga tct 33433DNAArtificial
SequenceSynthetic Sequence 4gatcggaaga gcacacgtct gaactccagt cac
33534DNAArtificial SequenceSynthetic Sequence 5gtgactggag
ttcagacgtg tgctcttccg atct 34633DNAArtificial SequenceSynthetic
Sequence 6agatcggaag agcgtcgtgt agggaaagag tgt 33730DNAArtificial
SequenceSynthetic Sequence 7ccatctcatc cctgcgtgtc tccgactcag
30823DNAArtificial SequenceSynthetic Sequence 8cctctctatg
ggcagtcggt gat 23
* * * * *