U.S. patent application number 13/132027 was filed with the patent office on 2011-11-24 for construction method and system of fragments assembling scaffold, and genome sequencing device.
Invention is credited to Xiaodong Fang, Peixiang Ni, Jian Wang, Jun Wang, Huanming Yang.
Application Number | 20110288845 13/132027 |
Document ID | / |
Family ID | 40976941 |
Filed Date | 2011-11-24 |
United States Patent
Application |
20110288845 |
Kind Code |
A1 |
Ni; Peixiang ; et
al. |
November 24, 2011 |
CONSTRUCTION METHOD AND SYSTEM OF FRAGMENTS ASSEMBLING SCAFFOLD,
AND GENOME SEQUENCING DEVICE
Abstract
The present invention relates to gene engineering filed, and
provides a genome sequencing device, construction method of
fragments assembling scaffold and system thereof. The method
comprises the following steps: mapping the double-barreled data
obtained through sequencing to contigs; calculating the mean length
between contigs based on multiple pairs of double-barreled data
mapped to contigs, which is taken as the gap size between contigs;
constructing scaffold based on gap size between contigs and the
double-barreled relation between contigs; and obtaining complete
scaffold graph. Since the mean length between contigs is calculated
from multiple pairs of double-barreled data and is taken as the gap
size between contigs, the estimation precision of gap size between
contigs is improved greatly. It can be used for genome sequencing
including short sequencing read length to finish task of assembling
sequencing fragments.
Inventors: |
Ni; Peixiang; (Shenzhen,
CN) ; Fang; Xiaodong; (Shenzhen, CN) ; Wang;
Jun; (Shenzhen, CN) ; Yang; Huanming;
(Shenzhen, CN) ; Wang; Jian; (Shenzhen,
CN) |
Family ID: |
40976941 |
Appl. No.: |
13/132027 |
Filed: |
December 11, 2009 |
PCT Filed: |
December 11, 2009 |
PCT NO: |
PCT/CN2009/001428 |
371 Date: |
August 10, 2011 |
Current U.S.
Class: |
703/11 |
Current CPC
Class: |
G16B 30/00 20190201 |
Class at
Publication: |
703/11 |
International
Class: |
G06G 7/48 20060101
G06G007/48 |
Foreign Application Data
Date |
Code |
Application Number |
Dec 12, 2008 |
CN |
200810218342.5 |
Claims
1. A construction method of fragments assembling scaffold,
comprising: mapping the double-barreled data obtained through
sequencing to contigs; obtaining the gap size between said contigs
based on said double-barreled data mapped to said contigs;
constructing fragments assembling scaffold based on the gap size
between contigs and the double-barreled relationship between
contigs, and obtaining fragments assembling scaffold graph.
2. The method according to claim 1, wherein the step of obtaining
the gap size between said contigs based on said double-barreled
data mapped to said contigs comprises: calculating the mean or
median length between contigs based on multiple pairs of
double-barreled data mapped to contigs, which is taken as the gap
size between contigs.
3. The method according to claim 1, further comprising the step of:
detecting repeat contigs in said fragments assembling scaffold
graph, and masking the detected repeat contigs.
4. The method according to claim 3, wherein said repeat contig is
linked in one direction to a plurality of contigs that having
overlaps.
5. The method according to claim 1, further comprising the step of:
linearizing the fragments assembling scaffold graph based on the
gap size between contigs and the double-barreled relationship
between contigs in the fragments assembling scaffold graph.
6. The method according to claim 5, further comprising the step of:
recalculating the gap size between contigs in the linearized
fragments assembling scaffold graph.
7. The method according to claim 3 or claim 4, further comprising
the step of: when the masked repeat contig locates between two
unique contigs, recovering the masked repeat contig.
8. A construction system of fragments assembling scaffold
comprising: a double-barreled data mapping unit for mapping the
double-barreled data obtained through sequencing to contigs; a gap
size obtaining unit for obtaining the gap size between said contigs
based on said double-barreled data mapped to said contigs; a
scaffold construction unit for constructing fragments assembling
scaffold based on the gap size between contigs and the
double-barreled relation between contigs, and obtaining fragments
assembling scaffold graph.
9. The system according to claim 8, further comprising: a repeat
contig masking unit for detecting repeat contigs in the fragments
assembling scaffold graph, and masking the detected repeat
contigs.
10. The system according to claim 9, further comprising: a
linearization unit for linearizing the fragments assembling
scaffold graph based on the gap size between contigs and the
double-barreled relationship between contigs in the fragments
assembling scaffold graph.
11. The system according to claim 10, wherein the gap size
obtaining unit is also used for recalculating the gap size between
contigs in the linearized fragments assembling scaffold graph.
12. The system according to claim 9, further comprising: a repeat
contig recovering unit for recovering the masked repeat contig when
the masked repeat contig locates between two unique contigs.
13. The system according to claim 8, wherein the gap size obtaining
unit calculating the mean or median length between contigs based on
multiple pairs of double-barreled data mapped to contigs, which is
taken as the gap size between contigs.
14. A genome sequencing device comprising the construction system
of fragments assembling scaffold according to any one of claims
8-13.
Description
FIELD OF THE INVENTION
[0001] The present invention relates to gene engineering field, in
particular a construction method and system of fragments assembling
scaffold and a genome sequencing device.
BACKGROUND OF INVENTION
[0002] Genomics is to study and analyze the whole genetic
information of an organism, in order to know the mechanism and
function of the whole genetic information. One basic step in
genomics is to obtain the whole sequences of on organism.
Currently, there is the First-Generation Sequencing Method such as
whole genome shotgun sequencing (Sanger Method), as well as the
Second-Generation Sequencing Method such as Solexa, Solid and 454
method.
[0003] The Sanger Method is briefly described as follows: the whole
genome is broken up into small DNA fragments of varying length to
construct the Shotgun library; the Shotgun library is randomly
sequenced; the sequences fragments are then assembled into whole
genome sequence by bioinformatics method. This method is
characterized by long sequencing reads.
[0004] The Solexa Method is briefly described as follows: the whole
genome is broken up into DNA fragments of 100-200 bp. Then an
adaptor is linked to the DNA fragments and a library is obtained by
polymerase chain reaction. The adaptor linked DNA fragment is
subsequently immobilized to adaptor linked flow cell. After
reaction, different DNA fragments are amplified. In the next step,
a sequencing-by-synthesis step is performed using 4 fluorescence
labeled dyes. This method is characterized by high throughput, low
cost, low sequencing error and short sequencing reads.
[0005] The construction of fragments assembling scaffold has always
been the key step in de novo assembling, which is used for
determining the position relation between contigs and building the
basic framework for genome assembling. The quality of this process
directly affects the final result of the whole genome sequences.
The current construction method of scaffold is to link those
sequencing fragments that have overlaps, so as to finish the task
of assembling sequencing fragments. In the case of short sequencing
reads, the overlaps between sequencing fragments is relatively
short; thus leading to a low precision for current construction
method of scaffold. Given that the Second-Generation Sequencing
Method such as Solexa, Solid and 454 method has a shorter
sequencing reads than the First-Generation Sequencing Method,
current construction method of scaffold can hardly apply to the
Second-Generation Sequencing Method to finish the task of
assembling sequencing fragments.
SUMMARY OF THE INVENTION
[0006] A first object of the present invention is to provide a
construction method of fragments assembling scaffold to solve the
above mentioned problem.
[0007] In one aspect, the present invention provides a construction
method of fragments assembling scaffold, the method comprising the
steps of:
[0008] mapping the double-barreled data (pair end information)
obtained through sequencing to contigs;
[0009] obtaining the gap size between said contigs based on said
double-barreled data mapped to said contigs;
[0010] constructing fragments assembling scaffold based on the gap
size between contigs and the double-barreled relation between
contigs, and obtaining fragments assembling scaffold graph.
[0011] A second object of the present invention is to provide a
construction system of fragments assembling scaffold, the system
comprising:
[0012] a double-barreled data mapping unit for mapping the
double-barreled data obtained through sequencing to contigs;
[0013] a gap size obtaining unit for obtaining the gap size between
said contigs based on said double-barreled data mapped to said
contigs;
[0014] a scaffold construction unit for constructing fragments
assembling scaffold based on the gap size between contigs and the
double-barreled relation between contigs, and obtaining fragments
assembling scaffold graph.
[0015] Another object of the present invention is to provide a
genome sequencing device comprising the above construction system
of fragments assembling scaffold.
[0016] In the embodiments of the present invention, by mapping the
double-barreled data obtained through sequencing to contigs and
obtaining the gap size between said contigs based on multiple pairs
of double-barreled data between said contigs, the estimated
precision of gap size between contigs in the construction of
fragments assembling scaffold is improved greatly. Then the
fragments assembling scaffold is constructed based on the gap size
between contigs and the double-barreled relation between contigs,
and a complete fragments assembling scaffold graph is obtained. As
such, even if a genome sequencing method with short sequencing
reads is used, it is also possible to finish the task of assembling
sequencing fragments. Meanwhile, the error rate in assembling
sequencing fragments is reduced.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] FIG. 1 is a flow chart of one example of the construction
method of fragments assembling scaffold of the present
invention.
[0018] FIG. 2 is a flow chart of another example of the
construction method of fragments assembling scaffold of the present
invention.
[0019] FIG. 3 is a representative diagram of a scaffold graph
constructed by mapping the double-barreled data to contigs.
[0020] FIG. 4 is a diagram showing the masking of repeat
contigs.
[0021] FIGS. 5a and 5b is a representative diagram of a linearized
scaffold graph.
[0022] FIG. 6 is a diagram showing the recover of repeat
contigs.
[0023] FIG. 7 is a diagram of one example of the construction
system of fragments assembling scaffold of the present
invention.
[0024] FIG. 8 is a diagram of another example of the construction
system of fragments assembling scaffold of the present
invention.
DETAILED DESCRIPTION OF THE INVENTION
[0025] To make the objects, technical solutions and advantages of
the present invention more clear and readily to understand, the
present invention is further described below in details by
referring to accompanying Figures and examples. It should be
understood that the described embodiments are to illustrate the
present invention only, and not intend to be limiting. In the
Figures, identical reference indicates the same or similar
component or element.
[0026] In the embodiments of the present invention, by mapping the
double-barreled data obtained through sequencing to contigs and
calculating the gap size between said contigs based on multiple
pairs of double-barreled data, the fragments assembling scaffold is
constructed based on the gap size between contigs and the
double-barreled relation between contigs, and a complete fragments
assembling scaffold graph is obtained.
[0027] FIG. 1 shows the flow chart of one example of the
construction method of fragments assembling scaffold of the present
invention.
[0028] Referring to FIG. 1, in step 102, double-barreled data (also
referred to as double-barreled reads) obtained through sequencing
is mapped to contigs.
[0029] In the examples of the present invention, the genome may be
sequenced through various sequencing methods, such as the First
Generation Sequencing Method, the Second Generation Sequencing
Method, thereby obtaining multiple short sequences having
double-barreled relationships (designated as double-barreled data).
In one example of the present invention, the genome is sequenced
through the Second Generation Sequencing Method, which method is
characterized by high throughout and short sequencing reads,
thereby reducing the complexity of the construction method of
scaffold.
[0030] Various mapping methods may be used to map double-barreled
data obtained through sequencing to contigs, such as soap, eland,
maq or BLAT mapping program. Upon mapping the double-barreled data
obtained through sequencing to contigs, the positions and
orientations of double-barreled data on contigs would be
obtained.
[0031] For a case where double-barreled data obtained through
sequencing is reads1 and reads1', reads2 and reads2', reads3 and
reads3', FIG. 3 shows a representative scaffold graph after mapping
the double-barreled data to contigs.
[0032] In step 104, the gap size between the contigs is obtained
based on the double-barreled data mapped to the contigs.
[0033] Two contigs are linked together via one pair of
double-barreled data. The gap length between two contigs can be
calculated based on each pair of double-barreled data mapped to the
contigs. If there are multiple pairs of double-barreled data
between two contigs, then calculate each gap length therefrom and
the median or mean gap length is taken as the final gap size
between two contigs.
[0034] In one example of the present invention, the number of the
double-barreled data over two contigs are recorded and marked as
weight. A particular threshold is chosen as appropriate. Those
cases where the weight is higher than the threshold are considered
as effective linking, in order to increase the accuracy of the
linking relationship.
[0035] In the examples of the present invention, the respective gap
size between contigs are calculated based on multiple pairs of
double-barreled data between two contigs, the mean gap size is
taken as the gap size between those contigs. Referring to FIG. 3,
when there are 3 pairs of double-barreled data between contig 1 and
contig 2, then 3 gap sizes are calculated based on these 3 pairs of
double-barreled data. The mean of these 3 gap sizes is taken as the
gap size between contig 1 and contig 2. When determining the gap
size between contigs, the gap size between all contigs having
double-barreled relationships are calculated, the mean gap size is
taken as the gap size between these contigs. Meanwhile, the number
of the double-barreled data between contig and marked as weight
(the number of the double-barreled data between contig1 and contig2
is 3). When the weight is higher than the predefined threshold, the
link between contig1 and contig2 is considered as effective link,
in order to increase the accuracy of the linking relationship.
[0036] If the gap size between two contigs calculated based on one
pair of double-barreled data is Xi, which follows a normal
distribution of N (.mu., .sigma. 2), in which .mu. denotes the
expected value while .sigma. 2 denotes the variance, then the mean
gap size between contigs calculated from N pairs of double-barreled
data follows the distribution of N (.mu., .sigma. 2/N). As such,
when the covering degree of double-barreled data on the contigs is
high, the estimation precision of gap size between contigs would be
improved greatly.
[0037] In step 106, the scaffold between contigs is constructed
based on the gap size between contigs and the double-barreled
relationship between contigs, and a complete scaffold graph is
constructed based on each contig, wherein the double-barreled
relationship between contigs may be directly determined by the
position relationship given by raw experimental data.
[0038] Referring to FIG. 3, the gap size between contig1 and
contig2 is calculated from 3 pairs of double-barreled data between
contig1 and contig2 as shown in FIG. 3, then the scaffold between
contig1 and contig2 may be constructed based on the gap size
between contig1 and contig2 and the double-barreled relationship
between contig1 and contig2, as shown in FIG. 3. Similarly, the
scaffold of all contigs having double-barreled relationship may be
constructed based on the gap size of all contigs having
double-barreled relationship and the double-barreled relationship
of all contigs having double-barreled relationship, thereby linking
all contigs having double-barreled relationship to obtain the
complete scaffold graph, as shown in FIG. 4.
[0039] FIG. 2 shows the flow chart of another example of the
construction method of fragments assembling scaffold of the present
invention.
[0040] As shown in FIG. 2, in step 202, double-barreled data
obtained through sequencing is mapped to contigs.
[0041] In step 204, the mean gap length between the contigs is
calculated based on multiple pairs of double-barreled data mapped
to the contigs, which is taken as the gap size between the
contigs.
[0042] In step 206, a scaffold graph is constructed based on the
gap size between contigs and the double-barreled relationship
between contigs.
[0043] In step 208, the constructed scaffold graph is checked for
repeat contigs. The detected repeat contigs are masked. It is
possible that there is a plurality of repeat contigs in the
scaffold graph constructed according to the above discussed method,
thereby reducing the accuracy of genome sequencing. By masking
repeat contigs in this step, the accuracy of genome sequencing
would be increased.
[0044] In the examples of the present invention, if one contig is
linked in one direction to a plurality of contigs that having
overlaps, then this contig is considered as repeat contig. Repeat
contigs are masked upon detected.
[0045] For a scaffold constructed as shown in FIG. 4, since contig
R is linked to contig A and contig B in the reverse direction, and
there is overlap between contig A and contig B; meanwhile contig R
is linked to contig D, contig E and contig F in the forward
direction, and there is overlap between contig E and contig F,
thereby contig R is a repeat contig, which would be masked.
[0046] In order to obtain scaffold of sufficient length within a
controllable error range and allow to determine the proper position
relationship of as many contigs as possible, in another example of
the present invention, the construction method of scaffold further
comprises the steps of:
[0047] in step 210, the scaffold graph is linearized based on the
gap size between contigs and the double-barreled relationship
between contigs.
[0048] In the examples of the present invention, when repeat
contigs are contained in the scaffold constructed in step 206, such
repeat contigs are masked via step 208, after masking, the scaffold
graph is linearized. When no repeat contig is contained in the
scaffold constructed in step 206, the scaffold graph would be
linearized directly. The step of linearization is as follows:
[0049] placing each contig at appropriate position in the sub-graph
based on the gap size between contigs and the double-barreled
relationship between contigs, if no significant overlap exists
between any two contigs, then performing linearization according to
the position relationship of these two contigs.
[0050] For a scaffold as shown in FIG. 5a, wherein the gap size and
the double-barreled relationship between contig A and contig B, the
gap size and the double-barreled relationship between contig E and
contig D, the gap size and the double-barreled relationship between
contig A and contig E, the gap size and the double-barreled
relationship between contig E and contig C are known, the linear
structural relationship would be deduced therefrom as AEBCD. In
other words, the scaffold graph as shown in FIG. 5a can be
linearized as the scaffold graph as shown in FIG. 5b directly.
[0051] The gap size between contigs in the scaffold might be
changed due to the linearization of the scaffold graph. In order to
present the gap size between contigs in the linearized scaffold
graph accurately, the construction method of scaffold of the
present invention further comprises:
[0052] recalculating the gap size between contigs in the linearized
scaffold graph.
[0053] The step of recalculating the gap size between contigs in
the linearized scaffold graph comprises: based on the position
relationship of the contigs in the linearized scaffold graph,
recalculating the gap size between each two adjacent contigs; and
relinking adjacent contigs, thereby converting the scaffold graph
into a true linear structure. Referring to FIGS. 5a and 5b, after
converting the linking relationship of AB, AC, EC, ED of FIG. 5a
into the linking relationship of AE, EB, BC, CD of FIG. 5b, the gap
size of each contigs is calculated from the gap size that has
already been obtained. For example, the gap size of AE can be
calculated simply as AE=AC-EC.
[0054] After performing masking of repeat contigs in the scaffold
graph and linearization of sub-graph, it is possible that
previously masked repeat contig locates between two unique contigs,
since the gap size between contigs in the scaffold have been
changed. In this case, in order to reduce the internal gap size in
the scaffold and allow the scaffold to be filled as much as
possible, the construction method of scaffold further comprises the
steps of:
[0055] in step 212, when masked repeat contig locates between two
unique contigs, recovering the masked repeat contig.
[0056] Referring to FIG. 6, which shows the scaffold graph obtained
after step 208 and step 210. If the previously masked contig R
locates between unique contig A and unique contig D of the scaffold
graph, the previously masked contig R would be recovered
directly.
[0057] FIG. 7 shows a diagram of one example of the construction
system of fragments assembling scaffold of the present invention.
As shown in FIG. 7, the construction system of fragments assembling
scaffold comprises a double-barreled data mapping unit 71; a gap
size obtaining unit 72; and a scaffold (fragments assembling
scaffold) construction unit 73, wherein:
[0058] the double-barreled data mapping unit 71 is used for mapping
the double-barreled data obtained through sequencing to contigs. In
the examples of the present invention, the genome may be sequenced
through various sequencing methods, such as the First Generation
Sequencing Method, the Second Generation Sequencing Method, thereby
obtaining multiple short sequences having double-barreled
relationships (designated as double-barreled data). In one example
of the present invention, the genome is sequenced through the
Second Generation Sequencing Method, which method is characterized
by high throughout and short sequencing reads, thereby reducing the
complexity of the construction method of scaffold. Various mapping
methods may be used to map double-barreled data obtained through
sequencing to contigs, such as soap, eland, maq or BLAT mapping
program. Upon mapping the double-barreled data obtained through
sequencing to contigs, the positions and orientations of
double-barreled data on contigs would be obtained. FIG. 3 shows a
representative scaffold graph after mapping the double-barreled
data to contigs.
[0059] The gap size obtaining unit 72 is used for obtaining the gap
size between the contigs based on the double-barreled data mapped
to the contigs. For example, the mean or median gap length
calculated from multiple pairs of double-barreled data mapped to
the contigs is taken as the gap size between two contigs. In
addition, the number of the double-barreled data over two contigs
are recorded and marked as weight.
[0060] In the examples of the present invention, if the gap size
between two contigs calculated based on one pair of double-barreled
data is Xi, which follows a normal distribution of N (.mu., .sigma.
2), in which .mu. denotes the expected value while .sigma. 2
denotes the variance, then the mean gap size between contigs
calculated from N pairs of double-barreled data follows the
distribution of N (.mu., .sigma. 2/N). As such, when the covering
degree of double-barreled data on the contigs is high, the
estimation precision of gap size between contigs would be improved
greatly.
[0061] The scaffold construction unit 73 is used for constructing
fragments assembling scaffold based on the gap size between contigs
and the double-barreled relation between contigs, and obtaining
fragments assembling scaffold graph, wherein the double-barreled
relationship between contigs may be directly determined by the
position relationship given by raw experimental data.
[0062] Referring to FIG. 3, the gap size between contig1 and
contig2 is calculated from 3 pairs of double-barreled data between
contig1 and contig2 as shown in FIG. 3, then the scaffold between
contig1 and contig2 may be constructed based on the gap size
between contig1 and contig2 and the double-barreled relationship
between contig1 and contig2, as shown in FIG. 3. Similarly, the
scaffold of all contigs having double-barreled relationship may be
constructed based on the gap size of all contigs having
double-barreled relationship and the double-barreled relationship
of all contigs having double-barreled relationship, thereby linking
all contigs having double-barreled relationship to obtain the
complete scaffold graph, as shown in FIG. 4.
[0063] FIG. 8 shows a diagram of another example of the
construction system of fragments assembling scaffold of the present
invention. As shown in FIG. 8, the construction system of fragments
assembling scaffold comprises a double-barreled data mapping unit
71; a gap size obtaining unit 72; and a scaffold construction unit
73; and optionally, a repeat contig masking unit 84; a
linearization unit 85 and a repeat contig recovering unit 86,
wherein the double-barreled data mapping unit 71, the gap size
obtaining unit 72 and the scaffold construction unit 73 is the same
as that in FIG. 7. Please refer to the description above.
[0064] It is possible that there is a plurality of repeat contigs
in the scaffold graph constructed by the scaffold construction unit
73, thereby reducing the accuracy of genome sequencing. In order to
increase the accuracy of genome sequencing, in another example of
the present invention, the construction system of scaffold further
comprises a repeat contig masking unit 84. The repeat contig
masking unit 84 detects and masks repeat contigs in the scaffold
graph. In the examples of the present invention, if one contig is
linked in one direction to a plurality of contigs that having
overlaps, then this contig is considered as a repeat contig.
[0065] In order to obtain scaffold of sufficient length within a
controllable error range and allow to determine the proper position
relationship of as many contigs as possible, in another example of
the present invention, the construction system of scaffold further
comprises a linearization unit 85. In the linearization unit 85,
the scaffold graph is linearized based on the gap size between
contigs and the double-barreled relationship between contigs. The
step of linearization is as follows: placing each contig at
appropriate position in the sub-graph based on the gap size between
contigs and the double-barreled relationship between contigs, if no
significant overlap exists between any two contigs, then performing
linearization according to the position relationship of these two
contigs.
[0066] The gap size between contigs in the scaffold might be
changed due to the linearization of the scaffold graph. In order to
present the gap size between contigs in linearized scaffold graph
accurately, in another example of the present invention, in the gap
size obtaining unit 72, the gap size between contigs in the
linearized scaffold graph will be recalculated.
[0067] The step of recalculating the gap size between contigs in
the linearized scaffold graph comprises: based on the position
relationship of the contigs in the linearized scaffold graph,
recalculating the gap size between each two adjacent contigs; and
relinking adjacent contigs, thereby converting the scaffold graph
into a true linear structure. Referring to FIGS. 5a and 5b, after
converting the linking relationship of AB, EC, AC, ED of FIG. 5a
into the linking relationship of AE, EB, BC, CD of FIG. 5b, the gap
size of each contigs is calculated from the gap size that has
already been obtained. For example, the gap size of AE can be
calculated simply as AE=AC-EC.
[0068] After performing masking of repeat contigs in the scaffold
graph and linearization of sub-graph, it is possible that
previously masked repeat contig locates between two unique contigs,
since the gap size between contigs in the scaffold have been
changed. In this case, in order to reduce the internal gap size in
the scaffold and allow the scaffold to be filled as much as
possible, the construction system of scaffold further comprises a
repeat contig recovering unit 86. In the repeat contig recovering
unit 86, when masked repeat contig locates between two unique
contigs, the masked repeat contig would be recovered.
[0069] Referring to FIG. 6, which shows the scaffold graph obtained
in the scaffold construction unit 73. If the previously masked
contig R locates between unique contig A and unique contig D of the
scaffold graph, the previously masked contig R would be recovered
directly.
[0070] It is to be noted that, although the repeat contig masking
unit 84, the linearization unit 85 and the repeat contig recovering
unit 86 are shown simultaneously in FIG. 8, one skilled in the art
would understand that, in addition to the double-barreled data
mapping unit 71, the gap size obtaining unit 72 and the scaffold
construction unit 73, the construction system of fragments
assembling scaffold could comprises only the repeat contig masking
unit 84 or the linearization unit 85; or both the repeat contig
masking unit 84 and the linearization unit 85; or comprises
simultaneously the repeat contig masking unit 84, the linearization
unit 85 and the repeat contig recovering unit 86.
[0071] For better understanding, the above description only shows
the relevant part of the examples of the present invention. One
skilled in the art would understand that, the construction system
of scaffold may be a software unit, hardware unit or a
soft-hardware unit that is within a genome sequencing device; or
otherwise, integrated as independent configuration into a genome
sequencing device or a application system of a genome sequencing
device.
[0072] In the embodiments of the present invention, by mapping the
double-barreled data obtained through sequencing to contigs and
obtaining the gap size between said contigs based on multiple pairs
of double-barreled data mapped to said contigs, the estimation
precision of gap size between contigs in the construction of
fragments assembling scaffold is improved greatly. Then the
fragments assembling scaffold is constructed based on the gap size
between contigs and the double-barreled relation between contigs,
and a complete fragments assembling scaffold graph is obtained. As
such, even if a genome sequencing method with short sequencing
reads is used, it is also possible to finish the task of assembling
sequencing fragments. Meanwhile, the error rate in assembling
sequencing fragments is reduced. Meanwhile, by masking the repeat
contigs in the constructed scaffold graph, mis-assembling due to
repeat contigs is avoided, and therefore the accuracy of scaffold
construction is greatly improved. By linearization of the
constructed scaffold graph, the position relationship of contigs
are determined, therefore the coverage length of scaffold is
increased. By recovering masked repeat contigs, the information of
repeat contigs are used sufficiently, and as many as internal gaps
in the scaffold are filled up.
[0073] The above description is only better working mode of the
present invention, and is not intended to be limiting. Any
modifications, equivalent substitutions and improvements that is
made without departing from the spirit and principle of the
invention are contained within the scope of the present
invention.
* * * * *