U.S. patent number 4,771,384 [Application Number 06/889,981] was granted by the patent office on 1988-09-13 for system and method for fragmentation mapping.
This patent grant is currently assigned to Dnastar, Inc.. Invention is credited to Frederick R. Blattner, Donna L. Daniels, John L. Schroeder, Michael Waterman.
United States Patent |
4,771,384 |
Daniels , et al. |
September 13, 1988 |
**Please see images for:
( Certificate of Correction ) ** |
System and method for fragmentation mapping
Abstract
A system and method for the construction of one-dimensional maps
from fragmentation data is disclosed. Particularly useful for
construction of restriction maps of DNA, the system and method
completely permutes sites, single digest fragments, and any
available multiple digest fragments, and displays maps in
rank-order according to a quality factor. Display of constructed
maps includes information about relative ordering of all fragments,
sites, and particularly about closely-spaced sites and
fragments.
Inventors: |
Daniels; Donna L. (Cross
Plains, WI), Schroeder; John L. (Madison, WI), Blattner;
Frederick R. (Madison, WI), Waterman; Michael (Culver
City,, CA) |
Assignee: |
Dnastar, Inc. (Madison,
WI)
|
Family
ID: |
25396069 |
Appl.
No.: |
06/889,981 |
Filed: |
July 24, 1986 |
Current U.S.
Class: |
382/129; 204/456;
435/6.12; 702/20 |
Current CPC
Class: |
C12Q
1/683 (20130101); G01N 27/44717 (20130101) |
Current International
Class: |
C12Q
1/68 (20060101); G01N 27/447 (20060101); G05F
015/42 (); G06G 007/58 () |
Field of
Search: |
;364/413,497,498,499
;204/182.8,182.9 ;935/3,9,75,76 |
References Cited
[Referenced By]
U.S. Patent Documents
Foreign Patent Documents
|
|
|
|
|
|
|
3300632 |
|
Jul 1984 |
|
DE |
|
59-182366 |
|
Oct 1984 |
|
JP |
|
59-193355 |
|
Nov 1984 |
|
JP |
|
Other References
Pearson, William R., Automatic Construction of Restriction Site
Maps, Nucleic Acids Research, vol. 10 No. 1 1982. .
Nolan, Garry P. et al., Plasmid Mapping Computer Program, Nucleic
Acids Research, vol. 12 No. 1 1984. .
Durand and Gregegere, An Efficient Program to Construct Restriction
Maps from Experimental Data with Realistic Error Levels, Nucleic
Acids Research, vol. 12 No. 1 1984. .
Wulkan and Lott, Computer-Aided Construction of Nucleic Acid
Restriction Maps Using Defined Vectors, Cabios, vol. 1, No. 4 1985.
.
Fitch, Smith & Ralph, Mapping the Order of DNA Restriction
Fragments, Gene, vol. 22 1983..
|
Primary Examiner: Kucia; R. R.
Attorney, Agent or Firm: Isaksen, Lathrop, Esch, Hart &
Clark
Claims
What is claimed is:
1. A method for constructing fragmentation maps of molecules of DNA
comprising the steps of
(a) digesting molecules of the DNA with at least two digesting
agents which fragment the DNA at characteristics restriction sites,
said digesting being conducted with each agent separately as well
as with both agents together;
(b) analyzing the approximate length of the fragments created by
the digesting step; and
(c) entering the approximate length values for the fragments into a
digital computer programmed with the steps of
(i) permuting incrementally from a beginning fragment form
digestion by a first of the agents arrangements of additional sites
and fragments from digestion by a second and by both of the agents
to construct a plurality of hypotheses for correct additions to the
fragmentation map begun with the beginning fragment;
(ii) testing each hypothesis by computing local additive lengths of
the additions to said map hypothesis and testing for the existence
of additional fragments of a length corresponding to said computed
length;
(iii) rejecting each of said hypotheses for which no such
additional fragment is found of the correct length;
(iv) accepting incrementally each of said map hypotheses for which
said additional fragment is found of the correct length; and
(v) outputting all the accepted map hypotheses which utilize all
the fragments as possibly correct fragmentation maps.
2. The method of calim 1, wherein in step (i) for all sites
calculated from said digestion by the first agent, every
permutation of said sites each in combination with every
permutation of the fragments from the digestion by the first agent,
and each in combination with every permutation of the fragments
from the digestion by both agents is generated as an hypothesis for
a possibly correct fragmentation map.
3. The method of claim 1, wherein in step (i) every permutation of
the fragments from the digestion by the first agent in combination
with every permutation of said sites which combination is
consistent with the sizes of said fragments from the digestion by
the first agent within predetermined error limits, and each in
combination with every permutation of the fragments from the
digestion by both agents is generated as an hypothesis for a
possibly correct fragmentation map.
4. The method of claim 3 further including the step of
evaluating said permutations on the basis of consistency with the
length of the fragments from the digestion with both agents, and
rejecting any combination which is inconsistent with said data.
5. A method for constructing fragmentation maps of DNA molecules
comprising the steps of
(a) digesting the DNA molecules separately with two different
enzymes which cut the DNA at characteristic sites into fragments
and also digesting the DNA molecules together with both
enzymes;
(b) analyzing the approximate length of the fragments created by
the digesting step; and
(c) entering the approximate lengths of the fragments into a
digital computer programmed to perform the steps of
(i) generating each permutation of all fragments from digestion by
one enzyme in combiantion with all permutations of sites which are
consistent with the lengths of said fragments from digestion by one
enzyme within predetermined error limits,
(ii) computing the minimum and maximum possible sizes for each
interval between sites in said combinations by summing the lengths
and the possible error,
(iii) selecting fragments from the fragments from digestion by the
other enzyme and by both enzyme that are consistent with said
computed sizes within possible error,
(iv) generating all possible permutation of the order of the
fragments so selected, and
(v) evaluating the overall cumulative error of the fit of said
selected orders of fragments to provide an indication of overall
probability of said orders to a user.
6. A method of constructing fragmentation maps of DNA molecules
comprising the step of
(a) digesting the DNA molecules separately with at least first and
second digestion agents, and jointly with both agents, each agent
cutting the DNA molecule at a characteristic site;
(b) analyzing the lengths of the fragments created by the
digestions; and
(c) entering the lengths of the fragments into a digital computer
programmed to perform the steps of
(i) maintaining three parallel tentative maps of the order of the
fragments from the digestion by the first agent, by the second
agent, and by both agents,
(ii) beginning with a fragment from the digestion by the first
agent;
(iii) selecting a site for addition to the tentative maps, the site
being selected as the end of the shorter of the two maps from
digestion with one of the first and second agents,
(iv) tentatively adding to the site a fragment selected from the
fragments from the digest by the respective agent for that map, to
thus create a hypothesis,
(v) testing the hypothesis by testing for the existence of
fragments from the digestion by both agents for addition to the map
for both agents which is consistent with the hypothesis,
(vi) if the testing of the hypothesis fails, repeat steps (iv) and
(v) for each remaining fragment for the respective map,
(vii) if the testing of the hypothesis proceeds, repeat steps (iii)
through (vi) until all fragments from all digests are assigned to
one of the fragmentation maps, and
(viii) providing as an output to the user all sets of fragmentation
maps which use all fragments and which tested correctly.
7. The method of claim 6 wherein the computer is programmed before
step (viii) with the step of if no set of fragmentation maps is
generated which uses all fragments and tests correctly, then
selecting a different fragment from the digestion by the first
agent as the beginning in step (ii) and repeating step (iii)
through (vii).
8. The method of claim 6 wherein an error is assigned to the
lengths of each of the fragments in the step of analyzing, which
error is entered into the computer and wherein the computer is
further programmed to sum the combined error for each complete set
of fragmentation maps and to indicate to the user the relative
combined error for each such set.
9. The method of claim 6 wherien the computer is further programmed
to, as a result of the testing, to reject all additional hypotheses
which are mirror images or which are additions to each hypothesis
which fails the test in step (v).
10. The method of claim 6 wherein the user may input constraining
information into the computer to constrain the selection in steps
(ii), (iii) or (iv).
11. The method of claim 10 wherein the constraining information
comprises at least one of
a fragment occupying a defined position on a map,
a fragment is cut by a selected agent,
a fragment from the digestion by both agents is contained in a
certain fragment from the digestion by a single agent,
a fragment is adjacent a certain fragment,
a fragment from one digest is identical to a certain fragment from
another digest, and
a fragment has a size such that its length has not been determined.
Description
FIELD OF THE INVENTION
The present invention relates to the computer-assisted construction
of one-dimensional linear or circular maps from fragment size data.
The preferred embodiment allows construction of restriction maps of
deoxyribonucleic acid (DNA) for use by experimenters in the fields
of molecular biology, molecular genetics and genetic
engineering.
BACKGROUND OF THE INVENTION
Restriction enzymes are endonucleases (see Glossary of Terms) which
cleave DNA at short, specific nucleotide sequences. Purified DNA
can be cleaved at these specific sites on the molecule and the
length of the cleavage products measured. By a variety of
experimental techniques, the order of these fragments in the
molecule can be determined and thus one can "map" the restriction
sites of a genome. At minimum, a map is an ordered list of
restriction sites and their positions on the DNA molecule. A map
may also contain an identification of every fragment in the data
set as belonging to an interval on the map. For most purposes, the
usefulness of a map is dependent on this information. Such a map is
a physical map of the genome and can be correlated with a genetic
map or with physical maps derived by other means (such as EM
Heteroduplex analysis).
Restriction enzymes have proven useful for the physical dissection
of the genomes of organisms ranging from bacteria to mammals. These
endonucleases have been used to map and compare genomes, to produce
DNA fragments for DNA sequencing and to construct recombinant DNA
molecules. Maps are useful both for the isolation of defined
regions of the genome and in the analysis of hybrid chromosomes and
deletion mutants.
For example, restriction enzymes are the "scissors" used to cut
large genomes (billions of nucleotides (bases) long) into pieces
for subsequent molecular cloning. It was only the discovery of
restriction enzymes and their characteristics which allowed the
development of the analysis techniques collectively known as
"recombinant DNA technology". The ability to clone pieces of DNA of
interest out of the huge background of the genome and replicate
them along with the cloning vector permits the isolation of useful
amounts of reasonably pure DNA for study. This has led to
experiments for the elucidation of molecular genetic details about
gene structure and function that were impossible a decade ago.
Restriction enzymes are also an important tool of genome analysis.
For example, when screening a "shotgun" (large, mixed collection of
all clones of a genome) for a particular feature of interest, an
experimenter generally finds a number of positive candidates. The
relationship of the candidates can be elucidated by restriction
mapping. Two candidates might be identical (and would thus have an
identical restriction map), they might be overlapping (and would
thus each have a part of its map identical with the other) or they
might be unrelated (and would thus have quite different restriction
maps). In the case of overlapping clones, this kind of analysis
further locates the feature selected for when isolating the
positive candidates to the DNA contained in both clones (i.e., the
overlapping region of the restriction maps). If two of the
candidates selected have unrelated maps, this suggests that two
different regions of the genome (two different "genes") can produce
the same feature selected for.
Similarly, genomes related in a number of ways can be compared by
their restriction maps. For example, two related but distinct
organisms (such as bacteriophage lambda and bacteriophage phi 80)
can genetically recombine to yield recombinent progeny with some of
the features of each parent. If one determines a restriction map of
the recombinant and compares it with the maps of the parents, one
can determine which region of the genome came from which parent,
where the recombination events occured in the DNA, and may be able
to assign some of the features to particular regions in the
DNA.
In recent years it has become possible to determine the nucleotide
sequence of DNA in a particular gene or region of interest. DNA is
a linear string of subunits, each subunit can be one of four types
(A,G,C, or T). The experiments to determine the sequence or order
of subunits in a given region of DNA are greatly facilitated by
having a restriction map of the region of interest.
DESCRIPTION OF THE PRIOR ART
Since maps are so useful, constructing restriction maps from
fragment size data (usually derived from gel electrophoresis of
digested DNA) is a common activity of molecular biologists. One
method frequently used is referred to in the literature as "by
inspection". This generally means that DNA which has been digested
to completion by a particular restriction enzyme is run in one
channel of an electrophoresis gel, a second complete digest with a
different enzyme in another channel and the complete digest with
both enzymes together in a channel between the two. Electrophoresis
is known to separate DNA fragments on the basis of size. The sizes
of the DNA fragments in the digestion are then measured by
comparing their mobility on the gel to the mobility of a set of
known size standards.
The researcher then tries to piece together all of these fragments,
much like a jigsaw puzzle. He/she tries to determine an ordering of
every fragment in each of the three digests. In doing so, the
researcher follows rules which express the underlying logic of the
original arrangement of the fragments. For example, each fragment
in each single digest must be identified with some subset of the
fragments in the double digest whose sum of lengths adds up within
error to the length of the single digest fragment. Similarly, each
end of each double digest fragment must be identified with an end
of a fragment from one or the other of the single digests.
The result is a linear map of the original DNA with an ordering and
position (plus or minus error) of each site for both enzymes and an
identification of every fragment on the gel to an interval between
two sites on the map and of every interval on the map to a fragment
on the gel.
Mapping "by inspection" (i.e. by hand) is most common among
molecular biologists. However, several computer programs or
algorithms have been reported in the literature which claim to find
all possible maps. All use as input data the fragment sizes of
fragments in single and multiple digests. As will be discussed
below, each of the prior art systems suffers from some inherent
defect which leave much to be desired by the genetic
researcher.
Prior Art Algorithms and Computer Programs
Stefik--Artificial Inteligence (1978) 11:85-114. Doctoral
Dissertation--1980.
Implemented for Stanford Sumex system. This method permutes sites
and intervals, termed "segments", and rejects maps using heuristic
rules. It is not an algorithm, but is a set of heuristic rules.
According to Stefik's dissertation, the method is not intended to
be a complete solution generator, but is only a small subset of the
possible solution implemented as a rule-driven artificial
intelligence program.
Pearson NAR (1982) 10:217-227.
Implemented on a DEC-10 minicomputer. Permutes only fragments from
single digests. This method "predicts" double digestion fragments
(by an unexplained method), assigns double digestion fragments to
these predictions in order, and takes least sum of squares of
absolute differences as the goodness of fit. The configuration
space identified as (A!.times.B!). The method is said to be
extendable to more than two enzymes. It computes pairs first, then
adds third, etc. to list of possible maps.
The Pearson method does not permute double digest fragments nor
does it consider alternative orders of closly spaced sites. On
particular sets of input data, this method has been shown to miss
the correct map.
Fitch, et. al. Gene (1983) 22:19-29.
This approach is a branch and bounds method. The method is limited
to two enzymes and 3 digests. There is no indication of any ability
to handle additional enzymes.
The method appears to attempt to do what is commonly done by hand
in a "systematic" way. The "bounding rules" described are complex
and their interaction with each other and the procedure cannot be
clearly understood from the description.
Fitch, et al. start by trying to determine all possible
"contained-in" relationships for the smallest single digest
fragment. The method uses local error calculations to verify that
the sum of double digest fragments' length could fit within a
selected single digest fragment. A difficulty in this approach is
that the method attaches great value to assigning uncut fragments
(a necessarily tentative initial assignment). The method then tries
to make the uncut assignment definite by other information. When
all "contained-in" possibilities are enumerated for every single
digest, they have to be resolved into all possible consistent sets
of one per fragment and then maps extracted for each set.
Durand, et. al. NAR (1984) 12:703-716.
Durand, et al. propose another method using branch and bounds
(although not called such in the paper). The method handles many
enzymes at once, and builds from the left end of the molecule. The
method does not assign double digest fragments, however. It also
does not consider alternative orderings of closely spaced sites. No
use of local error is made, thus preventing the method from
computing long maps efficiently (although the paper suggests
starting from both ends at once). The algorithm can accept as input
partial information about the map.
Wulkan et. al. CABIOS (1985) 1:4 pp 235-239.
Wulkan et. al. present an implementation of the method of Pearson
for a microcomputer. The addition of "clues" such as the
contained-within relationships of the data speeds computation of
maps.
Nolan, et. al. NAR (1984) 12:717
This method produces maps using a branch and bounds method
involving construction of "forks" and permuting them. "Forks" are
sets of fragments from one double digest that sum to the same
length within error as one fragment from a single digest. As in
Fitch, et al., the first step is to determine all possible forks in
the data set.
The method is inherently limited to two enzymes and one double
digest. Two versions of the method exist for circular and linear
maps. It is not clear whether the "fork" generation method covers
all combinations. (Combinatorics of the permutation method are:
F!+sum(F.sub.i !) where F is the number of forks and F.sub.i is the
number of double digest fragments in the i'th fork.) It builds
forks together using global error. A set of heuristics is also
presented which is intended to bound the set of fork
permutations.
BRIEF DESCRIPTION OF THE FIGURES
FIG. 1 is an example of a restriction map and fragment size
data.
FIG. 2 depicts two examples (a and b) of pairs of different
restriction maps which prior art mapping algorithms would not
identify as different.
FIG. 3 is an example of a starting point used when mapping "by
hand".
FIG. 4 is a flow chart of the computer program which implements the
present invention.
FIG. 5 is a flow chart of recursive procedure BUILDMAP, which is a
part of the computer program implementation of the present
invention.
FIG. 6 is a detailed flowchart of step 120 of BUILDMAP showing
special cases for starting maps.
FIG. 7 is a diagramatic representation of the steps (a, b & c)
of map construction by the method of the present invention.
FIG. 8 is a diagramatic representation of the method of fitting
double digest fragments (a & b) into the growing map using
local error.
FIG. 9 is a collection of maps showing possible error relationships
between sites.
FIG. 10 is a map illustrating how to find reference point.
FIG. 11 is a diagram of the special case of starting a circular
map.
INTRODUCTION TO AN UNDERSTANDING OF THE PRESENT INVENTION
In accordance with the present invention, there is provided a
method for constructing restriction maps of DNA from data derived
from single and multiple digestions of purified DNA with various
restriction enzymes. Such data typically consists of the number of
fragments produced by digestion and their sizes. After digestion of
a DNA molecule with one or more restriction enzymes, the resulting
fragments are separated by gel electrophoresis. When compared to
standard compounds of known length (or molecular weight), the
fragments may be listed with their approximate length. (The error
inherent in such a procedure is on the order of 1-10%.)
Even if there were no error in measurement of the lengths of the
restricition fragments, there could be alternative arrangements of
the fragment data which would be consistent with the known
information about the genome of interest. For instance, if a linear
DNA molecule were being studied, and if the identities of both of
the ends were known with certainty (perhaps through
radio-labelling), there might still be ambiguity in the data if two
or more fragments having the same end moieties and identical
lengths resulted from a multiple digestion.
Given the error inherent in the measuring procedures employed, the
number of possible arrangements of fragments consistent with the
data is very large. The mapping problem is of the type "NP
Complete", which indicates that no program can be written which
solves all cases in less than exponential time.
Although the complete solution to the mapping problem has been
claimed by researchers in the art, investigations by the present
inventors have revealed that each solution proposed in the prior
art has been deficient. Since the correct solution is known to
exist (i.e., a DNA molecule does have a sequence, and a map),
generation of that solution among several possibilities is one
crucial measure of prior art methods. Any algorithm which fails to
give the correct result has either failed to generate that result
through an incomplete definition of the configuration space of the
mapping problem, or has rejected a generated solution for incorrect
or invalid reasons.
Once a set of possible maps for a given set of data has been
produced, the next desirable step is the ability to rank-order
those possible solutions in accordance with their conformity with
the data. In this way, the investigator may easily focus his or her
efforts on those possible solutions which hold the most
promise.
In accordance with the present invention a map generation system
correctly and completely defines the configuration space of the
mapping problem, evaluates possible maps on the basis of their
conformity with the experimental data, and displays the possible
maps generated to the user in novel and useful formats.
Explanation of restriction mapping in molecular biology
What is a map?
Referring now to FIG. 1(a) there is shown a restriction map of a
DNA molecule. The DNA is cut by restriction enzymes labeled Eco,
Bam, and Hin at specific sites located at positions labeled A to F.
This particular DNA is linear and the left and right ends are
denoted LEND and REND. (The left and right ends are not inherently
different and which is which is just a convention accepted by
scientists working on that particular DNA).
FIG. 1(b) shows the fragments obtained by digestion of the DNA with
any one of the enzymes on the map or with any two in
combination.
FIG. 1(c) shows the sizes of fragments in the various digests in
sorted (decreasing) order. This is the form of the data obtained by
experiments. The goal of mapping is to deduce the fragment
arrangement (as in 1(b)) and thus the map (as in 1(a)) from the
fragmentation data. Since the map is one-dimensional and intervals
on the map are additive, the test for correctness of a map is based
on this additive property. Finding the correct map is greatly
complicated by the fact that fragment-size measurements are not
perfect, but have an inherent measurement error. It is possible,
however, to put a bound on the magnitude of the measurement error.
A further complication is that some fragments may be "missed" in
the experiment, either because they are too small for the detection
method or are not resolved from another very similarly sized
fragment and thus not identified. The various digests can be
checked against each other for internal consistency of fragments
number and total size but "completeness" of the data is not
guaranteed.
The researcher's problem is to deduce all possible maps which are
consistent with the fragment size data (within error), and to
identify the more likely ones.
Depending on the use to which he/she wishes to put the map, the
experimenter will either do additional experiments to distinguish
between the possibilities or will decide to work around the
existing ambiguities in the map.
What constitutes a valid map, and whether two maps differ depends
on the definition used, which may vary depending on the use to
which the map is to be put. A map usually contains some or all of
the following (not independent) information about the molecule.
1. An ordered list of sites generally with coordinates and perhaps
with .+-. error on the coordinates
2. An ordering of fragments from the digests (which can include
error of the fragments). Either an ordering of just the single
digest fragments, or an ordering of fragments of all the
digests.
3. An assignment of every fragment in the data to an interval on
the map and of every interval on the map to a fragment in the data.
(Note that this is essentially equivalent to "1"+"2".)
4. An enumeration of the relationships between fragments of
different digests. Particularly those fragments from single-enzyme
digests are related to those subfragments from all multiple digests
involving that enzyme and vice versa. That is, for every pair of
digests involving that enzyme, one single and one multiple,
completely identify the contained-within and containing
relationships. This information is generally needed for cloning
and/or sequencing experiments and is completely predictable from
"1"+"2".
"1" is the format in which maps are usually presented. However, it
is important to note that because of error, "1" does not
unambiguously predict "2", nor does "2" unambiguously predict
"1".
FIG. 2(a) shows two possible maps which are both consistent with
the experimentally-derived fragment size data. The maps have the
same fragment order for all digests, both single and double, but
differ in the relative ordering of sites. This means that in map
(ii) the 300-long Eco fragment is cut by Bam into a very slightly
shorter new fragment with one Eco end and one Bam end and a 10-long
new fragment with one Eco end and one Bam end. The 1100-long Bam
fragment is similarly cut by Eco in the double digest. In the
second map (ii), however, the 300-long Eco fragment is not cut by
Bam nor is the 1100-long Bam fragment cut by Eco. These differences
may be critical in the design of some experiments, such as cloning
or sequencing.
FIG. 2(b) shows two maps which have the same site order and error
ranges on the coordinates but differ in the ordering of fragments
in the double digest. Depending on the use, this difference may be
very important and the experimenter may need to know that two
alternatives exist. This is important for any experiment where one
wishes to physically isolate a particular sub-region of the DNA
molecule for subsequent experiments, such as cloning, sequencing,
or protein-binding assays.
As an example of the typical prior art procedure employed by
researchers in the field, (other than computer methods referred to
above as "prior art",) the following description may be useful.
Procedure for Restriction Mapping "by hand"
Do experiments to derive the data.
(a) Isolate DNA of a single molecular species. (Usually a plasmid
from a bacteria or the genome from a virus).
(b) Digest aliquots of the DNA with a variety of restriction
enzymes employed singly and in various combinations.
(c) Run the digestions on an electrophoretic gel to separate the
fragments according to size.
Determining the map from the data involves several steps:
1. The researcher measures the length of fragments resulting from
digestion by comparing the position of the bands on the gel with
the positions of known length standards.
2. He/she notes significant facts (clues to the map as it were)
about the pattern of the bands in the three lanes. (one lane for
each of the two single and one double digests.)
(a) Which bands in the single digest are not in the double digest?
(This means that there is at least one cut for the second enzyme
between the two ends of the single digest fragment)
(b) Which bands in the single digests are (tentatively) also
present in the double digest? (Its presence in both digests means
that there is no site for the second enzyme between the two first
enzyme sites which generated the fragment).
(c) Which bands in the double digest are (tentatively) identical to
bands in which of the single digests and which bands are new? (This
identifies the ends of the fragments as being both of one type or
both of the other or one of each type.)
(d) Are the total number of fragments in the three digests
consistent and is the number of new fragments in the double digest
consistent with the number of cut fragments in the singles?
3. An effort is made to resolve discrepancies by careful
reinspection of the gel. (see Table 1)
(a) Search for "missing" fragments. Two similarly sized fragments
may not be resolved into two bands. A band on the gel composed of
two fragments (a doublet) is generally recognizable as such because
it stains more intensly than a singlet and it may be broader. The
other cause of missing fragments is that they are so small they
either ran off the bottom of the gel or do not stain intensely
enough to be seen. If it is fairly certain that doublets have been
identified, one tentatively assumes the existence of "not visible"
fragments in the double digest which are smaller than some maximum
(the size which would certainly have been visible).
(b) If the number of cut fragments and new fragments is
inconsistent, there is a mis-assignment of a cut fragment in a
single digest and a similarly-sized new fragment in the double
digest as identical and uncut. Re-examination of the gel may detect
this.
4. The researcher tries to construct a map which is consistent with
all the above information. More than one map may be consistent and
he/she tries to find all possibilities. The mental processes used
and the rate of progress are highly dependent on the skill and
experience of the mapper, as well as the peculiarities of the data
being considered.
Generally, what is done is to find a small sub-region of the map
which is "easy" (certain, self-evident, unambiguous, etc). He/she
determines the containing and contained-within relationships for
this region and constructs a map of this sub-section. He/she then
proceeds in both directions, identifying overlapping fragments from
the two single digests by which ones contain a double digest
fragment in common, until a map is achieved. Progress becomes
easier as more map is determined because the mapped region and
fragments already assigned to them can be eliminated from further
consideration in constructing the not-yet-mapped regions.
Most maps have a readily evident region which can be a starting
point. For example, a large "new" fragment in a double digest can
only be contained in an even larger "cut" fragment (necessarily one
in each single). Frequently, there is only one choice for each of
these. Subtraction of the size of the new fragment from the size of
the containing single digest fragments yields a possible size for
two other new fragments and if either or both exist, a possible map
is started (see FIG. 3).
Similarly, a small cut fragment can only be cut into an even
smaller new fragment, or a large uncut fragment must be a
sub-region of an even larger cut fragment in the other single
digest. In most data sets, there is usually a natural starting
point for the mapping process. If not, what is usually done is to
resort to a different experiment with a different, easier pair of
enzymes. After some map is constructed, the more difficult enzymes
can be added more readily.
Alternatively, there may already be a known section of the map.
When mapping clones for example, the vector portion is already
known.
Mapping by Computer
The construction of maps by hand using the human mind to perceive
relationships is a difficult skill in which proficiency comes only
with practice. Therefore it cannot usually be assigned to unskilled
personel. In complex maps the number of possibilities is understood
to exceed the ability of the human to solve. Computational
assistance is therefore definitly welcome, not only because of the
time saved but because of potentially greater accuracy. Even the
most experienced and skilled practitioner of this art is aware and
concerned that the map he/she has deduced (really induced) might
not be the only one or even the best that fits the data. Just as
the human mind is good at solving puzzles by intuitive leaps, it is
also easy to fixate on one approach leaving blind spots that may
mask the truth.
BRIEF DESCRIPTION OF THE INVENTION
The present invention comprises a computer-assisted method for
determining the order (map) of fragments in a one-dimensional array
(either linear or circular), based on fragment length and fragment
termination data. While the invention may have utility in other
types of problems, its primary utility is that of restriction
mapping of DNA molecules.
With reference to DNA restriction mapping, the method of this
invention comprises constructing hypotheses of possible maps by
permuting at least some, and preferably all, arrangements of
fragments (each of a characteristic fragment length) and some, and
preferably all, fragment termination information, known as "sites".
The number and types of sites in the molecule are determined by
counting the numbered fragments in restriction enzyme digestions,
and fragment length is one of a number of possible method such as,
for example, by gel elctrophoresis. Each such permutation is then
tested for validation by comparing possible local sub-regions of
the original array with local additive lengths of fragments. Each
validated hypothesis is then output as a possibly correct
fragmentation map of the DNA under investigation.
Preferably, multiple possibly correct maps are then ordered as to
their probability of correctness by reference to their
goodness-of-fit to the data from multiple enzyme digestions.
In an alternative embodiment, a restriction map is constructed by
selecting a restriction site as a starting point then selecting all
possibly adjacent fragments as candidate neighbors. An extension of
this hypothetical map at the distal terminus (if the candidate
neighbor fragment is consistent with all available data) is then
constructed in a like manner, and the process is reiterated, in
essence "growing" the hypothetical map until it is complete, thus
arriving at a pre-validated possibly correct fragmentation map. At
each step of the growth process, local additivity of the
sub-regions of the partial map are used to check validity of the
hypothetical map.
DESCRIPTION OF THE PREFERRED EMBODIMENT THEORY AND OPERATION
In accordance with the present invention there is provided a
process for constructing maps from fragmentation data. For clarity,
the present invention will be described in terms of its application
to mapping of DNA molecules fragmented by restriction enzymes,
commonly known in the field of molecular biology as restriction
mapping. Other types of linear or circular structures and other
fragmentation agents could also be mapped by the method.
Objectives of a mapping system
A system for constructing maps should take fragmentation data in
the form obtained by the researcher and produce a systematic search
of all the possible arrangments of this data to identify maps that
are consistent with it. Since, in general, more than one
potentially correct map will be found, the system should output any
that are consistent with the data. Within this framework the
following criteria should be met:
Completeness of solution
Speed of solution
Ability to handle fragment data with .+-. errors
Ranking of possible maps by quality
Circular and linear maps supported
Experimentally realistic range of data handled.
Completeness of Solution
Since many possible maps may be consistent with a given set of data
the system must find them all, although the user might wish to be
shown only the best few. It is only the assurance that all
possibilities have been found that allows the researcher to have
confidence in the method, for if possible solutions failed to be
considered, the correct map could be among the missing. The
criterion of completeness is of such central importance that it is
a definitive requirement for a mapping system and method.
Logical approaches for deriving solutions from data can be thought
of as either using negative inference to reject elements of the
configuration space because of conflicts with the data or using
positive inference to deduce solutions from the data. A method
which uses only negative inference is complete if the hypothesis
generator exhaustively models the solution space and the rejection
criteria never cause rejection of a valid solution. A method which
uses only positive inference is complete if the inference rules
generate all possible hypotheses fitting the criteria. In choosing
an approach for solving a problem, the character of the data, of
the configuration space, and of the inference rules which can be
used must all be considered. Many known algorithms for problem
solving use a combination of both approaches.
The following description discloses two variations of a method for
solving restriction mapping problems. First described is an
exhaustive hypothesis generator which completely models the
configuration space and passes hypotheses one at a time to an
evaluator which rejects those inconsistent with the data. It is
this complete accuracy of the map generator of the present
invention that instills the necessary confidence in its users. Also
described are specific positive inference rules which could be used
to limit the configuration space.
The preferred embodiment of the invention is termed the "branch and
bounds method" and is based on the exhaustive hypothesis generator
as limited by positive inference rules. The method uses negative
inference to reject many elements of the solution space at once.
This is accomplished by intermixing steps of map generation and map
testing at a fine level.
Speed of Solution
The second criterion of acceptability is speed. The mapping problem
is in the category known as NP complete, which means that there is
no known way to solve it in less than exponential time. No matter
how efficient a method is employed, the time required to solve
increasingly complex maps increases exponentially. Many such
problems would be insoluable for all practical purposes, regardless
of the computational power available. The impact of an efficient
algorithm versus an inefficient one, however, can be worth orders
of magnitude in time. In effect, the quantitative difference
becomes qualitative, making useful problems soluble that would
otherwise be (for all practical purposes) insoluble. The preferred
embodiment is an exhaustive map generator that can solve maps
commonly encountered in molecular biology at practical speed, even
on a personal computer.
Ability to handle fragment data with .+-. errors
Real (experimentally derived) data has error limits associated with
it and a method must allow for the possibility that any fragment in
the data deviates from its mean by the allowed amount in either
direction. Real maps are one-dimensional and intervals on the map
are additive. Fragment sizes are measurements of these intervals.
If any proposed map has fragments assigned to intervals such that
the additive relationship within error is not maintained, then the
map is incorrect.
Ranking of maps by quality
When a large number of possible maps (each reasonably consistent
with the data given the error limits) are found by a method, there
is considerable difficulty for the user in deciding which is the
most probable one. The determination of a "figure of merit" for
each map, based on the overall fit of the data to the map, provides
a basis for ranking the output. This is a considerable help to the
user.
Circular and linear maps supported
Natural DNA is either linear or circular. The ability of a method
to handle both cases is therefore a significant benefit.
Experimentally realistic range of data handled
The method of the present invention is not inherently limited in
the number of enzymes that can be handled. For example, some prior
art methods can handle only two enzymes. Similarly, the method of
the present invention does not require that every double enzyme
digestion be done in order to solve multi-enzyme maps. There may be
used as much double digest data as is available, but only a few
double digestion experiments are necessary. The minimum requirement
is only enough multiple digests to involve all the enzymes. As more
double or multiple digest data are supplied, better results are
produced, ie. fewer "possible" maps found in a shorter time.
A map generator and a map evaluator are provided. The generator has
the job of enumerating all possible hypotheses (in a combinatoric
sense) based on a model of the configuration space. Each hypothesis
is then passed to the map evaluator. The evaluator functions to
reject maps which are inconsistent with the data and presents all
others to the user. The evaluator comprises three parts: the map
hypothesis rejector, the site positioner and the map quality
evaluator. The hypothesis rejector screens hypotheses produced by
the generator and rejects those with an unacceptable level of
inconsistency with the data. For each map not rejected, the site
positioner assigns numerical positions to each site on the map and
the evaluator checks each map against the data for quality of fit.
The maps with an acceptable quality are the solutions and they are
passed to the user through the output channel. The generator and
the elements of the evaluator will be discussed in detail in the
following three sections.
The map generator
The map generator can be of any form that produces a complete
coverage of all mathmatically possible maps given the single and
multiple digest data. A brute force procedure that is a correct but
computationally-intensive method for generating all maps is:
(PERMUTE ALL SITES) AND (PERMUTE ALL FRAGMENTS)
This is described in more detail as the steps below.
1. Generate list of all sites in molecule {determine the number of
sites Si for each enzyme i by counting the fragments in each single
digest (and subtracting one if map is linear)}
2. Determine the total number of sites, N, in the map by adding up
the individual totals.
3. While an untried permutation of sites exist {number of
permutations of sites is: ##EQU1##
4. Select a permutation of all sites
5. While an untried permutation of fragments in single enzyme
digests exists {number of permutations is F1! X F2! X . . . Fi!,
where Fi is the number of fragments in digest i}
6. Select a permutation of single digest fragments
7. For fragments in all single digests, assign all fragments (in
selected order) to intervals on the map bounded on both ends by
sites of the type used in the digest and containing no interior
sites of that type.
8. While untried permutation of fragments in multiple enzyme
digests exists
9. Select a permutation of multiple digest fragments {number of
permutations is M1! X M2! X M3! . . . X Mk!, where M is the number
of fragments in multiple digest k}
10. For all fragments in all multiple digests included in the input
data, distribute all fragments (in selected order) to the map
intervals bounded by sites of any type used in the digest and
containing no interior sites of any of these types.
11. Pass this hypothesis to the hypothesis evaluator. {Each
configuration defined by the combinations of site orders and
combinations of fragment distributions is a theoretically possible
hypothesis to be passed to the hypothesis evaluator.}
______________________________________ EndWhile {permute all
experi- mentally available multiple digest fragments} EndWhile
{permute all single di- gest fragments} EndWhile {permute all
sites} ______________________________________
It should be noted that the map generator can be used equally well
to generate linear or circular maps. The only detail that must be
changed is whether fragments are to be assigned to an interval that
spans the end. If the map is linear such assignments must be
eliminated from consideration.
Other suitable embodiments of the "brute force" map generator can
be obtained by interchanging the order of steps. For example, one
could consider the possible ordering of fragments first (in the
outer loop) and then consider the ordering of sites. Alternatively,
the single fragments could be permuted first, then the sites, and
then the multiple fragments. Finally, as discussed below with
respect to the preferred embodiment, these steps may be intermixed
at a fine level, some fragment permutations being done, then some
site permutations, then some more fragments etc. The possible
variations in the order of steps are minor modifications of the
method of the present invention.
A critical feature of the "brute force" map generator of the
present invention is that it does not require that double digest
data be available from all the combinations of enzymes of the map.
This is important because the number of double digests increases
rapidly with the number of single digests. {EX(E-1)/2 , where E is
the number of enzymes to be mapped} This can easily lead to a great
deal of experimental work for maps with several enzymes. Many
methods of the prior art which require all double digest data to
set up the map generator are unsuitable because of the
experimentally unrealistic requirement that double digest data be
available for every pairwise combination of enzymes.
As is common to all methods, that of the present invention requires
single digests of all the enzymes to be mapped. These are used to
generate the list of sites to be permuted. Fragment permutations
are done on the single digest fragments and on only those multiple
digests for which experimental data are available.
The hypothesis evaluator
The map hypothesis rejector checks a number of intervals on the
hypothesized map for additivity of fragment sizes as explained
above. It is important to note that if any interval on the
hypothesised map is non-additive, (beyond the error limits of the
data), then the map is inconsistent with the data and is considered
to be wrong. (It is this fact that is used to reject whole sets of
hypotheses in the solution space by the prefered embodiment, the
branch and bounds method described below.) Pearson's map rejector
looks only at the whole map (sum) deviation of the data, not at
individual data points, and rejects those maps whose total
deviation is greater than a preset limit. Thus, correct maps may be
rejected if most of the intervals are near the error limits of the
data since the total error will be high, while obviously incorrect
maps may be accepted if most of the intervals are coincidentally
close to the data while one diagnostic interval is substantially
beyond the acceptable limits of non-additivity.
The site positioner is responsible for assigning each restriction
site a numerical position on the "possible" map based on the order
of sites that has been selected and fragments have been assigned to
the map intervals by the map generator. This step is required
because every measurement of fragment length has a .+-. error
associated with it. As a result, the length of the intervals of the
map between sites can be determined in many different ways by
adding up the lengths of the various combinations of sub-fragments
that make it up.
Several alternatives exist for the design of the site positioning
algorithm. In accordance with Pearson's method, single digests may
be used alone after normalization of the total length of fragments
to an average value. This provides a single position for each site,
but the method avoids conflicts by throwing away the multiple
digestion data. This technique has an adverse effect on accuracy
since the single digest fragments, as the longest in the data, have
the largest absolute errors.
An alternative method is to use the shortest fragment available to
define every map interval. This also throws away data but is more
accurate since the data thrown away is the less accurate data
rather than the more accurate data.
The preferred method is that disclosed by Schroeder and Blattner.
This method makes use of all the data by performing a least squares
calculation to minimize the sum of the squares of the error of each
fragment length relative to the length of the map interval to which
it is assigned. This method gives the optimum map positions in the
least squares sense for every site. As a slight variation of this
method, it is possible to minimize the sum of squares of the errors
weighted by various factors, such as the inverse of the relative
error limits of the input data.
The map quality evaluator is used to give possible maps a "figure
of merit" so that they may be ranked for the user. Once the site
positions have been determined, the deviation of the length of each
actual data fragment from the calculated length of its interval may
be determined. Any map that predicts a fragment size that lies
considerably outside the assigned error limits for that fragment is
rejected. (It should be noted that this operation rejects maps. The
map rejector described above is not essential nor need it be
exhaustive in its rejection of inadequate maps since the function
can be done here. However, the least squares assignment of site
positions is computationly expensive and it is better to do it only
on those very few maps, relative to configuration space, which are
qualitatively reasonably consistent with the data.) This leaves the
maps that are qualitatively consistent with the data. For these a
figure of merit for the overall map is determined by any of several
statistical formulae; e.g. the root mean square average deviation,
the root mean square average fractional deviation or the
corresponding straight averages of absolute values of the
deviations or fractional deviations. The figure of merit may, in
fact, be the maximum fragment error. It is convenient to store
these maps in a sorted list so they may be presented to the user in
order of the figure of merit.
In summary, then, the present invention provides a method having a
proper map generator (which includes permutation of BOTH sites AND
single digest fragments AND multiple digest fragments in all
possible combinations), and a hypothesis evaluator which rejects
only those maps in which any region is unacceptably (beyond the
error limits of the data) non-additive.
The brute force method (wherein the map generator passes one
hypothesis at a time to the evaluator) bogs down with maps of
modest complexity. The number of trials is far greater than the
product of the factorials of the numbers of single digest fragments
(as was suggested by Pearson).
Since for a two enzyme linear map (with the double digest) there
are A! x B! x (A+B)! permutations of fragments plus the
permutations of sites, a brute force method which checks all
permutations one at a time becomes rapidly unusable as the number
of fragments increases.
Therefore, an efficient map generator/map evaluator combination
represents a great improvment. An efficient map generator is one
that accomplishes the same result as the "brute force" generator in
many fewer computational steps, yet does not fail to consider any
of the possibilities that might be correct maps.
The preferred embodiment of the present invention is based on the
following three concepts:
1. It is not really necessary to examine all combinations of every
permutation of single digest fragments with every permutation of
sites ONE AT A TIME. While the order of sites is not uniquely
deduceable from the order of single digest fragments unless the
data is perfect (0 error), it is possible with linear maps to
calculate which site permutations (one to several) are consistent
with each proposed permutation of single digest fragments. Thus the
configuration space to be exhaustively examined can be
significantly reduced by considering only these relative few
permutations for each permutation of single digest fragments.
The situation with circular maps is considerably more complicated
since possible site orders can be calculated only from permutations
of single digest fragments and some (data dependent) assignments of
some of the double digest fragments.
2. Once both a single digest fragment permutation and a compatible
site permutation (and with circular maps, enough double digest
fragments assigned to allow this) are selected, then the distance
between any two sites on the map can be calculated. (i.e. we can
calculate the size that a fragment assigned to such an interval can
be, without conflicting with the already selected fragment and site
permutations and assignments.) All unassigned multiple digest
fragments must then be assigned in all possible orders to all
appropriate intervals. (i.e. Permute multiple digest fragments and
assign to intervals--the inside loop of the general description.)
For each such permutation (a hypothesis) if any fragment is
inconsistent with the calculated allowable size of the interval,
then the map is rejected. If no fragments are inconsistent, the map
is accepted as possible and output. After all possibilities are
tried, the process goes back to the next outside loop and tries the
next permutation of single digest fragments AND sites.
2a. If no consistent (with the selected single digest permutation
and site permutation) permutations of double digest fragments
exist, then there are no possible maps for this particular
combination. It is not always necessary to enumerate all possible
permutations of double digest fragments to discover this. For
example: for any interval for which double digest data is
available, if no fragment of the appropriate size exists in the
appropriate double digest, then no permutation of multiple digest
fragments exists which is consistent and the single digest fragment
permutation-site permutation combination may be rejected without
individually considering all double digest permutations.
The preceeding two rules are examples of positive inference which
may be used to limit the hypothsis generator and thus make it more
efficient. Using these two rules, it would still be necessary to
individually examine all single digest fragment permutations in
combination with all consistent possible site permutations, reject
some of them by criteria based on the existence or non-existence of
particular multiple digest fragments (within error), and examine
the rest in combination with multiple digest fragment
permutations.
3. If the generator produces maps in the right order, and rejection
criteria are applied at intermediate steps in map generation, (not
just at the end), an inconsistency in a map may be used to reject
entire sets of maps at once. That is, if a map is rejected because
it has a particular inconsistent subregion, any and all maps which
have the same subregion can be rejected at the same time.
The preferred embodiment specifies a particular arrangment of the
steps of fragment and site permutation which can be represented in
the form of a tree. Traversal of every branch of the tree from the
root to every leaf would be equivalent to the brute force method.
(This is equivalent to saying that hypothesis evaluation is applied
to complete maps.) But the arrangment of the tree has been chosen
to increase the efficiency and to eliminate computational work, by
applying the test for local additivity at each branch and rejecting
all leaves at once for any branch failing the test.
Starting at the root of the tree and moving outward, the nodes
alternate between permuting sites, permuting single fragments, and
permuting double fragments. (The term "selection of growing point
enzyme" is used herein to describe the act of permuting sites).
Because of this alternating arrangment, the order of steps is not
only different from the described brute force implementation of the
invention, but may differ depending on the data.
Because of the way the tree is designed, every node of it
corresponds to a partially completed map, and these are nested so
that each step farther out on the tree from the root adds fragments
or sites to the end of the map that existed at the preceeding
level. With this structure in mind, it is easy to see how the
discovery of an inconsistency at any node (a local non-additivity)
leads to a simple way of rejecting large numbers of hypotheses at
once and thus limits vastly the computational work. In traversing
the tree, as soon as a node is encountered whose partial map is
inconsistent with the data, everything beyond it need not be
traversed because all maps that begin the same way, are
rejected.
The equivalence of the growing point selection rule technique to
the full permutation of all sites used in the brute force method is
clear. The savings accomplished through the use of the more
efficient method is quite large in most cases where real data is
involved because it is only necessary to apply site permutations in
situations where the accumulated error in the positions of
restriction sites at the end of the partially completed map of the
node overlap. If these positions did not overlap, the permutation
would, perforce create a negative fragment. The error control that
results from tacking corresponding sites together as the map grows
makes this a major savings over the more inefficient brute force
method.
The preferred embodiment of the present invention uses a branch and
bounds method and is described below. The input data are sets of
ordered lists of fragment sizes, corresponding to the sizes of
fragments in restriction enzyme digestions and the topology of the
molecule (whether the molecule is linear or circular). The method
requires that for each enzyme to be mapped, there must be a data
list for the single enzyme digest and each enzyme must be involved
in at least one multiple enzyme digest. The topology of the
configuration space of the mapping problem requires that all pairs
of single enzymes can be related through one or more multiple
digests. (This is distinct from the first requirement only if there
are more than three enzymes to be mapped). In the prefered
embodiment, in order to ensure complete coverage of the
configuration space of a circular map, there is a further
requirement that there must be enough multiple digest data that at
least one of the enzymes to be mapped is involved in a multiple
digest with every other enzyme.
Overview of procedure
Maps are constructed by building the map up one step at a time. At
each step there are a (data dependent) number of choices that could
be made. At each step, every possibility must be tried and at each
try the result is checked for internal consistency. (Additivity of
map intervals within the error bounds of the measured fragment
sizes must always be maintained). Any inconsistency causes pruning
of that branch. That is, we need not continue any further down the
wrong path of map construction but back up and try again with a
different choice. If there are no more choices we backup one step
further and try again.
There are three different kinds of steps in map construction which
are repeated until a complete map is achieved (and/or until all
possibilities have been tried and rejected). Pruning rules are
applied at each step.
1. Choose a site. The hypothesis to be tested is that this site is
the next site on the map. (No intervening sites between the
previous site, chosen in the previous cycle, and this one). This
site is called the growing point. It is chosen to be consistent
with single digest fragments assigned in previous cycles. All
possibilities must be tried. No more choices to try causes the
method to reject this branch and backup to the previous step. (step
3 in previous cycle).
2. Assign multiple digest fragments which end at this site.
Any multiple digest which involves the enzyme of the growing point
type will have a fragment in it that ends at the growing point and
maintains additivity of interval on the partially constructed map,
if it is correct so far. There may be more than one choice and
eventually all choices must be tried. If there are no more choices
to try this branch is pruned and the method backs up to the
previous step. (step 3 above)
3. Assign a single digest fragment that begins at the growing point
and extends into the not yet constructed part of the growing map.
(All choices must be eventually tried). If there are no more
choices to try this branch is pruned and the method backs up one
step. (step 2 above)
These three steps are repeated until the entire configuration space
has been covered. Each cycle adds to the growing map, one site, one
fragment from each multiple enzyme digest which ends at that site,
and one fragment from the single enzyme digest which begins at that
site.
The steps of the method are cyclic so the above numbering of the
steps (i.e. 1,2,3) is an arbitrary break in the cycle (i.e.
2,3,1,2,3,1, . . . is an equally valid representation). An
alternative way to think about the cycle is described below.
Maps are constructed by adding one single digest fragment at a time
to the growing map. The fragment is then accepted or rejected by
checking the consistency of the map to date with the available
multiple digest data. An inconsistency causes rejection of the
newly added fragment and the the next available fragment is tried
instead. If there are no more available fragments to try, the
algorithm backs up one step, removes the last added, tentatively
accepted, fragment and tries the next available fragment to add at
that point. If the fragment is accepted, the construction process
continues, another single-digest fragment is added to the map,
accepted or rejected, and so on. A possible map is found when all
single-digest fragments have been added to the map and accepted.
The map is processed (checked, refined and output). Then the
algorithm backs up a step and continues, since we want to find all
possible maps which are consistent. A flow chart of the method is
given in FIGS. 4,5 and 6. It is described below.
In the preferred embodiment, the overall order in which the map is
constructed is from one end toward the other (left to right if ends
are known) for linear molecules, and from an arbitrary point for
circular maps. Different overall orders are also possible (e.g.
right to left, or one end for a while then the other end for a
while, or both ends at once with a parallel processor, or from more
than one point in a circular map either alternating or in parallel,
etc.).
The point to which the method will add another single digest
fragment is called the growing point. (It is also the point up to
which double digest fragments have been fit in the previous step.)
It is defined as the next (from the previous growing point) site on
the growing map. It is found as the right end of a single enzyme
map which is the leftmost amongst the single digest maps. Since the
relative map positions of the endpoints have an inherent
uncertainty due to the measurement error of the input data, the
relative order of closely spaced sites may not be determinable.
When ambiguity of growing points is encountered, all possible
growing points are explored, one after the other. Thus the method
of the present invention properly handles closly spaced sites, a
particular weakness of various prior art methods. A valuable
enhancement to the selection of growing point is the use of
relative error (relative to a reference point in the accepted part
of the growing map) in comparing endpoints. This is described in
the Errors section below.
A single digest fragment is added to the end of one of the maps
(the growing point) and the recursive procedure BUILDMAP is called
to see if this added fragment is accepted and if it is to add
another fragment to the (next) growing point.
The procedure BUILDMAP executes as follows: The new ends are
compared (next growing point is determined) and the sizes of
fragments to expect from each multiple digest are calculated. The
fragment we are trying to find runs from the right end of the
double digest map to the just determined (next) growing point (see
FIG. 7). Each multiple digest is checked for the existence of a
fragment of the appropriate size. Its size must be such that the
sum of its length and the lengths of all other fragments assigned
to the map from that double digest between the previous end of the
growing point enzyme map and the current end equals, within error,
the size of the last fragment in the new growing point digest (see
FIG. 8). This value range could be calculated in a number of ways.
The important point is that the error to be used is local to the
region where the fragment is being fit, and includes only the error
in fragments which overlap the last fragment in the single digest
map of the growing point enzyme. This is a unique and crucial
feature of the present invention. If there is not a set (one from
each multiple) of fragments which fit, the single-digest fragment
is rejected (BUILDMAP returns) and a different single-digest
fragment tried.
After the existence of at least one appropriately sized fragment
from each appropriate multiple digest is verified, then the added
single digest fragment is accepted, one fragment from each multiple
digest is assigned to the map, i.e. it can not be used in any other
position in this map. If there is more than one fragment which will
fit, all possibilities are tried. If more than one multiple digest
has more than one fragment which will fit, all permutations--i.e.
all sets of one fragment from each multiple--are tried.
When multiple digest fragments which fit are found and assigned to
this map position, a coordinate correction is performed on the
growing point using the overlapping lengths of (1) the single
digest fragment ending at the growing point and (2) the sum of the
lengths of the multiple digest fragments contained within it, for
each multiple digest involving the growing point enzyme in the data
set. The coordinate adjustment employed is to set the coordinate
range of the site (growing point) to the intersection of endpoint
ranges of the overlapping values. The effect of this is to identify
the endpoint of the single digest map and the multiple digest map
as the same site by setting the coordinates to the same value. This
assignment of the double digest fragments to the growing overlapped
regions of the single digest maps, followed by coalescing of the
endpoint sites, merges the originally independent single enzyme
maps into a single unified map. In the unified map the length of
intervals between sites (fragment sizes), both like enzyme sites
(single digest fragments and different enzyme sites (multiple
digest fragments), are used to set site coordinates (See FIG. 7).
As grapically represented in FIG. 7, the growing end of the map is
a set of independent, parallel maps, one for each enzyme. As we
move along, adding and checking, the parallel maps are linked
together.
The practical value of unifying the maps as we proceed is twofold.
First, it allows the use of relative error (relative to a reference
point in the unified map) for finding the next growing point, thus
reducing the ambiguity of relative position of endpoint coordinates
enough that all possibilities can be practically explored.
Experiments have shown that this increases the efficiency of the
algorithm by orders of magnitude.
Second, it could make the calculating of the size of fragment to
look for in the double digest slightly more efficient. We could use
subtraction of endpoint coordinates (using a reference point in the
unified map to get relative, ie. local, error) and compare this
value against the query fragment in the multiple digest instead of
summing all appropriate fragments (and errors) with the query
fragment (with errors) and then comparing this size to the size of
the single digest fragment (with errors).
The order of map construction disclosed (trying all possible
growing points) is unique to the present invention as is the
fitting of double digest fragments using local error. The
assignment of all permutations of double digest fragments which fit
is unique to the present invention. While not critical, it assures
the identification of maps which differ only in the alternative
location of double digest fragments (as in FIG. 2) and permits the
coalescing of the parallel maps from the different digests. The
coalescing of the maps is unique to the present invention, and
while not critical to the algorithm in general, it allows for the
next point. That is, the method of calculating the next growing
point(s) is unique in the use of relative error and increases the
efficiency dramatically, making the use of the method on a computer
practical.
Error handling for determining growing point
Two unique features of the present invention are the use of local
errors to fit fragments from the multiple digest against the single
digests and the use of relative error in calculating growing point.
The use of local errors in fitting double digest fragments is
described above, in FIG. 7, and in the Details section.
The use of relative error in determining the next growing point is
important for efficiency. There are various, equivalent ways that
the error handling may be implemented. The principle is described
here and more details of the preferred embodiment are described in
the Details section.
The calculated position error of any two sites on a line might be
dependent on each other or independent of each other. For example,
if the coordinate and error of a site was determined by adding the
length and error of the interval between it and another site to the
coordinate and error of the other site, then the positions of the
sites are dependent. They can only vary independently over part of
the error range. Therefore, the difference between sites which are
dependent in this way can be obtained by subtracting the
coordinates, and the error on this distance can be obtained by
subtracting the errors of the two sites (FIG. 9). On the other
hand, two independent sites can vary over the entire range of their
errors independently. The distance between these sites is obtained
by subtracting the coordinates and the error is arrived at
independently by adding the errors. If two sites are not known
relative to each other, but each is known relative to a third site,
the two sites are not totally independent. The relative position of
the two sites may be calculated including only that amount of the
site errors which is independent. That is, their relative positions
may be determined more accurately by referal to the third site than
by subtraction of coodinates and summing of errors. One can
subtract the error on the common reference point from each of the
sites, before comparing them. The best reference point to use is
one closest (fewest intervening steps of dependent sites) to the
two sites.
let dist(x,y) be the distance between site x and site y
and x(coord), y(coord), be the locations of x, y
and x(error) and y(error) be the error on the location
The following cases are handled:
Dependent sites:
dist(x,y)=y(coord)-x(coord).+-..vertline.y(error)-x(error).vertline.
Independent sites:
dist(x,y)=y(coord)-x(coord).+-.(y(error)+x(error))
Partially dependent sites, by referal to a third site, z, upon
which both x and y are dependent: ##EQU2##
Applying this to the comparison of two restriction sites for
determination of growing point(s) in the algorithm, any two sites
on the growing map of the same enzyme type are dependent because
the interval between them is known (fragment length of fragment
from single digest or sum of more than one fragment) and is used to
determine the coordinate and the error. Any two sites of different
types are dependent if the double digest for the two enzymes
exists, and the fragment between the two sites has been assigned
and used to determine the coordinates of the two sites.
In order to determine the next growing point, it is necessary to
compare right ends of the single digest maps (coordinates of
different enzymes). Since the error on these sites generally gets
larger and larger as we get further in map construction, using the
global error to compare the sites would lead to exploring many
"possible" growing points and very slow operation. However, the
sites in fact are not totally independent. They can be compared by
referal to a third site, a reference point, upon which both of them
are dependent, as shown in FIG. 10.
The best reference point to use is the closest. When comparing two
map right-end sites (and the double digest for the two enzymes
exists), the best reference point is the leftmost previous end of
the two maps. If appropriate double digest data doesn't exist, then
there is another site from which there is a common path of
dependent sites to the two sites being compared. The best reference
point is different for any two sites and depends on which double
digests exist. For present purposes (reducing ambiguity of growing
point) using the best reference point is not essential. Any fairly
close reference point will do at the cost of some increased
ambiguity. Because finding the best for every pairwise combination
of enzymes would be a complex and time consuming step, the closest
site which can be a reference point for any two map right ends may
be used. (Termed the "greatest-common-whole-map-reference-point").
It is the leftmost of the "best" reference points for pairs, and is
found as the leftmost previous end among the single digest maps.
Since the input data is reqired to contain multiple digests so that
all enzymes are represented in the multiples, the entire set of
enzymes is tied together through multiple digest(s) and the use of
a single reference point is valid.
Procedure for starting a map
The previous sections describe the process of constructing
restriction maps by adding fragments one at a time to a growing
map. The following discussion details the procedures for starting a
new map.
Linear maps
Linear maps have two sites in common among all the digests, the two
ends. To start a linear map, an arbitrary order is imposed on the
enzymes' single digests. Only two minor steps need be incorporated
into the BUILDMAP procedure for beginning a map: the method of
choosing the growing point and the method of fitting multiple
digest fragments. At the begining of a map (i.e. there is at least
one single digest with no fragments yet in the map), the growing
point is chosen as an unstarted single digest map (arbitraily the
next). After choosing such a growing point, no multiple digest
fragments need to be fit. The method proceeds by adding a single
digest fragment to the selected growing point.
Circular maps
Circular maps create special problems at the beginning and ending
of map construction, but once underway can proceed as described
above. The method of the present invention can handle circular maps
without special user input, except, of course, input of an
indication that the DNA is circular.
In determining a growing point, if any single digest map is not
started, there is a growing point ambiguity. Possibilities for the
growing point include the leftmost site(s) of the started maps and
all unstarted enzyme maps. All of these possibilities must be
tried.
When the selected growing point belongs to an unstarted map, then
fitting multiple digest fragments employs special rules. If all of
the other enzymes involved in a multiple digest belong to unstarted
maps, then no multiple digest fragment is placed in the map. The
fragment that would have been chosen will actually be chosen at the
right end of the map, since the left end of this fragment is
positioned to the left of our arbitrary starting point and the map
is circular. If one or more of the other enzymes involved in the
multiple digest belong to started maps, then the fragment chosen
from the multiple digest is allowed to fit by much weaker rules
than normal. This situation is analogous to guessing where to start
the growing point map. Any fragment chosen from the multiple digest
will position the first site in the growing point digest, but the
method has already made one assumption that must not be
contradicted. The position of the growing point must not be to the
right of the right end of any other started map (i.e. contradicting
the definition of growing point).
Ending a Map
For linear problems involving only two enzymes, the method of the
present invention always completely solves the map at the step
numbered 9 in FIG. 5. Extending the number of enzymes, however,
involves some complications.
For linear maps, the method will have chosen one enzyme as the last
growing point (which is synonymous with the right end of the map)
and will have fit fragments from all multiple digests involving
that enzyme. Since the right end is a site which is common to all
maps and the growing point enzyme is a site which is not, the
method must fit a last fragment into all multiple digest maps that
do not contain the growing point enzyme. The fitting procedure is
the same as that depicted in FIG. 8, but each of the enzymes
involved in the multiple digest maps is hypothesized to be the
growing point, one at a time.
For circular maps, the method will always have two loose ends at
step 9 of the flowchart in FIG. 5. First, as in linear maps, there
is one last fragment to fit in each of the multiple digest maps
that do not involve the enzyme at the end of the map (and in this
case also at the beginning). Since the maps are circular, the
method must use special fitting rules to accomodate the part of the
map that extends beyond the arbitrary ending/starting point
wrapping around to the first site in the multiple digest map. The
difference in starting positions .+-. error is added to each single
digest map when this fragment is fit. In addition, each single
digest map must be tested for fit between the part of the map that
extends beyond the last growing point and the position in the map
where that digest begins. The fitting rules employed are equivalent
to rotating the map leftward to each of the single digest starting
sites and applying the standard rules for fit.
Evaluating and presenting a map
Once the method of the present invention has produced a site order
and an assignment of fragment order in every digest, it has thus
produced an assignment of every fragment to a map interval. This is
the requirement for use of the method of Schroeder and Blattner to
determine the coordinates of the sites of the restriction map
(Schroeder and Blattner, 1978). This method assigns map coordinates
so as to minimize the sum of the squares of the fractional
deviation between the fragment computed length (difference of
coordinates) and the fragment measured length. The present
invention employs this method to assign map coordinates to the maps
constructed. Maps are then sorted in order of best "score".
Score may be defined in a number of ways. The present inventors
have used both the normalized standard deviation of fragments and
the standard deviation weighted by measurement error. In either
case, the smallest score is the best fit. The first uses the sum of
squares of the fractional deviation (the same thing which is
minimized by the fitting method). The second uses the sum of
squares of the fractional deviation weighted by the inverse of %
fragment-error for the individual fragments. Thus a difference in
an interval whose measurement was known to be bad (proportionally)
does not count as much as a fragment whose measurement was known to
be good.
Enhancements and Variations
The following variations are alternative embodiments which may be
used in combination with the method described herein, or with other
methods of map computation.
(1) The method may impose an ordering so as to eliminate inverted
(mirror image) maps.
(2) When fitting doubles, the method may check "clues". A fragment
labelled as "new" in the double digest (either input by the user or
automatically labelled by the program because the size is distinct
from all single-digest fragments) is rejected if the method is
searching for a fragment with both ends of the same type. Those
fragments which can be automatically labeled by the program would
automatically be rejected because of size, but "clue" checking is
known to be much faster than checking the size. Those which are
close in size to a non-new fragment can't be distinguished as new
just by size of the input data but the user may be able to tell by
inspection of the gel and thus many false maps can be rejected if
the user also inputs clues where possible.
(3) In trying to add a single-digest fragment, use of clues such as
(cut-by enz X) to eliminate some fragments from consideration would
be a saving of computational effort.
(4) For linear molecules, use of clues such as: Left end, Right
end, or someEnd increases efficiency.
(5) It would be efficient to consider "identical" sized fragments
as the same so that both permutations are not output, thus
eliminating "identical" (for all practical purposes) maps. The
preferred embodiment uses "identical to the last digit", but this
may be varied by the user.
(6) It will be useful to extend the method to allow adding of
digests using additional enzymes to an existing map. Instead of
trying all orderings of single-digest fragments, for a mapped
enzyme, the method would allow adding them only in the mapped
order.
(6a) Similarly, it will be useful to extend the method to allow use
of any available partial mapping information, such as some of the
contained-in relationships or a map of part of the molecule such as
one has when mapping clones of an unknown DNA in a known vector.
This may be done either by constraining the order of adding
fragments or by by elimination of maps inconsistent with the
partial mapping information.
(7) The existence of "not-visible" fragments is readily detected by
consistency rules and easily handled by the present invention.
According to the preferred embodiment, the consistency checker (of
the data input and editing program) finds out if there are small
(not visible) missing fragments and the user is advised to add a
small fragment with value < some amount. This could be more
automated by letting the method assume existence of such
fragments.
(8) For those instances where an experimentor purifies a gel band
and redigests, it may be useful to add a "contained.sub.-- in" clue
for those fragments' relationship to other fragments.
(9) The inclusion of some indication of adjacency within a
particular digest may be added.
(10) The inclusion of some indication of the existence of fragments
obtained by partial (incomplete) digestions consisting of adjacent
fragments not cut may be useful. Such an indicator would be used
either to eliminate some "close" solutions, or to check additive
length of fragments.
Details of the Preferred Embodiment Computer Implementation
An implementation of the method of the present invention, written
in the `C` computer language for the IBM Personal Computer is
attached hereto as Appendix A. The system is structured into three
major parts: a fragment/digest editor, a map generator and a map
viewer.
The fragment/digest editor allows a user to enter digestion data
from the keyboard in a convenient manner and will also accept
measurements produced by a digitizer. The editor is resposible for
determining the overall consistency of the data and identifying
conditions that would cause problems for the map generator, such as
missing fragments.
The map generator is responsible for producing all maps consistent
with the fragment measurements and storing completed maps for
review.
The map viewer is a screen oriented system which allows the user to
examine the maps produced in various levels of detail and in a
variety of formats. Foremost is the ease of comparing any two maps.
The viewer also allows the user to further process any selected
maps into files compatible with other analysis and formatting
tools.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Data Structures of RMAP
Refering to Block 10 in FIG. 4, there are shown three data
structures for the control of the program RMAP during the
construction of possible restriction maps.
The FRAGMENT.sub.-- DATA are a set of ordered lists, one for each
different digestion in the input data. Each element of each list
contains a fragment size, fragment error, an assigned-or-available
flag, and (optionally) any "clues" about the fragment. Each list
also has an associated pointer which is the "next-available
fragment to try". It means that all available fragments above the
pointer have been tried and all available fragments below the
pointer have not been tried. Upon rejection of a fragment the
program sets the pointer to the next (after the rejected one)
available fragment in the ordered list. Upon acceptance of a
fragment the program sets the pointer to the beginning of the list
(first available fragment) in readiness for the next step of the
recursion.
The MAP.sub.-- TO.sub.-- DATE is a set of arrays, one for each
different digest in the input data. Each element of the array is a
fragment identifier. As the program proceeds with map construction
each fragment added to the growing map is represented by assigning
its identifier to the next element in the array. Each map array has
an associated integer, the number of fragments assigned to the map
to date, and two associated coordinate ranges, the minimum and
maximum values for the right end of the map and for the previous
end (i.e. left end of last fragment added).
The HEAP is a dynamic structure which stores the information needed
as the program proceeds back and forth through the recursion
levels. It is a stack of objects referencing lists of growing
points yet to be tried for a specific map position. Each element of
the list contains the growing-point enzyme type, right-end
coordinate range and previous coordinate range. A new list is
created with each recursion and "heaped" upon the last. Each
element of the list is examined, in turn, before the algorithm
retreats to the previous heap level.
At the completion of a map, the sites of the map may be read from
the heap in left to right order by traversing the heap from bottom
to top and referencing the first list element at each level.
Reading the Data
Data is read into the program from a data file. This file is
created by an editor written especially for the purpose. Input for
the editor can be either typed at the keyboard or obtained via
measurement by a digitizer tablet. The editor checks fragment
consistency using the rules outlined in Table 1.
Starting the map (FIG. 4 Block 20)
To start the map a fragment is selected from one of the single
digests. There are some constraints on which single enzyme digest
and which fragment from the digest may be used. For linear maps any
digest will do. The algorithm tries all fragments as the starting
fragment (see flow chart) however, fragment selection for the first
fragment is rejected if it conflicts with a "clue". i.e. if
"leftend" is labelled in the starting digest, it is the only
fragment from that digest which is possible as the starting
fragment. Similarly, "right end" is rejected as a starting
fragment, or if two directionally unspecified ends are in the data
set ("ends") only those two are tried. If clues for endpoints have
been specified, then one of the single digests containing clues
must be chosen to start the map, otherwise we may inadvertently
chose a fragment from an un-clued digest which conflicts with the
end clues. All unstarted maps are assigned a range for the position
of their right end site of zero.+-.zero.
For circular maps the enzyme digest to start is one which is
involved in at least one multiple digest with every other enzyme to
be mapped. This ensures complete coverage of the configuration
space. If there is not such a digest, an arbitry one is choosen.
All unstarted maps are assigned a range for the position of their
right end site that spans zero to the length of the first
fragment.
Recursive procedure BUILDMAP (FIG. 5)
We have implemented this method as a recursive procedure. However,
since the maximum depth of recursion is known for any given data
set (total number of single digest fragments) it could be
implemented non-recursively. The Pseudocode Listing of Table 2
outlines the computer procedure implementing the mapping method
described in FIG. 5.
The details described in the paragraphs below are numbered and
formatted in parallel to the Pseudocode Listing so the reader can
more easily refer back to the "big picture".
BEGIN BUILDMAP {a recursive procedure which tests to see if a
single-digest fragment which has been added to the map is accepted
and to try adding another fragment if it is}
1. (a). Determine possible GrowingPoints by finding leftmost right
end site(s) by comparing coordinates of the sites (stored on the
map-to-date). For this comparison, instead of using the global
range (error relative to left end) of the coordinates (stored in
map) use the relative error. This is obtained by subtracting the
error of the greatest-common reference point for the entire map
from the range of each right end coordinate. The site with the
leftmost coordinate and any other site(s) whose
relative-error-ranges for its coordinate overlap that of the
leftmost site are possible growing points.
i.e. Find RightEnd with lowest Low end of range. This is one new
GrowingPoint. Compare that G.P. High end with the low end for every
other RightEnd site. If G.P.High - RefError>otherLow then the
other site is also a potential GrowingPoint OR if
G.P.High>otherHigh then other is a potential GrowingPoint.
note: RefError is the error range of the leftmost PreviousEnd of
all the single digest maps.
(Special case if just starting a linear map--choose exactly one
unstarted map as growing point rather than all that intersect.)
(b). Place a new element on the heap containing a growing point
node for each possible growing point.
This contains (i) The enzyme type of the GrowingPoint
(ii) The coordinate range (global error) of the growing point
(iii) The coordinate range of the previous growing point
2. While untried GrowingPoints, Select Next GrowingPoint at the top
of the heap
3. Calculate size of fragments from multiple digests to look for
{use local errors}
(a) Examine all multiple digests which contain the growing point
enzyme.
(b) Find a fragment which runs from the current right end site of
the multiple digest map to the growing point.
(c) Use as the reference point the left end of the single fragment
in the growing point enzyme single digest.
(d) Criteria for fit: Sum of fragments with errors in the multiple
digest from the reference point rightwards, plus the query fragment
with error must intersect the size of the single digest fragment
(of growing point type) with its error.
(Note: special case if at beginning of circular map--sum of query
fragment minus error and starting position of multiple digest map
minus error must be less than the leftmost single digest map right
end plus error. This effectively places the first site in the
growing point digest map anywhere from the last growing point to
the next most likely growing point. After fitting one fragment
using this rule all others must fit as if the growing point digest
map begins at this position.)
4. While an untried set of fragments which fit exist, Select
fragments from multiple digests.
One fragment from each multiple digest 3(a) must fit 3(d).
Existence of set that does fit allows acceptance of single digest
fragment added to map in previous recursion and the algorithm can
proceed forward. This loop tries all permutations of one fragment
from each multiple which fits against the single.
At this point, the single-digest fragment added in the last
recursion has been accepted. Now proceed to get ready to add the
next fragment to the map.
5. Coalesce end of single digest map with each multiple digest.
Using the set of fragments from multiples selected above (Step
4):
(a) take the intersection of the ranges of all the sums used to fit
the multiple digest fragments and the range of the single. If there
is an empty intersection, reject this permutation of multiple
digest fragments. {continue with step 4}
(b) add this range (a) to the PreviousEnd of the growing point
digest. Also sum the errors.
c) adjust the coordinate of the growing point by putting the result
(b) in the right end coordinate of the map of the single
digest.
Note: this is where the starting position of circular maps of the
single digests is determined and why this method can map circular
molecules without a specific user input offset. If we are just
starting the growing point digest, then we substitute 0.+-.0 for
the PreviousEnd in step 5(b).
{start at beginning of list of unassigned fragments for single
digest of GrowingPoint enzyme. This important point is handled by
the selection routine for single digest fragments so the pointer is
set to the top of the list in a previous recursion when the last
fragment was selected.}
6. Are any single digest fragments untried? if not, then proceed to
step 9, otherwise
While still untried fragments in single digest of GrowingPoint
type,
(a) Select next fragment from single digest to try adding to map at
GrowingPoint.
(b) Calculate new right end for that map.
(i) save the right-end as PreviousEnd.
(ii) The new right-end is right-end (stored in map)+(fragment size
of fragment selected.+-.error on selected fragment).
7. Call BUILDMAP {to see if added fragment is accepted and to try
adding another fragment if it is. return here if fragment selected
in 6 above is rejected}
8. Remove fragment from Map. Return fragment to fragment list, Set
next-frag-to-try pointer to next fragment after this one. Restore
PreviousEnd and RightEnd (by reading from heap) end Loop 6 (while
still fragments left to try)
9. Query is this a map? If yes, record the map.
Remove multiple digest fragments selected in step 4 from map.
Return fragments to fragment list.
end Loop 4 (while still permutations of fragments from each double
digest that fit)
remove GrowingPoint node from Heap
end Loop 2 (while still untried growing points)
remove top level from heap (contains no more GrowingPoint nodes
now)
RETURN {fragment added in previous recursion doesn't fit}
Starting Maps
FIG. 6 outlines the procedure for fitting double digest fragments.
This includes the special cases of just starting a map.
Starting a linear map
When any of the single digest maps are not yet started (i.e. no
fragments yet assigned to map) special rules apply to finding the
(next) growing point and to fitting fragments. The growing point
selected is the next (abitrary order) unstarted digest. Since the
growing point is synonymous with the left end, we would look for a
fragment of zero length to be fit and obviously no such fragment
would be represented in the data. The previously added single
digest fragment is accepted (without assigning any multiple digest
fragments) and a single digest fragment from the new growing point
is added.
Starting a circular map
When any of the single digest maps are not yet started special
rules apply to finding the (next) growing point and to fitting
fragments. The possible growing points are the leftmost right end
site(s) of the started digest map(s) and all unstarted digest maps
(eg. at the first call of BUILDMAP all single digest maps are
determined to be possible growing points). All possibilities must
be tried one after the other.
Special rules also apply to fitting multiple digest fragments. The
various cases are flowcharted in FIG. 6 and fitting rules for the
special case diagramed in FIG. 10. In order to understand why these
rules work it is necessary to realize that the selection of a
growing point has made an hypothesis about the order of sites in
the map. Since we try all possible growing points this is
legitimate. Any multiple digest involving the selected growing
point enzyme must be considered. A fragment to be fit is to be
placed left of (ending at) the growing point.
If the growing point enzyme map has already been started, there are
two cases to consider, the other enzyme maps in the multiple digest
have not been started or they (any one) have been started. If both
the growing point digest map and (one of) the other enzyme maps
have been started, then the regular rules apply. If none of the
other enzymes in the multiple digest have started maps then the
first site for that enzyme has not yet been encounted and is right
of the growing point. Since the growing point enzyme map has
started, the fragment to be fit is the same size as the last
fragment in the growing point enzyme digest map.
If the growing point enzyme map has not been started then, we don't
really know the position of the growing point except that it is
right of the previous growing point and left of any other right end
in the map. If none of the maps for any of the other enzymes in the
double digest have been started, the fragment left of the growing
point extends left of the arbitrary starting point for the map and
so no fragment needs be fit. (Unlike the case of linear where the
concept of a fragment left of the left end is meaningless, there
really is such a fragment in a circular map. It is fit at the end
of the process.) However, when the growing point digest map has not
been started and the other enzyme map has been started, the
fragment to be fit runs from the other enzyme to the growing point.
Thus the first double digest fragment to be fit has a large error
since the uncertainty of the location of the growing point site is
large. After the first multiple digest fragment is selected the
site location is set using this fragment and fragments to be fit
from any other appropriate multiple digests have errors based on
the local error of the first (i.e. sum of overlapping
fragments).
Ending a Map
When all fragments from single digests have been placed in the map
(FIG. 5, Block 200) the algorithm has reached a stage where
BUILDMAP has made all the forward progress that it can. In linear
maps with two enzymes this stage represents a completed map, but in
more complicated maps one last fitting step still remains before we
have a complete map.
For linear maps of three or more enzymes, we have chosen only one
of the single digest enzymes as the last growing point. Since that
site represents the right end of the map, it is a site common to
all digests, but we have only fit multiple digest fragments for
digests that contain the growing point enzyme. Therefore, we need
to determine that the last unchosen fragment from each of the
multiple digests NOT containing the growing point enzyme will fit
at the right end of the map.
To accomplish that, we scan the set of multiple digests in a
predefined order (our implementation uses the order of occurrence
in the data). Each multiple digest that contains an unchosen
fragment has that fragment chosen and tested against each of the
single digest maps for the enzymes it contains. This involves
hypothesizing each other enzyme as the last growing point and then
applying the fitting rules shown in FIG. 8. If all such fragments
fit, then the map is complete and we proceed to Block 210 of FIG.
5.
In order to restore the state of the map in such a way that
BUILDMAP can continue we must replace exactly the set of fragments
chosen by this last step. This is accomplished by replacing one
fragment from the end of each multiple digest map NOT containing
the last growing point enzyme in the same predefined order in which
these fragments were selected. The ordering is important if any of
these fragments fails to fit, since we must replace ONLY those
fragments chosen by this last step.
For circular maps of two or more enzymes, we have completed the map
for all digests containing the enzyme of the single digest used to
begin the map. Although all single digest fragments have been
chosen at this step, we still need to fit a fragment from each
multiple digest NOT containing the growing point enzyme (same as
the enzyme site at the start/end of the map). The procedure for
fitting these remaining fragments is much the same as for the
linear case detailed above, but only the leftmost of the single
digest right ends for enzymes contained in the multiple digest is
tested for fit, since all fragments to the right of this point have
already been fit at the beginning of the map.
Since circular maps allow weak fitting rules in order to start
single digest maps, we are in a position at the end of the map to
test this hypothesis using the stronger fitting rules of FIG. 8.
Although all fragments have been placed in the map, we may have
managed to avoid testing the fit of some single digest fragments
that span our arbitrary starting/ending site. The only fragments
which have escaped scrutiny are those which span the start/end site
and are composed of three or more multiple digest fragments. Rather
than identifying these individual cases, our implementation simply
tests every one of the rightmost single digest fragments against
the multiple digest fragments contained within it using the rules
shown in FIG. 8 modified to cycle to the beginning of the map at
the start/end site.
Recording a completed map
Once a complete map has been identified, we are in a position to
evaluate the quality of the map in relation to the others generated
by BUILDMAP. The present invention applies the method of Schroeder
and Blattner to compute positions for all map sites based upon the
generated order of fragmentation data. A measure of goodness of fit
is then calculated using the formula: ##EQU3## where: Fm=measured
fragment size from site i to site j
Fc=computed fragment size from site i to site j
n=total number of fragments represented in data
M=measure of deviation evaluated over all pairs i,j which are
measured
A perfect map produces a measure of zero and larger measures
indicate greater deviation from an optimal map. The preferred
embodiment ranks maps in order of increasing measure of deviation
and records maps in a randomly accessed file indexed by a file of
deviation measures. Purely as a matter of convenience, only up to
fifty maps are stored. Beyond fifty maps, measures better than the
fiftieth map cause replacement by the current map and an update of
the index file maintaining the measures in increasing order.
Each map is represented in the randomly accesed file by a record
composed of:
(1) the vector of site coordinates produced by the method.
(2) the order of single digest enzymes corresponding to the sites
of (1) that is internally represented by the first element of each
heap list.
(3) the generated order of fragment identifiers for each digest map
represented in the data.
All other information (such as fragment sizes, error, enzyme names,
etc.) is assumed to be reproducible from the input data.
The index contains all references necessary to reproduce the input
data (specifically the data pathname), the number of maps stored in
the randomly accessed file and a record for each map
containing:
(1) the position of the map record in the random file.
(2) the measure of deviation for that map.
Structuring the output this way promotes complete modularity of the
reviewing process, therefore it may be implemented as a completely
independent program.
Map Consistency
In order to successfully apply the map generation algorithm,
several requirements of the input data must be met. These
requirements are embodied by the following assumptions:
______________________________________ Purity Rule: (1) The DNA to
be mapped is a single molecular strain. Topology Rule: (2) The DNA
is known to be exclu- sively either circular or linear. Digestion
Rule: (3) Digestion of the DNA has proceeded to completion.
Combination Rule: (4) At least two independent sets of cleavage
products are represented, separately and in combination. Orphan
Rule: (5) Each enzyme or proper subset of enzymes involved in a
multiple digestion is represented by a separate (single) digestion
and identified as such Completeness Rule: (6) Every fragment has
been represented in the data and a - measurement and bounding error
has been assigned to it. Uniqueness Rule: (7) Any fragments of
identical size are individually represented.
______________________________________
From these primary assumptions we have formulated tests which may
be used to qualify the data for map generation (Table 1). Data
failing any one of these tests will guarantee incomplete coverage
of the configuration space and will give unsatisfactory (or no)
results, therefore, all tests must succeed before any attempt at
map generation is made.
The tests in Table 1 apply assumptions 1,2,3,6, and 7 in
combination to yield equations relating the number of fragments
between single and multiple digests. Any digests not meeting these
equalities are assumed to have violated either the Completeness
Rule (6) or the Uniqueness Rule (7) and the user is directed to
re-examine the experiment in order to identify the cause of the
inconsistency.
Application of the Combination Rule (4) involves identifying at
least two single digestions and at least one multiple digestion in
the data. Once that has been satisfied we may apply the Orphan Rule
(5) to determine whether any set of single digestions is missing.
Passing this last test qualifies the data for map generation.
STATEMENT OF INDUSTRIAL UTILITY
The present invention may be useful for the generation of
restriction maps of DNA or for solutions to similar mapping
problems. DNA restriction maps find application in the design of
recombinant DNA, and in the elucidation of DNA structures for
purposes of genetic comparison, diagnosis, and other such uses.
Glossary of Terms
Decomposition Fragment--See Fragment.
DNA--Deoxyribonucleic Acid. For mapping purposes it is a
one-dimensional entity of finite length either linear (two ends) or
circular (no ends).
Double-digest--All the fragments produced by incubation of a DNA
molecule with two restriction enzymes at once.
Fragment--A subsection of a DNA molecule created by cleaving the
DNA with restriction enzyme(s). The ends of the fragment are two
restriction sites. The size (length) of the fragment is the
distance between the two restriction sites on the DNA molecule from
which the fragment was cleaved. Fragments may also be sub-fragments
of larger fragments produced by single digests.
Genome--The entire genetic content of an organism. For most
organisms, this comprises one or more molecules of DNA. (Exceptions
are some viruses, e.g. retroviruses which have RNA genomes.)
Growing Point--The data point used in the construction of maps that
according to the present invention which is defined as the site in
the restriction map chosen from the set of right ends of all single
digest maps which is assumed to be furthest to the left.
Growing Point Enzyme--The enzyme used in the single digest which
has been chosen as the growing point.
Map--An enumeration of the sites for one or more restriction
enzymes on a particular DNA molecule.
Mapping--The process of deducing the map of a particular DNA
molecule for one or more restriction enzymes from single and
multiple digests.
Measurement error--The length of fragments can be measured. Methods
generally used to obtain fragment sizes for the purposes of mapping
have a measurement error roughly proportional to fragment length.
(i.e. the error is proportional to fragment length, such as .+-.5%
rather than absolute like .+-.620)
Multiple-Digest--All the fragments produced by incubation of a DNA
molecule with two or more restriction enzymes at once.
Permutation--A unique ordering of objects. (E.g. a permutation of
fragments or sites).
Restriction enzyme--An endonuclease which cleaves DNA at one or
more specific sites.
Restriction site--A site on a DNA molecule at which a restriction
enzyme cleaves. It is usually identified with the name of the
enzyme which cleaves at that site and the position of the site
(.+-. error) on the DNA molecule in question.
Single-digest--All the fragments produced by incubation of a DNA
molecule with any one restriction enzyme.
Reference Point--A site in the restriction map used for the purpose
of comparing the right ends of two single digest maps where error
is localized.
TABLE 1 ______________________________________ FRAGMENT CONSISTENCY
RULES ______________________________________ let f.sub.1 = number
of fragments in digest 1 f.sub.2 = number of fragments in digest 2
f.sub.1,2 = number of fragments in 1 + 2 double digest s.sub.1 =
number of sites of type 1 s.sub.2 = number of sites of type 2
s.sub.1,2 = number of sites of either type 1 or 2 for linear
molecules f.sub.1 = s.sub.1 + 1 f.sub.2 = s.sub.2 + 1 f.sub.1,2 =
s.sub.1 + s.sub.2 + 1 = f.sub.1 + f.sub.2 - 1 for circular
molecules f.sub.1 = s.sub.1 f.sub.2 = s.sub.2 f.sub.1,2 = f.sub.1 +
f.sub.2 For digests containing "clue" information: let cut.sub.1 =
number of fragments of digest 1 cut by enzyme 2 uncut.sub.1 =
number of fragments of digest 1 not cut by enzyme 2 new = number of
fragments in the double digest which are not in either single
digest cut.sub.1 + uncut.sub.1 = f.sub.1 cut.sub.2 + uncut.sub.2 =
f.sub.2 uncut.sub.1 + uncut.sub.2 + new = f.sub.1,2 assume s.sub.1
and s.sub.2 > 0, 1 .ltoreq. cut.sub.1 .ltoreq. s.sub.2 1
.ltoreq. cut.sub.2 .ltoreq. s.sub.1 for linear molecules new =
cut.sub.1 + cut.sub.2 - 1 .vertline.f.sub.1 - f.sub.2 .vertline. -
1 .ltoreq. uncut.sub.1 + uncut.sub.2 .ltoreq. .vertline.f.sub.1 -
f.sub.2 .vertline. + 1 1 .ltoreq. new .ltoreq. 2*
min(s.sub.1,s.sub.2) for circular molecules new = cut.sub.1 +
cut.sub.2 uncut.sub.1 + uncut.sub.2 = .vertline.f.sub.1 - f.sub.2
.vertline. 2 .ltoreq. new .ltoreq. 2* min(s.sub.1,s.sub.2)
______________________________________
TABLE 2
__________________________________________________________________________
Outline of Procedure Buildmap formated like the implemented C
program. This procedure implements the method FlowCharted in FIG.
5. Steps are numbered in accordance with FlowChart in FIG. 5. Terms
are defined in the text and the method is explained in text and
figures. In this table, messages surounded by brackets are
{comments}. Overview of Recursive Procedure BUILDMAP for
constructing restriction maps from fragment size data
__________________________________________________________________________
Begin BUILDMAP {to see if added fragment is accepted and to try
adding another fragment if it is} Determine which right ends may be
growing points by finding leftmost right end site(s) {by comparing
the coordinates using relative error} While untried GrowingPoints,
2. Select GrowingPoint 3. Calculate size of fragments from multiple
digests to look for {use local errors} While a set of untried
fragments which fit exist 4. Select a permutation fragments from
multiple {fragments must fit. Existence of set that do allows
acceptance of single digest fragment added to map in last
recursion} 5. rectify maps by coalescing sites {using selected set
of fragments from multiples which fit to adjust the coordinate of
the growing point} {start at beginning of list of unassigned
fragments for single digest of GrowingPoint enzyme. note this is
handled by the Select routine} While still untried fragments in
single digest of GrowingPoint Enzyme 6. Select next fragment from
single digest to try adding to GrowingPoint and Calculate new right
end for that map. 7. Call BUILDMAP {to see if added fragment is
accepted try adding another fragment if it is} {return to here if
reject fragment selected in 6 above} 8. Remove fragment (selected
in 6) from map, mark it as tried and replace it in fragment list
end while {still fragments left to try} 9. Query is this a map? If
yes, go to 10. OUTPUT PROCEDURE {return here after successful - map
output so we can continue and get all maps} end while {still
permutations of fragments from each double digest that fit} end
while {still untried growing points} RETURN {fragment added in
previous recursion doesn't fit}
__________________________________________________________________________
##SPC1##
* * * * *