System and method for fragmentation mapping Patent Grant Daniels , et al. September 13, 1 [Dnastar, Inc.]

System and method for fragmentation mapping

Daniels , et al. September 13, 1

Patent Grant 4771384

U.S. patent number 4,771,384 [Application Number 06/889,981] was granted by the patent office on 1988-09-13 for system and method for fragmentation mapping. This patent grant is currently assigned to Dnastar, Inc.. Invention is credited to Frederick R. Blattner, Donna L. Daniels, John L. Schroeder, Michael Waterman.

United States Patent	4,771,384
Daniels , et al.	September 13, 1988

**Please see images for: ( Certificate of Correction ) **

System and method for fragmentation mapping

Abstract

A system and method for the construction of one-dimensional maps from fragmentation data is disclosed. Particularly useful for construction of restriction maps of DNA, the system and method completely permutes sites, single digest fragments, and any available multiple digest fragments, and displays maps in rank-order according to a quality factor. Display of constructed maps includes information about relative ordering of all fragments, sites, and particularly about closely-spaced sites and fragments.

Inventors:	Daniels; Donna L. (Cross Plains, WI), Schroeder; John L. (Madison, WI), Blattner; Frederick R. (Madison, WI), Waterman; Michael (Culver City,, CA)
Assignee:	Dnastar, Inc. (Madison, WI)
Family ID:	25396069
Appl. No.:	06/889,981
Filed:	July 24, 1986

Current U.S. Class:	382/129; 204/456; 435/6.12; 702/20
Current CPC Class:	C12Q 1/683 (20130101); G01N 27/44717 (20130101)
Current International Class:	C12Q 1/68 (20060101); G01N 27/447 (20060101); G05F 015/42 (); G06G 007/58 ()
Field of Search:	;364/413,497,498,499 ;204/182.8,182.9 ;935/3,9,75,76

References Cited [Referenced By]

U.S. Patent Documents


4412288	October 1983	Herman
4518690	May 1985	Guntaka
4675283	June 1987	Roninson

Foreign Patent Documents


3300632	Jul 1984	DE
59-182366	Oct 1984	JP
59-193355	Nov 1984	JP

Other References

Pearson, William R., Automatic Construction of Restriction Site Maps, Nucleic Acids Research, vol. 10 No. 1 1982. .
Nolan, Garry P. et al., Plasmid Mapping Computer Program, Nucleic Acids Research, vol. 12 No. 1 1984. .
Durand and Gregegere, An Efficient Program to Construct Restriction Maps from Experimental Data with Realistic Error Levels, Nucleic Acids Research, vol. 12 No. 1 1984. .
Wulkan and Lott, Computer-Aided Construction of Nucleic Acid Restriction Maps Using Defined Vectors, Cabios, vol. 1, No. 4 1985. .
Fitch, Smith & Ralph, Mapping the Order of DNA Restriction Fragments, Gene, vol. 22 1983..

Primary Examiner: Kucia; R. R.
Attorney, Agent or Firm: Isaksen, Lathrop, Esch, Hart & Clark

Claims

What is claimed is:

1. A method for constructing fragmentation maps of molecules of DNA comprising the steps of

(a) digesting molecules of the DNA with at least two digesting agents which fragment the DNA at characteristics restriction sites, said digesting being conducted with each agent separately as well as with both agents together;

(b) analyzing the approximate length of the fragments created by the digesting step; and

(c) entering the approximate length values for the fragments into a digital computer programmed with the steps of

(i) permuting incrementally from a beginning fragment form digestion by a first of the agents arrangements of additional sites and fragments from digestion by a second and by both of the agents to construct a plurality of hypotheses for correct additions to the fragmentation map begun with the beginning fragment;

(ii) testing each hypothesis by computing local additive lengths of the additions to said map hypothesis and testing for the existence of additional fragments of a length corresponding to said computed length;

(iii) rejecting each of said hypotheses for which no such additional fragment is found of the correct length;

(iv) accepting incrementally each of said map hypotheses for which said additional fragment is found of the correct length; and

(v) outputting all the accepted map hypotheses which utilize all the fragments as possibly correct fragmentation maps.

2. The method of calim 1, wherein in step (i) for all sites calculated from said digestion by the first agent, every permutation of said sites each in combination with every permutation of the fragments from the digestion by the first agent, and each in combination with every permutation of the fragments from the digestion by both agents is generated as an hypothesis for a possibly correct fragmentation map.

3. The method of claim 1, wherein in step (i) every permutation of the fragments from the digestion by the first agent in combination with every permutation of said sites which combination is consistent with the sizes of said fragments from the digestion by the first agent within predetermined error limits, and each in combination with every permutation of the fragments from the digestion by both agents is generated as an hypothesis for a possibly correct fragmentation map.

4. The method of claim 3 further including the step of

evaluating said permutations on the basis of consistency with the length of the fragments from the digestion with both agents, and rejecting any combination which is inconsistent with said data.

5. A method for constructing fragmentation maps of DNA molecules comprising the steps of

(a) digesting the DNA molecules separately with two different enzymes which cut the DNA at characteristic sites into fragments and also digesting the DNA molecules together with both enzymes;

(b) analyzing the approximate length of the fragments created by the digesting step; and

(c) entering the approximate lengths of the fragments into a digital computer programmed to perform the steps of

(i) generating each permutation of all fragments from digestion by one enzyme in combiantion with all permutations of sites which are consistent with the lengths of said fragments from digestion by one enzyme within predetermined error limits,

(ii) computing the minimum and maximum possible sizes for each interval between sites in said combinations by summing the lengths and the possible error,

(iii) selecting fragments from the fragments from digestion by the other enzyme and by both enzyme that are consistent with said computed sizes within possible error,

(iv) generating all possible permutation of the order of the fragments so selected, and

(v) evaluating the overall cumulative error of the fit of said selected orders of fragments to provide an indication of overall probability of said orders to a user.

6. A method of constructing fragmentation maps of DNA molecules comprising the step of

(a) digesting the DNA molecules separately with at least first and second digestion agents, and jointly with both agents, each agent cutting the DNA molecule at a characteristic site;

(b) analyzing the lengths of the fragments created by the digestions; and

(c) entering the lengths of the fragments into a digital computer programmed to perform the steps of

(i) maintaining three parallel tentative maps of the order of the fragments from the digestion by the first agent, by the second agent, and by both agents,

(ii) beginning with a fragment from the digestion by the first agent;

(iii) selecting a site for addition to the tentative maps, the site being selected as the end of the shorter of the two maps from digestion with one of the first and second agents,

(iv) tentatively adding to the site a fragment selected from the fragments from the digest by the respective agent for that map, to thus create a hypothesis,

(v) testing the hypothesis by testing for the existence of fragments from the digestion by both agents for addition to the map for both agents which is consistent with the hypothesis,

(vi) if the testing of the hypothesis fails, repeat steps (iv) and (v) for each remaining fragment for the respective map,

(vii) if the testing of the hypothesis proceeds, repeat steps (iii) through (vi) until all fragments from all digests are assigned to one of the fragmentation maps, and

(viii) providing as an output to the user all sets of fragmentation maps which use all fragments and which tested correctly.

7. The method of claim 6 wherein the computer is programmed before step (viii) with the step of if no set of fragmentation maps is generated which uses all fragments and tests correctly, then selecting a different fragment from the digestion by the first agent as the beginning in step (ii) and repeating step (iii) through (vii).

8. The method of claim 6 wherein an error is assigned to the lengths of each of the fragments in the step of analyzing, which error is entered into the computer and wherein the computer is further programmed to sum the combined error for each complete set of fragmentation maps and to indicate to the user the relative combined error for each such set.

9. The method of claim 6 wherien the computer is further programmed to, as a result of the testing, to reject all additional hypotheses which are mirror images or which are additions to each hypothesis which fails the test in step (v).

10. The method of claim 6 wherein the user may input constraining information into the computer to constrain the selection in steps (ii), (iii) or (iv).

11. The method of claim 10 wherein the constraining information comprises at least one of

a fragment occupying a defined position on a map,

a fragment is cut by a selected agent,

a fragment from the digestion by both agents is contained in a certain fragment from the digestion by a single agent,

a fragment is adjacent a certain fragment,

a fragment from one digest is identical to a certain fragment from another digest, and

a fragment has a size such that its length has not been determined.

Description

FIELD OF THE INVENTION

The present invention relates to the computer-assisted construction of one-dimensional linear or circular maps from fragment size data. The preferred embodiment allows construction of restriction maps of deoxyribonucleic acid (DNA) for use by experimenters in the fields of molecular biology, molecular genetics and genetic engineering.

BACKGROUND OF THE INVENTION

Restriction enzymes are endonucleases (see Glossary of Terms) which cleave DNA at short, specific nucleotide sequences. Purified DNA can be cleaved at these specific sites on the molecule and the length of the cleavage products measured. By a variety of experimental techniques, the order of these fragments in the molecule can be determined and thus one can "map" the restriction sites of a genome. At minimum, a map is an ordered list of restriction sites and their positions on the DNA molecule. A map may also contain an identification of every fragment in the data set as belonging to an interval on the map. For most purposes, the usefulness of a map is dependent on this information. Such a map is a physical map of the genome and can be correlated with a genetic map or with physical maps derived by other means (such as EM Heteroduplex analysis).

Restriction enzymes have proven useful for the physical dissection of the genomes of organisms ranging from bacteria to mammals. These endonucleases have been used to map and compare genomes, to produce DNA fragments for DNA sequencing and to construct recombinant DNA molecules. Maps are useful both for the isolation of defined regions of the genome and in the analysis of hybrid chromosomes and deletion mutants.

For example, restriction enzymes are the "scissors" used to cut large genomes (billions of nucleotides (bases) long) into pieces for subsequent molecular cloning. It was only the discovery of restriction enzymes and their characteristics which allowed the development of the analysis techniques collectively known as "recombinant DNA technology". The ability to clone pieces of DNA of interest out of the huge background of the genome and replicate them along with the cloning vector permits the isolation of useful amounts of reasonably pure DNA for study. This has led to experiments for the elucidation of molecular genetic details about gene structure and function that were impossible a decade ago.

Restriction enzymes are also an important tool of genome analysis. For example, when screening a "shotgun" (large, mixed collection of all clones of a genome) for a particular feature of interest, an experimenter generally finds a number of positive candidates. The relationship of the candidates can be elucidated by restriction mapping. Two candidates might be identical (and would thus have an identical restriction map), they might be overlapping (and would thus each have a part of its map identical with the other) or they might be unrelated (and would thus have quite different restriction maps). In the case of overlapping clones, this kind of analysis further locates the feature selected for when isolating the positive candidates to the DNA contained in both clones (i.e., the overlapping region of the restriction maps). If two of the candidates selected have unrelated maps, this suggests that two different regions of the genome (two different "genes") can produce the same feature selected for.

Similarly, genomes related in a number of ways can be compared by their restriction maps. For example, two related but distinct organisms (such as bacteriophage lambda and bacteriophage phi 80) can genetically recombine to yield recombinent progeny with some of the features of each parent. If one determines a restriction map of the recombinant and compares it with the maps of the parents, one can determine which region of the genome came from which parent, where the recombination events occured in the DNA, and may be able to assign some of the features to particular regions in the DNA.

In recent years it has become possible to determine the nucleotide sequence of DNA in a particular gene or region of interest. DNA is a linear string of subunits, each subunit can be one of four types (A,G,C, or T). The experiments to determine the sequence or order of subunits in a given region of DNA are greatly facilitated by having a restriction map of the region of interest.

DESCRIPTION OF THE PRIOR ART

Since maps are so useful, constructing restriction maps from fragment size data (usually derived from gel electrophoresis of digested DNA) is a common activity of molecular biologists. One method frequently used is referred to in the literature as "by inspection". This generally means that DNA which has been digested to completion by a particular restriction enzyme is run in one channel of an electrophoresis gel, a second complete digest with a different enzyme in another channel and the complete digest with both enzymes together in a channel between the two. Electrophoresis is known to separate DNA fragments on the basis of size. The sizes of the DNA fragments in the digestion are then measured by comparing their mobility on the gel to the mobility of a set of known size standards.

The researcher then tries to piece together all of these fragments, much like a jigsaw puzzle. He/she tries to determine an ordering of every fragment in each of the three digests. In doing so, the researcher follows rules which express the underlying logic of the original arrangement of the fragments. For example, each fragment in each single digest must be identified with some subset of the fragments in the double digest whose sum of lengths adds up within error to the length of the single digest fragment. Similarly, each end of each double digest fragment must be identified with an end of a fragment from one or the other of the single digests.

The result is a linear map of the original DNA with an ordering and position (plus or minus error) of each site for both enzymes and an identification of every fragment on the gel to an interval between two sites on the map and of every interval on the map to a fragment on the gel.

Mapping "by inspection" (i.e. by hand) is most common among molecular biologists. However, several computer programs or algorithms have been reported in the literature which claim to find all possible maps. All use as input data the fragment sizes of fragments in single and multiple digests. As will be discussed below, each of the prior art systems suffers from some inherent defect which leave much to be desired by the genetic researcher.

Prior Art Algorithms and Computer Programs

Stefik--Artificial Inteligence (1978) 11:85-114. Doctoral Dissertation--1980.

Implemented for Stanford Sumex system. This method permutes sites and intervals, termed "segments", and rejects maps using heuristic rules. It is not an algorithm, but is a set of heuristic rules. According to Stefik's dissertation, the method is not intended to be a complete solution generator, but is only a small subset of the possible solution implemented as a rule-driven artificial intelligence program.

Pearson NAR (1982) 10:217-227.

Implemented on a DEC-10 minicomputer. Permutes only fragments from single digests. This method "predicts" double digestion fragments (by an unexplained method), assigns double digestion fragments to these predictions in order, and takes least sum of squares of absolute differences as the goodness of fit. The configuration space identified as (A!.times.B!). The method is said to be extendable to more than two enzymes. It computes pairs first, then adds third, etc. to list of possible maps.

The Pearson method does not permute double digest fragments nor does it consider alternative orders of closly spaced sites. On particular sets of input data, this method has been shown to miss the correct map.

Fitch, et. al. Gene (1983) 22:19-29.

This approach is a branch and bounds method. The method is limited to two enzymes and 3 digests. There is no indication of any ability to handle additional enzymes.

The method appears to attempt to do what is commonly done by hand in a "systematic" way. The "bounding rules" described are complex and their interaction with each other and the procedure cannot be clearly understood from the description.

Fitch, et al. start by trying to determine all possible "contained-in" relationships for the smallest single digest fragment. The method uses local error calculations to verify that the sum of double digest fragments' length could fit within a selected single digest fragment. A difficulty in this approach is that the method attaches great value to assigning uncut fragments (a necessarily tentative initial assignment). The method then tries to make the uncut assignment definite by other information. When all "contained-in" possibilities are enumerated for every single digest, they have to be resolved into all possible consistent sets of one per fragment and then maps extracted for each set.

Durand, et. al. NAR (1984) 12:703-716.

Durand, et al. propose another method using branch and bounds (although not called such in the paper). The method handles many enzymes at once, and builds from the left end of the molecule. The method does not assign double digest fragments, however. It also does not consider alternative orderings of closely spaced sites. No use of local error is made, thus preventing the method from computing long maps efficiently (although the paper suggests starting from both ends at once). The algorithm can accept as input partial information about the map.

Wulkan et. al. CABIOS (1985) 1:4 pp 235-239.

Wulkan et. al. present an implementation of the method of Pearson for a microcomputer. The addition of "clues" such as the contained-within relationships of the data speeds computation of maps.

Nolan, et. al. NAR (1984) 12:717

This method produces maps using a branch and bounds method involving construction of "forks" and permuting them. "Forks" are sets of fragments from one double digest that sum to the same length within error as one fragment from a single digest. As in Fitch, et al., the first step is to determine all possible forks in the data set.

The method is inherently limited to two enzymes and one double digest. Two versions of the method exist for circular and linear maps. It is not clear whether the "fork" generation method covers all combinations. (Combinatorics of the permutation method are: F!+sum(F.sub.i !) where F is the number of forks and F.sub.i is the number of double digest fragments in the i'th fork.) It builds forks together using global error. A set of heuristics is also presented which is intended to bound the set of fork permutations.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is an example of a restriction map and fragment size data.

FIG. 2 depicts two examples (a and b) of pairs of different restriction maps which prior art mapping algorithms would not identify as different.

FIG. 3 is an example of a starting point used when mapping "by hand".

FIG. 4 is a flow chart of the computer program which implements the present invention.

FIG. 5 is a flow chart of recursive procedure BUILDMAP, which is a part of the computer program implementation of the present invention.

FIG. 6 is a detailed flowchart of step 120 of BUILDMAP showing special cases for starting maps.

FIG. 7 is a diagramatic representation of the steps (a, b & c) of map construction by the method of the present invention.

FIG. 8 is a diagramatic representation of the method of fitting double digest fragments (a & b) into the growing map using local error.

FIG. 9 is a collection of maps showing possible error relationships between sites.

FIG. 10 is a map illustrating how to find reference point.

FIG. 11 is a diagram of the special case of starting a circular map.

INTRODUCTION TO AN UNDERSTANDING OF THE PRESENT INVENTION

In accordance with the present invention, there is provided a method for constructing restriction maps of DNA from data derived from single and multiple digestions of purified DNA with various restriction enzymes. Such data typically consists of the number of fragments produced by digestion and their sizes. After digestion of a DNA molecule with one or more restriction enzymes, the resulting fragments are separated by gel electrophoresis. When compared to standard compounds of known length (or molecular weight), the fragments may be listed with their approximate length. (The error inherent in such a procedure is on the order of 1-10%.)

Even if there were no error in measurement of the lengths of the restricition fragments, there could be alternative arrangements of the fragment data which would be consistent with the known information about the genome of interest. For instance, if a linear DNA molecule were being studied, and if the identities of both of the ends were known with certainty (perhaps through radio-labelling), there might still be ambiguity in the data if two or more fragments having the same end moieties and identical lengths resulted from a multiple digestion.

Given the error inherent in the measuring procedures employed, the number of possible arrangements of fragments consistent with the data is very large. The mapping problem is of the type "NP Complete", which indicates that no program can be written which solves all cases in less than exponential time.

Although the complete solution to the mapping problem has been claimed by researchers in the art, investigations by the present inventors have revealed that each solution proposed in the prior art has been deficient. Since the correct solution is known to exist (i.e., a DNA molecule does have a sequence, and a map), generation of that solution among several possibilities is one crucial measure of prior art methods. Any algorithm which fails to give the correct result has either failed to generate that result through an incomplete definition of the configuration space of the mapping problem, or has rejected a generated solution for incorrect or invalid reasons.

Once a set of possible maps for a given set of data has been produced, the next desirable step is the ability to rank-order those possible solutions in accordance with their conformity with the data. In this way, the investigator may easily focus his or her efforts on those possible solutions which hold the most promise.

In accordance with the present invention a map generation system correctly and completely defines the configuration space of the mapping problem, evaluates possible maps on the basis of their conformity with the experimental data, and displays the possible maps generated to the user in novel and useful formats.

Explanation of restriction mapping in molecular biology

What is a map?

Referring now to FIG. 1(a) there is shown a restriction map of a DNA molecule. The DNA is cut by restriction enzymes labeled Eco, Bam, and Hin at specific sites located at positions labeled A to F. This particular DNA is linear and the left and right ends are denoted LEND and REND. (The left and right ends are not inherently different and which is which is just a convention accepted by scientists working on that particular DNA).

FIG. 1(b) shows the fragments obtained by digestion of the DNA with any one of the enzymes on the map or with any two in combination.

FIG. 1(c) shows the sizes of fragments in the various digests in sorted (decreasing) order. This is the form of the data obtained by experiments. The goal of mapping is to deduce the fragment arrangement (as in 1(b)) and thus the map (as in 1(a)) from the fragmentation data. Since the map is one-dimensional and intervals on the map are additive, the test for correctness of a map is based on this additive property. Finding the correct map is greatly complicated by the fact that fragment-size measurements are not perfect, but have an inherent measurement error. It is possible, however, to put a bound on the magnitude of the measurement error. A further complication is that some fragments may be "missed" in the experiment, either because they are too small for the detection method or are not resolved from another very similarly sized fragment and thus not identified. The various digests can be checked against each other for internal consistency of fragments number and total size but "completeness" of the data is not guaranteed.

The researcher's problem is to deduce all possible maps which are consistent with the fragment size data (within error), and to identify the more likely ones.

Depending on the use to which he/she wishes to put the map, the experimenter will either do additional experiments to distinguish between the possibilities or will decide to work around the existing ambiguities in the map.

What constitutes a valid map, and whether two maps differ depends on the definition used, which may vary depending on the use to which the map is to be put. A map usually contains some or all of the following (not independent) information about the molecule.

1. An ordered list of sites generally with coordinates and perhaps with .+-. error on the coordinates

2. An ordering of fragments from the digests (which can include error of the fragments). Either an ordering of just the single digest fragments, or an ordering of fragments of all the digests.

3. An assignment of every fragment in the data to an interval on the map and of every interval on the map to a fragment in the data. (Note that this is essentially equivalent to "1"+"2".)

4. An enumeration of the relationships between fragments of different digests. Particularly those fragments from single-enzyme digests are related to those subfragments from all multiple digests involving that enzyme and vice versa. That is, for every pair of digests involving that enzyme, one single and one multiple, completely identify the contained-within and containing relationships. This information is generally needed for cloning and/or sequencing experiments and is completely predictable from "1"+"2".

"1" is the format in which maps are usually presented. However, it is important to note that because of error, "1" does not unambiguously predict "2", nor does "2" unambiguously predict "1".

FIG. 2(a) shows two possible maps which are both consistent with the experimentally-derived fragment size data. The maps have the same fragment order for all digests, both single and double, but differ in the relative ordering of sites. This means that in map (ii) the 300-long Eco fragment is cut by Bam into a very slightly shorter new fragment with one Eco end and one Bam end and a 10-long new fragment with one Eco end and one Bam end. The 1100-long Bam fragment is similarly cut by Eco in the double digest. In the second map (ii), however, the 300-long Eco fragment is not cut by Bam nor is the 1100-long Bam fragment cut by Eco. These differences may be critical in the design of some experiments, such as cloning or sequencing.

FIG. 2(b) shows two maps which have the same site order and error ranges on the coordinates but differ in the ordering of fragments in the double digest. Depending on the use, this difference may be very important and the experimenter may need to know that two alternatives exist. This is important for any experiment where one wishes to physically isolate a particular sub-region of the DNA molecule for subsequent experiments, such as cloning, sequencing, or protein-binding assays.

As an example of the typical prior art procedure employed by researchers in the field, (other than computer methods referred to above as "prior art",) the following description may be useful.

Procedure for Restriction Mapping "by hand"

Do experiments to derive the data.

(a) Isolate DNA of a single molecular species. (Usually a plasmid from a bacteria or the genome from a virus).

(b) Digest aliquots of the DNA with a variety of restriction enzymes employed singly and in various combinations.

(c) Run the digestions on an electrophoretic gel to separate the fragments according to size.

Determining the map from the data involves several steps:

1. The researcher measures the length of fragments resulting from digestion by comparing the position of the bands on the gel with the positions of known length standards.

2. He/she notes significant facts (clues to the map as it were) about the pattern of the bands in the three lanes. (one lane for each of the two single and one double digests.)

(a) Which bands in the single digest are not in the double digest? (This means that there is at least one cut for the second enzyme between the two ends of the single digest fragment)

(b) Which bands in the single digests are (tentatively) also present in the double digest? (Its presence in both digests means that there is no site for the second enzyme between the two first enzyme sites which generated the fragment).

(c) Which bands in the double digest are (tentatively) identical to bands in which of the single digests and which bands are new? (This identifies the ends of the fragments as being both of one type or both of the other or one of each type.)

(d) Are the total number of fragments in the three digests consistent and is the number of new fragments in the double digest consistent with the number of cut fragments in the singles?

3. An effort is made to resolve discrepancies by careful reinspection of the gel. (see Table 1)

(a) Search for "missing" fragments. Two similarly sized fragments may not be resolved into two bands. A band on the gel composed of two fragments (a doublet) is generally recognizable as such because it stains more intensly than a singlet and it may be broader. The other cause of missing fragments is that they are so small they either ran off the bottom of the gel or do not stain intensely enough to be seen. If it is fairly certain that doublets have been identified, one tentatively assumes the existence of "not visible" fragments in the double digest which are smaller than some maximum (the size which would certainly have been visible).

(b) If the number of cut fragments and new fragments is inconsistent, there is a mis-assignment of a cut fragment in a single digest and a similarly-sized new fragment in the double digest as identical and uncut. Re-examination of the gel may detect this.

4. The researcher tries to construct a map which is consistent with all the above information. More than one map may be consistent and he/she tries to find all possibilities. The mental processes used and the rate of progress are highly dependent on the skill and experience of the mapper, as well as the peculiarities of the data being considered.

Generally, what is done is to find a small sub-region of the map which is "easy" (certain, self-evident, unambiguous, etc). He/she determines the containing and contained-within relationships for this region and constructs a map of this sub-section. He/she then proceeds in both directions, identifying overlapping fragments from the two single digests by which ones contain a double digest fragment in common, until a map is achieved. Progress becomes easier as more map is determined because the mapped region and fragments already assigned to them can be eliminated from further consideration in constructing the not-yet-mapped regions.

Most maps have a readily evident region which can be a starting point. For example, a large "new" fragment in a double digest can only be contained in an even larger "cut" fragment (necessarily one in each single). Frequently, there is only one choice for each of these. Subtraction of the size of the new fragment from the size of the containing single digest fragments yields a possible size for two other new fragments and if either or both exist, a possible map is started (see FIG. 3).

Similarly, a small cut fragment can only be cut into an even smaller new fragment, or a large uncut fragment must be a sub-region of an even larger cut fragment in the other single digest. In most data sets, there is usually a natural starting point for the mapping process. If not, what is usually done is to resort to a different experiment with a different, easier pair of enzymes. After some map is constructed, the more difficult enzymes can be added more readily.

Alternatively, there may already be a known section of the map. When mapping clones for example, the vector portion is already known.

Mapping by Computer

The construction of maps by hand using the human mind to perceive relationships is a difficult skill in which proficiency comes only with practice. Therefore it cannot usually be assigned to unskilled personel. In complex maps the number of possibilities is understood to exceed the ability of the human to solve. Computational assistance is therefore definitly welcome, not only because of the time saved but because of potentially greater accuracy. Even the most experienced and skilled practitioner of this art is aware and concerned that the map he/she has deduced (really induced) might not be the only one or even the best that fits the data. Just as the human mind is good at solving puzzles by intuitive leaps, it is also easy to fixate on one approach leaving blind spots that may mask the truth.

BRIEF DESCRIPTION OF THE INVENTION

The present invention comprises a computer-assisted method for determining the order (map) of fragments in a one-dimensional array (either linear or circular), based on fragment length and fragment termination data. While the invention may have utility in other types of problems, its primary utility is that of restriction mapping of DNA molecules.

With reference to DNA restriction mapping, the method of this invention comprises constructing hypotheses of possible maps by permuting at least some, and preferably all, arrangements of fragments (each of a characteristic fragment length) and some, and preferably all, fragment termination information, known as "sites". The number and types of sites in the molecule are determined by counting the numbered fragments in restriction enzyme digestions, and fragment length is one of a number of possible method such as, for example, by gel elctrophoresis. Each such permutation is then tested for validation by comparing possible local sub-regions of the original array with local additive lengths of fragments. Each validated hypothesis is then output as a possibly correct fragmentation map of the DNA under investigation.

Preferably, multiple possibly correct maps are then ordered as to their probability of correctness by reference to their goodness-of-fit to the data from multiple enzyme digestions.

In an alternative embodiment, a restriction map is constructed by selecting a restriction site as a starting point then selecting all possibly adjacent fragments as candidate neighbors. An extension of this hypothetical map at the distal terminus (if the candidate neighbor fragment is consistent with all available data) is then constructed in a like manner, and the process is reiterated, in essence "growing" the hypothetical map until it is complete, thus arriving at a pre-validated possibly correct fragmentation map. At each step of the growth process, local additivity of the sub-regions of the partial map are used to check validity of the hypothetical map.

DESCRIPTION OF THE PREFERRED EMBODIMENT THEORY AND OPERATION

In accordance with the present invention there is provided a process for constructing maps from fragmentation data. For clarity, the present invention will be described in terms of its application to mapping of DNA molecules fragmented by restriction enzymes, commonly known in the field of molecular biology as restriction mapping. Other types of linear or circular structures and other fragmentation agents could also be mapped by the method.

Objectives of a mapping system

A system for constructing maps should take fragmentation data in the form obtained by the researcher and produce a systematic search of all the possible arrangments of this data to identify maps that are consistent with it. Since, in general, more than one potentially correct map will be found, the system should output any that are consistent with the data. Within this framework the following criteria should be met:

Completeness of solution

Speed of solution

Ability to handle fragment data with .+-. errors

Ranking of possible maps by quality

Circular and linear maps supported

Experimentally realistic range of data handled.

Completeness of Solution

Since many possible maps may be consistent with a given set of data the system must find them all, although the user might wish to be shown only the best few. It is only the assurance that all possibilities have been found that allows the researcher to have confidence in the method, for if possible solutions failed to be considered, the correct map could be among the missing. The criterion of completeness is of such central importance that it is a definitive requirement for a mapping system and method.

Logical approaches for deriving solutions from data can be thought of as either using negative inference to reject elements of the configuration space because of conflicts with the data or using positive inference to deduce solutions from the data. A method which uses only negative inference is complete if the hypothesis generator exhaustively models the solution space and the rejection criteria never cause rejection of a valid solution. A method which uses only positive inference is complete if the inference rules generate all possible hypotheses fitting the criteria. In choosing an approach for solving a problem, the character of the data, of the configuration space, and of the inference rules which can be used must all be considered. Many known algorithms for problem solving use a combination of both approaches.

The following description discloses two variations of a method for solving restriction mapping problems. First described is an exhaustive hypothesis generator which completely models the configuration space and passes hypotheses one at a time to an evaluator which rejects those inconsistent with the data. It is this complete accuracy of the map generator of the present invention that instills the necessary confidence in its users. Also described are specific positive inference rules which could be used to limit the configuration space.

The preferred embodiment of the invention is termed the "branch and bounds method" and is based on the exhaustive hypothesis generator as limited by positive inference rules. The method uses negative inference to reject many elements of the solution space at once. This is accomplished by intermixing steps of map generation and map testing at a fine level.

Speed of Solution

The second criterion of acceptability is speed. The mapping problem is in the category known as NP complete, which means that there is no known way to solve it in less than exponential time. No matter how efficient a method is employed, the time required to solve increasingly complex maps increases exponentially. Many such problems would be insoluable for all practical purposes, regardless of the computational power available. The impact of an efficient algorithm versus an inefficient one, however, can be worth orders of magnitude in time. In effect, the quantitative difference becomes qualitative, making useful problems soluble that would otherwise be (for all practical purposes) insoluble. The preferred embodiment is an exhaustive map generator that can solve maps commonly encountered in molecular biology at practical speed, even on a personal computer.

Ability to handle fragment data with .+-. errors

Real (experimentally derived) data has error limits associated with it and a method must allow for the possibility that any fragment in the data deviates from its mean by the allowed amount in either direction. Real maps are one-dimensional and intervals on the map are additive. Fragment sizes are measurements of these intervals. If any proposed map has fragments assigned to intervals such that the additive relationship within error is not maintained, then the map is incorrect.

Ranking of maps by quality

When a large number of possible maps (each reasonably consistent with the data given the error limits) are found by a method, there is considerable difficulty for the user in deciding which is the most probable one. The determination of a "figure of merit" for each map, based on the overall fit of the data to the map, provides a basis for ranking the output. This is a considerable help to the user.

Circular and linear maps supported

Natural DNA is either linear or circular. The ability of a method to handle both cases is therefore a significant benefit.

Experimentally realistic range of data handled

The method of the present invention is not inherently limited in the number of enzymes that can be handled. For example, some prior art methods can handle only two enzymes. Similarly, the method of the present invention does not require that every double enzyme digestion be done in order to solve multi-enzyme maps. There may be used as much double digest data as is available, but only a few double digestion experiments are necessary. The minimum requirement is only enough multiple digests to involve all the enzymes. As more double or multiple digest data are supplied, better results are produced, ie. fewer "possible" maps found in a shorter time.

A map generator and a map evaluator are provided. The generator has the job of enumerating all possible hypotheses (in a combinatoric sense) based on a model of the configuration space. Each hypothesis is then passed to the map evaluator. The evaluator functions to reject maps which are inconsistent with the data and presents all others to the user. The evaluator comprises three parts: the map hypothesis rejector, the site positioner and the map quality evaluator. The hypothesis rejector screens hypotheses produced by the generator and rejects those with an unacceptable level of inconsistency with the data. For each map not rejected, the site positioner assigns numerical positions to each site on the map and the evaluator checks each map against the data for quality of fit. The maps with an acceptable quality are the solutions and they are passed to the user through the output channel. The generator and the elements of the evaluator will be discussed in detail in the following three sections.

The map generator

The map generator can be of any form that produces a complete coverage of all mathmatically possible maps given the single and multiple digest data. A brute force procedure that is a correct but computationally-intensive method for generating all maps is:

(PERMUTE ALL SITES) AND (PERMUTE ALL FRAGMENTS)

This is described in more detail as the steps below.

1. Generate list of all sites in molecule {determine the number of sites Si for each enzyme i by counting the fragments in each single digest (and subtracting one if map is linear)}

2. Determine the total number of sites, N, in the map by adding up the individual totals.

3. While an untried permutation of sites exist {number of permutations of sites is: ##EQU1##

4. Select a permutation of all sites

5. While an untried permutation of fragments in single enzyme digests exists {number of permutations is F1! X F2! X . . . Fi!, where Fi is the number of fragments in digest i}

6. Select a permutation of single digest fragments

7. For fragments in all single digests, assign all fragments (in selected order) to intervals on the map bounded on both ends by sites of the type used in the digest and containing no interior sites of that type.

8. While untried permutation of fragments in multiple enzyme digests exists

9. Select a permutation of multiple digest fragments {number of permutations is M1! X M2! X M3! . . . X Mk!, where M is the number of fragments in multiple digest k}

10. For all fragments in all multiple digests included in the input data, distribute all fragments (in selected order) to the map intervals bounded by sites of any type used in the digest and containing no interior sites of any of these types.

11. Pass this hypothesis to the hypothesis evaluator. {Each configuration defined by the combinations of site orders and combinations of fragment distributions is a theoretically possible hypothesis to be passed to the hypothesis evaluator.}

______________________________________ EndWhile {permute all experi- mentally available multiple digest fragments} EndWhile {permute all single di- gest fragments} EndWhile {permute all sites} ______________________________________

It should be noted that the map generator can be used equally well to generate linear or circular maps. The only detail that must be changed is whether fragments are to be assigned to an interval that spans the end. If the map is linear such assignments must be eliminated from consideration.

Other suitable embodiments of the "brute force" map generator can be obtained by interchanging the order of steps. For example, one could consider the possible ordering of fragments first (in the outer loop) and then consider the ordering of sites. Alternatively, the single fragments could be permuted first, then the sites, and then the multiple fragments. Finally, as discussed below with respect to the preferred embodiment, these steps may be intermixed at a fine level, some fragment permutations being done, then some site permutations, then some more fragments etc. The possible variations in the order of steps are minor modifications of the method of the present invention.

A critical feature of the "brute force" map generator of the present invention is that it does not require that double digest data be available from all the combinations of enzymes of the map. This is important because the number of double digests increases rapidly with the number of single digests. {EX(E-1)/2 , where E is the number of enzymes to be mapped} This can easily lead to a great deal of experimental work for maps with several enzymes. Many methods of the prior art which require all double digest data to set up the map generator are unsuitable because of the experimentally unrealistic requirement that double digest data be available for every pairwise combination of enzymes.

As is common to all methods, that of the present invention requires single digests of all the enzymes to be mapped. These are used to generate the list of sites to be permuted. Fragment permutations are done on the single digest fragments and on only those multiple digests for which experimental data are available.

The hypothesis evaluator

The map hypothesis rejector checks a number of intervals on the hypothesized map for additivity of fragment sizes as explained above. It is important to note that if any interval on the hypothesised map is non-additive, (beyond the error limits of the data), then the map is inconsistent with the data and is considered to be wrong. (It is this fact that is used to reject whole sets of hypotheses in the solution space by the prefered embodiment, the branch and bounds method described below.) Pearson's map rejector looks only at the whole map (sum) deviation of the data, not at individual data points, and rejects those maps whose total deviation is greater than a preset limit. Thus, correct maps may be rejected if most of the intervals are near the error limits of the data since the total error will be high, while obviously incorrect maps may be accepted if most of the intervals are coincidentally close to the data while one diagnostic interval is substantially beyond the acceptable limits of non-additivity.

The site positioner is responsible for assigning each restriction site a numerical position on the "possible" map based on the order of sites that has been selected and fragments have been assigned to the map intervals by the map generator. This step is required because every measurement of fragment length has a .+-. error associated with it. As a result, the length of the intervals of the map between sites can be determined in many different ways by adding up the lengths of the various combinations of sub-fragments that make it up.

Several alternatives exist for the design of the site positioning algorithm. In accordance with Pearson's method, single digests may be used alone after normalization of the total length of fragments to an average value. This provides a single position for each site, but the method avoids conflicts by throwing away the multiple digestion data. This technique has an adverse effect on accuracy since the single digest fragments, as the longest in the data, have the largest absolute errors.

An alternative method is to use the shortest fragment available to define every map interval. This also throws away data but is more accurate since the data thrown away is the less accurate data rather than the more accurate data.

The preferred method is that disclosed by Schroeder and Blattner. This method makes use of all the data by performing a least squares calculation to minimize the sum of the squares of the error of each fragment length relative to the length of the map interval to which it is assigned. This method gives the optimum map positions in the least squares sense for every site. As a slight variation of this method, it is possible to minimize the sum of squares of the errors weighted by various factors, such as the inverse of the relative error limits of the input data.

The map quality evaluator is used to give possible maps a "figure of merit" so that they may be ranked for the user. Once the site positions have been determined, the deviation of the length of each actual data fragment from the calculated length of its interval may be determined. Any map that predicts a fragment size that lies considerably outside the assigned error limits for that fragment is rejected. (It should be noted that this operation rejects maps. The map rejector described above is not essential nor need it be exhaustive in its rejection of inadequate maps since the function can be done here. However, the least squares assignment of site positions is computationly expensive and it is better to do it only on those very few maps, relative to configuration space, which are qualitatively reasonably consistent with the data.) This leaves the maps that are qualitatively consistent with the data. For these a figure of merit for the overall map is determined by any of several statistical formulae; e.g. the root mean square average deviation, the root mean square average fractional deviation or the corresponding straight averages of absolute values of the deviations or fractional deviations. The figure of merit may, in fact, be the maximum fragment error. It is convenient to store these maps in a sorted list so they may be presented to the user in order of the figure of merit.

In summary, then, the present invention provides a method having a proper map generator (which includes permutation of BOTH sites AND single digest fragments AND multiple digest fragments in all possible combinations), and a hypothesis evaluator which rejects only those maps in which any region is unacceptably (beyond the error limits of the data) non-additive.

The brute force method (wherein the map generator passes one hypothesis at a time to the evaluator) bogs down with maps of modest complexity. The number of trials is far greater than the product of the factorials of the numbers of single digest fragments (as was suggested by Pearson).

Since for a two enzyme linear map (with the double digest) there are A! x B! x (A+B)! permutations of fragments plus the permutations of sites, a brute force method which checks all permutations one at a time becomes rapidly unusable as the number of fragments increases.

Therefore, an efficient map generator/map evaluator combination represents a great improvment. An efficient map generator is one that accomplishes the same result as the "brute force" generator in many fewer computational steps, yet does not fail to consider any of the possibilities that might be correct maps.

The preferred embodiment of the present invention is based on the following three concepts:

1. It is not really necessary to examine all combinations of every permutation of single digest fragments with every permutation of sites ONE AT A TIME. While the order of sites is not uniquely deduceable from the order of single digest fragments unless the data is perfect (0 error), it is possible with linear maps to calculate which site permutations (one to several) are consistent with each proposed permutation of single digest fragments. Thus the configuration space to be exhaustively examined can be significantly reduced by considering only these relative few permutations for each permutation of single digest fragments.

The situation with circular maps is considerably more complicated since possible site orders can be calculated only from permutations of single digest fragments and some (data dependent) assignments of some of the double digest fragments.

2. Once both a single digest fragment permutation and a compatible site permutation (and with circular maps, enough double digest fragments assigned to allow this) are selected, then the distance between any two sites on the map can be calculated. (i.e. we can calculate the size that a fragment assigned to such an interval can be, without conflicting with the already selected fragment and site permutations and assignments.) All unassigned multiple digest fragments must then be assigned in all possible orders to all appropriate intervals. (i.e. Permute multiple digest fragments and assign to intervals--the inside loop of the general description.) For each such permutation (a hypothesis) if any fragment is inconsistent with the calculated allowable size of the interval, then the map is rejected. If no fragments are inconsistent, the map is accepted as possible and output. After all possibilities are tried, the process goes back to the next outside loop and tries the next permutation of single digest fragments AND sites.

2a. If no consistent (with the selected single digest permutation and site permutation) permutations of double digest fragments exist, then there are no possible maps for this particular combination. It is not always necessary to enumerate all possible permutations of double digest fragments to discover this. For example: for any interval for which double digest data is available, if no fragment of the appropriate size exists in the appropriate double digest, then no permutation of multiple digest fragments exists which is consistent and the single digest fragment permutation-site permutation combination may be rejected without individually considering all double digest permutations.

The preceeding two rules are examples of positive inference which may be used to limit the hypothsis generator and thus make it more efficient. Using these two rules, it would still be necessary to individually examine all single digest fragment permutations in combination with all consistent possible site permutations, reject some of them by criteria based on the existence or non-existence of particular multiple digest fragments (within error), and examine the rest in combination with multiple digest fragment permutations.

3. If the generator produces maps in the right order, and rejection criteria are applied at intermediate steps in map generation, (not just at the end), an inconsistency in a map may be used to reject entire sets of maps at once. That is, if a map is rejected because it has a particular inconsistent subregion, any and all maps which have the same subregion can be rejected at the same time.

The preferred embodiment specifies a particular arrangment of the steps of fragment and site permutation which can be represented in the form of a tree. Traversal of every branch of the tree from the root to every leaf would be equivalent to the brute force method. (This is equivalent to saying that hypothesis evaluation is applied to complete maps.) But the arrangment of the tree has been chosen to increase the efficiency and to eliminate computational work, by applying the test for local additivity at each branch and rejecting all leaves at once for any branch failing the test.

Starting at the root of the tree and moving outward, the nodes alternate between permuting sites, permuting single fragments, and permuting double fragments. (The term "selection of growing point enzyme" is used herein to describe the act of permuting sites). Because of this alternating arrangment, the order of steps is not only different from the described brute force implementation of the invention, but may differ depending on the data.

Because of the way the tree is designed, every node of it corresponds to a partially completed map, and these are nested so that each step farther out on the tree from the root adds fragments or sites to the end of the map that existed at the preceeding level. With this structure in mind, it is easy to see how the discovery of an inconsistency at any node (a local non-additivity) leads to a simple way of rejecting large numbers of hypotheses at once and thus limits vastly the computational work. In traversing the tree, as soon as a node is encountered whose partial map is inconsistent with the data, everything beyond it need not be traversed because all maps that begin the same way, are rejected.

The equivalence of the growing point selection rule technique to the full permutation of all sites used in the brute force method is clear. The savings accomplished through the use of the more efficient method is quite large in most cases where real data is involved because it is only necessary to apply site permutations in situations where the accumulated error in the positions of restriction sites at the end of the partially completed map of the node overlap. If these positions did not overlap, the permutation would, perforce create a negative fragment. The error control that results from tacking corresponding sites together as the map grows makes this a major savings over the more inefficient brute force method.

The preferred embodiment of the present invention uses a branch and bounds method and is described below. The input data are sets of ordered lists of fragment sizes, corresponding to the sizes of fragments in restriction enzyme digestions and the topology of the molecule (whether the molecule is linear or circular). The method requires that for each enzyme to be mapped, there must be a data list for the single enzyme digest and each enzyme must be involved in at least one multiple enzyme digest. The topology of the configuration space of the mapping problem requires that all pairs of single enzymes can be related through one or more multiple digests. (This is distinct from the first requirement only if there are more than three enzymes to be mapped). In the prefered embodiment, in order to ensure complete coverage of the configuration space of a circular map, there is a further requirement that there must be enough multiple digest data that at least one of the enzymes to be mapped is involved in a multiple digest with every other enzyme.

Overview of procedure

Maps are constructed by building the map up one step at a time. At each step there are a (data dependent) number of choices that could be made. At each step, every possibility must be tried and at each try the result is checked for internal consistency. (Additivity of map intervals within the error bounds of the measured fragment sizes must always be maintained). Any inconsistency causes pruning of that branch. That is, we need not continue any further down the wrong path of map construction but back up and try again with a different choice. If there are no more choices we backup one step further and try again.

There are three different kinds of steps in map construction which are repeated until a complete map is achieved (and/or until all possibilities have been tried and rejected). Pruning rules are applied at each step.

1. Choose a site. The hypothesis to be tested is that this site is the next site on the map. (No intervening sites between the previous site, chosen in the previous cycle, and this one). This site is called the growing point. It is chosen to be consistent with single digest fragments assigned in previous cycles. All possibilities must be tried. No more choices to try causes the method to reject this branch and backup to the previous step. (step 3 in previous cycle).

2. Assign multiple digest fragments which end at this site.

Any multiple digest which involves the enzyme of the growing point type will have a fragment in it that ends at the growing point and maintains additivity of interval on the partially constructed map, if it is correct so far. There may be more than one choice and eventually all choices must be tried. If there are no more choices to try this branch is pruned and the method backs up to the previous step. (step 3 above)

3. Assign a single digest fragment that begins at the growing point and extends into the not yet constructed part of the growing map. (All choices must be eventually tried). If there are no more choices to try this branch is pruned and the method backs up one step. (step 2 above)

These three steps are repeated until the entire configuration space has been covered. Each cycle adds to the growing map, one site, one fragment from each multiple enzyme digest which ends at that site, and one fragment from the single enzyme digest which begins at that site.

The steps of the method are cyclic so the above numbering of the steps (i.e. 1,2,3) is an arbitrary break in the cycle (i.e. 2,3,1,2,3,1, . . . is an equally valid representation). An alternative way to think about the cycle is described below.

Maps are constructed by adding one single digest fragment at a time to the growing map. The fragment is then accepted or rejected by checking the consistency of the map to date with the available multiple digest data. An inconsistency causes rejection of the newly added fragment and the the next available fragment is tried instead. If there are no more available fragments to try, the algorithm backs up one step, removes the last added, tentatively accepted, fragment and tries the next available fragment to add at that point. If the fragment is accepted, the construction process continues, another single-digest fragment is added to the map, accepted or rejected, and so on. A possible map is found when all single-digest fragments have been added to the map and accepted. The map is processed (checked, refined and output). Then the algorithm backs up a step and continues, since we want to find all possible maps which are consistent. A flow chart of the method is given in FIGS. 4,5 and 6. It is described below.

In the preferred embodiment, the overall order in which the map is constructed is from one end toward the other (left to right if ends are known) for linear molecules, and from an arbitrary point for circular maps. Different overall orders are also possible (e.g. right to left, or one end for a while then the other end for a while, or both ends at once with a parallel processor, or from more than one point in a circular map either alternating or in parallel, etc.).

The point to which the method will add another single digest fragment is called the growing point. (It is also the point up to which double digest fragments have been fit in the previous step.) It is defined as the next (from the previous growing point) site on the growing map. It is found as the right end of a single enzyme map which is the leftmost amongst the single digest maps. Since the relative map positions of the endpoints have an inherent uncertainty due to the measurement error of the input data, the relative order of closely spaced sites may not be determinable. When ambiguity of growing points is encountered, all possible growing points are explored, one after the other. Thus the method of the present invention properly handles closly spaced sites, a particular weakness of various prior art methods. A valuable enhancement to the selection of growing point is the use of relative error (relative to a reference point in the accepted part of the growing map) in comparing endpoints. This is described in the Errors section below.

A single digest fragment is added to the end of one of the maps (the growing point) and the recursive procedure BUILDMAP is called to see if this added fragment is accepted and if it is to add another fragment to the (next) growing point.

The procedure BUILDMAP executes as follows: The new ends are compared (next growing point is determined) and the sizes of fragments to expect from each multiple digest are calculated. The fragment we are trying to find runs from the right end of the double digest map to the just determined (next) growing point (see FIG. 7). Each multiple digest is checked for the existence of a fragment of the appropriate size. Its size must be such that the sum of its length and the lengths of all other fragments assigned to the map from that double digest between the previous end of the growing point enzyme map and the current end equals, within error, the size of the last fragment in the new growing point digest (see FIG. 8). This value range could be calculated in a number of ways. The important point is that the error to be used is local to the region where the fragment is being fit, and includes only the error in fragments which overlap the last fragment in the single digest map of the growing point enzyme. This is a unique and crucial feature of the present invention. If there is not a set (one from each multiple) of fragments which fit, the single-digest fragment is rejected (BUILDMAP returns) and a different single-digest fragment tried.

After the existence of at least one appropriately sized fragment from each appropriate multiple digest is verified, then the added single digest fragment is accepted, one fragment from each multiple digest is assigned to the map, i.e. it can not be used in any other position in this map. If there is more than one fragment which will fit, all possibilities are tried. If more than one multiple digest has more than one fragment which will fit, all permutations--i.e. all sets of one fragment from each multiple--are tried.

When multiple digest fragments which fit are found and assigned to this map position, a coordinate correction is performed on the growing point using the overlapping lengths of (1) the single digest fragment ending at the growing point and (2) the sum of the lengths of the multiple digest fragments contained within it, for each multiple digest involving the growing point enzyme in the data set. The coordinate adjustment employed is to set the coordinate range of the site (growing point) to the intersection of endpoint ranges of the overlapping values. The effect of this is to identify the endpoint of the single digest map and the multiple digest map as the same site by setting the coordinates to the same value. This assignment of the double digest fragments to the growing overlapped regions of the single digest maps, followed by coalescing of the endpoint sites, merges the originally independent single enzyme maps into a single unified map. In the unified map the length of intervals between sites (fragment sizes), both like enzyme sites (single digest fragments and different enzyme sites (multiple digest fragments), are used to set site coordinates (See FIG. 7). As grapically represented in FIG. 7, the growing end of the map is a set of independent, parallel maps, one for each enzyme. As we move along, adding and checking, the parallel maps are linked together.

The practical value of unifying the maps as we proceed is twofold. First, it allows the use of relative error (relative to a reference point in the unified map) for finding the next growing point, thus reducing the ambiguity of relative position of endpoint coordinates enough that all possibilities can be practically explored. Experiments have shown that this increases the efficiency of the algorithm by orders of magnitude.

Second, it could make the calculating of the size of fragment to look for in the double digest slightly more efficient. We could use subtraction of endpoint coordinates (using a reference point in the unified map to get relative, ie. local, error) and compare this value against the query fragment in the multiple digest instead of summing all appropriate fragments (and errors) with the query fragment (with errors) and then comparing this size to the size of the single digest fragment (with errors).

The order of map construction disclosed (trying all possible growing points) is unique to the present invention as is the fitting of double digest fragments using local error. The assignment of all permutations of double digest fragments which fit is unique to the present invention. While not critical, it assures the identification of maps which differ only in the alternative location of double digest fragments (as in FIG. 2) and permits the coalescing of the parallel maps from the different digests. The coalescing of the maps is unique to the present invention, and while not critical to the algorithm in general, it allows for the next point. That is, the method of calculating the next growing point(s) is unique in the use of relative error and increases the efficiency dramatically, making the use of the method on a computer practical.

Error handling for determining growing point

Two unique features of the present invention are the use of local errors to fit fragments from the multiple digest against the single digests and the use of relative error in calculating growing point. The use of local errors in fitting double digest fragments is described above, in FIG. 7, and in the Details section.

The use of relative error in determining the next growing point is important for efficiency. There are various, equivalent ways that the error handling may be implemented. The principle is described here and more details of the preferred embodiment are described in the Details section.

The calculated position error of any two sites on a line might be dependent on each other or independent of each other. For example, if the coordinate and error of a site was determined by adding the length and error of the interval between it and another site to the coordinate and error of the other site, then the positions of the sites are dependent. They can only vary independently over part of the error range. Therefore, the difference between sites which are dependent in this way can be obtained by subtracting the coordinates, and the error on this distance can be obtained by subtracting the errors of the two sites (FIG. 9). On the other hand, two independent sites can vary over the entire range of their errors independently. The distance between these sites is obtained by subtracting the coordinates and the error is arrived at independently by adding the errors. If two sites are not known relative to each other, but each is known relative to a third site, the two sites are not totally independent. The relative position of the two sites may be calculated including only that amount of the site errors which is independent. That is, their relative positions may be determined more accurately by referal to the third site than by subtraction of coodinates and summing of errors. One can subtract the error on the common reference point from each of the sites, before comparing them. The best reference point to use is one closest (fewest intervening steps of dependent sites) to the two sites.

let dist(x,y) be the distance between site x and site y

and x(coord), y(coord), be the locations of x, y

and x(error) and y(error) be the error on the location

The following cases are handled:

Dependent sites:

dist(x,y)=y(coord)-x(coord).+-..vertline.y(error)-x(error).vertline.

Independent sites:

dist(x,y)=y(coord)-x(coord).+-.(y(error)+x(error))

Partially dependent sites, by referal to a third site, z, upon which both x and y are dependent: ##EQU2##

Applying this to the comparison of two restriction sites for determination of growing point(s) in the algorithm, any two sites on the growing map of the same enzyme type are dependent because the interval between them is known (fragment length of fragment from single digest or sum of more than one fragment) and is used to determine the coordinate and the error. Any two sites of different types are dependent if the double digest for the two enzymes exists, and the fragment between the two sites has been assigned and used to determine the coordinates of the two sites.

In order to determine the next growing point, it is necessary to compare right ends of the single digest maps (coordinates of different enzymes). Since the error on these sites generally gets larger and larger as we get further in map construction, using the global error to compare the sites would lead to exploring many "possible" growing points and very slow operation. However, the sites in fact are not totally independent. They can be compared by referal to a third site, a reference point, upon which both of them are dependent, as shown in FIG. 10.

The best reference point to use is the closest. When comparing two map right-end sites (and the double digest for the two enzymes exists), the best reference point is the leftmost previous end of the two maps. If appropriate double digest data doesn't exist, then there is another site from which there is a common path of dependent sites to the two sites being compared. The best reference point is different for any two sites and depends on which double digests exist. For present purposes (reducing ambiguity of growing point) using the best reference point is not essential. Any fairly close reference point will do at the cost of some increased ambiguity. Because finding the best for every pairwise combination of enzymes would be a complex and time consuming step, the closest site which can be a reference point for any two map right ends may be used. (Termed the "greatest-common-whole-map-reference-point"). It is the leftmost of the "best" reference points for pairs, and is found as the leftmost previous end among the single digest maps. Since the input data is reqired to contain multiple digests so that all enzymes are represented in the multiples, the entire set of enzymes is tied together through multiple digest(s) and the use of a single reference point is valid.

Procedure for starting a map

The previous sections describe the process of constructing restriction maps by adding fragments one at a time to a growing map. The following discussion details the procedures for starting a new map.

Linear maps

Linear maps have two sites in common among all the digests, the two ends. To start a linear map, an arbitrary order is imposed on the enzymes' single digests. Only two minor steps need be incorporated into the BUILDMAP procedure for beginning a map: the method of choosing the growing point and the method of fitting multiple digest fragments. At the begining of a map (i.e. there is at least one single digest with no fragments yet in the map), the growing point is chosen as an unstarted single digest map (arbitraily the next). After choosing such a growing point, no multiple digest fragments need to be fit. The method proceeds by adding a single digest fragment to the selected growing point.

Circular maps

Circular maps create special problems at the beginning and ending of map construction, but once underway can proceed as described above. The method of the present invention can handle circular maps without special user input, except, of course, input of an indication that the DNA is circular.

In determining a growing point, if any single digest map is not started, there is a growing point ambiguity. Possibilities for the growing point include the leftmost site(s) of the started maps and all unstarted enzyme maps. All of these possibilities must be tried.

When the selected growing point belongs to an unstarted map, then fitting multiple digest fragments employs special rules. If all of the other enzymes involved in a multiple digest belong to unstarted maps, then no multiple digest fragment is placed in the map. The fragment that would have been chosen will actually be chosen at the right end of the map, since the left end of this fragment is positioned to the left of our arbitrary starting point and the map is circular. If one or more of the other enzymes involved in the multiple digest belong to started maps, then the fragment chosen from the multiple digest is allowed to fit by much weaker rules than normal. This situation is analogous to guessing where to start the growing point map. Any fragment chosen from the multiple digest will position the first site in the growing point digest, but the method has already made one assumption that must not be contradicted. The position of the growing point must not be to the right of the right end of any other started map (i.e. contradicting the definition of growing point).

Ending a Map

For linear problems involving only two enzymes, the method of the present invention always completely solves the map at the step numbered 9 in FIG. 5. Extending the number of enzymes, however, involves some complications.

For linear maps, the method will have chosen one enzyme as the last growing point (which is synonymous with the right end of the map) and will have fit fragments from all multiple digests involving that enzyme. Since the right end is a site which is common to all maps and the growing point enzyme is a site which is not, the method must fit a last fragment into all multiple digest maps that do not contain the growing point enzyme. The fitting procedure is the same as that depicted in FIG. 8, but each of the enzymes involved in the multiple digest maps is hypothesized to be the growing point, one at a time.

For circular maps, the method will always have two loose ends at step 9 of the flowchart in FIG. 5. First, as in linear maps, there is one last fragment to fit in each of the multiple digest maps that do not involve the enzyme at the end of the map (and in this case also at the beginning). Since the maps are circular, the method must use special fitting rules to accomodate the part of the map that extends beyond the arbitrary ending/starting point wrapping around to the first site in the multiple digest map. The difference in starting positions .+-. error is added to each single digest map when this fragment is fit. In addition, each single digest map must be tested for fit between the part of the map that extends beyond the last growing point and the position in the map where that digest begins. The fitting rules employed are equivalent to rotating the map leftward to each of the single digest starting sites and applying the standard rules for fit.

Evaluating and presenting a map

Once the method of the present invention has produced a site order and an assignment of fragment order in every digest, it has thus produced an assignment of every fragment to a map interval. This is the requirement for use of the method of Schroeder and Blattner to determine the coordinates of the sites of the restriction map (Schroeder and Blattner, 1978). This method assigns map coordinates so as to minimize the sum of the squares of the fractional deviation between the fragment computed length (difference of coordinates) and the fragment measured length. The present invention employs this method to assign map coordinates to the maps constructed. Maps are then sorted in order of best "score".

Score may be defined in a number of ways. The present inventors have used both the normalized standard deviation of fragments and the standard deviation weighted by measurement error. In either case, the smallest score is the best fit. The first uses the sum of squares of the fractional deviation (the same thing which is minimized by the fitting method). The second uses the sum of squares of the fractional deviation weighted by the inverse of % fragment-error for the individual fragments. Thus a difference in an interval whose measurement was known to be bad (proportionally) does not count as much as a fragment whose measurement was known to be good.

Enhancements and Variations

The following variations are alternative embodiments which may be used in combination with the method described herein, or with other methods of map computation.

(1) The method may impose an ordering so as to eliminate inverted (mirror image) maps.

(2) When fitting doubles, the method may check "clues". A fragment labelled as "new" in the double digest (either input by the user or automatically labelled by the program because the size is distinct from all single-digest fragments) is rejected if the method is searching for a fragment with both ends of the same type. Those fragments which can be automatically labeled by the program would automatically be rejected because of size, but "clue" checking is known to be much faster than checking the size. Those which are close in size to a non-new fragment can't be distinguished as new just by size of the input data but the user may be able to tell by inspection of the gel and thus many false maps can be rejected if the user also inputs clues where possible.

(3) In trying to add a single-digest fragment, use of clues such as (cut-by enz X) to eliminate some fragments from consideration would be a saving of computational effort.

(4) For linear molecules, use of clues such as: Left end, Right end, or someEnd increases efficiency.

(5) It would be efficient to consider "identical" sized fragments as the same so that both permutations are not output, thus eliminating "identical" (for all practical purposes) maps. The preferred embodiment uses "identical to the last digit", but this may be varied by the user.

(6) It will be useful to extend the method to allow adding of digests using additional enzymes to an existing map. Instead of trying all orderings of single-digest fragments, for a mapped enzyme, the method would allow adding them only in the mapped order.

(6a) Similarly, it will be useful to extend the method to allow use of any available partial mapping information, such as some of the contained-in relationships or a map of part of the molecule such as one has when mapping clones of an unknown DNA in a known vector. This may be done either by constraining the order of adding fragments or by by elimination of maps inconsistent with the partial mapping information.

(7) The existence of "not-visible" fragments is readily detected by consistency rules and easily handled by the present invention. According to the preferred embodiment, the consistency checker (of the data input and editing program) finds out if there are small (not visible) missing fragments and the user is advised to add a small fragment with value < some amount. This could be more automated by letting the method assume existence of such fragments.

(8) For those instances where an experimentor purifies a gel band and redigests, it may be useful to add a "contained.sub.-- in" clue for those fragments' relationship to other fragments.

(9) The inclusion of some indication of adjacency within a particular digest may be added.

(10) The inclusion of some indication of the existence of fragments obtained by partial (incomplete) digestions consisting of adjacent fragments not cut may be useful. Such an indicator would be used either to eliminate some "close" solutions, or to check additive length of fragments.

Details of the Preferred Embodiment Computer Implementation

An implementation of the method of the present invention, written in the `C` computer language for the IBM Personal Computer is attached hereto as Appendix A. The system is structured into three major parts: a fragment/digest editor, a map generator and a map viewer.

The fragment/digest editor allows a user to enter digestion data from the keyboard in a convenient manner and will also accept measurements produced by a digitizer. The editor is resposible for determining the overall consistency of the data and identifying conditions that would cause problems for the map generator, such as missing fragments.

The map generator is responsible for producing all maps consistent with the fragment measurements and storing completed maps for review.

The map viewer is a screen oriented system which allows the user to examine the maps produced in various levels of detail and in a variety of formats. Foremost is the ease of comparing any two maps. The viewer also allows the user to further process any selected maps into files compatible with other analysis and formatting tools.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Data Structures of RMAP

Refering to Block 10 in FIG. 4, there are shown three data structures for the control of the program RMAP during the construction of possible restriction maps.

The FRAGMENT.sub.-- DATA are a set of ordered lists, one for each different digestion in the input data. Each element of each list contains a fragment size, fragment error, an assigned-or-available flag, and (optionally) any "clues" about the fragment. Each list also has an associated pointer which is the "next-available fragment to try". It means that all available fragments above the pointer have been tried and all available fragments below the pointer have not been tried. Upon rejection of a fragment the program sets the pointer to the next (after the rejected one) available fragment in the ordered list. Upon acceptance of a fragment the program sets the pointer to the beginning of the list (first available fragment) in readiness for the next step of the recursion.

The MAP.sub.-- TO.sub.-- DATE is a set of arrays, one for each different digest in the input data. Each element of the array is a fragment identifier. As the program proceeds with map construction each fragment added to the growing map is represented by assigning its identifier to the next element in the array. Each map array has an associated integer, the number of fragments assigned to the map to date, and two associated coordinate ranges, the minimum and maximum values for the right end of the map and for the previous end (i.e. left end of last fragment added).

The HEAP is a dynamic structure which stores the information needed as the program proceeds back and forth through the recursion levels. It is a stack of objects referencing lists of growing points yet to be tried for a specific map position. Each element of the list contains the growing-point enzyme type, right-end coordinate range and previous coordinate range. A new list is created with each recursion and "heaped" upon the last. Each element of the list is examined, in turn, before the algorithm retreats to the previous heap level.

At the completion of a map, the sites of the map may be read from the heap in left to right order by traversing the heap from bottom to top and referencing the first list element at each level.

Reading the Data

Data is read into the program from a data file. This file is created by an editor written especially for the purpose. Input for the editor can be either typed at the keyboard or obtained via measurement by a digitizer tablet. The editor checks fragment consistency using the rules outlined in Table 1.

Starting the map (FIG. 4 Block 20)

To start the map a fragment is selected from one of the single digests. There are some constraints on which single enzyme digest and which fragment from the digest may be used. For linear maps any digest will do. The algorithm tries all fragments as the starting fragment (see flow chart) however, fragment selection for the first fragment is rejected if it conflicts with a "clue". i.e. if "leftend" is labelled in the starting digest, it is the only fragment from that digest which is possible as the starting fragment. Similarly, "right end" is rejected as a starting fragment, or if two directionally unspecified ends are in the data set ("ends") only those two are tried. If clues for endpoints have been specified, then one of the single digests containing clues must be chosen to start the map, otherwise we may inadvertently chose a fragment from an un-clued digest which conflicts with the end clues. All unstarted maps are assigned a range for the position of their right end site of zero.+-.zero.

For circular maps the enzyme digest to start is one which is involved in at least one multiple digest with every other enzyme to be mapped. This ensures complete coverage of the configuration space. If there is not such a digest, an arbitry one is choosen. All unstarted maps are assigned a range for the position of their right end site that spans zero to the length of the first fragment.

Recursive procedure BUILDMAP (FIG. 5)

We have implemented this method as a recursive procedure. However, since the maximum depth of recursion is known for any given data set (total number of single digest fragments) it could be implemented non-recursively. The Pseudocode Listing of Table 2 outlines the computer procedure implementing the mapping method described in FIG. 5.

The details described in the paragraphs below are numbered and formatted in parallel to the Pseudocode Listing so the reader can more easily refer back to the "big picture".

BEGIN BUILDMAP {a recursive procedure which tests to see if a single-digest fragment which has been added to the map is accepted and to try adding another fragment if it is}

1. (a). Determine possible GrowingPoints by finding leftmost right end site(s) by comparing coordinates of the sites (stored on the map-to-date). For this comparison, instead of using the global range (error relative to left end) of the coordinates (stored in map) use the relative error. This is obtained by subtracting the error of the greatest-common reference point for the entire map from the range of each right end coordinate. The site with the leftmost coordinate and any other site(s) whose relative-error-ranges for its coordinate overlap that of the leftmost site are possible growing points.

i.e. Find RightEnd with lowest Low end of range. This is one new GrowingPoint. Compare that G.P. High end with the low end for every other RightEnd site. If G.P.High - RefError>otherLow then the other site is also a potential GrowingPoint OR if G.P.High>otherHigh then other is a potential GrowingPoint.

note: RefError is the error range of the leftmost PreviousEnd of all the single digest maps.

(Special case if just starting a linear map--choose exactly one unstarted map as growing point rather than all that intersect.)

(b). Place a new element on the heap containing a growing point node for each possible growing point.

This contains (i) The enzyme type of the GrowingPoint

(ii) The coordinate range (global error) of the growing point

(iii) The coordinate range of the previous growing point

2. While untried GrowingPoints, Select Next GrowingPoint at the top of the heap

3. Calculate size of fragments from multiple digests to look for {use local errors}

(a) Examine all multiple digests which contain the growing point enzyme.

(b) Find a fragment which runs from the current right end site of the multiple digest map to the growing point.

(c) Use as the reference point the left end of the single fragment in the growing point enzyme single digest.

(d) Criteria for fit: Sum of fragments with errors in the multiple digest from the reference point rightwards, plus the query fragment with error must intersect the size of the single digest fragment (of growing point type) with its error.

(Note: special case if at beginning of circular map--sum of query fragment minus error and starting position of multiple digest map minus error must be less than the leftmost single digest map right end plus error. This effectively places the first site in the growing point digest map anywhere from the last growing point to the next most likely growing point. After fitting one fragment using this rule all others must fit as if the growing point digest map begins at this position.)

4. While an untried set of fragments which fit exist, Select fragments from multiple digests.

One fragment from each multiple digest 3(a) must fit 3(d). Existence of set that does fit allows acceptance of single digest fragment added to map in previous recursion and the algorithm can proceed forward. This loop tries all permutations of one fragment from each multiple which fits against the single.

At this point, the single-digest fragment added in the last recursion has been accepted. Now proceed to get ready to add the next fragment to the map.

5. Coalesce end of single digest map with each multiple digest.

Using the set of fragments from multiples selected above (Step 4):

(a) take the intersection of the ranges of all the sums used to fit the multiple digest fragments and the range of the single. If there is an empty intersection, reject this permutation of multiple digest fragments. {continue with step 4}

(b) add this range (a) to the PreviousEnd of the growing point digest. Also sum the errors.

c) adjust the coordinate of the growing point by putting the result (b) in the right end coordinate of the map of the single digest.

Note: this is where the starting position of circular maps of the single digests is determined and why this method can map circular molecules without a specific user input offset. If we are just starting the growing point digest, then we substitute 0.+-.0 for the PreviousEnd in step 5(b).

{start at beginning of list of unassigned fragments for single digest of GrowingPoint enzyme. This important point is handled by the selection routine for single digest fragments so the pointer is set to the top of the list in a previous recursion when the last fragment was selected.}

6. Are any single digest fragments untried? if not, then proceed to step 9, otherwise

While still untried fragments in single digest of GrowingPoint type,

(a) Select next fragment from single digest to try adding to map at GrowingPoint.

(b) Calculate new right end for that map.

(i) save the right-end as PreviousEnd.

(ii) The new right-end is right-end (stored in map)+(fragment size of fragment selected.+-.error on selected fragment).

7. Call BUILDMAP {to see if added fragment is accepted and to try adding another fragment if it is. return here if fragment selected in 6 above is rejected}

8. Remove fragment from Map. Return fragment to fragment list, Set next-frag-to-try pointer to next fragment after this one. Restore PreviousEnd and RightEnd (by reading from heap) end Loop 6 (while still fragments left to try)

9. Query is this a map? If yes, record the map.

Remove multiple digest fragments selected in step 4 from map. Return fragments to fragment list.

end Loop 4 (while still permutations of fragments from each double digest that fit)

remove GrowingPoint node from Heap

end Loop 2 (while still untried growing points)

remove top level from heap (contains no more GrowingPoint nodes now)

RETURN {fragment added in previous recursion doesn't fit}

Starting Maps

FIG. 6 outlines the procedure for fitting double digest fragments. This includes the special cases of just starting a map.

Starting a linear map

When any of the single digest maps are not yet started (i.e. no fragments yet assigned to map) special rules apply to finding the (next) growing point and to fitting fragments. The growing point selected is the next (abitrary order) unstarted digest. Since the growing point is synonymous with the left end, we would look for a fragment of zero length to be fit and obviously no such fragment would be represented in the data. The previously added single digest fragment is accepted (without assigning any multiple digest fragments) and a single digest fragment from the new growing point is added.

Starting a circular map

When any of the single digest maps are not yet started special rules apply to finding the (next) growing point and to fitting fragments. The possible growing points are the leftmost right end site(s) of the started digest map(s) and all unstarted digest maps (eg. at the first call of BUILDMAP all single digest maps are determined to be possible growing points). All possibilities must be tried one after the other.

Special rules also apply to fitting multiple digest fragments. The various cases are flowcharted in FIG. 6 and fitting rules for the special case diagramed in FIG. 10. In order to understand why these rules work it is necessary to realize that the selection of a growing point has made an hypothesis about the order of sites in the map. Since we try all possible growing points this is legitimate. Any multiple digest involving the selected growing point enzyme must be considered. A fragment to be fit is to be placed left of (ending at) the growing point.

If the growing point enzyme map has already been started, there are two cases to consider, the other enzyme maps in the multiple digest have not been started or they (any one) have been started. If both the growing point digest map and (one of) the other enzyme maps have been started, then the regular rules apply. If none of the other enzymes in the multiple digest have started maps then the first site for that enzyme has not yet been encounted and is right of the growing point. Since the growing point enzyme map has started, the fragment to be fit is the same size as the last fragment in the growing point enzyme digest map.

If the growing point enzyme map has not been started then, we don't really know the position of the growing point except that it is right of the previous growing point and left of any other right end in the map. If none of the maps for any of the other enzymes in the double digest have been started, the fragment left of the growing point extends left of the arbitrary starting point for the map and so no fragment needs be fit. (Unlike the case of linear where the concept of a fragment left of the left end is meaningless, there really is such a fragment in a circular map. It is fit at the end of the process.) However, when the growing point digest map has not been started and the other enzyme map has been started, the fragment to be fit runs from the other enzyme to the growing point. Thus the first double digest fragment to be fit has a large error since the uncertainty of the location of the growing point site is large. After the first multiple digest fragment is selected the site location is set using this fragment and fragments to be fit from any other appropriate multiple digests have errors based on the local error of the first (i.e. sum of overlapping fragments).

Ending a Map

When all fragments from single digests have been placed in the map (FIG. 5, Block 200) the algorithm has reached a stage where BUILDMAP has made all the forward progress that it can. In linear maps with two enzymes this stage represents a completed map, but in more complicated maps one last fitting step still remains before we have a complete map.

For linear maps of three or more enzymes, we have chosen only one of the single digest enzymes as the last growing point. Since that site represents the right end of the map, it is a site common to all digests, but we have only fit multiple digest fragments for digests that contain the growing point enzyme. Therefore, we need to determine that the last unchosen fragment from each of the multiple digests NOT containing the growing point enzyme will fit at the right end of the map.

To accomplish that, we scan the set of multiple digests in a predefined order (our implementation uses the order of occurrence in the data). Each multiple digest that contains an unchosen fragment has that fragment chosen and tested against each of the single digest maps for the enzymes it contains. This involves hypothesizing each other enzyme as the last growing point and then applying the fitting rules shown in FIG. 8. If all such fragments fit, then the map is complete and we proceed to Block 210 of FIG. 5.

In order to restore the state of the map in such a way that BUILDMAP can continue we must replace exactly the set of fragments chosen by this last step. This is accomplished by replacing one fragment from the end of each multiple digest map NOT containing the last growing point enzyme in the same predefined order in which these fragments were selected. The ordering is important if any of these fragments fails to fit, since we must replace ONLY those fragments chosen by this last step.

For circular maps of two or more enzymes, we have completed the map for all digests containing the enzyme of the single digest used to begin the map. Although all single digest fragments have been chosen at this step, we still need to fit a fragment from each multiple digest NOT containing the growing point enzyme (same as the enzyme site at the start/end of the map). The procedure for fitting these remaining fragments is much the same as for the linear case detailed above, but only the leftmost of the single digest right ends for enzymes contained in the multiple digest is tested for fit, since all fragments to the right of this point have already been fit at the beginning of the map.

Since circular maps allow weak fitting rules in order to start single digest maps, we are in a position at the end of the map to test this hypothesis using the stronger fitting rules of FIG. 8. Although all fragments have been placed in the map, we may have managed to avoid testing the fit of some single digest fragments that span our arbitrary starting/ending site. The only fragments which have escaped scrutiny are those which span the start/end site and are composed of three or more multiple digest fragments. Rather than identifying these individual cases, our implementation simply tests every one of the rightmost single digest fragments against the multiple digest fragments contained within it using the rules shown in FIG. 8 modified to cycle to the beginning of the map at the start/end site.

Recording a completed map

Once a complete map has been identified, we are in a position to evaluate the quality of the map in relation to the others generated by BUILDMAP. The present invention applies the method of Schroeder and Blattner to compute positions for all map sites based upon the generated order of fragmentation data. A measure of goodness of fit is then calculated using the formula: ##EQU3## where: Fm=measured fragment size from site i to site j

Fc=computed fragment size from site i to site j

n=total number of fragments represented in data

M=measure of deviation evaluated over all pairs i,j which are measured

A perfect map produces a measure of zero and larger measures indicate greater deviation from an optimal map. The preferred embodiment ranks maps in order of increasing measure of deviation and records maps in a randomly accessed file indexed by a file of deviation measures. Purely as a matter of convenience, only up to fifty maps are stored. Beyond fifty maps, measures better than the fiftieth map cause replacement by the current map and an update of the index file maintaining the measures in increasing order.

Each map is represented in the randomly accesed file by a record composed of:

(1) the vector of site coordinates produced by the method.

(2) the order of single digest enzymes corresponding to the sites of (1) that is internally represented by the first element of each heap list.

(3) the generated order of fragment identifiers for each digest map represented in the data.

All other information (such as fragment sizes, error, enzyme names, etc.) is assumed to be reproducible from the input data.

The index contains all references necessary to reproduce the input data (specifically the data pathname), the number of maps stored in the randomly accessed file and a record for each map containing:

(1) the position of the map record in the random file.

(2) the measure of deviation for that map.

Structuring the output this way promotes complete modularity of the reviewing process, therefore it may be implemented as a completely independent program.

Map Consistency

In order to successfully apply the map generation algorithm, several requirements of the input data must be met. These requirements are embodied by the following assumptions:

______________________________________ Purity Rule: (1) The DNA to be mapped is a single molecular strain. Topology Rule: (2) The DNA is known to be exclu- sively either circular or linear. Digestion Rule: (3) Digestion of the DNA has proceeded to completion. Combination Rule: (4) At least two independent sets of cleavage products are represented, separately and in combination. Orphan Rule: (5) Each enzyme or proper subset of enzymes involved in a multiple digestion is represented by a separate (single) digestion and identified as such Completeness Rule: (6) Every fragment has been represented in the data and a - measurement and bounding error has been assigned to it. Uniqueness Rule: (7) Any fragments of identical size are individually represented. ______________________________________

From these primary assumptions we have formulated tests which may be used to qualify the data for map generation (Table 1). Data failing any one of these tests will guarantee incomplete coverage of the configuration space and will give unsatisfactory (or no) results, therefore, all tests must succeed before any attempt at map generation is made.

The tests in Table 1 apply assumptions 1,2,3,6, and 7 in combination to yield equations relating the number of fragments between single and multiple digests. Any digests not meeting these equalities are assumed to have violated either the Completeness Rule (6) or the Uniqueness Rule (7) and the user is directed to re-examine the experiment in order to identify the cause of the inconsistency.

Application of the Combination Rule (4) involves identifying at least two single digestions and at least one multiple digestion in the data. Once that has been satisfied we may apply the Orphan Rule (5) to determine whether any set of single digestions is missing. Passing this last test qualifies the data for map generation.

STATEMENT OF INDUSTRIAL UTILITY

The present invention may be useful for the generation of restriction maps of DNA or for solutions to similar mapping problems. DNA restriction maps find application in the design of recombinant DNA, and in the elucidation of DNA structures for purposes of genetic comparison, diagnosis, and other such uses.

Glossary of Terms

Decomposition Fragment--See Fragment.

DNA--Deoxyribonucleic Acid. For mapping purposes it is a one-dimensional entity of finite length either linear (two ends) or circular (no ends).

Double-digest--All the fragments produced by incubation of a DNA molecule with two restriction enzymes at once.

Fragment--A subsection of a DNA molecule created by cleaving the DNA with restriction enzyme(s). The ends of the fragment are two restriction sites. The size (length) of the fragment is the distance between the two restriction sites on the DNA molecule from which the fragment was cleaved. Fragments may also be sub-fragments of larger fragments produced by single digests.

Genome--The entire genetic content of an organism. For most organisms, this comprises one or more molecules of DNA. (Exceptions are some viruses, e.g. retroviruses which have RNA genomes.)

Growing Point--The data point used in the construction of maps that according to the present invention which is defined as the site in the restriction map chosen from the set of right ends of all single digest maps which is assumed to be furthest to the left.

Growing Point Enzyme--The enzyme used in the single digest which has been chosen as the growing point.

Map--An enumeration of the sites for one or more restriction enzymes on a particular DNA molecule.

Mapping--The process of deducing the map of a particular DNA molecule for one or more restriction enzymes from single and multiple digests.

Measurement error--The length of fragments can be measured. Methods generally used to obtain fragment sizes for the purposes of mapping have a measurement error roughly proportional to fragment length. (i.e. the error is proportional to fragment length, such as .+-.5% rather than absolute like .+-.620)

Multiple-Digest--All the fragments produced by incubation of a DNA molecule with two or more restriction enzymes at once.

Permutation--A unique ordering of objects. (E.g. a permutation of fragments or sites).

Restriction enzyme--An endonuclease which cleaves DNA at one or more specific sites.

Restriction site--A site on a DNA molecule at which a restriction enzyme cleaves. It is usually identified with the name of the enzyme which cleaves at that site and the position of the site (.+-. error) on the DNA molecule in question.

Single-digest--All the fragments produced by incubation of a DNA molecule with any one restriction enzyme.

Reference Point--A site in the restriction map used for the purpose of comparing the right ends of two single digest maps where error is localized.

TABLE 1 ______________________________________ FRAGMENT CONSISTENCY RULES ______________________________________ let f.sub.1 = number of fragments in digest 1 f.sub.2 = number of fragments in digest 2 f.sub.1,2 = number of fragments in 1 + 2 double digest s.sub.1 = number of sites of type 1 s.sub.2 = number of sites of type 2 s.sub.1,2 = number of sites of either type 1 or 2 for linear molecules f.sub.1 = s.sub.1 + 1 f.sub.2 = s.sub.2 + 1 f.sub.1,2 = s.sub.1 + s.sub.2 + 1 = f.sub.1 + f.sub.2 - 1 for circular molecules f.sub.1 = s.sub.1 f.sub.2 = s.sub.2 f.sub.1,2 = f.sub.1 + f.sub.2 For digests containing "clue" information: let cut.sub.1 = number of fragments of digest 1 cut by enzyme 2 uncut.sub.1 = number of fragments of digest 1 not cut by enzyme 2 new = number of fragments in the double digest which are not in either single digest cut.sub.1 + uncut.sub.1 = f.sub.1 cut.sub.2 + uncut.sub.2 = f.sub.2 uncut.sub.1 + uncut.sub.2 + new = f.sub.1,2 assume s.sub.1 and s.sub.2 > 0, 1 .ltoreq. cut.sub.1 .ltoreq. s.sub.2 1 .ltoreq. cut.sub.2 .ltoreq. s.sub.1 for linear molecules new = cut.sub.1 + cut.sub.2 - 1 .vertline.f.sub.1 - f.sub.2 .vertline. - 1 .ltoreq. uncut.sub.1 + uncut.sub.2 .ltoreq. .vertline.f.sub.1 - f.sub.2 .vertline. + 1 1 .ltoreq. new .ltoreq. 2* min(s.sub.1,s.sub.2) for circular molecules new = cut.sub.1 + cut.sub.2 uncut.sub.1 + uncut.sub.2 = .vertline.f.sub.1 - f.sub.2 .vertline. 2 .ltoreq. new .ltoreq. 2* min(s.sub.1,s.sub.2) ______________________________________

TABLE 2 __________________________________________________________________________ Outline of Procedure Buildmap formated like the implemented C program. This procedure implements the method FlowCharted in FIG. 5. Steps are numbered in accordance with FlowChart in FIG. 5. Terms are defined in the text and the method is explained in text and figures. In this table, messages surounded by brackets are {comments}. Overview of Recursive Procedure BUILDMAP for constructing restriction maps from fragment size data __________________________________________________________________________ Begin BUILDMAP {to see if added fragment is accepted and to try adding another fragment if it is} Determine which right ends may be growing points by finding leftmost right end site(s) {by comparing the coordinates using relative error} While untried GrowingPoints, 2. Select GrowingPoint 3. Calculate size of fragments from multiple digests to look for {use local errors} While a set of untried fragments which fit exist 4. Select a permutation fragments from multiple {fragments must fit. Existence of set that do allows acceptance of single digest fragment added to map in last recursion} 5. rectify maps by coalescing sites {using selected set of fragments from multiples which fit to adjust the coordinate of the growing point} {start at beginning of list of unassigned fragments for single digest of GrowingPoint enzyme. note this is handled by the Select routine} While still untried fragments in single digest of GrowingPoint Enzyme 6. Select next fragment from single digest to try adding to GrowingPoint and Calculate new right end for that map. 7. Call BUILDMAP {to see if added fragment is accepted try adding another fragment if it is} {return to here if reject fragment selected in 6 above} 8. Remove fragment (selected in 6) from map, mark it as tried and replace it in fragment list end while {still fragments left to try} 9. Query is this a map? If yes, go to 10. OUTPUT PROCEDURE {return here after successful - map output so we can continue and get all maps} end while {still permutations of fragments from each double digest that fit} end while {still untried growing points} RETURN {fragment added in previous recursion doesn't fit} __________________________________________________________________________ ##SPC1##

* * * * *