U.S. patent application number 10/855367 was filed with the patent office on 2005-12-01 for method and system for the detection of atypical sequences via generalized compositional methods.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Rigoutsos, Isidore, Tsirigos, Aristotelis.
Application Number | 20050267692 10/855367 |
Document ID | / |
Family ID | 35426500 |
Filed Date | 2005-12-01 |
United States Patent
Application |
20050267692 |
Kind Code |
A1 |
Tsirigos, Aristotelis ; et
al. |
December 1, 2005 |
Method and system for the detection of atypical sequences via
generalized compositional methods
Abstract
A method and system for determining whether a sequence fragment
g is atypical with respect to a reference sequence G using
compositional methods and including constructing a template from G
and g respectively containing a sequence of characters for a
comparison with one another, wherein a number of characters
contained in the template exceeds two. For the case where the
sequences at hand are genetic, the atypicality detection can be
used to determine whether a given sequence fragment g is the result
of a horizontal transfer event.
Inventors: |
Tsirigos, Aristotelis;
(Astoria, NY) ; Rigoutsos, Isidore; (Astoria,
NY) |
Correspondence
Address: |
MCGINN INTELLECTUAL PROPERTY LAW GROUP, PLLC
8321 OLD COURTHOUSE ROAD
SUITE 200
VIENNA
VA
22182-3817
US
|
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
35426500 |
Appl. No.: |
10/855367 |
Filed: |
May 28, 2004 |
Current U.S.
Class: |
702/20 |
Current CPC
Class: |
G16B 30/10 20190201;
Y10S 707/99936 20130101; G16B 30/00 20190201 |
Class at
Publication: |
702/020 |
International
Class: |
C12Q 001/68; G06F
019/00; G01N 033/48; G01N 033/50 |
Claims
Having thus described our invention, what we claim as new and
desire to secure by Letters Patent is as follow:
1. A method of determining that a sequence fragment g is atypical
with respect to a reference sequence G, said method comprising:
constructing at least one template, each said template containing a
sequence of characters; counting a number of occurrences in said
reference sequence g for each of said at least one template; and
counting a number of occurrences in said reference sequence G for
each of said at least one template, wherein a number of said
characters contained in each of said at least one template exceeds
two.
2. The method of claim 1, wherein the reference sequence G
comprises a genetic sequence.
3. The method of claim 1, wherein the sequence fragment g comprises
a genetic sequence.
4. The method of claim 2, wherein the reference sequence G
comprises a partial sequence of a genome.
5. The method of claim 2, wherein the reference sequence G
comprises a complete sequence of a genome.
6. The method of claim 2, wherein the reference sequence G
comprises a collection of segments.
7. The method of claim 6, wherein G comprises a genome and the
sequence fragment g and the reference sequence G comprise genes
from the genome G.
8. The method of claim 3, wherein the sequence fragment g comprises
a non-coding sequence.
9. The method of claim 3, wherein the sequence fragment g comprises
a coding sequence.
10. The method of claim 3, wherein a part of the sequence fragment
g comprises a coding sequence and the remaining part comprises a
non-coding sequence
11. The method of claim 2, wherein said number of characters is
less than nine.
12. The method of claim 3, wherein said number of characters is
less than nine.
13. The method of claim 2, wherein said number lies in a range of
six to eight.
14. The method of claim 3, wherein said number lies in a range of
six to eight.
15. The method of claim 1, wherein said template optionally
includes at least one "don't care" character.
16. The method of claim 1, further comprising: recording a
representation of the reference sequence G as a sequence of said
number of occurrences of at least one template.
17. The method of claim 16, further comprising: converting said
number of occurrences into a frequency.
18. The method of claim 1, further comprising: recording a
representation of sequence fragment g as a sequence of said number
of occurrences of said at least one template.
19. The method of claim 18, further comprising: converting said
number of occurrences into a frequency.
20. The method of claim 16, further comprising: using a similarity
function to compare said sequence of number of occurrences of said
sequence fragment g that is being evaluated with a similar sequence
of number of occurrences of said reference sequence G.
21. The method of claim 20, wherein said similarity function
comprises one of: Pearson correlation; standard .chi..sup.2 values;
Mahalanobis distance; and Kullback-Liebler distance.
22. The method of claim 20, further comprising: for sequence
fragment g and reference sequence G, determining whether the result
of comparison (e.g., a typicality score) falls below a threshold
value.
23. The method of claim 6, wherein the sequence fragment g
comprises one of the segments comprising the reference sequence
G.
24. A method of representing a genetic sequence fragment g, said
method comprising: constructing at least one template containing a
sequence of characters; and counting a number of occurrences in the
sequence fragment g for at least one template, wherein a number of
said characters contained in said at least one template exceeds
two.
25. The method of claim 24, wherein the sequence fragment g
comprises a genetic sequence.
26. The method of claim 25, wherein the sequence fragment g
comprises a partial sequence of a genome.
27. The method of claim 25, wherein the sequence fragment g
comprises a complete sequence of a genome.
28. The method of claim 25, wherein sequence fragment g comprises a
collection of segments.
29. The method of claim 25, wherein the sequence fragment g
comprises a non-coding sequence.
30. The method of claim 25, wherein the sequence fragment g
comprises a coding sequence.
31. The method of claim 25, wherein a part of the sequence fragment
g comprises a coding sequence and a remaining part is a non-coding
sequence.
32. The method of claim 25, wherein said number of characters is
less than nine.
33. The method of claim 25, wherein said number lies in a range of
six to eight.
34. The method of claim 24, wherein said template optionally
includes at least one "don't care" character.
35. The method of claim 24, further comprising: counting the number
of occurrences of at least one said template in the sequence
fragment g; and recording a representation of sequence fragment g
as a sequence of said number of occurrences of said at least one
said template.
36. The method of claim 35, further comprising: converting said
number of occurrences into a frequency.
37. A system, comprising: a compositional feature vector calculator
to compute a number of occurrences of a given template having at
least two template characters in a genetic sequence fragment under
evaluation; and a memory interface to retrieve said genetic
sequence fragment under evaluation from a memory and to record said
number of occurrences of said template in a genetic sequence
fragment under evaluation.
38. The system of claim 37, wherein said template includes a number
of characters that is less than nine.
39. The system of claim 37, wherein said number lies in a range of
six to eight.
40. The system of claim 37, wherein said template optionally
includes at least one "don't care" character.
41. The system of claim 37, further comprising: a graphical user
interface to allow a user to enter data for said template; a
typicality score calculator module to calculate a typicality score
for a sequence fragment on a reference sequence; and a threshold
value calculator module to calculate an atypicality threshold for
the sequence fragment under evaluation
42. The system of claim 41, further comprising: a lateral gene
transfer (LGT) calculator to determine whether said typicality
score for said genetic sequence is below said threshold.
43. A signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus to perform a method of detecting horizontal gene transfer
in an organism, said program comprising: a compositional feature
vector calculator to compute a number of occurrences of a given
template having at least two template characters in a genetic
fragment under evaluation; and a memory interface module to
retrieve said genetic fragment under evaluation from a memory and
to record said number of occurrences of said template in said
genetic fragment under evaluation.
44. The signal-bearing medium of claim 43, wherein said template
has a number of characters that is less than nine.
45. The signal-bearing medium of claim 43, wherein said number lies
in a range of six to eight.
46. The signal-bearing medium of claim 43, wherein said template
optionally includes at least one "don't care" character.
47. The signal-bearing medium of claim 43, further comprising: a
graphical user interface module to allow a user to enter data for
said template; a typicality score calculator module to calculate a
typicality score for a sequence fragment on a reference sequence;
and a threshold value calculator module to calculate an atypicality
threshold for said sequence fragment under evaluation.
48. The signal-bearing medium of claim 47, further comprising: a
lateral gene transfer (LGT) calculator module to determine whether
said typicality score for said genetic sequence is below said
threshold.
49. The method of claim 22, wherein the result of said comparison
comprises a typicality score.
50. A system of determining that a sequence fragment g is atypical
with respect to a reference sequence G, said system comprising:
means for constructing at least one template, each said template
containing a sequence of characters; means for counting a number of
occurrences in said reference sequence g for each of said at least
one template; and means for counting a number of occurrences in
said reference sequence G for each of said at least one template,
wherein a number of said characters contained in each of said at
least one template exceeds two.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] The present invention generally relates to a method of
detecting atypical sequence fragments contained in longer
one-dimensional sequences composed of letters from a fixed
alphabet. More specifically, a method of representing sequences and
sequence fragments, as based on an experimentally-determined
optimal range of pattern template lengths, in combination with a
correlation function and a threshold calculation, improves upon
conventional methods for carrying out this task. For clarity
purposes, we develop and discuss our method in the case where the
said sequences are genomic sequences and in particular DNA: in this
case, atypical sequences are generally thought to be the result of
horizontal transfer events. But the described method is applicable
in a more general context, as this would be immediately apparent to
someone skilled in the art.
[0003] 2. Description of the Related Art
[0004] As already mentioned above, the system and method that we
describe below is generally applicable to the case where a long
sequence comprised of letters from a given alphabet contains a
fragment which is "atypical," i.e. unlike the remainder of the
sequence. For the purpose of presenting a clear description of the
approach, we have selected to develop the discussion for the
specific case where the sequences at hand are genetic sequences,
and in particular DNA; therein we seek to locate atypical sequence
fragments. Generally, atypical sequence fragments overlap with
regions defining genes but this is not a requirement.
[0005] Let us now continue our discussion in the context of genomic
sequences and complete genomes. In recent years, an increasing
number of the genomes for a number of organisms has been
experimentally determined. Through study of these genomic sequences
and with the help of other types of analyses, it has been
discovered that organisms can acquire genetic sequences from
organisms that are not necessarily related to them, through a
process that has been termed "horizontal gene transfer" (HGT) or
"lateral gene transfer" (LGT). In the discussion that follows, the
terms "horizontal gene transfer" and "lateral gene transfer" and
the abbreviations HGT and LGT will be used interchangeably.
[0006] An exemplary problem addressed by the present invention is
that of determining whether an observed genetic sequence from an
organism is native to that organism's genome or represents an
"atypicality" acquired through the HGT process. And as already
mentioned, this exemplary problem is a special case of the problem
of identifying atypical sequence fragments within longer
sequences.
[0007] Before proceeding with a discussion of the present
invention, it is noted that a genome can be thought of as a
sequence of nucleic acids; among other things, a genomic sequence
contains the definitions of the corresponding organism's genes. The
above-mentioned exemplary problem, therefore, can be described as
the process of deciding for a given segment of a given genetic
sequence, referred to as the "query," whether the sequence has been
native to the organism's genome or is the result of a transfer
event. The query could be distinct from a gene coding region,
partially overlapping a gene coding region, or wholly-contained
within a gene coding region.
[0008] This scenario is analogous to the problem of determining
which words, if any, from a given sequence of natural language
words have actually originated in another language. It is noted
that this `donor` language may not be known necessarily. The sought
determination is to be made by looking at the words, and, without
knowing the meaning of each word nor having a dictionary upon which
to rely in order to make the decision. Analogously, in the specific
problem under consideration that we use to develop our method, and
since genomic sequences are available for only a relatively small
number of organisms, one cannot rely on the availability of a
repository of reference sequences. In general, it is entirely
possible that a given genetic sequence has been transferred
horizontally from a donor organism whose genome has yet been
sequenced.
[0009] Techniques for determining whether a genetic sequence of a
given size is atypical have been known and will be discussed
shortly. However, there is a growing need for new methods that
exhibit an increased sensitivity in detecting genetic atypicalities
while at the same time reducing the number of false positive
predictions.
[0010] The present invention addresses the problem of
characterizing a sequence fragment of a given sequence in terms of
its atypicality. We develop the invention for the concrete case
where the given sequence is a genome sequence. Recall that for a
given organism, those genomic fragments whose origin can be traced
to an exogenous source, with high probability, are ideal candidates
for being atypical and putative instances of horizontal gene
transfer (HGT).
SUMMARY OF THE INVENTION
[0011] In view of the foregoing, and other, exemplary problems,
drawbacks, and disadvantages of the conventional systems, it is an
exemplary feature of the present invention to provide a system and
method in which detection of atypical sequences is improved over
conventional methods.
[0012] It is another exemplary feature to provide a template having
optimal length for purpose of detecting atypical sequences.
[0013] To achieve the above and other features, in a first
exemplary aspect of the present invention, described herein is a
method and system for detecting atypical sequence fragments, the
method including constructing a template that subselects a sequence
of characters from a sequence under evaluation, wherein a number of
characters contained in the template exceeds two.
[0014] In a second exemplary aspect of the present invention, also
described herein is a method for representing a genetic sequence,
including constructing a template containing a sequence of genetic
characters for a comparison, counting the number of occurrences of
the template in the genetic sequence, and recording a
representation of the genetic sequence being evaluated as a feature
vector comprising a sequence of number of occurrences of at least
one template, wherein a minimum number of the characters contained
in the template is two.
[0015] In a third exemplary aspect of the present invention, also
described herein is a signal-bearing medium tangibly embodying a
program of machine-readable instructions executable by a digital
processing apparatus to perform a method of detecting horizontal
gene transfer in an organism, the program including a feature
vector calculator module to compute the number of occurrences of at
least one template in a sequence under evaluation and a memory
interface module to retrieve the genetic sequence under evaluation
from a memory and to record the number of occurrences of the
template in the genetic sequence under evaluation, wherein the
template contains at least two characters.
[0016] With the above and other unique and unobvious exemplary
aspects of the present invention, it is possible to improve the
detection of atypical sequence fragments. For the special case
where the query sequence coincides with the extent of a gene coding
region, the present invention permits the elucidation of putative
HGTs.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The foregoing and other purposes, aspects and advantages
will be better understood from the following detailed description
of an exemplary embodiment for the special instance where the
invention is applied to the problem of identifying LGT's, with
reference to the drawings, in which:
[0018] FIG. 1 shows an overall performance result 100 of scoring
methods for LGT detection wherein a specific instance of the
present invention is compared with four conventional methods;
[0019] FIG. 2 shows the average relative improvement 200 of a
specific instance of the present invention when compared with the
second best performing conventional method (i.e., CAI), as a
function of amount of artificially introduced LGTs;
[0020] FIG. 3 shows the improvement 300 of a specific instance of
the present invention relative to the second best performing
conventional method (CAI) for each of 123 genomes that were used in
the experiments;
[0021] FIG. 4 shows the improvement 400 of a specific instance of
the present invention relative to the next best performing
conventional method (CG) for each of 123 genomes that were used in
the experiments;
[0022] FIG. 5 shows the result 500 of LGT detection by the present
invention as a function of the length n of the template used and
for various amounts of artificially introduced LGTs.
[0023] FIG. 6 is an exemplary graph 600 from the experimental data
that illustrates the LGT threshold using the genomic sequence of
Aeropyrum pemix as the input.
[0024] FIG. 7 exemplarily illustrates a template 700 used in the
present invention having word width n;
[0025] FIG. 8 illustrates an exemplary flowchart 800 of a software
program to derive the compositional feature vector used in the
present invention;
[0026] FIG. 9 illustrates an exemplary flowchart 900 of a software
program to derive the typicality scores (and LGT likelihood) for
each sequence fragment of interest;
[0027] FIG. 10 illustrates an exemplary block diagram 1000 of a
software module for executing the present invention;
[0028] FIG. 11 illustrates an exemplary hardware/information
handling system 1100 for incorporating the present invention
therein; and
[0029] FIG. 12 illustrates a signal bearing medium 1200 (e.g.,
storage medium) for storing steps of a program of a method
according to the present invention.
DETAILED DESCRIPTION OF AN EXEMPLARY EMBODIMENT OF THE
INVENTION
[0030] Referring now to the drawings, and more particularly to
FIGS. 1-12, exemplary embodiments of the present invention will now
be described.
[0031] The present invention evolved from an effort by the present
inventors to study and improve conventional methods for detecting
atypicalities. For the special case where the atypicalities
corresponded to putative LGTs, as described below, the present
inventors have carried out considerable experimentation to
determine an optimal range for template lengths appropriate for
this task.
[0032] In the following discussion, an alternative representation
scheme is presented that is based on a sequence's composition. A
focus is on devising novel methodologies which can help identify
sequences with atypical features (e.g., genes whose features
deviate from those of the majority of the genes in the genome under
consideration). In the case of genomic sequences, the underlying
conjecture is that such atypical genes are most likely the result
of lateral transfer events. An extensive set of experiments with
123 archaeal and bacterial genomes is described and demonstrates
that in the special instance of studying the problem of horizontal
transfers, the present method exhibits a markedly improved ability
in identifying horizontally-transferred genes, when compared to
previous reported approaches.
[0033] In addition, it is shown that improvement generally
increases with the size of the templates that are used as features
and in conjunction with an appropriate similarity metric. For the
case of genetic sequences, it appears that an optimum set of
conditions exists, as we discuss below, and that further increase
in template size leads to a decrease in performance. Finally, the
method is demonstrated as being extended to cover the case where,
instead of individual sequence fragments, the objective is to
detect clusters of atypical sequence fragments (e.g. groups of
genes that appear close to one another in the genome of the
receiving organism).
[0034] In recent years, the increase in the amounts of available
genomic data has made it easier to appreciate the extent by which
prokaryotes make use of horizontal gene transfer in order to
increase their genetic diversity. Horizontally transferred genes
give rise to extremely dynamic genomes in which substantial amounts
of coding DNA are incorporated into the chromosome from external
sources. Unlike eukaryotes, which evolve principally through the
modification of existing genetic information, prokaryotes seem to
have obtained a significant proportion of their genetic diversity
through the acquisition of sequences from organisms to which they
can be rather distantly related. Thus, in principle, horizontal
gene transfer has the potential to produce extremely dynamic
genomes in which substantial amounts of DNA, exogenous in origin,
are inserted in the chromosome. Such transfers have the potential
to change both the ecological and the pathogenic character of
archaea and bacteria.
[0035] The significance of horizontal gene transfer for prokaryotic
evolution was not recognized until the 1950s, when pathogenic
bacteria began developing multi-drug resistance patterns, on a
worldwide scale. The ease with which certain bacteria developed
resistance to the same spectrum of antibiotics indicated that these
traits were being transferred across taxa, rather than being
generated de novo by each lineage.
[0036] It was not until much later that the widespread impact of
horizontal gene transfer on prokaryotic evolution was adequately
appreciated. These early attempts to study rapid evolution by gene
acquisition encompassed four issues that have since become central
to all subsequent analyses.
[0037] First, it is imperative to devise methods for detecting and
identifying cases of horizontal gene transfer. Second, one would
ideally like to determine the source organism for each one of the
horizontally-transferred genes. Third, it is essential to
understand how many and which traits have been conveyed to the host
genome as a result of the identified transfer events. Fourth, one
would like to estimate the relative rates at which different
classes of genes mobilized across genomes.
[0038] Naturally, to establish whether a new trait, or a specific
region in a genome is the result of horizontal transfer processes,
it would be most satisfying if one could observe the conversion of
a deficient strain in the presence of an appropriate donor. It
would be all the more convincing to establish that genetic material
had indeed been transferred and also to determine the manner in
which it was acquired.
[0039] But, in the absence of controlled experimental settings,
actual transfer events cannot be easily pinpointed. Thus,
unambiguous evidence of their occurrence must be derived from other
sources. And this have given rise to a number of techniques that
have been developed for the detection of genetic exchange events
across species, and which go after their goal by processing
sequence data.
[0040] Horizontally-transferred genes share an unusually high
degree of affinity with the donor organism. Moreover, both donor
and recipient genomes are expected to share every trait that is
associated with the genes in question. Since each transfer event
involves the introduction of DNA into a single lineage, the
acquired trait(s) will be limited to the descendents of the
recipient strain and will remain absent from closely related taxa.
This will, in turn, produce a scattered phylogenetic
distribution.
[0041] On the other hand, horizontal gene transfer need not be
invoked as the causal agent in attempts to explain the sporadic
occurrence of certain phenotypic characteristics, (e.g., the
ability to withstand particular antibiotics). In fact, these
properties can frequently originate through point mutations in
existing genes and, therefore, may be the result of independent
evolution in diverging lineages.
[0042] Frequently, additional information has been brought to bear
in an effort to distinguish between convergent evolution and
horizontal gene transfers. It should come as no surprise that the
strongest evidence for (or against) horizontal gene transfer
derives from a genetic analysis of the implicated DNA
sequences.
[0043] DNA sequence information has been used in diverse ways to
identify cases of horizontal gene transfer, but the approach
underlying most proposed analyses is the discovery of features
indicating that the evolutionary history of genes within a
particular region differs substantially from that of ancestral
(i.e., vertically-transmitted) genes.
[0044] Similarly to distinctive phenotypic properties, DNA segments
gained through horizontal gene transfer often display a restricted
phylogenetic distribution across related strains or species. In
addition, these species-specific regions may show unduly high
levels of DNA or protein-sequence similarity to genes from taxa
that are inferred to be very divergent by means of other criteria.
The significance of aberrant phylogenies can be evaluated by
phylogenetic congruency tests or other means.
[0045] Although gene comparisons and their phylogenetic
distributions can be useful in detecting horizontal transfers, it
is the DNA sequences of the genes themselves which provide the best
clues as to their origin and ancestry. Prokaryotic species display
a wide degree of variation in their overall G+C content, but the
genes in a particular species' genome are fairly similar with
respect to their base composition.
[0046] Similar observations have been made for patterns of codon
usage and the frequencies of di- and trinucleotides. Consequently,
any sequences that may have been introduced in a genome through
recent horizontal transfer ought to retain, at least initially, the
sequence characteristics of the donor genome and thus could, in
principle, be distinguished from the DNA that has been ancestrally
present in the recipient genome through examination of the actual
nucleotide sequence.
[0047] It is not surprising that genomic regions often manifest
several attributes which are tell-tale signs of their acquisition
through horizontal gene transfer. For example, there is a large
number of S. enterica genes that are absent from E. coli (as well
as other enteric bacteria) and which have base compositions that
are significantly different from the average G+C content (=52%) of
S. enterica's genome. Also, certain serovars of S. enterica may
contain more than a million bases not present in other serovars, as
assessed by a genomic subtraction procedure.
[0048] The base compositions of these anonymous, serovar-specific
sequences suggest that at least half of them are the result of
horizontal transfer events. Additionally, the regions adjacent to
these putative horizontal transfers often contain vestiges of
sequences facilitating their integration in the recipient's genome,
such as remnants of transposable elements, transfer origins of
plasmids, or known attachment sites of phage integrases. All this
orthogonal evidence further attests to the sequences' foreign
origin.
[0049] In what follows, a novel methodology is presented that is
based on genomic composition and can identify atypical sequence
fragments. We showcase the methodology in the special case of
identifying putative horizontally-transferred genes. With the help
of a very extensive set of experiments with 123 archaeal and
bacterial genomes we also demonstrate that this method, heretofore
referred to as Wn, outperforms previously published approaches,
such as C+G and its variations, dinucleotides, and CAI.
[0050] Before outlining the details of the present invention, we
begin with a brief overview of previously reported methods for
identifying atypical genetic sequences and which methods have been
based on genomic composition. We then follow the description of the
Wn technique of the present invention with a description of
experiments and the reporting of obtained results on a collection
of 123 genomes; these results make very apparent the improvement
over traditionally used methods that can be achieved by the Wn
technique.
[0051] An Overview of Composition-Based Methods
[0052] Previously published methods for horizontal gene transfer
detection which have been based on gene content assume that in a
given organism there exist compositional features that remain
relatively unchanged across its genomic sequence. Genes that
display atypical nucleotide composition compared to the prevalent
compositional features of their containing genome are likely to
have been horizontally acquired.
[0053] Consequently, over the years, multiple features have been
used to define characteristic 'signatures' for a genome: once such
a signature has been computed, any sequence fragment that deviates
from it can be marked as a horizontal transfer candidate.
[0054] The simplest and historically earliest type of genomic
signature is its composition in the bases G and C (i.e. the
genome's G+C content). It is important to note that due to the
periodicity of the DNA code, implied by the organization of the
coding regions into codons, the G+C content varies significantly
based on the position within the codon. As a result, four discrete
G+C content signatures can be identified. The first of the four G+C
signatures corresponds to the overall G+C content and is computed
by considering the entire nucleotide sequence of a genome, and
includes both coding and non-coding regions. The remaining three
signatures are denoted by G+C(k), k=1,2,3 and correspond to the
value of the G+C content as determined by considering only those
nucleotides occupying the k-th position within a codon. Unlike the
first of these four signatures, only coding regions are used in the
computation of G+C(k).
[0055] A related variation of the G+C(k) content idea is the
so-called Codon Adaptation Index (CAI) which was introduced by
Ikemura in 1985. CAI measures how well a given gene's codon usage
correlates with the codon usage of highly expressed genes from the
organism under consideration.
[0056] In yet another variation introduced in the context of a
study of the E. coli genome, Lawrence and Ochman identified
atypical coding regions by simultaneously combining G+C(1) and
G+C(3). Additionally, for each gene they computed a `codon usage`
that assessed the degree of bias in the use of synonymous codons
compared to what was expected from each of the three G+C(k) values.
A gene was rendered atypical when its relative "codon usage", as
defined above, differed significantly from its CAI index.
[0057] The codon usage patterns in E. coli were also investigated
by Karlin et al, who found that the codon biases observed in
ribosomal proteins deviate the most from the biases of the average
E. coli gene. Using this observation, they defined "alien" genes as
those genes whose codon bias: a) exceeded a threshold when compared
to that of the average gene, and, b) was high relative to the one
observed in ribosomal proteins.
[0058] Another popular genomic signature is the relative abundance
of dinucleotides compared to single nucleotide composition. As
demonstrated by Karlin and co-workers, despite the fact that
genomic sequences display various kinds of internal heterogeneity,
including G+C content variation, coding versus non-coding, mobile
insertion sequences, etc., they nonetheless preserve an
approximately unchanged distribution of dinucleotide relative
abundance values, when calculated over non-overlapping 50-kb-wide
windows covering the genome. Moreover, the dinucleotide relative
abundance values of different sequence samples of DNA from the same
organism are generally much more similar to each other than they
are to sequence samples from other organisms. Finally, closely
related organisms generally have more similar dinucleotide relative
abundance values than do distantly related organisms.
[0059] Karlin and co-workers also introduced the "codon signature"
which is defined as the dinucleotide relative abundances at the
distinct codon positions 1-2, 2-3 and 3-4 (4=1 of the next codon).
For large collections of genes (50 or more), they found that the
codon signature, like the genome signature, is essentially
invariant. The codon signature largely adheres to the genome
signature but accommodates amino acid constraints.
[0060] Hooper and Berg proposed as a genomic signature the
dinucleotide composed of the nucleotide in the third codon position
and the first position nucleotide of the following codon. Using the
16 possible dinucleotide combinations, they calculate how well
individual genes conform to the computed mean dinucleotide
frequencies of the genome they belong to. Mahalanobis distance,
instead of Euclidean distance, is used to generate a distance
metric on the dinucleotide distribution. As it turned out, genes
from different genomes could be separated with a high degree of
accuracy.
[0061] Hayes and Borodovsky have also demonstrated the connection
between gene prediction and atypical sequence detection. While
addressing the problem of accurate statistical modeling of DNA
sequences as coding or noncoding for bacterial species, they
observed that more than one statistical model was necessary to
describe the gene-containing regions. This was attributed to the
diversity of oligo-nucleotide compositions among the gene coding
regions, and specifically the variety of the underlying codon usage
strategies. In the simplest case, two models sufficed, one for
typical and one for atypical genes. The atypical model that they
introduced allowed the prediction of genes which escape
identification by the typical model while at the same time
suggesting good horizontal transfer candidates.
[0062] S. Garcia-Vailve, A. Romeu, and J. Palau, in their 2000
article "Horizontal Gene Transfer in Bacterial and Archaeal
Complete Genomes", identified horizontal gene transfer candidates
by combining multiple sources of information. In particular, their
analysis was based on G+C and G+C(k) content, on codon usage, on
amino-acid usage and on gene position. Genes whose G+C content
significantly deviated from the mean G+C content of the organism
were considered to be candidate gene transfers provided that: a)
they also have an extraneous codon usage (computed in a similar
way), b) their size is more than 300 bp, and c) the amino-acid
composition of the corresponding protein deviated from the average
amino-acid composition of the genome. The same authors stress the
importance of excluding highly expressed genes from the candidate
set of gene transfers, since they may deviate from the mean values
of codon usage simply because they must adapt in order to reflect
changes in tRNA abundance: for example, ribosomal proteins are
filtered and are not included in the predictions.
[0063] In a 2003 article, entitled "HGT-DB: a database of putative
horizontally transferred genes in prokaryotic complete genomes," S.
Garcia-Vallve, E. Guzman, M. A. Montero, and A. Romeu used this
method to generate results for 88 complete bacterial and archaeal
genomes: the putative horizontally transferred genes were collected
in the HGT-DB database that is accessible on the world-wide
web.
[0064] It should be stressed however, that Garcia-Vallve and
co-workers did not introduce a new genomic representation scheme,
but rather combined several distinct, previously-published
modalities into one feature vector. The complication with this
approach, as is always the case with feature vectors comprising
distinct non-uniform features, is that it is difficult to derive a
distance function which properly takes into account the distinct
features, the different units, the different dynamic ranges of
values, etc.
[0065] In direct contrast to this, the method of the present
invention preferably uses a single feature to determine whether a
sequence fragment is atypical when compared with the rest of the
sequence that contains it.
[0066] Before concluding our brief summary of earlier work, we
should also mention an approach that is radically distinct from the
ones described above in that it is not composition-based. In
"Distributional profiles of homologous open reading frames among
bacterial phyla: implications for vertical and horizontal
transmission", M. Ragan and R. Charlebois organize ORFs from
different genome in groups of high sequence similarity (using
gapped-BLAST) and look at the distributional profile of each group
across the genomes. ORFs whose distribution profile cannot be
reconciled parsimoniously with a tree-like descent and loss are
candidates for having been transmitted through horizontal gene
transfer. In other words, instead of deciding whether a sequence
fragment is typical or atypical by comparing its composition to
that of the containing genome, they carry out a statistical
comparison of similar genes across genomes.
[0067] The Wn Method of the Present Invention
[0068] We now detail the Wn method of the present invention for
deriving generalized compositional features (single modality). The
Wn method extends and generalizes composition-based methods in
three distinct ways:
[0069] first, higher-order tuples of letters are used; this
overcomes the diminished discrimination power exhibited by the
previously proposed lower-order tuples (e.g. the di- and
tri-nucleotide models). Use of richer compositional features leads
to an improved ability to classify. It also leads to an increased
sensitivity in identifying genes with atypical compositions.
[0070] second, in our method, we extend the composition-based model
in a manner that allows certain positions of the sequence at hand
to be "ignored". This is achieved through the use of generating
templates that include "gaps" where each gap is denoted by the "."
(also known as the "don't care" character) sequence letters that
occupy the "don't care" position are ignored during the
computation. As an example, the template A G which matches each of
AAG, ACG, AGG, and ATG will effectively ignore the identity of the
nucleotide occupying the middle position.
[0071] third, our generalized model permits the consideration and
use of templates that do not occupy consecutive positions. For
example, in the case of genetic sequences we can use this property
to optionally take the periodicity of the DNA code into account:
when calculating codon frequencies (through the use of
tri-nucleotide templates) one can incorporate the requirement that
only the templates that start at positions 3 k+1, where k is a
non-negative integer, be examined and participate in any subsequent
computations.
[0072] In this augmented model, we define the compositional feature
vector .phi.(s) for any given sequence s over a set of templates
.pi.={.pi..sub.1, .pi..sub.2, . . . , .pi..sub.q} as
.phi.(s)=(.alpha..sub.1, .alpha..sub.2, . . . , .alpha..sub.q),
where .alpha..sub.i is the frequency of template .pi..sub.i in
sequence s. Instead of using the absolute template frequencies,
these frequencies can be optionally normalized over the expected
template frequencies given the expected single nucleotide
composition with respect to some background reference sequence.
Generally, if g denotes the sequence fragment whose typicality
property matches that of G which is the reference sequence, then
the single-letter frequencies of the reference sequence G are
considered to also match the expected single-letter frequencies of
the sequence fragment g. In the case of genomic sequences, g could
be a gene or an arbitrary DNA fragment, whereas G could be a genome
or part of a genome. The relative (normalized) frequencies are
given by the following equation: 1 i = P g ( i ) j = 1 i P G ( ij
)
[0073] where .pi..sub.ij is the j-th letter of template .pi..sub.i.
The probability of the "." character is one. The probability in the
numerator is the observed template frequency in the sequence
fragment, whereas the single--letter probabilities in the
denominator are computed from the reference sequence G; as we have
mentioned already, the computations can be carried out with the
help of templates that can be made position-specific.
[0074] Turning Compositional Features to Typicality Scores
[0075] Given a reference sequence G, and a sequence fragment g, the
objective is to determine g's "atypicality" with respect to the
reference sequence G. It should be noted that, in general, G does
not have to be a single sequence of letters, but can be thought of
as a collection of n.sub.G fragments--in this case, any
computations involving G will be carried out with each of the
n.sub.G fragments in turn and the results will be aggregated.
[0076] In the case of genomic sequences, we wish to characterize
any arbitrary sequence fragment g (whether it corresponds to a
coding region, to a non-coding region, or both) in terms of its
similarity to G (which can in turn be a whole genome, or part of a
genome, or a collection of n.sub.G genomic fragments). The
assumption here is that G exhibits a relatively unchanged
composition over extended intervals that may not be contiguous
necessarily: thus, a sequence fragment g whose template composition
differs substantially from the typical composition of the reference
sequence G will be considered "atypical" with respect to G.
[0077] In our method, a score S.sub.G(g) is assigned to each
sequence fragment g as a result of a comparison with G. The higher
the score, the more typical g is with respect to G. In the case of
genomic data, a sequence fragment with low S.sub.G(g) value is
likely to have been acquired through a horizontal gene transfer
event from an (generally unknown) donor into G.
[0078] Given a sequence fragment's feature vector .phi.(g), a
straightforward way to compute a typicality score is to compare it
with the feature vector .phi.(G) of the reference sequence G. The
comparison can be performed in multiple ways, and it will yield a
score S.sub.G(g) of similarity between g and G.
[0079] Four commonly used similarity metrics are correlation,
.chi..sup.2 test, Mahalanobis distance and relative entropy. The
first method involves the calculation of the classic Pearson
correlation between the vectors for g and G. In this case, g's
score S.sub.G(g) with respect to the reference sequence G can be
computed as follows: 2 S G ( g ) = E ( ( g ) - ( g ) ) ( ( G ) - (
G ) ) ( g ) ( G )
[0080] The standard .chi..sup.2 test measures the deviation of a
vector from its expected value by summing up the deviations of each
vector component. In this case, g's score is obtained by negating
it, so that high .chi..sup.2 values (and thus high deviations) will
correspond to atypical scores: 3 S G ( g ) = - k ( k ( g ) - E [ k
( g ) ] ) 2 E [ k ( g ) ]
[0081] where the expected value for component k is estimated by the
mean value of the component across all n.sub.G segments comprising
G (clearly, if G consists of one sequence only, then n.sub.G=1): 4
E [ k ( g ) ] = 1 n G g G k ( g )
[0082] The need to use the Mahalanobis distance arises in the case
where the selected compositional features are significantly
correlated with each other. As a result their covariance matrix K
contains important information. The score induced by the
Mahalanobis distance is:
S.sub.G(g)=-(.phi.(g)-.phi.(G)).sup.TK.sup.-1(.phi.(g)-.phi.(G))
[0083] When the feature vector defines a distribution (e.g.
frequencies of all possible tri-nucleotides), a score can be
assigned to a fragment g by measuring the distance of the
distribution defined by g's vector from the one defined by G's
vector using the concept of relative entropy (also known as
Kullback-Leibler distance): 5 S G ( g ) = - k k ( g ) ln k ( g ) k
( G )
[0084] The Wn Method for Atypicality Detection
[0085] Given a reference sequence G that comprises n.sub.G
segments, one instance of the problem solved by the present
invention is to determine the subset of those n.sub.G segments that
are deemed atypical. Clearly, the method can be trivially extended
to determine whether a given sequence fragment g (not necessarily
in G) is atypical with respect to G.
[0086] In the case of genomic sequences, the method of the present
invention can be used to determine whether a given region g of
genome G is the result of a horizontal transfer or not. In another
obvious extension, the sequence fragments g correspond to coding
regions, i.e. genes. An exemplary goal is to first compute for each
gene g in genome G a typicality score with respect to G. This score
reflects the similarity of g to the whole genome G as this
similarity can be assessed by the selected compositional
features.
[0087] For our analysis, and in addition to the previously proposed
methods (e.g. G+C content, CAI, etc.), we also evaluated the
performance of methods which make use of templates of size greater
than or equal to 3 (=size of a codon) to form compositional
features.
[0088] As a first observation, we have found that at template sizes
greater than 2, the optimal performance was obtained by not taking
the periodicity of the genetic code into account, i.e. by counting
all the templates without skipping the 2.sup.nd and 3.sup.rd codon
positions, and by choosing correlation as the similarity metric for
computing the final scores. For template size n, the corresponding
method is denoted as Wn, where n is greater that 2.
[0089] Furthermore, for genomic sequence data, the performance
increased with the template size, reaching a maximum value at sizes
6 through 8 and then dropped sharply as the size increased further.
The user has a choice regarding which of the three template sizes
(i.e. 6, 7 or 8) to use, and the answer depends on the size of the
genomic sequence at hand: short sequences will dictate a value of 6
whereas long sequences will dictate a value of 8.
[0090] Clearly, and in order to achieve greater specificity, the
largest possible template size should always be opted for, provided
that the sequence being processed can yield a substantial number of
templates of that size. As a rule of thumb, a lower size template
should be used when individual gene transfers are sought, whereas
larger size templates should be reserved for when the user tries to
identify clusters of laterally transferred genes: a sliding window
method to that end is described next.
[0091] The Wn variant for atypicallity detection that uses sliding
windows Here, for completeness, we briefly describe how the Wn
method can be applied to the problem of detecting clusters of
putative transfers in genomic sequences: instead of computing the
feature vector .phi.(.) for an individual gene, the computation is
carried out across the extent of a sliding window that extends over
multiple neighboring genes simultaneously.
[0092] The size of the window, i.e. the number of genes that are
included in the computation of the feature vector .phi.(.) is a
parameter of the algorithm. A score is obtained for each placement
of the window, and the score of each gene is simply the average of
the scores generated by all the windows in which the gene
participates.
[0093] The Wn Method for LGT Detection: Automated Threshold
Selection
[0094] Given the typicality scores for each sequence fragment g,
there is also a need for the automatic determination of a threshold
value: any sequence fragment g that falls below this threshold is
then considered to be a horizontal transfer. We illustrate the
process of automatically deciding the threshold using the
typicality scores for each of the genes contained in the genome of
A. pernix. The distribution of the scores .function., sorted in
increasing order, is shown in FIG. 6. In the same Figure, the
derivative .function.' of the distribution is also shown, properly
smoothed by taking the average over sliding windows. It is observed
that the scores drop very fast for the first few genes, and then
the derivative almost drops to zero. The following threshold T is
defined based on the values of the derivatives:
T=min.sub.i.function.'(i)+2[avg.sub.i.function.'(i)-min.sub.i.function.'(i-
)]
[0095] The smallest value of i for which the derivative drops below
threshold T represents the number of predicted putative horizontal
gene transfers for the genome at hand.
[0096] Experiments and Results
[0097] Below, we demonstrate the potential of using compositional
features to detect atypical sequence fragments through a very large
number of experiments where we created artificial horizontal gene
transfer events. The experimental procedure had as follows:
[0098] first, we created a pool of donor genes. Alternatively, one
could have formed the pool of donors by forming a collection of
variable size genomic fragments from a variety of genomes. The pool
of donor genes that we used comprised the gene complement of the 27
phages shown in Table 1 and contained a total of 1,485 donor
genes.
[0099] then, for each genome G under consideration, i.e. for each
reference sequence G, a group of D.sub.G genes was randomly
subselected, with replacement, from the donor pool and incorporated
into the host genome under consideration. The number D.sub.G of the
artificial transfer events was selected to be a fixed percentage of
the number of genes contained in G, the host genome. Each genome in
turn, from a collection of 123 fully-sequenced prokaryotic genomes
(archaea and bacteria), served as a host genome for these
experiments.
[0100] The purpose of this simulation was to recover as many as
possible of the D.sub.G genes that we artificially incorporated in
the genome G. In order to generate trustworthy estimates of the
achieved performance, we repeated the above experiment a total of
k=100 times for each one of the genomes G, and averaged the
results. Finally, and in an effort to gauge the performance of our
method for different amounts of horizontally transferred genes, we
repeated the above experiments for different "mixture" percentages,
i.e. for values of D.sub.G that ranged from 1% to 8% of the number
of genes originally contained in the genome G.
[0101] Clearly, the D.sub.G genes that were incorporated in genome
G during our simulation would be competing for the same positions
with those horizontally transferred genes, if any, that are already
present in G. Since all of the tested methods had to process the
same input sets, the testing conditions were consistent across all
used methods.
[0102] Each of the tested methods generated a typicality score for
each gene, using the tested method's compositional features. The
methods were then evaluated according to the percentage of genes
from the phage donor pool that occupied the D.sub.G-lowest
typicality scores. We refer to this percentage as the "hit
ratio."
[0103] Let m be a gene-scoring method, and r.sub.i.sup.m (G) be the
method's hit ratio as this is estimated from the i-th experiment,
i=1, 2, 3 . . . , k, with host genome G. We calculate the
performance of method m on genome G to be the average across the k
experiments: 6 r m ( G ) = 1 k i = 1 k r i m ( G )
[0104] With the help of r.sub.i.sup.m (G), we calculate the overall
performance of method m to be the average performance across the N
genomes G used in our experiments: 7 r m = 1 N G r m ( G )
[0105] We evaluated the above performance values for several
methods m. In Table 2, we show results for five methods: four of
these methods were proposed in earlier work, whereas the fifth
method is the Wn method of the present invention. As can be seen
from this Table, Wn outperforms all four other methods.
[0106] The first method whose performance is listed in Table 2 is
the Codon Adaptation Index (CAI); the latter is computed for each
gene and becomes the gene's score. The lower this score is, the
more atypical the gene is considered to be compared to the rest of
the genome. The CAI index for gene g in genome G is given by the
following formula: 8 CAI ( g ) = exp ( i f i ln W i )
[0107] where .function..sub.i is the relative frequency of codon i
in the coding sequence, and W.sub.i the ratio of the frequency of
codon i to the frequency of the major codon for the same amino-acid
in the whole genome.
[0108] In the second method, the G+C content for each gene is
computed and compared against the G+C content of the genome using
the .chi..sup.2 test; the .chi..sup.2 value is negated in order to
yield the gene typicality score.
[0109] The third method, labeled 3/4 heretofore, is based on the
composition of the dinucleotides comprising the nucleotides in the
third codon position and the first position of the following codon.
Again the .chi..sup.2 test is used to compute the gene scores.
[0110] The remaining two methods, CODONS and W6 (e.g., the method
of the present invention) use correlation as a similarity metric
and templates of size 3 and 6 respectively as compositional
features. In the case of CODONS, only tri-nucleotides that
correspond to codons are taken into account. On the other hand, in
the case of W6, all 6-nucleotide templates are counted.
[0111] In Table 3, the performance of all 5 methods is shown in
absolute terms and for different percentages of
artificially-transfered genes. In all cases, the W6 method, which
is the method of the present invention, outperforms the rest. FIG.
1 summarizes pictorially the results contained in Table 3. Table 4
shows the performance of all 5 methods in relative terms.
[0112] Of the traditional methods, CAI has the best performance.
But the method presented in the present invention outperforms CAI
and leads to a relative improvement over CAI which ranges from 11%
to 38%. Notably, the improvement of our method over the methods we
have tested increases as the % of added genes decreases; in other
words, Wn is particularly sensitive in those cases where the number
of horizontally transferred genes is small compared to the number
of genes comprising the genome G. FIG. 2 shows pictorially the
average relative improvement of the Wn method of the present
invention over CAI. For each organism, the relative increase in the
hit ratio is calculated as 9 R m ( G ) = r W6 ( G ) - r m ( G ) r m
( G )
[0113] where m corresponds to the method that is used as a
reference each time. The average relative improvement is obtained
by taking the mean of the relative increase in the hit ratio across
all organisms: 10 R m = 1 N G R m ( G )
[0114] This last figure is a measure of how many additional
horizontal transfer events are detected when the Wn method of the
present invention is compared with method m. For example, in the
experiments with 2% added genes, the Wn method discovers an average
of 25% more artificially transferred genes than CAI, and about 66%
more than the G+C method.
[0115] In FIG. 3, a detailed comparative analysis is shown of the
performance of the inventive method and the CAI method for the
experiments we carried out using a 2% mixture. For each organism,
the performance of the Wn method is marked with a dot, and the
performance of the CAI method with a "+". The curve shows the
relative improvement achieved by our method across organisms. For
visualization purposes, the 123 genomes are listed in order of
increasing relative improvement achieved by Wn. On the left side of
the graph, where the line assumes negative values, the CAI method
outperforms the Wn method, whereas on the right side the Wn method
greatly outperforms the CAI method. In fact, for the majority of
the organisms (91 versus 32) the Wn method of the present invention
provides significantly improved performance over the CAI method.
FIG. 4 shows the same kind of comparative analysis but this time
the Wn method is compared with G+C method.
[0116] In additional experiments, we also studied the effect of
template size (i.e. the n in Wn) on the overall performance of the
Wn method. Using the above experimental protocol with 100
experiments per organism, we discovered that for template sizes
greater than 2, the optimal performance was achieved by a) ignoring
the periodicity of the genetic code and by b) choosing correlation
as the similarity metric for computing the final scores.
[0117] FIG. 5 shows how the optimal feature/metric combination
affects the gene detection performance as a function of the size n
of the template. Initially, as the template size increases
performance improves reaching a maximum value at sizes 6 through 8.
However, any further increases in the size of the template lead to
a degradation of performance. The choice of the template size to
use depends in turn on the size of the sequence fragment g. In
order to achieve high specificity, the largest possible template
size should be opted for, provided that the sequence fragment g is
sizeable enough to yield a substantial number of templates of the
chosen size. Large template sizes should also be used with the
sliding window variant and large windows. On the other hand, a
smaller size template should be used when individual gene transfers
are sought.
[0118] Following is a discussion of the exemplary inventive method
as optionally implemented in software and/or hardware. First, as
explained above and exemplarily shown in FIG. 7, the template of
the present invention is a word of characters having width n, where
n>=3, and where the characters are allowed to include a "." or
"don't care" symbol. Moreover, templates can be formed in
position-specific manner.
[0119] FIG. 8 illustrates an exemplary flowchart of a method 800
for determining the template counts and for calculating the
compositional feature vector .phi.(g) of a sequence fragment g. In
step 801, data for at least one template Wn is entered. As
explained above, more than one template can be entered so that a
set of templates .pi.={.pi..sub.1, .pi..sub.2, . . . , .pi..sub.q}
can be counted in parallel.
[0120] In step 802, data from the sequence fragment g is retrieved
from memory and, in step 803 the number of occurrences in g of each
template is determined. If the template occurrence data is to be
converted into the compositional feature vector format
.phi.(s)=(.alpha..sub.1, .alpha..sub.2, . . . , .alpha..sub.q),
discussed above, where .alpha..sub.i is the frequency of template
.pi..sub.i in g, then this data conversion is performed in step
804.
[0121] FIG. 9 illustrates an exemplary flowchart 900 that can be
used to compute the calculations for the atypicality score a
sequence fragment g. In step 901 template data are entered. In step
902, a sequence fragment g and a reference sequence G are entered.
In steps 903 and 904 the compositional feature vector .phi.(g) for
sequence fragment g and the compositional feature vector .phi.(G)
for reference sequence G are calculated, as discussed above for the
steps exemplarily shown in FIG. 8. If the reference sequence G is a
collection of n.sub.G distinct segments, then the compositional
feature vector .phi.(G) can be calculated as the average of the
compositional feature vectors of its segments.
[0122] In step 905, a "comparison" function is applied to determine
for the sequence fragment g its atypicality with respect to the
reference sequence G. As mentioned above, a number of similarity
measures can be used and the present invention is not intended to
be limited in the specific measure to be used. Thus, for example,
this step 905 can be executed using any of a number of schemes
including the classic Pearson correlation, standard .chi..sup.2
values, Mahalanobis distance, the Kullback-Liebler distance,
etc.
[0123] In step 906, the typicality score for g is compared with the
automatically derived threshold value, such as discussed above.
Finally, in step 907, if the typicality score for g is below
threshold, then the fragment g will be reported as atypical.
[0124] FIG. 10 illustrates an exemplary block diagram 1000 of a
system of software modules for a computer implementation of the
present invention. This system would include a graphical user
interface 1001 to allow a user to enter template data and provide
other system controls. Memory interface module 1002 allows data
from a database to be entered for evaluation and results to be
stored throughout the execution. Feature vector calculator module
1003 determines the process described in FIG. 8, and typicality
score module 1004, threshold value module 1005, and LGT calculator
module 1006 executes the equations discussed above for their
respective values.
[0125] FIG. 11 illustrates a typical hardware configuration of an
information handling/computer system in accordance with the
invention and which preferably has at least one processor or
central processing unit (CPU) 1111.
[0126] The CPUs 1111 are interconnected via a system bus 1112 to a
random access memory (RAM) 1114, read-only memory (ROM) 1116,
input/output (I/O) adapter 1118 (for connecting peripheral devices
such as disk units 1121 and tape drives 1140 to the bus 1112), user
interface adapter 1122 (for connecting a keyboard 1124, mouse 1126,
speaker 1128, microphone 1132, and/or other user interface device
to the bus 1112), a communication adapter 1134 for connecting an
information handling system to a data processing network, the
Internet, an Intranet, a personal area network (PAN), etc., and a
display adapter 1136 for connecting the bus 1112 to a display
device 1138 and/or printer 1139 (e.g., a digital printer or the
like).
[0127] In addition to the hardware/software environment described
above, a different aspect of the invention includes a
computer-implemented method for performing the above method. As an
example, this method may be implemented in the particular
environment discussed above.
[0128] Such a method may be implemented, for example, by operating
a computer, as embodied by a digital data processing apparatus, to
execute a sequence of machine-readable instructions. These
instructions may reside in various types of signal-bearing
media.
[0129] Thus, this aspect of the present invention is directed to a
programmed product, comprising signal-bearing media tangibly
embodying a program of machine-readable instructions executable by
a digital data processor incorporating the CPU 1111 and hardware
above, to perform the method of the invention.
[0130] This signal-bearing media may include, for example, a RAM
contained within the CPU 1111, as represented by the fast-access
storage for example. Alternatively, the instructions may be
contained in another signal-bearing media, such as a magnetic data
storage diskette 1200 (FIG. 12), directly or indirectly accessible
by the CPU 1111.
[0131] Whether contained in the diskette 1200, the computer/CPU
1111, or elsewhere, the instructions may be stored on a variety of
machine-readable data storage media, such as DASD storage (e.g., a
conventional "hard drive" or a RAID array), magnetic tape,
electronic read-only memory (e.g., ROM, EPROM, or EEPROM), an
optical storage device (e.g. CD-ROM, WORM, DVD, digital optical
tape, etc.), paper "punch" cards, or other suitable signal-bearing
media including transmission media such as digital and analog and
communication links and wireless. In an illustrative embodiment of
the invention, the machine-readable instructions may comprise
software object code.
[0132] Thus, the present invention provides a new method Wn for
atypicality detection that is based on generalized compositional
features. Given a reference sequence G and a sequence fragment g
the method assigns the latter a score that reflects its similarity
to the reference sequence G. For the specific instance of the
invention where the involved sequences were genetic sequences, it
was found that CAI and W6 scores yield the best performance, with
W6 outperforming CAI. W6 is the exemplary method of the present
invention. The performance of various methods including CAI and W6
was measured through a series of experiments where a random set of
genes from phages was added to each of a fully-sequenced
prokaryotic genome in turn, thus creating an artificial organism,
with the objective of recovering in the lowest scoring positions as
many of the added genes as possible without any a priori
knowledge.
[0133] While the invention has been described in terms of a single
preferred embodiment, those skilled in the art will recognize that
the invention can be practiced with modification within the spirit
and scope of the appended claims.
[0134] Further, it is noted that, Applicants' intent is to
encompass equivalents of all claim elements, even if amended later
during prosecution.
1TABLE 1 List of phages Phage GenBank ID Genes Streptococcus
thermophilus bacteriophage Sfi21 NC_000872 50 Coliphage alpha3
NC_001330 10 Mycobacterium phage L5 NC_001335 85 Haemophilus phage
HP1 NC_001697 42 Methanobacterium phage psiM2 NC_001902 32
Mycoplasma arthritidis bacteriophage MAV1 NC_001942 15 Chlamydia
phage 2 virion NC_002194 8 Methanothermobacter wolfeii prophage
psiM100 NC_002628 35 Bacillus phage GA-1 virion NC_002649 35
Lactococcus lactis bacteriophage TP901-1 NC_002747 56 Streptococcus
pneumoniae bacteriophage NC_003050 53 MM1 provirus Sulfolobus
islandicus filamentous virus NC_003214 72 Bacteriophage PSA
NC_003291 59 Halovirus HF2 NC_003345 114 Cyanophage P60 NC_003390
80 Lactobacillus casei bacteriophage A2 virion NC_004112 61 Vibrio
cholerae O139 fs1 phage NC_004306 15 Salmonella typhimurium phage
ST64B NC_004313 56 Pseudomonas aeruginosa phage PaP3 NC_004466 71
Streptococcus pyogenes phage 315.4 provirus NC_004587 64
Staphylococcus aureus phage phi 13 provirus NC_004617 49 Yersinia
pestis phage phiA1122 NC_004777 50 Xanthomonas oryzae bacteriophage
Xp10 NC_004902 60 Enterobacteria phage RB69 NC_004928 179
Burkholderia cepacia phage BcepNazgul NC_005091 75 Ralstonia phage
p12J virion NC_005131 10 Bordetella phage BPP-1 NC_005357 49
[0135]
2TABLE 2 Gene scoring methods Name Width Step Metric Description CG
1 1 .chi..sup.2 G + C content 3/4 2 3 .chi..sup.2 Dinucleotide
composition of codon positions 3 and 1 CODONS 3 3 .chi..sup.2 Codon
composition CAI 3 3 N/A Codon Adaptation Index W6 6 1 correlation
6-nucleotide composition
[0136]
3TABLE 3 Performance results % LGT CG 3/4 CODONS CAI W6 1% 38.66%
36.89% 27.79% 43.90% 50.28% 2% 44.40% 42.99% 34.48% 49.61% 55.59%
4% 50.33% 49.43% 41.68% 55.30% 60.58% 8% 56.38% 56.21% 49.66%
61.02% 65.33%
[0137]
4TABLE 4 Absolute improvement of the new method, W6, over the rest
% LGT W6 vs. CG W6 vs. 3/4 W6 vs. CODONS W6 vs. CAI 1% 11.62%
13.39% 22.49% 6.38% 2% 11.19% 12.60% 21.11% 5.98% 4% 10.25% 11.14%
18.90% 5.28% 8% 8.95% 9.12% 15.67% 4.31%
[0138]
5TABLE 5 relative improvement of the new method, W6, over the rest
% LGT W6 vs. CG W6 vs. 3/4 W6 vs. CODONS W6 vs. CAI 1% 170.36%
85.99% 206.05% 38.25% 2% 66.40% 57.82% 124.86% 25.42% 4% 30.81%
34.75% 75.79% 16.92% 8% 18.58% 20.92% 44.35% 11.20%
* * * * *