U.S. patent number 6,907,393 [Application Number 10/110,242] was granted by the patent office on 2005-06-14 for methods for screening candidates for artificial promotors.
This patent grant is currently assigned to Toagosei Co., Ltd.. Invention is credited to Tetsuhiko Yoshida.
United States Patent |
6,907,393 |
Yoshida |
June 14, 2005 |
Methods for screening candidates for artificial promotors
Abstract
The present invention provides methods for selecting artificial
promotor candidates that reduce the number of promotors required to
be evaluated by actual experimentation when designing artificial
promotor candidates that are potentially effective for structural
genes. In these artificial promotor candidate selection methods, a
randomly selected nucleotide sequence is joined upstream of a
structural gene (S1). Next, nucleotides are extracted in a
predetermined pattern from a designated region containing at least
the transcription start site of the joined nucleotide sequence,
thereby generating a virtual amino acid sequence, and an index
curve is generated from the virtual amino acid sequence (S2, S3).
Then, the nucleotide sequence joined upstream of the structural
gene is selected as an artificial promotor candidate if the index
curve exhibits a sign change near the transcription start site
(S4).
Inventors: |
Yoshida; Tetsuhiko (Nagoya,
JP) |
Assignee: |
Toagosei Co., Ltd. (Tokyo,
JP)
|
Family
ID: |
17767047 |
Appl.
No.: |
10/110,242 |
Filed: |
May 7, 2002 |
PCT
Filed: |
October 12, 2000 |
PCT No.: |
PCT/JP00/07105 |
371(c)(1),(2),(4) Date: |
May 07, 2002 |
PCT
Pub. No.: |
WO01/27259 |
PCT
Pub. Date: |
April 19, 2001 |
Foreign Application Priority Data
|
|
|
|
|
Oct 13, 1999 [JP] |
|
|
11-291295 |
|
Current U.S.
Class: |
703/11; 365/94;
700/1; 702/20 |
Current CPC
Class: |
C12N
15/1089 (20130101) |
Current International
Class: |
C12N
15/10 (20060101); G06G 007/60 (); G06F 019/00 ();
G05B 015/00 (); G11C 017/00 () |
Field of
Search: |
;703/11 ;702/20 ;700/1
;365/94 |
Other References
Tomii et al., "Analysis of Amino Acid Indices and Mutation Matrices
for Sequence Comparison and Structure Prediction of Proteins,"
Protein Engineering, vol. 9, No. 1, pp 27-36 (1996). .
Kawashima et al., "Aaindex: Amino Acid Index Database," Nucleic
Acids Research, vol. 27, No. 1, pp 368-369 (Jan. 1, 1999). .
Von Heijne et al., "Trans-membrane Translocation of Proteins: The
Direct Transfer Model," European Journal of Biochemistry, vol. 97,
No. 1, pp 175-181 (1979). .
Fickett et al., "Eukaryotic Promoter Recognition," Genome Research,
vol. 7 No. 9, pp 861-878 (Sep. 1997). .
Pedersen et al., "The Biology of Eukaryotic Promoter Prediction-a
Review," Computers & Chemistry, vol. 23, No. 3-4, pp 191-207
(Jun. 15, 1999). .
Qing K. Chen et al., "PromFD 1.0: a computer program that predicts
eukaryotic pol II promoters using strings and IMD matrices", CABIOS
1997, vol. 13, No. 1, pp. 29-35. .
Petra Sazelova et al., "In vitro and in vivo transcription from a
computer predicted promoter", 1997, vol. 187, No. 2, pp.
281-287..
|
Primary Examiner: Brusca; John S.
Attorney, Agent or Firm: Oliff & Berridge, PLC
Claims
What is claimed is:
1. A method for selecting an artificial promotor candidate,
comprising: generating a first virtual amino acid sequence from a
candidate nucleotide sequence and generating an amino acid index
curve from the first virtual amino acid sequence, generating a
second virtual amino acid sequence from a nucleotide sequence near
a transcription start site of a transcription region of a gene and
generating an amino acid index curve from the second virtual amino
acid sequence, and selecting the candidate nucleotide sequence as
an artificial promotor candidate if the amino acid index curve of
the candidate nucleotide sequence has a sign opposite of the amino
acid index curve of the nucleotide sequence near the transcription
start site of the transcription region.
2. A method for selecting an artificial promotor candidate,
comprising the steps of: joining nucleotide sequence data of an
artificial promotor to an upstream side of nucleotide sequence data
of a transcription region of a gene; and selecting the artificial
promotor as a promotor candidate for the transcription region when
an amino acid index curve based on a virtual amino acid sequence,
which is obtained from the joined nucleotide sequence data,
exhibits a sign change near the transcription start site of the
transcription region.
3. A method as claimed in claim 2, wherein the virtual amino acid
sequence, which is obtained from the joined nucleotide sequence
data, is generated based on amino acid units formed from units of
three nucleotides each extracted from the joined nucleotide
sequence data while shifting one nucleotide for each
extraction.
4. A method for selecting a potentially effective artificial
promotor candidate, comprising the steps of: generating joined
nucleotide sequence data by joining nucleotide sequence data
representing a candidate artificial promotor to an upstream side of
nucleotide sequence data representing a transcription region of a
gene; generating a first progression from index values of
respective amino acids found in a virtual amino acid sequence
formed from units of three nucleotides each extracted from the
joined nucleotide sequence data starting at the beginning of the
joined nucleotide sequence data and shifting one nucleotide for
each extraction; smoothing the first progression, thereby obtaining
a second progression; and selecting the candidate artificial
promotor as a potentially effective artificial promotor candidate
if the sign of the second progression changes near position of the
transcription start site of the transcription region.
5. An artificial promotor candidate selection device comprising:
means for generating an amino acid index curve based upon index
values of respective amino acids in a virtual amino acid sequence,
which is obtained from nucleotide sequence data generated by
joining nucleotide sequence data of an artificial promotor to an
upstream side of nucleotide sequence data of a transcription region
of a gene; and means for selecting the artificial promotor as a
promotor candidate for the transcription region when the sign of
the amino acid index curve changes near the transcription start
site of the transcription region.
6. An artificial promotor candidate selection device comprising:
means for generating joined nucleotide sequence data by joining
nucleotide sequence data of a candidate artificial promotor to an
upstream side of nucleotide sequence data of a transcription region
of a gene; means for generating a first progression from index
values of respective amino acids in a virtual amino acid sequence
formed from units of three nucleotides each extracted from the
joined nucleotide sequence data starting at the beginning of the
joined nucleotide sequence data, which is generated by the joined
nucleotide sequence data generating means, and shifting one
nucleotide for each extraction; means for generating a second
progression by smoothing the first progression, which is generated
by the first progression generating means; and means for selecting
the candidate artificial promotor as a potentially effective
artificial promotor if the sign of the second progression, which is
obtained by the second progression generating means, changes near
the transcription start site of the transcription region.
7. A computer-readable storage medium storing a program for
selecting an artificial promotor as a promotor candidate for a
transcription region of a gene by joining nucleotide sequence data
of the artificial promotor upstream of nucleotide sequence data of
the transcription region when an amino acid index curve based on a
virtual amino acid sequence, which is obtained from the joined
nucleotide sequence data, exhibits a sign change near the
transcription start site of the transcription region.
8. A computer-readable storage medium storing a program comprising
the steps of: generating joined nucleotide sequence data by joining
nucleotide sequence data of a candidate artificial promotor to an
upstream side of nucleotide sequence data of a transcription region
of a gene in order to identify potentially effective artificial
promotor candidates for the transcription region; generating a
first progression from index values of respective amino acids in a
virtual amino acid sequence formed from units of three nucleotides
each extracted from the joined nucleotide sequence data starting at
the beginning of the joined nucleotide sequence data and shifting
one nucleotide for each extraction; smoothing the first
progression, thereby generating a second progression; and selecting
the candidate artificial promotor as a potentially effective
artificial promotor if the sign of the second progression changes
near the transcription start site of the transcription region.
Description
FIELD OF THE INVENTION
The present invention relates to artificial promotor candidate
selection technology and more particularly, to artificial promotor
selection technology for selecting potentially effective artificial
promotor candidates from among a set of virtually-generated
promotor candidates.
BACKGROUND ART
Progress in the biological sciences has enabled the analysis of the
chromosomal structure at the molecular level. This analysis has
determined that DNA (deoxyribonucleic acid) molecules in
chromosomes are continuous, singular, thread-like strands. Four
types of nucleotides, i.e., adenine (a), guanine (g), cytosine (c),
and thymine (t), are present in DNA, and the DNA molecules in
chromosomes are composed of combinations of these nucleotides,
together with sugars and phosphoric acids that are bound thereto.
The DNA molecules in chromosomes encode a variety of information by
virtue of differences in nucleotide sequences.
Through the analysis of DNA, it has been determined that portions
of the DNA material are structural units that carry genetic
information and determine genotype, e.g., genes. The genes
determine primary structure, such as proteins, tRNA, and RRNA, and
are specifically defined as structural genes. The gene-based
mechanism of protein synthesis will be explained with reference to
FIG. 18.
As shown in FIG. 18, a gene is divided into the three regions, a
promotor region, a transcription region (the structural gene region
containing information relating to the amino acid sequence of the
protein) and a termination region. The promotor region serves to
control the start of transcription, the transcription region is the
region that is actually transcribed, and the termination region
controls the termination of transcription. The site where
transcription begins is called the transcription start site.
Protein synthesis from genes, which are structured as described
above, occurs according to the following steps. First, RNA
polymerase (an enzyme that transcribes a DNA sequence) binds to a
DNA sequence slightly ahead of the promotor region and then moves
toward the promotor region. Once this RNA polymerase has passed the
promotor region and moved to the transcription-region side of the
gene, it generates messenger RNA that corresponds to the DNA
sequence in the transcription region. Further, when the RNA
polymerase reaches the termination region, the transcription of the
DNA sequence ceases. The messenger RNA subsequently moves from the
cell nucleus to the cytoplasm and binds with ribosomes. The
ribosomes synthesize proteins based upon the messenger RNA. The DNA
sequence of a gene is transcribed in this manner, and a specific
protein is synthesized based upon this transcription.
The rate of such protein synthesis depends upon the rate at which
RNA polymerase transcribes messenger RNA, and the promotor sequence
of the gene controls the rate at which messenger RNA is
transcribed.
Therefore, it is theoretically possible to create promotors that
synthesize proteins at a high rate and promotors that synthesize
proteins at a low rate by manipulating the DNA sequence of the
promotor region.
In recent years, there have been many attempts to actively control
the protein synthesis rate by artificially altering the DNA
sequence of promotor regions. By controlling the rate of protein
synthesis, it would be possible to create, e.g., promoters having a
high transcription activity for use in artificially expressing
specific proteins in large quantities. As a result, gene therapy
would be enabled at the molecular level in which, e.g., cancer
cells are infected with an adequate amount of a virus (in gene
therapy, viruses of reduced toxicity are used as virus vectors to
transport normal genes into cells) and proteins are then
specifically expressed from these normal genes (integrated with
powerful artificial promoters) in amounts that are adequate to
suppress the cancer.
However, at present, a method for designing artificial promotors
having high transcription activity, i.e., artificial promotors
containing a DNA sequence that enable the above-noted
characteristics, has not yet been established. Currently, randomly
sequenced nucleotides (promotor candidates) are placed ahead of
structural genes (test genes), and the efficacy of these promotor
candidates is evaluated by determining whether or not the test gene
is expressed.
According to this method, it is extremely difficult to identify
nucleotide sequences suitable as potentially highly effective
artificial promoters from a virtually infinite number of nucleotide
sequences (4.sup.th power exponential) because the artificial
promotor candidates used in the associated experimentation have
randomly determined nucleotide sequences.
Therefore, it is an object of the present invention to provide
artificial promotor candidate selection methods capable of reducing
the number of nucleotide sequences that are actually tested by
selecting, in advance of testing, nucleotide sequences that are
highly likely to function as potentially effective promotors from
among a set of nucleotide sequences under consideration.
DISCLOSURE OF THE INVENTION
As a result of research on the characteristics of nucleotide
sequences of known promotors, virtual amino acid sequences were
created by extracting amino acid information from nucleotide
sequences of the promotors and structural genes using certain
patterns; then, curves were generated using index values
corresponding to the respective amino acids in the virtual amino
acid sequences. The invention was conceived upon discovery that a
sign change is very often present in the curve near the
transcription start site.
According to the artificial promotor candidate selection methods of
the present invention, artificial promotor candidates are selected
based upon DNA nucleotide sequences, in which the corresponding
virtual amino acid sequence exhibits a sign change in the amino
acid index curve at a point near the transcription start site of
the transcription region of the structural gene.
The artificial promotor candidate selection methods of the present
invention are based upon the newly-discovered fact that a sign
change usually exists near the transcription start site of a given
nucleotide sequence, which contains a promotor and a structural
gene, between the amino acid index curve for the promotor and the
amino acid index curve for the structural gene. For example, a
curve is first determined based upon on the structural gene using
the amino acid index values of a virtual amino acid sequence near
the transcription start site; then, a nucleotide sequence for an
artificial promotor candidate can be selected such that the virtual
amino acid sequence for the promotor region will be opposite in
sign near the transcription start site relative to the curve of the
structural gene. In addition, a pre-determined nucleotide sequence
may virtually joined upstream of a structural gene, and then the
efficacy of the pre-determined nucleotide sequence can be evaluated
by determining whether or not the index curve of the virtual amino
acid sequence, which is generated from the joined nucleotide
sequence, exhibits a sign change near the transcription start
site.
In another artificial promotor candidate selection method of the
present invention, a candidate nucleotide sequence of an artificial
promotor is virtually joined upstream of a nucleotide sequence of a
structural gene; then, if the amino acid index curve of the virtual
amino acid sequence, which is generated based upon the joined
nucleotide sequences, exhibits a sign change near the transcription
start site, the nucleotide sequence of an artificial promotor will
be selected as a promotor candidate for the structural gene.
In other words, when artificial promotor candidates are being
considered for selection as being potentially effective for a
structural gene having a known nucleotide sequence, a candidate
nucleotide sequence is joined upstream of th structural gene. If
the index curve based on a virtual amino acid sequence, which is
extracted in a pre-determined pattern from the joined nucleotide
sequence in the region that contains at least the transcription
start site of the joined nucleotide sequence, exhibits a sign
change near the transcription start site, the candidate nucleotide
sequence is selected as an artificial promotor candidate.
The methods for extracting a virtual amino acid sequence from a
given nucleotide sequence and the methods for expressing the index
curve, which index curve is based upon the virtual amino acid
sequence, may be appropriately selected in order to make the
changes at the transcription start site easier to observe.
In one example of a method for extracting a virtual amino acid
sequence from a given nucleotide sequence, a virtual amino acid
sequence can be generated from the nucleotide sequence by selecting
each respective amino acid based upon units of three nucleotides
and by moving along the nucleotide sequence one nucleotide at a
time in order to select successive respective amino acids.
However, in addition to the above-mentioned method in which units
of three nucleotides are extracted from the nucleotide sequence to
select the amino acid and moving along the sequence one nuleotide
at a time in order to generate virtual amino acid sequences, a
variety of methods can be utilized. For instance, three connected
and neighboring nucleotides could be extracted from a nucleotide
sequence, three nucleotides could be extracted while omitting every
other nucleotide, or three nucleotides could be extracted while
omitting two nucleotides out of each three nucleotides.
Another artificial promotor candidate selection method of the
present invention includes the following steps: a candidate
nucleotide sequence for an artificial promotor is joined to the
upstream end of a nucleotide sequence of a structural gene; units
of three nucleotides are extracted from the nucleotide sequence of
the artificial promotor present starting from the beginning of the
nucleotide sequence then shifting down the nucleotide sequence one
nucleotide per extraction; a first progression is obtained from the
index values of the respective amino acids in the virtual amino
acid sequence thereby formed; the first progression is smoothed to
obtain a second progression; and the artificial promotor is
selected as a potentially effective artificial promotor candidate
if a sign change occurs near the transcription start site within
the second progression.
In this artificial promotor candidate selection method, a
nucleotide sequence is formed by joining the nucleotide sequence of
a virtually designed artificial promotor upstream to the nucleotide
sequence for a structural gene. Specifically, the nucleotide
sequence of an artificial promotor is joined upstream of the
nucleotide sequence for a structural gene that is targeted for
transcription.
As was explained above, a large number of nucleotide sequences may
be evaluated as potential artificial promoters. Therefore, the
computation task is preferably performed using a computer. A
computer can be used to perform the task of automatically
generating nucleotide sequences of potentially effectively
artificial promoters. Nucleotide sequences of artificial promotors,
which are expected to have ascertain degree of effect according to
the computer, may be then manually modified by a researcher. For
example, artificial promotor nucleotide sequences arc sometimes
modified by changing a portion of a known artificial promotor
nucleotide sequence.
Next, using a selected method, units of three nucleotides per
extraction are extracted from the nucleotide sequence of the
artificial promotor starting from the beginning of the nucleotide
sequence in order to generate a virtual amino acid sequence. The
index values of the respective amino acids from the virtual amino
acid sequence are determined, and a first progression comprising
the amino acid index values is generated.
Herein, the term "amino acid index value" (e.g., transfer free
energy to lipophilic phase, von Heijne-Blomberg, 1979) refers to a
numerical value that was determined based upon a variety of
different measurements, presently 434 types of measurements, of the
physical properties of amino acids, and which amino acid index
values are available on the Internet. Amino acid index values are
utilized herein, because certain higher information is present in
the background, which information is not ascertainable simply by
analyzing nucleotide sequences.
Next, the first progression is smoothed to produce a second
progression. This smoothing may be accomplished, e.g., by summing
and averaging a certain number (e.g., 3-6) of values following the
value in question, summing and averaging a certain number (e.g.,
3-6) of values preceding the value in question, or summing and
averaging a certain number (e.g., 3-6) of sequential values before
and after the value in question. In each of the above methods,
although an average value is determined by dividing the sum of the
added values by the number of values added, the sum of the values
itself also could be used. The associated graph, as will be
explained below in a plotting step, can realize an identical
pattern by adjusting the scale of the graph. Smoothing is not
required to be performed according to the above methods. Further, a
variety of mathematical methods may be utilized to smooth the data.
Smoothing may be performed, for example, according to a moving
average. Any method is acceptable if smoothing is provided to an
extent such that the pattern (variation) of a graph of a smoothed
progression can be recognized.
Then, a determination is made as to whether or not the sign changes
near the transcription start site in the second progression; if the
sign does change, the artificial promotor is selected as a
potentially effective artificial promotor candidate.
When the determination is made as to whether or not the sign
changes near the transcription start site in the second
progression, the progression and its corresponding values can be
represented graphically in the form of an index curve, thereby
simplifying the determination by highlighting the sign inversion
(reversal) trends in the index curve near the transcription start
site.
Using the above-mentioned artificial promotor candidate selection
method of the present invention, artificial promotor candidates
that are very likely to function as promoters are selected from a
large number of virtually-created, artificial promoters; therefore,
the number of promotors that must be actually tested in order to
evaluate their effect can be reduced.
As shown in the lower half of FIG. 1, prior-art promotor design
methods comprised the following steps: promotor candidates are
synthesized to as to have random nucleotide sequences (S1), the
promotor candidates are joined with a structural gene (gene for
testing) (S2), the resulting product is introduced into a cell and
the effectiveness of the promotor candidate is evaluated by
checking for expression of the structural gene. According to the
present invention, on the other hand, as illustrated in the top
half of FIG. 1, promotor candidates are first selected in step S1
by using the above-described methods, the promotor candidates are
then synthesized (S2) and joined with structural genes (genes for
testing) (S3), and finally tested (S4). Thus, the number of
promotors that must be actually synthesized and tested is
reduced.
BRIEF EXPLANATION OF THE DRAWINGS
FIG. 1 illustrates the effect of the present invention.
FIG. 2 is a flowchart illustrating a method for selecting
artificial promotor candidates of the present invention.
FIG. 3 illustrates methods for converting from a nucleotide
sequence (containing SEQ ID NOs: 6 and 7) to an amino acid index
sequence. The pattern 1, 2 and 3 nucleotide sequences include
nucleotides 1 to 17 of SEQ ID NO: 2.
FIG. 4 is a genetic code table.
FIG. 5 shows representative amino acid index values.
FIG. 6 is a conceptual illustration of an index curve.
FIG. 7 shows the nucleotide sequence of the promotor region of Gene
J01567 (SEQ ID NO: 1).
FIG. 8 shows the nucleotide sequence of the promotor region of Gene
X87994 (SEQ ID NO: 2).
FIG. 9 shows the nucleotide sequence of the promotor region of Gene
EPD31005 (SEQ ID NO: 3).
FIG. 10 shows the nucleotide sequence of the promotor region of
Gene EPD35038 (SEQ ID NO: 4).
FIG. 11 shows the nucleotide sequence of the promotor region of
Gene J01567 (SEQ ID NO: 1) in order of one nucleotide at a time
beginning from the upstream.
FIG. 12 shows a sequence created by extracting nucleotidcs from the
nucleotide sequence shown in FIG. 11 in groups of three (SEQ ID NO:
5).
FIG. 13 shows a progression obtained by converting the sequences of
three nucleotides shown in FIG. 12 into amino acid index
values.
FIG. 14 shows a progression obtained by smoothing the progression
shown in FIG. 13.
FIG. 15 shows representative index curves generated from known
promotors.
FIG. 16 is a hardware configuration diagram showing the artificial
promotor candidate selection device of one embodiment of the
present invention.
FIG. 17 is a flowchart illustrating a method for selecting
artificial promotor candidates using the artificial promotor
candidate selection device shown in FIG. 16.
FIG. 18 illustrates the steps of DNA transcription and protein
synthesis.
PREFERRED EMBODIMENTS OF THE INVENTION
Hereinafter, embodiments of the artificial promotor candidate
selection methods of the present invention will be discussed FIG. 2
shows a method for selecting artificial promotor candidates.
As shown in FIG. 2, in the artificial promotor candidate selection
methods of the invention, a virtual nucleotide sequence is created
by joining the nucleotide sequence of a virtual artificial promotor
to the nucleotide sequence of a structural gene (gene for testing)
(S1).
This step will be described in greater detail using FIG. 3. As
shown in FIG. 3, the nucleotides of the virtually created
artificial promotor nucleotide sequence are gatct . . . ca, and the
nucleotides of the structural gene are cccgtccag . . . . In Step
S1, the gatct . . . ca sequence of the artificial promotor
nucleotide sequence is inserted before the cccgtccag . . .
nucleotide sequence of the structural gene, thereby generating the
nucleotide sequence gatct . . . cacccgtccag . . . . (containing SEQ
ID NOs: 8 and 7).
Next, a first progression is prepared, which first progression
includes the amino acid index values of the virtual amino acid
index sequence. The virtual amino acid index sequence is generated
by extracting nucleotides in units of three in a pre-determined
pattern starting from the first nucleotide in the sequence
(S2).
An example of a method for sequentially extracting nucleotides in
units of three from the nucleotide sequence will be presented.
Pattern 1 is a method for sequentially extracting nucleotides in
units of three starting at the beginning of the nucleotide sequence
gatct . . . and shifting one nucleotide per extraction. The first
three nucleotides extracted by this method are gat, the second
three nucleotides are atc, and the third three nucleotides are
tct.
As shown in FIG. 4 (genetic code table), the three nucleotides gat
correspond to aspartic acid (D: hereinafter, amino acids will be
referred to by their single-letter codes), atc corresponds to
isoleucine (I), and tct corresponds to serine (S). Therefore, the
amino acid sequence produced by extracting nucleotides in units of
three, while shifting one nucleotide per extraction, from this
nucleotide sequence is represented by the virtual amino acid
sequence D, I, S, L, S . . . . (SEQ ID NO: 9).
Once the virtual amino acid sequence is prepared, the respective
amino acids within the virtual amino acid sequence are converted
into their respective index values to create a progression. von
Heijne-Blomberg's transfer free energy to lipophilic phase is
presented in FIG. 5 as an example of amino acid index values for
the 20 types of amino acids. Thus, the above amino acid sequence is
converted into the following progression of numerical values
(Progression 1): 23.22, -18.32, -1.54, -17.79, -1.54 . . . , which
corresponds to the first progression as utilized in the claims.
If a stop codon (X) is encountered in the amino acid sequence, then
there is no corresponding amino acid or amino acid index value,
thus, in this embodiment, the amino acid index value of the stop
codon (X) is zero. Removal of the stop codon (X) would result in
the removal of a nucleotide from the nucleotide sequence, which
would change the amino acid index curve.
Next, Progression 1 (23.22, -18.32, -1.54, -17.79, -1.54 . . . ) is
smoothed (S3). Progression 1 is smoothed in this step because
changes (patterning) in the amino acid index values in Progression
1 are difficult to recognize due to extreme variation in
neighboring values. Therefore, smoothing may be performed so that a
graph containing the smoothed numbers will have a discernable
pattern.
Any of several known smoothing methods in the field of data
processing may be used, but as an example for the purpose of
explanation, the values of Progression 1 will be smoothed by taking
the average of a certain number (3) of these values and
sequentially moving toward the end of the virtual sequence in order
to generate a smoothed progression.
The average of the first three values of Progression 1 (23.22,
-18.32, -1.54, -17.79, 348 -1.54 . . . ) is
(23.22+(-18.32)+(-1.54))/3=1.12, the average of the next three
values is (-18.32)+(-1.54)+(-17.79))/3=-12.55, and the average of
the next three values is (-1.54+(-17.79)+(-1.54))/3=-6.96. The
averages of the subsequent values are generated in a similar
manner.
Taking the average for Progression 1 for every few values results
in a smoothed Progression 2 (1.12, -12.55, -6.96 . . . ), which
corresponds to the second progression as used in the claims.
Next, a determination is made as to whether or not there is a sign
inversion in Progression 2 near the transcription start site; if
there is a sign inversion, the artificial promotor is selected as a
potentially effective artificial promotor candidate.
This determination method preferably produces an index curve by
graphing Progression 2 and evaluates the particular artificial
promotor based on the pattern of the index curve. (S4).
The positions of the numerical values of Progression 2 (1.12,
-12.55, -6.96 . . . ) are plotted onto the horizontal axis, and th
smoothed amino acid index values are plotted onto the vertical axis
in order to create an index curve. FIG. 6 conceptually illustrates
an example of a portion of an index curve around the transcription
start site.
Artificial promotor candidates that are very likely to function as
promoters can be selected based upon the pattern of the index curve
of a virtual amino acid sequence around the transcription start
site.
As discussed above, index curves of known promotors typically show
a sign reversal (observable in FIG. 15) near the transcription
start site. If the index curve of the virtual amino acid sequence,
which is prepared by joining a virtual artificial promotor
nucleotide sequence to a nucleotide sequence of a structural gene,
exhibits a sign reversal near the transcription start site, the
virtual artificial promotor having this nucleotide sequence could
be selected as an artificial promotor candidate as possibly being
effective for the structural gene.
The efficacy of this artificial promotor candidate selection method
will be described below in terms of genes (promotor regions) having
known nuclcotide sequences.
The selected genes (promotor regions) having known nucleotide
sequences were J01567 Plasmid Colicin E1 (from E. coli) strong
promotor region DNA, X87994 C. xyli DNA for strong promotor (569
bp), EPD31005 (+)Ph EPSP synthase P2+, and EPD35038 (+)Le LAT52;
range -499 to 100.
The nucleotide sequences of J01576 and X87994 were obtained from
the DNA Data Bank of Japan (DDBJ), and the other sequences were
found in the Eukaryotic Promotor Database (EPD). (X87994 sequence
data was used beginning at number 481.)
These nucleotide sequences are shown in FIGS. 7 to 10. The
nucleotide sequence of J01567 is shown in FIG. 7, and the
transcription start site is found at number 84. The nucleotide
sequence of X87994 is shown in FIG. 8 and the transcription start
site is found at number 542. (Note that this transcription start
site corresponds to number 62 in FIG. 15(b), because the sequence
listing in FIG. 15 begins with number 481.) The nucleotide sequence
of EPD31005 is shown in FIG. 9 and the transcription start site is
found at number 140. Finally, the nucleotide sequence of EPD35038
is shown in FIG. 10 and the transcription start site is found at
number 140.
The index curve of the nucleotide sequence was plotted according to
the methods described above, which will be described in detail with
reference to FIGS. 11 to 14, and using as an example J01567, which
nucleotide sequence is shown in FIG. 7. The index curve generating
method, which is described below, is only one example and does not
limit the present invention.
(1) Three sequential nuclcotides are repeatedly extracted as units
starting from the beginning of nucleotide sequence J01567, which is
shown in FIG. 11, by moving one nucleotide per extraction until the
final nucleotide is reached, thereby generating the sequence shown
in FIG. 12, which lists the nucleotides in groups of three.
(2) The three nucleotide units are then converted into the
corresponding amino acids using the sequence generated in the above
Step (1))(FIG. 12), thereby generating a virtual amino acid
sequence.
(3) The progression shown in FIG. 13 is then calculated, which
progression includes amino acid index values obtained by converting
each amino acid within the sequence generated in the above Step (2)
into the corresponding hydrophobic amino acid index value.
(4) The progression shown in FIG. 14 is then calculated, which
progression contains values generated by adding the five values in
front of each value from the progression generated in the above
Step (3) (FIG. 13).
(5) An index curve is produced by graphing the progression
generated in the above Step (4) (FIG. 14) along the horizontal axis
starting from the first nucleotide and by expressing the amino acid
index value corresponding to each position on the vertical
axis.
The resulting index curves are shown in FIG. 15. (The graphs are
expressed in a bar-graph format in order to visually enhance the
sign reversals.)
Index curves produced from the nucleotide sequences of the above
identified four genes are shown in FIG. 15. Each index curve
exhibits a sign reversal near the transcription start site.
By creating index curves of virtual amino acid sequences, which are
obtained by joining the nucleotide sequence of a virtual artificial
promotor to a nucleotide sequence of a structural gene, virtual
artificial promotors can be selected as potentially effective
artificial promotor candidates for the structural gene in question
when a sign change is present in the index curve near the
transcription start site. The artificial promotor candidates thus
selected are subsequently synthesized, joined to the actual
corresponding nucleotide sequence, and tested in order to make a
final evaluation.
When artificial promotor candidates are selected according to the
above method, the number of artificial promotor candidates actually
synthesized and tested can be dramatically reduced, and potentially
effective artificial promotor candidates (i.e., nucleotide
sequences) can be efficiently designed.
When the nucleotide sequence of a potentially effective artificial
promotor is being designed, the portion of the index curve that is
evaluated can be limited to the portion near the transcription
start site in order to simplify the evaluation.
If artificial promotor candidates having high transcription
activity can be effectively designed using the methods of the
present invention, then promotors useful in the field of gene
therapy, for example, could be efficiently designed. Cancer cells,
for example, could be infected with an adequate amount of a virus
(a normal gene-carrying vector virus) in order to cause specific
expression of proteins from these normal genes (integrated with
powerful artificial promotors) in amounts that are adequate to
suppress the cancer. An artificially strong active promotor must be
designed in order to cause specific and powerful expression of the
normal gene.
An embodiment of an artificial promotor candidate selection device
of the present invention, which is capable of selecting favorable
artificial promotor candidates, will be described. FIG. 16 is a
hardware configuration diagram of the artificial promotor candidate
selection device of this embodiment.
As shown in FIG. 16, the artificial promotor candidate selection
device of this embodiment includes a controller 10 (hereinafter
referred to as a main controller), which is programmed to control
the entire artificial promotor candidate selection device, a memory
device 20 connected to the main controller 10 that stores various
types of data, an input device 11 comprising a keyboard, mouse, or
other pointing device connected to the main controller 10 via a I/O
controller 14, a display device 12, such as a monitor that shows
index curves and other images, and an output device 13, such as a
printer, that prints the nucleotide sequences of selected promotor
candidates. The main controller 10 includes an operating system
(OS) or other control program, a conversion program for generating
a first progression, in which an input nucleotide sequence is
converted to amino acid index values, a program for smoothing the
first progression, an image processing program for displaying the
smoothed data on the display device 12, and an internal memory for
storing required data.
The memory device 20 is a storage means, such as a hard disk,
floppy disk, or optical disk, and stores a nucleotide sequence file
21, a nucleotide/amino acid file 22, an amino acid index value file
23, and an artificial promotor candidate file 24 that contains the
nucleotide sequences of selected artificial promotor
candidates.
The nucleotide sequence file 21 is a file containing nucleotide
sequences created by joining the nucleotide sequence of a virtually
produced artificial promotor candidate to the nucleotide sequence
of a structural gene. Nucleotide sequences input by the operator
via the keyboard or other implement of input device 11 are stored
in the memory device 22.
The nucleotide/amino acid file 22 is a file containing data that
converts the various three-nucleotide combinations into the
corresponding amino acids. Specifically, it contains data such as
the data shown in FIG. 4.
The amino acid index value file 23 is a file containing the amino
acid index values of alanine, arginine, and the other amino acids.
Specifically, the amino acid index value file 23 comprises data
such as the data shown in FIG. 5.
The artificial promotor candidate file 24 is a file containing
artificial promotor nucleotide sequences selected as being highly
likely to effectively function as artificial promotors based upon
the determination using the generated index curve.
Next, a method for selecting artificial promotor candidates from
nucleotide sequences input using the artificial promotor candidate
selection device of this embodiment will be described with
reference to FIG. 17, which is a flowchart illustrates the method
for selecting artificial promotor candidates.
First, an operator inputs a virtual nucleotide sequence via the
input device 11, which virtual nucleotide sequence includes the
nucleotide sequence of a virtual promotor joined to the nucleotide
sequence of a structural gene (S1). The nucleotides input by the
operator are displayed as e.g., a, g, c, or t, on the display
device 12 so the operator can perform data entry while monitoring
the input. The nucleotide sequence is stored in the nucleotide
sequence file 21 of the memory device 20 via the I/O controller 14
and the main controller 10. This example will concern inputting one
nucleotide sequence (artificial promotor candidate).
Next, via the input device 11, the operator inputs steps for
extracting units of three nucleotides from the nucleotide sequence
input in Step S1 (S2). As shown in FIG. 3, the promotor candidate
selection device of this embodiment can utilize Selection Method 1,
in which three sequential nucleotides are extracted, Selection
Method 2, in which three nucleotides are extracted by omitting
every other one nucleotide, or Selection Method 3, in which three
nucleotides are extracted by selecting one nucleotide out of each
three nucleotides (See FIG. 3).
The operator additionally inputs a smoothing program via the input
device 11 (S3). In the artificial promotor candidate selection
device of this embodiment, a plurality of values (3 to 10) is input
to determine the average.
Once the data is input in Steps S1 to S3, the main controller 10
extracts three nucleotides from the nucleotide sequence input in
Step S1 according to the extraction method input in Step S2. The
virtual amino acid sequence formed by this nucleotide sequence is
converted into amino acid index values, thereby generating a first
progression (S4). That is, groups of three nucleotides are
extracted from the nuclcotide sequence stored in the nucleotide
sequence file 21, the amino acids corresponding to the nucleotide
sequence are selected based upon the data stored in the
nucleotide/amino acid file 22 and subsequently, the amino acids are
converted into the first progression based upon the data stored in
the amino acid index value file 22. The first progression thus
created is stored in an internal memory within the main controller
10.
Next, the main controller 10 smoothes the first progression stored
in the internal memory (S5). Then, the main controller 10 displays
an index curve on the display device 12 based upon the smoothed
values (S6). In this embodiment, smoothing is performed by taking
the average of a number of values, which number corresponds to the
number input in Step S3 (e.g., 3). A detailed explanation of this
smoothing method was provided above.
Then, based upon the index curve displayed on the display device
12, the operator determines the efficacy of the candidate promotor
having the virtual nucleotide sequence (S7). Specifically, the
index curve for the input nucleotide sequence is displayed on the
display device 12, the operator observes the index curve and the
operator determines whether or not a sign change exists near the
transcription start site.
When a promotor is determined to have a high likelihood of
functioning as an artificial promotor in Step S7, the corresponding
nucleotide sequence is selected as an artificial promotor candidate
(S8). The nucleotide sequence of the selected artificial promotor
candidate is stored in the artificial promotor candidate file 24
within the memory device 20, and the operator can display it on the
display device 12 or output it using an output device (printer) on
an as-needed basis.
As described above, in the artificial promotor candidate selection
device of this embodiment, the simple input by an operator of a
nucleotide sequence, which is the nucleotide sequence of a virtual
artificial promotor candidate and the nucleotide sequence of a
targeted structural gene, allows a determination to be made as to
whether or not the virtual promotor candidate could potentially
function as an artificial promotor, thereby reducing the number of
artificial promotor candidates required to be actually synthesized
and tested to a certain extent and enabling efficient artificial
promotor design.
Although the description of this embodiment concerned an example in
which only one nucleotide sequence (artificial promotor candidate)
was input, a plurality of nucleotide sequences can be input at one
time and stored in the nucleotide sequence file 21 of the memory
device 20. In such case, Steps S2 to S8, shown in FIG. 17, would be
repeated for each nucleotide sequence stored in the nucleotide
sequence file 21.
The operator is not required to input nucleotide sequences as
discussed above. Instead, the items input into the artificial
promotor candidate selection device could be limited to the number
of nucleotides for the artificial promotor and the nucleotide
sequence of the structural gene. The main controller 10 would then
automatically generate all conceivable nucleotide sequences.
That is, DNA comprises four nucleotides a, g, c, and t. Therefore,
if the number of nucleotides in the promotor region (n) is known,
then all conceivable nucleotide sequences (4.sup.D) could be
automatically created and combined with the nucleotide sequence of
a structural gene. By equipping the artificial promotor candidate
selection device with such an automatic nucleotide sequence
generating means, artificial promotor candidates could be
efficiently generated and the determination of the efficacy thereof
could be efficiently performed; thus, nucleotide sequence analysis
could be performed without the omissions associated with prior art
random artificial promotor generation.
In the above description, an operator (researcher) evaluates the
index curves displayed on the display device 12; however, computer
pattern recognition technology could be employed to enable
automatic recognition by the main controller 10. The device also
could be combined with a means for automatically generating
nucleotide sequences that automatically generates nucleotide
sequences using the main controller 10 and automatically evaluates
the index curves using the main controller 10, thereby enabling
automatic evaluation of all nucleotide sequences.
Although the preferred embodiments of the present invention have
been described above, each is but an example, and the invention may
be embodied through a variety of modifications or improvements
based on the knowledge of persons skilled in the art.
SEQUENCE LISTING <100> GENERAL INFORMATION: <160>
NUMBER OF SEQ ID NOS: 9 <200> SEQUENCE CHARACTERISTICS:
<210> SEQ ID NO 1 <211> LENGTH: 152 <212> TYPE:
DNA <213> ORGANISM: Plasmid Colicin E1 <400> SEQUENCE:
1 ccggattagc agagcgatga tggcacaaac ggtgctacag agttcttgaa gtagtggccc
60 gactacggct acactagaag gacagtattt ggtatctgcg ctctgctgaa
gccagttacc 120 ttcggaaaaa gagttggtag ctcttgatcc gg 152 <200>
SEQUENCE CHARACTERISTICS: <210> SEQ ID NO 2 <211>
LENGTH: 569 <212> TYPE: DNA <213> ORGANISM: Clavibacter
xyli <400> SEQUENCE: 2 gatctcctcg gcgaacacac ccgtccagct
cggcttggac gggctgccga tctcacggga 60 gagctcggtc gtggcgtcga
ggtgggcrag tgtgccggcg ccgggtcggt gtagtcgtcc 120 gcggggacgt
agatcgcgtc ggtttccgtg tccgcacggg agaatccatt gtctgcatta 180
ccgctgtgta acgttttkcg agcctcgttt tgcattcatt gccgacggta cctgagggtc
240 agtcttatcc cgcgccggac caggcactct cgcacgcccg tgcsgcgcgc
aacacttcgc 300 acaggaggaa cattgcacat caagggaaga aaggtcgggc
tcgcggtcag cgccgtcgcc 360 ggtgtcgcsc tctcgctctg tcagcctgca
ccacgtccgg caccggtggt tcgtccgctg 420 ctaaaggggg catggtgaca
gtggcggtcg tcaacgacct cacctcactg aactcgcaga 480 ccccgcaggg
caacctggac accaacggcc aggtcggcta cctgaagttc ctacggaacc 540
ggtttccagt acatcgacaa caactacaa 569 <200> SEQUENCE
CHARACTERISTICS: <210> SEQ ID NO 3 <211> LENGTH: 175
<212> TYPE: DNA <213> ORGANISM: Petunia hybrida
<400> SEQUENCE: 3 gagaacacag ctggaatttt ttacaaaggt agttggtgaa
gctagtcagc gaatcccatt 60 accttccact ctacctaacc cccttcacca
acaacaaatt tctgtaattt aaaaactagc 120 caaaaaagaa ctctctttta
caaagagcca aagactcaat ctttactttc aagaa 175 <200> SEQUENCE
CHARACTERISTICS: <210> SEQ ID NO 4 <211> LENGTH: 239
<212> TYPE: DNA <213> ORGANISM: Lycopersicon esculentum
<400> SEQUENCE: 4 tcaattgcta caatcacttc attattaatt ttaattaata
tatgtggtta tatatgaaac 60 tgttagagaa ataatagctc caccatattt
ttttctcaat ttattttcac tataaaaagg 120 ctatttcatt ataatcaaaa
caagacacac acaaagagaa ggagcaataa aataaaagta 180 aacaacaatt
tgtgtgttta aaaaaaaaaa aaaagtacac acaccaaaaa aaaaaattc 239
<200> SEQUENCE CHARACTERISTICS: <210> SEQ ID NO 5
<211> LENGTH: 453 <212> TYPE: DNA <213> ORGANISM:
Artificial <220> FEATURE: <223> OTHER INFORMATION:
Virtually-generated promotor candidate. <400> SEQUENCE: 5
ccgcggggag atattttata gagcgcacag agagagagcg cgcgagatat gtgagatatg
60 tggggcgcac acacacaaaa aaacacgcgg ggtgtgtgcg ctctatacac
acagagagag 120 agtgttttct ctcttttgtg agaaaagagt gtatagagtg
tgtggggcgc ccccccgcga 180 gacactctat acacgcgggg cgctctatac
acacacactc tatagagaga aaagagggga 240 gacacacaga gtgtatatat
ttttttgtgg ggtgtatata tctctctgtg cgcgcgcgct 300 ctctctctgt
gcgctctgtg agaaaagagc gccccacaga gtgttttata cacccctctt 360
ttctcgcggg gagaaaaaaa aaaaaagaga gagagtgttt tgtggggtgt atagagcgct
420 ctctctcttt tgtgagatat ctccccgcgg ggg 453 <200> SEQUENCE
CHARACTERISTICS: <210> SEQ ID NO 6 <211> LENGTH: 9
<212> TYPE: DNA <213> ORGANISM: Artificial <220>
FEATURE: <223> OTHER INFORMATION: Partial sequence of
promotor portion of exemplified nucleotide sequence. <400>
SEQUENCE: 6 gatctcctc 9 <200> SEQUENCE CHARACTERISTICS:
<210> SEQ ID NO 7 <211> LENGTH: 11 <212> TYPE:
DNA <213> ORGANISM: Artificial <220> FEATURE:
<223> OTHER INFORMATION: Partial sequence of promotor and
structural gene of exemplified nucleotide sequence. <400>
SEQUENCE: 7 cacccgtcca g 11 <200> SEQUENCE CHARACTERISTICS:
<210> SEQ ID NO 8 <211> LENGTH: 5 <212> TYPE: DNA
<213> ORGANISM: Artificial <220> FEATURE: <223>
OTHER INFORMATION: Partial sequence of promotor portion of
exemplified nucleotide sequence. <400> SEQUENCE: 8 gatct 5
<200> SEQUENCE CHARACTERISTICS: <210> SEQ ID NO 9
<211> LENGTH: 5 <212> TYPE: PRT <213> ORGANISM:
Artificial <220> FEATURE: <223> OTHER INFORMATION:
Virtual amino acid sequence. <400> SEQUENCE: 9 Asp Ile Ser
Leu Ser 1 5
* * * * *