U.S. patent number RE46,033 [Application Number 14/121,743] was granted by the patent office on 2016-06-21 for genomic plant sequences and uses thereof.
This patent grant is currently assigned to Monsanto Technology LLC. The grantee listed for this patent is Monsanto Technology LLC. Invention is credited to Andrey A. Boukharov, Yongwei Cao, Stanton B. Dotson, Jeffrey M. Koshi, David K. Kovalic, Jingdong Liu, James D. McIninch, Wei Wu.
United States Patent |
RE46,033 |
Boukharov , et al. |
June 21, 2016 |
**Please see images for:
( Certificate of Correction ) ** |
Genomic plant sequences and uses thereof
Abstract
The present invention discloses rice genomic promoter sequences.
The promoters are particularly suited for use in rice and other
cereal crops. Methods of modifying, producing, and using the
promoters are also disclosed. The invention further discloses
compositions, transformed host cells, transgenic plants, and seeds
containing the rice genomic promoter sequences, and methods for
preparing and using the same.
Inventors: |
Boukharov; Andrey A.
(Chesterfield, MO), Cao; Yongwei (Lexington, MA), Dotson;
Stanton B. (Chesterfield, MO), Koshi; Jeffrey M.
(Cambridge, MA), Kovalic; David K. (University City, MO),
Liu; Jingdong (Ballwin, MO), McIninch; James D.
(Burlington, MA), Wu; Wei (St. Louis, MO) |
Applicant: |
Name |
City |
State |
Country |
Type |
Monsanto Technology LLC |
St. Louis |
MO |
US |
|
|
Assignee: |
Monsanto Technology LLC (St.
Louis, MO)
|
Family
ID: |
56118440 |
Appl.
No.: |
14/121,743 |
Filed: |
October 10, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
09702134 |
Oct 31, 2000 |
|
|
|
|
09620392 |
Jul 19, 2000 |
|
|
|
|
60144351 |
Jul 20, 1999 |
|
|
|
Reissue of: |
09815264 |
Mar 23, 2011 |
7365185 |
Apr 29, 2008 |
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
C12N
15/8216 (20130101); C12Q 1/6895 (20130101); C07K
14/415 (20130101); C12Q 2600/13 (20130101); C12N
15/8227 (20130101) |
Current International
Class: |
C07H
21/04 (20060101); C07K 14/415 (20060101); C07H
21/02 (20060101); A01H 5/00 (20060101); C12N
15/82 (20060101) |
Other References
Adams et al., Complementary DNA sequencing: expressed sequence tags
and human genome project, Science, 252(5013):1651-1656, 1991. cited
by applicant .
Anaviev et al., "Oat-maize chromosome addition lines: a new system
for mapping the maize genome," Proc. Natl. Acad. Sci. USA,
94:3524-3529, 1997. cited by applicant .
Birkenbihl et al., "Cosmid-derived map of E. coli strain BHE2600 in
comparison to the map of strain W3110." Nucleic Acids Res.,
17(13):5057-5069, 1989. cited by applicant .
Bukanov et al., "Ordered cosmid library and high-resolution
physical-genetic Aap of helicobacter pylori strain NCTC11638," Mol.
Microbiol., 11(3):509-523, 1994. cited by applicant .
Coulson et al., "Toward a physical map of the genome of the
nematode caenorhabditis elegans," Proc. Natl. Acad. Sci. USA,
83:7821-8725, 1986. cited by applicant .
Ebert et al., "Identification of an essential upstream element in
the nopaline synthase promoter by stable and transient assays,"
Proc. Natl. Acad. Sci. USA, 84(16):5745-5749, 1987. cited by
applicant .
Efstratiadis et al., "Enzymatic in vitro synthesis of globin
genes," Cell, 7:279-288, 1976. cited by applicant .
Eiglmeier et al., "Use of an ordered cosmid library to deduce the
genomic organization of Mycobacterium leprae," Mol. Microbiol.,
7(2):197-206, 1993. cited by applicant .
Goff, "Rica as a model for cereal genomics," Curr. Opin. Plant
Biol., 2:86-89, 1999. cited by applicant .
Hong, "A rapid and accurate strategy for rice contig map
construction by combination of fingerprinting and hybridization,"
Plant Mol. Biol., 35:129-133, 1997. cited by applicant .
Kidwell et al., "Transposable elements as sources of variation in
animals and plants," Proc. Natl. Acad. Sci. USA, 94:7704-7711,
1997. cited by applicant .
Kim et al., "Construction and characterization of a human bacterial
artificial chromosome library," Genomics, 34:213-218, 1996. cited
by applicant .
Knott et al., "Randomly picked cosmid clones overlap the pyrB and
oriC gap in the physical map of the E. coli chromosome," Nucleic
Acids Res., 16:2601-2612, 1988. cited by applicant .
Ko et al., "An `equalized cDNA` library by the reassociation of
short double-stranded cDNA," Nucleic Acids Res., 18(19):5705-5711,
1990. cited by applicant .
Kurata et al., "A 300 kilobase interval genetic map of rice
including 883 expressed sequences," Natur Gen., 8(4):362-372, 1994.
cited by applicant .
McCombie et al., "Caenorhabditis elegans expressed sequence tags
identify gene families and disease gene homologues," Nature Gen.,
1:124-131, 1992. cited by applicant .
Meinkoth et al., "Hybridization of nucleic acids immobilized on
solid supports," Anal. Biochem., 138:267-284, 1984. cited by
applicant .
Mohan et al., "Genome mapping, molecular markers and
marker-assisted selection crop plants," Mol. Breed, 3:87-103, 1997.
cited by applicant .
Okubo et al., "Large scale cDNA sequencing for analysis of
quantitative and qualitative aspects of gene expression," Nature
Gen., 2:173-179, 1992. cited by applicant .
Tanksley et al., "Chromosome landing: a paradigm for map-based gene
cloning in plants with large genomes," Trends in Genet.,
11(2):63-68, 1995. cited by applicant .
Venter et al., "A new strategy for genome sequencing," Nature,
381:364-366, 1996. cited by applicant .
Wang et al., "Construction of a rice bacterial artificial
chromosome library and identification of clones linked to the X-21
disease resistance locus," Planta J., 7(3):525-533, 1995. cited by
applicant .
Wenzel et al., "Physical mapping of the mycoplasma pneumoniae
genome," Nucleic Acids Res., 16(17):8323-8336, 1988. cited by
applicant .
Yomo et al., "Histochemical studies on protease formation in the
cotyledons of germinating bean seeds," Planta, 112(1):35-43, 1973.
cited by applicant .
Zhang et al., "Construction and characterization of two rice
bacterial artificial chromosome libraries from the parents of a
permanent recombinant inbred mapping population," Mol. Breeding,
2:11-24, 1996. cited by applicant .
Zhang et al., "Physical mapping of the rice genome with BACs,"
Plant Mol. Biol., 35:115-127, 1997. cited by applicant .
Zwick et al., "Physical mapping of the liguleless linkage group in
sorghum bicolor using rice RFLP-selected sorghum BACs," Genetics,
248:1983-1992, 1998. cited by applicant .
Meinkoth et al. Analyt. Biochem. (1984) vol. 138, pp. 267-284.
cited by examiner .
Wing et al. NCBI accession number AZ134591, Jun. 2, 2000. cited by
examiner .
Chen et al., Proc. Natl. Acad. Sci. USA, 94:3431-3435 (1997). cited
by applicant.
|
Primary Examiner: Campell; Bruce
Attorney, Agent or Firm: Dentons US LLP Doyle Esq.; Carine
M.
Parent Case Text
REFERENCES TO RELATED APPLICATIONS
This application is a continuation-in-part under 35 U.S.C.
.sctn.120 of U.S. application Ser. No. 09/620,392, filed Jul. 19,
2000.Iadd., now abandoned, which claims the benefit of U.S.
Provisional Application Ser. No. 60/144,351, filed Jul. 20,
1999.Iaddend.; .Iadd.and is a continuation-in-part of U.S.
application .Iaddend.Ser. No. 09/702,134, filed Oct. 31, 2000,
.Iadd.now abandoned, .Iaddend.the disclosures of which applications
are incorporated herein by reference in their entirety.
Claims
We claim:
1. A substantially purified nucleic acid molecule that comprises
the nucleotide sequence of SEQ ID NO: 1.Iadd., .Iaddend.or a
complement thereof.Iadd., operably linked to a heterologous
structural nucleic acid sequence.Iaddend..
2. A substantially purified nucleic acid molecule that consists of
the nucleotide sequence of SEQ ID NO: 1.Iadd., .Iaddend.or a
complement thereof.Iadd., operably linked to a heterologous
structural nucleic acid sequence.Iaddend..
3. A substantially purified nucleic acid molecule comprising a
nucleic acid sequence wherein the nucleic acid sequence: i)
hybridizes under high stringeny conditions with the sequence of SEQ
ID NO:1 or a complement thereof; or ii) exhibits an 85% or greater
identity to the sequence of SEQ ID NO:1.Iadd.; wherein the nucleic
acid sequence is operably linked to a heterologous structural
nucleic acid squence.Iaddend..
4. The nucleic acid molecule of claim 3, wherein the nucleic acid
sequence exhibits a 90% or greater identity to the nucleic acid
sequence of SEQ ID NO:1.
5. The nucleic acid molecule of claim 3, wherein the nucleic acid
sequence exhibits a 95% or greater identity to the nucleic acid
sequence of SEQ D NO:1.
6. The nucleic acid molecule of claim 3, wherein the nucleic acid
sequence exhibits a 99% or greater identity to the nucleic acid
sequence of SEQ ID NO:1.
7. The nucleic acid molecule of claim 3, wherein said nucleic acid
sequence comprises the sequence of SEQ ID NO: 1.
8. The nucleic acid sequence of claim 3, wherein the nucleic acid
molecule further comprises one or more cis-acting nucleic acid
elements.
9. The nucleic acid molecule of claim 3, wherein the nucleic acid
molecule further comprises a 5' leader sequence selected from the
group consisting of dSSU 5', PetHSP70 5', and GmHSP17.9 5'.
10. The nucleic acid molecule of claim 3, wherein the nucleic acid
molecule further comprises a 3' untranslated region.
11. The nucleic acid molecule of claim 10, wherein the 3'
untranslated region is selected from the group consisting of NOS
3', E9 3', ADR12 3', 7S.alpha.3', 11S 3', and albumin 3'.
12. A transgenic plant comprising a recombinant nucleic acid
molecule having the nucleic acid sequence of claim 3.
13. A host cell comprising a recombinant nucleic acid molecule
having the nucleic acid molecule of claim 3.
14. The host cell of claim 13, wherein said host cell is a plant
cell.
15. A transgenic plant comprising the host cell of claim 13.
Description
INCORPORATION OF SEQUENCE LISTING
Two copies of the sequence listing (Copy 1 .Iadd.Replacement
.Iaddend.and Copy 2 .Iadd.Replacement.Iaddend.) and a computer
readable form of the sequence listing .Iadd.(Computer Readable Form
(CRF) Replacement).Iaddend., all on CD-ROMs, each containing the
file named .[.Pa2_00329.txt.]. .Iadd.MONS232USRE seq.txt
.Iaddend.which is .[.420,819,499.]. .Iadd.420,834,519
.Iaddend.bytes (measured in MS-DOS) and was created on .[.Mar. 23,
2001.]. .Iadd.Sep. 26, 2014.Iaddend., are herein incorporated by
reference.
INCORPORATION OF TABLES 1, 3, 4, 5 AND 6
Two copies of Tables 1, 3, 4, 5, and 6 on CD-ROMs, each containing
47,041,202 bytes (measured in MS-DOS) and each having the file name
Pa_00329.txt all created on .[.Mar. 16, 2001.]. .Iadd.Apr. 15,
2009.Iaddend., are herein incorporated by reference.
TABLE-US-LTS-CD-00001 LENGTHY TABLES The patent contains a lengthy
table section. A copy of the table is available in electronic form
from the USPTO web site
(http://seqdata.uspto.gov/?pageRequest=docDetail&DocID=USRE046033E1).
An electronic copy of the table will also be available from the
USPTO upon request and payment of the fee set forth in 37 CFR
1.19(b)(3).
FIELD OF THE INVENTION
The present invention relates to the field of plant biochemistry
and genetics. Specifically, the invention relates to regulatory
elements comprising genomic nucleic acid sequences from rice
plants, and nucleic acid molecules containing the same. More
specifically, the invention discloses nucleic acid sequences from
Oryza sativa (rice) containing regulatory elements, such as
promoters. The invention also discloses methods of modifying,
producing, and using the regulatory elements.
BACKGROUND OF THE INVENTION
Promoters
The genetic enhancement of plants and seeds provides significant
benefits to society. For example, plants and seeds may be enhanced
to have desirable agricultural, biosynthetic, commercial, chemical,
insecticidal, industrial, nutritional, or pharmaceutical
properties. Despite the availability of many molecular tools,
however, the genetic modification of plants and seeds is often
constrained by an insufficient or poorly localized expression of
the engineered transgene.
Many intracellular processes may impact overall transgene
expression, including transcription, translation, protein assembly
and folding, methylation, phosphorylation, transport, and
proteolysis. Intervention in one or more of these processes can
increase the amount of transgene expression in genetically
engineered plants and seeds. For example, raising the steady-state
level of mRNA in the cytosol often yields an increased accumulation
of transgene expression. Many factors may contribute to increasing
the steady-state level of an mRNA in the cytosol, including the
rate of transcription, promoter strength and other regulatory
features of the promoter, efficiency of mRNA processing, and the
overall stability of the mRNA.
Among these factors, the promoter plays a central role. Along the
promoter, the transcription machinery is assembled and
transcription is initiated. This early step is often rate-limiting
relative to subsequent stages of protein production. Transcription
initiation at the promoter may be regulated in several ways. For
example, a promoter may be induced by the presence of a particular
compound or external stimuli, express a gene only in a specific
tissue, express a gene during a specific stage of development, or
constitutively express a gene. Thus, transcription of a transgene
may be regulated by operably linking the coding sequence to
promoters with different regulatory characteristics. Accordingly,
regulatory elements such as promoters, play a pivotal role in
enhancing the agronomic, pharmaceutical or nutritional value of
crops.
At least two types of information are useful in predicting promoter
regions within a genomic DNA sequence. First, promoters may be
identified on the basis of their sequence "content," such as
transcription factor binding sites and various known promoter
motifs. (Stormo, Genome Research 10: 394-397 (2000)). Such signals
may be identified by computer programs that identify sites
associated with promoters, such as TATA boxes, transcription factor
(TF) binding sites, and CpG islands.
Second, promoters may be identified on the basis of their
"location," i.e. their proximity to a known or suspected coding
sequence. (Stormo, Genome Research 10: 394-397 (2000)). Promoters
are typically contained within a region of DNA extending
approximately 150-1500 basepairs in the 5' direction from the start
codon of a coding sequence. Thus, promoter regions may be
identified by locating the start codon of a coding sequence, and
moving beyond the start codon in the 5' direction to locate the
promoter region.
Rice
Approximately half a billion tons of rice is produced each year
world-wide. More than 90% of this rice is for human consumption
(Goff, Curr. Opin. Plant Biol. 2:86-89 (1999)). Rice, however, is
not only a commercially important crop; it is also a model for
other cereal crops, such as sorghum, maize, barley and wheat.
Rice is a model crop for several reasons. First, the genes in rice
are predicted to be generally arranged in the genome in an order
that is similar to other cereal crops. In fact, comparisons of the
physical and genetic maps of cereal genomes have suggested the
existence of a colinearity of gene order among the various cereal
genomes studied. (Goff, Curr. Opin. Plant Biol. 2:86-89
(1999)).
Second, studies of a number of individual genes indicate that there
is considerable homology within gene families found among various
cereal crops. This conservation of gene and protein sequences
suggests that functional studies of genes or proteins from one
cereal crop can help elucidate the function of similar genes or
proteins in other cereal crops. Likewise, non-coding regulatory
elements in rice, such as promoters, are predicted to display
similar functions compared to related regulatory elements found in
other cereal crops. Accordingly, a strong constitutive or
tissue-specific promoters from one cereal is more likely to retain
its function when introduced as a portion of a transgene into
another cereal crop species (Goff, Curr. Opin. Plant Biol. 2:86-89
(1999).
Third, rice can be used as a model for other cereal genomes because
its genome is smaller than those of other major cereals. The size
of the rice genome is estimated at 420 to 450 megabase pairs.
Sorghum, maize, barley and wheat have larger genomes (1000, 3000,
5000 and 16000 Mbp, respectively). Despite such differences in
genome size, however, the number of genes in each of these crops is
on the same order of magnitude. Thus, the smaller genome size of
rice results in a higher gene density relative to the other
cereals. Based on estimates of 30,000 genes in a cereal genome,
rice will have on average one gene approximately every 15 Kbp. In
contrast, maize and wheat have one gene approximately every 100 and
500 Kbp, respectively. This higher gene density makes rice an
attractive target for cereal gene discovery efforts, genomic
sequence analysis, and identification of regulatory elements, such
as promoters (Goff, Curr. Opin. Plant Biol. 2:86-89 (1999)).
For these reasons, rice is a model for other crops. Accordingly,
discoveries in rice may be extended to other crops. Thus, the
identification of new genes, regulatory elements (e.g., promoters),
etc. that function in rice is useful not only in developing
enhanced varieties of rice, but also in developing enhanced
varieties of other crops. In particular, developments in rice are
applicable to other cereal crops, such as sorghum, maize, barley
and wheat.
Clearly, there exists a need in the art for new regulatory
elements, such as promoters, that are capable of expressing
heterologous nucleic acid sequences in important crop species.
SUMMARY OF THE INVENTION
The present invention includes and provides a substantially
purified nucleic acid molecule comprising a nucleic acid sequence
wherein the nucleic acid sequence: i) hybridizes under stringent
conditions with a sequence selected from the group consisting of
SEQ ID NO:1 through 57,467, and the complements thereof; or ii)
exhibits an 85% or greater identity to a sequence selected from the
group consisting of SEQ ID NO:1 through 57,467.
The present invention includes and provides a transgenic plant
containing a nucleic acid molecule that comprises in the 5' to 3'
direction: a nucleic acid sequence that: i) hybridizes under
stringent conditions with a sequence selected from the group
consisting of SEQ ID NO:1 through 57,467, and the complements
thereof; or ii) exhibits an 85% or greater identity to a sequence
selected from the group consisting of SEQ ID NO:1 through 57,467;
operably linked to a structural nucleic acid sequence; wherein the
nucleic acid sequence is heterologous with respect to the
structural nucleic acid sequence.
The present invention includes and provides a seed from a
transgenic plant containing a nucleic acid molecule that comprises
in the 5' to 3' direction: a nucleic acid sequence that: i)
hybridizes under stringent conditions with a sequence selected from
the group consisting of SEQ ID NO:1 through 57,467, and the
complements thereof; or ii) exhibits an 85% or greater identity to
a sequence selected from the group consisting of SEQ ID NO:1
through 57,467; operably linked to a structural nucleic acid
sequence; wherein the nucleic acid sequence is heterologous with
respect to the structural nucleic acid sequence.
The present invention includes and provides a fertile transgenic
plant containing a nucleic acid molecule that comprises in the 5'
to 3' direction: a nucleic acid sequence that: i) hybridizes under
stringent conditions with a sequence selected from the group
consisting of SEQ ID NO:1 through 57,467, and the complements
thereof; or ii) exhibits an 85% or greater identity to a sequence
selected from the group consisting of SEQ ID NO:1 through 57,467;
operably linked to a structural nucleic acid sequence; wherein the
nucleic acid sequence is heterologous with respect to the
structural nucleic acid sequence.
The present invention includes and provides a method of
transforming a host cell comprising: a) providing a nucleic acid
molecule that comprises in the 5' to 3' direction: a nucleic acid
sequence that: i) hybridizes under stringent conditions with a
sequence selected from the group consisting of SEQ ID NO:1 through
57,467, and the complements thereof; or ii) exhibits an 85% or
greater identity to a sequence selected from the group consisting
of SEQ ID NO:1 through 57,467; operably linked to a structural
nucleic acid sequence; and b) transforming said plant with the
nucleic acid molecule.
DEFINITIONS
The following definitions are provided as an aid to understanding
the detailed description of the present invention.
The phrases "coding sequence," "structural sequence," and
"structural nucleic acid sequence" refer to a physical structure
comprising an orderly arrangement of nucleic acids. The nucleic
acids are arranged in a series of nucleic acid triplets that each
form a codon. Each codon encodes for a specific amino acid. Thus
the coding sequence, structural sequence, and structural nucleic
acid sequence encode a series of amino acids forming a protein,
polypeptide, or peptide sequence. The coding sequence, structural
sequence, and structural nucleic acid sequence may be contained,
without limitation, within a larger nucleic acid molecule, vector,
etc. In addition, the orderly arrangement of nucleic acids in these
sequences may be depicted, without limitation, in the form of a
sequence listing, figure, table, electronic medium, etc.
The phrases "DNA sequence" and "nucleic acid sequence" refer to a
physical structure comprising an orderly arrangement of nucleic
acids. The DNA sequence or nucleic acid sequence may be contained
within a larger nucleic acid molecule, vector, or the like. In
addition, the orderly arrangement of nucleic acids in these
sequences may be depicted in the form of a sequence listing,
figure, table, electronic medium, or the like.
The term "expression" refers to the transcription of a gene to
produce the corresponding mRNA and translation of this mRNA to
produce the corresponding gene product (i.e., a peptide,
polypeptide, or protein) and activity of the protein to confer a
function.
The term "expression of antisense RNA" refers to the transcription
of a DNA to produce a first RNA molecule capable of hybridizing to
a second RNA molecule.
The term "gene" refers to chromosomal or genomic DNA, plasmid DNA,
cDNA, synthetic DNA, or other DNA that encodes a peptide,
polypeptide, protein, or RNA molecule.
"Homology" refers to the level of similarity between two or more
nucleic acid or amino acid sequences in terms of percent of
positional identity (i.e., sequence similarity or identity).
Homology also refers to the concept of similar functional
properties among different nucleic acids or proteins.
The phrase "heterologous" refers to the relationship between two or
more nucleic acid or protein sequences that are derived from
different sources. For example, a promoter is heterologous with
respect to.a coding sequence if such a combination is not normally
found in nature. In addition, a particular sequence may be
"heterologous" with respect to a cell or organism into which it is
inserted (i.e. does not naturally occur in that particular cell or
organism).
The term "hybridization" refers generally to the ability of nucleic
acid molecules to join via complementary base strand pairing. Such
hybridization may occur when nucleic acid molecules are contacted
under appropriate conditions (see also, "specific hybridization,"
below).
The phrase "operably linked" refers to the functional spatial
arrangement of two or more nucleic acid regions or nucleic acid
sequences. For example, a promoter region may be positioned
relative to a nucleic acid sequence such that transcription of the
nucleic acid sequence is directed by the promoter region. Thus, the
promoter region is "operably linked" to the nucleic acid
sequence.
The term "promoter," "promoter region," or "promoter sequence"
refer to a nucleic acid sequence, usually found upstream (5') to a
coding sequence, that directs transcription of a nucleic acid
sequence into mRNA. The promoter or promoter region typically
provide a recognition site for RNA polymerase and the other factors
necessary for proper initiation of transcription. As contemplated
herein, a promoter or promoter region includes variations of
promoters derived by inserting or deleting regulatory regions,
subjecting the promoter to random or site-directed mutagenesis,
etc. The activity or strength of a promoter may be measured in
terms of the amounts of RNA it produces, or the amount of protein
accumulation in a cell or tissue, relative to a promoter whose
transcriptional activity has been previously assessed.
The term "recombinant vector" refers to any agent such as a
plasmid, cosmid, virus, autonomously replicating sequence, phage,
or linear or circular single-stranded or double-stranded DNA or RNA
nucleotide sequence. The recombinant vector may be derived from any
source; is capable of genomic integration or autonomous
replication; and comprises a promoter nucleic acid sequence
operably linked to one or more nucleic acid sequences. A
recombinant vector is typically used to introduce such operably
linked sequences into a suitable host.
"Regulatory sequence" refers to a nucleotide sequence located
upstream (5'), within, or downstream (3') to a coding sequence.
Transcription and expression of the coding sequence is typically
impacted by the presence or absence of the regulatory sequence.
"Specifically hybridizes" refers to the ability of two nucleic acid
molecules to form an anti-parallel, double-stranded nucleic acid
structure. A nucleic acid molecule is said to be the "complement"
of another nucleic acid molecule if they exhibit "complete
complementarity," i.e., each nucleotide in one sequence is
complementary to its base pairing partner nucleotide in another
sequence. Two molecules are said to be "minimally complementary" if
they can hybridize to one another with sufficient stability to
permit them to remain annealed to one another under at least
conventional "low-stringency" conditions. Similarly, the molecules
are said to be "complementary" if they can hybridize to one another
with sufficient stability to permit them to remain annealed to one
another under conventional "high-stringency" conditions. Nucleic
acid molecules that hybridize to other nucleic acid molecules,
e.g., at least under low stringency conditions are said to be
"hybridizable cognates" of the other nucleic acid molecules.
Conventional low stringency and high stringency conditions are
described herein and by Sambrook et al., Molecular Cloning, A
Laboratory Manual, 2nd Ed., Cold Spring Harbor Press, Cold Spring
Harbor, N.Y. (1989) and by Haymes et al., Nucleic Acid
Hybridization, A Practical Approach, IRL Press, Washington, D.C.
(1985). Departures from complete complementarity are permissible,
as long as such departures do not completely preclude the capacity
of the molecules to form a double-stranded structure.
The term "substantially homologous" refers to two sequences which
are at least 90% identical in sequence, as measured by the BestFit
program described herein (Version 10; Genetics Computer Group,
Inc., Madison, Wis.), using default parameters.
"Substantially purified" refers to a molecule separated from
substantially all other molecules normally associated with it in
its native state. More preferably a substantially purified molecule
is the predominant species present in a preparation. A
substantially purified molecule may be greater than 60% free,
preferably 75% free, more preferably 90% free, and most preferably
95% free from the other molecules (exclusive of solvent) present in
the natural mixture. The term "substantially purified" is not
intended to encompass molecules present in their native state.
The term "transformation" refers to the introduction of nucleic
acid into a recipient host. The term "host" refers to bacteria
cells, fungi, animals and animal cells, plants and plant cells, or
any plant parts or tissues including protoplasts, calli, roots,
tubers, seeds, stems, leaves, seedlings, embryos, and pollen.
The term "transgenic" refers to an animal, plant, or other organism
containing one or more heterologous nucleic acid sequences.
DETAILED DESCRIPTION OF THE INVENTION
The present invention includes nucleic acid molecules comprising
promoter sequences useful for transcribing a heterologous
structural nucleic acid sequence in plants, and methods of
modifying, producing, and using the same. The invention also
includes compositions, transformed host cells, transgenic plants,
and seeds containing the promoters, and methods for preparing and
using the same.
Nucleic Acid Molecules
The present invention includes a nucleic acid molecule having a
nucleic acid sequence that hybridizes to SEQ ID NO:1 through SEQ ID
NO:57,467, or any complements thereof; or any fragments thereof.
The present invention also provides a nucleic acid molecule
comprising a nucleic acid sequence selected from the group
consisting of SEQ ID NO:1 through SEQ ID NO:57,467, any complements
thereof, and any fragments thereof.
Nucleic acid hybridization is a technique well known to those of
skill in the art of DNA manipulation. The hybridization properties
of a given pair of nucleic acids are an indication of their
similarity or identity.
Low stringency conditions may be used to select nucleic acid
sequences with lower sequence identities to a target nucleic acid
sequence. One may wish to employ conditions such as about 0.15 M to
about 0.9 M sodium chloride, at temperatures ranging from about
20.degree. C. to about 55.degree. C.
High stringency conditions may be used to select for nucleic acid
sequences with higher degrees of identity to the disclosed nucleic
acid sequences (Sambrook et al., 1989).
High stringency conditions typically involve nucleic acid
hybridization in about 2.times. to about 10.times.SSC (diluted from
a 20.times.SSC stock solution containing 3 M sodium chloride and
0.3 M sodium citrate, pH 7.0 in distilled water), about 2.5.times.
to about 5.times.Denhardt's solution (diluted from a 50.times.
stock solution containing 1% (w/v) bovine serum albumin, 1% (w/v)
ficoll, and 1% (w/v) polyvinylpyrrolidone in distilled water),
about 10 mg/mL to about 100 mg/mL fish sperm DNA, and about 0.02%
(w/v) to about 0.1% (w/v) SDS, with an incubation at about
50.degree. C. to about 70.degree. C. for several hours to
overnight. High stringency conditions are preferably provided by
6.times.SSC, 5.times.Denhardt's solution, 100 mg/mL fish sperm DNA,
and 0.1% (w/v) SDS, with an incubation at 55.degree. C. for several
hours.
Hybridization is generally followed by several wash steps. The wash
compositions generally comprise 0.5.times. to about 10.times.SSC,
and 0.01% (w/v) to about 0.5% (w/v) SDS with a 15 minute incubation
at about 20.degree. C. to about 70.degree. C. Preferably, the
nucleic acid segments remain hybridized after washing at least one
time in 0.1.times.SSC at 65.degree. C.
A nucleic acid molecule preferably comprises a nucleic acid
sequence that hybridizes, under low or high stringency conditions,
with SEQ ID NO:1 through SEQ ID NO:57,467, any complements thereof,
or any fragments thereof. A nucleic acid molecule most preferably
comprises a nucleic acid sequence that hybridizes under high
stringency conditions with SEQ ID NO:1 through SEQ ID NO:57,467,
any complements thereof, or any fragments thereof.
In an alternative embodiment, the nucleic acid molecule comprises a
nucleic acid sequence that exhibits 85% or greater identity, and
more preferably at least 86 or greater, 87 or greater, 88 or
greater, 89 or greater, 90 or greater, 91 or greater, 92 or
greater, 93 or greater, 94 or greater, 95 or greater, 96 or
greater, 97 or greater, 98 or greater, or 99% or greater identity
to a nucleic acid molecule selected from the group consisting of
SEQ ID NO:1 through SEQ ID NO:57,467 and complements thereof. The
nucleic acid molecule most preferably comprises a nucleic acid
sequence selected from the group consisting of SEQ ID NO:1 through
SEQ ID NO:57,467 and complements thereof.
The percent of sequence identity is preferably determined using the
"Best Fit" or "Gap" program of the Sequence Analysis Software
Packager.TM. (Version 10; Genetics Computer Group, Inc., Madison,
Wis.). "Gap" utilizes the algorithm of Needleman and Wunsch
(Needleman and Wunsch, Journal of molecular Biology 48:443-453,
1970) to find the alignment of two sequences that maximizes the
number of matches and minimizes the number of gaps. "BestFit"
performs an optimal alignment of the best segment of similarity
between two sequences and inserts gaps to maximize the number of
matches using the local homology algorithm of Smith and Waterman
(Smith and Waterman, Advances in Applied Mathematics, 2:482-489,
1981, Smith et al., Nucleic Acids Research 11:2205-2220, 1983). The
percent identity is most preferably determined using the "Best Fit"
program.
As used herein "sequence identity" refers to the extent to which
two optimally aligned polynucleotide or peptide sequences are
invariant throughout a window of alignment of components, e.g.,
nucleotides or amino acids. An "identity fraction" for aligned
segments of a test sequence and a reference sequence is the number
of identical components which are shared by the two aligned
sequences divided by the total number of components in reference
sequence segment, i.e., the entire reference sequence or a smaller
defined part of the reference sequence. "Percent identity" is the
identity fraction times 100.
Useful methods for determining sequence identity are also disclosed
in Guide to Huge Computers, Martin J. Bishop, ed., Academic Press,
San Diego, 1994, and Carillo, H., and Lipton, D., Applied Math
(1988) 48:1073. More particularly, preferred computer programs for
determining sequence identity include the Basic Local Alignment
Search Tool (BLAST) programs which are publicly available from
National Center Biotechnology Information (NCBI) at the National
Library of Medicine, National Institute of Health, Bethesda, Md.
20894; sec BLAST Manual, Altschul et al., NCBI, NLM, NIH; Altschul
et al., J. Mol. Biol. 215:403-410 (1990); version 2.0 or higher of
BLAST programs allows the introduction of gaps (deletions and
insertions) into alignments; for peptide sequence BLASTX can be
used to determine sequence identity; and, for polynucleotide
sequence BLASTN can be used to determine sequence identity.
For purposes of this invention "percent identity" may also be
determined using BLASTX version 2.0 for translated nucleotide
sequences and BLASTN version 2.0 for polynucleotide sequences. In a
preferred embodiment of the present invention, the presently
disclosed rice genomic promoter sequences comprise nucleic acid
molecules or fragments having a BLAST score of more than 200,
preferably a BLAST score of more than 300, and even more preferably
a BLAST score of more than 400 with their respective
homologues.
Nucleic acid molecules of the present invention include nucleic
acid sequences that are between about 0.01 Kb and about 50 Kb, more
preferably between about 0.1 Kb and about 25 Kb, even more
preferably between about 1 Kb and about 10 Kb, and most preferably
between about 3 Kb and about 10 Kb, about 3 Kb and about 7 Kb,
about 4 Kb and about 6 Kb, about 2 Kb and about 4 Kb, about 2 Kb
and about 5 Kb, about 1 Kb and about 5 Kb, about 1 Kb and about 3
Kb, or about 1 Kb and about 2 Kb.
Promoters
Any of the nucleic acid molecules described herein may comprise
nucleic acid sequences comprising promoters. Promoters of the
present invention can include between about 300 bp upstream and
about 10 kb upstream of the trinucleotide ATG sequence at the start
site of a protein coding region. Promoters of the present invention
can preferably include between about 300 bp upstream and about 5 kb
upstream of the trinucleotide ATG sequence at the start site of a
protein coding region. Promoters of the present invention can more
preferably include between about 300 bp upstream and about 2 kb
upstream of the trinucleotide ATG sequence at the start site of a
protein coding region. Promoters of the present invention can
include between about 300 bp upstream and about 1 kb upstream of
the trinucleotide ATG sequence at the start site of a protein
coding region. While in many circumstances a 300 bp promoter may be
sufficient for expression, additional sequences may act to further
regulate expression, for example, in response to biochemical,
developmental or environmental signals.
It is also preferred that the promoters of the present invention
contain a CAAT and a TATA cis element. Moreover, the promoters of
the present invention can contain one or more cis elements in
addition to a CAAT and a TATA box.
By "regulatory element" it is intended a series of nucleotides that
determines if, when, and at what level a particular gene is
expressed. The regulatory DNA sequences specifically interact with
regulatory or other proteins. Many regulatory elements act in cis
("cis elements") and are believed to affect DNA topology, producing
local conformations that selectively allow or restrict access of
RNA polymerase to the DNA template or that facilitate selective
opening of the double helix at the site of transcriptional
initiation. Cis elements occur within, but are not limited to
promoters, and promoter modulating sequences (inducible elements).
Cis elements can be identified using known cis elements as a target
sequence or target motif in the BLAST programs of the present
invention.
Promoters of the present invention include homologues of cis
elements known to effect gene regulation that show homology with
the promoter sequences of the present invention. These cis elements
include, but are not limited to, oxygen responsive cis elements
(Cowen et al., J Biol. Chem. 268(36):26904-26910 (1993)), light
regulatory elements (Bruce and Quaill, Plant Cell 2 (11):1081-1089
(1990); Bruce et al., EMBO J. 10:3015-3024 (1991); Rocholl et al.,
Plant Sci. 97:189-198 (1994); Block et al., Proc. Natl. Acad. Sci.
USA 87:5387-5391 (1990); Giuliano et al., Proc. Natl. Acad. Sci.
USA 85:7089-7093 (1988); Staiger et al., Proc. Natl. Acad. Sci. USA
86:6930-6934 (1989); Izawa et al., Plant Cell 6: 1277-1287 (1994);
Menkens et al., Trends in Biochemistry 20:506-510 (1995); Foster et
al., FASEB J. 8:192-200 (1994); Plesse et al., Mol Gen Gene
254:258-266 (1997); Green et al., EMBO J. 6:2543-2549(1987);
Kuhlemeier et al., Ann. Rev Plant Physiol. 38:221-257 (1987);
Villain et al., J. Biol. Chem. 271:32593-32598 (1996); Lam et al.,
Plant Cell 2:857-866 (1990); Gilmartin et al., Plant Cell 2:369-378
(1990); Datta et al., Plant Cell 1:1069-1077 (1989); Gilmartin et
al., Plant Cell 2:369-378 (1990); Castresana et al., EMBO J.
7:1929-1936 (1988); Ueda et al., Plant Cell 1:217-227 (1989);
Terzaghi et al, Annu. Rev. Plant Physiol. Plant Mol. Biol.
46:445-474 (1995); Green et al., EMBO J. 6:2543-2549 (1987);
Villain et al., J. Biol. Chem. 271:32593-32598 (1996); Tjaden et
al., Plant Cell 6: 107-118 (1994); Tjaden et al., Plant Physiol.
108:1109-1117 (1995); Ngai et al., Plant J. 12:1021-1234 (1997);
Bruce et al., EMBO J 10:3015-3024 (1991); Ngai et al., Plant J.
12:1021-1034 (1997)), elements responsive to gibberellin, (Muller
et al., J. Plant Physiol. 145:606-613 (1995); Croissant et al.,
Plant Science 116:27-35 (1996); Lohmer et al., EMBO J. 10:617-624
(1991); Rogers et al., Plant Cell 4:1443-1451 (1992); Lanahan et
al., Plant Cell 4:203-211 (1992); Skriver et al., Proc. Natl. Acad.
Sci. USA 88:7266-7270 (1991); Gilmartin et al., Plant Cell
2:369-378 (1990); Huang et al., Plant Mol. Biol. 14:655-668 (1990),
Gubler et al., Plant Cell 7:1879-1891 (1995)), elements responsive
to abscisic acid, (Busk et al., Plant Cell 9:2261-2270 (1997);
Guiltinan et al., Science 250:267-270 (1990); Shen et al., Plant
Cell 7:295-307 (1995); Shen et al., Plant Cell 8:1107-1119 (1996);
Seo et al., Plant Mol. Biol. 27:1119-1131 (1995); Marcotte et al.,
Plant Cell 1:969-976 (1989); Shen et al., Plant Cell 7:295-307
(1995); Iwasaki et al., Mol Gen Genet 247:391-398 (1995); Hattori
et al., Genes Dev. 6:609-618 (1992); Thomas et al., Plant Cell
5:1401-1410 (1993)), elements similar to abscisic acid responsive
elements, (Ellerstrom et al., Plant Mol. Biol. 32:1019-1027
(1996)), auxin responsive elements (Liu et al., Plant Cell
6:645-657 (1994); Liu et al., Plant Physiol. 115:397-407 (1997);
Kosugi et al., Plant J. 7:877-886 (1995); Kosugi et al., Plant Cell
9:1607-1619 (1997); Ballas et al., J. Mol. Biol. 233.580-596
(1993)), a cis element responsive to methyl jasmonate treatment
(Beaudoin and Rotistein Plant Mol. Biol. 33:835-846 (1997)), a cis
element responsive to abscisic acid and stress response (Straub et
al., Plant Mol. Biol. 26:617-630 (1994)), ethylene responsive cis
elements (Itzhaki et al., Proc. Natl. Acad. Sci. USA 91:8925-8929
(1994); Montgomery et al., Proc. Natl. Acad. Sci. USA 90:5939-5943
(1993); Sessa et al., Plant Mol. Biol. 28:145-153 (1995); Shinshi
et al., Plant Mol. Biol. 27:923-932 (1995)), salicylic acid cis
responsive elements, (Strange et al., Plant J. 11:1315-1324 (1997);
Qin et al., Plant Cell 6:863-874 (1994)), a cis element that
responds to water stress and abscisic acid (Lam et al., J. Biol.
Chem. 266:17131-17135 (1991); Thomas et al., Plant Cell 5:1401-1410
(1993); Pla et al., Plant Mol Biol 21:259-266 (1993)), a cis
element essential for M phase-specific expression (Ito et al.,
Plant Cell 10:331-341 (1998)), sucrose responsive elements (Huang
et al., Plant Mol. Biol. 14:655-668 (1990); Hwang et al., Plant Mol
Biol 36:331-341 (1998); Grierson et al., Plant J. 5:815-826
(1994)), heat shock response elements (Pelham et al., Trends Genet.
1:31-35 (1985)), elements responsive to auxin and/or salicylic acid
and also reported for light regulation (Lam et al., Proc. Natl.
Acad Sci. USA 86:7890-7897 (1989); Benfey et al., Science
250:959-966 (1990)), elements responsive to ethylene and salicylic
acid (Ohme-Takagi et al., Plant Mol. Biol. 15:941-946 (1990)),
elements responsive to wounding and abiotic stress (Loake et al.,
Proc. Natl. Acad. Sci. USA 89:9230-9234 (1992); Mhiri et al., Plant
Mol. Biol. 33:257-266 (1997)), antoxidant response elements
(Rushmore et al., J. Biol. Chem. 266:11632-11639; Dalton et al.,
Nucleic Acids Res. 22:5016-5023 (1994)), Sph elements (Suzuki et
al., Plant Cell 9:799-807 1997)), elicitor responsive elements,
(Fukuda et al., Plant Mol. Biol. 34:81-87 (1997); Rushton et al.,
EMBO J. 15:5690-5700 (1996)), metal responsive elements (Stuart et
al., Nature 317:828-831 (1985); Westin et al., EMBO J. 7:3763-3770
(1988); Thiele et al., Nucleic Acids Res. 20:1183-1191 (1992);
Faisst et al., Nucleic Acids Res. 20:3-26 (1992)), low temperature
responsive elements, (Baker et al., Plant Mol. Biol. 24:701-713
(1994); Jiang et al., Plant Mol. Biol. 30:679-684 (1996); Nordin et
al., Plant Mol. Biol. 21:641-653 (1993); Zhou et al., J. Biol.
Chem. 267:23515-23519 (1992)), drought responsive elements,
(Yamaguchi et al., Plant Cell 6:251-264 (1994); Wang et al., Plant
Mol. Biol. 28:605-617 (1995); Bray E A, Trends in Plant Science
2:48-54 (1997)) enhancer elements for glutenin, (Colot et al., EMBO
J. 6:3559-3564 (1987); Thomas et al., Plant Cell 2:1171-1180
(1990); Kreis et al., Philos. Trans. R. Soc. Lond., B314:355-365
(1986)), light-independent regulatory elements, (Lagrange et al.,
Plant Cell 9:1469-1479 (1997); Villain et al., J. Biol. Chem.
271:32593-32598(1996)), OCS enhancer elements, (Bouchezet al., EMBO
J. 8:4197-4204 (1989); Foley et al., Plant J. 3:669-679 (1993)),
ACGT elements, (Foster et al., FASEB J. 8:192-200 (1994); Izawa et
al., Plant Cell 6:1277-1287 (1994); Izawa et al., J. Mol. Biol.
230:1131-1144 (1993)), negative cis elements in plastid related
genes, (Zhou et al., J. Biol. Chem. 267:23515-23519 (1992);
Lagrange et al., Mol. Cell Biol. 13:2614-2622 (1993); Lagrange et
al., Plant Cell 9:1469-1479 (1997); Zhou et al., J. Biol. Chem.
267:23515-23519 (1992)), prolamin box elements, (Forde et al.,
Nucleic Acids Res. 13:7327-7339 (1985); Colot et al., EMBO J.
6:3559-3564 (1987); Thomas et al., Plant Cell 2:1171-1180 (1990);
Thompson et al., Plant Mol. Biol. 15:755-764 (1990); Vicente et
al., Proc. Natl. Acad. Sci. USA 94:7685-7690 (1997)), elements in
enhancers from the IgM heavy chain gene (Gillies et al., Cell
33:717-728 (1983); Whittier et al., Nucleic Acids Res. 15:2515-2535
(1987)).
Promoter Activity
The activity or strength of a promoter may be measured in terms of
the amount of mRNA or protein accumulation it specifically
produces, relative to the total amount of mRNA or protein. The
promoter preferably expresses an operably linked nucleic acid
sequence at a level greater than 0.01%; more preferably greater
than 0.05, 0.1, 0.25, 0.5, 0.75, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, or 20% (w/w) of the total cellular
RNA or protein.
As used herein, an "expression pattern" is any pattern of
differential gene expression. In a preferred embodiment, an
expression pattern is selected from the group consisting of tissue,
temporal, spatial, developmental, stress, environmental,
physiological, pathological, cell cycle, and chemically responsive
expression patterns.
As used herein, an "enhanced expression pattern" is any expression
pattern for which an operably linked nucleic acid sequence is
expressed at a level greater than 0.01%; more preferably greater
than 0.05, 0.1, 0.25, 0.5, 0.75, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, or 20%(w/w) of the total cellular
RNA or protein.
Alternatively, the activity or strength of a promoter may be
expressed relative to a well-characterized promoter (for which
transcriptional activity was previously assessed). For example, a
less-characterized promoter may be operably linked to a reporter
sequence (e.g., GUS) and introduced into a specific cell type. A
well-characterized promoter (e.g. the 35S promoter) is similarly
prepared and introduced into the same cellular context.
Transcriptional activity of the unknown promoter is determined by
comparing the amount of reporter expression, relative to the well
characterized promoter. In one embodiment, the activity of the
present promoter is as strong as the 35S promoter when compared in
the same cellular context. The cellular context is preferably rice,
sorghum, maize, barley, wheat, canola, soybean, or maize; and more
preferably is rice, sorghum, maize, barley, or wheat; and most
preferably is rice.
Structural Nucleic Acid Sequences
The promoter of the present invention may be operably linked to a
structural nucleic acid sequence that is heterologous with respect
to the promoter. The structural nucleic acid sequence may generally
be any nucleic acid sequence for which an increased level of
transcription is desired. The structural nucleic acid sequence
preferably encodes a polypeptide that is suitable for incorporation
into the diet of a human or an animal. Suitable structural nucleic
acid sequences include those encoding a yield protein, a stress
resistance protein, a developmental control protein, a tissue
differentiation protein, a meristem protein, an environmentally
responsive protein, a senescence protein, a hormone responsive
protein, an abscission protein, a source protein, a sink protein, a
flower control protein, a seed protein, an herbicide resistance
protein, a disease resistance protein, a fatty acid biosynthetic
enzyme, a tocopherol biosynthetic enzyme, an amino acid
biosynthetic enzyme, and an insecticidal protein.
Alternatively, the promoter and structural nucleic acid sequence
may be designed to down-regulate a specific nucleic acid sequence.
This is typically accomplished by linking the promoter to a
structural nucleic acid sequence that is oriented in the antisense
direction. One of ordinary skill in the art is familiar with such
antisense technology. Briefly, as the antisense nucleic acid
sequence is transcribed, it hybridizes to and sequesters a
complimentary nucleic acid sequence inside the cell. This duplex
RNA molecule cannot be translated into a protein by the cell's
translational machinery. Any nucleic acid sequence may be
negatively regulated in this manner.
Modified Structural Nucleic Acid Sequences
The promoter of the present invention may also be operably linked
to a modified structural nucleic acid sequence that is heterologous
with respect to the promoter. The structural nucleic acid sequence
may be modified to provide various desirable features. For example,
a structural nucleic acid sequence may be modified to increase the
content of essential amino acids, enhance translation of the amino
acid sequence, alter post-translational modifications (e.g.,
phosphorylation sites), transport a translated product to a
compartment inside or outside of the cell, improve protein
stability, insert or delete cell signaling motifs, etc.
Codon Usage in Structural Nucleic Acid Sequences
Due to the degeneracy of the genetic code, different nucleotide
codons may be used to code for a particular amino acid. A host cell
often displays a preferred pattern of codon usage. Structural
nucleic acid sequences are preferably constructed to utilize the
codon usage pattern of the particular host cell. This generally
enhances the expression 20 of the structural nucleic acid sequence
in a transformed host cell. Any of the above described nucleic acid
and amino acid sequences may be modified to reflect the preferred
codon usage of a host cell or organism in which they are contained.
Modification of a structural nucleic acid sequence for optimal
codon usage in plants is described in U.S. Pat. No. 5,689,052.
Other Modifications of Structural Nucleic Acid Sequences
Additional variations in the structural nucleic acid sequences
described above may encode proteins having equivalent or superior
characteristics when compared to the proteins from which they are
engineered. Mutations may include deletions, insertions,
truncations, substitutions, fusions, shuffling of motif sequences,
and the like.
Mutations to a structural nucleic acid sequence may be introduced
in either a specific or random manner, both of which are well known
to those of skill in the art of molecular biology. A myriad of
site-directed mutagenesis techniques exist, typically using
oligonucleotides to introduce mutations at specific locations in a
structural nucleic acid sequence. Examples include single strand
rescue (Kunkel et al., Proc. Natl. Acad. Sci. U.S.A., 82: 488-492,
1985), unique site elimination (Deng and Nickloff, Anal. Biochem.
200:81, 1992), nick protection (Vandeyar, et al. Gene 65: 129-133,
1988), and PCR (Costa et al., Methods Mol. Biol. 57: 31-44, 1996).
Random or non-specific mutations may be generated by chemical
agents (for a general review, see Singer and Kusmierek, Ann. Rev.
Biochem. 52: 655-693, 1982) such as nitrosoguanidine (Cerda-Olmedo
et al., J. Mol. Biol. 33: 705-719, 1968; Guerola, et al. Nature New
Biol. 230: 122-125, 1971) and 2-aminopurine (Rogan and Bessman, J.
Bacteriol. 103: 622-633, 1970); or by biological methods such as
passage through mutator strains (Greener, et al. Mol. Biotechnol.
7:189-195,1997).
The modifications may result in either conservative or
non-conservative changes in the amino acid sequence. Conservative
changes result from additions, deletions, substitutions, etc. in
the structural nucleic acid sequence which do not alter the final
amino acid sequence of the protein. In a preferred embodiment, the
protein has between 20 and 500 conservative changes, more
preferably between 15 and 300 conservative changes, even more
preferably between 10 and 150 conservative changes, and most
preferably between 5 and 75 conservative changes.
Non-conservative changes include additions, deletions, and
substitutions which result in an altered amino acid sequence. In a
preferred embodiment, the protein has between 10 and 250
non-conservative amino acid changes, more preferably between 5 and
100 non-conservative amino acid changes, even more preferably
between 2 and 50 non-conservative amino acid changes, and most
preferably between 1 and 30 non-conservative amino acid
changes.
Additional methods of making the alterations described above are
described by Ausubel et al., Current Protocols in Molecular
Biology, John Wiley and Sons, Inc., 1995, Bauer et al., Gene,
37:73, 1985; Craik, BioTechniques, 3: 12-19, 1985; Frits Eckstein
et al., Nucleic Acids Research, 10: 6487-6497, 1982; Sambrook, et
al., Molecular Cloning: A Laboratory Manual, Second Edition, Cold
Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., 1989,
Smith, et al., In: Genetic Engineering: Principles and Methods,
Setlow et al., Eds., Plenum Press, N.Y., 1-32, 1981, and Osuna, et
al., Critical Reviews In Microbiology, 20: 107-116, 1994.
Modifications may be made to the protein sequences of the present
invention and the nucleic acid segments which encode them that
maintain the desired properties of the molecule. The following is a
discussion based upon changing the amino acid sequence of a protein
to create an equivalent, or possibly an improved, second-generation
molecule. The amino acid changes may be achieved by changing the
codons of the structural nucleic acid sequence, according to the
codons given in Table A.
TABLE-US-00001 TABLE A Codon degeneracy of amino acids One Three
Amino acid letter letter Codons Alanine A Ala GCA GCC GCG GCT
Cysteine C Cys TGC TGT Aspartic acid D Asp GAC GAT Glutamic acid E
Glu GAA GAG Phenylalanine F Phe ITC TTT Glycine G Gly GGA GGC GGG
GGT Histidine H His CAC CAT Isoleucine I Ile ATA ATC ATT Lysine K
Lys AAA AAG Leucine L Leu TTA TTG CTA CTC CTG CTT Methionine M Met
ATG Asparagine N Asn AAC AAT Proline P Pro CCA CCC CCG CCT
Glutamine Q Gln CAA CAG Arginine R Arg AGA AGG CGA CGC CGG CGT
Serine S Ser AGC AGT TCA TCC TCG TCT Threonine T Thr ACA ACC ACG
ACT Valine V Val GTA GTC GTG GTT Tryptophan W Trp TGG Tyrosine Y
Tyr TAC TAT
In making such changes, the hydropathic index of amino acids may be
considered. The importance of the hydropathic amino acid index in
conferring interactive biological function on a protein is
generally understood in the art (Kyte and Doolittle, J. Mol. Biol.,
157: 105-132, 1982). It is accepted that the relative hydropathic
character of the amino acid contributes to the secondary structure
of the resultant protein, which in turn defines the interaction of
the protein with other molecules, for example, enzymes, substrates,
receptors, DNA, antibodies, antigens, and the like.
Each amino acid has been assigned a hydropathic index on the basis
of their hydrophobicity and charge characteristics. These are:
isoleucine (+4.5); valine (+4.2); leucine (+3.8); phenylalanine
(+2.8); cysteine/cysteine (+2.5); methionine (+1.9); alanine
(+1.8); glycine (-0.4); threonine (-0.7); serine (-0.8); tryptophan
(-0.9); tyrosine (-1.3); proline (-1.6); histidine (-3.2);
glutamate/glutamine/aspartate/asparagine (-3.5); lysine (-3.9); and
arginine (4.5).
It is known in the art that certain amino acids may be substituted
by other amino acids having a similar hydropathic index or score
and still result in a protein with similar biological activity,
i.e., still obtain a biologically functional protein. In making
such changes, the substitution of amino acids whose hydropathic
indices are within .+-.2 is preferred, those within .+-.1 are more
preferred, and those within .+-.0.5 are most preferred.
It is also understood in the art that the substitution of like
amino acids may be made effectively on the basis of hydrophilicity.
U.S. Pat. No. 4,554,101 (Hopp) states that the greatest local
average hydrophilicity of a protein, as governed by the
hydrophilicity of its adjacent amino acids, correlates with a
biological property of the protein. The following hydrophilicity
values have been assigned to amino acids: arginine/lysine (+3.0);
aspartate/glutamate (+3.0.+-.1); serine (+0.3);
asparagine/glutamine (+0.2); glycine (0); threonine (-0.4); proline
(-0.5.+-.1); alanine/histidine (-0.5); cysteine (-1.0); methionine
(-1.3); valine (-1.5); leucine/isoleucine (-1.8); tyrosine (-2.3);
phenylalanine (-2.5); and tryptophan (-3.4).
It is understood that an amino acid may be substituted by another
amino acid having a similar hydrophilicity score and still result
in a protein with similar biological activity, i.e., still obtain a
biologically functional protein. In making such changes, the
substitution of amino acids whose hydropathic indices are within
.+-.2 is preferred, those within .+-.1 are more preferred, and
those within .+-.0.5 are most preferred.
As outlined above, amino acid substitutions are therefore based on
the relative similarity of the amino acid side-chain substituents,
for example, their hydrophobicity, hydrophilicity, charge, size,
and the like. Exemplary substitutions which take various of the
foregoing characteristics into consideration are well known to
those of skill in the art and include: arginine and lysine;
glutamate and aspartate; serine and threonine; glutamine and
asparagine; and valine, leucine, and isoleucine. Changes which are
not expected to be advantageous may also be used if these resulted
proteins having improved rumen resistance, increased resistance to
proteolytic degradation, or both improved rumen resistance and
increased resistance to proteolytic degradation, relative to the
unmodified polypeptide from which they are engineered.
Recombinant Vectors
Any of the promoters and structural nucleic acid sequences
described above may be provided in a recombinant vector. A
recombinant vector typically comprises, in a 5' to 3' orientation:
a promoter to direct the transcription of a structural nucleic acid
sequence and a structural nucleic acid sequence. The recombinant
vector may further comprise a 3' transcriptional terminator, a 3'
polyadenylation signal, other untranslated nucleic acid sequences,
transit and targeting nucleic acid sequences, selectable markers,
enhancers, and operators, as desired.
Means for preparing recombinant vectors are well known in the art.
Methods for making recombinant vectors particularly suited to plant
transformation include, without limitation, those described in U.S.
Pat. Nos. 4,971,908, 4,940,835, 4,769,061 and 4,757,011. These type
of vectors have also been reviewed (Rodriguez, et al. Vectors: A
Survey of Molecular Cloning Vectors and Their Uses, Butterworths,
Boston, 1988; Glick et al., Methods in Plant Molecular Biology and
Biotechnology, CRC Press, Boca Raton, Fla., 1993).
Typical vectors useful for expression of nucleic acids in higher
plants are well known in the art and include vectors derived from
the tumor-inducing (Ti) plasmid of Agrobacterium tumefaciens
(Rogers, et al., Meth. In Enzymol, 153: 253-277, 1987). Other
recombinant vectors useful for plant transformation, including the
pCaMVCN transfer control vector, have also been described (Fromm et
al., Proc. Natl. Acad. Sci. USA, 82(17): 5824-5828, 1985).
Promoters in the Recombinant Vectors
The promoter used in the recombinant vector preferably transcribes
a heterologous structural nucleic acid sequence at a high level in
a plant. More preferably, the promoter hybridizes to a nucleic acid
sequence selected from the group consisting of SEQ ID NO:1 through
SEQ ID NO:57,467, or any complements thereof; or any fragments
thereof. Suitable hybridization conditions include those described
above. A nucleic acid sequence of the promoter preferably
hybridizes, under low or high stringency conditions, with SEQ ID
NO:1 through SEQ ID NO:57,467, and any complements thereof. The
promoter most preferably hybridizes under high stringency
conditions to a nucleic acid sequence selected from the group
consisting of SEQ ID NO:1 through SEQ ID NO:57,467, and any
complements thereof.
In an alternative embodiment, the promoter comprises a nucleic acid
sequence that exhibits 85% or greater identity, and more preferably
at least 86 or greater, 87 or greater, 88 or greater, 89 or
greater, 90 or greater, 91 or greater, 92 or greater, 93 or
greater, 94 or greater, 95 or greater, 96 or greater, 97 or
greater, 98 or greater, or 99% or greater identity to a nucleic
acid sequence selected from the group consisting of SEQ ID NO:1
through SEQ ID NO:57,467, and complements thereof. The promoter
most preferably comprises a nucleic acid sequence selected from the
group consisting of SEQ ID NO:1 through SEQ ID NO:57,467, any
complements thereof, and any fragments thereof.
Additional Promoters in the Recombinant Vector
One or more additional promoters may also be provided in the
recombinant vector. These promoters may be operably linked to any
of the structural nucleic acid sequences described above.
Alternatively, the promoters may be operably linked to other
nucleic acid sequences, such as those encoding transit peptides,
selectable marker proteins, or antisense sequences.
These additional promoters may be selected on the basis of the cell
type into which the vector will be inserted. Promoters which
function in bacteria, yeast, and plants are all well taught in the
art. The additional promoters may also be selected on the basis of
their regulatory features. Examples of such features include
enhancement of transcriptional activity, inducibility,
tissue-specificity, and developmental stage-specificity. In plants,
promoters that are inducible, of viral or synthetic origin,
constitutively active, temporally regulated, and spatially
regulated have been described (Poszkowski, et al., EMBO J., 3:
2719, 1989; Odell, et al., Nature, 313:810, 1985; Chau et al.,
Science, 244:174-181. 1989).
Often-used constitutive promoters include the CaMV 35S promoter
(Odell, et al., Nature, 313: 810, 1985), the enhanced CaMV 35S
promoter, the Figwort Mosaic Virus (FMV) promoter (Richins, et al.,
Nucleic Acids Res. 20: 8451, 1987), the mannopine synthase (mas)
promoter, the nopaline synthase (nos) promoter, and the octopine
synthase (ocs) promoter.
Useful inducible promoters include promoters induced by salicylic
acid or polyacrylic acids (PR-1; Williams, et al., Biotechnology
10:540-543, 1992), induced by application of safeners (substituted
benzenesulfonamide herbicides; Hershey and Stoner, Plant Mol. Biol.
17: 679-690, 1991), heat-shock promoters (Ou-Lee et al., Proc.
Natl. Acad. Sci U.S.A. 83: 6815, 1986; Ainley et al., Plant Mol.
Biol. 14: 949, 1990), a nitrate-inducible promoter derived from the
spinach nitrite reductase structural nucleic acid sequence (Back et
al., Plant Mol. Biol. 17: 9, 1991), hormone-inducible promoters
(Yamaguchi-Shinozaki et al., Plant Mol. Biol. 15: 905, 1990), and
light-inducible promoters associated with the small subunit of RuBP
carboxylase and LHCP families (Kuhlemeier et al., Plant Cell 1:
471, 1989; Feinbaum et al., Mol. Gen. Genet. 226: 449-456, 1991;
Weisshaar, et al., EMBO J. 10: 1777-1786, 1991; Lam and Chua, J.
Biol. Chem. 266: 17131-17135, 1990; Castresana et al., EMBO J. 7:
1929-1936,1988; Schulze-Lefert, et al., EMBO J. 8: 651, 1989).
Examples of useful tissue-specific, developmentally-regulated
promoters include the .beta.-conglycinin 7S.alpha. promoter (Doyle
et al., J. Biol. Chem. 261: 9228-9238,1986; Slighton and Beachy,
Planta 172: 356, 1987), and seed-specific promoters (Knutzon, et
al., Proc. Natl. Acad. Sci US.A. 89: 2624-2628, 1992; Bustos, et
al., EMBO J 10: 1469-1479, 1991; Lam and Chua, Science 248: 471,
1991). Plant functional promoters useful for preferential
expression in seed plastid include those from plant storage
proteins and from proteins involved in fatty acid biosynthesis in
oilseeds. Examples of such promoters include the 5' regulatory
regions from such structural nucleic acid sequences as napin (Kridl
et al., Seed Sci. Res. 1: 209, 1991), phaseolin, zein, soybean
trypsin inhibitor, ACP, stearoyl-ACP desaturase, and oleosin.
Seed-specific regulation is discussed in EP 0 255 378.
Another exemplary tissue-specific promoter is the lectin promoter,
which is specific for seed tissue. The Lectin protein in soybean
seeds is encoded by a single structural nucleic acid sequence (Lel)
that is only expressed during seed maturation and accounts for
about 2 to about 5% of total seed mRNA. The lectin structural
nucleic acid sequence and seed-specific promoter have been fully
characterized and used to direct seed specific expression.in
transgenic tobacco plants (Vodkin, et al., Cell, 34: 1023, 1983;
Lindstrom, et al., Developmental Genetics, 11: 160, 1990).
Particularly preferred additional promoters in the recombinant
vector include the nopaline synthase (nos), mannopine synthase
(mas), and octopine synthase (ocs) promoters, which are carried on
tumor-inducing plasmids of Agrobacterium tumefaciens; the
cauliflower mosaic virus (CaMV) 19S and 35S promoters; the enhanced
CaMV 35S promoter, the Figwort Mosaic Virus (FMV) 35S promoter; the
light-inducible promoter from the small subunit of
ribulose-1,5-bisphosphate carboxylase (ssRUBISCO); the EIF-4A
promoter from tobacco (Mandel, et al., Plant Mol. Biol, 29:
995-1004, 1995); corn sucrose synthetase 1 (Yang, et al., Proc.
Natl. Acad. Sci. USA, 87: 4144-48, 1990); corn alcohol
dehydrogenase 1 (Vogel, et al., J. Cell Biochem., (Suppl) 13D: 312,
1989); corn light harvesting complex (Simpson, Science, 233: 34,
1986); corn heat shock protein (Odell, et al., Nature, 313: 810,
1985); the chitinase promoter from Arabidopsis (Samac, et al.,
Plant Cell, 3:1063-1072, 1991); the LTP (Lipid Transfer Protein)
promoters from broccoli (Pyee, et al., Plant J., 7: 49-59, 1995);
petunia chalcone isomerase (Van Tunen, et al., EMBO J. 7: 1257,
1988); bean glycine rich protein 1 (Keller, et al., EMBO L., 8:
1309-1314, 1989); Potato patatin (Wenzler, et al., Plant Mol.
Biol., 12: 41-50, 1989); the ubiquitin promoter from maize
(Christensen et al., Plant Mol. Biol., 18: 675,689, 1992); and the
actin promoter from rice (McEloy, et al., Plant Cell, 2:163-171,
1990).
The additional promoter is preferably seed selective, tissue
selective, constitutive, or inducible. The promoter is most
preferably the nopaline synthase (NOS), octopine synthase (OCS),
mannopine synthase (MAS), cauliflower mosaic virus 19S and 35S
(CaMVI19S, CaMV35S), enhanced CaMV (eCaMV), ribulose
1,5-bisphosphate carboxylase (ssRUBISCO), figwort mosaic virus
(FMV), CaMV derived AS4, tobacco RB7, wheat PDX1, tobacco EIF-4,
lectin protein (Lel), or rice RC2 promoter.
Structural Nucleic Acid Sequences in the Recombinant Nucleic Acid
Vector
The promoter in the recombinant vector is preferably operably
linked to a structural nucleic acid sequence. Exemplary structural
nucleic acid sequences, and modified forms thereof, are described
in detail above. The promoter of the present invention may be
operably linked to a structural nucleic acid sequence that is
heterologous with respect to the promoter. In one aspect, the
structural nucleic acid sequence may generally be any nucleic acid
sequence for which an increased level of transcription is desired.
The structural nucleic acid sequence preferably encodes a
polypeptide that is suitable for incorporation into the diet of a
human or an animal. Suitable structural nucleic acid sequences
include those encoding a yield protein, a stress resistance
protein, a developmental control protein, a tissue differentiation
protein, a meristem protein, an environmentally responsive protein,
a senescence protein, a hormone responsive protein, an abscission
protein, a source protein, a sink protein, a flower control
protein, a seed protein, an herbicide resistance protein, a disease
resistance protein, a fatty acid biosynthetic enzyme, a tocopherol
biosynthetic enzyme, an amino acid biosynthetic enzyme, and an
insecticidal protein.
Alternatively, the promoter and structural nucleic acid sequence
may be designed to down-regulate a specific nucleic acid sequence.
This is typically accomplished by linking the promoter to a
structural nucleic acid sequence that is oriented in the antisense
direction. One of ordinary skill in the art is familiar with such
antisense technology. Using such an approach, a cellular nucleic
acid sequence is effectively down regulated as the subsequent steps
of translation are disrupted. Nucleic acid sequences may be
negatively regulated in this manner.
Recombinant Vectors Having Additional Structural Nucleic Acid
Sequences
The recombinant vector may also contain one or more additional
structural nucleic acid sequences. These additional structural
nucleic acid sequences may generally be any sequences suitable for
use in a recombinant vector. Such structural nucleic acid sequences
include any of the structural nucleic acid sequences, and modified
forms thereof, described above. The additional structural nucleic
acid sequences may also be operably linked to any of the
above-described promoters. The one or more structural nucleic acid
sequences may each be operably linked to separate promoters.
Alternatively, the structural nucleic acid sequences may be
operably linked to a single promoter (i.e. a single operon).
The additional structural nucleic acid sequences preferably encode
a yield protein, a stress resistance protein, a developmental
control protein, a tissue differentiation protein, a meristem
protein, an environmentally responsive protein, a senescence
protein, a hormone responsive protein, an abscission protein, a
source protein, a sink protein, a flower control protein, a seed
protein, an herbicide resistance protein, a disease resistance
protein, a fatty acid biosynthetic enzyme, a tocopherol
biosynthetic enzyme, an amino acid biosynthetic enzyme, and an
insecticidal protein.
Alternatively, the second structural nucleic acid sequence may be
designed to down-regulate a specific nucleic acid sequence. This is
typically accomplished by operably linking the second structural
amino acid, in an antisense orientation, with a promoter. One of
ordinary skill in the art is familiar with such antisense
technology. The process is also briefly described above. Any
nucleic acid sequence may be negatively regulated in this
manner.
Selectable Markers
The recombinant vector may further comprise a selectable marker.
The nucleic acid sequence serving as the selectable marker
functions to produce a phenotype in cells which facilitates their
identification relative to cells not containing the marker.
Examples of selectable markers include, but are not limited to, a
neo gene (Potrykus, et al., Ann. Rev. Plant Physiol. Plant Mol.
Biol., 42: 205, 1991), which codes for kanamycin resistance and can
be selected for using kanamycin, G418, etc.; a bar gene which codes
for bialaphos resistance; a mutant EPSP synthase gene (Hinchee, et
al., Bio/Technology 6:915-922, 1988) which encodes glyphosate
resistance; a nitrilase gene which confers resistance to bromoxynil
(Stalker, et al., J. Biol. Chem. 263:6310-6314, 1988); a mutant
acetolactate synthase gene (ALS) which confers imidazolinone or
sulphonylurea resistance (European Patent Application No. 0154204);
green fluorescent protein (GFP); and a methotrexate resistant DHFR
gene. (Thillet, et al., J. Biol. Chem. 263:12500-12508, 1988).
Other exemplary selectable markers include: a .beta.-glucuronidase
or uida gene, (GUS), which encodes an enzyme for which various
chromogenic substrates are known (Jefferson, Plant Mol. Biol, Rep.
5:387-405, 1987; Jefferson, et al., EMBO J. 6:3901-3907, 1987); an
R-locus gene, which encodes a product that regulates the production
of anthocyanin pigments (red color) in plant tissues (Dellaporta et
al., Stadler Symposium 11:263-282, 1988); a .beta.-lactamase gene
(Sutcliffe et al., Proc. Natl. Acad. Sci. (U.S.A.) 75:3737-3741,
1978), which encodes an enzyme for which various chromogenic
substrates are known (e.g., PADAC, a chromogenic cephalosporin); a
luciferase gene (Ow, et al., Science 234:856-859, 1986); a xylE
gene (Zukowsky, et al., Proc. Natl. Acad. Sci. (U.S.A)
80:1101-1105, 1983) which encodes a catechol dioxygenase that can
convert chromogenic catechols; an .alpha.-amylase gene (Ikatu et
al., Bio/Technol. 8:241-242, 1990); a tyrosinase gene (Katz et al.,
J. Gen. Microbiol. 129:2703-2714, 1983), which encodes an enzyme
capable of oxidizing tyrosine to DOPA and dopaquinone (which in
turn condenses to melanin); and an .alpha.-galactosidase, which
will turn a chromogenic .alpha.-galactose substrate.
Included within the term "selectable markers" are also genes which
encode a secretable marker whose secretion can be detected as a
means of identifying or selecting for transformed cells. Examples
include markers that encode a secretable antigen that can be
identified by antibody interaction, or even secretable enzymes
which can be detected catalytically. Selectable secreted marker
proteins fall into a number of classes, including small, diffusible
proteins which are detectable, (e.g., by ELISA), small active
enzymes which are detectable in extracellular solution (e.g.,
.alpha.-amylase, .beta.-lactamase, phosphinothricin transferase),
or proteins which are inserted or trapped in the cell wall (such as
proteins which include a leader sequence such as that found in the
expression unit of extension or tobacco PR-S). Other possible
selectable marker genes will be apparent to those of skill in the
art.
The selectable marker is preferably GUS, green fluorescent protein
(GFP), neomycin phosphotransferase II (nptII), luciferase (LUX), an
antibiotic resistance coding sequence, or an herbicide (e.g.,
glyphosate) resistance coding sequence. The selectable marker is
most preferably a kanamycin, bygromycin, or herbicide resistance
marker.
Other Elements in the Recombinant Vector
Various cis-acting untranslated 5' and 3' regulatory sequences may
be included in the recombinant nucleic acid vector. Any such
regulatory sequences may be provided in a recombinant vector with
other regulatory sequences. Such combinations can be designed or
modified to produce desirable regulatory features.
A 3' non-translated region typically provides a transcriptional
termination signal, and a polyadenylation signal which functions in
plants to cause the addition of adenylate nucleotides to the 3' end
of the mRNA. These may be obtained from the 3' regions to the
nopaline synthase (nos) coding sequence, the soybean 7S.alpha.
storage protein coding sequence, the albumin coding sequence, and
the pea ssRUBISCO E9 coding sequence. Particularly preferred 3'
nucleic acid sequences include nos 3', E93', ADR12 3', 7S.alpha.
3', 1,1S 3', and albumin 3'.
Typically, nucleic acid sequences located a few hundred base pairs
downstream of the polyadenylation site serve to terminate
transcription. These regions are required for efficient
polyadenylation of transcribed mRNA.
Translational enhancers may also be incorporated as part of the
recombinant vector. Thus the recombinant vector may preferably
contain one or more 5' non-translated leader sequences which serve
to enhance expression of the nucleic acid sequence. Such enhancer
sequences may be desirable to increase or alter the translational
efficiency of the resultant mRNA. Preferred 5' nucleic acid
sequences include dSSU 5', PetHSP70 5', and GmHSP17.9 5'.
The recombinant vector may further comprise a nucleic acid sequence
encoding a transit peptide. This peptide may be useful for
directing a protein to the extracellular space, a chloroplast, or
to some other compartment inside or outside of the cell (see, e.g.,
European Patent Application Publication Number 0218571).
The structural nucleic acid sequence in the recombinant vector may
comprise introns. The introns may be heterologous with respect to
the structural nucleic acid sequence. Preferred introns include the
rice actin intron and the corn HSP70 intron.
Fusion Proteins
Any of the above described structural nucleic acid sequences, and
modified forms thereof, may be linked with additional nucleic acid
sequences to encode fusion proteins. The additional nucleic acid
sequence preferably encodes at least 1 amino acid, peptide, or
protein. Production of fusion proteins is routine in the art and
many possible fusion combinations exist.
For instance, the fusion protein may provide a "tagged" epitope to
facilitate detection of the fusion protein, such as GST, GFP, FLAG,
or polyHIS. Such fusions preferably encode between 1 and 50 amino
acids, more preferably between 5 and 30 additional amino acids, and
even more preferably between 5 and 20 amino acids.
Alternatively, the fusion may provide regulatory, enzymatic, cell
signaling, or intercellular transport functions. For example, a
sequence encoding a chloroplast transit peptide may be added to
direct a fusion protein to the chloroplasts within a plant cell.
Such fusion partners preferably encode between 1 and 1000
additional amino acids, more preferably between 5 and 500
additional amino acids, and even more preferably between 10 and 250
amino acids.
Probes and Primers
Short nucleic acid sequences having the ability to specifically
hybridize to complementary nucleic acid sequences may be produced
and utilized in the present invention. These short nucleic acid
molecules may be used as probes to identify the presence of a
complementary nucleic acid sequence in a given sample. Thus, by
constructing a nucleic acid probe which is complementary to a small
portion of a particular nucleic acid sequence, the presence of that
nucleic acid sequence may be detected and assessed.
Use of these probes may greatly facilitate the identification of
transgenic plants which contain the presently disclosed nucleic
acid molecules. The probes may also be used to screen cDNA or
genomic libraries for additional nucleic acid sequences related or
sharing homology to the presently disclosed promoters and
structural nucleic acid sequences.
Alternatively, the short nucleic acid sequences may be used as
oligonucleotide primers to amplify or mutate a complementary
nucleic acid sequence using PCR technology. These primers may also
facilitate the amplification of related complementary nucleic acid
sequences (e.g. related nucleic acid sequences from other
species).
The short nucleic acid sequences may be used as probes and
specifically as PCR probes. A PCR probe is a nucleic acid molecule
capable of initiating a polymerase activity while in a
double-stranded structure with another nucleic acid. Various
methods for determining the structure of PCR probes and PCR
techniques exist in the art. Computer generated searches using
programs such as Primer3
(www-genome.wi.mit.edu/cgi-bin/primer/primer2.cgi), STSPipeline
(www-genome.wi.mit.edu/cgi-bin/www.STS_Pipeline), or GeneUp
(Pesole, et al., BioTechniques 25:112-123, 1998), for example, can
be used to identify potential PCR primers.
The primer or probe is generally complementary to a portion of a
nucleic acid sequence that is to be identified, amplified, or
mutated. The primer or probe should be of sufficient length to form
a stable and sequence-specific duplex molecule with its complement.
The primer or probe preferably is about 10 to about 200 nucleotides
long, more preferably is about 10 to about 100 nucleotides long,
even more preferably is about 10 to about 50 nucleotides long, and
most preferably is about 14 to about 30 nucleotides long.
The primer or probe may be prepared by direct chemical synthesis,
by PCR (See, for example, U.S. Pat. No. 4,683,195, and 4,683,202),
or by excising the nucleic acid specific fragment from a larger
nucleic acid molecule.
Sequence Analysis
In the present invention, sequence similarity or identity is
preferably determined using the "Best Fit" or "Gap" programs of the
Sequence Analysis Software Package.TM. (Version 10; Genetics
Computer Group, Inc., Center, Madison, Wis.). "Gap" utilizes the
algorithm of Needleman and Wunsch (Needleman and Wunsch, Journal of
Molecular Biology 48:443453, 1970) to find the alignment of two
sequences that maximizes the number of matches and minimizes the
number of gaps. "BestFit" performs an optimal alignment of the best
segment of similarity between two sequences. Optimal alignments are
found by inserting gaps to maximize the number of matches using the
local homology algorithm of Smith and Waterman (Smith, et al, In:
Genetic Engineering: Principles and Methods, Setlow et al., Eds.,
Plenum Press, N.Y., 1-32, 1981; Smith and Waterman Advances in
Applied Mathematics, 2:482-489, 1981).
The Sequence Analysis Software Package described above contains a
number of other useful sequence analysis tools for identifying
homologues of the presently disclosed nucleotide and amino acid
sequences. For example, the "BLAST" program (Altschul, et al.,
Journal of Molecular Biology 215: 403-410, 1990) searches for
sequences similar to a query sequence (either peptide or nucleic
acid) in a specified database (e.g., sequence databases maintained
at the National Center for Biotechnology Information (NCBI) in
Bethesda, Md., USA); "FastA" (Lipman and Pearson, Science,
227:1435-1441, 1985; see also Pearson and Lipman, Proceedings of
the National Academy of Sciences USA 85, 2444-2448, 1988; Pearson,
Methods in Enzymology, (R. Doolittle, ed.), 183, 63-98, Academic
Press, San Diego, Calif., USA, 1990) performs a Pearson and Lipman
search for similarity between a query sequence and a group of
sequences of the same type (nucleic acid or protein); "TfastA"
performs a Pearson and Lipman search for similarity between a
protein query sequence and any group of nucleotide sequences (it
translates the nucleotide sequences in all six reading frames
before performing the comparison); "FastX" performs a Pearson and
Lipman search for similarity between a nucleotide query sequence
and a group of protein sequences, taking frameshifts into account.
"TfastX" performs a Pearson and Lipman search for similarity
between a protein query sequence and any group of nucleotide
sequences, taking frameshifts into account (it translates both
strands of the nucleic acid sequence before performing the
comparison).
Transgenic Plants and Plant Cells
The invention also includes and provides transformed plant cells
which comprise a nucleic acid molecule of the present
invention.
Preferred nucleic acid sequences of the present invention include,
without limitation, recombinant vectors, structural nucleic acid
sequences, promoters, and other regulatory elements, are described
above. A promoter preferably comprises a nucleic acid sequence that
hybridizes under stringent conditions with a nucleic acid sequence
selected from the group consisting of SEQ ID NO:1 through SEQ ID
NO:57,467, any complement thereof; and any fragment thereof; or
exhibits 85% or greater identity, and more preferably at least 86
or greater, 87 or greater, 88 or greater, 89 or greater, 90 or
greater, 91 or greater, 92 or greater, 93 or greater, 94 or
greater, 95 or greater, 96 or greater, 97 or greater, 98 or
greater, or 99% or greater identity to a nucleic acid sequence
selected from the group consisting of SEQ ID NO:1 through SEQ ID
NO:57,467, any complement thereof; and any fragment thereof. A
promoter most preferably comprises SEQ ID NO:1 through SEQ ID
NO:57,467.
Methods for preparing such recombinant vectors are well known in
the art. For example, methods for making recombinant vectors
particularly suited to plant transformation are described in U.S.
Pat. Nos. 4,971,908, 4,940,835, 4,769,061 and 4,757,011. These
vectors have also been reviewed (Rodriguez et al., Vectors: A
Survey of Molecular Cloning Vectors and Their Uses, Butterworths,
Boston, 1988; Glick, et al., Methods in Plant Molecular Biology and
Biotechnology, CRC Press, Boca Raton, Fla., 1993) and are described
above.
Typical vectors useful for expression of nucleic acids in plant
cells are well known in the art and include vectors derived from
the tumor-inducing (Ti) plasmid of Agrobacterium tumefaciens
(Rogers, et al., Meth In Enzymol, 153: 253-277, 1987). Other
recombinant vectors useful for plant transformation, have also been
described (Fromm et al., Proc. Natl. Acad Sci. USA, 82(17):
5824-5828, 1985). Elements of such recombinant vectors are
discussed above.
The transformed plant or cell may generally be any plant or cell
that is compatible with the present invention. The plant or cell
preferably is alfalfa, apple, banana, barley, bean, broccoli,
cabbage, carrot, castorbean, celery, citrus, clover, coconut,
coffee, corn, cotton, cucumber, garlic, grape, linseed, melon, oat,
olive, onion, palm, parsnip, pea, peanut, pepper, potato, radish,
rapeseed, rice, rye, sorghum, soybean, spinach, strawberry,
sugarbeet, sugarcane, sunflower, tobacco, tomato, or wheat. The
transformed plant or cell is more preferably rice, sorghum, maize,
barley, wheat, canola, soybean, or maize; even more preferably a
rice, sorghum, maize, barley, or wheat; and most preferably is
rice. The rice plant or cell is preferably Oryza sativa L (japonica
type), and more preferably Oryza saliva L to (japonica type), cv.
Nipponbare.
Typical vectors useful for expression of nucleic acids in cells and
higher plants are well known in the art and include vectors derived
from the tumor-inducing (Ti) plasmid of Agrobacterium tumefaciens
(Rogers, et al., Meth. In Enzymol, 153: 253-277, 1987). Other
recombinant vectors useful for plant transformation, have also been
described (Fromm et al., Proc. Natl. Acad. Sci. USA, 82(17):
5824-5828, 1985). Elements of such recombinant vectors are
discussed above.
Method for Preparing Transformed Cells
The invention is also directed to a method of producing transformed
cells which comprise, in a 5' to 3' orientation, a promoter
operably linked to a heterologous structural nucleic acid sequence.
Other sequences may also be introduced into the cell along with the
promoter and structural nucleic acid sequence. These other
sequences may include 3' transcriptional terminators, 3'
polyadenylation signals, other untranslated sequences, transit or
targeting sequences, selectable markers, enhancers, and
operators.
Preferred recombinant vectors, structural nucleic acid sequences,
promoters, and other regulatory elements are described above. The
promoter preferably has a nucleic acid sequence that hybridizes
under stringent conditions with SEQ ID NO:1 through SEQ ID
NO:57,467, or any complement thereof; or exhibits 85% or greater
identity, and more preferably at least 86 or greater, 87 or
greater, 88 or greater, 89 or greater, 90 or greater, 91 or
greater, 92 or greater, 93 or greater, 94 or greater, 95 or
greater, 96 or greater, 97 or greater, 98 or greater, or 99% or
greater identity to SEQ ID ao NO:1 through SEQ ID NO:57,467.
The method generally comprises the steps of selecting a suitable
host cell, transforming the host cell with a recombinant vector,
and obtaining the transformed host cell.
There are many methods for introducing nucleic acids into plant
cells. Suitable methods include bacterial infection (e.g.
Agrobacterium), binary bacterial artificial chromosome vectors,
direct delivery of DNA (e.g. via PEG-mediated transformation,
desiccation/inhibition-mediated DNA uptake, electroporation,
agitation with silicon carbide fibers, and acceleration of DNA
coated particles, etc. (reviewed in Potrykus, et al., Ann. Rev.
Plant Physiol. Plant Mol. Biol., 42: 205, 1991).
Technology for introduction of DNA into cells is well known to
those of skill in the art. These methods can generally be
classified into four categories: (1) chemical methods (Graham and
Van der Eb, Virology, 54(2): 536-539, 1973; Zatloukal, et al., Ann.
N.Y. Acad. Sci., 660: 136-153, 1992); (2) physical methods such as
microinjection (Capccchi, Cell, 22(2): 479-488, 1980),
electroporation (Wong and Neumann, Biochim. Biophys. Res. Commun.,
107(2): 584-587, 1982; Fromm et al., Proc. Natl. Acad. Sci. USA,
82(17): 5824-5828, 1985; U.S. Pat. No. 5,384,253) and particle
acceleration (Johnston and Tang, Methods Cell Biol., 43(A):
353-365, 1994; Fynan et al., Proc. Natl. Acad. Sci. USA, 90(24):
11478-11482, 1993); (3) viral vectors (Clapp, Clin. Perinatol.,
20(1): 155-168, 1993; Lu, et al., J. Exp. Med., 178(6): 2089-2096,
1993; Eglitis and Anderson, Biotechniques, 6(7): 608-614, 1988);
and (4) receptor-mediated mechanisms (Curiel et al., Hum. Gen.
Ther., 3(2):147-154, 1992; Wagner, et al., Proc. Natl. Acad. Sci.
USA, 89(13): 6099-6103, 1992). Alternatively, nucleic acids can be
directly introduced into pollen by directly injecting a plant's
reproductive organs (Zhou, et al., Methods in Enzymology, 101: 433,
1983; Hess, Intern Rev. Cytol., 107: 367, 1987; Luo, et al., Plant
Mol Biol. Reporter 6: 165, 1988; Pena, et al., Nature, 325: 274,
1987). The nucleic acids may also be injected into immature embryos
(Neuhaus, et al., Theor. Appl. Genet., 75: 30, 1987).
The recombinant vector used to transform the host cell typically
comprises, in a 5' to 3' orientation: a promoter to direct the
transcription of a structural nucleic acid sequence, a structural
nucleic acid sequence, a 3' transcriptional terminator, and a 3'
polyadenylation signal. The recombinant vector may further comprise
untranslated nucleic acid sequences, transit and targeting nucleic
acid sequences, selectable markers, enhancers, or operators.
Suitable recombinant vectors, structural nucleic acid sequences,
promoters, and other regulatory elements include, without
limitation, those described above.
The regeneration, development, and cultivation of plants from
transformed plant protoplast or explants is well taught in the art
(Weissbach and Weissbach, Methods for Plant Molecular Biology,
(Eds.), Academic Press, Inc., San Diego, Calif., 1988; Horsch et
al., Science, 227: 1229-1231, 1985). In this method, transformants
are generally cultured in the presence of a selective media which
selects for the successfully transformed cells and induces the
regeneration of plant shoots (Fraley et al., Proc. Natl. Acad. Sci.
U.S.A., 80: 4803, 1983). These shoots are typically obtained within
two to four months.
The shoots are then transferred to an appropriate root-inducing
medium containing the selective agent and an antibiotic to prevent
bacterial growth. Many of the shoots will develop roots. These are
then transplanted to soil or other media to allow the continued
development of roots. The method, as outlined, will generally vary
depending on the particular plant strain employed.
The regenerated transgenic plants are self-pollinated to provide
homozygous transgenic plants. Alternatively, pollen obtained from
the regenerated transgenic plants may be crossed with
non-transgenic plants, preferably inbred lines of agronomically
important species. Conversely, pollen as from non-transgenic plants
may be used to pollinate the regenerated transgenic plants.
The transgenic plant may pass along the transformed nucleic acid
sequence to its progeny. The transgenic plant is preferably
homozygous for the transformed nucleic acid sequence and transmits
that sequence to all of its offspring upon as a result of sexual
reproduction. Progeny may be grown from seeds produced by the
transgenic plant. These additional plants may then be
self-pollinated to generate a true breeding line of plants.
The progeny from these plants are evaluated, among other things,
for gene expression. The gene expression may be detected by several
common methods such as western blotting, northern blotting,
immunoprecipitation, and ELISA.
Methods for transforming dicots, primarily by use of Agrobacterium
tumefaciens and obtaining transgenic plants have been published for
cotton (U.S. Pat. Nos. 5,004,863; 5,159,135; 5,518,908); soybean
(U.S. Pat. No. 5,569,834; 5,416,011; McCabe, et al., Biotechnolgy,
6: 923, 1988; Christou et al., Plant Physiol. 87:671-674 (1988));
Brassica (U.S. Pat. No. 5,463,174); peanut (Cheng et al., Plant
Cell Rep. 15:653-657 (1996), McKently et al., Plant Cell Rep.
14:699-703 (1995)); papaya; and pea (Grant et al., Plant Cell Rep.
15:254-258 (1995)).
Transformation of monocotyledons using electroporation, particle
bombardment and Agrobacterium have also been reported.
Transformation and plant regeneration have been achieved in
asparagus (Bytebier et al., Proc. Natl. Acad. Sci (USA) 84:5354
(1987)); barley (Wan and Lemaux, Plant Physiol 104:37 (1994));
maize (Rhodes et al., Science 240:204 (1988); Gordon-Kamm et al.,
Plant Cell 2:603-618 (1990); Fromm et al., Bio/Technology 8:833
(1990); Koziel et al., Bio/Technology 11: 194 (1993); Armstrong et
al., Crop Science 35:550-557 (1995)); oat (Somers et al.,
Bio/Technology 10:1589 (1992)); orchard grass (Horn et al., Plant
Cell Rep. 7:469 (1988)); rice (Toriyama et al., Theor Appl. Genet.
205:34 (1986); Part et al., Plant Mol. 32:1135-1148 (1996);
Abedinia et al., Aust. J. Plant Physiol. 24:133-141 (1997); Zhang
and Wu, Theor. Appl. Genet. 76:835 (1988); Zhang et al., Plant Cell
Rep. 7:379 (1988); Battraw and Hall, Plant Sci. 86:191-202 (1992);
Christou et al., Bio/Technology 9:957 (1991)); rye (De la Pena et
al., Nature 325:274 (1987)); sugarcane (Bower and Birch, Plant J.
2:409 (1992)); tall fescue (Wang et al., Bio/Technology 10:691
(1992)) and wheat (Vasil et al., Bio/Technology 10:667 (1992); U.S.
Pat. No. 5,631,152).
Other Transformed Organisms
Any of the above described promoters and structural nucleic acid
sequences may be introduced into any cell or organism such as a
mammalian cell, mammal, fish cell, fish, bird cell, bird, algae
cell, algae, fungal cell, fungi, or bacterial cell. Preferred hosts
and transformants include: fungal cells such as Aspergillus,
yeasts, mammals (particularly bovine and porcine), insects,
bacteria and algae.
The transformed cell or organism is preferably prokaryotic, more
preferably a bacterial cell, even more preferably a Agrobacterium,
Bacillus, Escherichia, Pseudomonas cell, and most preferably is an
Escherichia coli cell. Alternatively, the transformed organism is
preferably a yeast or fungal cell. The yeast cell is preferably a
Saccharomyces cerevisiae, Schizosaccharomyces pombe, or Pichia
pastoris.
Methods to transform such cells or organisms are known in the art
(EP 0238023; Yelton et al., Proc. Natl. Acad. Sci. (U.S.A.),
81:1470-1474 (1984); Malardier et al., Gene, 78:147-156 (1989);
Becker and Guarente, In: Abelson and Simon (eds.,), Guide to Yeast
Genetics and Molecular Biology, Methods Enzymol., Vol. 194, pp.
182-187, Academic Press, Inc., New York; Ito et al., J.
Bacteriology, 153:163 (1983); Hinnen et al., Proc. Natl. Acad. Sci.
(U.S.A.), 75:1920 (1978); Bennett and LaSure (eds.), More Gene
Manipulations in Fungi, Academic Press, Calif. (1991)). Methods to
produce proteins of the present invention from such organisms are
also known (Kudla et al., EMBO, 9:1355-1364 (1990); Jarai and
Buxton, Current Genetics, 26:2238-2244 (1994); Verdier, Yeast,
6:271-297 (1990); MacKenzie et al., Journal of Gen. Microbiol.,
139:2295-2307 (1993); Hartl et al., TIBS, 19:20-25 (1994); Bergeron
et al., TIBS, 19:124-128 (1994); Demolder et al., J. Biotechnology,
32:179-189 (1994); Craig, Science, 260:1902-1903 (1993); Gething
and Sambrook, Nature, 355:33-45 (1992); Puig and Gilbert, J. Biol.
Chem., 269:7764-7771 (1994); Wang and Tsou, FASEB Journal,
7:1515-1517 (9193); Robinson et al., Bio/Technology, 1:381-384
(1994); Enderlin and Ogrydziak, Yeast, 10:67-79 (1994); Fuller et
al., Proc. Natl. Acad. Sci. (U.S.A.), 86:1434-1438 (1989); Julius
et al., Cell, 37:1075-1089 (1984); Julius et al., Cell, 32:839-852
(1983)).
Exemplary Uses
The presently disclosed promoter sequences may be used as genetic
markers and employed in genetic mapping studies using linkage
analysis. A genetic linkage map shows the relative locations of
specific DNA markers along a chromosome. Maps are used for the
identification of genes associated with genetic diseases or
phenotypic traits, comparative genomics, and as a guide for
physical mapping. Through genetic mapping, a fine scale linkage map
can be developed using DNA markers, and, then, a genomic DNA
library of large-sized fragments can be screened with molecular
markers linked to the desired trait. In a preferred embodiment of
the present invention, a genomic library is screened with the
promoter sequences of the present invention.
Mapping marker locations is based on the observation that two
markers located near each other on the same chromosome will tend to
be passed together from parent to offspring. During gamete
production, DNA strands occasionally break and rejoin in different
places on the same chromosome or on the homologous chromosome. The
closer the markers are to each other, the more tightly linked and
the less likely a recombination event will fall between and
separate them. Recombination frequency thus provides an estimate of
the distance between two markers.
In segregating populations, target genes have been reported to have
been placed within an interval of 5-10 cM with a high degree of
certainty (Tanksley et al., Trends in Genetics 11(2):63-68 (1995)).
The markers defining this interval are used to screen a larger
segregating population to identify individuals derived from one or
more gametes containing a crossover in the given interval. Such
individuals are useful in orienting other markers closer to the
target gene. Once identified, these individuals can be analyzed in
relation to all molecular markers within the region to identify
those closest to the target.
Markers of the present invention can be employed to construct
linkage maps and to locate genes with qualitative and quantitative
effects. The genetic linkage of additional marker molecules can be
established by a genetic mapping model such as, without limitation,
the flanking marker model reported by Lander and Botstein,
Genetics, 121:185-199 (1989), and the interval mapping, based on
maximum likelihood methods described by Lander and Botstein,
Genetics, 121:185-199 (1989), and implemented in the software
package MAPMAKER/QTL (Lincoln and Lander, Mapping Genes Controlling
Quantitative Traits Using MAPMAKER/QTL, Whitehead Institute for
Biomedical Research, Massachusetts, (1990)). Additional software
includes Qgene, Version 2.23 (1996), Department of Plant Breeding
and Biometry, 266 Emerson Hall, Cornell University, Ithaca, N.Y.).
Use of the Qgene software is a particularly preferred approach.
A maximum likelihood estimate (MLE) for the presence of a marker is
calculated, together with an MLE assuming no QTL effect, to avoid
false positives. A log10 of an odds ratio (LOD) is then calculated
as: LOD=log10 (MLE for the presence of a QTL/MLE given no linked
QTL).
The LOD score essentially indicates how much more likely the data
are to have arisen assuming the presence of a QTL than in its
absence. The LOD threshold value for avoiding a false positive with
a given confidence, say 95%, depends on the number of markers and
the length of the genome. Graphs indicating LOD thresholds are set
forth in Lander and Botstein, Genetics, 121:185-199 (1989), and
further described by Ar s and Moreno-Gonzalez, Plant Breeding,
Hayward, Bosemark, Romagosa (eds.) Chapman & Hall, London, pp.
314-331 (1993).
Additional models can be used. Many modifications and alternative
approaches to interval mapping have been reported, including the
use of non-parametric methods (Kruglyak and Lander, Genetics,
139:1421-1428 (1995)). Multiple regression methods or models can be
also be used, in which the trait is regressed on a large number of
markers (Jansen, Biometrics in Plant Breed, van Oijen, Jansen
(eds.) Proceedings of the Ninth Meeting of the Eucarpia Section
Biometrics in Plant Breeding, The Netherlands, pp. 116-124 (1994);
Weber and Wricke, Advances in Plant Breeding, Blackwell, Berlin, 16
(1994). Procedures combining interval mapping with regression
analysis, whereby the phenotype is regressed onto a single putative
QTL at a given marker interval, and at the same time onto a number
of markers that serve as `cofactors,` have been reported by Jansen
and Stam, Genetics, 136:1447-1455 (1994) and Zeng, Genetics,
136:1457-1468 (1994). Generally, the use of cofactors reduces the
bias and sampling error of the estimated QTL positions (Utz and
Melchinger, Biometrics in Plant Breeding, van Oijen, Jansen (eds.)
Proceedings of the Ninth Meeting of the Eucarpia Section Biometrics
in Plant Breeding, The Netherlands, pp.195-204 (1994), thereby
improving the precision and efficiency of QTL mapping (Zeng,
Genetics, 136:1457-1468 (1994)). These models can be extended to
multi-environment experiments to analysis genotype-environment
interactions (Jansen et al., Theo. Appl. Genet. 91:33-37
(1995)).
Selection of an appropriate mapping population is important to map
construction. The choice of appropriate mapping population depends
on the type of marker systems employed (Tanksley et al., J. P.
Gustafson and R. Appels (eds.), Plenum Press, New York, pp. 157-173
(1988)). Consideration must be given to the source of parents
(adapted vs. exotic) used in the mapping population. Chromosome
pairing and recombination rates can be severely disturbed
(suppressed) in wide crosses (adapted x exotic) and generally yield
greatly reduced linkage distances. Wide crosses will usually
provide segregating populations with a relatively large array of
polymorphisms when compared to progeny in a narrow cross (adapted x
adapted).
An F2 population is the first generation of selfing after the
hybrid seed is produced. Usually a single F1 plant is selfed to
generate a population segregating for all the genes in Mendelian
(1:2:1) fashion. Maximum genetic information is obtained from a
completely classified F2 population using a codominant marker
system (Mather, Measurement of Linkage in Heredity: Methuen and
Co., (1938)). In the case of dominant markers, progeny tests (e.g.,
F3, BCF2) are required to identify the heterozygotes, thus making
it equivalent to a completely classified F2 population. However,
this procedure is often prohibitive because of the cost and time
involved in progeny testing. Progeny testing of F2 individuals is
often used in map construction where phenotypes do not consistently
reflect genotype (e.g., disease resistance) or where trait
expression is controlled by a QTL. Segregation data from progeny
test populations (e.g., F3 or BCF2) can be used in map
construction. Marker-assisted selection can then be applied to
cross progeny based on marker-trait map associations (F2, F3),
where linkage groups have not been completely disassociated by
recombination events (i.e., maximum disequilibrium).
Recombinant inbred lines (RIL) (genetically related lines; usually
>F5, developed from continuously selfing F2 lines towards
homozygosity) can be used as a mapping population. Information
obtained from dominant markers can be maximized by using RIL
because all loci are homozygous or nearly so. Under conditions of
tight linkage (i.e., about <10% recombination), dominant and
co-dominant markers evaluated in RIL populations provide more
information per individual than either marker type in backcross
populations (Reiter, Proc. Natl. Acad. Sci. USA 89:1477-1481
(1992)). However, as the distance between markers becomes larger
(i.e., loci become more independent), the information in RIL
populations decreases dramatically when compared to codominant
markers.
Backcross populations (e.g., generated from a cross between a
successful variety (recurrent parent) and another variety (donor
parent) carrying a trait not present in the former) can be utilized
as a mapping population. A series of backcrosses to the recurrent
parent can be made to recover most of its desirable traits. Thus a
population is created consisting of individuals nearly like the
recurrent parent but each individual carries varying amounts or
mosaic of genomic regions from the donor parent. Backcross
populations can be useful for mapping dominant markers if all loci
in the recurrent parent are homozygous and the donor and recurrent
parent have contrasting polymorphic marker alleles (Reiter et al.,
Proc. Natl. Acad. Sci. USA 89:1477-1481 (1992)). Information
obtained from backcross populations using either codominant or
dominant makers is less than that obtained from F2 populations
because one, rather than two, recombinant gametes are sampled per
plant. Backcross populations, however, are more informative (at low
marker saturation) when compared to RILs as the distance between
linked loci increases in RIL populations (i.e., about 0.15%
recombination). Increased recombination can be beneficial for
resolution of light linkages, but may be undesirable in the
construction of maps with low marker saturation.
Near-isogenic lines (NIL)(created by many backcrosses to produce an
array of individuals that are nearly identical in genetic
composition except for the trait or genomic region under
interrogation) can be used as a mapping population. In mapping with
NILs, only a portion of the polymorphic loci are expected to map to
a selected region.
Bulk segregant analysis (BSA) is a method developed for the rapid
identification of linkage between markers and traits of interest
(Michelmore et al., Proc. Natl. Acad. Sci. USA 88:9828-9832
(1991)). In BSA, two bulked DNA samples are drawn from a
segregating population originating from a single cross. These bulks
contain individuals that are identical for a particular trait
(resistant or susceptible to particular disease) or. genomic region
but arbitrary at unlinked regions (i.e., heterozygous). Regions
unlinked to the target region will not differ between the bulked
samples of many individuals in BSA.
It is understood that one or more of the nucleic acid molecules of
the present invention may in one embodiment be used as markers in
genetic mapping. In a preferred embodiment, nucleic acid molecules
of the present invention may in one embodiment be used as markers
with rice.
Nucleic acid molecules of the present invention can be used in
comparative mapping (physical and genetic) and to isolate molecules
from other cereals based on the syntenic relationship between
cereals. Comparative mapping within families provides a method to
the degree of sequence conservation, gene order, ploidy of species,
ancestral relationships and the rates at which individual genomes
are evolving. Comparative mapping has been carried out by
cross-hybridizing molecular markers across species within a given
family.
In a preferred embodiment, the nucleic acid molecules of the
present invention can be utilized to isolate corresponding syntenic
regions in non-rice plants (Bennetzen and Freeling, Trends in
Genet., 9(8):259-261 (1993); Ahn et al., Mol. Gen. Genet.,
241(5-6):483-490 (1993); Schwarzacher, Cur. Opin. Genet. &
Devel., 4(6): 868-874 (1994); Kurata et al., Bio/Technology,
12:276-278 (1994); Kilian et al., Nucl. Acids Res.,
23(14):2729-2733 (1995); Bennett, Symp. Soc, Exp. Biol., 50:45-52
(1996); Hu et al., Genetics, 142(3):1021-1031 (1996); Kilian, Plant
Mol. Biol., 35:187-195 (1997); Bennetzen and Freeling, Genome Res.,
7(4):301-306 (1997); Foote et al., Genetics, 147(2):801-807 (1997);
Gallego et al., Genome, 41(3):328-336 (1998)). Gale and Devos Proc.
Natl. Acad. Sci. USA 95:1971-1974 (1998); Bennetzen et al., Proc.
Natl. Acad. Sci. USA, 95:1975-1978 (1998); Messing and Llaca, Proc.
Natl. Acad. Sci. USA 95:2017-2020 (1998); McCouch, Proc. Natl.
Acad. Sci. USA, 95:1983-1985 (1998); Goff, Curr. Opin. Plant Biol.
2:85-89 (1999); Bailey et al., Theor. Appl. Genet., 98:281-284
(1999); Zhang et al., Proc. Natl. Acad. Sci. USA, 91:8675-8679
(1994); Yano and Sasaki, Plant Mol. Biol., 35:145-153 (1997);
Leister et al., Proc. Natl. Acad. Sci. USA, 95:370-375 (1998); Lin
et al., Phytopathology 86(11):1 156-1159 (1996); Havukkala, Curr.
Opin. Genet. Dev., 96:711-713 (1996); and Lee, The Society for
Experimental Biology, pp.31-38 (1996). Synteny between rice and
barley has recently been reported in the genomic region carrying
malting quality Quantitative Trait Loci (QTL) (Kleinhofs et al.,
Genome 41:373-380 (1998)). Likewise, mapping of the liguless region
of sorghum, a region containing a developmental control gene, was
facilitated using molecular markers from a syntenic region of the
rice genome (Christou et al., Genetics 148:1983-1992 (1998)).
In a particularly preferred embodiment, the nucleic acid molecules
of the present invention that define a genomic region in rice
plants associated with a desirable phenotype are utilized to obtain
corresponding syntenic regions in non-rice plants. A region can be
defined either physically or genetically. In an even more preferred
embodiment, the nucleic acid molecules of the present invention
that define a genomic region in rice plants associated with a
desirable phenotype are utilized to obtain corresponding syntenic
regions in corn plants. A region can be defined either physically
or genetically.
One or more of the nucleic acids molecules may be used to define a
physical genomic region. For example, two nucleic acid molecules of
the present invention can act to define a physical genomic region
that lies between them. Moreover, for example, a physical genomic
region may be defined by a distance relative to a nucleic acid
molecule. In a preferred embodiment of the present invention, the
defined physical genomic region is less than about 1,000 kb, more
preferably less than about 500 kb, even more preferably less than
about 100 kb or less than about 50 kb.
One or more of the nucleic acids molecules may be used to define a
genomic region by its genetic distance from one or more of the
nucleic acid molecules of the present invention. In a preferred
embodiment of the present invention, the genomic region is defined
by its linkage to a nucleic acid molecule of the present invention.
In such a preferred embodiment, the genomic region that is defined
by one or more nucleic acid molecules of the present invention is
located within about 50 centimorgans, more preferably within about
20 centimorgans, even more preferably with about 10, about 5 or
about 2 centimorgans of the trait or marker at issue.
In another particularly preferred embodiment, two or more nucleic
acid molecules of the present invention derived from rice plants
that flank a genomic region of interest in rice plants are used to
isolate the syntenic region in another cereal, more preferably
maize, sorghum, barley, or wheat Regions of interest in rice
include, without limitation, those regions that are associated with
a commercially desirable phenotype in rice. In another particularly
preferred embodiment the desirable phenotype in rice is the result
of a quantitative trait locus (QTL) present in the region.
One exemplary approach to isolate syntenic genomic regions is as
follows. Nucleic acid sequences derived from rice of the present
invention can be used to select large insert clones from a total
genomic DNA library of a related species such as maize, sorghum,
barley, or wheat. Any appropriate method to screen the genomic
library with a nucleic acid molecule of the present invention may
be used to select the required clones (See, for example, Birren et
al., Detecting Genes: A Laboratory Manual, Cold Spring Harbor, New
York, N.Y. (1998)). For example, direct hybridization of a nucleic
acid molecule of the present invention to mapping filters
comprising the genomic DNA of the syntenic species can be used to
select large insert clones from a total genomic DNA library of a
related species. The selected clones can then be used to physically
map the region in the target species. An advantage of this method
for comparative mapping is that no mapping population or linkage
map of the target species is needed and the clones may also be used
in other closely related species. By comparing the results obtained
by genetic mapping in model plants, with those from other species,
similarities of genomic structure among plants species can be
established. Cross-hybridization of RFLP markers have been reported
and conserved gene order has been established in many studies. Such
macroscopic synteny is utilized for the estimation of
correspondence of loci among these crops. These loci include not
only Mendelian genes but also Quantitative Trait Loci (QTL) (Mohan
et al., Molecular Breeding 3:87-103 (1997)). Other methods to
isolate syntenic nucleic acid molecules may be used.
It is understood that markers of the present invention may be used
in comparative mapping. In a preferred embodiment the markers of
present invention may be used in the comparative mapping of
cereals, more preferably maize, barley, sorgham, and wheat.
It is understood that markers of the present invention may be used
to isolate promoters and other nucleic acid sequences from other
cereals based on the syntenic relationship between such cereals. In
a preferred embodiment the cereal is selected from the group of
maize, sorgham, barley, and wheat.
The nucleic acid molecules of the present invention can be used to
identify polymorphisms. In one embodiment, one or more of the
nucleic acid molecules may be employed as a marker nucleic acid
molecule to identify such polymorphism(s). Alternatively, such
polymorphisms can be detected through the use of a marker nucleic
acid molecule or a marker protein that is genetically linked to
(i.e., a polynucleotide that co-segregates with) such
polymorphism(s). In a preferred embodiment, the plant is selected
from the group consisting of cereals, and more preferably rice,
maize, barley, sorgham, and wheat.
In an alternative embodiment, such polymorphisms can be detected
through the use of a marker nucleic acid molecule that is
physically linked to such polymorphism(s). For this purpose, marker
nucleic acid sequences located within 1 mb of the polymorphism(s),
and more preferably within 100 kb of the polymorphism(s), and most
preferably within 10 kb of the polymorphism(s) can be employed.
The genomes of animals and plants naturally undergo spontaneous
mutation in the course of their continuing evolution (Gusella, Ann.
Rev. Biochem. 55:831-854 (1986)). A "polymorphism" is a variation
or difference in the sequence of the gene or its flanking regions
that arises in some of the members of a species. The variant
sequence and the "original" sequence co-exist in the species'
population. In some instances, such co-existence is in stable or
quasi-stable equilibrium.
A polymorphism is thus said to be "allelic," in that, due to the
existence of the polymorphism, some members may have the original
sequence (i.e., the original "allele") whereas other members may
have the variant sequence (i.e., the variant "allele"). In the
simplest case, only one variant sequence may exist, and the
polymorphism is thus said to be di-allelic. In other cases, the
population may contain multiple alleles, and the polymorphism is
termed tri-allelic, etc. A single gene may have multiple different
unrelated polymorphisms. For example, it may have a di-allelic
polymorphism at one site, and a multi-allelic polymorphism at
another site.
The variation that defines the polymorphism may range from a single
nucleotide variation to the insertion or deletion of extended
regions within a gene. In some cases, the DNA sequence variations
are in regions of the genome that are characterized by short tandem
repeats (STRs) that include tandem di- or tri-nucleotide repeated
motifs of nucleotides. Polymorphisms characterized by such tandem
repeats are referred to as "variable number tandem repeat" ("VNTR")
polymorphisms. VNTRs have been used in identity analysis (Weber,
U.S. Pat. No. 5,075,217; Armour et al., FEBS Lett. 307:113-115
(1992); Jones et al., Eur. J. Haematol. 39:144-147 (1987); Horn et
al., PCT Application WO91/14003; Jeffreys, European Patent
Application 370,719; Jeffreys, U.S. Pat. No. 5,175,082; Jeffreys et
al., Amer. J. Hum. Genet. 39:11-24 (1986); Jeffreys et al., Nature
316:76-79 (1985); Gray et al., Proc. R. Acad. Soc. Lond.
243:241-253 (1991); Moore et al., Genomics 10:654-660 (1991);
Jeffreys et al., Anim. Genet. 18:1-15 (1987); Hillel et al., Anim.
Genet. 20:145-155 (1989); Hillel et al., Genet. 124:783-789
(1990)).
The detection of polymorphic sites in a sample of DNA may be
facilitated through the use of nucleic acid amplification methods.
Such methods specifically increase the concentration of
polynucleotides that span the polymorphic site, or include that
site and sequences located either distal or proximal to it. Such
amplified molecules can be readily detected by gel electrophoresis
or other means.
The most preferred method of achieving such amplification employs
the polymerase chain reaction ("PCR") (Mullis et al., Cold Spring
Harbor Symp. Quant. Biol. 51:263-273 (1986); Erlich et al.,
European Patent Appln. 50,424; European Patent Appln. 84,796,
European Patent Application 258,017, European Patent Appln.
237,362; Mullis, European Patent Appln. 201,184; Mullis, et al.,
U.S. Pat. No. 4,683,202; Erlich., U.S. Pat. No. 4,582,788; and
Saiki et al., U.S. Pat. No. 4,683,194), using primer pairs that are
capable of hybridizing to the proximal sequences that define a
polymorphism in its double-stranded form.
In lieu of PCR, alternative methods, such as the "Ligase Chain
Reaction" ("LCR") may be used (Barany, Proc. Natl. Acad Sci. USA
88:189-193 (1991),. LCR uses two pairs of oligonucleotide probes to
exponentially amplify a specific target. The sequences of each pair
of oligonucleotides is selected to permit the pair to hybridize to
abutting sequences of the same strand of the target. Such
hybridization forms a substrate for a template-dependent ligase. As
with PCR, the resulting products thus serve as a template in
subsequent cycles and an exponential amplification of the desired
sequence is obtained.
LCR can be performed with oligonucleotides having the proximal and
distal sequences of the same strand of a polymorphic site. In one
embodiment, either oligonucleotide will be designed to include the
actual polymorphic site of the polymorphism. In such an embodiment,
the reaction conditions are selected such that the oligonucleotides
can be ligated together only if the target molecule either contains
or lacks the specific nucleotide that is complementary to the
polymorphic site present on the oligonucleotide. Alternatively, the
oligonucleotides may be selected such that they do not include the
polymorphic site (see, Segev, PCT Application WO 90/01069,).
The "Oligonucleotide Ligation Assay" ("OLA") may alternatively be
employed (Landegren et al., Science 241:1077-1080 (1988)). The OLA
protocol uses two oligonucleotides which are designed to be capable
of hybridizing to abutting sequences of a single strand of a
target. OLA, like LCR, is particularly suited for the detection of
point mutations. Unlike LCR, however, OLA results in "linear"
rather than exponential amplification of the target sequence.
Nickerson et al. have described a nucleic acid detection assay that
combines attributes of PCR and OLA (Nickerson et al., Proc. Natl.
Acad. Sci. USA 87:8923-8927 (1990)). In this method, PCR is used to
achieve the exponential amplification of target DNA, which is then
detected using OLA. In addition to requiring multiple, and
separate, processing steps, one problem associated with such
combinations is that they inherit all of the problems associated
with PCR and OLA.
Schemes based on ligation of two (or more) oligonucleotides in the
presence of nucleic acid having the sequence of the resulting
"di-oligonucleotide", thereby amplifying the di-oligonucleotide,
are also known (Wu et al., Genomics 4:560 (1989)), and may be
readily adapted to the purposes of the present invention.
Other known nucleic acid amplification procedures, such as
allele-specific oligomers, branched DNA technology,
transcription-based amplification systems, or isothermal
amplification methods may also be used to amplify and analyze such
polymorphisms (Malek et al., U.S. Pat. No. 5,130,238; Davey et al.,
European Patent Application 329,822; Schuster et al., U.S. Pat. No.
5,169,766; Miller et al., PCT Application WO 89/06700; Kwoh et al.,
Proc. Natl. Acad. Sci. USA 86:1173-1177 (1989); Gingeras et al.,
PCT Application WO 88/10315; Walker et al., Proc. Natl. Acad. Sci.
USA 89:392-396 (1992)).
The identification of a polymorphism can be determined in a variety
of ways. By correlating the presence or absence of it in an plant
with the presence or absence of a phenotype, it is possible to
predict the phenotype of that plant. If a polymorphism creates or
destroys a restriction endonuclease cleavage site, or if ii results
in the loss or insertion of DNA (e.g., a VNTR polymorphism), it
will alter the size or profile of the DNA fragments that are
generated by digestion with that restriction endonuclease. As such,
individuals that possess a variant sequence can be distinguished
from those having the original sequence by restriction fragment
analysis. Polymorphisms that can be identified in this manner are
termed "restriction fragment length polymorphisms" ("RFLPs"). RFLPs
have been widely used in human and plant genetic analyses
(Glassberg, UK Patent Application 2135774; Skolnick et al.,
Cytogen. Cell Genet 32:58-67 (1982); Botstein et al., Ann. J. Hum.
Genet. 32:314-331 (1980); Fischer et al. PCT Application
WO90/13668; Ublen, PCT Application WO90/11369).
Polymorphisms can also be identified by Single Strand Conformation
Polymorphism (SSCP) analysis. The SSCP technique is a method
capable of identifying most sequence variations in a single strand
of DNA, typically between 150 and 250 nucleotides in length (Elles,
Methods in Molecular Medicine: Molecular Diagnosis of Genetic
Diseases, Humana Press (1996)); Orita et al., Genomics 5:874-879
(1989)). Under denaturing conditions a single strand of DNA will
adopt a conformation that is uniquely dependent on its sequence
conformation. This conformation usually will be different, even if
only a single base is changed. Most conformations have been
reported to alter the physical configuration or size sufficiently
to be detectable by electrophoresis. A number of protocols have
been described for SSCP including, but not limited to Lee et al.,
Anal. Biochem. 205:289-293 (1992); Suzuki et al., Anal. Biochem.
192:82-84 (1991); Lo et al., Nucleic Acids Research 20:1005-1009
(1992); Sarkar et al., Genomics 13:441-443 (1992)). It is
understood that one or more of the nucleic acids of the present
invention, may be utilized as markers or probes to detect
polymorphisms by SSCP analysis.
Polymorphisms may also be found using a DNA fingerprinting
technique called amplified fragment length polymorphism (AFLP),
which is based on the selective PCR amplification of restriction
fragments from a total digest of genomic DNA to profile that DNA.
Vos et al., Nucleic Acids Res. 23:44074414 (1995). This method
allows for the specific co-amplification of high numbers of
restriction fragments, which can be visualized by PCR without
knowledge of the nucleic acid sequence.
AFLP employs basically three steps. Initially, a sample of genomic
DNA is cut with restriction enzymes and oligonucleotide adapters
are ligated to the restriction fragments of the DNA. The
restriction fragments are then amplified using PCR by using the
adapter and restriction sequence as target sites for primer
annealing. The selective amplification is achieved by the use of
primers that extend into the restriction fragments, amplifying only
those fragments in which the primer extensions match the nucleotide
flanking the restriction sites. These amplified fragments are then
visualized on a denaturing polyacrylamide gel.
AFLP analysis has been performed on Salix (Beismann et al., Mol.
Ecol. 6:989-993 (1997)); Acinetobacter (Janssen et al., Int. J.
Syst. Bacteriol 47:1179-1187 (1997)), Aeromonas popoffi (Huys et
al., Int. J. Syst. Bacteriol. 47:1165-1171 (1997)), rice (McCouch
et al., Plant Mol. Biol. 35:89-99 (1997)); Nandi et al., Mol. Gen.
Genet. 255:1-8 (1997); Cho et al., Genome 39:373-378 (1996)),
barley (Hordeum vulgare) (Simons et al., Genomics 44:61-70 (1997);
Waugh et al., Mol. Gen. Genet. 255:311-321 (1997); Qi et al., Mol.
Gen. Genet. 254:330-336 (1997); Becker et al., Mol. Gen. Genet.
249:65-73 (1995)), potato (Van der Voort et al., Mol. Gen. Genet.
255:438447 (1997); Meksem et al., Mol. Gen. Genet. 249:74-81
(1995)), Phytophthora infestans (Van der Lee et al., Fungal Genet.
Biol. 21:278-291 (1997)), Bacillus anthracis (Keim et al., J.
Bacteriol. 179:818-824 (1997)), Astragalus cremnophylax (Travis et
al., Mol. Ecol. 5:735-745 (1996)), Arabidopsis (Cnops et al., Mol.
Gen. Genet. 253:3241 (1996)), Escherichia coli (Lin et al., Nucleic
Acids Res. 24:3649-3650 (1996)), Aeromonas (Huys et al., Int. J.
Syst. Bacteriol. 46:572-580 (1996)), nematode (Folkertsma et al.,
Mol. Plant Microbe Interact. 9:47-54 (1996)), tomato (Thomas et
al., Plant J. 8:785-794 (1995)), and human (Latorra et al., PCR
Methods Appl. 3:351-358 (1994)). AFLP analysis has also been used
for fingerprinting mRNA (Money et al., Nucleic Acids Res.
24:2616-2617 (1996); Bachem, et al., Plant J. 9:745-753 (1996)). It
is understood that one or more of the promoter sequences of the
present invention, may be utilized as markers or probes to detect
polymorphisms by AFLP analysis for fingerprinting mRNA.
Polymorphisms may also be found using random amplified polymorphic
DNA (RAPD) (Williams et al., Nucl. Acids Res. 18:6531-6535 (1990))
and cleavable amplified polymorphic sequences (CAPS) (Lyamichev et
al., Science 260:778-783 (1993)). It is understood that one or more
of the promoter sequences of the present invention, may be utilized
as markers or probes to detect polymorphisms by RAPD or CAPS
analysis.
Promoter sequences of the present invention can be used to in a
microarray-based method for high-throughput screening of plant
genomic DNA. This `chip`-based approach involves using microarrays
of nucleic acid molecules as gene-specific hybridization targets to
identify and quantitatively measure the corresponding plant genes
(Schena et al., Science 270:467-470 (1995); Shalon, Ph.D. Thesis.
Stanford University (1996)). Every nucleotide in a large sequence
can be queried at the same time. Hybridization can be used to
efficiently analyze nucleotide sequences.
Several microarray methods have been described. For example,
microarrays of BACs may be prepared to sufficiently cover 3.times.
of an entire genome. Such microarrays can be used in a variety of
genomics experiments including gene mapping, DNA fingerprinting and
promoter identification. Microarrays of genomic DNA can also be
used for parallel analysis of genomes at single gene resolution
(Lemieux et al., Molecular Breeding 277-289 (1988)). It is
understood that one or more of the molecules of the present
invention, preferably one or more of the promoter sequences of the
present invention may be utilized in a genomic microarray based
method. In a preferred embodiment of the present invention, one or
more of the rice genomic promoter sequences may be utilized in a
genomic microarray based method. For example, Genomic Mismatch
Scanning (GMS), a hybridization-based method of linkage analysis
that allows rapid identification of regions of identity-by-descent
between two related individuals, can be carried out with
microarrays. GMS is reported to have been used to identify
genetically common chromosomal segments based on the ability of
these DNA sequences to form extensive regions of mismatch-free
heteroduplexes. A series of enzymatic steps, coupled with filter
binding, is used to selectively remove heteroduplexes that contain
mismatches (i.e., chromosomal regions that do not share identity-by
descent.). Fragments of chromosomal DNA representing inherited
regions are hybridized to a microarray of ordered genomic clones
and positive hybridization signals pinpoint regions of
identity-by-descent at high resolution (Lemieux et al., Molecular
Breeding 277-289 (1988)).
It is understood that one or more of the nucleic acid molecules of
the present invention may be utilized in a GMS microarray based
method to locate regions of identity-by-descent between related
individuals. In a preferred embodiment of the present invention,
one or more of the nucleic acid molecules of the present invention
may be utilized in a GMS microarray based method to locate regions
of identity-by-descent between related individuals.
A particularly preferred microarray embodiment of the present
invention is a microarray comprising nucleic acid molecules that
are homologues of known sequences but elicit only limited or no
matches to known nucleic acid molecules. A further preferred
microarray embodiment of the present invention is a microarray
comprising genomic nucleic acid molecules of the present invention
that elicit only limited or no matches to known genes.
It is understood that one or more of the molecules of the present
invention, preferably one or more of the promoter sequences of the
present invention may be utilized in a microarray based method.
In a preferred embodiment of the present invention, one or more of
the nucleic acid molecules of the present invention may be utilized
in a microarray based method.
Computer Related Uses of the Invention
A nucleic acid molecule comprising SEQ ID NO:1 through SEQ ID NO:
57,467, complements thereof and fragments of either, or a nucleic
acid molecule that hybridizes under stringent conditions with SEQ
ID NO: 1 through SEQ ID NO:57,467, or any complement thereof; or
exhibits 85% or greater identity, and more preferably at least 86
or greater, 87 or greater, 88 or greater, 89 or greater, 90 or
greater, 91 or greater, 92 or greater, 93 or greater, 94 or
greater, 95 or greater, 96 or greater, 97 or greater, 98 or
greater, or 99% or greater identity to SEQ ID NO:57,467; can be
"provided" in a variety of mediums to facilitate its. Such a medium
can also provide a subset thereof in a form that allows a skilled
artisan to examine the sequences.
In a preferred embodiment, at least 20, 50, 100, 500, 1,000, 2,000,
3,000, or 4,000 of the nucleic acid sequences of the present
invention are provided in a variety of mediums. In one application
of this embodiment, a nucleotide sequence of the present invention
can be recorded on computer readable media. As used herein,
"computer readable media" refers to any medium that can be read and
accessed directly by a computer. Such media include, but are not
limited to: magnetic storage media, such as floppy discs, hard
disc, storage medium, and magnetic tape: optical storage media such
as CD-ROM; electrical storage media such as RAM and ROM; and
hybrids of these categories such as magnetic/optical storage media.
A skilled artisan can readily appreciate how any of the presently
known computer readable mediums can be used to create a manufacture
comprising computer readable medium having recorded thereon a
nucleotide sequence of the present invention.
As used herein, "recorded" refers to a process for storing
information on computer readable medium. A skilled artisan can
readily adopt any of the presently known methods for recording
information on computer readable medium to generate media
comprising the nucleotide sequence information of the present
invention. A variety of data storage structures are available to a
skilled artisan for creating a computer readable medium having
recorded thereon a nucleotide sequence of the present invention.
The choice of the data storage structure will generally be based on
the means chosen to access the stored information. In addition, a
variety of data processor programs and formats can be used to store
the nucleotide sequence information of the present invention on
computer readable medium. The sequence information can be
represented in a word processing text file, formatted in
commercially-available software such as WordPerfect and Microsoft
Word, or represented in the form of an ASCII file, stored in a
database application, such as DB2, Sybase, Oracle, or the like. A
skilled artisan can readily adapt any number of data processor
structuring formats (e.g., text file or database) in order to
obtain computer readable medium having recorded thereon the
nucleotide sequence information of the present invention.
By providing one or more of nucleotide sequences of the present
invention, a skilled artisan can routinely access the sequence
information for a variety of purposes. Computer software is
publicly available which allows a skilled artisan to access
sequence information provided in a computer readable medium. The
examples which follow demonstrate how software which implements the
BLAST (Altschul et al., J. Mol. Biol. 215:403410(1990)) and BLAZE
(Brutlag et al., Comp. Chem. 17:203-207 (1993)) search algorithms
on a Sybase system can be used to identify open reading frames
(ORFs) within the genome that contain homology to ORFs or proteins
from other organisms. Such ORFs are protein-encoding fragments
within the sequences of the present invention and are useful in
producing commercially important proteins such as enzymes used in
amino acid biosynthesis, metabolism, transcription, translation,
RNA processing, nucleic acid and a protein degradation, protein
modification, and DNA replication, restriction, modification,
recombination, and repair.
The present invention further provides systems, particularly
computer-based systems, which contain the sequence information
described herein. Such systems are designed to identify
commercially important fragments of the nucleic acid molecule of
the present invention. As used herein, "a computer-based system"
refers to the hardware means, software means, and data storage
means used to analyze the nucleotide sequence information of the
present invention. The minimum hardware means of the computer-based
systems of the present invention comprises a central processing
unit (CPU), input means, output means, and data storage means. A
skilled artisan can readily appreciate that any one of the
currently available computer-based system are suitable for use in
the present invention.
As indicated above, the computer-based systems of the present
invention comprise a data storage means having stored therein a
nucleotide sequence of the present invention and the necessary
hardware means and software means for supporting and implementing a
search means. As used herein, "data storage means" refers to memory
that can store nucleotide sequence information of the present
invention, or a memory access means which can access manufactures
having recorded thereon the nucleotide sequence information of the
present invention. As used herein, "search means" refers to one or
more programs which are implemented on the computer-based system to
compare a target sequence or target structural motif with the
sequence information stored within the data storage means. Search
means are used to identify fragments or regions of the sequence of
the present invention that match a particular target sequence or
target motif. A variety of known algorithms are disclosed publicly
and a variety of commercially available software for conducting
search means are available and can be used in the computer-based
systems of the present invention. Examples of such software
include, but are not limited to, MacPattern (EMBL), BLASTIN and
BLASTIX (NCBIA). One of the available algorithms or implementing
software packages for conducting homology searches can be adapted
for use in the present computer-based systems.
As used herein, "a target structural motif," or "target motif,"
refers to any rationally selected sequence or combination of
sequences in which the sequence(s) are chosen based on primary
sequence composition or a three dimensional configuration which is
formed upon folding of the target motif. There are a variety of
target motifs known in the art. Target motifs include, but are not
limited to, transcription factor binding sites, repressor binding
sites, inducible expression elements, transcriptional activation
sites, transcription initiation sites, untranslated leaders, intron
splicing sites, methylation sites, histone binding sites, RNA
processing sites, non-histone structural protein binding sites,
replication sites, sites which influence the stability of
transcribed mRNA message and hairpin sites.
Thus, the present invention further provides an input means for
receiving a target sequence, a data storage means for storing the
target sequences of the present invention sequence identified using
a search means as described above, and an output means for
outputting the identified homologous sequences. A variety of
structural formats for the input and output means can be used to
input and output information in the computer-based systems of the
present invention. A preferred format for an output means ranks
fragments of the sequence of the present invention by varying
degrees of homology to the target sequence or target motif. Such
presentation provides a skilled artisan with a ranking of sequences
which contain various amounts of the target sequence or target
motif and identifies the degree of homology contained in the
identified fragment.
A variety of comparing means can be used to compare a target
sequence or target motif with the data storage means to identify
sequence fragments sequence of the present invention. For example,
implementing software which implement the BLAST and BLAZE
algorithms (Altschul et al., J. Mol. Biol. 215:403-410 (1990)) can
be used to identify open frames within the nucleic acid molecules
of the present invention. A skilled artisan can readily recognize
that any one of the publicly available homology search programs can
be used as the search means for the computer-based systems of the
present invention.
Having now generally described the invention, the same will be more
readily understood through reference to the following examples
which are provided by way of illustration, and are not intended to
be limiting of the present invention, unless specified.
Each periodical, patent, and other document or reference cited
herein is herein incorporated by reference in its entirety.
EXAMPLES
The following examples are provided to better illustrate the
practice of the present invention and should not be interpreted in
any way to limit the scope of the present invention. Those skilled
in the art will recognize that various modifications, truncations,
etcetera can be made to the methods and genes described herein
while not departing from the spirit and scope of the present
invention.
Example 1
Generating a Genomic Bacterial Artificial Chromosome (BAC)
Library
BACs are stable, non-chimeric cloning systems having genomic
fragment inserts (100-300 kb) and their DNA can be prepared for
most types of experiments including DNA sequencing. BAC vector,
pBeloBAC11, is derived from the endogenous E. coli F-factor
plasmid, which contains genes for strict copy number control and
unidirectional origin of DNA replication. Additionally, pBeloBAC11
has three unique restriction enzyme sites (Hind III, Bam HI and Sph
I) located within the LacZ gene which can be used as cloning sites
for megabase-size plant DNA. Indigo, another BAC vector contains
Hind III and Eco RI cloning sites. This vector also contains a
random mutation in the LacZ gene that allows for darker blue
colonies.
As an alternative, the P1-derived artificial chromosome (PAC) can
be used as a large DNA fragment cloning vector (Ioannou, et al.,
Nature Genet. 6:84-89 (1994); Suzuki, et al., Gene 199:133-137
(1997)). The PAC vector has most of the features of the BAC system,
but also contains some of the elements of the bacteriophage P1
cloning system.
BAC libraries are generated by ligating size-selected restriction
digested DNA with pBeloBAC 11 followed by electroporation into E.
coli. BAC library construction and characterization is extremely
efficient when compared to YAC (yeast artificial chromosome)
library construction and analysis, particularly because of the
chimerism associated with YACs and difficulties associated with
extracting YAC DNA.
There are general methods for preparing megabase-size DNA from
plants. For example, the protoplast method yields megabase-size DNA
of high quality with minimal breakage. A process involves preparing
young leaves which are manually feathered with a razor-blade before
being incubated for four to five hours with cell-wall-degrading
enzymes. A second method developed by Zhange et al., Plant J.
7:175-184 (1995), is a universal nuclei method that works well for
several divergent plant taxa. Fresh or frozen tissue is homogenized
with a blender or mortar and pestle. Nuclei are then isolated and
embedded. DNA prepared by the nucleic method is often more
concentrated and is reported to contain lower amounts of
chloroplast DNA than the protoplast method.
Once protoplasts or nuclei are produced, they are embedded in an
agarose matrix as plugs or microbeads. The agarose provides a
support matrix to prevent shearing of the DNA while allowing
enzymes and buffers to diffuse into the DNA. The DNA is purified
and manipulated in the agarose and is stable for more than one year
at 4.degree. C.
Once high molecular weight DNA is prepared, it is fragmented to the
desired size range. In general, DNA fragmentation utilizes two
general approaches, 1) physical shearing and 2) partial digestion
with a restriction enzyme that cuts relatively frequently within
the genome. Since physical shearing is not dependent upon the
frequency and distribution of particular restriction enzymes sites,
this method should yield the most random distribution of DNA
fragments. However, the ends of the sheared DNA fragments must be
repaired and cloned directly or restriction enzyme sites added by
the addition of synthetic linkers. Because of the subsequent steps
required to clone DNA fragmented by shearing, most protocols
fragment DNA by partial restriction enzyme digestion. The advantage
of partial restriction enzyme digestion is that no further
enzymatic modification of the ends of the restriction fragments are
necessary. Four common techniques that can be used to achieve
reproducible partial digestion of megabase-size DNA are 1) varying
the concentration of the restriction enzyme, 2) varying the time of
incubation with the restriction enzyme 3) varying the concentration
of an enzyme cofactor (e.g., Mg2+) and 4) varying the ratio of
endonuclease to methylase.
There are three cloning sites in pBeloBAC11, but only Hind III and
Barn HI produce 5' overhangs for easy vector dephosphorylation.
These two restriction enzymes are primarily used to construct BAC
libraries. The optimal partial digestion conditions for
megabase-size DNA are determined by wide and narrow window
digestions. To optimize the optimum amount of Hind III, 1, 2, 3,
10, and 5- units of enzyme are each added to 50 ml aliquots of
microbeads and incubated at 37.degree. C. for 20 minutes.
After partial digestion of megabase-size DNA, the DNA is run on a
pulsed-field gel, and DNA in a size range of 100-500 kb is excised
from the gel. This DNA is ligated to the BAC vector or subjected to
a second size selection on a pulsed field gel under different
running conditions. Studies have previously reported that two
rounds of size selection can eliminate small DNA fragments
co-migrating with the selected range in the first pulse-field
fractionation. Such a strategy results in an increase in insert
sizes and a more uniform insert size distribution. A practical
approach to performing size selections is to first test for the
number of clones/microliter of ligation and insert size from the
first size selected material. If the numbers are good (500 to 2000
white colony/microliter of ligation) and the size range is also
good (50 to 300 kb) then a second size selection is practical. When
performing a second size selection one expects a 80 to 95% decrease
in the number of recombinant clones per transformation.
Twenty to two hundred nanograms of the size-selected DNA is ligated
to dephosphorylated BAC vector (molar ratio of 10 to 1 in BAC
vector excess). Most BAC libraries use a molar ratio of 5 to 15:1
(size selected DNA:BAC vector).
Transformation is carried out by electroporation and the
transformation efficiency for BACs is about 40 to 1,500
transformants from one microliter of ligation product or 20 to 1000
transformants/ng DNA.
Several tests can be carried out to determine the quality of a BAC
library. Three basic tests to evaluate the quality include: the
genome coverage of a BAC library-average insert size, average
number of clones hybridizing with single copy probes and
chloroplast DNA content.
The determination of the average insert size of the library is
assessed in two ways. First, during library construction every
ligation is tested to determine the average insert size by assaying
20-50 BAC clones per ligation. DNA is isolated from recombinant
clones using a standard mini preparation protocol, digested with
Not I to free the insert from the BAC vector and then sized using
pulsed field gel electrophoresis (Maule, Molecular Biotechnology
9:107-126 (1998)).
To determine the genome coverage of the library, it is screened
with single copy RFLP markers distributed randomly across the
genome by hybridization. Microtiter plates containing BAC clones
are spotted onto Hybond membranes. Bacteria from 48 or 72 plates
are spotted twice onto one membrane resulting in 18,000 to 27,648
unique clones on each membrane in either a 4.times.4 or 5.times.5
orientation. Since each clone is present twice, false positives are
easily eliminated and true positives are easily recognized and
identified.
Finally, the chloroplast DNA content in the BAC library is
estimated by hybridizing three chloroplast genes spaced evenly
across the chloroplast genome to the library on high density
hybridization filters.
There are strategies for isolating rare sequences within the
genome. For example, higher plant genomes can range in size from
100 Mb/1C (Arabidopsis) to 15,966 Mb/C (Triticum aestivum),
(Arumuganathan and Earle, Plant Mol Bio Rep.9:208219 (1991)). The
number of clones required to achieve a given probability that any
DNA sequence will be represented in a genomic library is
N=(1n(1-P))/(1n(1-L/G)) where N is the number of clones required, P
is the probability desired to get the target sequence, L is the
length of the average clone insert in base pairs and G is the
haploid genome length in base pairs (Clarke et al., Cell 9:91-100
(1976)).
The rice BAC library of the present invention is constructed in the
pBeloBAC11 or similar vector. Inserts are generated by partial Eco
RI or other enzymatic digestion of DNA. The 25.times. library
provides 4-5.times. coverage sequence from BAC clones across
genome.
Example 2
Sequencing Genomic DNA Inserts from a Genomic BAC Library
Two basic methods can be used for DNA sequencing, the chain
termination method of Sanger et al., Proc. Natl. Acad. Sci. USA
74:5463-5467 (1977), and the chemical degradation method of Maxam
and Gilbert, Proc. Natl. Acad. Sci. USA 74:560-564 (1977),.
Automation and advances in technology such as the replacement of
radioisotopes with fluorescence-based sequencing have reduced the
effort required to sequence DNA (Craxton, Methods, 2:20-26 (1991);
Ju et al., Proc. Natl. Acad. Sci. USA 92:4347-4351 (1995); Tabor
and Richardson, Proc. Natl. Acad. Sci. USA 92:6339-6343 (1995)).
Automated sequencers are available from, for example, Pharmacia
Biotech, Inc., Piscataway, N.J. (Pharmacia ALF), LI-COR, Inc.,
Lincoln, Nebr. (LI-COR 4,000) and Millipore, Bedford, Mass.
(Millipore BaseStation).
In addition, advances in capillary gel electrophoresis have also
reduced the effort required to sequence DNA and such advances
provide a rapid high resolution approach for sequencing DNA samples
(Swerdlow and Gesteland, Nucleic Acids Res. 18:1415-1419 (1990);
Smith, Nature 349:812-813 (1991); Luckey et al., Methods Enzymol.
218:154-172(1993); Lu et al., J. Chromatog. A. 680:497-501 (1994);
Carson et al., Anal. Chem. 65:3219-3226 (1993); Huang et al, Anal.
Chem. 64:2149-2154 (1992); Kheterpal et al., Electrophoresis
17:1852-1859 (1996); Quesada and Zhang, Electrophoresis
17:1841-1851 (1996); Baba, Yakugaku Zasshi 117:265-281 (1997)).
A number of sequencing techniques are known in the art, including
fluorescence-based sequencing methodologies. These methods have the
detection, automation and instrumentation capability necessary for
the analysis of large volumes of sequence data. Currently, the
377.DNA Sequencer (Perkin-Elmer Corp., Applied Biosystems Div.,
Foster City, Calif.) allows the most rapid electrophoresis and data
collection. With these types of automated systems, fluorescent
dye-labeled sequence reaction products are detected and data
entered directly into the computer, producing a chromatogram that
is subsequently viewed, stored, and analyzed using the
corresponding software programs. These methods are known to those
of skill in the art and have been described and reviewed (Birren et
al., Genome Analysis: Analyzing DNA,1, Cold Spring Harbor, N.Y.
1999)).
PHRED is used to call the bases from the sequence trace files
(www-mbt.washington.edu). Phred uses Fourier methods to examine the
four base traces in the region surrounding each point in the data
set in order to predict a series of evenly spaced predicted
locations. That is, it determines where the peaks would be centered
if there were no compressions, dropouts, or other factors shifting
the peaks from their "true" locations. Next, PHRED examines each
trace to find the centers of the actual, or observed peaks and the
areas of these peaks relative to their neighbors. The peaks are
detected independently along each of the four traces so many peaks
overlap. A dynamic programming algorithm is used to match the
observed peaks detected in the second step with the predicted peak
locations found in the first step.
After the base calling is completed, contaminating sequences (E.
coli, BAC vector sequences >50 bases and sub-cloning vector are
removed and constraints are made for the assembler. Contigs are
assembled using CAP3 (Huang, et al., Genomics 46: 37-45
(1997)).
A two-step re-assembly process is employed to reduce sequence
redundancies caused by overlaps between BAC clones. In the first
step, BAC clones are grouped into clusters based on overlaps
between contig sequences from different BACs. These overlaps are
identified by comparing each sequence in the dataset against every
other sequences, by BLASTN. BACs containing overlaps greater than
5,000 base pairs in length and greater than 94% in sequence
identity are put into the same cluster. Repetitive sequences are
masked prior to this procedure to avoid false joining by repetitive
elements present in the genome. In the second step, sequences from
each BAC cluster are assembled by PHRAP.longread, which is able to
handle very long sequences. A minimum match is set at 100 bp and a
minimum score is set at 600 as a threshold to join input contigs
into longer contigs.
Example 3
Identifying Genes within a Genomic BAC Library
This example illustrates the identification of combigenes within
the rice genomic contig library as assembled in Example 2. The
genes and partial genes that are embedded in such contigs are
identified through a series of informatic analyses. The tools to
define genes fall into two categories: homology-based and
predictive-based methods. Homology-based searches (e.g., GAP2,
BLASTX supplemented by NAP and TBLASTX) detect conserved sequences
during comparisons of DNA sequences or hypothetically translated
protein sequences to public and/or proprietary DNA and protein
databases. Existence of an Oryza sativa gene is inferred if
significant sequence similarity extends over the majority of the
target gene. Since homology-based methods may overlook genes unique
to Oryza sativa, for which homologous nucleic acid molecules have
not yet been identified in databases, gene prediction programs are
also used. Predictive methods employed in the definition of the
Oryza sativa genes included the use of the GenScan gene predictive
software program which is available from Stanford University (e.g.,
at the website: www-gnomic/stanford.edu/GENSCANW.html. and the
Genemark.hmm for Eukaryotes program from Gene Probe, Inc (Atlanta,
Ga.) www-geneprobe.net/index.htm). GenScan, in general terms,
infers the presence and extent of a gene through a search for
"gene-like" grammar. GeneMark.hmm searches a file containing DNA
sequence data for genes. It employs a Hidden Markov Model algorithm
with a species-specific inhomogeneous Markov model of gene-encoding
regions of DNA.
The homology-based methods that are used to define the Oryza sativa
gene set included GAP2, BLASTX supplemented by NAP and TBLASTX. For
a description of BLASTX and TBLASTX see Coulson, Trends in
Biotechnology 12:76-80 (1994) and Birren et al., Genome Analysis,
1:543-559 (1997). GAP2 and NAP are part of the Analysis and
Annotation Tool (AAT) for Finding Genes in Genomic Sequences which
was developed by Xiaoqiu Huang at Michigan Tech University and is
available at the web site www-genome.cs.mtu.edu/. The AAT package
includes two sets of programs, one set DPS/NAP (referred to as
"NAP") for comparing the query sequence with a protein database,
and the other set DDS/GAP2 (referred to as "GAP2") for comparing
the query sequence with a cDNA database. Each set contains a fast
database search program and a rigorous alignment program. The
database search program identifies regions of the query sequence
that are similar to a database sequence. Then the alignment program
constructs an optimal alignment for each region and the database
sequence. The alignment program also reports the coordinates of
exons in the query sequence. See Huang, et al., Genomics 46: 37-45
(1997). The GAP2 program computes an optimal global alignment of a
genomic sequence and a cDNA sequence without penalizing terminal
gaps. A long gap in the cDNA sequence is given a constant penalty.
The DNA-DNA alignment by GAP2 adjusts penalties to accommodate
introns. The GAP2 program makes use of splice site consensuses in
alignment computation. GAP2 delivers the alignment in linear space,
so long sequences can be aligned. See Huang, Computer Applications
in the Biosciences 10 227-235 (1994). The GAP2 program aligns the
Oryza sativa contigs with a library of 42,260 Oryza sativa
cDNAs.
The NAP program computes a global alignment of a DNA sequence and a
protein sequence without penalizing terminal gaps. NAP handles
frameshifts and long introns in the DNA sequence. The program
delivers the alignment in linear space, so long sequences can be
aligned. It makes use of splice site consensuses in alignment
computation. Both strands of the DNA sequence are compared with the
protein sequence and one of the two alignments with the larger
score is reported. See Huang, and Zhang, Computer Applications in
the Biosciences 12(6), 497-506 (1996).
NAP takes a nucleotide sequence, translates it in three forward
reading frames and three reverse complement reading frames, and
then compares the six translations against a protein sequence
database (e.g. the non-redundant protein (ie., nr-aa database
maintained by the National Center for Biotechnology Information as
part of GenBank and available at the web site:
www.ncbi.nlm.nih.gov).
The first homology-based search for genes in the Oryza saliva
contigs is effected using the GAP2 program and the Oryza saliva
library of clustered Oryza saliva cDNA. The Oryza saliva clusters
are mapped onto an assembly of Oryza saliva contigs using the GAP2
program. GAP2 standards for selecting a DNA-DNA match are
.gtoreq.92% sequence identity with the following parameters: gap
extension penalty=1 match score=2 gap open penalty=6 gap length for
constant penalty=20 mismatch penalty=2 minimum exon length=21
minimum total length of all exons in a gene (in nucleotide)=200
When a particular Oryza sativa cDNA aligns to more than one Oryza
sativa contig, the alignment with the highest identity is selected
and alignments with lower levels of identity are filtered out as
surreptitious alignments. Oryza sativa cDNA sequences aligning to
Oryza sativa contigs with exceptionally low complexity are filtered
out when the basis for alignment included a high number of cDNAs
with poly A tails aligning to genomic regions with extended repeats
of A or T.
The second homology-based method used for gene discovery is BLASTX
hits extended with the NAP software package. BLASTX is run with the
Oryza sativa genomic contigs as queries against the GenBank
non-redundant protein data library identified as "nr-aa". NAP is
used to better align the amino acid sequences as compared to the
genomic sequence. NAP extends the match in regions where BLASTX has
identified high-scoring-pairs (HSPs), predicts introns, and then
links the exons into a single ORF prediction. Experience suggests
that NAP tends to mis-predict the first exon. The NAP parameters
are: gap extension penalty=1 gap open penalty=15 gap length for
constant penalty=25 min exon length (in aa)=7 minimum total length
of all exons in a gene (in nucleotide)=200 homology>40%
The NAP alignment score and GenBank reference number for best match
are reported for each contig for which there is a NAP hit.
In the final homology-based method, TBLASTX, is used with cDNA
information from four plant sequencing projects: 27,037 sequences
from Triticum aestivum, 136,074 sequences from Glycine max, 71,822
sequences from Zea mays and 68,517 sequences from Arabidopsis
thaliana. Conservative standards for inclusion of TBLASTX hits into
the gene set are utilized. These standards are a minimal E value of
1E-16, and a minimal match of 150 bp in Oryza saliva contig.
The GenScan program is "trained" with Arabidopsis thaliana
characteristics. Though better than the "off-the-shelf" version,
the GenScan trained to identify Oryza saliva genes proved more
proficient at predicting exons than predicting full-length genes.
Predicting full-length genes is compromised by point mutations in
the unfinished contigs, as well as by the short length of the
contigs relative to the typical length of a gene. Due to the errors
found in the full-length gene predictions by GenScan, inclusion of
GenScan-predicted genes is limited to those genes and exons whose
probabilities are above a conservative probability threshold. The
GenScan parameters are: weighted mean GenScan P value>0.4 mean
GenScan T value>0 mean GenScan Coding score>50 length>200
bp minimum total length of all exons in a gene=500
The weighted mean GenScan P value is a probability for correctly
predicting ORFs or partial ORFs and is defined as the (l/SS li)(SS
li Pi), where "l" is the length of a exon and "P" is the
probability or correctness for the exon.
The GeneMark.hmm for Eukaryotes program uses the Hidden Markov
model for species Oryza Sativa. Minimum total length of all exons
in a gene is 500 bp. Except for the model selection, there is no
specific run-time parameter for GeneMark.hmm.
The gene predictions from these programs are stored in a database
and then combigenes are derived from these predictions. A combigene
is a cluster of putative genes which satisfy the following
criteria:
All genes making up a single combigene are located on the same
strand of a contig;
Maximum intron size of a valid gene is 4000 bp;
Maximum distance between any two genes in the same combigene is 200
bp, as measured by the bases between adjacent ending exons;
If an individual gene is predicted by NAP it has at least 40%
sequence identity to its hit;
If an individual gene is predicted by GAP2 it has at least 92%
sequence identity to its hit;
If an individual gene is predicted by Genscan the weighted average
of the probabilities calculated for all of its exons is not less
than 0.4. The gene boundaries of a Genscan-predicted gene are
determined while taking into account only exons.
Since TBLASTX-predicted genes are standless the combigene which is
made up of such genes can be assigned a strand only if there is a
gene in the cluster that was predicted by a strand-defining
gene-predicting program.
Example 4
Identifying Promoters in the Genomic BAC Library Using
Bioinformatic Techniques
Candidate promote r sequence s are selected by identifying the
regions o f DNA located immediately upstream of "combigenes" as
described and defined in Example 3. The length of the region to be
extracted from the corresponding contig's sequence is set to be
1500 nucleotides plus the very first nucleotide of a combigene.
Thus, if a combigene is sufficiently far from the edge of a contig
a 1501 nucleotide sequence is obtained, otherwise the sequence will
be shorter. Only coding region predictions are considered when
building combigenes. Therefore, the 5' UTR of the putative cDNA is
included a s part of the combigene upstream region.
If there is an AAT/NAP-predicted component in a combigene, then the
putative promoter sequence is extracted upstream of the beginning
of that component otherwise--the sequence is extracted upstream of
the beginning of the combigene (which may correspond to Genscan,
AAT/GAP or a TBLASTX prediction).
Promoter candidates are further selected using bioinformatic
analysis of the candidate promoter sequence.
The candidate promoter regions listed in SEQ ID NO:1 through SEQ ID
NO:57467 are analyzed for known promoter motifs listed in Table
2.
The identification of such motifs provides important information
about the candidate promoter. For example, some motifs are
associated with informative annotations such as "light inducible
binding site" or "stress inducible binding motif" and can be used
to select with confidence a promoter that is able to confer light
inducibility or stress inducibility to an operably-linked
transgene, respectively.
Putative promoter sequences are also searched with matrices for the
TATA box, GC box (factor name: V_GC_0) and CCAAT box (factor name:
F_HAP234_01). The matrix for the TATA box is from the Eukaryotic
Promoter Database (www.epd.isb-sib.ch/) and the matrices for the GC
box and the CCAAT box are from Transfac
(www-transfac.gbf.de/TRANSFAC/).
The algorithm that is used to annotate promoters searches for
matches to both sequence motifs and matrix motifs. First,
individual matches are found. For sequence motifs, a maximum number
of mismatches is allowed (see Table 2). If the code M,R,W,S,Y, or K
are listed in the sequence motif (each of which is a degenerate
code for 2 nucleotides) 1/2 mismatch is allowed. If the code B, D,
H, or V are listed in the sequence motif (each of which is a
degenerate code for 3 nucleotides) 1/3 mismatch is allowed. p
values are determined by simulation with a 5 Mb of random DNA with
the same dinucleotide frequency as the test set is generated and
the probability of a given matrix score is determined (number of
hits/5e7). Once the individual hits have been found, the putative
promoter sequence is searched for clusters of hits in a 250 bp
window. The score for a cluster is found by summing the negative
natural log of the p value for each individual hit. Using 100 Mb
simulations as described above, the probability of a window having
a cluster score greater than or equal to the given value is
determined. Clusters with a p value more significant than p<le-6
are reported. Only the top 287 hits are taken and are ranked by p
value. Effects of repetitive elements are screened. If the 287th
ranked hit has the same p value as the first ranked hit, no results
are reported for that factor.
For matrix motifs, a p value cutoff is used on a matrix score. The
matrix score is determined by adding the path of a given DNA
sequence through a matrix. P values are determined by simulation; 5
Mb of random DNA with the same dinucleotide frequency as a test set
is generated to test individual matrix hits and 100 Mb is used to
test clusters; the probability of a given matrix score and the
probability scores for clusters are determined as are the sequence
motifs. The usual cutoff for matrices is 2.5e-4. No clustering is
done for the TATA box, GC box or CCAAT box.
Candidate promoters are also selected based on the expression
characteristics of the gene that is cis-associated with the
candidate promoter, (i.e. the native gene). For example, a promoter
region located 5' to a gene, which is expressed during a specific
stage of development, likely plays a key role in the temporal
regulation of that gene. Thus the promoter, when operably linked to
a heterologous coding sequence, may similarly regulate the
heterologous coding sequence.
Combining the motif analysis with the expression analysis, the list
of candidate promoters having desired properties can be narrowed.
This decreases the overall number of candidate promoters that must
be screened to confirm the promoter's function. For example, one
can start with seed-expressed transcription factors, identify
candidate promoters that match the consensus regulation sites for
seed-expressed transcription factors, and then test the identified
candidate promoters to confirm the promoter sub-set which are
capable of conferring seed-specific expression to a gene.
Example 5
Identifying Promoters in the Genomic BAC Library Using an
Expression Assay
Promoters may also be identified based on quantitative analysis of
genes that are cis-associated with candidate promoters, (i.e. the
native genes). In this method, the native genes associated with SEQ
ID NO:1 through SEQ ID NO:57,467 are analyzed on a digital northern
blot. Digital northern data can be generated from EST sequencing,
SAGE and other methods, which in effect count RNA molecules
expressed in cell. This data can be generated as needed, or is
generally available to the public on a number of web sites (e.g.,
www.tigr.org). Data can be obtained from any plant species,
although data on rice gene expression is particularly preferred.
Promoters are selected based on the expression information of the
digital northern. For example, identifying genes expressing genes
under stress-related conditions would provide a group of promoters
able to confer such stress-inducible expression to other genes.
Example 6
Identifying Promoters in the Genomic BAC Library Using Microarray
Analysis
Promoters may also be selected based by transcriptional profiling
or microarray analysis. Transcriptional profiling can be completed
on large scale for each cis-linked gene associated with SEQ ID NO:1
through SEQ ID NO:57,467. Transcription profiling data can be
obtained on RNA prepared from any plant species using a chip
comprised of sequences from any plant species, although data
generated from rice using a rice chip is preferred.
A comprehensive database of transcription profiling data narrows
down the list of promoter candidates that confer a desired
expression pattern. For example, a promoter that confers
drought-specific expression can be selected by identifying a
cis-linked gene that is induced under drought conditions (on the
microarray), but is not expressed at other stages of plant growth
and development. Such a promoter is likely to confer drought
inducibility to an operably linked transgene. Public databases of
transcript profiling data are becoming more comprehensive and
thereby enabling this type of analysis.
Example 7
Functional Screening of Promoters in an Expression Assay
Promoters are screened in an expression assay. The promoters in SEQ
ID NO:1 through SEQ ID NO:57,467 are amplified by PCR from rice
genomic DNA and cloned into an expression vector containing a
reporter transgene (e.g., GUS or GFP). The individual promoter or a
collection of promoters ("promoter library") are then screened in
an expression assay for the ability to express the reporter
transgene. In a common expression assay for leaf promoters, the
promoters are transfected into rice or maize leaf protoplasts.
Reporter gene expression in the protoplasts indicates a promoter
capable of conferring gene-expression in the leaf. The promoters
are also transfected into protoplasts from other tissues or plant
species to identify other regulatory features of the promoter.
Alternatively, promoters may be screened using a particle gun
technique to bombard the cells, tissues or plants. The bombarded
samples are visually inspected for reporter gene expression.
Reporter gene expression observed in any bombarded samples
indicates the presence of a promoter able to confer expression of a
transgene in that cell, tissue or plant.
The promoters may also be screened in plants where transformation
protocols have been greatly enhanced to facilitate the screening of
large numbers of promoters. In this approach, the individual rice
promoters or "promoter library" is transformed into Arabidopsis
plants. The resulting transformed tissues or progeny are scored for
reporter expression. Again, reporter gene expression in a given
tissue indicates that a promoter is able to confer transgene
expression in that tissue.
For some promoters, such as those providing constitutive
expression, a reporter transgene can be replaced with a selectable
marker transgene, such as a gene conferring glyphosate tolerance.
Transformed cells, tissues or plants expressing the selectable
marker are selected, rather than visually scored. For example, the
promoter is linked to a selectable marker, such as glyphosate
resistance, and then screening for male sterile plants. The
selected plants, in this case male sterile plants, may contain a
promoter for male reproductive tissues.
The promoters described herein can also be used to ablate or kill
cells expressing a gene from the promoter. In such cases, the
promoter is operably linked to a negative selectable marker gene,
including but not limited to the diptheria toxin gene, or to a
conditional lethal gene, including but not limited to the
phosphonate ester hydrolase gene (pehA). The negative selectable
marker gene is transformed into cells, tissues or plants. The
cells, tissues or plants which express the negative selectable gene
from the promoter are selectively killed. In the case of the
conditional lethal gene, the transformed cells, tissues or plants
which express the conditional lethal gene are only killed in the
presence of the negative selective agent or negative selective
condition. In the example of the phosphonate ester hydrolase gene,
the transformed cells, tissues or plants which express the
conditional lethal gene are only killed in the presence of glyceryl
glyphoste.
Table 1
The data in Table 1 provides features relating to the putative
promoter sequences. *column headings Seq Num: Provides the SEQ ID
NO. for the rice contigs on which the putative promoter sequences
are found. Contig ID: unique identifier of the rice contig CmbGID:
name of the putative promoter sequence. Putative promoters are
named as cg_"no". The "no" refers to the combigene from which the
putative promoter is selected. CNTG LEN: The length of the contig
BEGN POS: starting position of the putative promoter sequence;
STRND: DNA strand on which the combigene is located LEN: length of
the putative promoter sequence Table 2
Table 2 lists the sequence motifs that are searched in the putative
promoter sequences
TABLE-US-00002 TABLE 2 Reference for .Iadd.SEQ.Iaddend. Sequence
Maximum transcription .Iadd.ID.Iaddend. Transcription Motif
mis-matches factors and .Iadd.NO..Iaddend. Factor Name Sequence
Motif name Sequence Motif Length allowed sequence motifs
.Iadd.109670.Iaddend. Fac006 ABADESI1 RTACGTGGCR 10 1 PLACE
.Iadd.109671.Iaddend. Fac037 ABADESI2 GGACGCGTGGC 11 2 PLACE
.Iadd.109672.Iaddend. Fac010 ABFOS GCATCTTTACTTTAGCATC 19 6 PLACE
.Iadd.109673.Iaddend. Fac016 ABRE3OSRAB16 GTACGTGGCGC 11 2 PLACE
.Iadd.109674.Iaddend. Fac016 ABREATRD22 RYACGTGGYR 10 0 PLACE
.Iadd.109675.Iaddend. Fac020 ABREOSRAB21 ACGTSSSC 8 0 PLACE
.Iadd.109676.Iaddend. Fac021 ABREOSRGA1 CCACGTGG 8 0 PLACE
.Iadd.109677.Iaddend. Fac021 ABRETAEM GGACACGTGGC 11 2 PLACE
.Iadd.109678.Iaddend. Fac022 ACGTABOX TACGTA 6 0 PLACE
.Iadd.109679.Iaddend. Fac031 AMYBOX1 TAACARA 7 0 PLACE
.Iadd.109680.Iaddend. Fac032 AMYBOX2 TATCCAT 7 0 PLACE
.Iadd.109681.Iaddend. Fac060 DREDR1ATRD29AB TACCGACAT 9 1 PLACE
.Iadd.109682.Iaddend. Fac064 EREGCCNTCHN TAAGAGCCGCC 11 2 PLACE
.Iadd.109683.Iaddend. Fac066 GARE2R TAACARANTCYGG 14 2 PLACE
.Iadd.109684.Iaddend. Fac068 GBOXRELOSAMY3 CTACGTGGCCA 11 2 PLACE
.Iadd.109685.Iaddend. Fac070 GLUTAACAOS AACAAACTCTAT 12 2 PLACE
.Iadd.109686.Iaddend. Fac071 1OSGT2GLUTEBOX ATATCATGAGTCACTTCA 18 4
PLACE .Iadd.109687.Iaddend. Fac071 1OSGT2GLUTEBOX
ATATCATGAGTCACTTCA 18 4 PLACE .Iadd.109688.Iaddend. Fac071
1OSGT3GLUTEBOX TATCTAGTGAGTCACTTCA 19 5 PLACE-
.Iadd.109689.Iaddend. Fac071 1OSGT3GLUTEBOX TATCTAGTGAGTCACTTCA 19
5 PLACE- .Iadd.109690.Iaddend. Fac072 2OSGT2GLUTEBOX TCCGTGTACCA 11
2 PLACE .Iadd.109691.Iaddend. Fac072 2OSGT2GLUTEBOX TCCGTGTACCA 11
2 PLACE .Iadd.109692.Iaddend. Fac072 2OSGT3GLUTEBOX CTTTTGTGTACCTTA
15 3 PLACE .Iadd.109693.Iaddend. Fac072 2OSGT3GLUTEBOX
CTTTTGTGTACCTTA 15 3 PLACE .Iadd.109694.Iaddend. Fac073 GLUTEBP1OS
AAGCAACACACAAC 14 3 PLACE .Iadd.109695.Iaddend. Fac074 GLUTEBP2OS
ATGCTCAATAGATATAAGT 19 5 PLACE .Iadd.109696.Iaddend. Fac075
GLUTECOREOS CTTTCGTGTAC 11 2 PLACE .Iadd.109697.Iaddend. Fac079
GT2OSPHY AGCGGTAATT 9 1 PLACE .Iadd.109698.Iaddend. Fac105 MYBGAHV
TAACAAA 7 0 PLACE .Iadd.109699.Iaddend. Fac129 PROLAMINBOX
CACATGTGTAAAGGT 15 4 PLACE .Iadd.109700.Iaddend. Fac135 RGATAOS
CAGAAGATA 9 1 PLACE .Iadd.109701.Iaddend. Fac136 RNFG1OS
GATCATCGATC 11 2 PLACE .Iadd.109702.Iaddend. Fac137 RNFG2OS
CCAGTGTGCCCCTGG 15 4 PLACE .Iadd.109703.Iaddend. Fac139 RYREPEAT4
TCCATGCATGCAC 13 3 PLACE .Iadd.109704.Iaddend. Fac139 RYREPEAT4
TCCATGCATGCAC 13 3 PLACE .Iadd.109705.Iaddend. Fac139 RYREPEATGMGY2
CATGCAT 7 0 PLACE .Iadd.109706.Iaddend. Fac139 RYREPEATGMGY2
CATGCAT 7 0 PLACE .Iadd.109707.Iaddend. Fac139 RYREPEATLEGUMINBOX
CATGCAY 7 0 PLACE .Iadd.109708.Iaddend. Fac139 RYREPEATLEGUMINBOX
CATGCAY 7 0 PLACE .Iadd.109709.Iaddend. Fac139 RYREPEATVFLEB4
CATGCATG 8 0 PLACE .Iadd.109710.Iaddend. Fac139 RYREPEATVFLEB4
CATGCATG 8 0 PLACE .Iadd.109711.Iaddend. Fac149 SITEIIAOSPCNA
TGGGCCCGT 9 1 PLACE .Iadd.109712.Iaddend. Fac150 SITEIIBOSPCNA
TGGTCCCAC 9 1 PLACE .Iadd.109713.Iaddend. Fac151 SITEIOSPCNA
CCAGGTGG 8 1 PLACE .Iadd.109714.Iaddend. Fac163 AACAOSGLUB1
CAACAAACTATATC 14 3.5 PLACE .Iadd.109715.Iaddend. Fac165
ACGTOSGLUB1 GTACGTG 7 0 PLACE .Iadd.109716.Iaddend. Fac180
GT1CONSENSUS GRWAAW 6 0 PLACE .Iadd.109717.Iaddend. Fac201
PYRIMIDINEBOXOSRAMY1A CCTTTT 6 0 PLACE .Iadd.109718.Iaddend. Fac218
ABREMOTIFAOSOSEM TACGTGTC 8 0.5 PLACE .Iadd.109719.Iaddend. Fac219
ABREMOTIFIIIOS RAB16B GCCGCGTGGC 10 1.5 PLACE .Iadd.109720.Iaddend.
Fac220 ABREMOTIFIOSRAB16B AGTACGTGGC 10 1.5 PLACE
.Iadd.109721.Iaddend. Fac223 CE3OSOSEM AACGCGTGTC 10 1.5 PLACE
.Iadd.109722.Iaddend. Fac267 POLASIG2 AATTAAA 7 0 PLACE
.Iadd.109723.Iaddend. OS_A-box OS_A-box TATCCATCCATCC 13 3
PlantCARE .Iadd.109724.Iaddend. OS_A-box2 OS_A-box2 AATAACAAACTCC
13 3 PlantCARE .Iadd.109725.Iaddend. OS_AACA OS_AACA TAACAAACTCCA
12 2.5 PlantCARE .Iadd.109726.Iaddend. OS_ABRE OS_ABRE GACACGTACGT
11 2 PlantCARE .Iadd.109727.Iaddend. OS_ABRE2 OS_ABRE2
ACGTACGTGTCGCGC 15 4 PlantCARE .Iadd.109728.Iaddend. OS_AP-2-like
OS_AP-2-like CGCGCCGG 8 0.5 PlantCARE .Iadd.109729.Iaddend.
OS_AP-2-like2 OS_AP-2-like2 CGACCAGG 8 0.5 PlantCARE-
.Iadd.109730.Iaddend. OS_ATGCAAAT OS_ATGCAAAT ATACAAAT 8 0.5
PlantCARE .Iadd.109731.Iaddend. OS_CE3 OS_CE3 GACGCGTGTC 10 1.5
PlantCARE .Iadd.109732.Iaddend. OS_GATT OS_GATT CT CCTGATTGGA 12
2.5 PlantCARE .Iadd.109733.Iaddend. OS_GCN4 OS_GCN4 TGWGTCA 7 0
PlantCARE .Iadd.109734.Iaddend. OS_GCN4_2 OS_GCN4_2 CAAGCCA 7 0
PlantCARE .Iadd.109735.Iaddend. OS_P- box OS_P- box GCCTTTTGAGT 11
2 PlantCARE .Iadd.109736.Iaddend. OS_P-box2 OS_P-box2 CCTTTTG 7 0
PlantCARE .Iadd.109737.Iaddend. OS_Prolamin_box OS_Prolamin_box
TGCAAAGT 8 0.5 Plant- CARE .Iadd.109738.Iaddend. OS_Skn-1 OS_Skn-1
GTCAT 5 0 PlantCARE .Iadd.109739.Iaddend. OS_TATC-box OS_TATC-box
TATCCCA 7 0 PlantCARE .Iadd.109740.Iaddend. OS_TGGCA OS_TGGCA
GACACCAAGTGGCA 14 3.5 PlantCARE .Iadd.109741.Iaddend. OS_light
OS_light AACCAATCTCATCCATCC 18 5.5 PlantCAR- E
.Iadd.109742.Iaddend. AS_RF2A_01 AS_RF2A_01 CCAGTGTGGCGCTGG 15 4
TRANS .Iadd.109743.Iaddend. AT_RS1A_01 AT_RS1A_01 CTTCCACGTGGCA 13
3 TRANS .Iadd.109744.Iaddend. PV_GRP18_01 PV_GRP18_01
TGGATGTGGAAGACAGCA 18 5.5 TR- ANS .Iadd.10974S.Iaddend. RICE_ACT_01
RICE_ACT_01 GCCCAACCCAACCCAAC 17 5 TRANS- .Iadd.109746.Iaddend.
RICE_AGB_01 RICE_AGB_01 GCCACGTAAG 10 1.5 TRANS
.Iadd.109747.Iaddend. RICE_AGB_03 RICE_AGB_03 GCCACGTCAG 10 1.5
TRANS .Iadd.109748.Iaddend. RICE_EM_01 RICE_EM_01 TACGTGT 7 0 TRANS
.Iadd.109749.Iaddend. RICE_EM_02 RICE_EM_02 GACGTGT 7 0 TRANS
.Iadd.109750.Iaddend. RICE_GL51_01 RICE_GL51_01 AAGTCATAACTG 12 2.5
TRANS .Iadd.109751.Iaddend. RICE_GL51_02 RICE_GL51_02 CCATGTCATATT
12 2.5 TRANS .Iadd.109752.Iaddend. RICE_GL51_03 RICE_GL51_03
AATGATGTGTCAAT 14 3.5 TRAN- S .Iadd.109753.Iaddend. RICE_GL51_04
RICE_GL51_04 TTCCGTGTACCAC 13 3 TRANS .Iadd.109754.Iaddend.
RICE_GL51_05 RICE_GL51_05 TGAGTCA 7 0 TRANS .Iadd.109755.Iaddend.
RICE_GLU2_01 RICE_GLU2_01 CCTTTCGTGTACC 13 3 TRANS
.Iadd.109756.Iaddend. RICE_GLUB1_01 RICE_GLUB1_01 CTGAGTCAT 9 1
TRANS .Iadd.109757.Iaddend. RICE_NITR_01 RICE_NITR_01 CACGTCAC 8
0.5 TRANS .Iadd.109758.Iaddend. RICE_RAB16A_01 RICE_RAB16A_01
TACGTGGCNNNNCCGC 23 6 - TRANS CGCGCCT .Iadd.109759.Iaddend.
RICE_RAB16A_03 RICE__RAB16A__03 GTACGTGG 8 0.5 TRANS-
.Iadd.109760.Iaddend. TAF-1AS_ TAF1_01 GCAACGTGGC 10 1.5 TRANS
.Iadd.109761.Iaddend. TAF-1RICE_ RAB16B_01 GGTACGTGGCG 11 2 TRANS
.Iadd.109762.Iaddend. WHEAT_H3_01 WHEAT_H3_01 CCACGTCA 8 0.5 TRANS
.Iadd.109763.Iaddend. Seed_sp Seed_AACA__rnotif AACAAACTCTATC 13 3
lit1 .Iadd.109764.Iaddend. Seed_sp Seed_GCN4 GTGAGTCAC 9 1 lit1
.Iadd.109765.Iaddend. SugRep ACGTABOX TACGTA 6 0 lit2
.Iadd.109766.Iaddend. SugRep TCmotif TATCCAY 7 0 lit2
.Iadd.109767.Iaddend. Amy3 DAMYBOX2 TATCCAT 7 0 lit3
.Iadd.109768.Iaddend. Amy3 DGBOXRELOSAMY3 CTACGTGGCCA 11 2 lit3
*Column Headings for Table 2
Transcription Factor Name: Name of the transcription factor which
binds to a sequence motif within a promoter region Sequence Motif
name: Name of the sequence motif to which the transcription factor
binds Sequence Motif: sequences searched to further annotate
putative promoters Maximum mismatches allowed: Number of mismatches
that are permitted when searching for motifs to annotate putative
promoters Reference for transcription factors and sequence motifs:
Motifs and transcription factors are found in one of three
databases: PLACE, PlantCARE or TRANS (respectively,
www-dna.affrc.go.jp/htdocs/PLACE/,
www-sphinx.rug.ac.be:8080/PlantCARE/index.htm,
www/transfac.gbf.de/TRANSFAC/, or Yoshihara et al., FEBS Letters
383, 1996, pp 213-218; or Toyofuku K et al. FEBS Lett 428:275-280
(1998) or lit3 (Huang et al Plant Mol Biol 14:655-668 (1990)).
Table 3 Table 3 describes those putative promoter sequences
containing TATA boxes, GC boxes or CCAT boxes as determined by
matrix motifs. *Column headings for Table 3 Seq num Provides the
SEQ ID NO. for the listed sequences. Seq ID Arbitrarily assigned
identifier for each putative promoter sequence Start Indicates the
start position of the TATA box, GC box or CCAAT box. End Indicates
the end position of the of the TATA box, GC box or CCAAT box.
p-Value Probability value is determined by simulation as described
above. -In (p-Value) Indictates the negative natural log of the
p-Value. # of hits in cluster No clustering is done in Table 3.
Therefore, all entries in this column are listed as "1". Factor
Name Transcription factors associated with the TATA box
(TATA-plant), GC box (V_GC_01) and CCAAT box (F_HAP234_01) are
listed under this column heading when the matrix motifs for the
TATA box, GC box and CCAAT box are identified. Site name List
whether the search is done for the TATA box, GC box or CCAAT box.
Table 4 describes those putative promoter sequences containing
specific sequence motifs as listed in Table 2. Column headings for
Table 4 Seq num Provides the SEQ ID NO. for the listed sequences.
Seq ID Arbitrarily assigned identifier for each putative promoter
sequence Start
Indicates the start position of a sequence motif. Note that if
multiple motifs are present in a window of sequence, the start
position, the first time it is indicated, lists the start position
of the first motif in a window. The second or subsequent times a
start position is listed, this heading refers to the start postion
of the subsequent individual sequence motif within the window of
sequence. End Indicates the end position of a sequence motif Note
that if multiple motifs are present in a window of sequence, the
end position, the first time it is indicated, lists the end
position of the first motif in a window. The second or subsequent
times an end position is listed, this heading refers to the end
postion of the subsequent individual sequence motif within the
window. Strand The strand of genomic DNA on which the sequence
motif is located (+/-) p-Value Probability value as determined by
simulation as described above. Note that if multiple motifs are
present in a window of sequence, the p Value, the first time it is
indicated, lists the p-Value position for all of the motifs in a
window of sequence. The second or subsequent times a P-value is
listed, this heading refers to the p-Value of the subsequent
individual sequence motif within the window. -In (p-Value)
Indictates the negative natural log of the p-Value as described in
the column heading above. # of hits in cluster P1 Clustering is
described above. The number of hits in a cluster refer to the
number of times sequence motifs appear in a window of sequence.
Factor Name Transcription factors associated with the sequence
motifs listed in Table 2 are included under this column heading.
Site name Lists the sequence motif for which a search is done on a
putative promoter sequence. Table 5 Table 5 lists the putative
promoter sequences for which sequence or matrix motifs are not
identified. Column headings for Table 5 Seq num Provides the SEQ ID
NO. for the listed sequences. Seq ID Arbitrarily assigned
identifier for each putative promoter sequence Table 6 Table 6
lists the contigs, combigenes and description information
associated with each gene prediction. *Column Headings: Seq num
Provides the SEQ ID NO. for the listed sequences. Contig id
Arbitrarily assigned name for each contig. CDS. The location of the
exons found within the gene as determined by the gene-predicting
program (Method). CG ID Arbitraily assigned name for each
combigene. CG Start Indicates the start position of the combigene
gene. CG End Indicates the end position of the combigene gene.
Strand Indicates the strand location of the gene (+/-) Gene
Indictates an arbitraily assigned gene name based on the method
used to predict the gene. Method Indicates the gene-predicting
program used. These programs are GenScan, AAT/NAP, AAT/GAP, TBLASTX
or Genemark.hmm. Gene Start The start position of the putative gene
making up a combigene as predicted by the particular gene
predicting program used. Gene End The end position of the putative
gene making up a combigene as predicted by the particular
gene-predicting program used. Hit Score The aat_nap score (under
Hit score in the rows where the method is AAT/NAP) is reported by
the nap program in the aat package. It is an alignment score in
which each match and mismatch is scored based on the BLOSUM62
scoring matrix. The aat_gap score (under Hit score in the rows
where the method is AAT/GAP) is the alignment score for each hit
sequence, as reported by AAT/GAP. For TBLASTX the Bit score for
BLAST match score that is generated by the sequence comparison of
the genomic contig with the Monsanto cDNA sequence named under the
GI column is listed. The E-value corresponding to a given bit score
is E=mn2-S'. "m" and "n" are two proteins of length "m" and "n",
"E" is the E value and S' is the bit score. GI Each sequence in the
GenBank public database is arbitrarily assigned a unique NCBI gi
(National Center for Biotechnology Information GenBank Identifier)
number. In this table, the NCBI gi number which is associated (in
the same row) with a given contig or singleton refers to the
particular GenBank sequence which is the best match for that
sequence. If the hit is based on cDNAs from Monsanto's SeqDB, the
name, of the cDNA sequence it hit to is named. Description The
Description column provides a description of the NCBI gi referenced
in the "GI" column.
SEQUENCE LISTINGS
0 SQTB SEQUENCE LISTING The patent contains a lengthy "Sequence
Listing" section. A copy of the "Sequence Listing" is available in
electronic form from the USPTO web site
(http://seqdata.uspto.gov/?pageRequest=docDetail&DocID=USRE046033E1)-
. An electronic copy of the "Sequence Listing" will also be
available from the USPTO upon request and payment of the fee set
forth in 37 CFR 1.19(b)(3).
* * * * *
References