Compositions and systems for identifying and comparing expressed genes (mRNAs) in eukaryotic organisms Kane; Michael D. ; et al. [Dombkowski; Alan A.]

Compositions and systems for identifying and comparing expressed genes (mRNAs) in eukaryotic organisms

Kane; Michael D. ; et al.

Patent Application Summary

U.S. patent application number 11/214436 was filed with the patent office on 2006-05-18 for compositions and systems for identifying and comparing expressed genes (mrnas) in eukaryotic organisms. Invention is credited to Alan A. Dombkowski, Michael D. Kane, Aaron C. Nagel.

Application Number	20060105362 11/214436
Document ID	/
Family ID	22924667
Filed Date	2006-05-18

United States Patent Application	20060105362
Kind Code	A1
Kane; Michael D. ; et al.	May 18, 2006

Compositions and systems for identifying and comparing expressed genes (mRNAs) in eukaryotic organisms

Abstract

The invention comprises compositions and systems to identify and compare expressed genes in a given in vivo or in vitro RNA sample, as well as the relative difference in mRNA expression between two or more samples, where desired. Furthermore, the invention comprises compositions and systems to identify novel genes. The invention comprises, without limitation, one or more mRNA specific identimers for use in reverse transcription that themselves comprise an oligo-T nucleotide sequence (at the 5' end) linked to a nucleotide sequence VNx (at the 3' end) where the V nucleotide immediately adjacent to the oligo-T segment is not a T.

Inventors:	Kane; Michael D.; (Ann Arbor, MI) ; Nagel; Aaron C.; (Ypsilanti, MI) ; Dombkowski; Alan A.; (Canton, MI)
Correspondence Address:	RADER, FISHMAN & GRAUER PLLC 39533 WOODWARD AVENUE SUITE 140 BLOOMFIELD HILLS MI 48304-0610 US
Family ID:	22924667
Appl. No.:	11/214436
Filed:	August 29, 2005

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
10002536	Nov 1, 2001	6955876
11214436	Aug 29, 2005
60244933	Nov 1, 2000

Current U.S. Class:	435/6.11 ; 435/6.12; 435/91.2
Current CPC Class:	C12Q 2531/143 20130101; C12Q 2525/143 20130101; C12Q 2525/173 20130101; C12Q 1/6809 20130101; C12Q 1/6809 20130101; C12N 15/1096 20130101
Class at Publication:	435/006 ; 435/091.2
International Class:	C12Q 1/68 20060101 C12Q001/68; C12P 19/34 20060101 C12P019/34

Claims

1-33. (canceled)

34. A system for identification and characterization of gene expression in one or more samples, comprised of: (a) providing one or more samples comprising one or more mRNA molecules; (b) providing an identimer comprising an oligo-dT primer of sequence, from 5' to 3' end, of Tn-VNx, where n is an integer 8 or greater but not more than 50 representing the number of T's , V equals a nucleotide A, C, or G but not T, each N equals a nucleotide A, C, G, or T, and x is an integer 3 or greater but not more than 10 representing the number of N nucleotides, said identimer also comprising a detectable marker at its 5' end; (c) contacting said mRNA with said identimer such that the polyT portion of the identimer hybridizes to the polyA tail of the mRNA and the VNx portion of the identimer hybridizes with portions of the mRNA immediately upstream of the polyA tail; (d) reverse transcribing the mRNA to produce a first strand cDNA that includes the identimer; (e) synthesizing a second DNA strand complementary to the first strand cDNA to form a duplex; (f) cleaving the duplex with at least one sequence-specific cleaving agent to provide one or more duplex cleavage fragments; (g) ligating an adaptamer comprising an RNA polymerase promoter site to one or more of said cleavage fragments; and (h) amplifying the one or more ligated cleavage fragments using the identimer to produce one or more amplified fragments comprising sequences complementary to a 3' end of the mRNA, (i) identifying and characterizing the cleavage fragments according to the presence of the marker, the sequences corresponding to the VNx nucleotide sequence and the sequence associated with the sequence-specific cleaving agent, and the size of the fragment, and (j) identifying any gene associated with the cleavage fragments by comparing the sequence and size characteristics of the cleavage fragment with a database contacting sequence and size characteristics of RNAs associated with known genes, whereby said comparison is conducted by means of software operated on a microprocessor, where said detectable marker is comprised of a fluorescent molecule, a hapten, biotin, a radioisotope, or any combination thereof.

Description

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application is a continuation of U.S. patent application Ser. No. 10/002,536, filed on Nov. 1, 2001, which claims priority based on U.S. provisional patent application No. 60/244,933 filed Nov. 1, 2000.

FIELD OF THE INVENTION

[0002] The present invention relates to the fields of genomic and proteomic analysis. In particular, the present invention relates to the field of gene expression analysis.

BACKGROUND OF THE INVENTION

[0003] With progress in sequencing many genomes, including among them the human genome, there is additional interest in understanding the significance of changes in gene expression. The ability to correlate changes in gene expression, for example, with specific treatments and phenotypes in clinical and non-clinical biological systems, allows scientists to understand the underlying cell biology and identify the roles of specific genes, receptors and signaling pathways. One objective, among many, of this research is to identify specific genes that may serve, for example, as biomarkers for disease progression or diagnostic criteria, as well as to identify gene expression products (e.g., proteins) that can be targeted as or by new therapeutic compounds in order to study, diagnose, prevent or cure disease.

[0004] There have been significant advancements in the human genome-sequencing project and in similar sequencing efforts that involve organisms of interest to basic and preclinical research, genetics, and agronomics. This progress has generated and continues to generate deoxyribonucleic acid ("DNA") and ribonucleic acid ("RNA") sequence databases that serve as informational resources and support advancement in the methods by which genomic and proteomic research is carried out.

[0005] However, current genomic tools and techniques continue to require significant known genomic sequence information for the organism or tissue under investigation, or require that the investigators derive libraries of clones from the particular organism or tissue. In order to build a DNA microarray that represents essentially all genes for a particular species under investigation (e.g., human), the investigating scientist must expend tremendous resources to identify all possible messenger RNAs ("mRNAs") that may be present in the studied sample. For example, high-density DNA microarrays, using large numbers of known genes, are required to conduct mRNA expression profiling in such samples. By comparison, use of low-density DNA microarrays creates a higher probability of "missing" genes (by omission from the array) that may be relevant to a given experimental paradigm.

[0006] Alternative methods, such as differential display and serial analysis of gene expression, may permit detection of differences in mRNA species between or among RNA samples. However, these methods also require significant resources to identify expressed genes and related expression products, such as mRNAs. For example, in order to identify differences in specific genes using differential display, segregated bands must be removed (excised) from an electrophoresis gel, amplified using polymerase chain reaction ("PCR") techniques, and then sequenced. Similarly, serial analysis of gene expression ("SAGE") requires significant sequencing resources to identify any differences in known and unknown genes.

[0007] The present invention addresses limitations in the prior art by comprising compositions and systems that incorporate novel strategies whereby molecular or biochemical assay compositions and systems are linked to DNA or RNA sequence databases for optimal resource efficiency in assaying gene expression.

SUMMARY OF THE INVENTION

[0008] The invention comprises compositions and systems to identify any or all genes expressed in a given in vivo or in vitro RNA sample, as well as the relative differences in mRNA between two or more samples, where desired. Furthermore, the invention comprises compositions and systems to identify novel genes (expressed as mRNA), by way of one example only, by detecting mRNA 3' end fragments that do not correspond to any known sequence. Thus, embodiments of the invention (1) may identify any or all genes (mRNAs) expressed in a given eukaryotic sample, (2) support discovery of novel genes, or (3) may identify mRNAs that are expressed at different levels between two or more samples. The invention further comprises custom microarray design and production where genes shown to be present and/or differentially regulated between two samples can be used to produce project- or disease-specific microarrays that detect the genes of interest. Moreover, embodiments of the invention comprise systems of nucleic acid fragment collection and analysis.

[0009] The invention comprises compositions and systems to identify the relative expression level of any or all eukaryotic mRNAs in one or more samples. The invention comprises, without limitation, one or more mRNA specific primers for use in reverse transcription that themselves comprises an oligo-dT nucleotide sequence (at the 5' end) linked to a nucleotide sequence (at the 3' end) where the nucleotide immediately adjacent to the oligo-dT segment is not a T. This sequence can be written (from 5' to 3' end) as Tn-VNx, where n=any integer of 8 or greater describing how many T nucleotides are present; V=nucleotides A, G, or C; each N=nucleotides A, G, C, or T, and x=any integer 3 or greater that describes how many N nucleotides are present (SEQ ID NO: 1). (For purposes of the invention, the designation "d", or "deoxy", shall also include the "nondeoxy" form where appropriate as known by those of ordinary skill). The complete primer (oligo-dT region+VNx sequence) of the invention is called an "identimer."

[0010] Embodiments of the invention may employ different or every combination of the Tn-VNx sequence. The 5' end of the identimer comprises a reporter marker or molecule, by way of example only and without limitation, a fluorescent molecule, which allows detection of the resulting fragment. The Tn-VNx sequence of the invention includes priming of all genes containing the complementary sequence at the 3' end (immediately adjacent to the poly-adenylation tail), providing information about the gene's identity (i.e. the mRNA's sequence at the 3' end), limiting the number of reaction products to a useful number, and enabling detection of the resulting 3' end fragment, as one example only, by fluorescence detection.

[0011] After deriving double-stranded complementary DNA ("cDNA") in the invention, one or more 3' end fragments are generated by a sequence-specific cleavage of the double-stranded cDNA, for example, by restriction endonuclease cleavage or other sequence-specific cleaving agents known to those of ordinary skill. The invention generates 3' end fragments of cDNA where the 5' end is known (complementary to the VNx nucleotide sequence), the 3' end is known (e.g., the restriction endonuclease recognition sequence), and the size is known (by way of one example only, from electrophoretic separation or reverse HPLC). In some embodiments, the resulting data is accumulated, analyzed, and stored in a database. The resulting data permits identification of expressed mRNA in a single sample, or, by comparing the abundance of fragments from a control sample to a test (or unknown) sample, the relative differences in expression of genes of interest may be determined among samples.

[0012] The invention allows research and clinical scientists to identify any or all genes expressed in a given in vivo or in vitro RNA sample, as well as the relative differences in mRNA between two or more samples, where desired. Furthermore, novel gene (mRNA) discovery is made possible since mRNA 3' end fragments can be identified that do not correspond to any known sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] FIG. 1 shows a schematic diagram of one embodiment of the invention which generates 3' cDNA fragments for the assay of poly-adenylated mRNAs in eukaryotic samples.

[0014] FIG. 2 shows a comparative fingerprint analysis between PCR amplified samples by an existing method and by the invention using an intermediate linear amplification step.

[0015] FIG. 3 shows a reverse phase HPLC separation of control and treated samples showing a 3' cDNA fragment of similar size but different abundance, indicating a difference in expression level for a specific mRNA between the samples.

[0016] FIG. 4 shows the identification of mRNA in a test sample from cDNA fragments.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0017] As is commonly known, transcription is the transfer of the genetic information from a host's archival copy of DNA to mRNA. In typical transcription, RNA polymerase binds to a particular region of archival DNA and begins to make a strand of mRNA with a base sequence complementary to the DNA template that is "downstream" of the RNA polymerase binding site. When this transcription is finished, the portion of the DNA that coded for a protein, i.e., a gene, is now represented by an mRNA molecule that can be used as a template for translation into gene products, such as proteins. Thus, levels of gene expression can be evaluated by identifying and characterizing the mRNA complement of a host.

[0018] Those of ordinary skill will also appreciate that, to further evaluate the levels of gene expression, polyA mRNA may be prepared from the desired organism or tissue, and the first strand of cDNA may be synthesized from the mRNA template using, as some examples only, an RNA-dependent DNA polymerase, such as reverse transcriptase, and an oligonucleotide primer. A second strand of cDNA may be synthesized by one of the methods known to one of ordinary skill in the art, for example, by replacement synthesis or by primed synthesis. The resulting cDNA strands are thus available for further analysis and characterization.

[0019] With this background in mind, the invention comprises an identimer with three or more nucleotides upstream of a poly-T tail, combined with a restriction enzyme that cleaves ds DNA in a sequence-specific fashion, to generate 3-prime end cDNA fragments of expressed genes. The expression level of a given gene is proportional to and correlates with the amount or abundance of the respective 3' cDNA fragment level. Genes (expressed as mRNAs) are identified by combining the known sequence of 3 or more nucleotides immediately adjacent to the poly-A tail (complementary to the Nx base-anchored primer)(SEQ ID NO: 1), the specific DNA sequence recognized or cut by the restriction enzyme(s) employed, and the size of the 3' fragment. The size of the 3' fragment represents the distance between the Nx base-anchored priming (poly-adenylation site) and the nearest restriction enzyme cut site. The identity of the mRNA (gene) may be derived by searching an mRNA or DNA database for the nucleotide sequence that matches the N(x)-base priming site, the restriction enzyme cut site, and the distance between the priming site and the cut site. Ambiguous calls are avoided by repeating the protocol with one or more restriction enzymes that recognize or cut a different nucleic acid sequence.

[0020] In one embodiment, without limitation, the invention comprises up to 192 different identimers that represent all combinations of the primer designated as 5' Tn-VNNN 3' (n=an integer of preferably 21 representing the number of T's; V=nucleotides A, G, or C but not T; each N=nucleotide A, G, C, or T)(SEQ ID NO: 2), which is designed to identify the 4 nucleotides immediately adjacent to the poly-adenylation site in eukaryotic mRNA. The Tn or poly-dT in the identimer is designed to anneal to the poly-A tail in eukaryotic mRNA. In one embodiment, without limitation, more than one set of 192 identimers (the permutations of VNNN=3.times.4.times.4.times.4) is employed, and each set is used with a single RNA sample. These identimer sets, and thus the samples, are differentiated by adding a distinct, detectable molecular label or marker, by way of one example only, a fluorescent label, to the 5' end of each identimer in a set, with all identimers within a set having a similar 5' marker. The identimer is annealed to mRNA using buffer and temperature conditions that are known in the art for optimal sequence-specific priming, and reverse transcription is carried out. Second-strand synthesis is subsequently carried out to produce double-stranded ("ds") cDNA that is amenable to restriction enzyme cleavage. Enzyme-mediated, sequence-specific cleavage is carried out, resulting in fragmented ds cDNA. For each set where different cleavage enzymes or agents are used, the invention will generate different 3' end fragments for characterization. In this manner, the invention generates and analyzes cDNA fragments that are assayed for size (e.g., mobility in a gel) and amount.

[0021] In one embodiment, without limitation, the invention generates 3' cDNA fragments where the 4 nucleotides immediately adjacent to the poly-adenylated tail (VNNN) are known, the sequence-specific location (i.e., restriction site) is known, and the size of the fragment is determined and used to establish the distance between the restriction site and the poly-adenylated tail. The amount of the cDNA fragment, as well as the sample it represents, is determined by assaying the signal intensity of the 5' label on the employed identimer set by means known to those of ordinary skill in the art. Differences in the normalized intensity of the 5' labels for a specific cDNA fragment indicate differences in the respective mRNA among or between the RNA sample sets under investigation. Information regarding the identity of an expressed mRNA is derived from knowing the 4 nucleotides immediately adjacent to the poly-adenylated tail, the nucleotide sequence for the restriction site employed, and the distance between the poly-adenylated tail and the restriction site nearest to the poly-adenylation site with respect to the primary nucleic acid sequence (i.e., the most 3' restriction site in the mRNA sequence).

[0022] In another embodiment, without limitation, samples representing control and one or more experimental samples of mRNA may each be divided into 192 aliqouts, with one identimer of the 192 identimers added to each respective aliquots. The respective identimer sets may contain different distinct, detectable molecular labels or markers on the 5' end of each identimer in a set, with all identimers within a set having a similar 5' label. Reverse transciption and second strand synthesis is similarly conducted according to means known to those of ordinary skill in the art to produce ds cDNA that is amenable to restriction enzyme cleavage. Enzyme-mediated, sequence-specific cleavage is carried out, resulting in fragmented ds cDNA. Comparison of fragmented ds cDNA corresponding to each respective identimer yields data showing the relative expression of the corresponding gene between control and test samples. One of ordinary skill will appreciate that other useful embodiments are apparent, as some examples only, by varying the type and number of markers in each identimer set, as well as by varying the sequence-specific cleavage enzymes used to produce fragmented ds cDNA.

[0023] In one embodiment, without limitation, the invention comprises a system that compares two or more RNA samples resulting in (1) cDNA fragments of essentially all mRNAs (genes) present in the samples, (2) identifies the subset of genes that are different between the samples, and (3) provides immediate sequence information to identify known genes from the cDNA fragments. This invention thus has significant advantages over existing art because prior sequence information or clone library construction is not needed to enable the assay, as in the case of DNA microarrays that require significant resources to produce the DNA microarray. Instead, the invention provides immediate sequence information, in addition to information concerning changes or differences in mRNA level, to determine mRNA expression level and mRNA identification in one assay. Furthermore, the invention generates cDNA fragments from all mRNAs present in the sample for subsequent investigation by common molecular biology techniques such as cloning, PCR, sequencing, etc. It is important to note that the invention does not require prior knowledge of the sequence of the genome of the organism under investigation and can be employed in organisms that lack significant genomic sequence information. Thus by identifying specific differences in mRNAs levels, critical changes in gene activity relevant to the research or disease model under study can be derived where little or no genomic sequence information is available. In this paradigm, preliminary information can be utilized to direct selective cloning and sequencing efforts, as well as to produce a DNA microarray specific to the relevant mRNAs for a more exhaustive, high throughput study across many samples in the research or disease model under investigation.

[0024] The determination of the size and abundance of 3' cDNA fragments requires labeling the fragments with a detectable marker entity or grouping. In one embodiment of the invention, without limitation, labeling the 3' cDNA fragments involves fluorescence. By way of one example only, a sample derived from a biological reference or control group is labeled with one fluorescent molecule or group, and a sample derived from a biological test or study group is labeled with a different fluorescent molecule or group. The detection method of the invention comprises a fluorescence detector that can identify and distinguish the fluorescent molecules or groups employed during labeling. One method of labeling involves the addition of one or more fluorescent groups to the identimer, preferably at the 5' end of the identimer to avoid interfering with reverse transcription priming. Another labeling method, among others, uses fluorescence-modified reverse transcription substrates (e.g., fluorescence-modified trinucleotides) that are added to the reverse transcription reaction and incorporated into the cDNA during reverse transcription.

[0025] FIG. 1 shows a molecular protocol of one embodiment of the invention to generate 3' cDNA fragments for the assay of all poly-adenylated mRNAs in eukaryotic samples. The sample under investigation is divided into 192 aliquots, and first strand synthesis (reverse transcription) is carried out using all VNNN combinations of the identimer, followed by second strand synthesis. The ds cDNA is cleaved in a sequence-specific manner using a restriction enzyme that involves a 4-base recognition sequence (e.g., NlaIII). The resulting fragments are ligated to an adaptamer that contains one or more RNA polymerase promoter sites for subsequent in vitro transcription. The 3' cDNA fragments are initially enriched using PCR, primed at the adaptamer and the poly-adenylation site (i.e. identimer), and subsequently employed as a template for in vitro transcription promoted within the adaptamer (e.g. T7 polymerase promoter in the ligated adaptamer). This results in an amplification of the sequence adjacent to, and downstream from, the RNA polymerase promoter sequence, which includes the restriction site and the poly-adenylation site. "Second round" first strand synthesis is carried out using a fluorescence-labeled primer (identimer) to enable the detection of all 3' cDNA fragments for size and abundance (fluorescence label is denoted as an "*" at the 5' end of the identimer in FIG. 1). The entire process is repeated using a different restriction enzyme that employs a different recognition sequence (e.g. MboI). Gene (mRNA) identification is made by collecting knowledge of the 4 nucleotides upstream of the poly-adenylation site (determined by identimer priming), the sequence of the restriction enzyme recognition site, and the size of the fragment that provides the distance between the poly-adenylation site and the proximal restriction site. This information is employed to search the known sequence database(s) to identify the mRNA(s) that match these criteria.

[0026] In some embodiments, the identification of a gene or mRNA utilizes information derived from the identimer sequence, the restriction enzyme recognition sequence(s), and the size of the resulting cDNA fragments. This information is then employed to search an mRNA sequence database to identify the specific genes or mRNAs in the samples under investigation. The data used to search the mRNA database are derived by means of the invention. The mRNA nucleotide sequence of the bases immediately adjacent to the poly-A tail are derived from knowledge of the complementary identimer sequence. For example, if the identimer for a given reaction has the sequence 5'-TTTTTTTTTTTTTTTTTTTTAAAC-3' (SEQ ID NO: 3), then any mRNAs identified from this reaction will contain the sequence 5'-GTTTAAAAAAAAAAAAAAAAAAAA-3' (SEQ ID NO: 4). Further information is derived from the determination of the length of labeled cDNA fragments and the restriction enzyme employed to generate the fragments. For example, if the first restriction digest of the identimer reaction above employs the restriction enzyme NlaIII, which cuts at the sequence 5'-CATG-'3, then a cDNA fragment that is 334 bases in length identifies and mRNA sequence that contains the 5'-CATG-'3 sequence 314 bases from the poly-adenylation site. This take into account the 20 "T" bases on the identimer (i.e. 334-20=314). If the second restriction digest of the identimer reaction employs the restriction enzyme MboI, which cuts at the sequence 5'-GATC-3', then a cDNA fragment that is 889 bases in length identifies an mRNA sequence that contains the 5'-GATC-5' sequence 869 bases from the poly-adenylation sequence. Using this information to search an appropriate database, one can identify the mRNA as human precerebellin (GI# 180250), which matches the analytical data. If no mRNA is present in the database, then one can employ a similar bioinformatical strategy to predict the identity of the unknown mRNA or approximate the identity of mRNA or gene family involved. Similarly, if the samples are derived from an organism that lacks an adequate mRNA or gene sequence database, the mRNA is identified using the database from a closely related species.

[0027] In some embodiments, the invention comprises an in vitro transcription step for optimal sensitivity. Each reaction employs one or more primed reverse transcription step resulting in an mRNA:cDNA hybrid. The mRNA:cDNA hybrid is converted to ds cDNA using a "second strand" reaction that may involve, as examples only, RNase H, a DNA polymerase, and a DNA ligase. Subsequently the ds cDNA is fragmented using one or more restriction enzymes or other cleaving agents that cleave DNA in a sequence-dependent fashion. Subsequently a specific DNA sequence, or "adaptamer", containing one or more RNA polymerase promoters, is ligated onto the restriction site, allowing in vitro transcription-mediated amplification of the adaptamer-ligated fragments. The resulting RNA sequences include a 3' poly-A tail and serve as a template for identimer-primed reverse transcription that produces 3' cDNA fragments. These fragments, which may correspond to cleaved fragments, are flanked by known sequences from the identimer hybridization site and restriction site, and the fragments can be analyzed for size and abundance to determine the identity of the mRNA and the expression level in the sample, respectively.

[0028] In one embodiment of the invention, without limitation, identification of the gene associated with a given mRNA fragment, observed as a gel band or chromatographic mobility peak, is attained by an automated matching of the fragment length, restriction enzyme(s), and the known VNx recognition sequence to all predicted fragments obtained in a computational analysis of an mRNA database. A representative set of mRNA sequences for the organism under investigation is submitted to a computational analysis that, for each sequence, identifies the start of the poly-adenylation (polyA) site, the VNx sequence immediately upstream of the polyA site (recognition sequence), and the location of all restriction sites that would be subject to cleavage by the restriction enzymes used in the biochemical protocol. For each restriction enzyme used, the cleavage site upstream and proximal to the mRNA 3' end is identified. The predicted fragment length is calculated by counting the number of nucleotides from the proximal cleavage site to the beginning of the polyA site and adding the number of nucleotides for the length of the oligo-T portion of the identimer primer. Thus, for each mRNA sequence in the representative database, an algorithm may predict the fragment lengths that would be observed if the given mRNA sequence were present in the sample and analyzed with the biochemical protocol. A computer application calculates all predicted fragments for a specified set of restriction enzymes and four-base recognition elements. To associate a gene with a fragment obtained in the biochemical protocol (the "target fragment"), the target fragment is compared to all predicted fragments. Predicted fragments matching the target fragment length, restriction site, and VNx recognition sequence are putative matches. An unambiguous identification is obtained when only one predicted fragment matches the target fragment. Gene identification is accomplished by referring to the sequence information associated with the predicted fragment. In the case where the target fragment matches multiple predicted fragments (multiple genes), the use of additional restriction enzymes provides an unambiguous identification. In such a case, identification requires that all predicted fragments for the putative gene must match target fragments observed in the experiment. In other words, each restriction enzyme may produce an observed fragment for the given gene; when each of these target fragments is matched to predicted fragments from a single gene then identification of the target fragments can be made.

[0029] In some embodiments, the invention comprises providing a transcript (mRNA) sequence database for the organism under investigation, as well as an executable program to mine and match the database with the 3' cDNA fragments.

[0030] Determining relative differences between two different RNA samples involves comparing the abundance of all 3' fragments, which have been differentially labeled for detection. For example, a comparative increase in the signal intensity of a specific 3' cDNA fragment(s) indicates the mRNA that gave rise to the fragment(s) is more abundant in the respective sample. Furthermore, the appearance or disappearance of a specific 3' cDNA fragment(s) indicates induction or repression, respectively, of the mRNA that gave rise to the fragment(s).

EXAMPLES

[0031] The following examples illustrate embodiments of the invention but in no way restrict the overall scope of the invention to only those described below.

Example 1

[0032] First and second strand cDNA synthesis. First strand synthesis is performed by means known to those of ordinary skill (using any experimental cell/tissue type) on the total RNA population utilizing a four-base identimer of sequence NNNVT.sub.21, where each N=A, C, T, or G, and V=A, C, or G but not T (SEQ ID NO: 2). In practical application, the total number of unique identimer tags theoretically required to span the total estimated mRNA population (in a eukaryotic organism) would be 192 (thus 192 unique subsets). Compared with most differential display protocols, which typically utilize a two-base anchored primer for first strand synthesis, a four-based identimer has advantages by: (1) significantly reducing the complexity of the mRNA pool by a factor of 16 (192/12=.about.16), thereby reducing the number of bands displayed per fingerprint (or subset); (2) providing a more accurate prediction of the candidate mRNA(s) of interest through the additional two nucleotide sequence at the 3'-end of each mRNA preceding the poly-A (along with restriction site); and (3) allowing for more stringent annealing temperature (e.g., 50.degree. C.), thereby reducing potential mispriming during first strand synthesis. Following first strand synthesis, ds cDNA 5' was synthesized using a cocktail of requisite enzymes (DNA Polymerase I, RNase H, and E. coli DNA ligase), for example, according to the method of Gubler and Hoffman (Life Technologies Instruction and Technical manuals).

[0033] Restriction enzyme digestion. Following second strand synthesis, ds cDNAs are digested separately with any four-base recognition sequence-specific class IIS restriction enzymes, yielding a 4-base cohesive or recessive end, ideal for improving the efficiency of subsequent ligation reactions. Assuming the average size of a synthesized cDNA is approximately 1200 base pairs ("bp"), the 4-base recognition enzyme will cut once every 256 bp (on average), generating 5 different fragments (on average). However, the fragment of interest is the most 3' fragment, which is selected for in subsequent steps.

[0034] RNA polymerase-specific adaptor ligations. An extended ds promoter recognition sequence specific for the respective RNA polymerase of choice (e.g. T7) is ligated (using a standard T4 DNA ligase protocol) to the 5' protruding region at the 3' end of each ds cDNA containing the four-base cut site. The RNA polymerase adaptor has an extended complementary sequence to the cohesive or recessive end generated by each respective enzyme. All cDNAs now acquire a universal promoter site (and primer site) specific for the RNA polymerase employed.

[0035] Selective PCR amplification of 3'-cDNA fragments. Ligation products are subjected to a selective PCR amplification of the representative 3'-cDNA pools by known means, using a sense strand specific primer derived from the RNA polymerase-specific ds adaptor and each individual identimer tag(s) originally used for first strand synthesis. This step selectively amplifies the most 3' ds fragments that are flanked by the adaptor sequence ("forward primer annealing site") and the poly-A site ("reverse primer" or "identimer annealing" site). PCR amplifications were performed using a HotStar.TM. Taq master mix (Qiagen, Valencia, Calif., USA) in a 30 .mu.l reaction format as follows: 2.5 U HotStar Taq.TM. DNA Polymerase (Qiagen); 1.times.PCR buffer; 200 .mu.M dNTPs; 0.02 .mu.M identimer tag; 1.0 .mu.M T3 sense primer. The PCR conditions were as follows: initial activation and denaturing step at 95.degree. C. for 15 min followed by 25 cycles at 94.degree. C. for 30 sec, 50.degree. C. for 1 min, 72.degree. C. for 1 min with a final extension step for 5 min at 72.degree. C. 3' PCR fragments are purified from residual reaction components, quantified, and used for proceeding reactions.

[0036] Selective linear amplification of 3' PCR pools. Another drawback to most current differential display-derived methods is the underrepresentation of the mRNA pool due to differential display's preferential bias toward high copy number mRNAs. Given that the percentage of low to rare abundance mRNAs can comprise up to 90% of the total mRNA pool, alternative or extended strategies (following PCR) can be employed to obtain more accurate representation of all expressed mRNAs. This entails a linear amplification event (i.e., in vitro transcription) following an exponential amplification event (PCR) as a non-biased approach to further analyzing those messages that otherwise would not be observed following PCR.

[0037] Each PCR reaction (containing a representative 3' pool based on the identimer used) is subjected to an in vitro transcription reaction using the RNA polymerase of interest for the appropriate time interval using the following reaction components (final concentration): 1-2 .mu.g template DNA; 7.5 mM of each individual nucleotide; 10 mM DTT; 1.times. reaction buffer; RNA polymerase. Following linear amplification, in vitro transcribed mRNAs are purified, quantified, and used for a second-round first strand synthesis, respective to the first step in this overall method.

[0038] FIG. 2 shows a comparative fingerprint analysis between PCR amplified samples using an existing approach and the invention's intermediate linear amplification step. The samples are: Lane 1, 100 bp ladder; Lane 2, 2 .mu.g `control` PCR fingerprint; Lane 3, 2 .mu.g `treated` PCR fingerprint; Lane 4, 2 .mu.g `control` RT product; Lane 5, 2 .mu.g `treated` RT product. Control and treated samples represent total RNA samples from untreated and etoposide-induced apoptosis in HEK293 cell cultures, respectively. The arrows in FIG. 2 indicate unique bands displayed using the invention's linear amplification step for comparing expression profiles, which would otherwise not be detected using current art. Differences in migration patterns between PCR fingerprints and RT fingerprints are attributed to the RNA polymerase recognition site ligated into the fragments, which are not transcribed during in vitro transcription in the system of the invention.

[0039] Second-round first strand synthesis and display of fingerprints.

[0040] In vitro transcribed mRNAs are subjected to a second-round first strand synthesis reaction by ordinary means in order to generate a double-stranded mRNA:cDNA duplex. Experimental samples are end-labeled using 5'-fluorescence-labeled identimer tags (i.e., control sample, cy-3; treated sample, cy-5). Following synthesis, 3' fingerprints are analyzed for differences in expression levels using denaturing high-performance liquid chromatography (DHPLC) or gel electrophoresis. All detectable fragments are analyzed for abundance and subsequently employed for gene (mRNA) identification by a bioinformatic method that employs the identimer sequence, the restriction site, and the length of the fragments. Fragments can be collected for subsequent applications or investigation such as DNA microarray production, sequencing, cloning, etc.

[0041] FIG. 3 shows an example of reverse phase HPLC separation of control and treated samples. The results showing a 3' cDNA fragment (arrows) of similar size but different abundance, indicating a difference in expression level for a specific mRNA between the samples. Note that all other peaks (bands) in the trace co-migrate and are at the same abundance (based on peak height), indicating these 3' cDNA fragments are derived from mRNAs that are present in both samples at the same expression level. Control and treated samples represent total RNA samples from untreated and etoposide-induced apoptosis in HEK293 cell cultures, respectively.

Example 2

[0042] In one embodiment, the invention is a system whereby the identity and relative expression level of all eukaryotic mRNAs (messenger RNAs) are determined. Some components of the invention include, without limitation (1) primer design & reverse transcription, (2) production of double-stranded cDNA, (3) sequence-specific cleavage of ds cDNA, and (4) fragment detection & analysis.

[0043] Primer Design. The invention takes advantage of the polyadenylation of eukaryotic mRNAs by utilizing an anchored oligo-T primer. The basic primer design includes an oligo-T nucleotide sequence (5' end) linked to a 5-nucleotide sequence (3' end) where the bases immediately adjacent to the oligo-T stretch is not a T. This sequence can be written (5' to 3') as Tn-VNNNN (n=any single integer of 8 or greater representing how many T bases are present). The 5' end of the primer contains one or more reporter molecules or markers (e.g. fluorescent molecule, hapten, biotin, radioisotope, etc.) that allows for detection of the resulting fragment (e.g. size determination) and enables collection of the resulting fragment if desired. Every combination of the VNNNN sequence is employed (in this case, 768 combinations) and occasional modifications in length are utilized to accommodate common vs. rate 3' end sequences. The 768 combinations (i.e. reactions) are managed by employing multi-well plates (e.g. 384-well plate) and multiple reporter molecules (e.g. two reactions per well in a 384-well plate using two different fluorescent reporter molecules). The Tn-VNNNN sequence comprises priming of all genes containing the complementary sequence at the 3' end (immediately adjacent to the poly-adenylation tail), providing information about the genes identity (i.e. the mRNA's sequence at the 3' end), and enabling detection of the resulting 3' end fragment (e.g. fluorescence detection). The VNNNN sequence primer (including the oligo-T region) is called the "identimer" and the complementary sequence (i.e. mRNA sequence) to the 5 (non-oligo-T region) nucleotides of the identimer sequence is called the "identifier" sequence (which utilizes DNA sequence nomenclature rather than RNA; T rather than U). The "identifier" sequence is used in identifying the specific mRNA in a sequence database, along with the 3' fragment size and restriction enzyme recognition sequence.

[0044] Reverse Transcription. Reverse transcription ("RT") is a molecular biology protocol known to those of ordinary skill that allows a complementary DNA sequence to be synthesized using an RNA template. Many enzymes are available to carry out this reaction, which involves adding nucleotides to the 3' end of the primer or growing DNA strand. A result of RT is a RNA-DNA heteroduplex. Since the identimer primer utilizes the poly-A tail on eukaryotic mRNAs, almost all RNA will be employed for RT. The basic premise involves annealing the identimer primer to the RNA that enables RT.

[0045] Production of Double-Stranded cDNA. Once the RT reaction is complete, ds-cDNA is generated by any method utilized by those of ordinary skill in the art. This protocol known to those of ordinary skill may employ, for example, RNAse H to produce nicks and gaps in the RNA strand on the RNA-DNA heteroduplex while simultaneously employing DNA polymerase I, as well as DNA ligase, to replace the RNA strand with "second strand" DNA.

[0046] Sequence-Specific cleavage of Double-Stranded cDNA. After deriving ds cDNA, one or more 3' end fragments are generated by a sequence-specific cleavage of the ds DNA. One embodiment, among others, would involve using a restriction enzyme ("RE") that includes a 4 base recognition sequence. The average length of the cut fragments may be estimated at 256 bases using a RE with a 4-base recognition sequence. Furthermore, the RE recognition sequence provides the 5' end sequence information on the fragment generated (i.e. the RE recognition sequence itself). The resulting fragments can be ligated to a common primer for amplification if desired.

[0047] Fragment Detection & Analysis. The previous steps generate 3' end fragments where the identimer sequence and the RE recognition sequence are known. Determining the size of the fragment in the system (as one example, by capillary electrophoresis) provides information about the location of the RE recognition sequence and enables analysis of the fragment information with the appropriate database. This creates a novel and desirable data set describing the mRNA expression profile(s) of RNA isolated from any eukaryotic samples or model system. Furthermore, collection of these fragments for sequencing may be utilized for identifying novel sequences.

[0048] FIG. 4 shows the identification of mRNA in an HEK 293 test sample from cDNA fragments. The identity of the specific mRNAs has been established from combining the specific VNNNN identimer used in the experiment (5' TTTTTTTTTTTTTTTTTTTTGGTTT 3' (SEQ ID NO: 5)), the specific restriction enzyme employed (NlaIII, cut site is 5' CATG 3'), and the size of the fragments from the gel. Using this information, a search of publicly available mRNA and DNA sequence databases produced the results for the samples shown in Table 1: TABLE-US-00001 TABLE 1 Band Size (bp) GenBank ID Gene/mRNA Description 1 380 5102748 Homo Sapiens mRNA full length insert cDNA clone EUROIMAGE 35971 2 315 180250 Human precerebellin and cerebellin mRNA, complete cds 3 275 5102752 Homo Sapiens mRNA full length cDNA clone EUROIMAGE 609395 4 250 No ID No matching mRNA sequence 5 220 No ID No matching mRNA sequence 6 200 No ID No matching mRNA sequence 7 155 6807752 Homo Sapiens mRNA: cDNA DKFZp434L1016 8 120 3366801 Homo Sapiens orphan G protein- coupled receptor HG38 mRNA, complete cds

[0049] Preferred embodiments of the present Invention have been disclosed. A person of ordinary skill in the art would realize, however, that certain modifications would come within the teachings of the invention, and the following claims should be studied to determine the true scope and content of the invention. In addition, the methods and structures of the present invention can be incorporated in the form of a variety of embodiments, only a few of which are described herein. It will be apparent to the artisan that other embodiments exist that do not depart from the spirit of the invention. Thus, the described embodiments are illustrative and should not be construed as restrictive.

Sequence CWU 1

1

5 1 61 DNA Artificial Sequence This is a synthesized sequence. 1 tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt vnnnnnnnnn 60 n 61 2 25 DNA Artificial Sequence This is a synthesized sequence. 2 tttttttttt tttttttttt tvnnn 25 3 24 DNA Artificial Sequence This is a synthesized sequence. 3 tttttttttt tttttttttt aaac 24 4 24 DNA Artificial Sequence This is a synthesized sequence. 4 gtttaaaaaa aaaaaaaaaa aaaa 24 5 25 DNA Artificial Sequence This is a synthesized sequence. 5 tttttttttt tttttttttt ggttt 25 Attorney File No. 65446-280 EV 130836112US

* * * * *