Integrated Analyses of Breast and Colorectal Cancers Vogelstein; Bert ; et al. [The Johns Hopkins University]

Integrated Analyses of Breast and Colorectal Cancers

Vogelstein; Bert ; et al.

Patent Application Summary

U.S. patent application number 12/619726 was filed with the patent office on 2010-06-03 for integrated analyses of breast and colorectal cancers. This patent application is currently assigned to The Johns Hopkins University. Invention is credited to Kenneth W. Kinzler, Rebecca J. Leary, Victor E. Velculescu, Bert Vogelstein.

Application Number	20100136560 12/619726
Document ID	/
Family ID	42223167
Filed Date	2010-06-03

United States Patent Application	20100136560
Kind Code	A1
Vogelstein; Bert ; et al.	June 3, 2010

Integrated Analyses of Breast and Colorectal Cancers

Abstract

Genome-wide analysis of copy number changes in breast and colorectal tumors used approaches that can reliably detect homozygous deletions and amplifications. The number of genes altered by major copy number changes--deletion of all copies or amplification of at least twelve copies per cell--averaged thirteen per tumor. These data were integrated with previous mutation analyses of the Reference Sequence genes in these same tumor types to identify genes and cellular pathways affected by both copy number changes and point alterations. Pathways enriched for genetic alterations include those controlling cell adhesion, intracellular signaling, DNA topological change, and cell cycle control. These analyses provide an integrated view of copy number and sequencing alterations on a genome-wide scale and identify genes and pathways that are useful for cancer diagnosis and therapy.

Inventors:	Vogelstein; Bert; (Baltimore, MD) ; Kinzler; Kenneth W.; (Baltimore, MD) ; Leary; Rebecca J.; (Baltimore, MD) ; Velculescu; Victor E.; (Dayton, MD)
Correspondence Address:	BANNER & WITCOFF, LTD. 1100 13th STREET, N.W., SUITE 1200 WASHINGTON DC 20005-4051 US
Assignee:	The Johns Hopkins University Baltimore MD
Family ID:	42223167
Appl. No.:	12/619726
Filed:	November 17, 2009

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
61119103	Dec 2, 2008

Current U.S. Class:	435/6.12
Current CPC Class:	C12Q 2600/156 20130101; A61P 35/00 20180101; C12Q 1/6886 20130101
Class at Publication:	435/6
International Class:	C12Q 1/68 20060101 C12Q001/68

Goverment Interests

[0001] The disclosed invention was made using funds from the U.S. government, particularly National Institutes of Health grants CA 043460, CA 057345, CA 062924, and CA 121113. The U.S. government therefore retains certain rights in the invention.

Claims

1. A method of characterizing a breast or colon tumor in a human, comprising the steps of: determining a mutated pathway selected from those listed in Table 3 or SI Table 6 in a breast or colon tumor sample by determining at least one somatic mutation in a gene in the pathway in a test sample relative to a normal sample of the human; assigning the breast or colon tumor to a first group of breast or colon tumors that have a somatic mutation in a gene in said pathway.

2. The method of claim 1 wherein the first group comprises tumors with mutations in a plurality of genes in said pathway.

3. The method of claim 1 wherein the first group consists of tumors with mutation in a gene in said pathway.

4. The method of claim 1 wherein a mutation is determined by DNA sequencing.

5. The method of claim 1 wherein a mutation is determined by DNA sequencing in which a polymerase or ligase enzyme is used to join together nucleotides and oligonucleotides complementary to the gene.

6. The method of claim 1 wherein a mutation is determined by DNA sequencing in which a single stranded DNA molecule obtained from the gene is hybridized to a single stranded DNA reagent.

7. The method of claim 1 wherein a mutation is determined by DNA sequencing in which DNA molecules are separated by length or mass.

8. The method of claim 1 wherein a mutation is determined by DNA sequencing employing a dideoxynucleotide inhibitor.

9. The method of claim 1 wherein a mutation is determined by DNA sequencing in an automated sequencing machine.

10. The method of claim 1 further comprising the steps of: administering a drug or drug candidate to the first group.

11. The method of claim 10 further comprising the steps of: administering a drug or drug candidate to a second group of breast or colon tumors that does not have a mutation is said pathway.

12. The method of claim 11 further comprising the steps of: comparing efficacy of the drug or drug candidate on the first group to efficacy on the second group; identifying a pathway which correlates with increased or decreased efficacy of the drug or candidate drug in the first group relative to the second group.

13. The method of claim 1 further comprising the steps of: comparing efficacy of a candidate or known anti-cancer therapeutic on the first group to efficacy on a second group of breast or colon tumors that does not have a mutation is said pathway; identifying a pathway which correlates with increased or decreased efficacy of the candidate or known anti-cancer therapeutic in the first group relative to other groups.

14. The method of claim 1 wherein the mutation is selected from the group consisting of a point mutation, a homozygous deletion, and a genomic amplification.

15. A method of detecting or diagnosing a breast or colon tumor or minimal residual disease of a breast or colon tumor or molecular relapse of a breast or colon tumor in a human, comprising the steps of: determining in a test sample of a tumor or suspected tumor of the human, a genomic amplification of at least one genomic region, said genomic region selected from the group consisting of those listed in SI Table 4 or Table 1; identifying the human as likely to have a breast or colon tumor, minimal residual disease, or molecular relapse of breast or colon tumor when the amplification is determined.

16. The method of claim 15 wherein genomic amplification is determined by generating fragments of genomic DNA from the test sample, ligating the fragments into a concatenate, and sequencing the concatenate.

17. The method of claim 15 wherein genomic amplification is determined by hybridizing genomic DNA from the test sample to an array of oligonucleotides.

18. The method of claim 15 wherein the genomic amplification is at least 6-fold increased relative to a normal sample of the human.

19. A method of detecting or diagnosing a breast or colon tumor or minimal residual disease of a breast or colon tumor or molecular relapse of a breast or colon tumor in a human, comprising the steps of: determining in a test sample of a tumor or suspected tumor of the human, a genomic deletion of at least one genomic region, said genomic region selected from the group consisting of those listed in SI Table 5 or Table 2; identifying the human as likely to have a breast or colon tumor, minimal residual disease, or molecular relapse of breast or colon tumor when the homozygous deletion is determined.

20. The method of claim 19 wherein genomic deletion is determined by generating fragments of genomic DNA from the test sample, ligating the fragments into a concatenate, and sequencing the concatenate.

21. The method of claim 19 wherein genomic deletion is determined by hybridizing genomic DNA from the test sample to an array of oligonucleotides.

22. The method of claim 19 wherein the genomic deletion is homozygous.

Description

TECHNICAL FIELD OF THE INVENTION

[0002] This invention is related to the area of classifying, characterizing, detecting and diagnosing cancers. In particular, it relates to breast and colorectal cancers.

BACKGROUND OF THE INVENTION

[0003] It is well accepted that cancer is the result of the sequential mutations of oncogenes and tumor suppressor genes (1). Historically, the discovery of these genes has been accomplished through analyses of individual candidate genes chosen on the basis of functional or biologic data implicating them in the tumorigenic process. Recent advances in genomic technologies and bioinformatics have permitted simultaneous evaluation of many genes, thereby offering more comprehensive and unbiased information (2, 3). For example, the sequence of large families of genes, and even the human genes in the Reference Sequence (RefSeq) database, have been determined in subsets of human cancers (4, 5). However, the alterations detected by sequencing represent only one category of genetic change that occurs in human cancer. Other alterations include gains (amplifications) and losses (deletions) of discrete chromosomal sequences that occur during tumor progression. Dramatic amplifications of oncogenes such as ERBB2 (6) or MYC (7) and deletions of tumor suppressor genes such as CDKN2A (8), PTEN (9, 10) and SMAD4 (11) have demonstrated the importance of these mechanisms of genetic alteration in particular tumor types. A comprehensive picture of genetic alterations in human cancer should therefore include the integration of sequence based alterations together with copy number gains and losses.

[0004] Evaluations of copy number changes in cancers using a variety of array types have been previously reported (12). Several of the more recent studies employed oligonucleotide arrays capable of distinguishing >100,000 genomic loci in colon, breast lung, pancreatic, and skin cancers as well as certain leukemias (13-20). However, identification of focal, high copy amplifications or homozygous deletions (HDs) have infrequently been reported because many prior copy number analyses on arrays have used genomic DNA purified from primary tumors. Primary tumors contain varying proportions of non-neoplastic cells thereby reducing the apparent extent of amplification and obscuring focal amplifications--defined by the increased copy number of a small region of the genome--from simple gains of whole chromosome arms. Furthermore, HDs can be difficult to discern in primary tumors due to confounding hybridization signals from non-neoplastic cells (17).

[0005] Many of the problems encountered with primary tumor samples can be overcome by use of early passage cancer cell lines or xenografts which are devoid of human non-neoplastic cells. Previous studies have shown that the process of generating such in vitro or in vivo cultures is not associated with the development of additional genetic alterations (21). It is now widely recognized that HDs found in cell lines and xenografts represent true genetic alterations that are present in clonal fashion in primary tumors but are difficult to document in the latter because of contaminating non-neoplastic cells (22, 23).

[0006] There is a continuing need in the art for methods to characterize, classify, detect and diagnose breast and colorectal cancers.

SUMMARY OF THE INVENTION

[0007] According to one embodiment of the invention a method of characterizing a breast or colon tumor in a human is provided. A mutated pathway selected from those listed in Table 3 or SI Table 6 is determined in a breast or colon tumor by determining at least one somatic mutation in a gene in the pathway in a test sample relative to a normal sample of the human. The breast or colon tumor is assigned to a first group of breast or colon tumors that have a somatic mutation in at least one gene in said pathway.

[0008] According to another embodiment of the invention a method of detecting or diagnosing a breast or colon tumor or minimal residual disease of a breast or colon tumor or molecular relapse of a breast or colon tumor in a human is provided. A genomic amplification of at least one genomic region is determined in a test sample of a tumor or suspected tumor of the human. The genomic region is selected from the group consisting of those listed in SI Table 4 or Table 1. The human is identified as likely to have a breast or colon tumor, minimal residual disease, or molecular relapse of breast or colon tumor when the amplification is determined.

[0009] According to another embodiment a method is provided of detecting or diagnosing a breast or colon tumor or minimal residual disease of a breast or colon tumor or molecular relapse of a breast or colon tumor in a human. A genomic deletion of at least one genomic region is determined in a test sample of a tumor or suspected tumor of the human. The genomic region is selected from the group consisting of those listed in SI Table 5 or Table 2. The human is identified as likely to have a breast or colon tumor, minimal residual disease, or molecular relapse of breast or colon tumor when the homozygous deletion is determined.

[0010] These and other embodiments which will be apparent to those of skill in the art upon reading the specification provide the art with methods for detecting, classifying, characterizing and diagnosing breast and colorectal tumors.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] FIG. 1 shows alterations in the combined FGF, EGFR, ERBB2 and PI3K pathways. Genes affected by copy number alterations are circled in red, while those altered by point mutations are circled in blue. The number of breast (B) and colorectal (C) tumors containing alterations are indicated in boxes adjacent to each gene.

[0012] FIG. 2A-2B shows genomic landscape of a copy number and nucleotide alterations in two typical cancer samples. FIG. 2A indicates breast cancer alterations while FIG. 2B indicates colorectal cancer alterations. The telomere of the short arm of chromosome 1 is represented in the rear left corner of the green plane and ascending chromosomal positions continue in the direction of the arrow. Chromosomal positions that follow the front edge of the plane are continued at the back edge of the plane of the adjacent row and chromosomes are appended end to end. Peaks indicate the 60 highest-ranking candidate cancer genes for each tumor type, with peak heights reflecting the passenger probability scores. The yellow and blue peaks correspond to genes that are altered by copy number changes, while those altered only by point mutations are purple. The dots represent genes that were altered by copy number changes (red squares) or point mutations (white circles) in the B9C breast or Mx27 colorectal tumor samples. Altered genes participating in significant gene groups or pathways (FIG. 13; SI Table 6) are indicated as black circles or squares.

[0013] FIG. 3 (SI FIG. 1.) shows schematic of experimental approach for integration of copy number and sequence alterations in breast and colorectal cancers

[0014] FIG. 4 (SI FIG. 2.) shows detection of amplifications and homozygous deletions using Illumina arrays and Digital Karyotyping. Digital Karyotyping results are shown in the top graphs, with the chromosomal coordinates indicated on the horizontal axis and the Digital Karyotyping tag density ratio indicated on the vertical axis. Illumina array results are shown in the bottom graphs, with the chromosomal coordinates indicated on the horizontal axis and the Log R Ratio indicated on the vertical axis. Digital Karyotyping data were used to validate the Illumina arrays and to develop approaches for sensitive and specific detection of focal amplifications and homozygous deletions.

[0015] FIG. 5 (Table 1) Top candidate cancer genes in breast and colorectal cancer amplifications.

[0016] FIG. 6 (Table 2) Top candidate cancer genes in breast and colorectal cancer homozygous deletions.

[0017] FIG. 7 (Table 3) Candidate cancer pathways altered in breast and colorectal cancers.

[0018] FIG. 8 (SI Table 1) Comparison between Illumina.TM. array and Digital Karyotyping copy number analyses.

[0019] FIG. 9 (SI Table 2) Copy number changes detected by Digital Karyotyping in colorectal cancer

[0020] FIG. 10 (SI Table 3) Amplifications and homozygous deletions detected by Illumina.TM. arrays in breast and colorectal cancers

[0021] FIG. 11 (SI Table 4) Amplified genes in breast and colorectal cancers.

[0022] FIG. 12 (SI Table 5) Homozygously deleted genes in breast and colorectal cancers.

[0023] FIG. 13 (SI Table 6) Pathways enriched for copy number alterations and point mutations.

[0024] FIG. 14 (SI Table 7) Breast and colorectal cancer samples used in these analyses.

DETAILED DESCRIPTION OF THE INVENTION

[0025] The inventors have developed means of diagnosing, classifying, characterizing, and detecting breast and colorectal tumors based on somatic mutations in genes in pathways, including point mutations, genomic amplifications, and genomic deletions.

[0026] Xenografts or cell lines derived from breast and colorectal cancers were examined to obtain high resolution analyses of copy number and nucleotide alterations. Tumors were evaluated with microarrays containing at least 317,000 SNP probes and selected samples were also evaluated with Digital Karyotyping (24). This latter method provides a highly quantitative measure of gene copy number and was used to validate the sensitivity and specificity of the microarray data. The sequences of the 18,191 genes from the RefSeq database previously determined for breast and colorectal cancers were integrated with these results, providing a genome-wide analysis of sequence and copy number alterations.

[0027] The integrated mutational analysis described here provides a global picture of the genetic alterations of breast and colorectal cancers. The combination of sequencing and copy number analysis at the whole genome level permits the identification of genes and pathways that may not be easily detected by either analysis alone. The analysis of point mutations can provide independent information that can help identify candidate target genes in regions of amplification or HD. As gene groups and pathways can be affected by sequence and copy number changes, a combined analysis can highlight the groups that are enriched for these somatic alterations.

[0028] The analysis of copy number changes can also provide general insights into the functional effects of point mutations. Single nucleotide substitutions in genes that are observed to be deleted are more likely to be inactivating, while substitutions in genes that are amplified are more likely to be activating. This was confirmed by the observation of HDs and point mutations in TP53, SMAD2, SMAD3, and PTEN all of which are thought to be tumor suppressors. If copy number changes faithfully reflect the overall effect of target genes, one would expect to infrequently see both amplifications and HDs of the same set of genes in human tumors. Accordingly, we observed an under-representation of genes that are homozygously deleted in one tumor and amplified in another (only two of the 1148 altered genes identified were altered by both amplification and HD (p<0.01, binomial test) and neither were considered good candidates by the integrated statistical analyses).

[0029] In addition to identifying genes through the integrated analysis of point mutations and copy number changes, a number of issues arise from these studies that have implications for future large scale genomic analyses (36). One is that the complexity of genetic alterations in human cancer increases when considering both point alterations and copy number changes. In addition to a median of 84 and 76 genes altered by point mutation, breast and colorectal cancers have a median of 24 and 9 genes altered by a major copy number change. These observations support a view of the breast and colorectal cancer genomic landscape where a few commonly affected "gene mountains" are scattered among a much larger number of "gene hills" that are infrequently altered by either point mutation or copy number changes. An example of a cancer genome landscape that incorporates copy number changes, illustrated in FIG. 2, shows new gene mountains and hills that result from the combined analysis.

[0030] Though cancer genome landscapes are complex, they may be better understood by placing all genetic alterations within defined cellular pathways. Our analyses identified several converging gene pathways, including the ERBB2, EGFR and PI3K pathways, that were affected by copy number changes and point alterations in both breast and colorectal cancers. In addition, many pathways implicated in colorectal tumor progression (Notch, AKT, and MAPK) were enriched for alterations. Interestingly, many gene groups contained genes that were both amplified and others that were deleted, suggesting that different genes within the same group or pathway may be affected through alternate mechanisms. This is consistent with the observation that most signaling pathways contain both positive and negative regulators and alterations in any of these can lead to dysregulated signaling.

[0031] The copy number and sequence alterations reported here should be placed in the context of other analyses to reveal the full compendium of molecular changes in a tumor cell. One limitation of our approach is that the copy number analyses we performed may have missed very small regions (<20 kb) that were amplified or deleted. Use of arrays with higher numbers of SNPs or larger DK libraries generated using next generation sequencing approaches will help improve the sensitivity of these analyses. Additionally, the incorporation of approaches that detect structural changes (e.g. translocations) and epigenetic alterations will likely prove to be useful. Finally, as has been done with karyotypic abnormalities (37), it will be important to determine the timing of these alterations within each tumor type by analysis of additional tumor samples from different stages. In this regard, it should be noted that other methods of tumor isolation may not result in tumor DNA purity that will allow the sensitive and quantitative detection of copy number alterations afforded by our studies (38, 39).

[0032] The development of approaches to identify genetic alterations on a genome-wide scale has made the discovery of mutations the "easy" part of cancer gene discovery efforts. Functional studies to identify the culprits underlying the 1077 copy number changes discovered from our study would currently be impractical. The statistical techniques we developed highlight the best candidates for future functional studies, but it remains possible that specific loci are more likely to be altered by copy number changes than others because they are located near fragile sites or other hotspots for recombination (40). Therefore, these genetic analyses can only identify candidate genes that may play a role in cancer and do not definitively implicate any gene in the neoplastic process.

[0033] Several of the pathways identified affected a relatively high fraction of cancers and may be useful for cancer diagnosis or therapy. Alterations in signaling pathways of FGFR, EGFR, ERBB2 and PI3K were detected in nearly two thirds of breast and colorectal tumors that were comprehensively examined in this study. These data suggest that the ERBB2 inhibitors may be useful not only in breast cancer but also in selected colorectal cancer patients in combination with existing therapeutic agents. Additionally, a significant fraction of the breast tumors analyzed had genetic alterations in a process regulating DNA topology. Although TOP2A is co-amplified with ERBB2 and therefore does not represent the likely driver of this amplicon, alterations of TOP2A may still be of clinical utility. As higher doses of anthracyclines may improve clinical outcomes in breast cancer patients with TOP2A amplifications (41, 42), our observations suggest that the additional alterations that we identified could be used to select patients that may respond to topoisomerase-targeted therapies. In a similar fashion, tumor cells deficient in certain cellular processes as a result of HDs could be targeted pharmacologically through synthetic lethality. In a general sense, our discovery that a typical colorectal or breast cancer has 4 to 7 genes homozygously deleted suggests that further development of strategies targeting such HDs (43) could be widely applicable.

[0034] Mutations, including homozygous deletions, genomic amplifications, and point mutations can be determined by any means known in the art, including but not limited to the methods described below. Sequencing, digital karyotyping, and hybridization to SNP arrays, are non-limiting examples of techniques which can be used. DNA sequencing can be performed using any techniques which are known in the art, for example, based on chemical degradation, enzymatic synthesis, ligation, hybridization, etc. Enzymes which can be used include but are not limited to polymerases and ligases. Synthesized or degraded nucleic acids can be analyzed using techniques which separate molecules based on length or mass, for example. Sequence determinations can be performed manually or in an automated fashion. Some techniques which can be exploited utilize radiolabeled or fluorescently labeled nucleotides. Single stranded oligonucleotides can be employed as probes or primers, both of which may hybridize to the analyte. Some methods utilize dideoxynucleotides which act as monomers and terminators of DNA synthesis.

[0035] Mutation, deletion, or amplification determination involves one or more ex vivo samples which are processed in order to analyze the genetic material (or sometimes the proteins encoded by the genetic material). Typically this involves purification or enrichment of nucleic acids and removal or de-enrichment of other cellular components, such as protein, lipid, carbohydrates. The nucleic acids are further reacted chemically or enzymatically to yield readily detectable products which correspond to the nucleic acids in the ex vivo samples. Determination of a somatic mutation is done by comparing a tumor sample or characteristic to a normal sample of the same individual. Differences can be observed and recorded by a human or a machine or a computer.

[0036] Changes in copy number of a genomic segment can be determined by any means known in the art. In one technique, fragments (enzymatically generated or random) are generated and ligated together to form a chain or concatenate. The concatenates can be sequenced, and underrepresented or overrepresented fragments of the genome can be noted. Alternatively genomic DNA fragments can be hybridized to an array of oligonucleotides and their relative prevalence scored. Such techniques may detect deletions or amplifications. Changes in copy number may be from diploid to homozygous deletion, or amplifications ranging from diploid to at least 5-, at least 6-, at least 7-, at least 8-, at least 9-, at least 10-, at least 15-, at least 20-, at least 25-fold of diploid.

[0037] Tumors or patients bearing tumors can be divided into or assigned to groups based on the presence or absence of a particular somatic mutation. The group with the mutation may optionally contain tumors with a particular mutation in a particular gene, tumors with mutations in a single gene, or tumors with mutations in a single pathway. Groups comprising tumors with mutations in a single gene or a single pathway may be the same or different types of mutations.

[0038] Groups that are divided on the basis of a mutation in a gene or in a pathway may be used to evaluate drugs or other therapeutic treatments. This permits the determination of groups which are susceptible or refractory to the treatment. Thus patients who are susceptible can be successfully treated, and patients who are refractory can avoid expensive, potentially hazardous, and ultimately ineffective treatments.

[0039] The mutations in genes and pathways, including point mutations, homozygous deletions, and amplifications, can also be used to detect or diagnose breast or colon tumors, or minimal residual disease of such tumors, or molecular relapse of such tumors. The mutations and genes and pathways which have been found are characteristic of these cancers and can be used to identify them in various stages of disease. Characteristic mutations are not necessarily present in all or even in a majority of tumors of the breast or colon.

[0040] Mutations found in tumors can be determined or confirmed by comparison to normal tissue. Somatic mutations are ones that occur in the tumor but are not found in normal tissue of the individual. Thus a comparison between tumor and normal can be used for identification and confirmation.

[0041] The above disclosure generally describes the present invention. All references disclosed herein are expressly incorporated by reference. A more complete understanding can be obtained by reference to the following specific examples which are provided herein for purposes of illustration only, and are not intended to limit the scope of the invention.

Example 1

Optimization of Copy Number Analysis with Digital Karyotyping

[0042] Digital Karyotyping (DK) was used as a standard to develop criteria for assessing amplifications and HDs with Illumina high density SNP arrays. Analysis of DK libraries from 18 colorectal tumor samples identified a total of 21 amplification events, each containing relatively small chromosomal regions (41 kb to 2.3 Mb) with 12 to 186 copies per nucleus (SI Table 2). We also found 4 regions within the autosomal chromosomes where the tag density reached zero, representing HDs. As expected, we identified low-amplitude gains and losses of entire chromosomes, chromosomal arms, or other large genomic regions. We did not pursue these low-amplitude copy number changes as it is difficult to reliably identify candidate cancer genes from such large regions. To ensure that the copy number changes identified by DK were bona fide amplifications or HDs, we independently examined 12 alterations by quantitative PCR and confirmed the presence of the genomic alterations in every case examined.

[0043] We then directly compared DK data to those obtained through genomic hybridization of the same DNA samples to Illumina high density oligonucleotide arrays. The Illumina platform employs a two step procedure based on oligo hybridization and single base extension for analysis of genomic SNPs (25). The combination of these two steps leads to greater fidelity of SNP calls and decreases false hybridization signals. Using fluorescence intensity measurements we developed an approach to detect amplifications resulting in 12 or more copies per nucleus (6-fold or greater amplification compared to the diploid genome) as well as deletions of both copies of a gene (HDs) (see SI Methods).

[0044] Using this new approach, 14 amplification events and 3 HD events identified by DK in 3 representative tumor samples were detected by Illumina arrays (SI Table 1 and SI FIG. 2). In all cases, the genomic boundaries identified by both approaches were similar and within the resolution expected for both methods. The copy number of amplifications determined by DK was higher than those same regions identified by Illumina arrays. It is known that arrays underestimate the copy number of amplifications (26) and this was confirmed by real-time PCR for amplifications in SI Table 1. No additional copy number changes of the sizes expected to be detected by DK were identified by the microarray approach in these samples. We did identify 25 additional small HDs; all of these were <250 kb in length and would not have been possible to detect with DK given the number of tags analyzed (24). To independently validate such smaller HDs, we used PCR and Sanger sequencing to examine genes located within small HDs and found that in each case, multiple exons of each gene could not be amplified or sequenced. These results suggested that our approach for analysis of Illumina array data provided a sensitive and specific method for identification of amplifications and HDs, including relatively small alterations of either type.

Example 2

Detection of Amplifications and Homozygous Deletions

[0045] A total of 45 breast and 36 colorectal tumors were analyzed by Illumina arrays containing either .about.317,000 or .about.550,000 SNPs (SI FIG. 1). To determine the fraction of alterations that were likely to be somatic (i.e., tumor derived), we analyzed these regions in 23 matched normal samples. In the normal samples, no amplifications and only four distinct HDs were detected. We removed these alterations from further analysis, as well as those corresponding to known copy number variation in normal human cells (27, 28). Finally, we removed any copy number changes where the boundaries were identical in two or more samples, as these were likely to represent germline variants. Based on this conservative strategy, we estimated that >95% of the 614 amplifications and 463 HDs (SI Table 3) represented true somatic alterations.

[0046] Breast cancers contributed to a majority of the alterations identified, comprising 68% and 81% of the total HDs and amplifications, respectively. Individual colorectal and breast tumors had on average 7 and 18 copy number alterations, respectively. Each colorectal cancer had an average of 4 HDs and 3 amplifications. Breast cancers had on average 7 HDs and 11 amplifications. Several of the tumor samples contained copy number alterations that were separated by short non-amplified or deleted sequences, presumably reflecting the complex structure of these alterations (29, 30).

[0047] The copy number alterations observed encompassed on average 1.7 and 2.4 Mb of colorectal and breast haploid genomic sequence, respectively. Each HD affected the coding region of one gene on average, while an average amplicon contained two genes. The average numbers of protein-coding genes that were affected by either amplification or HD were 24 and 9 per breast and colorectal cancer, respectively.

Example 3

Genes Altered in More than One Tumor

[0048] One of the main challenges in the analysis of somatic alterations in cancers involves the distinction between those changes which are selected for during tumorigenesis (driver alterations) from those that provide no selective advantage (passenger alterations). Even in regions that have multiple copy number alterations, this distinction can be particularly difficult because regions of amplification and HD can contain multiple genes, only a subset of which are presumably the underlying targets. We reasoned that the integration of copy number analyses with sequence data would help reveal the driver genes that were more likely to contain genetic alterations. To accomplish this integration, we developed a new statistical approach for determining whether the observed genetic alterations of any type in any gene were likely to reflect an underlying mutation frequency that was significantly higher than the passenger rate. To analyze the probability that a given gene would be involved in a copy number alteration, we made the conservative assumption that the frequency of all amplifications and HDs observed in each tumor type represented the passenger mutation frequency (i.e., we assumed that all copy number changes were passengers). The number of actual copy number alterations affecting each gene in all tumors was then compared to the simulated number of expected passenger alterations taking into account gene size, the distribution of SNP locations, and the frequency of passenger amplifications and HDs in breast and colorectal cancers.

[0049] We integrated these copy number analyses with the sequence data of the Sjoblom et al. and Wood et al studies (5, 31). In these studies, the protein coding sequences of 20,857 transcripts from the 18,191 genes in the RefSeq database were determined in 11 breast and 11 colorectal cancer samples, allowing detection of somatic sequence alterations. Genes containing somatic alterations were subsequently analyzed for mutations in additional tumors of the same type. In the current study, the same 22 breast and colorectal tumor samples were analyzed in parallel by Illumina arrays, together with additional samples of each tumor type (SI FIG. 1 and SI Table 7). To integrate these different mutational data for each tumor type, we combined the probability that a gene was a driver gene based on the type and frequency of point mutations previously observed with the probability that the gene was a driver based on the number of observed amplifications and HDs.

[0050] Table 1 lists the loci that were amplified in at least one tumor and had the highest probability of containing driver genes as determined by the combined mutation analysis (a complete list of amplifications is provided in SI Table 3 and amplified genes in SI Table 4). For genes to be considered potential targets of the amplification, the entire coding region of the gene was required to be contained within a focal amplicon. A few candidate genes in this list (e.g. CCNE1 (cyclin E) and ERBB2) were amplified in multiple tumors but were not found to be mutated by sequencing. The majority of candidate genes, however, harbored point mutations in some tumors and amplifications in others. The most striking aspect of this list of candidate genes is that only some of them had been implicated in cancer in the past. Of the 19 genes indicated in Table 1, only 8 had been previously implicated in tumorigenesis. The known cancer genes included MYC, ERBB2 (HER2/NEU), CCNE1, CCND1, EGFR, FGFR2, and IRS2, each of which had been shown to be amplified. In addition, MRE11, which was amplified in breast cancers, has been shown to be mutated in small fraction of colorectal cancers and is thought to play an essential role in maintaining chromosomal stability (32). Some genes were shown to be altered in both breast and colorectal cancers, with at least one of the tumors containing amplifications. Interestingly, among these genes, ERBB2 was found to be amplified in both breast and colorectal cancers, and FGFR2 was found to be mutated in breast cancers and amplified in colorectal cancers.

[0051] Table 2 similarly lists the loci that were homozygously deleted in at least one tumor and had the highest probability of containing drivers as determined by the combined mutation analysis (a complete list of HDs is provided in SI Table 3 and homozygously deleted genes in SI Table 5). For each of these genes, a portion of the coding region was affected by the HD. A number of genes previously known to be inactivated in colorectal or breast tumorigenesis, such as CDKN2A, PTEN, and TP53 are found in this list. We also identified genes, such as CHD5, MAP2K4, SMAD2, and SMAD3 that have been previously shown to be deleted in other tumor types, but not in colorectal or breast cancers. Finally, we discovered a number of genes not previously known to be affected by HD in any tumor type. For example, HDs as well as point mutations were found in OMA1 and ZNF521 in colorectal cancers and in MANEA, PCDH8, SATL1, and ZNF674 in breast cancers. During the course of preparing this manuscript, we identified through independent experimentation that PCDH8 is mutated and homozygously deleted in breast cancer (33). A number of genes that were less frequently altered in any one tumor type were shown to be affected at significant levels in both tumor types, including CDH20, FHOD3 and FNDC1.

Example 4

Pathways Enriched for Copy Number and Point Alterations

[0052] We examined whether groups of genes belonging to certain cellular processes or pathways were preferentially affected by genetic alterations. For this purpose, we developed a statistical approach that provided a probability that a pathway contained driver alterations, taking into account both the copy number changes and point mutations. This approach was similar to that described above for evaluating individual genes but in this case was applied to entire groups of genes involved in specific pathways or functional groups. Because the net effect of a pathway can be the same whether certain components are amplified or others deleted, all copy number alterations within a gene group were considered. The analysis was performed using three well-annotated GeneGo MetaCore databases: gene ontology (GO), canonical gene pathway maps (MA), and genes participating in defined cellular processes and networks (GG) (34). For each gene group, we considered whether the component genes were more likely to be affected by point mutations, amplifications, or HDs, as compared to all genes analyzed. Importantly, these analyses were based on analysis of the rankings of altered genes within each group using a modified version of gene set enrichment analysis (GSEA) (35), rather than the total number of mutations within individual groups. This approach limits the effects of single highly mutated genes and requires the involvement of multiple genes to score a pathway as significantly affected.

[0053] These analyses identified gene groups that were enriched for genetic alterations in these tumor types (Table 3). In particular, the EGFR and ERBB gene families were enriched for alterations. Interestingly, both of these signaling pathways involved various components of the PI3 kinase pathway, suggesting that the observed alterations may result in similar effects in these tumor cells (FIG. 1). A third of genes in these combined pathways were mutated by sequence alterations, amplifications, or HDs. Enrichment of alterations in other canonical gene groups including Notch and G1-S cell cycle transition pathways were also detected. The latter group included HDs of CDKN2A and CDKN2B genes as well as amplifications of cyclin D1, cyclin D3, and cyclin E3 genes in breast cancers. For all these gene groups, new genes were identified that had not previously been implicated by genetic alterations in these cellular processes. Finally, a variety of gene groups not previously known to be enriched for copy number changes in tumorigenesis were identified. These included genes implicated in cell-cell interaction and adhesion, including cadherins and metalloproteases, as well as other genes implicated in cellular interactions during early embryonic and neural development.

[0054] As an example, in colorectal cancers, a total of 33 cadherin and protocadherin genes were detected as being affected by copy number or sequence changes. In breast cancers, there was also enrichment in genes implicated in DNA topological control, including alterations in a number of topoisomerases (TOP1, TOP2A, TOP2B and TOP3A) and helicases. All pathways showing significant enrichment for genetic alterations are listed in SI Table 6.

REFERENCES

[0055] The disclosure of each reference cited is expressly incorporated herein. [0056] 1. Vogelstein, B. & Kinzler, K. W. (2004) Cancer genes and the pathways they control. Nat Med 10:789-99. [0057] 2. Bardelli, A. & Velculescu, V. E. (2005) Mutational analysis of gene families in human cancer. Curr Opin Genet Dev 15:5-12. [0058] 3. Strausberg, R. L., Levy, S. & Rogers, Y. H. (2008) Emerging DNA sequencing technologies for human genomic medicine. Drug Discov Today 13:569-77. [0059] 4. Greenman, C., et al. (2007) Patterns of somatic mutation in human cancer genomes. Nature 446:153-8. [0060] 5. Wood, L. D., et al. (2007) The genomic landscapes of human breast and colorectal cancers. Science 318:1108-13. [0061] 6. Slamon, D. J., et al. (1987) Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235:177-82. [0062] 7. Collins, S. & Groudine, M. (1982) Amplification of endogenous myc-related DNA sequences in a human myeloid leukaemia cell line. Nature 298:679-81. [0063] 8. Kamb, A., et al. (1994) A cell cycle regulator potentially involved in genesis of many tumor types. Science 264:436-40. [0064] 9. Li, J., et al. (1997) PTEN, a putative protein tyrosine phosphatase gene mutated in human brain, breast, and prostate cancer. Science 275:1943-7. [0065] 10. Steck, P. A., et al. (1997) Identification of a candidate tumour suppressor gene, MMAC1, at chromosome 10q23.3 that is mutated in multiple advanced cancers. Nat Genet 15:356-62. [0066] 11. Hahn, S. A., et al. (1996) Dpc4, a Candidate Tumor Suppressor Gene At Human Chromosome 18q21.1. Science 271:350-353. [0067] 12. Pinkel, D. & Albertson, D. G. (2005) Array comparative genomic hybridization and its applications in cancer. Nat Genet 37 Suppl:S11-7. [0068] 13. Camps, J., et al. (2008) Chromosomal breakpoints in primary colon cancer cluster at sites of structural variants in the genome. Cancer Res 68:1284-95. [0069] 14. Weir, B. A., et al. (2007) Characterizing the cancer genome in lung adenocarcinoma. Nature 450:893-8. [0070] 15. Haverty, P. M., et al. (2008) High-resolution genomic and expression analyses of copy number alterations in breast tumors. Genes Chromosomes Cancer 47:530-42. [0071] 16. Mullighan, C. G., et al. (2007) Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature 446:758-64. [0072] 17. Harada, T., et al. (2008) Genome-wide DNA copy number analysis in pancreatic cancer using high-density single nucleotide polymorphism arrays. Oncogene 27:1951-60. [0073] 18. Stark, M. & Hayward, N. (2007) Genome-wide loss of heterozygosity and copy number analysis in melanoma using high-density single-nucleotide polymorphism arrays. Cancer Res 67:2632-42. [0074] 19. Nagayama, K., et al. (2007) Homozygous deletion scanning of the lung cancer genome at a 100-kb resolution. Genes Chromosomes Cancer 46:1000-10. [0075] 20. Zhao, X., et al. (2005) Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis. Cancer Res 65:5561-70. [0076] 21. Jones, S., et al. (2008) Comparative lesion sequencing provides insights into tumor evolution. Proc Natl Acad Sci USA 105:4283-8. [0077] 22. Cairns, P., et al. (1995) Frequency of homozygous deletion at p16/CDKN2 in primary human tumours. Nat Genet 11:210-2. [0078] 23. Liggett, W. H., Jr. & Sidransky, D. (1998) Role of the p16 tumor suppressor gene in cancer. J Clin Oncol 16:1197-206. [0079] 24. Wang, T. L., et al. (2002) Digital karyotyping. Proc Natl Acad Sci USA 99:16156-61. [0080] 25. Steemers, F. J., et al. (2006) Whole-genome genotyping with the single-base extension assay. Nat Methods 3:31-3. [0081] 26. Peiffer, D. A., et al. (2006) High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res 16:1136-48. [0082] 27. Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. (2006) A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38:75-81. [0083] 28. Sebat, J., et al. (2004) Large-scale copy number polymorphism in the human genome. Science 305:525-8. [0084] 29. Volik, S., et al. (2003) End-sequence profiling: sequence-based analysis of aberrant genomes. Proc Natl Acad Sci USA 100:7696-701. [0085] 30. Bignell, G. R., et al. (2007) Architectures of somatic genomic rearrangement in human cancer amplicons at sequence-level resolution. Genome Res 17:1296-303. [0086] 31. Sjoblom, T., et al. (2006) The consensus coding sequences of human breast and colorectal cancers. Science 314:268-74. [0087] 32. Wang, Z., et al. (2004) Three classes of genes mutated in colorectal cancers with chromosomal instability. Cancer Res 64:2998-3001. [0088] 33. Yu, J. S., et al. (2008) PCDH8, the human homolog of PAPC, is a candidate tumor suppressor of breast cancer. Oncogene. [0089] 34. Ekins, S., Nikolsky, Y., Bugrim, A., Kirillov, E. & Nikolskaya, T. (2007) Pathway mapping tools for analysis of high content data. Methods Mol Biol 356:319-50. [0090] 35. Subramanian, A., et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545-50. [0091] 36. Collins, F. S. & Barker, A. D. (2007) Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Sci Am 296:50-7. [0092] 37. Hoglund, M., et al. (2002) Dissecting karyotypic patterns in colorectal tumors: two distinct but overlapping pathways in the adenoma-carcinoma transition. Cancer Res 62:5939-46. [0093] 38. Paez, J. G., et al. (2004) Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Res 32:e71. [0094] 39. Arriola, E., et al. (2007) Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation. Lab Invest 87:75-83. [0095] 40. Popescu, N. C. (2004) Fragile sites and cancer genes on the short arm of chromosome 8. Lancet Oncol 5:77; discussion 77. [0096] 41. Tanner, M., et al. (2006) Topoisomerase IIalpha gene amplification predicts favorable treatment response to tailored and dose-escalated anthracycline-based adjuvant chemotherapy in HER-2/neu-amplified breast cancer: Scandinavian Breast Group Trial 9401. J Clin Oncol 24:2428-36. [0097] 42. Coon, J. S., et al. (2002) Amplification and overexpression of topoisomerase IIalpha predict response to anthracycline-based therapy in locally advanced breast cancer. Clin Cancer Res 8:1061-7. [0098] 43. Varshaysky, A. (2007) Targeting the absence: homozygous DNA deletions as immutable signposts for cancer therapy. Proc Natl Acad Sci USA 104:14935-40. [0099] 44. Leary, R. J., Cummins, J., Wang, T. L. & Velculescu, V. E. (2007) Digital karyotyping. Nat Protoc 2:1973-86.

Example 5

Materials and Methods

[0100] DNA samples from tumor derived xenografts and cell lines were obtained and purified. DK libraries were generated and analyzed as previously described (24, 44). The Illumina SNP arrays were used to analyze tumor samples. Bioinformatic analyses were used to determine focal amplifications and HDs. Statistical methods were employed to determine the likelihood that genetic alterations occurred at a frequency higher than the passenger rate, and to identify gene groups enriched for copy number and sequence alterations.

Clinical Samples and Cell Lines

[0101] DNA samples were obtained from xenografts and cell lines of ductal breast and colorectal carcinoma. Normal DNA samples were obtained from matched normal tissue or peripheral blood. Twenty two of the DNA samples include those used in the Discovery Screen of Sjoblom et al. and Wood et al. (1, 2). All tumor samples analyzed for copy number analyses are listed in SI Table 9. For the Illumina analyses, the colorectal cancer samples used were cell lines (10) or xenografts (26), each developed from a liver metastasis of a different patient. The breast cancer samples used were cell lines (22) and xenografts (23), each developed from a different patient. In addition, 11 colorectal cancer metastases (immunopurified using the BerEP4 antibody as previously described (3)) and 7 cell lines were analyzed by Digital Karyotyping analyses. Available clinical information for samples that were analyzed by copy number and sequence analyses is available in Table S2 of reference (2). All samples were obtained in accordance with the Health Insurance Portability and Accountability Act (HIPAA).

Digital Karyotyping

[0102] Digital Karyotyping libraries were constructed as previously described (4, 5). In brief, 17 by tags of genomic DNA were generated using the NlaIII mapping and Sad fragmenting restriction enzymes. For each library, the experimental tags obtained were concatenated, cloned and sequenced. SAGE2002 software was used to extract the experimental tags from the sequencing data. The sequences of the experimental tags were compared to the predicted virtual tags extracted from the human genome reference sequence hg16 (NCBI Build 34, July 2003) and were visualized using the SageGenie DKView to identify potential alterations (http://cgap.nci.nih.gov/SAGE/DKViewHome). The coordinates of all identified alterations were translated to the human genome reference sequence hg17 (NCBI Build 35, May 2004) to allow comparison to Illumina data.

[0103] Homozygous deletions were identified using a sliding window size of 175 virtual tags (.about.700 kb in size). Windows with a tag density ratio (observed tags in window/expected tags in window) <0.01 were considered to represent putative homozygous deletions and were further examined. Regions of homozygous deletions were defined as containing no experimental tags and the boundaries were determined as the outermost virtual tags with no matching experimental tags.

[0104] Amplifications were identified using sliding windows of variable sizes, as the most accurate window size for detection and quantification of amplifications is the exact size of the altered region. Windows with tag density ratios .gtoreq.6 were considered to represent amplified regions. Boundaries of the amplified region are determined by the outermost tag contained in a window with a tag density ratio >3 or by the virtual tag position after which there is sharp decline in the observed experimental tags.

High Density SNP Arrays

[0105] The Illumina Infinium II Whole Genome Genotyping Assay employing the BeadChip platform was used to analyze tumor samples at 317,503 (317 k), 555,351 (550 k V1), or 561,466 (550 k V3) SNP loci from the Human HapMap collection. All SNP positions were based on hg17 (NCBI Build 35, May 2004) version of the human genome reference sequence. The genotyping assay is a two step procedure that is based on hybridization to a 50 nucleotide oligo, followed by a two-color fluorescent single base extension. The image files of fluorescence intensities were processed using Illumina BeadStation software to provide intensity values for each SNP position. For each SNP, the normalized experimental intensity value (R) was compared to the intensity values for that SNP from a training set of normal samples and represented as a ratio (called the "Log R Ratio") of log 2(Rexperimental/Rtraining set).

Bioinformatic Analysis of High Density SNP Array Data

[0106] Digital Karyotyping was used to inform and optimize the criteria for detection of focal homozygous deletions and high-copy amplifications using the Illumina arrays. Three colorectal cancer samples (Co44, Co82 and Co84) were assessed by Digital Karyotyping tag libraries as well as the Illumina arrays (SI Table 1). From these analyses criteria were developed to permit sensitive and specific detection of the Digital Karyotyping alterations using the Illumina platform as described below. These criteria were subsequently used to analyze an additional 46 breast and 33 colorectal cancers.

Detection of Homozygous Deletions

[0107] Homozygous deletions (HDs) were defined as two or more consecutive SNPs with a Log R Ratio value of .ltoreq.-2. The first and last SNPs of the identified HD region were considered to be the boundaries of the alteration for subsequent analyses. The deletion breakpoint would be expected to be located between the boundary deleted SNPs and adjacent non-deleted SNPs; use of the inner deleted SNP boundaries provides the most conservative approach as use of the outer boundaries may include non-deleted regions. To eliminate chip artifacts and potential copy number polymorphisms, we removed all HDs that were included in copy number polymorphism databases (6, 7). As these analyses showed that copy number polymorphisms had conserved boundaries, we also removed all observed HDs with identical boundaries that occurred in multiple samples. Adjacent homozygous deletions separated by one or two SNPs were considered to be part of the same alteration. Adjacent HDs were evaluated separately for the purposes of determining affected genes, but were counted as single entries in Table 2 and SI Table 5. To identify genes affected by HDs, we compared the location of coding exons in the RefSeq and CCDS databases with the genomic coordinates of the observed HDs. Any gene with a portion of its coding region contained within a homozygous deletion was considered to be affected by the deletion.

Detection of Amplifications

[0108] High copy amplifications (i.e. >12 chromosomal copies as determined by Digital Karyotyping) were defined as regions having at least one SNP with a LogR ratio .gtoreq.1.4, at least one in ten SNPs with a LogR ratio .gtoreq.1, and an average LogR ratio of the entire region of .gtoreq.0.9. The boundaries of amplified regions were delimited by the outermost SNPs with LogR ratios >1. Similar to analyses of homozygous deletions, we removed all amplifications that had identical boundaries and occurred in multiple samples.

[0109] As focal amplifications are more likely to be useful in identifying specific target genes, a second set of criteria were used to remove large chromosomal regions or entire chromosomes that showed copy number gains. These large alterations, called "complex amplifications", were thus distinguished from small focal alterations, called "simple amplifications". Based on observations from Digital Karyotyping, several steps were used to identify and remove complex amplifications. First, amplifications >3 Mb in size and groups of nearby amplifications (within 1 Mb) that were also >3 Mb in size were considered complex. Amplifications or groups of amplifications that occurred at a frequency of .gtoreq.4 amplifications in a 10 Mb region, or .gtoreq.5 amplifications per chromosome were deemed to be complex. The amplifications remaining after these filtering steps were considered to be simple amplifications and were further examined. The complex regions were not included in subsequent statistical analyses but those containing candidate cancer genes are indicated in Table 1. To identify protein coding genes affected by amplifications, we compared the location of the start and stop positions of each gene within the RefSeq and CCDS databases with the genomic coordinates of the observed amplifications. As amplifications of a sub-genic region (i.e. containing only a fraction of a gene) are less likely to have a functional consequence, we focused our analyses on genes whose entire coding regions were included in the observed amplifications.

[0110] A number of genes co-amplified or co-deleted with known oncogenes (CCND1, ERBB2, CCNE1, EGFR, MYC) or tumor suppressors (CDKN2A, PTEN, MAP2K4, TP53) were considered "known passengers" and eliminated from further statistical analysis. However, for completeness, these known passengers were listed along with their respective copy number alterations in SI Tables 4 and 5. Copy number alterations of known passengers were also listed in SI Tables 6 and 7, but these alterations were not used to calculate the passenger probabilities listed in the same tables. Alterations of known passengers were also excluded from statistical analysis of pathways (SI Table 8).

Statistical Analysis of Deletions and Amplifications

[0111] For each of the genes involved in amplifications or deletions, we quantify the strength of the evidence that they may be drivers of carcinogenesis by reporting a driver probability, separately for amplifications and deletions. In each case, the passenger probability is an a posteriori probability that integrates information from the somatic mutation analysis of Wood et al. (2) with the data presented in this article. The passenger probabilities reported in Wood et al. (2) serve as a priori probabilities. These are available for three different scenarios of passenger mutation rates and results are presented separately for each. If a gene was not found to be mutated in Wood et al. (2) the prior passenger probability is set to the estimated proportion of passengers in the RefSeq set. Then, a likelihood ratio for "driver" versus "passenger" was evaluated using as evidence the number of samples in which a gene was found to be amplified (or deleted). Analysis is carried out separately by type of array, and then combined by multiplication of the relevant likelihood terms. The passenger term is the probability that the gene in question is amplified (deleted). For each sample, we begin by computing the probability that the observed amplifications (deletions) will include the gene in question by chance. Inclusion of all available SNPs is required for amplification, while any overlap of SNPs is sufficient for deletions. Specifically, if in a specific sample N SNPs are typed, and K amplifications are found, whose sizes, in terms of SNPs involved, are A1 . . . AK, a gene with G SNPs will be included at random with probability

(A1-G+1)/N+ . . . +(AK-G+1)/N

for amplifications and

(A1+G-1)/N+ . . . +(AK+G-1)/N

for deletions.

[0112] We then compute the probability of the observed number of amplifications (deletions) assuming that the samples are independent but not identically distributed Bernoulli random variables, using the Thomas and Taub algorithm (8), as implemented in R by M. Newton. Our approach to evaluating the passenger probabilities provides an upper bound, as it assumes that all the deletions and amplifications observed only include passengers. The driver term of the likelihood ratio was approximated as for the passenger term, after multiplying the sample-specific passenger rates above by a gene-specific factor reflecting the increase (alternative hypothesis) of interest. This increase is estimated by the ratio between the empirical deletion rate of the gene and the expected deletion rate for that gene.

[0113] For each of the gene sets considered we quantify the strength of the evidence that they may include a higher-than-average proportion of driver genes. For each set, in a list of all the RefSeq genes sorted by a score combining information on mutations, amplifications and deletions, we compared the ranking of the genes contained in the set with the ranking of those outside, using the rank-sum test, as implemented by the Limma package in Bioconductor (9). Scores were obtained by adding three log likelihood ratios for mutations, amplifications and deletions. This combination approach makes an approximating assumption of independence of amplifications and deletions. In general, amplified genes cannot be deleted, so independence is technically violated. However, because of the relatively small number of dramatic amplification and deletions, this assumption is tenable for the purposes of gene set analysis. Inspection of the log likelihoods suggest that they are roughly linear in the number of events, supporting the validity of this approximation as a scoring system. The statistical significance of deviation from the null hypothesis of a random distribution was calculated using Limma and then corrected for multiplicity by the q-value method (10) as implemented in version 1.1 of the package "q-value".

METHODS REFERENCES

Example 5

[0114] 1. Sjoblom, T., Jones, S., Wood, L. D., Parsons, D. W., Lin, J., Barber, T. D., Mandelker, D., Leary, R. J., Ptak, J., Silliman, N., Szabo, S., Buckhaults, P., Farrell, C., Meeh, P., Markowitz, S. D., Willis, J., Dawson, D., Willson, J. K., Gazdar, A. F., Hartigan, J., Wu, L., Liu, C., Parmigiani, G., Park, B. H., Bachman, K. E., Papadopoulos, N., Vogelstein, B., Kinzler, K. W. & Velculescu, V. E. (2006) The consensus coding sequences of human breast and colorectal cancers. Science 314:268-74. [0115] 2. Wood, L. D., Parsons, D. W., Jones, S., Lin, J., Sjoblom, T., Leary, R. J., Shen, D., Boca, S. M., Barber, T., Ptak, J., Silliman, N., Szabo, S., Derso, Z., Ustyanksky, V., Nikolskaya, T., Nikolsky, Y., Karchin, R., Wilson, P. A., Kaminker, J. S., Zhang, Z., Croshaw, R., Willis, J., Dawson, D., Shipitsin, M., Willson, J. K., Sukumar, S., Polyak, K., Park, B. H., Pethiyagoda, C. L., Pant, P. V., Ballinger, D. G., Sparks, A. B., Hartigan, J., Smith, D. R., Suh, E., Papadopoulos, N., Buckhaults, P., Markowitz, S. D., Parmigiani, G., Kinzler, K. W., Velculescu, V. E. & Vogelstein, B. (2007) The genomic landscapes of human breast and colorectal cancers. Science 318:1108-13. [0116] 3. Saha, S., Bardelli, A., Buckhaults, P., Velculescu, V. E., Rago, C., St Croix, B., Romans, K. E., Choti, M. A., Lengauer, C., Kinzler, K. W. & Vogelstein, B. (2001) A phosphatase associated with metastasis of colorectal cancer. Science 294:1343-6. [0117] 4. Wang, T. L., Maierhofer, C., Speicher, M. R., Lengauer, C., Vogelstein, B., Kinzler, K. W. & Velculescu, V. E. (2002) Digital karyotyping. Proc Natl Acad Sci USA 99:16156-61. [0118] 5. Leary, R. J., Cummins, J., Wang, T. L. & Velculescu, V. E. (2007) Digital karyotyping. Nat Protoc 2:1973-86. [0119] 6. Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. (2006) A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38:75-81. [0120] 7. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M., Navin, N., Lucito, R., Healy, J., Hicks, J., Ye, K., Reiner, A., Gilliam, T. C., Trask, B., Patterson, N., Zetterberg, A. & Wigler, M. (2004) Large-scale copy number polymorphism in the human genome. Science 305:525-8. [0121] 8. Thomas, M. A. & Taub, A. E. (1982) Calculating binomial probabilities when the trial probabilities are unequal. Journal of Statistical Computation and Simulation 14:125-131. [0122] 9. Smyth, G. K. (2005) in Bioinformatics and Computational Biology Solutions using R and Bioconductor, eds. Gentleman, V., Carey, S., Dudoit, R. & Irizarry, W. H. (Springer, New York), pp. 397-420. [0123] 10. Storey, J. D. & Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100:9440-5.

* * * * *

References

cgap.nci.nih.gov/SAGE/DKViewHome