Introduction

For generations, the heritable contribution of cancer has been the subject of intense study. Initially, twin and family studies provided evidence that specific cancers harbored probable genetic contributions but, until the last decade, progress has been slow. The pace of discovery has accelerated dramatically after the generation of a draft human genome sequence has quickly lead to international efforts to annotate different types of genetic variation in distinct populations (Lander et al. 2001; Venter et al. 2001; International HapMap Consortium 2003; Durbin et al. 2010). Some cancers, such as testicular, thyroid and melanoma, have stronger familial components than others, such as lung and endometrial (Hemminki et al. 2006; Lichtenstein et al. 2000). Equipped with first generation genotyping and sequencing technologies, such as restriction fragment length polymorphisms (RFLPs) for microsatellites and Sanger sequencing, investigators followed families with multiple affected individuals. Initially, mapping disease loci in humans, based on genetic linkage analysis, uncovered rare or uncommon mutations with large effect sizes. This approach utilized polymorphic microsatellite markers across the genome to scan for markers across a haplotype segregating within a family structure (NIH/CEPH Collaborative Mapping Group 1992; Elston and Cordell 2001). Identification of linkage peaks pointed towards regions that required further sequence analysis. In many circumstances, the regions harbored many possible genes for analysis. Consequently, only a small proportion of causative mutations were characterized, particularly in families loaded with breast, colorectal cancer, melanoma or a constellation of cancers, such as Li-Fraumeni Syndrome (Hussussian et al. 1994; Kamb et al. 1994; Malkin et al. 1990; Miki et al. 1994; Wooster et al. 1995; Hall et al. 1990).

The success of using markers to map high penetrance mutations provided an important impetus for pursuing variants with smaller effect sizes now that annotation of common variation was beginning to take shape on the horizon. It was proposed that association testing could be more efficient for discovery of common variants in unrelated populations (Risch and Merikangas 1996; Risch 2000; Lander and Kruglyak 1995). The promise of association testing shifted many towards studies of unrelated subjects in search of common variants that could explain cancer patterns in the general population. For a decade and a half, the field pursued candidate genes based on plausible hypotheses that known variants in specific genes could sufficiently alter either function or expression, resulting in risk for cancer (Erichsen and Chanock 2004). Despite the enormous effort expended in this approach, the approach has yielded only a handful of conclusive associations; the most notable ones include NAT2 and GSTM1 in bladder cancer (Garcia-Closas et al. 2005, 2011; Moore et al. 2011) and alcohol dehydrogenase genes (ADH1B and ADH7) in aerodigestive cancers (Hashibe et al. 2008). Investigators chose coding variants that altered the predicted amino acids, but over time very few coding variants were confirmed, suggesting that the majority of common alleles, particularly those discovered in genome-wide screens, contribute through alterations in regulation or expression of genes or pathways.

New genotyping technologies, together with a comprehensive map of human haplotypes (International HapMap Project) (Altshuler et al. 2010; Frazer et al. 2007; International HapMap Consortium 2003, 2005), have enabled investigators to scan across the genome in progressively larger sets of cases and controls without prior hypotheses in search of common susceptibility alleles with low effect sizes. Genome-wide association studies (GWAS) have emerged as an important tool for discovering regions in the genome using hundreds of thousands of common single nucleotide polymorphism (SNP) markers. This ‘agnostic’ approach is predicated on using a set of markers as surrogates to test a much larger set of variants, namely those in strong linkage disequilibrium based on a high coefficient of correlation, r 2 > 0.8, commonly assessed in the HapMap reference populations (Barrett and Cardon 2006). In this regard, the markers are used to discover regions that harbor one or more variant(s) that are directly responsible for the association signal.

Shortly after the first GWAS were reported in noncancerous outcomes (Klein et al. 2005; Sladek et al. 2007), the field began to adapt a standard for testing and reporting findings from GWAS, known as genome-wide significance. The testing of thousands of related markers necessitated a comprehensive approach to address the problem of multiple testing, which in turn necessitated confirmation of signals in independent studies (Chanock et al. 2007). Some have advocated that the combined P value across studies in different stages provides an important benchmark for the observation (Skol et al. 2006). The value of reporting small P values, less than 5 × 10−7 or 10−8, has been to limit false positive reports and, in this regard, provides investigators with a threshold for the pursuit of follow-up studies (Wellcome Trust Case Control Consortium 2007; Hirschhorn and Daly 2005; Skol et al. 2006; Donnelly 2008). Since GWAS are conducted in unrelated subjects, methods have evolved to address differences in the background population substructure of cases and controls and determine when it is suitable to adjust for the differences, as opposed to removing outlier subjects (Patterson et al. 2006; Price et al. 2006; Yu et al. 2008).

In response to the advent of the practical use of next generation sequencing technologies, the field is steadily shifting towards sequence analysis (Mardis 2011). Already, the International 1000 Genome Project is committed to sequencing over 1,500 unique subjects from the major continents to identify a comprehensive catalog of variants with an estimated minor allele frequency (MAF) of greater than 0.5% (Durbin et al. 2010); other international programs will augment the annotation of germline variations in the coming years. Imputation programs (Howie et al. 2009; Marchini et al. 2007; Browning and Browning 2007; Browning and Browning 2009; Li et al. 2009, 2010), based on analytical algorithms to infer common haplotypes, are being used for further “in silico” testing of additional promising variants, ones that ultimately require lab confirmation and sufficient evidence for replication (Chanock et al. 2007). However, the probability of accurate imputation appears to diminish as the MAF decreases below 3%, except in isolated, homogeneous populations (e.g. Iceland) (Helgason et al. 2005; Holm et al. 2011).

Discovery trends in cancer GWAS

To date, more than 150 associations have been reported for two dozen cancers by GWAS studies at or below the threshold for genome-wide significance (http://www.genome.gov/gwastudies/; see previous reviews, Chung et al. 2010; Hindorff et al. 2009; Varghese and Easton 2010). It is notable that all of the reported loci have been reported in multi-stage GWAS, in which a second or third round of analysis has been required to confirm the strength of signal. The vast majority of new loci in cancer GWAS have susceptibility alleles with MAFs greater than 10%, which is not surprising because of the sample sizes needed to detect regions with estimated per allele odds ratios of less than 1.4. The content of the commercial chips used to conduct the initial scans has been designed to most efficiently capture SNPs with MAFs greater than 10%. A small subset of loci have MAFs less than 10% but more than 5%, but as GWAS continue to increase in size, through either new scans or meta-analyses, it is likely that additional variants will be discovered in this range of MAFs.

Overall, the per allele estimated effect size for each region has been lower than initially anticipated, generally between 1.1 and 1.4. The notable exception to the observed low effect size for many loci is the chromosome 12 locus in testicular cancer. This locus harbors the KITLG gene and has a significantly larger per allele effect estimated in the range of 2.5 (Kanetsky et al. 2009; Rapley et al. 2009), a result that is not entirely surprising given the strong familial component of testicular cancer. It is also notable that there are very few loci with effect sizes greater than 1.4 that have been found in cancer GWAS, suggesting that the underlying genetic architecture is more complex than previously anticipated. Moreover, to find the loci with smaller effect sizes, we have had to set aside epidemiological safeguards for the sake of expediency in discovery.

To achieve the power needed to detect common markers with small effect sizes, some studies have adopted designs that may be affected by subtle biases (e.g., uncontrolled confounding, survival bias or major differences in recruitment strategies for case only series). Some studies have increased sample size by comparing newly genotyped cases against previously genotyped, publicly available controls, which may not be well-matched to the cases on age or environmental risk factors (Crowther-Swanepoel et al. 2010; McKay et al. 2008; Shete et al. 2009; Wang et al. 2008). In many studies, available controls from registered access repositories (i.e., Cancer Genetics Markers of Susceptibility, CGEMS, or Wellcome Trust Case Control Consortium, WTCCC) (Wellcome Trust Case Control Consortium 2007; Yeager et al. 2007) have been used to discover new loci. Others have consciously adopted design strategies that are unbiased under the null of no genetic association but yield effect estimates that are not generalizable, such as comparing cases with a family history of cancer to population controls (Easton et al. 2007).

While some have criticized GWAS for failing to adequately control for environmental exposures, the large number of new regions discovered across studies of different designs, namely, case control, case-cohort and clinical series, have validated the utility of this approach. It is important to stress that the findings have been robust across study designs, many of which have used convenient controls and still replicated in the more rigorous nested case–control studies. In this regard, different design strategies can successfully discover new loci and this raises important questions about what constitutes an epidemiologically rigorous design for GWAS; it has been suggested that unnecessary matching or adjustment for known risk factors can lead to a decrease in the power to detect new loci (Kuo and Feingold 2010; Xing and Xing 2010). In the near future, additional studies will be needed to pursue the current crop of leads to better understand the differential contribution of loci to subtypes and the interaction with environmental/lifestyle observations.

Differences in study design can lead to different conclusions regarding the biological interpretation of an association. Initially, two distinct GWAS in prostate cancer have yielded different results for chromosome 19q13.33, which harbors the gene responsible for the prostate serum antigen (PSA) (Eeles et al. 2008; Thomas et al. 2008). In a GWAS using cohort studies, the effect appears to be related to PSA screening levels whereas in a study using advanced cases and controls with low PSA levels, the effect points towards carcinogenesis (Ahn et al. 2008; Eeles et al. 2008; Parikh et al. 2010). After further resequence analysis and follow-up genotyping, the results suggest that variants in KLK3, including a nonsynonymous SNP, could contribute to both prostate carcinogenesis and PSA levels (Parikh et al. 2010, 2011; Kote-Jarai et al. 2011).

Phenotypic heterogeneity, such as merging estrogen receptor negative (ER-neg) and positive cases (ER-pos), has been necessary to discover loci that contribute to different types of breast cancer. It will be the next generation of studies that further dissects the effects by subtypes to determine the relative contribution of known loci to known subtypes or identify loci associated with particular subtypes in specific cancers (Kraft and Haiman 2010; Bolton et al. 2010). Preliminary analyses in breast cancer studies have shown that a subset of the discovered loci may be specific to ER-pos breast cancer while select loci could be more important for ER-neg breast cancer (Garcia-Closas et al. 2008; Milne et al. 2009). New studies are needed to include the newer classification of breast cancer based on micro-array profiles (e.g., luminal A, luminal B, basal, and others) (Perou et al. 2000; Ross et al. 2000). Similarly, in non-Hodgkin’s lymphoma, distinct regions have been identified in the chronic lymphocytic leukemia (Papaemmanuil et al. 2009) and follicular subtypes (Skibola et al. 2009). In lung cancer, the initial signal that maps to a region with the telomerase gene (TERT) on 5p15.33 is primarily related to the adenocarcinoma subtype (Landi et al. 2009). Interestingly, a scan in Asian women who did not smoke confirmed this finding and reported a higher odds ratio for this locus (Hsiung et al. 2010).

Interestingly, none of the associations with susceptibility to a specific cancer have yet been strongly correlated with clinical outcomes, such as survivorship or evidence of metastatic disease. In prostate cancer, there are at least 35 distinct loci harboring common susceptibility alleles identified by GWAS yet not a single one clearly distinguishes between aggressive and non-aggressive disease, despite several promising preliminary reports (Wiklund et al. 2009; Duggan et al. 2007; Liu et al. 2009; Xu et al. 2010). This observation suggests that there could be regions that distinctly contribute to cancer risk separate from cancer outcomes (e.g., aggressive or metastatic disease).

The majority of cancer GWAS have been conducted in individuals of European background using commercial array chips optimized for populations of European background. Recently, there has been a steady increase in the number of loci reported in Asian subjects with nearly two dozen novel markers reported in eight different cancers, some in cancers with higher incidence in Asia. Because of the differences in LD patterns across distinct populations, it is not surprising that the initial markers do not yield robust signals in other populations (Clarke et al. 2011). Studies are underway in individuals of African American background to investigate known regions with different linkage disequilibrium patterns, which could narrow the windows for subsequent functional studies. In progress are sufficiently sized GWAS in African Americans for breast, colorectal, lung and prostate cancers. Already, a new locus has been conclusively identified in African American men with prostate cancer (Haiman et al. 2011). Additional variants in 8q24 have been reported in men of African American background, but the sum of these cannot adequately explain the difference in incidence at this time (Haiman et al. 2007).

Few regions discovered in GWAS have been conclusively shown to interact with environmental exposures. This is striking in view of the discovery of 200 regions in the genome in large meta-analyses of height (Lango Allen et al. 2010). In a recent bladder cancer GWAS, the effect of a tagging SNP for NAT2, an established candidate gene, is only seen in ever-smokers whereas in never-smokers there is no effect (Garcia-Closas et al. 2011; Rothman et al. 2010). For GWAS of lung cancer, a disease strongly driven by exposure to tobacco products, so far only three or four regions have been conclusively established (McKay et al. 2008; Wang et al. 2008; Hung et al. 2008; Thorgeirsson et al. 2008; Landi et al. 2009). Of these regions, the signal on chromosome 15 could also be related to smoking (Chanock and Hunter 2008; Thorgeirsson et al. 2010; Truong et al. 2010).

The exploration of candidate pathways in GWAS has been hampered by inadequate sample sizes to discover and confirm findings. As progressively larger meta-analyses are conducted, it will be possible to explore this approach more comprehensively. Still, in the first discovery phase, the majority of cancer GWAS loci have pointed to primarily new or unknown regions and genes. A notable exception has been the GWAS of pediatric lymphoblastic leukemia, which have uncovered three sets of markers pointing to genes involved in B-cell development (Papaemmanuil et al. 2009; Trevino et al. 2009). Similarly, GWAS of testicular cancer have identified a number of regions harboring plausible candidate genes that are involved in the development of the testes. Moreover, for a disease such as breast cancer, which has been epidemiologically linked to hormones, surprisingly only one signal so far maps to a region harboring estrogen/progesterone-related genes (Zheng et al. 2009b).

Though many have suggested that copy number variations (CNVs) would explain a portion of the contribution of common variants to cancer risk, very few CNVs have been conclusively associated with cancer risk in unrelated populations. This is probably a consequence of the daunting technical problem of stably calling CNVs (Marenne et al. 2011) and the paucity of CNVs sufficiently common enough to be analyzed in an ‘agnostic’ scan. Also, it has been estimated that a large proportion of common CNVs (with MAF > 5%) are in strong LD with common SNPs, which suggests that many have been indirectly surveyed in scans thus far (Korn et al. 2008; McCarroll et al. 2008). Since a preliminary analysis of CNVs in the 1000 Genome Project suggests that the majority of CNVs have MAFs well below 5%, it is expected that a subset of CNVs could be identified in families or isolated populations (Mills et al. 2011). Still, it is notable that GWAS of neuroblastoma, a rare pediatric cancer, has identified copy number variants on chromosome 1q21.1 (Capasso et al. 2009; Diskin et al. 2009; Maris et al. 2008).

Nexus regions discovered in cancer GWAS

At least seven regions have been associated with more than one distinct cancer type. What is striking in a few instances is the localization of multiple cancers to small contiguous regions with complex linkage disequilibrium patterns (see Figs. 1, 2). At the same time, it is not unexpected to observe variants in the HLA regions on chromosome 6 for cancers of the immune system, such as non-Hodgkins lymphoma (NHL) (Skibola et al. 2009; Conde et al. 2010; Wang et al. 2011), and for cancers with established viral etiologies such as nasopharyngeal carcinoma associated with Epstein–Barr virus (Bei et al. 2010; Tse et al. 2009). The HLA has been the subject of intense study for decades in small underpowered studies. However, the current commercial arrays provide partial coverage of this region, known for its complex structure and gene conversion. Other technologies will be required to comprehensively explore this region in cancer susceptibility.

Fig. 1
figure 1

Multiple-cancer susceptibility loci on 8q24 defined by cancer GWAS. Approximate locations of GWAS reported cancer susceptibility loci are indicated by vertical arrows on the linkage disequilibrium pattern of the 1000 genome CEU data (Nov 2010 release, chr8:127,878–128,880 kb genomic region, reference build 37.1). The arrowheads indicate probable recombination hotspots as per HapMap I+II. Five distinct regions have been associated with prostate cancer risk (regions 1–5). Region 3 is also conclusively associated with colorectal cancer, and precancerous colorectal adenomas. Region 4 also harbors a breast cancer susceptibility locus rs13281615, and a bladder cancer susceptibility locus rs9642880, is telomeric to the region 1, and approximately 30 kb centromeric to the MYC oncogene. A recently identified ovarian cancer susceptibility locus rs10088218 lies >700 kb telomeric to MYC

Fig. 2
figure 2

GWAS regions on 5p15.33, including the TERT-CLPTM1L locus. Cancer GWAS have identified susceptibility loci for seven cancers on 5p15.33, depicted on a linkage disequilibrium heat map of the 1000 genome CEU data (Oct 2010 release, chr5:1,301–1,404 kb genomic region, reference build 36.3). Approximate location of TERT and CLPTM1L genes are depicted by thick black lines and each susceptibility locus is labeled with a color letter block

The region flanking the MYC oncogene on 8q24 harbors at least five independent loci associated with prostate cancer as well as loci associated with cancers of the breast, colon, bladder, ovaries and chronic lymphocytic leukemia (Al Olama et al. 2009; Crowther-Swanepoel et al. 2010; Easton et al. 2007; Gudmundsson et al. 2007; Kiemeney et al. 2008; Tomlinson et al. 2007; Zanke et al. 2007; Amundadottir et al. 2006; Yeager et al. 2007, 2009a) (Fig. 1). Though the MYC oncogene remains an attractive plausible gene, it is not clear exactly how the variants centromeric to MYC, none of which are in LD with variants within the gene or proximal promoter, alter expression. Preliminary work suggests that allelic differences in enhancers could directly or indirectly interact with MYC (Pomerantz et al. 2009; Tuupanen et al. 2009). Other pathways have been suggested in preliminary studies, but additional work is needed to dissect the genetic basis of these findings.

On 5p15.33, there are at least seven cancers, namely, basal cell carcinoma, bladder, brain, lung (including the subtype, adenocarcinoma), melanoma, pancreas and testicular that map to the TERT-CLPTM1L locus (Fig. 2) (Hsiung et al. 2010; Landi et al. 2009; Turnbull et al. 2010b; Shete et al. 2009; Broderick et al. 2009; McKay et al. 2008; Petersen et al. 2010; Wang et al. 2008; Rothman et al. 2010). There is additional evidence for associations with other cancers, such as prostate, breast and uterine cervix, based on candidate studies in follow-up of GWAS hits (Rafnar et al. 2009). This locus is notable for the telomerase gene (TERT) in which rare mutations have been associated with dyskeratosis congenital (an inherited bone marrow failure syndrome), idiopathic pulmonary fibrosis, acute myelogenous leukemia and chronic lymphocytic leukemia (Terrin et al. 2007; Calado et al. 2009a, b; Yamaguchi et al. 2005; Tsakiri et al. 2007; Mushiroda et al. 2008; Armanios et al. 2007). What is also remarkable is that a protective allele for one cancer appears to be a susceptibility allele for another cancer. It is particularly surprising to observe this inverse relationship between basal cell carcinoma and melanoma, two cancers of the skin strongly associated with sun exposure (Stacey et al. 2009; Rafnar et al. 2009).

A region on chromosome 11q13 harbors variants associated with risk for prostate cancer, renal cancer and breast cancer (Eeles et al. 2008; Thomas et al. 2008; Purdue et al. 2011; Turnbull et al. 2010a), three unrelated epithelial tumors with distinct epidemiological patterns and risk factors (Fig. 3). Initially, one locus was reported for prostate cancer, but in two fine mapping studies, as many as three independent loci have been established, two separated from the centromeric loci by a recombination hot spot (Chung et al. 2011; Zheng et al. 2009a). The prostate signals map to an intergenic region flanked by TPCN2 at its centromeric end and by MYEOV at its telomeric end; the former encodes two-pore segment channel 2, which was recently found to contain two coding SNPs, associated with blond versus brown hair color (Sulem et al. 2008), whereas MYEOV is frequently over-expressed in multiple myeloma, breast cancer and oral and esophageal squamous cell carcinomas (Janssen et al. 2000, 2002).

Fig. 3
figure 3

Multiple-cancer susceptibility region on 11q13. Five susceptibility loci of three types of cancers—prostate, kidney, and breast localize to within less than 400 kb region on 11q13. The annotated surrogates (r 2 > 0.8) are superimposed on a linkage disequilibrium heat map of the 1000 genome CEU data (July 2010 release, chr11:68,564–69,233 kb genomic region, reference build 36.3). Reference genes (black bar), and recombination hotspots according to HapMap (black arrow heads) found in the UCSC browser are annotated

GWAS of gastric cancer and esophageal squamous cell cancer in China identified a set of highly correlated variants across the PLCE1 gene on chromosome 10q23 (Abnet et al. 2010; Wang et al. 2010). Gastric cancer and esophageal cancer patterns in China are highly similar, and together represent a major source of mortality, whereas in the US, the two represent less than 1% of cancer mortality. In the GWAS, the effect was seen in gastric cancer of the cardia (near the Z line of the gastro-esophageal junction) whereas the signal was not significant in gastric cancers of the main body of the stomach. Even in this initial discovery GWAS, important subtype differences demonstrate distinct etiological bases for cancers in close proximity.

Even though the epidemiologic literature has strongly connected gastric cancer risk with ABO blood group type (Edgren et al. 2010; You et al. 2000), the genotypes that determine the ABO blood group did not rise to the top of the initial gastric cancer scans. Instead, in a GWAS of pancreatic cancer, a variant in the ABO blood group antigen was the most significant locus, confirming a finding suggested 50 years ago (Amundadottir et al. 2009; Bodmer and Bonilla 2008). In this case, a GWAS finding tracked back to a finding previously described in the epidemiologic literature, recently confirmed for self-described blood group type (Wolpin et al. 2009).

Follow-up to GWAS discovery

As the first stage of cancer GWAS has laid the foundation to discover new regions that harbor susceptibility alleles, the follow-up of regions is critical to elucidate the biological underpinnings of the direct association of common susceptibility alleles. In contrast to the GWAS paradigm that uses a standard multistage approach towards discovery of regions, follow-up studies will need to investigate each region individually based on the genomic structure, including linkage disequilibrium patterns of additional variants that are highly correlated with the initial marker(s), annotation of plausible candidate genes and the co-existence of susceptibility alleles for other phenotypes, both those that are related and those seemingly unrelated. Each region will require formidable resources to conduct the required two stage approach: first, the mapping of possible variants; second, the follow-up functional studies that provide biological plausibility. As of yet, the field is very early in pursing the known regions and only a few have sufficient peer-reviewed data to begin explaining the biological basis of the direct association. Since there are often many common alleles, each providing a small contribution, it might not be appropriate to invoke the term ‘causal’ but instead ‘direct association’, namely the variant(s) whose function(s) explains the basis of the observed association.

Fine mapping of each region has been accelerated by the 1000 Genome Project along with International HapMap, though each sample set is insufficient to capture all variants. Hence, some have advocated regional resequencing to augment the public databases (Parikh et al. 2010; Yeager et al. 2008, 2009b) for pursuit of common variants. In some cases, upon closer inspection, more than one common variant, each with small effect sizes, contributes to the specific cancer susceptibility as has been demonstrated for 8q24, 11q13 and the HNF1B locus on chromosome 17 (Amundadottir et al. 2006; Eeles et al. 2009; Gudmundsson et al. 2007; Haiman et al. 2007; Yeager et al. 2007, 2009a; Chung et al. 2011; Zheng et al. 2009a; Sun et al. 2008). It is still too early in the post-GWAS era to assess whether many of the GWAS signals will be explained by common variants or perhaps, by less common or rare variants residing on the background of common haplotypes, as suggested by some but questioned by others (Dickson et al. 2010; Wray et al. 2011).

As mentioned above, differences in population genetics history have yielded distinct patterns of linkage disequilibrium. These can be used to narrow the window for possible direct association of variants. In admixed individuals (e.g. African, East Asian or Latino/admixed), it is possible to search for admixture markers that might explain differences in disease disparity among different ethnic groups (Eeles et al. 2009; Haiman et al. 2007; Winkler et al. 2010).

So far, nearly all GWAS signals have mapped to non-coding regions, suggesting the contribution of regulatory mechanisms in neighboring genes. With further mapping of regions, it is possible that known markers could tag distant coding variants. The absence of coding variants suggests that minor shifts in protein function may be difficult to detect in the context of a complex disease such as cancer (Hindorff et al. 2009). In this regard, bioinformatic analyses become central to determining possible new transcripts or microRNA species that could be altered by the SNP either directly or indirectly (Coetzee et al. 2010). Still, it is possible that some variants act at a distance, meaning not on the closest genes but elsewhere; pilot studies have shown this to be the case in lymphoblastoid cell lines (Dimas et al. 2009; Stranger et al. 2007).

Initially, select regions identified by GWAS have been shown to harbor plausible candidate genes, e.g., FGFR2 for breast cancer or MSMB/NCOA4 for prostate cancer. For example, the best marker for the prostate cancer locus on 10q13, known as rs10993994, had been previously described as the promoter of the MSMB gene, which encodes the protein PSP94 and is a gene under intense investigation as a biomarker for prostate cancer (Eeles et al. 2008; Thomas et al. 2008). The T allele of rs10993994 is associated with lower MSMB gene expression in cell lines and tumor tissue, evidence corroborated by electromobility shift assays and luciferase transfection studies (Chang et al. 2009; Liu et al. 2008; Lou et al. 2009). In further analysis, the same promoter SNP influences the expression of the neighboring gene, NCOA4, an androgen receptor coactivator (Pomerantz et al. 2010).

GWAS and future clinical utility

Since we are still early in the discovery of common genetic variants associated with risk for specific cancers, it should not come as a surprise that the utility of applying common SNPs to disease prediction is not ready for wide-spread use. Common genetic variants represent only a proportion of genetic variants that contribute to disease risk (Manolio et al. 2009); uncommon, rare and copy number variants will undoubtedly contribute to risk in both familial and unrelated settings, and their relative contributions are expected to vary by cancer site. As it follows, common SNPs do not adequately explain the heredity of common, complex diseases (Lee et al. 2011).

The identification of susceptibility alleles for specific cancers has prompted investigation into applying common SNPs to individualized disease prevention and perhaps, public health policy. To date, the data do not adequately support the clinical utility, despite the suggestions of commercial direct-to-consumer vendors. Common alleles do not provide sufficient discrimination to warrant general use but instead should be further evaluated in the context of risk stratification based on additional factors for disease prevention in the general population.

Shortly after the discovery of the first handful of SNPs associated with breast cancer and prostate cancer, reports appeared suggesting a clinical utility for disease risk prediction that also included family history (Zheng et al. 2008). While initially some argued that the population attributable risk for each locus could be an effective measure, with time this metric has subsided. Instead, many have focused on shifts in the area under the curve (AUC) for receiver operator curves (ROC) (Kraft et al. 2009; Wray et al. 2010). It is notable that ten GWAS SNPs for breast cancer perform comparably to 30 years of clinical risk factors: the AUC for each moved from roughly 50% to less than 60%, but when combined provided little additive benefit (Wacholder et al. 2010; Gail 2009). Others have argued that SNPs could be useful in assessing the highest and lowest extremes of the distribution of common risk alleles (Pharoah et al. 2008). New approaches focused on applying common genetic variants could be beneficial in the reclassification of individuals based on genetic risk profiles or alternatively, the identification of subsets of individuals at high risk who might undergo a diagnostic or therapeutic intervention (Sun et al. 2011).

Though the discovery of additional genetic variants should improve risk prediction, a different underlying genomic architecture will need to be considered. Table 1 depicts the marked differences between four leading cancers. There is a notable difference between the number of common loci for prostate and lung cancer. Park et al. (2010) analyzed existing GWAS data for breast, colorectal and prostate cancer to estimate the number of additional variants expected to be found with further scans. For each of these three, the limit for the AUC appears to be 0.8, suggesting that additional uncommon variants exist and explain a proportion of the risk.

Table 1 Number of independent loci discovered by GWAS for common cancers based on published findings of cancer GWAS as of April 2011

The application of GWAS to therapeutic outcomes and toxicities is lagging behind the etiologic studies, partly due to the lack of sufficiently large sample sizes and also the daunting problem of clinical heterogeneity in treatment protocols and support. However, for pharmacogenomics to eventually have an impact in the clinic, the discovery of regions will need to take place in larger clinical studies prior to integration into risk models or stimulation of new targets for therapy and chemoprevention.

Recently, one group published a GWAS looking at clinical outcomes in children receiving therapy for acute lymphoblastic leukemia (Yang et al. 2011). The study suggested that there were specific admixed regions possibly harboring loci that could explain differences in outcome by ancestry, an observation specifically related to Hispanic children. Using the GWAS as a rough biomarker to examine ancestry overall, it was possible to see a significant difference in outcome (Chanock 2011). Larger studies are needed to pinpoint the specific regions that harbor candidate genes for study and possible targeting, but in the short term, it might be feasible to use the GWAS as a biomarker after confirmatory studies. This application of GWAS should be carried out sensibly, with care for social and ethical considerations, as an interim strategy until the specific regions are conclusively mapped.

Conclusion

Cancer GWAS have been successful in discovering new regions but not in delivering tools to clinical practice, yet. Advances in identifying new regions have been accomplished due to a convergence of epidemiologists, geneticists and statisticians to discover biologically compelling regions that now require careful investigation to understand the basis for the direct associations. The availability of a comprehensive map of genetic variation with greater granularity, namely the capture of less common variants, will enable the mapping of many more cancer susceptibility—and hopefully cancer outcome—alleles. Since the tools to survey environmental and lifestyle exposures are not as advanced, it is likely that the genetic discovery will continue to outpace gene–environment interactions for the near future.

The complex genomic architecture of disease susceptibility will require further discovery of uncommon and rare variants; next generation sequence technologies will play an important role in this discovery. The ability of the next generation sequencers has already outpaced the analytical capacity, shifting the short-term direction back towards sequence analysis in families or special populations, defined on the basis of demographic history or discrete exposures. The analytical challenge of prioritizing so many possible variants will require new methods for analyzing sets of highly correlated variants. Furthermore, the underlying population genetic issues become more complex with lower MAFs, which has important implications for the design and analysis of association studies. There is an opportunity to use denser arrays for scanning SNPs at lower MAFs ranging between 1 and 10%, but such scans will require substantially larger sample sizes to capture the comprehensive set of variants in each distinct space defined by MAF and effect size (Fig. 4).

Fig. 4
figure 4

The underlying genomic architecture of breast cancer susceptibility. Breast cancer risk alleles and estimated effect sizes are portrayed in this graph. Rare, highly penetrant breast cancer risk loci and risk loci of moderate to low effect sizes identified by multiple GWA studies are depicted. Additional loci with effect size < 1.2 are expected to be discovered by meta-analyses with larger sets of scanned subjects (Park et al. 2010). The blue box outlines the probable location of less common and rare alleles with moderate effect sizes to be discovered by next generation sequencing technologies (exome and whole genome)

One of the next major challenges is the integration of germline susceptibility alleles with environmental exposures to explore the underlying mechanisms of driver somatic mutations. As the large-scale cancer genome sequencing programs progress using paired germline and tumor tissue (Hudson et al. 2010; Cancer Genome Atlas Research Network 2008), it will be possible to pursue this avenue of investigation. At the same time, we must recognize that it will be well into the future before enough genomes are sequenced and datasets are analyzed to move beyond the already known pathways.

It will be critical to assess the applicability of genetic tests in specific clinical settings, such as when to perform screening tests with calculable risks (e.g., biopsies or chemoprevention) before incorporating SNPs into clinical practice. To integrate the fruits of current genomic observations, further studies will need to be designed to validate the utility of known genetic variants in risk assessment for cancer as well as its outcomes.