Article Text

Genetic architecture of colorectal cancer
  1. Ulrike Peters1,2,
  2. Stephanie Bien1,
  3. Niha Zubair1
  1. 1Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
  2. 2Department of Epidemiology, University of Washington School of Public Health, Seattle, Washington, USA
  1. Correspondence to Dr U Peters, Dr S Rosse Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, M4-B402, PO Box 19024, Seattle, WA 98109, USA; upeters{at} srosse{at}


Colorectal cancer (CRC) is a complex disease that develops as a consequence of both genetic and environmental risk factors. A small proportion (3–5%) of cases arise from hereditary syndromes predisposing to early onset CRC as a result of mutations in over a dozen well defined genes. In contrast, CRC is predominantly a late onset ‘sporadic’ disease, developing in individuals with no obvious hereditary syndrome. In recent years, genome wide association studies have discovered that over 40 genetic regions are associated with weak effects on sporadic CRC, and it has been estimated that increasingly large genome wide scans will identify many additional novel genetic regions. Subsequent experimental validations have identified the causally related variant(s) in a limited number of these genetic regions. Further biological insight could be obtained through ethnically diverse study populations, larger genetic sequencing studies and development of higher throughput functional experiments. Along with inherited variation, integration of the tumour genome may shed light on the carcinogenic processes in CRC. In addition to summarising the genetic architecture of CRC, this review discusses genetic factors that modify environmental predictors of CRC, as well as examples of how genetic insight has improved clinical surveillance, prevention and treatment strategies. In summary, substantial progress has been made in uncovering the genetic architecture of CRC, and continued research efforts are expected to identify additional genetic risk factors that further our biological understanding of this disease. Subsequently these new insights will lead to improved treatment and prevention of colorectal cancer.


Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


It is estimated that in 2015 there will be 777 987 new cases and 352 589 deaths from colorectal cancer (CRC) in developed countries.1 The average lifetime risk in these populations varies across countries, ranging from 4.3% to 5.3% for men and from 2.7% to 4.9% for women.2 ,3

Inherited susceptibility is a major component of CRC predisposition, with an estimated 12–35% of risk attributed to genetic factors.4 ,5 Over the past two decades, substantial progress has been made towards uncovering the genetic architecture of CRC, and yet there remains great opportunity for discovering additional variants. The genetic risk factors established thus far range between two extremes: (1) rare high penetrance mutations, each conferring marked elevations in risk for hereditary syndromes, to (2) common variants, also called polymorphisms, conferring weak effects on ‘sporadic’ risk in individuals with or without family history of CRC (figure 1). Revealing genetic factors underlying high penetrance syndromes has led to more effective disease management for patients and their families. Further discovery of risk loci with weak effects could similarly improve clinical surveillance and prevention strategies. Although each common variant associates with weaker effects, collectively these variants enable a more accurate prediction of an individual's risk given that the number of risk alleles carried by an individual varies substantially in a population. Moreover, common variants may modify the risk of CRC in individuals with hereditary syndromes.6 In addition to personalised risk prediction, a deeper understanding of genetic aetiology often implicates novel carcinogenic pathways and in turn new potential targets for therapeutics.

Figure 1

Genetic architecture of known colorectal cancer genetic susceptibility loci. Allele frequency shown for the risk allele frequency of the ethnicity in which the locus was discovered, except for variants with a recessive effect (MUTYH), for which the frequency of the homozygote rare allele is shown. Online supplementary table S1 provides details on each genetic variant presented in figure 1. BMP, bone morphogenic protein; CRC, colorectal cancer; GWAS, genome wide association studies; MAPK, mitogen activated protein kinases; TGFβ, transforming growth factor β.

This review begins with a summary of what is known about the genetic architecture of both rare hereditary syndromes and more common sporadic development of CRC. In addition, new approaches to investigate rarer variation, as well as studies that integrate the tumour genome, will be reviewed. Given that risk prediction and biological insight are improved by the identification of functional variants within associated regions, this review will also describe the current state of laboratory follow-up studies. Next, several noteworthy gene–environment interactions with suggestive influences on CRC susceptibility are summarised. Lastly, the importance of translating genetic findings into clinical and public health practices is discussed.

Genetic mutations predisposing to hereditary syndromes

Hereditary syndromes resulting from high penetrance germline mutations account for approximately 3–5% of all CRC.7 Although rare, the mutations underlying these conditions were readily detected through relatively small linkage studies. To date, 14 genes underlying CRC syndromes have been identified (table 1), beginning with the discovery of mutations in the adenomatous polyposis coli (APC) gene predisposing to familial adenomatous polyposis (FAP).8 Later, the human homologues of the DNA mismatch repair (MMR) genes (MLH1, MSH2, MSH6, PMS2) were implicated in a non-polyposis familial condition, now referred to as Lynch syndrome.9 Subsequently, mutations in the genes STK11,10 BMPR1A,11 SMAD4,11 PTEN12 and MutY homologue (MUTYH)13 have been identified as additional genetic causes of polyposis syndromes. These genes highlight several important molecular pathways, many of which are now thought to play a larger role in CRC pathogenesis, supported by results from genome wide association studies (GWAS) (figure 1, table 3). Previous genetic reviews of these syndromes7 ,14 ,15 have discussed clinical management and screening strategies in greater detail. Here we describe the genetic aetiologies and implicated pathways of hereditary syndromes.

Table 1

Genes with predisposing mutations to inherited colorectal cancer syndromes

APC and β-catenin

In 1987, the APC tumour suppressor gene was found to harbour causative germline mutations for the most severe polyposis syndrome, FAP.8 Development of FAP requires the inheritance of a single mutated copy of APC and is characterised by the onset of hundreds to thousands of small adenomatous polyps throughout the entire length of the colon after the first decade of life. If left untreated, the risk of CRC by age 40 years is nearly 100%. Decreased APC function is now understood to play an important role in colorectal tumorigenesis via aberrant regulation of intracellular levels of β-catenin (encoded by CTNNB1) within the wingless signal transduction (Wnt) pathway.14 ,16 This pathway controls cell division, adhesion and migration, making it particularly important for cells with rapid turnover, such as intestinal epithelial that renew every 4–5 days.17 ,18

Classic FAP results from deleterious APC mutations that are typically positioned in or near functional domains that bind β-catenin. However, recent studies suggest that decreased transcription of APC through promoter mutations may also result in FAP predisposition.19 Alternatively, mutations in the 5′ and 3′ ends of APC can result in a less profuse polyposis syndrome termed attenuated FAP (AFAP).20 ,21 In comparison with the classic syndrome, AFAP has a delayed age of onset (mean 56 years) for CRC with reduced average lifetime risk of approximately 70%.15 ,22 ,23

MutY homologue

Unlike other hereditary syndromes, the inheritance of mutations in both copies of the MUTYH (alias MYH) gene result in a recessive form of adenomatous polyposis, referred to as MUTYH associated polyposis (MAP).22 The MUTYH gene is involved in base excision repair of oxidative DNA damage. In some cases, MAP is phenotypically indistinguishable from FAP or AFAP, with an average lifetime CRC risk of about 80%.7 The MAP carcinogenic pathway often involves a high frequency of somatically acquired APC mutations.24

Mismatch repair genes

Over the past two decades, defective MMR genes have been discovered to play a role in subsets of both hereditary and sporadic CRC.25 Specifically, Lynch syndrome (also known as hereditary non-polyposis CRC), is the most common of the hereditary syndromes, accounting for 2–3% of all CRC cases.26 Lynch syndrome is characterised by early onset CRC (mean 45 years), and an average lifetime risk of 66% for men and approximately 43% for women.27 Lynch syndrome results from germline mutations in genes involved with DNA MMR (MLH1, MSH2, MSH6, PMS2 and EPCAM).15 ,28 Loss of MMR activity leads to defective repair of single base mismatches and insertion/deletions during DNA replication. The subsequent errors in replication results in the accumulation of repetitive short nucleotide sequences referred to as microsatellites. Accordingly, high microsatellite instability is a hallmark and critical component for the diagnosis of Lynch syndrome, although it should also be noted that approximately 12% of sporadic CRC are characterised by high microsatellite instability.29

DNA polymerase genes

DNA replication during cell division is inherently susceptible to errors that can be transmitted to daughter cells and incorporated as permanent mutations, which in turn can predispose to cancer. However, there are several conserved mechanisms to safeguard against replication error and subsequent somatic mutations. The role of such mechanisms in CRC risk have already been discussed in this review, such as mutations in base excision repair (MAP) and those in MMR genes (Lynch syndrome). Recent discovery of germline mutations in the proofreading domains of two DNA polymerases (POLE and POLD1) now implicate a new highly penetrant hereditary syndrome referred to as ‘polymerase proofreading associated polyposis’, leading to a large number of somatically acquired mutations (hypermutant tumours).30–33 These discoveries reinforce the importance of mechanisms related to correct DNA replication, and show that reduced fidelity can result in a mutator phenotype increasing cancer susceptibility.

STK11 and PTEN

The harmatomatous polyposis syndromes (Peutz–Jeghers syndrome (PJS), Cowden syndrome (CS), a subtype of CS (Bannayan–Riley–Ruvalcaba (BRR) syndrome) and hereditary mixed polyposis syndrome (HMPS) are very rare, with frequencies of approximately 1 in 50 000–200 000 for PJS and 1 in 200 000–250 000 births for CS, while the prevalence of BRR and HMPS is unknown.34

Serine/threonine protein kinase 11 (STK11) and phosphatase tensin homologue deleted on chromosome 10 (PTEN) have been identified as genes underlying PJS, CS and BRR syndrome. STK11 (alias LKB1) encodes a tumour suppressor that is activated in PJS and is related to cellular energy homeostasis and regulation of the mammalian target of rapamycin (mTOR)—a pathway that is central to metabolic signalling. In addition, STK11 governs whole body insulin sensitivity.35 Similarly, PTEN is linked to CS, as well as BRR syndrome, and regulates metabolic signalling via the phosphatidylinositol 3-kinase (PI3K) and the v-akt murine thymoma viral oncogene homologue 1 (Akt1) pathway.36 Moreover, the PI3k–Akt pathway is an important regulator of cell proliferation and is thought to mediate the effects of mTOR. As such, at least part of the activities of STK11 and PTEN are expected to converge through the mTOR pathway.

TGFβ superfamily

Other germline mutations linked to hereditary CRC syndromes, such as juvenile polyposis syndrome (JPS) and HMPS, implicate genes that enhance growth and invasiveness, such as those belonging to the transforming growth factor (TGF) β superfamily and the bone morphogenic protein (BMP) subfamilies (eg, SMAD4,37 ,38 BMPR1A,39 ,40 BMP4,41 GREM1,42 ENG43 ,44). The BMPs play a critical role in orchestrating proper tissue architecture through their interaction with specific surface receptors, referred to as BMP receptors (BMPRs), which mobilise the SMAD family proteins.45 Development of JPS is linked to mutations in one of two known genes in the TFGβ/BMP pathway. Specifically, 20% of patients with JPS are linked to mutations in SMAD4, while a similar proportion of cases are attributable to BMPR1A mutations. To date, more than 40 mutations (both single nucleotide and larger structural) in these genes have been linked to CRC.46

Novel approaches to further discovery of higher penetrant mutations

Given their high penetrance and apparent clustering in families, the aforementioned mutations could be discovered through relatively small linkage studies. While many genes underlying these syndromes have been uncovered, it is expected that additional higher penetrant genes exist but are more difficult to detect. For example, familial colorectal cancer type X (FCCTX) is a condition that meets the clinical criteria for Lynch syndrome, with the caveat that FCCTX does not show mutations in any of the MMR genes. Although FCCTX is currently classified as a single condition, it is possible that FCCTX represents more than one underlying disorder resulting from various unknown genetic causes.20 ,47 As such, additional research on familial CRC is likely to uncover additional rarer variants with moderate to strong effects on risk. However, until recent advances in sequencing technology, the investigation of such variation was prohibitively expensive.

It is expected that whole exome or whole genome sequencing of families at high risk with undetermined germline mutations will uncover higher penetrance mutations within pathways or mechanisms not previously implicated in CRC. For instance, the discovery of POLE and POLD1 mutations resulted from such analyses.45–48 In addition to discovering new genes, it can be expected that sequencing studies will discover additional mutations in known CRC genes given that improved sequencing technologies more comprehensively captures entire genes and enables the study of more complex structural variation, such as copy number variations, translocations and inversions.49

Whole exome and whole genome studies may also identify an increasing number of variants with unknown significance in hereditary CRC genes.48–50 Despite residing within genes known to harbour very rare and highly penetrant mutations, these uncharacterised variants with unknown significance can range from benign to pathogenic and thus pose a challenging problem, particularly for clinical testing. For instance, unlike the well characterised mutations linked to FAP and AFAP, more common variants in APC have also been observed near domains known to have functional importance. An example of this is the APC-I1307K variant that is very rare in most populations but has a frequency of approximately 6% in Ashkenazim and a modest risk (OR of 2.17) for CRC (figure 1).51 However, such effects (OR values near 2) can only be found with confidence if tested in sizable studies (several thousand participants). As the penetrance of mutations in syndromic genes varies by position within a gene and between genes, understanding the increasing discovery of variation within these genes through exome sequencing is a growing challenge.

To help catalogue and better classify these variants, the International Society for GI Hereditary Tumours (InSiGHT) is curating a comprehensive database of observed variants in known syndromic genes for CRC. Such large scale efforts in both assembling locus specific databases and in expert review of variants will be important to establish consistent management of individuals suspected to have Lynch syndrome,50 as well as other hereditary syndromes.

Discoveries of common low penetrance variants from GWAS of sporadic CRC

Family based linkage studies can successfully identify high penetrance mutations; however, in a key work, Risch and Merikangas52 argued that for many complex diseases, such as CRC, linkage studies have limited power to detect more common variants with weaker effects. However, in combination, low penetrance mutations may contribute substantially to overall disease heritability given their higher prevalence in the population. As technologies were not available at the time to conduct comprehensive genome wide scans, initial successes were limited to candidate gene approaches.53–55 However, as genotyping technology rapidly evolved to allow simultaneous testing of more than a million single nucleotide variants, a growing number of lower penetrance variants associated with sporadic CRC have been identified.56–69

To date, GWAS have identified over 40 independent loci providing deeper insight into the underlying biology of CRC (table 2). The risk allele frequency of these variants ranges from 0.04 to 0.9, and the genetic effect (OR per risk allele) ranges from approximately 1.04 to 1.56 (table 2). Importantly, most common susceptibility loci are positioned outside of coding regions many kilobases (kb) away from the nearest candidate gene. Although associated loci are often positioned in intergenic regions, their proximity to candidate target genes has implicated many known CRC related genes and pathways. For example, GWAS have identified common variation in putative regulatory elements that impact the expression of genes within the TGFβ/BMP pathways (eg, BMP2, BMP4, SMAD7, CCND2, GREM1)82 and genes in the mitogen activated protein kinases pathway (eg, DUSP10, MYO1B, MYC, CCND2, SH2B).

Table 2

Discovered loci with weak effects on colorectal cancer susceptibility

In addition to genes within the TGFβ/BMP pathway, other GWAS findings have similarly demonstrated that genes linked to hereditary syndromes may also harbour common risk variants with weaker effects (table 3). For instance, a truncating mutation CDH1 was previously linked to early onset colorectal and gastric cancers83 and more common independent variants in the same gene show weaker associations with sporadic CRC.62 Like APC and β-catenin (CTNNB1), the gene product of CDH1 is a component of adherens junctions and is involved with Wnt signalling, suggesting that aberrant regulation of CDH1 expression could underlie the observed CRC association at the 16q22.1 locus near CDH1 (rs9929218).62 ,84 Similarly, GWAS have implicated POLD374 in CRC development. In light of the recent discovery of higher penetrance mutations in POLE and POLD1 described earlier, these findings further suggest that DNA polymerase, as well as high and low penetrance variants in the same biological pathways, may play a role in CRC.

Table 3

Biological mechanisms marked by common risk loci

Although many GWAS identified susceptibility loci are positioned in or near genes involved in established CRC related pathways, many GWAS identified regions do not harbour known candidate genes. This supports the utility of using agnostic genome wide approaches, such as GWAS, to gain novel insight into the genetics of complex diseases. For instance, CDKN1A, EIF3H, TPD52L3, ITIH2, LAMA5 and LAMC1 represent genes in pathways not previously linked to CRC. CDKN1A encodes p21 which impacts multiple tumour suppressor pathways and represses MYC dependent transcription.74 TPD52L3 belongs to the family of tumour protein D52 genes, which have been implicated in cell proliferation and apoptosis and also serve as potential cancer biomarkers.85 ITIH genes have been found to be downregulated in multiple human solid tumours, including colon, breast and lung. As such, ITIH genes may represent a family of understudied putative tumour suppressor genes.86 LAMA5 and LAMC1 belong to the laminin gene family, which is involved in the maintenance of cell adhesion, migration and signalling, suggesting that laminin genes may play an important role in the development of CRC.87–89

While we attempt to describe the most likely candidate gene linked to each of the newly identified genetic loci, it is important to note that for most loci, functional evaluations, as described below, are still missing and hence it is possible that for some loci the candidate genes may change as more functional evidence becomes available.

What comes after discovery?

Fine mapping and functional follow-up

While results for high penetrance mutations usually point directly to the underlying casual variants, low penetrance variants are typically correlated with (ie, tag) the region that contains the underlying causal variant(s). For example, in figure 2, any of the correlated single nucleotide polymorphisms (SNPs) with lower p values could be the underlying causal variant in this GWAS locus (note the most significant variant is not necessarily the causal variant). To identify the causal variant(s), fine mapping and laboratory (ie, functional) follow-up studies are needed. Fine mapping studies attempt to genotype or impute all genetic variants in a GWAS locus to test the association with CRC risk, and provide a comprehensive list of all potential candidate SNPs that could be the underlying causal variant.90 Fine mapping studies can further investigate the existence of multiple independent causal variants in the region by simultaneously including multiple variants in a single model (ie, conditional analysis). This approach has successfully identified secondary independent CRC related SNPs in five regions (BMP4-14q22.2, BMP2-20p12.3, CCND2-12p13.32, SMAD7-18q21 and TCF7L2010q25.2) (table 2).64 As has been shown for many other traits,91–97 fine mapping is particularly powerful if conducted in multiple ethnicities with different haplotype structures to refine the number of possible causal variants. This is particularly true for participants of African descent, given shorter haplotype blocks; unfortunately, there are currently limited numbers of available CRC studies in African decent populations.90

Figure 2

Fine mapping of findings of genome wide association studies. Association results (p values) and correlation structure for all single nucleotide polymorphisms (SNPs) in the 8q24 risk locus. Physical position is along the x axis, and the −log10 of the SNP colorectal cancer (CRC) association p value is on the y axis. Each dot on the plot represents the p value of the association for one SNP with a risk of CRC. The most significant SNP (rs6983267) is marked as a purple diamond. The colour scheme represents the pairwise correlation (r2) for the SNPs across the 8q24 region with the most significant SNP (rs6983267) based on the European descent participants from the 1000 Genomes Project data. Grey indicates that correlation was missing for this p value because the variant had no r2 estimation due to low minor allele frequency or because the SNP is not in older versions of the 1000 Genomes data. The bottom half of the figure shows the position of the genes across the region.

Once the list of potential causal variants has been refined, in silico functional follow-up can help to prioritise the most likely candidates for further evaluation in the laboratory to detect allele specific effects and demonstrate likely target gene(s).98 As stated earlier, most high penetrance mutations associated with hereditary syndromes disrupt the coding of a protein, resulting in qualitative differences in protein structure that can be more readily predicted based on our knowledge of the genetic code and biochemical properties of the encoded amino acids. However, most low penetrance variants are located in non-coding regions and are predicted to confer weak effects on the expression of an often unknown target gene. Our limited understanding of transcriptional regulation and the consequences of polymorphisms in putative functional elements make interpretation of non-coding loci far more challenging. To address this, Encyclopaedia of non-coding DNA elements (ENCODE) and Roadmap epigenomics projects have created comprehensive catalogues of histone modifications and chromatin structure across many cell types and tissues, allowing researchers to prioritise variants that are most likely to disrupt regulatory elements. In addition, these resources allow researchers to form testable hypotheses about allelic effects on gene expression or chromatin structure for further laboratory follow-up.

When the first GWAS identified CRC related variant (rs6983267) was discovered in 2007, little was understood about its function. The locus was positioned in a gene desert, suggesting if the association was real, the effects must be exerted through some regulatory mechanism. However, the closest putative target gene, MYC, was positioned more than 300 kb away, and rs6983267 was not associated with differences in MYC expression.76 It took an additional 3 years for functional studies to demonstrate that the variant was located in transcriptional enhancer that differentially binds TCF7L2 (also known as TCF4) and physically interacts with the MYC promoter (table 4).99–102 Additional animal studies showed that knocking out the enhancer did not impact normal function. However, when crossed with APCmin mice, which have a mutated APC gene leading to multiple intestinal neoplasia, those mice with knocked out MYC enhancer had significantly reduced numbers of spontaneously developing colorectal tumours.107 Accordingly, this extensive functional work was able to identify the causal variant and describe the mechanism by which it exerts long range regulation of a gene. Notably, gene expression analysis was unable to reveal MYC as the target, demonstrating that even in the rare circumstance that the index association (or tagging SNP) is the causal variant, extensive functional follow-up is likely necessary to reveal the underlying biology driving CRC associations in GWAS.

Table 4

Functional evidence for variants in common risk loci

For many years researchers have focused on MYC as a candidate drug target,108 but direct inhibition was difficult.109 As such, identification of the regulatory element through the GWAS variants rs6983267 positioned several hundred base pairs upstream of MYC and influencing MYC expression opens new avenues for drug development.107 Given that most GWAS loci are not located in coding regions and may or may not impact the closest genes,110 functional work is critical to fully understand the importance of GWAS loci and reveal potential drug targets. However, functional work requires very different expertise and tools than discovery of susceptibility loci, necessitating interdisciplinary collaboration, such as those funded by the GAME-ON Initiative.111 ,112 Furthermore, it is critical to develop novel functional assays with higher throughput113 that simultaneously evaluate the growing number of GWAS loci, which can bridge the expanding gap between the numerous susceptibility loci and the limited number with laboratory validated function (table 4).

Future direction in discovery

Identifying new risk loci through whole genome and whole exome sequencing

Common CRC susceptibility loci are expected to explain approximately 7–8% of the heritability of CRC114 but the heritability explained currently by GWAS identified common susceptibility loci is only about 1–4%.56–64 ,66 ,70 ,73 ,74 ,114 ,115 This gap suggests that many common variants remain to be discovered. Notably, a third of common CRC susceptibility loci were only discovered in the last year, highlighting that, through larger study populations as well as improved technologies and analytic approaches, additional GWAS discoveries will likely be made. This is consistent with other common cancers and complex diseases, such as breast116 and prostate cancer,117 in which increasingly larger meta-analyses are discovering many novel genetic loci. Ongoing large scale consortia, for example, funded through the GAME-On Initiative of the National Cancer Institute and other non-US agencies, are expected to further facilitate these discovery efforts.118

While GWAS discoveries are ongoing, next generation sequencing as well as denser genotyping arrays are increasingly being used to investigate less frequent (minor allele frequency (MAF)=1–5%) and rare (MAF <1%) variants. Importantly, these variants contribute to the vast majority of the genetic variation in the genome (figure 3) and, hence, likely account for part of the missing heritability of CRC. Progress in discovering these variants will depend on their effect size; initial data suggest that at least some of the less frequent and rare variants have stronger effects (OR >1.5).121–124 This is consistent with the discovery of several independent signals in GWAS125–127 and high penetrance regions128 with MAF <5% and ORs between 1.5 and 4.3. However, it is important to note that these initial findings likely overestimate the effect of less frequent variants as they represent the most easily detectable of these variants. Subsequent discoveries of less frequent and rare variants with weaker effects will require rigorous study designs at a much larger scale.

Figure 3

Most genetic variants are rare—distribution of genetic variants by minor allele frequency. Sources: Gorlov et al119 and edu/drupal.120

Current sequencing studies focus on either the 1–2% of the genome that encodes proteins (the ‘exome’) or on sequencing the ‘whole genome’. Exome sequencing studies are successful in identifying high penetrance mutations; however, it remains unclear what fraction of lower penetrance variants will fall within the exome. Sequencing studies, when conducted at sufficient depth, allow for the investigation of more complex genetic variants, such as insertions or deletions (indels) or copy number variations (also called structural variation).129–135 Although the absolute number of these complex genetic variants is substantially lower than the number of single nucleotide variants, the fraction of the genome affected by copy number variations is substantially larger.136–143 The global assessment of complex variation has remained mostly elusive133 ,144–147 and it is currently unknown to what extent these types of variations contribute to the heritability of CRC.

Given the tens of millions of rare variants currently discovered through whole genome sequencing studies, even large sequencing studies will have limited statistical power to detect these variants. For instance, to detect a low frequency (1% allele frequency) CRC risk variant with modest effect (OR=1.5) among approximately five million tested variants would require 21 800 CRC cases and 21 800 controls as this variant would need to reach a p value of 1×10−8 (α threshold=0.05/5 000 000) to adequately account for the multiple comparisons. However, novel statistical methods that incorporate the growing body of functional data, such as ENCODE,148 RoadMap,149 Genotype-Tissue Expression (GTEx) project150 and The Cancer Genome Atlas (TCGA),151 are expected to improve prediction of variants likely to have functional importance, which will in turn enable more hypothesis driven discovery of novel CRC susceptibility loci. Rigorous bioinformatics that combine data from various laboratory based assays (eg, RNA-seq, ChIP-seq, Dnase-seq, chromosome capture and motif enrichment analysis) have improved substantially the resolution of predicted functional elements. These efforts have also enabled better prediction of long range interactions between regulatory regions and target genes. As such, functional information is beginning to reach sufficient fruition to help inform association testing of rare variants uncovered through sequencing efforts.

Integration of the cancer genome

Cancer is characterised by genetic and epigenetic alterations occurring in the tumour, also referred to as somatic mutations. In the past, there has been limited exchange between cancer genetics research focusing on somatic mutations and germline genetics research focusing on inherited disease related variants. However, high and low penetrance germline variations for CRC are located in genes that often possess somatic mutations in CRC tumours, which are located in pathways thought to impact on tumour development, such as Wnt, TGFβ and mitogen activated protein kinases.152 ,153 This is somewhat unsurprising given that germline high penetrance mutations in tumour suppressor genes (eg, APC, BMPR1A, PTEN) only progress to cancer after a somatic mutation results in the dysfunction of the second copy of the gene. Further demonstrating the complexity of CRC, low penetrance genetic variants are distinct because they are predominantly positioned outside of coding regions (yet often close to somatic driver genes) and confer more subtle effects leading to small increases in CRC risk. As stated previously, the link between somatic mutations and germline variants is an inherent feature of cancer development and, therefore, it is important for future studies to integrate both germline and tumour genetics to obtain a more comprehensive understanding of the carcinogenic processes in CRC.

Gene–environment interactions

CRC has several established environmental risk factors, many of which are modifiable. The most consistently observed positive associations are seen with age, male sex, obesity, height, smoking, alcohol, and red and processed meat; protective associations are seen with physical activity, non-steroidal anti-inflammatory drug use, exogenous hormone use, calcium, vitamin D, folate and, to a lesser extent, fruits, vegetables and fibre.154–159 Extensive methodological and applied research provides a strong rationale for examining gene–environment (GxE) interactions.160–164 GxE interaction analyses have identified an interaction between a known GWAS locus, rs16892766 (8q23.3), and vegetable intake165 ,166 and genome wide approaches have found statistically significant interactions with several environmental risk factors, such as processed meat and aspirin/non-steroidal anti-inflammatory drug use.168 However, it is important to note that investigation for GxE interactions is still at an early stage because sufficiently powered studies require well characterised samples with environmental data harmonised across multiple ongoing studies. Statistical power is a particular challenge for GxE analysis given that the discovery of GxE interaction requires approximately four times the sample size than the discovery of marginal effects of genetic variants.169 However, these explorations are of particular interest to the public as they can help identify subpopulations for which modifiable environmental exposures are most influential.170

Impact of genetic loci on treatment

As the discovery and functional follow-up of the many CRC related genetic loci are ongoing, it is important to consider the potential implications of these findings. Here we present examples across multiple diseases that demonstrate the potential opportunities for genetic data. In Crohn's disease, for instance, GWAS loci implicated previously less appreciated physiologic processes, such as autophagy, innate immunity and interleukin 23R signalling.67 ,171 ,172 These discoveries have already led to chemical screens for candidate therapeutic agents.67 ,173 ,174 For age related macular degeneration, GWAS identified several genes involved in inflammation, a link that was not previously established and has now opened up new treatment approaches and prevention strategies.175 ,176 Identifying the genetic basis of several Mendelian disorders has led to the development of Food and Drug Administration (FDA) approved drugs.177 Furthermore, genomic information can help improve clinical trial design (eg, screening for subtypes and adverse drug reactions).178 ,179 For the HIV antiviral drug, abacavir, genetically guided prescription is now standard of care.180 This is also true for CYP2C19/clopidogrel, CYP2D6/codeine, TPMT/azathioprine and 27 other interactions now endorsed by the American Society of Health System Pharmacists.181 These and other examples have led to new therapies177 ,182–185 and improved medical practices,186 ,187 demonstrating the potential of genetic findings.68 ,188 However, as drug development takes years to establish efficacy and effectiveness in clinical settings,189 it is likely that the full impact of the many recently discovered genetic findings is only beginning to be understood. In some instances, however, identification of underlying genes will not readily translate into improved treatment. For example, the genes for cystic fibrosis and sickle cell anaemia were identified more than 20 years ago,67 although recent findings suggest that treatment options remain possible.177 ,190 GWAS findings may also be useful for repositioning approved drugs.191 For instance, GWAS findings revealed variants in dopamine β-hydroxylase (DBH) impact on smoking cessation and thereby opened the possibility for targeted use of nepicastat (a drug targeting DHB and traditionally used to treat post-traumatic stress disorder). Many more examples for drug repositioning are available.191 As these examples include both rare high penetrance and common low penetrance variants, it is clear that neither the frequency nor the effect size will determine which susceptibility variants will lead to new treatment strategies. Accordingly, coordinated efforts to screen the growing number of susceptibility loci for putative drug targets seems promising, particularly if combined with functional studies to identify the underlying functional variant(s).

Impact of genetic loci on CRC prevention using risk prediction modelling

Although common susceptibility variants have limited power in discriminating cancer outcomes and many have yet to be identified,118 ,192–194 studies have begun to explore the potential clinical applications of polygenic risk profiling.195 Such models could potentially identify individuals at higher CRC risk for targeted screening and intervention.193 ,195–197

CRC remains the second leading cause of cancer death despite slight declines in CRC incidence. Paradoxically, it is among the most preventable and treatable of neoplastic diseases when detected early. For instance, in the USA, endoscopic screening, particularly colonoscopy, is the most commonly used strategy198–203; however, it is costly, invasive and carries risk.201 ,204–210 Although there have been improvements in screening uptake, 40–50% of eligible persons do not follow current screening recommendations.204 ,210–214 Current recommendations are primarily based on age (over 50 years) and family history of CRC,215 despite the knowledge that the incidence of CRC varies substantially in the population and most cases occur in those without a positive family history.216 ,217 Therefore, risk prediction models that stratify the population into risk groups according to their risk profile could result in more effective screening. Furthermore, those at higher susceptibility may be more likely to follow recommendations once they have been made aware of their increased risk.218–223 For instance, those with a positive family history of CRC are more likely to undergo endoscopy screening.224

We recently showed that a genetic risk score incorporating the first 27 known CRC GWAS findings in addition to family history improved the discriminatory accuracy.224 Specifically, the AUC improved from 0.51 to 0.59 for men and from 0.52 to 0.56 for women; these results were similar to that in a previous study.225 Although the improvement in AUC risk prediction is modest, the genetic risk score could be used to develop age specific guidance for screening more reflective of the individual's genetic risk. Current recommendations in the USA advise that screening should commence at the age of 50 years in those without a first degree family history of CRC, or at age 40 years in individuals with a family history of CRC. However, the genetic risk score identified a large fraction of men and women without a positive family history of CRC that had a higher risk for CRC (comparable or higher than those with a positive family history of CRC) justifying an earlier screening age than 50 years in a large subset of individuals without a positive family history.217 These results demonstrate that the combination of multiple common susceptibility loci can lead to improved screening recommendations tailored to an individual's risk despite the weak effects of any of the individual locus.226 In fact, an extensive GWAS of inflammatory bowel disease recently showed that including hundreds of variants with suggestive but not genome wide significant associations improved substantially the risk prediction compared with models limited to GWAS findings.227 Accordingly, it can be expected that incorporating a large number of genetic variants from genome wide scans can further improve risk prediction models, particularly if based on large sample sizes.227–231

Risk models should provide equitable benefits to the public. Currently, the majority of risk models have been evaluated among those of European descent. Evaluation of models across diverse ethnicities is critical because many GWAS findings are tagging a region, rather than the causal or disease influencing variant, and this correlation structure can vary across ethnicities. Another concern with risk prediction models is the potential for over or under prediction of risk due to failure to validate in an independent study.217 ,232 ,233 These issues are no longer conceptual as genetic risk profiling has now entered the commercial world. For instance, close to half a million individuals have purchased genetic information from 23andMe, which provided customers with vastly incomplete CRC risk estimates based on only four GWAS loci. Although the FDA has now restricted 23andMe commercialisation of genetic risk profiling, clinical practice will increasingly be confronted with ambiguous genetic information.234 Accordingly, in the era of direct to consumer genetic testing, rigorous evaluation of screening models that include new genetic and non-genetic information is of critical importance and expected to reduce the burden of CRC and other common complex diseases.235


In summary, substantial progress has been made to discover high penetrance mutations and common variants, and we expect that discovery of many additional variants will occur. Discovery of CRC loci is driven by sample sizes and available technologies. Genetic variants can be discovered in well defined CRC pathways, such as TGFβ or Wnt, but can also point to novel genes and pathways not previously implicated in CRC. It remains critical that we stay on the path to uncover the complete genetic architecture of CRC to more fully understand the aetiology of this severe disease; these findings, in turn, can lead to improved treatment and prevention.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors UP, SR and NZ were involved in manuscript writing.

  • Funding Supported by grants from the National Cancer Institute (R01 CA059045, U01 CA137088, U01 CA164930, R01 CA120582).

  • Competing interests None.

  • Provenance and peer review Commissioned; externally peer reviewed.