Article Text

Original research
Systematic meta-analyses, field synopsis and global assessment of the evidence of genetic association studies in colorectal cancer
1. Zahra Montazeri1,
2. Xue Li2,
3. Christine Nyiraneza1,
4. Xiangyu Ma3,
5. Maria Timofeeva4,
6. Victoria Svinti4,
7. Xiangrui Meng2,
8. Yazhou He2,
9. Yacong Bo5,
10. Samuel Morgan1,
11. Sergi Castellví-Bel6,
12. Clara Ruiz-Ponte7,
13. Ceres Fernández-Rozadilla7,
14. Ángel Carracedo7,
15. Antoni Castells6,
16. Timothy Bishop8,
17. Daniel Buchanan9,10,11,
18. Mark A Jenkins9,
19. Temitope O Keku12,
20. Annika Lindblom13,14,
21. Fränzel J B van Duijnhoven15,
22. Anna Wu16,
23. Susan M Farrington4,
24. Malcolm G Dunlop4,
25. Harry Campbell2,
26. Evropi Theodoratou2,17,
27. Wei Zheng18,
28. Julian Little1
1. 1 School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
2. 2 Centre for Global Health, Usher Institute, The University of Edinburgh, Edinburgh, UK
3. 3 Department of Epidemiology, College of Preventive Medicine, Third Military Medical University, Chongqing, Chongqing, China
4. 4 Colon Cancer Genetics Group, Cancer Research UK Edinburgh Centre and Medical Research Council Human Genetics Unit, Medical Research Council Institute of Genetics & Molecular Medicine, The University of Edinburgh, Edinburgh, UK
5. 5 Jockey Club School of Public Health and Primary Care, The Chinese University of Hong Kong, Shenzhen, Hong Kong
6. 6 Gastroenterology Department, Institut D'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Hospital Clínic de Barcelona, Universitat de Barcelona, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), Barcelona, Spain
7. 7 Fundación Pública Galega de Medicina Xenómica, Grupo de Medicina Xenómica, Santiago de Compostela, Spain; Instituto de Investigación Sanitaria de Santiago (IDIS), Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
8. 8 Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
9. 9 Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Melbourne, Victoria, Australia
10. 10 Colorectal Oncogenomics Group, Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne, Parkville, Victoria, Australia
11. 11 Genetic Medicine and Family Cancer Clinic, The Royal Melbourne Hospital, Parkville, Victoria, Australia
12. 12 Center for Gastrointestinal Biology and Disease, University of North Carolina, Chapel Hill, North Carolina, USA
13. 13 Department of Clinical Genetics, Karolinska University Hospital, Stockholm, Sweden
14. 14 Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden
15. 15 Division of Human Nutrition, Wageningen University and Research, Wageningen, The Netherlands
16. 16 University of Southern California, Preventative Medicine, Los Angeles, California, USA
17. 17 Cancer Research UK Edinburgh Centre, Medical Research Council Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK
18. 18 Division of Epidemiology, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
1. Correspondence to Dr Evropi Theodoratou, School of Molecular, Genetic and Population Health Sciences, University of Edinburgh, Edinburgh, Lothian, UK; e.theodoratou{at}ed.ac.uk

## Abstract

Objective To provide an understanding of the role of common genetic variations in colorectal cancer (CRC) risk, we report an updated field synopsis and comprehensive assessment of evidence to catalogue all genetic markers for CRC (CRCgene2).

Design We included 869 publications after parallel literature review and extracted data for 1063 polymorphisms in 303 different genes. Meta-analyses were performed for 308 single nucleotide polymorphisms (SNPs) in 158 different genes with at least three independent studies available for analysis. Scottish, Canadian and Spanish data from genome-wide association studies (GWASs) were incorporated for the meta-analyses of 132 SNPs. To assess and classify the credibility of the associations, we applied the Venice criteria and Bayesian False-Discovery Probability (BFDP). Genetic associations classified as ‘positive’ and ‘less-credible positive’ were further validated in three large GWAS consortia conducted in populations of European origin.

Results We initially identified 18 independent variants at 16 loci that were classified as ‘positive’ polymorphisms for their highly credible associations with CRC risk and 59 variants at 49 loci that were classified as ‘less-credible positive’ SNPs; 72.2% of the ‘positive’ SNPs were successfully replicated in three large GWASs and the ones that were not replicated were downgraded to ‘less-credible’ positive (reducing the ‘positive’ variants to 14 at 11 loci). For the remaining 231 variants, which were previously reported, our meta-analyses found no evidence to support their associations with CRC risk.

Conclusion The CRCgene2 database provides an updated list of genetic variants related to CRC risk by using harmonised methods to assess their credibility.

• colorectal cancer
https://creativecommons.org/licenses/by/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

## Statistics from Altmetric.com

### Summary box

#### ​What is already known on this subject?

• Colorectal cancer (CRC) is a global public health challenge. A large number of genetic association studies have been conducted to assess the potential correlation between common genetic variations and CRC risk.

#### ​What are the new findings?

• Using an established framework for grading credibility of genetic associations, we classified 14 independent variants at 12 loci (MUTYH, SMAD7, TERT, CDH1, RHPN2, BMP2, TGFB1 and common variants tagging loci at 8q24, 8q23.3, 10p14, 11q23.1, 20p12.3) as highly credibly associated with CRC risk. A total of 63 variants at 52 loci were classified as ‘less-credible positive’ SNPs; variants of nine of these genes could be mostly highly prioritised for further investigation. For 231 variants previously reported to be associated with CRC, our meta-analyses found no evidence to support such associations.

#### ​How might it impact on clinical practice in the foreseeable future?

• This database will be helpful for future research by promoting the investigation of these variants and corresponding genetic loci in populations other than of European origin, serving as a genetic basis for predicting risk estimates for population groups and providing candidate genes for further functional studies or gene-environment interaction studies.

## Introduction

Colorectal cancer (CRC) is one of the most commonly diagnosed malignancies and the leading cause of cancer deaths in the world, with 1.65 million new cases and about 835 000 deaths in 2015.1 The global burden of CRC is expected to increase by 60%, with more than 2.2 million new cases and almost 1.1 million deaths occurring annually by 2030.2 The distribution of CRC global burden varies widely, with more than two-thirds of all cases and about 60% of all deaths occurring in countries with a high or very high human development index, including Australia and New Zealand, Europe and North America, while incidence and mortality rates in Africa and South-Central Asia are relatively low.1 2 These geographic differences appear to be attributable to the differences in both environmental exposures and the background of genetically determined susceptibility.3

It is estimated that 15%–25% of CRC risk variance is attributed to inherited genetic factors, and the first-degree relatives of CRC patients have two to four times higher risk of developing CRC.4 5 The inherited genetic risk of CRC can be partly accounted for by a combination of rare high-penetrance mutations and large numbers of common genetic variants each conferring small risk.6 Although a number of highly penetrant mutations (eg, DNA mismatch repair genes, APC, SMAD4, LKB1/STK11, MUTYH) have been identified to influence CRC susceptibility with large effects, overall they account for only 2%–5% of incident CRC cases in the general population because these mutations are very rare.7–9 Candidate gene association studies have investigated the role of a large number of common genetic variants in CRC risk, but only a small number of them have been successfully replicated in subsequent studies.10 11

In 2012 and 2014, we reported two independently conducted series of meta-analyses to systematically evaluate associations between CRC and common variants using data from candidate gene studies and genome-wide association studies (GWASs) and identified a number of promising genetic risk variants for CRC risk.10 11 Our first field synopsis (published in 2012) reported 16 variants in 13 independent loci (MUTYH, MTHFR, SMAD7, 8q24, 8q23.3, 11q23.1, 14q22.2, 1q41, 20p12.3, 20q13.33, 3q26.2, 16q22.1, 19q13.1),10 and the second field synopsis (published in 2014) identified 8 additional variants in 5 independent loci (APC, CHEK2, DNMT3B, MLH1, MUTYH) that were strongly associated with CRC.11 These two synopses used slightly different methodologies, in that the 2012 field synopsis only included variants reported in four or more studies in meta-analyses, whereas the 2014 included variants with three or more studies, and there were some differences in the criteria applied for the evidence appraisal.10 11

In this study, we aimed to perform an updated field synopsis for CRC risk by including the most recently published genetic association studies, by following established guidelines12 13 and using harmonised methods for evidence appraisal.14–16 We systematically captured all published genetic association data on CRC for meta-analyses and subsequently incorporated data from GWAS consortia for interrogation. This study provides an up-to-date and publicly available database for CRC genetics (CRCgene2) and presents these data within a defined statistical and causal inference framework to aid interpretation of the results.13 We aimed to provide new insights into the fundamental biological mechanisms involved in colorectal carcinogenesis.

## Methods

### Literature search and data collection

To identify genetic association studies of CRC risk, we searched the Medline database via the Ovid gateway and the search terms comprising medical subject headings (MeSH) and keywords relating to colorectal neoplasms, the MeSH heading ‘genetic predisposition to disease’, and the keywords ‘gene$’ and ‘associate$’ were applied to terms in the entire article. The latest literature search was performed on 21 November 2018. We screened the eligibility of retrieved publications in a three-step parallel review of title, abstract and full text by following the predefined inclusion and exclusion criteria. Thus, each eligible study evaluated the association between a polymorphic genetic variant (with minor allele frequency ≥0.01 in the reference panel of the 1000 genomes) and a sporadic CRC. Studies investigating only premalignant conditions such as adenomas, polyps or dysplastic tissue were excluded. Studies investigating hereditary CRC syndromes, such as familial adenomatous polyposis, hereditary non-polyposis CRC, juvenile polyposis syndrome and Gardner’s syndrome; solely focusing on the progression or histological phenotype of CRC; or studies in animals; were excluded. Case-control, cohort and GWASs were included, while family-based studies were excluded. All included studies were published in English in a peer-reviewed journal; studies only reported as conference abstracts were excluded. Data from the eligible studies were abstracted into two standardised tables, including the key variables with regard to the study identifiers and context, study design and limitations, genotype information and outcome effects.

A list of genetic variants that were investigated in meta-analyses was summarised and data from three GWAS consortia (Scotland, Canada and Spain) were incorporated for meta-analyses, when the genotype data are available for the listed variants. In brief, the Study of CRC in Scotland (SOCCS) is a population-based case-control GWAS that includes 3417 cases and 3500 controls. The Assessment of Risk for Colorectal Tumors In Canada (ARCTIC) is a case-control GWAS database that includes 1231 cases and 1240 controls. The population-based cohort study in Spain comprised two phases (EPICOLON I and EPICOLON II) adding up to 2000 cases and 2000 controls. Restricted candidate gene genotyping data were available from both phases and GWAS data were only accessible from 881 cases and 667 controls from phase 2.17 18 More details about these GWAS datasets are present in online supplementary text.

### Statistical analysis

Meta-analysis was performed for genetic variants with data available from at least three independent studies. Summary crude ORs and 95% CI for allelic, recessive and dominant genetic models were calculated by applying either the fixed-effect model (Mantel-Haenszel method) or the random-effects model (DerSimonian-Laird method) in case of the existence of substantial heterogeneity. The Q statistic (with a threshold of p value <0.05) and I2 metric were calculated to quantify between-study heterogeneity. Funnel plot analysis with an Egger test was conducted to test for small study effects. We also estimated the statistical power of each meta-analysis based on the significance level of α=0.05, the effect sizes and the allele frequencies of genetic variants (an integral component of the Bayesian False-Discovery Probability (BFDP) analysis).19 All statistical analysis was conducted by using R software (R x64 3.1.0).

### Credibility of the identified genetic associations

We first applied the BFDP19 and the Venice criteria12 13 to assess the credibility of any observed genetic associations with p<0.05 in at least one genetic model. We then validated these associations in three additional GWAS consortia: Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO),20 Colorectal Transdisciplinary Study (CORECT, https://research.fhcrc.org/peters/en/corect-study.html) and Colon Colorectal Cancer Family Registry (CFR).21 With meta-analysis of these three GWAS datasets, we validated the observed genetic associations using data from 58 131 CRC cases and 67 347 controls (online supplementary text), and the statistical power of validation was estimated accordingly.

The BFDP assesses the noteworthiness of an observed association. The BFDP was selected rather than the false-positive report probability (FPRP) because it uses more information, defines the noteworthiness threshold explicitly in terms of the costs of false discovery and non-discovery, and does not suffer from the inferential limitations identified for the FPRP.22 We calculated BFDP values at two levels of prior probability: a medium/low prior level (0.05 to 10−3), close to what would be expected for a candidate gene, and a very low prior level (10−4 to 10−6), close to what would be expected for a random SNP. A noteworthy threshold was defined as 0.2 based on the assumption that the cost of false discovery would be four times higher than that of false non-discovery.19

According to the Venice criteria, the credibility of associations is assessed for three aspects: the amount of evidence, the extent of replication and the protection from bias.12 13 We used statistical power to assess the volume of evidence and a grade of A, B and C was assigned, respectively, when statistical power was greater than 80%, 50%–79% or less than 50%. The extent of replication was assessed by the measurement of heterogeneity (I2 criterion), and a grade of A, B and C was assigned, respectively, when I2 was less than 25%, 25%–49% or greater than 50%.12 For protection from bias, a complete assessment is difficult; instead, we considered the following aspects: (1) the phenotype definition was addressed by our inclusion criteria—namely that cases would have newly incident CRC; (2) genotyping error rates are generally low; (3) the criterion of replication across studies in part addresses potential concerns about variation in genotyping quality between studies; and (4) the magnitude of effect of population stratification appears to be small in general.23

Genetic associations were then classified into four categories based on the following criteria. Associations were classified as ‘positive’ if they: (1) were statistically significant at a p value level of 0.05 in at least two of the genetic models, (2) had a BFDP less than 0.20 at least at the p value level of 0.05, (3) had a statistical power greater than 80%, (4) had an I2 less than 50%. A class of ‘less-credible positive’, with a less-stringent threshold, was assigned to the associations (1) that were statistically significant at a p value threshold of 0.05 in at least one of the genetic models, but (2) their BFDP was greater than 0.20 or their statistical power was between 50% and 79% or had an I2 greater than 50%. Associations with p value large than 0.05 were further classified as ‘null’ or ‘negative’ by assessing if there are more than 5000 cases. After credibility assessment, genetic variants classified as ‘positive’ and ‘less-credible positive’ were sent to the CORECT coordinating centre to validate their associations with CRC risk using additive, dominant and recessive models. At this stage, ‘positive’ associations that failed to be validated (at p<0.05) were downgraded to ‘less-credible positive’. A schematic diagram is shown in figure 1 to demonstrate datasets included in each phase of the analysis.

Figure 1

Diagram of the study design. ARCTIC, Assessment of Risk for Colorectal Tumors In Canada; BFDP, Bayesian False-Discovery Probability; CFR, Colorectal Cancer Family Registry; CORECT, Colorectal Transdisciplinary Study; CRC, colorectal cancer; EPICOLON, Gastrointestinal Oncology Group of the Spanish Gastroenterological Association; GECCO, Genetics and Epidemiology of Colorectal Cancer Consortium; GWASs, genome-wide association studies; SOCCS, Study of CRC in Scotland.

## Results

### Literature search and data collection

A total of 20 900 citations were identified from literature search. Of these, 6770 (32.4%) papers were published after the search period of the most recent field synopsis (31 December 2012).11 After eligibility screening, we finally included and extracted data from 869 publications (figure 2), reporting the association of CRC risk with 1063 polymorphisms in 303 different genes, of which 308 polymorphisms were reported in at least three independent studies.

Figure 2

The distribution of included studies published from 1991 to 2018.

### Meta-analyses

Meta-analyses were conducted for 308 polymorphisms in 158 different loci with data available in three or more candidate or GWA studies (online supplementary table 1). On average, these meta-analyses were based on 6149 CRC cases (median; IQR=2301–7334) and 7337 controls (median; IQR=2809–8885) originating from 8 (median; IQR=4–9) case–control studies. Data from the Scottish, Canadian and/ or Spanish GWAS were incorporated in the meta-analyses for 132 SNPs. Summary crude ORs and 95% CI for the allelic, dominant and recessive models are presented in table 1. Of the meta-analyses for 308 polymorphisms (tagged at different 158 loci), a total of 77 SNPs (25.6%) (tagged in 61 different loci) were identified to have a nominally statistically significant association (p value <0.05) with CRC risk in at least one of the three genetic models and were eligible for credibility assessment using the BFDP19 (online supplementary tables 2–4) and the Venice criteria12 13 and for validation in the three (GECCO, CORECT and CCFR) GWAS consortia (online supplementary tables 5–7).

### Supplemental material

Table 1

Summary of ‘credible positive associations’ from meta-analyses and credibility assessment

Credibility assessment indicated 18 variants (5.8% of the meta-analysed SNPs) tagging 16 loci (rs36053993 and rs34612342 in MUTYH, rs2066847 in NOD2, rs12953717 and rs4464148 in SMAD7, rs1569686 in DNMT3B, rs2736100 in TERT, rs9858822 in PPAR-gamma, rs1862748 in CDH1, rs7259371 in RHPN2, rs355527 in BMP2, rs1800469 in TGFB1, rs10505477 in 8q24, rs16892766 in 8q23.3, rs3802842 in 11q23.1, rs961253 in 20p12.3, rs10795668 in 10p14, rs4951291 in 1q32.1) had the most credible associations with CRC risk and are therefore referred to as ‘positive’ SNPs (table 1, online supplementary tables 2–4). These findings are based on accrued data on 1224 to 43 652 cases and on 1381 to 60 883 controls, with a median of 17 100 cases per meta-analysis. The linkage disequilibrium (LD) between these ‘positive’ polymorphisms was checked pairwise using the Ensembl LD calculator with reference to the 1000 genomes: phase 3 CEU population and we found two pairs of SNPs (rs355527 and rs961253, rs12953717 and rs4464148) with r2 >0.20 (online supplementary tables 8). The other 59 variants (19.2% of the meta-analysed SNPs) in 49 loci with p value <0.05 were classified as ‘less-credible positive’ SNPs, given high heterogeneity, low statistical power or a high possibility of being false positive (BFDP >0.2) for their association with CRC risk (table 2, online supplementary tables 2–4). The summary findings for ‘less-credible positive’ SNPs were based on accrued data on 246 to 51 730 CRC cases and on 399 to 53 589 controls, with a median of at least 4287 CRC cases per meta-analysis.

Table 2

Summary of ‘less-credible associations’ from meta-analyses and credibility assessment

Polymorphisms classified as either ‘positive’ or ‘less-credible positive’ SNPs were sent for validation in synthesised data from the GECCO, CORECT and CCFR consortia. Validation was only able to be performed for 68 out of 77 polymorphisms, as one ‘positive’ variant (rs34612342 in MUTYH [Y179C] gene) was dropped due to its low imputation quality in these GWASs; 5 ‘less-credible positive’ polymorphisms (in GSTT1, GSTM1, TP73, CHEK2 and CYP2E1) had no rs numbers; and 3 polymorphisms (in TP53 and KRAS) were not available in these GWASs; therefore, they were not able to be validated. Of the 17 ‘positive’ SNPs sent for validation, 7 polymorphisms (41.2%) in 6 different loci (rs12953717 and rs4464148 in SMAD7, rs355527 in BMP2, rs10505477 in 8q24, rs961253 in 20p12.3, rs16892766 in 8q23.3, rs3802842 in 11q23.1) reached a genome-wide statistical significance (p≤5×10−8) in at least one meta-analysis model and 6 polymorphisms (35.3%) in 6 different loci (rs7259371 in RHPN2, rs2736100 in TERT, rs10795668 in 10p14, rs36053993 in MUTYH, rs1862748 in CDH1 and rs1800469 in TGFB1) reached a nominally statistical significance (p<0.05) in at least one meta-analysis model. However, the remaining 4 polymorphisms (23.5%) in 4 different loci (rs2066847 in NOD2, rs1569686 in DNMT3B, rs9858822 in PPAR-gamma and rs4951291 in 1q32.1) failed in the GWAS validation of all genetic models (online supplementary tables 5–7) and they were therefore downgraded to ‘less-credible positive’. Of the 51 ‘less-credible positive’ SNPs sent for validation, 2 polymorphisms (3.9%) in 2 different loci (rs6983267 in 8q24 and rs1801155 in APC) reached a genome-wide statistical significance (p≤5×10−8) in at least one meta-analysis model and 11 polymorphisms (21.6%) in 11 different loci (rs1136410 in PARP1, rs11568820 in VDR, rs1342387 in ADIPOR1, rs719725 in TPD52L3, rs20417 in PTGS2/COX2, rs2665802 in GH1, rs4803455 in TGFB1, rs7849 in SCD, rs8752 in HPGD, rs36053993 in MUTYH and rs928554 in ESR2) reached a nominally statistical significance (p<0.05) in at least one meta-analysis model, whereas the remaining 38 polymorphisms (74.5%) failed in the GWAS validation of all genetic models (online supplementary tables 5–7). Overall, 26 (33.8%) out of the 77 nominally significant polymorphisms tested via meta-analysis were successfully validated in these GWASs (p<0.05).

Funnel plots were produced for all ‘credible’ (online supplementary figures 1–14) and ‘less-credible’ SNPs (online supplementary figures 15–77) for their summary estimates in allelic model. Small study effects were reported for 2 (14.3%) of the 14 ‘positive’ polymorphisms (rs36053993 in MUTYH and rs4464148 in SMAD7) and for 10 (15.9%) of 63 ‘less-credible’ polymorphisms (rs1229984 in ADH1B, rs25487 in XRCC1, rs2240308 in AXIN2, rs1799750 in MMP1, rs1048943 in CYP1A1, rs373572 in RAD18, rs1800566 in NQ01, rs2066844 in NOD2, rs2665802 in GH1, rs712 in KRAS). Therefore, their reported association with CRC risk should be interpreted with caution.

### Supplemental material

The remaining 231 polymorphisms (75.0%) assessed in meta-analyses were reported to have no statistically significant association (p>0.05) with CRC risk in all three genetic models (online supplementary table 9), based on accrued data on 192 to 21 929 cases and 251 to 24 054 controls, with median of at least 4017 cases per meta-analysis. Of them, 148 polymorphisms were classified as having negative associations with CRC risk because the number of cases was less than 5000 for which the null results could be due to limited statistical power, while another 83 SNPs with p>0.05 and the number of cases>5000 were classified as null variants with adequate statistical power (online supplementary table 9).

## Discussion

This systematic, comprehensive field synopsis of genetic association studies on CRC updates the two previous field synopses10 11 with application of harmonised methods for evidence appraisal and further validation of the identified genetic associations in three GWAS consortia. Specifically, we extracted and collated data for 1063 polymorphisms in 303 different genes from 869 publications and performed up-to-date meta-analyses for 308 variants in 158 different genes that had data from at least three independent studies available for analysis. After credibility assessment and validation, we identified a total of 12 genetic loci credibly associated with CRC risk, of which 6 loci (MUTYH, SMAD7, 8q24, 8q23.3, 11q23.1, 20p12.3) were also classified as credibly associated with CRC risk in the previous field synopses and the other 6 loci (TGFB1, TERT, CDH1, RHPN2, BMP2 and 10p14) are novel findings as they have not been assessed or reported as credible risk loci in the previous field synopses.

We note that a synopsis was undertaken of literature on genetic associations with CRC published in the period 2012–2017.24 Our analysis differs from that study, in that it includes and updates data from our previous field synopses,10 11 includes data from three GWA studies in the meta-analysis and further validates the findings in three GWA consortia.

Similar to our previous field synopses,10 11 the present study reported two SNPs at 8q24 locus (rs6983267 and rs10505477) with strong evidence supporting significant associations with CRC risk and with these associations further replicated in three GWAS consortia. In biopsies of the rectum, sigmoid colon and cecum mucosa, proliferation has been reported to be higher among homozygotes for the risk alleles of the rs6983267 and rs10505477 variants compared with those with other genotypes in the general population.25 The rs6983267 variant has been assessed as having a highly credible association with colorectal adenomas.26 27 In fine-mapping and bioinformatic analysis performed within the GECCO-CCFR consortia, the rs6983267 variant was appraised as having a strong functional evidence.28 The rs6983267 may be a somatic target in CRC29 and may be associated with enhanced responsiveness to Wnt signalling.30 Furthermore, rs6983267 has also been found to be associated with other types of cancer, including prostate cancer.31–33 Interaction with the MYC proto-oncogene has been controversial,34–37 but in functional studies in cell lines, interaction between enhancer elements in the 8q24 locus and the MYC promoter, via transcription factor Tcf-4 binding and allele-specific regulation of MYC expression, has been demonstrated.38 Expression levels of one of these, CARLo-5, in normal colon tissue have been found to be statistically significantly correlated with rs6983267, and chromosome conformation capture analysis of genomic DNA from CRC-derived cell lines provided evidence of physical interaction between the active regulatory region of the CARLo-5 promoter and the MYC enhancer region.39 Since the end of our search period (21 November 2018), an analysis of GWAS data on 22 775 cases and 47 731 controls from 14 studies in East Asia detected a genome-wide significant association with the rs6983267 variant.40 In addition, in a combined meta-analysis of up to 58 131 cases and 67 347 controls from the GECCO, CORECT and CCFR consortia, in which imputed variants from a whole-genome sequencing analysis and Haplotype Reference Consortium panel variants were included, analysis in the 8q24.21 region conditioned on the rs6983267 and rs7013278 variants identified a genome-wide significant association with the rs4313119.41

Two genetic polymorphisms (rs12953717 and rs4464148) tagging in SMAD7 were identified to be associated with CRC with highly credible evidence. Associations with rs12953717 and rs4464148 were successfully validated in the three GWAS consortia. In fine-mapping and bioinformatics analysis performed within the GECCO-CCFR consortia, rs4464148 was appraised as having less strong functional evidence than the highly correlated rs9932005 variant, which is located within 5 kb away.28 The SMAD7 protein is an inhibitor for the TGF-ß signalling pathway.42 43 There were highly credible associations with the rs1862748 (tagging CDH1), rs355527 (BMP2), rs961253 (BMP2) and rs7259371 (RHPN2) variants, which are TGF-ß related, and replicated in the data from the GWAS consortia. In the recent East Asian analysis, the association with the rs961253 (BMP2) variant was replicated, as well as additional variants of SMAD7 (rs7229639, rs4939827), CDH1 (rs9929218), BMP2 (rs4813802) and RHPN2 (rs10411210).40 In the analysis from the GECCO, CORECT and CCFR consortia, conditioned on the rs4813802 and rs189583 variants of BMP2 as well as each other, novel associations with the BMP2 variants rs28488 and rs994308 were detected.41

Additionally, two variants (rs34612342 and rs36053993) tagging the MUTYH gene were highly credibly associated with increased CRC risk, of which rs36053993 was validated in the three GWAS consortia data, while rs34612342 was not tested due to its poor imputation quality. The MUTYH gene is known to be involved in the dysfunction of base-excision repair, which is the major pathway for repairing oxidative damage. This biological pathway in which the MUTYH gene is involved contributes to the development of multiple colorectal adenomas and carcinomas (MUTYH-associated polyposis (MAP) syndrome).44 In our analysis, the p values were nominally significant for all models for the rs34612342 variant and for the allelic and dominant models for the rs36053993 variant. For the former variant, the magnitude of effect was greatest for the recessive model, although with wide CIs, which is intriguing as MAP, in which highly penetrant mutations are implicated, and is inherited in an autosomal recessive manner.45

Highly credible associations were also reported for four variants tagging four different genes that are involved in inflammation or immune response. First, a positive association with the rs3802842 variant in the 11q23.1 was identified for CRC risk and this association was replicated in the data from the GWAS consortia and in the recent East Asian analysis.40 Fine-mapping and bioinformatics analysis performed within the GECCO-CCFR consortia support this variant with strong functional evidence.28 Fine mapping identified two genes COLCA1 and COLCA2 arranged on opposite strands and sharing a regulatory region containing rs3802842.46 It is reported that carrying the risk allele of rs3802842 is associated with the expression levels of COLCA1 and COLCA2, which is further correlated with lymphocyte infiltration of colonic lamina propia. Further, in an expression quantitative trait locus analysis in colorectal tissue, there were signals for COLCA1 and COLCA2.47 The polymorphism rs10795668 in 10p14 locus tagging GATA3 gene was reported with a highly credible association with CRC, and this association was further replicated in the GWAS consortia and in the recent East Asian analysis.40 However, in fine-mapping and bioinformatics analysis, this variant presents with weak functional evidence.28 Another less-credible positive association with the rs9858822 variant of PPARγ was identified; however, this association was not replicated in the GWAS consortia. Bioinformatics analysis showed that PPARγ and its ligands have been found to block proinflammatory genes in colon cancer cell lines, activated macrophages and monocytes.48 PPARγ is also involved in lipid metabolism, adipocyte differentiation, and glucose homeostasis and insulin sensitivity.48 A less-credible positive association also reported one variant (rs2066847) in NOD2 gene; however, this association was not replicated. Evidence from experimental studies in mice investigating the role of Nod2 in colorectal tumour risk has also been inconsistent.49–51 The recent analyses in East Asia,40 the GECCO, CORECT and CCFR consortia41 and of five UK studies and a further 10 from the COGENT consortium47 identified new associations in the major histocompatibility region.

The remaining two variants with highly credible associations with CRC risk included rs2736100 tagging TERT and rs16892766 at 8q23.3 tagging EIF3H. These variants were all validated in the GWAS consortia datasets. Fine-mapping and bioinformatics analysis performed within the GECCO-CCFR consortia supported these variants with strong functional evidence, for which polymorphism rs2736100 in TERT gene has been reported to be associated with telomere length and the risk of different types of cancer and chronic diseases other than cancer.52 For the association with the rs16892766 variant at 8q23.3 tagging EIF3H, another variant rs16888589 was previously reported with the lowest p value for the association with CRC in this locus.25 Functional analysis and chromosome conformation capture analysis in CRC cell lines have found that the genomic region harbouring rs16888589 increases EIF3H expression,53 but analysis of expression quantitative trait loci around rs16892766 suggested that UTP23 rather than EIF3H is the target of genetic variation associated with CRC in this region.54

The less credible associations with 63 variants of 52 genes involved the following pathways—adhesion (AXIN2, MMP1); alcohol metabolism (ADH1B); angiogenesis (VEGF); blood clotting (SERPINE1); DNA repair (CHEK2, ERCC5, MSH2, MSH3, PARP1, RAD18, XPC, XRCC1); hormone metabolism (ESR2, GH1, PGR); inflammation and immune response (CRP, HPGD, PTGS2/COX2); inhibition of cell growth (CCND1, EGF, TGFB1); iron metabolism (HFE); lipid metabolism (ADIPOQ, ADIPOR1, LIPC, SCD); one-carbon metabolism (MTHFR, MTFD1, MTRR); substrate metabolism (ABCB1; CYP1A1, CYP2C9, CYP2E1, GSTM1, GSTT1, NAT2, NQ01); tumour suppression (ARLTS1, miR, TP73); vitamin D metabolism (VDR)—common low penetrance variants at 1q32.1 (rs4951039, LINC00303) and 9p24 (rs719725); and the common rs1801155 (I1307K) variant of APC, for which large numbers of rare variants have been identified,55 and rs63750447 (V384D) variant of MLH1, for which rare variants confer a high risk of Lynch syndrome.56 These variants were classified as less-credible SNPs because of either the substantial heterogeneity or the high possibility of false positive; however, we would like to highlight a number of less-credible genetic loci (PARP1, MYC, VDR, ADIPOR1, APC, PTGS2/COX2, SCD, HPGD and ESR2), which were replicated in the GWAS data and for which their linked pathways are worthy of further investigation in future studies.

Updating field synopses is challenging because genetic analysis is such a fast-moving field. Recent trends highlighted by three articles published since we completed our search in November 2018 include the extension of consortia to increase statistical power to detect and replicate associations,47 the investigation of populations other than of European origin,40 and the use of whole-genome sequencing and more comprehensive reference panels to extend the range of genetic variants considered to include those that are rare or of low frequency.41 These three articles have added new loci for CRC susceptibility and indicate that most of the risk loci previously associated with CRC in populations of European origin are also associated with CRC risk in East Asian populations.

We checked whether the 14 variants we classified as ‘highly credible’ were replicated in the paper of Law et al.47 For one variant, rs34612342 MUTYH, there was poor imputation quality. However, as there was no satisfactory proxy for this, we report the available information, which supported the association (p=0.029). All the remaining 13 variants had p values for association less than 5.0×10−5 with no more than moderate heterogeneity. Such an empirical comparison was not appropriate with the data of Huyghe et al 41 because of the considerable overlap of included participants.

In summary, we have conducted a comprehensive study to capture and meta-analyse all SNP data for common genetic variants. The analysis clearly identifies 14 variants at 12 loci for which there is robust evidence of their impact on CRC risk, 63 variants at 52 loci for which further evidence through international collaboration should be generated and 231 variants for which the overall evidence does not support any association with CRC risk. With increasing availability of data from multiple SNPs, it is clear that studies to test associations must achieve very high levels of statistical stringency. Nonetheless, the analysis here provides a resource for mining available data and puts into context the sample sizes required for the identification of true associations for common genetic variants. Future resequencing studies are expected to identify rarer variants (eg, prevalence 0.05%–5%) with intermediate or perhaps even large effects, and GWAS of structural variation will likely identify deletions, amplifications and other copy number variations that may also influence CRC risk.6 This study highlights a number of common genetic variants that could be incorporated into genetic risk-prediction algorithms as further risk factors are identified and highlights the loci at which further research effort should be targeted. All data are available from the CRCgene2 database.

## Acknowledgments

ET is supported by a Cancer Research UK Career Development Fellowship (C31250/A22804).

SOCCS: The work was supported by Programme Grant funding from Cancer Research UK (C348/A12076) and by funding for the infrastructure and staffing of the Edinburgh CRUK Cancer Research Centre. This work was also funded by a grant to MGD as Project Leader with the MRC Human Genetics Unit Centre Grant (U127527202 and U127527198 from 1/4/18). We are grateful to all who contribute to recruitment, data collection and data curation for the Study of Colorectal Cancer in Scotland studies. We acknowledge that these studies would not be possible without the patients and controls and their families. We acknowledge the expert support on sample preparation from the Genetics Core of the Edinburgh Wellcome Trust Clinical Research Facility.

ARCTIC: ARCTIC was funded by the Canadian Institutes of Health Research (CIHR) Team in Interdisciplinary Research on Colorectal Cancer (CTP-79845); a CIHR pilot project grant in colorectal cancer screening (200509CCS-152119-CCS-CECA-102806); the Cancer Risk Evaluation (CaRE) Program Grant from the Canadian Cancer Society Research Institute (18001).

EPICOLON: We are sincerely grateful to all patients, the Spanish National DNA Bank (BNADN) and Biobank of Hospital Clínic–IDIBAPS. SNP genotyping services were provided by the Spanish 'Centro Nacional de Genotipado' (CEGEN-ISCIII). This work was supported by grants from Fondo de Investigación Sanitaria/FEDER (PI17/00878, PI17/00509), Fundación Científica de la Asociación Española contra el Cáncer (GCB13131592CAST), PERIS (SLT002/16/00398, Generalitat de Catalunya), CERCA Programme (Generalitat de Catalunya) and Agència de Gestió d'Ajuts Universitaris i de Recerca (Generalitat de Catalunya, GRPRE 2017SGR21, GRC 2017SGR653). CIBEREHD and CIBERER are funded by the Instituto de Salud Carlos III. This article is based upon work from COST Action CA17118, supported by COST (European Cooperation in Science and Technology, www.cost.eu).

This work was also supported by grant U01 CA167551 from the National Cancer Institute and through cooperative agreements with the following Colon Cancer Family Registry sites: Australasian Colorectal Cancer Family Registry (U01 CA074778 and U01/U24 CA097735); Ontario Familial Colorectal Cancer Registry (U01/U24 CA074783); and Seattle Colorectal Cancer Family Registry (U01/U24 CA074794). The genome-wide association studies (GWAS) were supported by grants U01 CA122839, R01 CA143237 and U19 CA148107.

## Footnotes

• ZM and XL are joint first authors.

• Twitter @ccgg_edinburgh

• Contributors JL, WZ, ET and HC conceived the study; JL, WZ, ET, HC, ZM and XL designed it; ZM, XL, ET, JL and HC wrote the article with input from other authors; ZM, XL, CN, MT and VS undertook data manipulations and statistical analysis; ZM, XL, CN, XMa, XMeng, YH and YB undertook the literature review and coordinated and/or undertook related abstraction, handling and curation of the data; SM, SC-B, CR-P, CF-R, ACarracedo, ACastells, DB, MAJ, TOK, AL, FJBvD, AW, SMF and MGD provided access to GWAS data. HC, ET, WZ and JL are joint corresponding authors.

• Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

• Competing interests None declared.

• Patient consent for publication Not required.

• Provenance and peer review Not commissioned; externally peer reviewed.

• Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information.

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.