Introduction

It is estimated that genetic variants explain 12–35% of the heritability in colorectal cancer (CRC) risk (Lichtenstein et al. 2000; Czene et al. 2002; Jiao et al. 2014). To date, Genome-Wide Association Studies (GWAS) have identified 56 independent common risk variants that are robustly associated with CRC (Peters et al. 2015; Schumacher et al. 2015; Orlando et al. 2016). However, the functional relevance of most discovered CRC-risk variants (89%) remains unclear. The biological mechanisms linking CRC-associated risk variants with target genes have only been validated in the laboratory for six regions [8q24 MYC (Pomerantz et al. 2009), 8q23.3 EIF3H (Pittman et al. 2010), 11q23.1 COLCA1 and COLCA2 (Biancolella et al. 2014), 15q13.3 GREM1 (Lewis et al. 2014), 16q22.1 CDH1 (Shin et al. 2004), and 18q21.1 SMAD7 (Fortini et al. 2014)]. Given that most of the associated loci do not include coding variants, a large portion of CRC genetic risk is thought to be explained by regulatory variation that modulates the expression of target genes. This hypothesis is supported by the observation that CRC risk variants are enriched in colon expression quantitative trait loci (eQTLs) (Hulur et al. 2015) and active regulatory regions of colorectal enhancers (Bien et al. 2017). Together, this evidence highlights the value of studying transcriptional regulation in relation to CRC risk.

Large-scale efforts are underway to map regulatory elements across tissues and cell types. Many transcriptome studies have been conducted where genotype and expression levels are jointly assayed for many individuals, enabling the discovery of tissue-specific eQTLs. For instance, the Genotype-Tissue Expression (GTEx) Project (GTEx Consortium 2013) is building a biospecimen repository to comprehensively map tissue-specific eQTLs across human tissues, which currently includes transcriptomes from 169 colon transverse samples. These data provide a remarkable new resource for understanding function in non-coding regions that can be used to inform GWAS.

We employed the computational method, PrediXcan (Gamazon et al. 2015), to perform a CRC transcriptome-wide association study using reference datasets to ‘impute’ unobserved expression levels into GWAS datasets. Variant prediction models were developed using colon transverse transcriptomes (n = 169) from GTEx (GTEx Consortium 2013) and a larger whole blood transcriptome panel (n = 922) from the depression genes and networks (DGN) (Battle et al. 2014). We included whole blood as a previous analysis demonstrated that gene regulatory elements of immune cell types from peripheral blood are enriched for variants with more significant CRC association P (Bien et al. 2017). Further, laboratory follow-up of the CRC GWAS locus 11q23 implicates two genes, COLCA1 and COLCA2, which are co-expressed in immune cell types and correlate with inflammatory processes (Peltekova et al. 2014). In addition to novel discovery, the PrediXcan approach can aid in prioritization of candidate target genes in non-coding GWAS loci and thereby inform testable hypotheses for laboratory follow-up. Therefore, as a secondary analysis we investigated the association of imputed gene expression with CRC in the 44 genetic regions harboring one or more of the 56 independent variants (r2 < 0.2) that are associated with CRC in previous GWAS (P ≤ 5 × 10− 8) and were replicated in an independent dataset.

We aimed to discover novel loci associated with CRC, and refine established regulatory risk loci by reducing the list of putative gene targets. Employing PrediXcan, we tested genetically regulated gene expression for association with CRC in a two-stage approach. In the discovery stage, up to 8277 gene sets were tested in 12,186 cases and 14,718 controls from the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) and the Colon Cancer Family Registry (CCFR). This discovery set was also used to identify potential target genes in the 44 genetic regions harboring 56 known CRC risk variants. We attempted replication of three novel genes that were not positioned within 1 Mb of the 56 previously reported risk variants and with false discovery rate (FDR) ≤ 0.2 for CRC risk in a large and independent study of 32,825 cases and 39,933 controls from the Colorectal Transdisciplinary (CORECT) consortium, UK Biobank, and additional CRC GWAS (Fig. 1).

Fig. 1
figure 1

Schematic illustration of the study design training data was comprised of joint observations of imputed variant genotypes and tissue-specific gene expression from reference datasets (DGN and GTEx). Elastic net regularization was used to train genetic variant predictors of gene expression and downloaded from PredictDB.org. Models for colon transverse tissues and whole blood were used for imputation of expression into independent GWAS datasets for Colorectal Cancer (CRC). Imputed gene expression was then tested for association with case (ca.)–control (co.) status in the discovery stage. Novel gene associations with a false discovery rate (FDR) = 0.2 were assessed in an independent CRC GWAS dataset. As a secondary analysis, the association of genetically determined expression of genes in 44 GWAS-associated risk regions was examined

Results

Imputation of genetically regulated gene expression

Gene expression levels were imputed using previously published multi-variant models built using elastic net regularization (variant weight gene models V6 available online from PredictDB.org). For each tissue and gene, a quality metric referred to as predictive R2 was provided as the correlation between the observed and predicted expression from the multi-variant model based on a tenfold cross validation. After restricting to protein coding genes with a predictive R2 > 0.01 (≥ 10% correlation between predicted and observed expression), the discovery analysis tested the association of imputed expression for 4850 genes using colon transverse models and 8277 genes using whole blood models. On average, colon transverse models used 22 variants (SD = 19) per gene with a range of 1–173 variants. The number of variants in whole blood models were slightly larger on average with a mean of 34 variants (SD = 24) per gene, ranging from 1 to 213 variants. We report CRC association results and predictive R2 for imputed expression of each gene with P ≤ 0.05 in either colon transcriptome or whole blood analysis (Online Resource 2 Table S2).

Discovery of new CRC susceptibility genes

In total, multivariate logistic regression was used to test the association of CRC with genetically impute gene expression for 4850 genes from colon transverse models and 8277 genes from whole blood models. We employed PrediXcan in 12,186 cases and 14,718 controls from 16 GWAS studies. Replication was attempted for associations meeting an FDR = 0.2 threshold in the discovery phase if they were in a novel CRC region using an independent GWAS dataset comprised of 32,825 cases and 39,933 controls from the CORECT consortium, UK Biobank, and additional GWAS as described in Online Resource 1. In the discovery phase, colon transcriptome models identified CRC association with imputed genetically regulated gene expression in three putative novel regions. Two out of three genes tested in the replication dataset were significant after adjusting for multiple comparisons (α = 0.05/3 = 0.017) (Online Resource Fig S1, Table 1). In addition to being more than 1 Mb away from previously identified risk variants, we confirmed that none of the variant predictors used to impute gene expression for these three genes were in LD (r2 ≤ 0.1) with previously published CRC-risk variants. In the 7q22.1 locus, increased expression of TRIM4 was associated with reduced CRC risk with an odds ratio (OR) of 0.94 [95% confidence interval (CI) 0.91–0.97, discovery P = 2.2 × 10− 4]. Reduced CRC risk was also statistically associated with increased genetically regulated gene expression of TRIM4 in the independent replication dataset (P = 0.01). The second novel locus, 14q22.1, was also found to be inversely associated, where increased genetically regulated gene expression of PYGL was associated with decreased CRC risk, showing an OR of 0.90 (95% CI 0.85–0.96) in the discovery dataset (discovery P = 2.3 × 10− 4) as well as in the replication dataset (P = 7.9 × 10− 4). Imputed genetically regulated gene expression for SLC22A31 was associated with increased CRC risk in the discovery phase (P = 1.3 × 10− 4), but did not replicate in the independent dataset. We found no associations in novel regions using whole blood variant models that reached FDR = 0.2 in the discovery phase.

Table 1 Genes passing discovery threshold in novel loci from colon transverse PrediXcan

Colon Transverse PrediXcan analyses were repeated for TRIM4 and PYGL in the discovery dataset stratifying cases by proximal (n = 4454 cases), distal (n = 3580 cases), and rectal (n = 2936 cases) cancer sites. We excluded 1216 cases from the stratified analysis because the colon cancer site was unspecified. We found that for both genes the effects and p values were similar between the three sites. For TRIM4, the CRC association with genetically imputed gene expression had an OR of 0.94 (95% CI 0.90–0.98, P = 3 × 10− 3) in proximal colon cases compared to an OR of 0.95 (95% CI 0.90–1.0, P = 5 × 10− 2) in distal colon cases and an OR of 0.93 (95% CI 0.88–0.98, P = 2 × 10− 2) in rectal cases. There was no significant difference in the effect estimates between these cancer sites for TRIM4 (Q-test for heterogeneity P = 1.0). Similarly, for PYGL, the CRC association with genetically regulated gene expression had an OR of 0.89 (95% CI 0.82–0.97, P = 3 × 10− 3) in proximal colon cases compared to an OR of 0.91 (95% CI 0.83–1.0, P = 2 × 10− 2) in distal colon cases and an OR of 0.86 (95% CI 0.77–0.95, P = 5 × 10− 4) in rectal cases with no significant difference in effects (Q test for heterogeneity P = 0.98).

We further investigated the replicated CRC-associated PrediXcan genes by summarizing the single-variant CRC association results for variants that were included in the prediction models, referred to hereafter as ‘variant predictors’ (Online Resources 3–6 Fig S2). In TRIM4, the association was mostly driven by one LD block with 62 correlated genetic variant predictors used to impute genetically regulated gene expression in colon tissue models. Among the variant predictors of TRIM4, rs2527886 was most significantly associated with CRC (P = 1.8 × 10− 4). Bioinformatic follow-up of the TRIM4 locus showed that in the genomic region containing variants correlated with rs2527886, there were six enhancers with strong Chromatin Immunoprecipitation Sequencing (ChIP-seq) H3K27ac signal in either normal colorectal crypt cells or a CRC cell line (Online Resource 1 Fig S3). Using peak signal from H3K27ac activity to define enhancer regions, two enhancers were gained in ten or more CRC cell lines compared to normal colorectal crypt cells, referred to as recurrent variant enhancer loci (VEL) (Akhtar-Zaidi et al. 2012). Rs2527886 is positioned within one of these VEL. Peak ChIP-seq binding region for CTCF suggests that the VEL harboring rs2527886 may be in physical contact with the TRIM4 promoter. In the same VEL, one of the LD variants, rs2525548 (LD r2 = 0.99), is positioned within transcription factor binding sites for RUNX3, FOX, NR3C1, and BATF (Online Resource 1 Fig S3). In the PYGL locus, rs12589665 is the variant predictor with the strongest marginal association with CRC (P = 3.2 × 10− 4). We identified 7 enhancers in the region spanning the variants in LD with rs12589665, and three variants in LD with the lead predictor variant were positioned in VEL. Two of these variants, rs72685325 (r2 = 0.62) and rs72685323 (r2 = 0.53), were positioned within binding sites for 7 transcription factors (Online Resource 1 Fig S3).

A series of exploratory analyses were conducted to assess whether the observed inflation in association signals (λ = 1.1) was the result of bias in our data or modeling error. Results suggest that inflation was not driven by genes with low predictive R2 values (Online Resource 1 Fig S4), other potential confounding factors common to GWAS like genotyping batch effects (Online Resource 1 Fig S5) or cryptic population structure (Online Resource 1 Fig S6–S7), or due to inflated Z statistics by modeling genes with little variability in expression (Online Resource 1 Fig. S8–S11). Observed inflation was slightly reduced, but still elevated when looking at the marginal association results for the variant predictors (λ = 1.07; Online Resource 1 Fig S12) and when excluding genes with high predicted co-expression (λ = 1.07; Online Resource 1 Fig S13). Collectively, this exploration suggests that the observed inflation is less likely to be the result of modeling or analytical error and more likely reflects the polygenicity of CRC.

Refinement of known CRC GWAS-risk regions

We first assembled a list of 56 previously reported independent (r2 ≤ 0.2) CRC GWAS risk variants and defined a distance-based region surrounding each variant as the chromosomal position of the first reported (index) variant ± 1 Mb (Online Resource 1 Table S3). We then combined overlapping risk regions by taking the minimum and maximum chromosomal positions of all regions that overlapped, resulting in a total of 44 CRC risk regions harboring 1–4 independent CRC-risk variants. In these 44 regions, there was an average of 20 (SD ± 17) protein-coding genes per region annotated by the Consensus Coding Sequence Database (CCDS). The average number of protein-coding genes per region with imputed genetically regulated gene expression in the tissue-specific models was reduced to an average of 10 (SD ± 8) genes in colon transverse, and 14 (SD ± 11) genes in whole blood. Further, in these regions we found that of the total number of genes with genetically regulated gene expression across the two models, an average 45% of the genes overlapped. We found that 34/44 (77%) of CRC-risk regions overlapped the transcription start site of a gene associated with CRC at a P < 0.05. Comparing the number of genes with a P < 0.05 to the total number of CCDS genes within 1 Mb of an index variant resulted in an average reduction of 82% per region (Table 2).

Table 2 Known GWAS-risk regions overlapping genes that show association of genetically regulated gene expression with CRC

We further investigated the regions that did not show evidence of gene association and found that GWAS reported risk variants in 3/10 of these regions were a coding variant or were in LD with a coding variant (3q26-MYNN and LRRC34, 10q24.32-WBP1L, 14q22.2-BMP4). Additionally, 2/10 of the risk variants were originally discovered in East Asian populations and risk SNPs had weaker association in our study (10q22.3-rs704017 P = 1 × 10− 4 and 10q24.32-rs4919687 P = 1 × 10− 2). Another 2/10 GWAS risk variant did not replicate in our study (4q31.1-rs60745952 P = 0.8 and 16p13.2-rs79900961 P = 0.26). In the remaining 3/10 regions, we found that the index variants did not reach genome-wide significance, reflecting power limitations in our discovery dataset (4q32.2-rs35509282 P = 6 × 10− 3, 16q24.1-rs16941835 P = 4 × 10− 3, and 20p12.3-rs961253 P = 4 × 10− 5).

Among the 34 regions containing associated genes, we found that the most significant gene association in the PrediXcan analysis was often the strongest candidate based on either known CRC etiology and gene function or results from previous laboratory follow-up (e.g. COLCA2, LAMC1, POLD3, SMAD7, TGFB1). In addition to confirming suspected genes, new candidates were also identified. For example, CXCR1 (P = 8 × 10− 5) and CXCR2 (P = 9 × 10− 5) were among the strongest associations. Notably, these genes are biologically relevant targets given that they encode cytokine receptors known to be implicated in a variety of cancers.

Discussion

In this study, we employed the PrediXcan in 12,186 cases and 14,718 controls. Genetic variant predictors of gene expression from both colon transverse and whole blood transcriptomes were used to test the association of CRC risk with imputed gene expression. We replicated novel associations of TRIM4 and PYGL in a large independent study of over 70,000 participants. In addition, we identified strong gene targets in several known GWAS loci, including genes that were previously not reported as putative candidates.

The two novel gene associations discovered in colon transverse models implicate genes involved with hypoxia-induced metabolic reprogramming, which is a hallmark of tumorigenesis in solid tumors. TRIM4 is a member of a superfamily of ubiquitin E3 ligases comprised of over 70 genes notably defined by a highly conserved N-terminal RING finger domain. This family of proteins has been implicated in a number of oncogenic or tumor suppressor activities that involve pathways related to CRC (Myc, Ras, etc.) (Sato et al. 2012; Chen et al. 2012; Zaman et al. 2013; Tocchini et al. 2014; Zhou et al. 2014; Zhan et al. 2015), and recently have been implicated in inflammatory and immune related activities (Eames et al. 2012; Versteeg et al. 2014). Somatic alterations in other TRIM genes have been associated with a large number of cancers including colon (Glebov et al. 2006; Noguchi et al. 2011; Hatakeyama 2011). While TRIM4 has not previously been implicated in cancer risk, the strong homology across gene members of this family and their implications in cancer and immunity make this gene an interesting candidate. Moreover, a recent study suggests that expression of TRIM4 plays a role in sensitizing cells to oxidative stress-induced death and regulation of reactive oxygen species (ROS) levels (H2O2) through ubiquitination of the redox regulator peroxide reductase (Tomar et al. 2015). Regulation of ROS levels and the cellular antioxidant system has previously been implicated in the pathophysiology of many diseases including inflammation and tumorigenesis (López-Lázaro 2007; Holmdahl et al. 2013). ROS are associated with cell cycle, proliferation, differentiation and migration and are elevated in colon as well as other cancers (Vaquero et al. 2004; Kumar et al. 2008; Afanas’ev 2011; Lin et al. 2017). Notably, many of the established environmental risk factors for colon cancer implicate oxidative stress pathways, including high alcohol consumption, smoking, increased consumption of red and processed meats (Stevens et al. 1988; Bird et al. 1996), or decreased consumption of fruits and vegetables (La Vecchia et al. 2013). In future laboratory analysis, it would be interesting to investigate whether the association of increased TRIM4 expression with decreased CRC risk is mechanistically acting through the regulation of ROS and cell growth.

Under the hypoxic conditions of the tumor microenvironment, constant reprogramming of glycogen metabolism is essential for providing the energy requirements necessary for cell growth and proliferation. PYGL (the second novel finding) encodes the key enzyme involved in glycogen degradation, releases glucose-1-phosphate so that it can enter the pentose phosphate pathway, which is important for generating NADPH, nucleotides, amino acids, and lipids required for continued cell proliferation (Favaro et al. 2012). It has previously been shown that depletion of PYGL leads to oxidative stress (increased ROS levels), and subsequent P53-induced growth arrest in cancer cells (Favaro et al. 2012). Of note, small molecule inhibitors of PYGL are currently under investigation for the treatment of diabetes (Praly and Vidal 2010). However, while decreased expression of PYGL in the tumor may result in tumor senescence, our results suggest that decreased PYGL expression is associated with increased risk of CRC. Like the dynamic role of expression for genes involved in the TGF-beta pathway, these conflicting observations between cancer risk and effects of early versus late induction of PYGL on cancer survival are likely reflecting the importance of context and fluctuating nutrient and oxygen availability within the tumor microenvironment.

Importantly, we found that the PrediXcan analysis identified new candidate genes in known GWAS loci that had previously gone undetected. For instance, in the recently identified 2q35 locus (Orlando et al. 2016), the authors originally reported the two closest genes, PNKD and TMBIM1, as potential targets for the putative regulatory locus marked by the index variant, rs992157. The authors reported eQTL evidence showing that rs992157 was associated with expression of nearby genes PNKD and TMBIM1 in lymphoblastoid cells, but not colorectal adenocarcinoma cells. In our PrediXcan analysis, expression of two other genes in this region, CXCR1 and CXCR2, were among the most strongly associated genes in the entire analysis, while the associations for PNKD (P = 6 × 10− 3) and TMBIM1 (P = 0.01) showed weaker associations. Our study added independent evidence for an association of the locus with CRC given that the index variant was only borderline significantly associated in previous analysis and identify two promising targets, CXCR1 and CXCR2. These genes are of note due to their chemotherapeutic properties. Specifically, the CXCR inhibitor, Reparixin, is currently under investigation for progression free survival of metastatic triple negative breast cancer in a stage 2 clinical trial (NCT02370238). Interestingly, expression of CXCR1 and CXCR2 has been shown to be elevated in colon tumor epithelium relative to normal adjacent tissue (P < 0.001). While there is still much to be learned, it is possible that this drug could also be useful for the treatment of CRC (Dabkeviciene et al. 2015).

This study had many strengths, most notably the use of reference transcriptome data to perform gene-level association testing in several large GWAS studies to both uncover novel associations and identify likely functional gene targets in known loci. By integrating reference transcriptome data, this study focused on genes that are expressed in CRC-relevant tissues. Furthermore, this method provided biologically relevant sets to aggregate variants, thereby improving statistical power by reducing the burden of multiple comparisons. In addition, our study was quite large, being comprised of nearly 100,000 participants across the discovery and replication datasets.

Our study had several limitations. For many genes, the predictive R2 for genetic variant models was relatively low, indicating that a small proportion of the variance in gene expression was explained by these models. In a recent publication, Su et al. (2018) demonstrated through extensive simulations that while there is an attenuation of true signal as a results of this, the diminishment in power was less than anticipated and more importantly this does not increase type I error. Predictive performance values were relatively strong in the models used for PYGL (R2 = 0.26) TRIM4. (R2 = 0.51) corresponding to 51% and 71% correlation between predicted and observed expression, respectively. In general, larger sample sizes for the reference panel will be needed to achieve better prediction models, particularly for rarer variants. While PYGL and TRIM4 were discovered using the colon tissue model, the whole blood model also showed evidence of association. This finding was not surprising in light of the recent GTEx paper demonstrating that many GWAS loci implicate shared eQTLs (GTEx Consortium et al. 2017). It should also be noted that variant predictors could implicate enhancers influencing the expression of multiple genes and because this study only evaluates genetically influenced expression levels, there is uncertainty that the associated gene is the causally related gene. As such, laboratory follow-up remains a critical extension of these findings; however, this laborious work can now be more targeted based on results from this analysis.

The loci identified using GWAS are most often located in non-coding regions and provide little biological insight. In contrast, the PrediXcan method directly tests putative target genes providing strong hypotheses for subsequent laboratory follow-up. The CXCR1 and CXCR2 findings are of interest given their therapeutic potential. As such, these findings provide preliminary support for new molecular targets that could potentially repurpose a putative cancer therapeutic agent and highlight the utility of integrating functional data for discovery of, and biological insight into risk loci.

Future analyses would be improved by increasing the number of transcriptomes. Similarly, larger GWAS sample sizes, or imputation of other molecular phenotypes (ChIP-seq, DNase-Seq, etc.) as data become available could be fruitful in the identification of important enhancer(s) or other regulatory elements that could influence the expression of one or more genes.

In conclusion, we identified two novel loci through the association of genetically predicted gene expression for TRIM4 and PYGL with CRC risk and identified strong target genes in known loci. The CXCR1 and CXCR2 findings highlight the advantage of using gene-based methods to identify stronger candidate genes and potentially expedite clinically relevant discovery. Further functional studies are required to confirm our findings and understand their biologic implications. This, in turn, could provide further insight into CRC etiology and potentially new therapeutic targets.

Materials and methods

Description of study cohorts

The discovery phase was comprised of 26,904 participants (12,186 CRC cases and 14,718 controls) of European ancestral heritage across 16 studies (described in methods and materials of Online Resource 1). Details of genotyping, QC and single-variant GWAS have been previously reported (Peters et al. 2013; Schumacher et al. 2015). The replication phase included a total of 32,825 cases and 39,933 controls. In addition to previously published CRC GWAS studies from CORECT (Schumacher et al. 2015) we included UK Biobank (application number 8614) and new CRC GWAS from additional GWAS. A nested case–control dataset from the UK Biobank resource was constructed defining cases as subjects with primary invasive CRC diagnosed, or who died from CRC according to ICD9 (1530–1534, 1536–1541) or ICD10 (C180, C182–C189, C19, C20) codes. Control selection was done in a time-forward manner, selecting one control for each case, first from the risk set at the time of the case’s event, and then multiple passes were made to match second, third and fourth controls. For prevalent cases, each case was matched with four controls that exactly matched the following matching criteria: year at enrollment, race/ethnicity, and sex. In total, 5356 cases and 21,407 matched controls were included from UK Biobank in the replication analysis. For the site-stratified analysis, “proximal” colon cancer was defined as hepatic flexure, transverse colon, cecum and ascending colon (ICD9 1530,1531,1534,1536), “distal” colon cancer was defined as descending colon, sigmoid colon, and splenic flexure (ICD9 1532,1533,1537) and “rectal” was defined as rectosigmoid junction, and rectum (ICD9 1540,1541).

Studies, sample selection and matching are described in Online Resource 1, which provides details on sample numbers, and demographic characteristics of study participants. All participants provided written informed consent, and each study was approved by the relevant research ethics committee or institutional review board.

Whole-genome sequencing reference genotype imputation panel

We performed low-pass whole-genome sequencing of 2192 samples (details in Online Resource 1) at the University of Washington Sequencing Center (Seattle, WA, USA). A detailed description is provided in the Online Resource 1. In brief, after sample QC and removal of samples with estimated DNA contamination > 3% (16), duplicated samples (5) or related individuals (1), sex discrepancies (0), and samples with low concordance with genome-wide variant array data (11), there were a total of 1439 CRC cases and 720 controls of European ancestry available for subsequent imputation. These data were used as a reference imputation panel for the discovery and replication GWAS datasets.

GWAS genotype data and quality control

In brief, genotyped variants were excluded based on call rate (< 98%), lack of Hardy–Weinberg Equilibrium in controls (HWE, P < 1 × 10− 4), and low minor allele frequency (MAF < 0.05). We imputed the autosomal variants of all studies to an internal imputation reference panel derived from whole genome sequencing (described above). We employed a two-stage imputation strategy (Howie et al. 2012) where entire chromosomes were first pre-phased using SHAPEIT2 (Delaneau et al. 2013), followed by imputation using minimac3 (Das et al. 2016). Only variants with an imputation quality R2 > 0.3 were included for subsequent analyses.

Imputation of genetically regulated gene expression in study cohort

Jointly measured genome variant data and transcriptome data sets were used by Gamazon et al. to develop additive models of gene expression levels. The weights for the estimation were downloaded from the publicly available database (http://hakyimlab.org/predictdb/). We used these models to estimate genetically regulated expression of genes in colon transverse, and whole blood. These estimates represent multi-variant prediction of tissue-specific gene expression levels.

In-depth details of the reference cohort, datasets, and model building have previously been described (Gamazon et al. 2015). To summarize, jointly measured genome-wide genotype data and RNA-seq data were obtained from two different projects: (1) the DGN cohort (Battle et al. 2014) (whole blood, n = 922) and (2) GTEx (GTEx Consortium 2015) (transverse colon, n = 169), predominantly of European ancestry. Gamazon et al. used approximately 650,000 variants with MAF > 0.05 to impute non-genotyped dosages using the 1000G Phase 1 v3 reference panel variants with MAF > 0.05 and imputation R2 > 0.8 was retained for subsequent model building. In each tissue, Gamazon et al. normalized gene expression by adjusting for sex, the top 3 principal components (derived from genotype data) and the top 15 PEER factors (to quantify hidden experimental confounders). These genomic and transcriptomic data sets were used to train additive models of gene expression levels with elastic net regularization (Gamazon et al. 2015). The model can be written as

$${Y_g}=\mathop \sum \limits_{k} {w_{k,g}}{X_k}~+~\varepsilon ,$$
(1)

where Yg is the expression trait of gene g, wk,g is the effect size of genetic marker k for g, Xk is the number of reference variant alleles of marker k and ε is the contribution of other factors influencing gene expression. The effect sizes (wk,g) in Eq. (1) were estimated using the elastic net penalized approach. The summation in Eq. 1 is referred to as the genetically determined component of gene expression. The variant models (weights, w_k,g) were downloaded from the publicly available database (http://hakyimlab.org/predictdb/).

The heritability of gene expression was used to estimate how well the variant models predict gene expression levels. The narrow-sense heritability for each gene was calculated by Gamazon et al. (2015), using a variance-component model with a genetic relationship matrix (GRM) estimated from genotype data, as implemented in GCTA (Yang et al. 2011). The proportion of the variance in gene expression explained by these local variants was calculated using a mixed-effects model (Torres et al. 2014; Gamazon et al. 2015). This heritability was highly correlated with the predictive R2 (The cross-validated R2 value found when training the model). Only genes with R2 ≥ 0.01(≥ 10% correlation between predicted and observed expression) were tested for association with CRC. Furthermore, this analysis focused on the component of heritability driven by variants in the vicinity (1 Mb) of each gene (cis-variants) because the component based on distal variants could not be estimated with enough accuracy to make meaningful inferences.

Genotypes were treated as continuous variables (dosages). Using the variant weights provided by Gamazon et al. we estimated the genetically regulated gene expression (GReX) of each gene g

$${\text{GReX}}=~\mathop \sum \limits_{k} {w_{k,g}}{x_k},$$
(2)

where wk is the single-variant coefficient derived by regressing the gene expression trait Y on variant Xk using the reference transcriptome data. To address linkage disequilibrium among variant predictors, Gamazon et al. (2015) used the variable selection method to select a sparser set of (less correlated) of predictors. Specifically, variant weights (wk) were derived using elastic net with the R package glmnet with α = 0.5. These weights are available from http://hakyimlab.org/predictdb/. Using Eq. 2, and the reference variant predictor weights (wk,g), the (unobserved) genetically determined expression of each gene g (GReX) was estimated in our GWAS sample. For both transcriptome models, separate analyses were performed for genetically based expression of genes (up to 2 tests per gene). Genes with predictive R2 > 0.01 were tested for association with CRC in our cohort (colon transverse n = 4850 genes, and whole blood n = 8277 genes).

Gene level tests of CRC association with imputed genetically regulated gene expression

Discovery phase

Statistical analyses of all data were conducted centrally at the GECCO coordinating center on individual-level data. Multivariate logistic regression models were adjusted for age, sex (when appropriate), center (when appropriate), and genotyping batch (ASTERISK) and the first four principal components to account for potential population substructure. Imputed genetically regulated gene expression (GREx), was treated as a continuous variable. All studies were analyzed together in a pooled dataset using logistic regression models to obtain odds ratios (ORs) and 95% confidence intervals (CIs). Quantile–quantile (QQ) plots were assessed to determine whether the distribution of the P was consistent with the null distribution (except for the extreme tail). All analyses were conducted using the R software (Version 3.0.1). Novelty of a gene finding was determined by taking all variant predictors of the gene and determining if they were in linkage disequilibrium (LD ≥ 0.2 in Phase 3 Thousand Genomes Europeans) with a previously reported GWAS index variant.

$$\begin{aligned} {\text{logit}}({p_{{\text{CRC}}}})&={\beta _0}+~{\beta _1}{\text{GReX}}~+{\beta _2}{\text{age}}+~{\beta _2}~{\text{sex}}+{\beta _3}~{\text{center}}\\&\quad +~{\beta _4}~{\text{batch}}~+{\text{PC}}1+{\text{PC}}2+{\text{PC}}3+{\text{PC}}4.\end{aligned}$$
(3)

We identified suggestive findings in the discovery stage to be replicated in a second independent dataset. In the discovery stage we employed a false-discovery rate (FDR) threshold of 0.2 separately for colon transverse and whole blood models. FDR for each gene was calculated using the R statistical package p.adjust, which uses the method of Benjamini and Hochberg to calculate the expected proportion of false discoveries amongst the rejected hypotheses (Hochberg and Benjamini 1990). Genes meeting this threshold were carried forward for replication.

Replication phase

To replicate novel PrediXcan findings (n = 3 genes from colon transverse models) that had a FDR ≤ 0.2, we used the same GTEx colon transverse, elastic net prediction models (as we had done in the discovery GECCO-CCFR data) to impute genetically regulated gene expression in replication samples from (1) CORECT (pooled across consortium studies), (2) UK Biobank and (3) a pooled dataset of 5 independent GWAS datasets. Multivariate logistic regression was used to test the association of imputed genetically regulated gene expression with colorectal cancer risk in these three datasets and then meta-analyzed effects using inverse variance weighting of Z scores (details provided in Online Resource 1). A two-sided P value less than 0.05/(number of genes to be replicated) was considered statistically significant.

Definition of CRC risk regions and refinement of GWAS loci

The 56 previously reported CRC risk variants used in this analysis had an LD r2 ≤ 0.2 with other risk variants in our known list, or were otherwise previously reported to maintain statistical significance in regression models conditioning on other nearby risk variants (referred to hereon as ‘independent’ risk variants). For each of the 56 independent risk variants defined in Table S3, we further defined ‘risk regions’ as 1 megabase (Mb) upstream and 1 Mb downstream of each risk variant (2 Mb regions surrounding each risk variant). Overlapping 2 Mb risk regions were then combined into a single new risk region defined as the minimum and maximum chromosomal coordinates from one or more overlapping risk regions (the union of the overlapping regions). This resulted in a total of 44 regions harboring one or more risk variants (maximum of four independent risk variants). A list of transcription start sites (TSS) for genes that showed nominal association (P ≤ 0.05) between genetically regulated gene expression and CRC risk in colon transverse and whole blood models was then intersected with the list of 44 risk regions to identify a list a putative target genes regulated by non-coding GWAS risk variants.

Bioinformatic follow-up

Bioinformatic follow-up was performed for the TRIM4 and PYGL loci using the UCSC Genome Browser and publicly available functional data for CRC relevant tissues and cell-types from Roadmap, ENCODE, as well as previously published epigenomes (Akhtar-Zaidi et al. 2012). The TRIM4 and PYGL loci were defined as the genomic region containing all variants in LD (r2 ≥ 0.2 from Phase 3 Thousand Genomes Project) with the variant predictor having the strongest marginal CRC association (TRIM4-rs2527886 and PYGL-rs12589665). We then aligned the locus with refseq protein coding genes, epigenetic signals in normal crypts and CRC cell lines to identify recurrently gained and lost variant enhancer loci (VEL), and ChIP-seq transcription factor binding sites.

URLs

PrediXcan software, https://github.com/hakyimlab/PrediXcan; University of Michigan Imputation-Server, https://imputationserver.sph.umich.edu/start.html; GTEx Portal, http://www.gtexportal.org/; PredictDB, http://predictdb.org/.