Article Text

Original article
Deep resequencing of 131 Crohn's disease associated genes in pooled DNA confirmed three reported variants and identified eight novel variants
  1. Sung Noh Hong1*,
  2. Changho Park2,3*,
  3. Soo Jung Park4,
  4. Chang Kyun Lee5,
  5. Byong Duk Ye6,
  6. You Sun Kim7,
  7. Seungbok Lee2,3,8,
  8. Jeesoo Chae2,3,
  9. Jong-Il Kim2,3,8,9†,
  10. Young-Ho Kim1†,
  11. IBD Study Group of the Korean Association for the Study of Intestinal Diseases (KASID)
  1. 1Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea
  2. 2Department of Biomedical Sciences, Seoul National University Graduate School, Seoul, Korea
  3. 3Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul, Korea
  4. 4 Department of Internal Medicine and Institute of Gastroenterology, Yonsei University College of Medicine, Seoul, Korea
  5. 5Department of Internal Medicine, Kyung Hee University School of Medicine, Seoul, Korea
  6. 6Department of Gastroenterology and Inflammatory Bowel Disease Center, Asan Medical Centre, University of Ulsan College of Medicine
  7. 7Department of Internal Medicine, Seoul Paik Hospital, Inje University College of Medicine, Seoul, Korea
  8. 8Medical Research Center, Genomic Medicine Institute (GMI), Seoul National University, Seoul, Korea
  9. 9Cancer Research Institute, Seoul National University College of Medicine, Seoul, Korea
  1. Correspondence to Professor Y-H Kim, Department of Medicine, Samsung Medical Centre, Sungkyunkwan University School of Medicine, Seoul 135-710, Republic of Korea; younghokim{at}skku.edu Professor J-I Kim, Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul 110-799, Republic of Korea; jongil{at}snu.ac.kr

Abstract

Objective Genome wide association studies (GWAS) and meta-analyses for Crohn's disease (CD) have not fully explained the heritability of CD, suggesting that additional loci are yet to be found and that the known loci may contain high effect rare risk variants that have thus far gone undetected by GWAS. While the cost of deep sequencing remains too high to analyse many samples, targeted sequencing of pooled DNA samples allows the efficient and cost effective capture of all variations in a target region.

Design We performed pooled sequencing in 500 Korean CD cases and 1000 controls to evaluate the coding exon and 5′ and 3′ untranslated regions of 131 CD associated genes. The identified genetic variants were validated using genotyping in an independent set of 500 CD cases and 1000 controls.

Results Pooled sequencing identified 30 common/low single nucleotide variants (SNVs) in 12 genes and 3 rare SNVs in 3 genes. Our results confirmed a significant association of CD with the following previously reported risk loci: rs3810936 in TNFSF15 (OR=1.83, p<2.2×10−16), rs76418789 in IL23R (OR=0.47, p=1.14×10−8) and rs2241880 in ATG16L1 (OR=1.30, p=5.28×10−6). In addition, novel loci were identified in TNFSF8 (rs3181374, OR=1.53, p=1.03×10−14), BTNL2 (rs28362680, OR=1.47, p=9.67×10−11), HLA-DQA2 (rs3208181, OR=1.36, p=4.66×10−6), STAT3 (rs1053004, OR=1.29, p=2.07×10−5), NFKBIA (rs2273650, OR=0.80, p=3.93×10−4), NKX2-3 (rs888208, OR=0.82, p=6.37×10−4) and DNAH12 (rs4462937, OR=1.13, p=3.17×10−2). A novel rare SNV, rs200735402 in CARD9, was shown to have a protective effect (OR=0.09, p=5.28×10−5).

Conclusions Our deep resequencing of 131 CD associated genes confirmed 3 reported risk loci and identified 8 novel risk loci for CD in Koreans, providing new insights into the genetic architecture of CD.

  • GENETIC POLYMORPHISMS
  • CROHN'S DISEASE

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Significance of this study

What is already known on this subject?

  • Genome wide association studies (GWAS) and meta-analyses for Crohn's disease (CD) have not fully explained the heritability of CD, suggesting that additional loci are yet to be found and that the known loci may contain high effect rare risk variants that have thus far gone undetected by GWAS.

  • While the cost of deep sequencing remains too high to analyse many samples, targeted sequencing of pooled DNA samples allows the efficient and cost effective capture of all variations in a target region.

What are the new findings?

  • Deep resequencing of 131 CD associated genes in pooled DNA confirmed 3 previously reported risk loci (rs3810936 in TNFSF15, rs76418789 in IL23R and rs2241880 in ATG16L1) and identified 8 novel risk loci (rs3181374 in TNFSF8, rs28362680 in BTNL2, rs3208181 in HLA-DQA2, rs1053004 in STAT3, rs2273650 in NFKBIA, rs888208 in NKX2-3, rs4462937 in DNAH12 and rs200735402 in CARD9).

How might it impact on clinical practice in the foreseeable future?

  • Although genetic variants in TNFSF15, IL23R and ATG16L1 that are known to be associated with CD in Caucasians also contribute to CD susceptibility in Asians, ethnicity dependence of the major susceptibility genes may exist.

  • Comprehensive efforts will be required to determine the overlapping and ethnicity specific CD susceptibility variants between Asians and Caucasians to interpret the pathogenesis of CD and to identify potential therapeutic targets of CD.

Introduction

Crohn's disease (CD) is a chronic idiopathic IBD. Although the exact pathophysiology is unknown, CD is thought to be caused by dysregulated mucosal immune responses to the gut flora in genetically susceptible individuals. Genome wide association studies (GWAS) and the subsequent meta-analyses for CD have identified more than 140 susceptible loci in Caucasians; however, the heritability of CD has not yet been fully explained, suggesting that additional loci remain to be identified and that the known loci may contain high effect rare risk variants that have not yet been detected by GWAS.1 ,2 Although the sequencing accuracy for each individual nucleotide is extremely high, the numerous nucleotides in the genome means that if an individual single nucleotide variant (SNV) in a certain gene is only sequenced once, then a significant number of sequencing errors will exist. Furthermore, because rare SNVs are common in the genome, increasing the sequencing accuracy even further by sequencing individual genes multiple times is necessary for distinguishing between sequencing errors and true rare SNVs.3 ,4 Therefore, deep sequencing in a target gene can be employed as a follow-up to GWAS to extensively sequence the regions surrounding the associated SNVs. In this manner, the full extent of allelic variation might be recovered—for example, in an 8 to 10 kb region of linkage disequilibrium (LD) or among many candidate genes.5 However, the cost of deep sequencing remains too high to analyse many samples whereas targeted sequencing of pooled DNA samples can allow researchers to efficiently and cost effectively capture all of the variations in a target region.3–5

CD is more common among Caucasians than non-Caucasian populations; however, a rapid increase in the incidence of CD has occurred in Asian populations, including Koreans, during the past two decades.6 The genetic susceptibility of Asian CD patients allegedly differs substantially from that of Caucasian CD patients.6–12 Among the well established CD susceptibility genes, NOD2 failed to be replicated in Asians,6–8 ,13–15 and reports regarding ATG16L16 ,9 ,16 ,17 and IL23R9 ,10 ,18 ,19 were inconsistent. However, previous studies in Asians tested a limited number of candidate SNVs in small samples and applied fairly low calling GWAS thresholds for detecting the risk variants and imperfect LD between the unobserved causal variants and the genotyped SNVs.20 Considering the low incidence of CD and the scarcity of biobanks in Asia, the deep sequencing of regions implicated by GWAS and subsequent meta-analyses may be effective ways to identify unrevealed genetic variations in Asians with CD.

We selected 131 CD associated genes containing GWAS or meta-analysis identified loci that are known to be associated with an increased or decreased risk for IBD and that are involved in the molecular pathways implicated in the pathophysiology of CD. We performed deep resequencing across the coding exons and the 5′ and 3′ untranslated regions (UTRs) of 131 CD associated genes using a pooled method to identify genetic susceptibility variations for CD.

Methods

Study population

All of the participants were of Korean descent and all of the case samples were obtained from the Korea Research Network for CD.21 Control samples were obtained from an Ansan-Sihwa population cohort, a Korean twin cohort of Seoul National University, and a health checkup population of Seoul National University Hospital, Seoul, Korea. CD was diagnosed based on the diagnostic guidelines of CD in Korea that were published by the IBD Study Group of the Korean Association for the Study of Intestinal Diseases (KASID) as follows:22 (1) clinical manifestation: history of abdominal pain, weight loss, malaise, diarrhoea and/or rectal bleeding for at least 6 weeks; (2) endoscopic, radiological or surgical findings: linear ulceration, mucosal cobblestoning, aphthous ulceration, stricture, fistula, skip areas and/or perianal disease; (3) pathological findings: transmural inflammation, epithelioid granulomas and/or chronic inflammation composed of increased lamina propria plasma cells and lymphocytes in association with chronic architectural distortion with patchy, mild to severe neutrophilic inflammation, including neutrophilic cryptitis, crypt abscesses, erosions or ulcers. All of the patients and control subjects provided written informed consent. The recruitment protocols and consent forms were approved by the institutional review boards at each participating institution.

In total, 1000 CD patient samples were randomly assigned to 500 cases in the screening group for the pooled sequencing, and 500 were randomly assigned to the validation group for TaqMan single nucleotide polymorphism (SNP) genotyping (Life Technologies, Carlsbad, California, USA). In addition, 1000 unrelated healthy control samples were obtained from an Ansan-Sihwa population cohort for the pooled sequencing,23 and 1000 unrelated healthy control samples were obtained from a Korean twin cohort of Seoul National University (n=500), and a health checkup population of Seoul National University Hospital (n=500) for validation testing.

Power tests were performed using PS Power and Sample Size Calculations software (V.3.0; Nashville, Tennessee, USA). Power analysis showed that 500 cases and 1000 controls were sufficient to detect an OR of 1.38 for common variants, with a minor allele frequency (MAF) of 0.3 in controls, and an OR of 3.16 for rare variants, with a MAF of 0.01 in controls (power=80% and α=0.05) (see online supplementary figure S1).

DNA preparation for pooled sequencing

Genomic DNA was extracted from peripheral blood samples using QIAamp DNA blood maxi kits according to the manufacturer's protocol (Qiagen, Hilden, Germany). Case and control DNA samples that were to be pooled were quantified in triplicate by fluorometry using a Quant-iT PicoGreen dsDNA kit (Invitrogen, Carlsbad, California, USA) and a Wallac Victor 2V Multilabel Counter (Perkin-Elmer, Boston, Massachusetts, USA). The two concentrations that were closest to the median value of the three readings were selected, and an average was taken. The concentration was only used if the difference between the two reading values was less than 5% of the average concentration. Equimolar amounts of the samples that were obtained from the 500 CD cases and from the 1000 unrelated healthy controls were pooled in batches of 50 cases and 50 controls, with 30 pooled groups in total for sequencing.

Selection of the CD associated genes

For the pooled sequencing, we searched for genes that are related to CD based on data from LD analyses,14 ,24–26 GWASs,18 ,27–36 meta-analyses3 ,37–40 and HuGE Navigator that were published before December 2011.41 In total, 296 genes were identified. Replication studies performed in other populations, particularly those studies that focused on the Asian population, were searched in Medline, and their functions were reviewed by SNH, CKL, BDY, YSK and SJP. According to the results of the review, 296 genes were identified and classified as follows (see online supplementary table S1): category 1: genes replicated as CD susceptible genetic variants in other populations/ethnic groups; category 2: genes that have a hypothetical functional association with IBD; category 3: genes that have not been replicated in other populations and show a low possibility of hypothetical functional association with CD; and category 4: unclassifiable genes. One hundred and thirty-one genes that were classified as category 1 or 2 were referred to as CD associated genes and were selected as targets for pooled sequencing.

Target enrichment design

The custom capture probes targeted the entire exonic regions based on five gene databases (RefSeq (NCBI Reference Sequence database), Ensembl, CCDS, GENCODE and VEGA) and were designed using Agilent SureDesign software (https://earray.chem.agilent.com/suredesign). The following parameters were used for the capture design sequences: density, 2×; masking, moderately stringent; and boosting, balanced. The target exonic sequences were selected based on the coding exons, and the 5′ and 3′ UTRs (50 bases from the 3′ end and 50 bases from the 5′ end) of the 131 CD candidate genes were composed of 2017 regions (region size: 933.460 kb). In total, 15 214 probes (total size: 962.178 kb; coverage: 97.4%) for each DNA pool were combined in equimolar amounts to obtain equal representation of all of the targets in the library construction.

Sequencing and variant discovery

DNA (3 µg per pooled sample) was sheared using a Covaris S2 sonicator (Covaris, Woburn, Massachusetts, USA). Target capture was performed using the standard Agilent SureSelect protocol (Agilent Technologies, Santa Clara, California, USA). Each pooled sample was sequenced on an Illumina HiSeq 2000 system with a read length of 2×101 bp. The 101 bp reads were aligned to the human reference genome hg19 using Bowtie 2 software (V.2.1.0). Duplicates were marked using Picard Tools (V.1.95), and the base qualities were recalibrated using GATK (Genome Analysis Toolkit, V.2.8-1). Finally, variant discovery was performed using CRISP (a multi-sample variant caller for high throughput pooled sequence data, version released on 27 December 2013), which was designed to analyse the sequencing data from the pooled DNA sequencing. Functional annotation of the genetic variants from the high throughput sequencing data was performed using the software tool ANNOVAR, and association analysis was performed using R package (V.3.0.1) for the p value and the false discovery rate adjusted p value (q value).

Replication study

To validate the SNPs identified by pooled sequencing, we performed ABI TaqMan SNP Genotyping Assays (Applied Biosystems ViiA 7 Real-Time PCR System, San Francisco, California, USA) on peripheral blood samples that were obtained from 500 CD patients and from 1000 unrelated healthy controls, which were independent of the samples that were used for the pooled sequencing. Genomic DNA was extracted using a QIAamp DNA blood maxi kit according to the manufacturer's protocol (Qiagen). DNA concentration was measured using a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, Delaware, USA). Reactions (5 µL) were prepared in 384 well plates, and the assays were performed and analysed according to the manufacturer's recommendations. All of the genotyped SNPs had an average genotyping call rate >95%. Control SNPs with a Hardy–Weinberg equilibrium p value<0.01 were excluded. Quality control and association analyses were performed using PLINK software (V.1.07). Determination of the LD between two SNPs was calculated using Haploview software (V.4.2) and SNP Annotation and Proxy Search (SNAP) V.2.2. The p values and ORs of the discovery and replication studies were calculated by allelic association tests with R package (V.3.0.1) and PLINK software (V.1.07). The combined p values and ORs were calculated by the Cochran–Mantel–Haenszel test using PLINK software (V.1.07).

Results

Pooled sequencing

A total of 500 CD samples and 1000 healthy control samples were pooled in batches of 50 cases or 50 controls and normalised such that the DNA pool reflected the sample allele frequencies. A target enrichment covering the entire exonic sequence and the 5′ and 3′ UTRs of the 131 CD associated genes was successfully designed to capture 15 214 probes (962 kb total) and 97.4% of the target sequences. This sequencing yielded much high quality data for each pool, which achieved 634.21-fold median coverage per pool (corresponding to 12.68-fold per sample). Variant discovery and pooled sequencing data analysis are shown in figure 1. In total, 6378 calls were detected, and 2580 calls showed a quality score >300 (figure 1).

Figure 1

Genetic variants discovery and pooled sequencing data analysis. MAF, minor allele frequency, SNVs, single nucleotide variants; UTR, untranslated region.

Common and low SNVs

A total of 530 calls had an MAF ≥0.01 for the case and/or control, and a quality score>20 000, and the calls were located in an exon, 5′ or 3′ UTR. Thirty variants in 12 genes showed statistical significance with p<0.01 and q<0.2 (see online supplementary table S2). Replication genotyping was performed to validate the SNVs that were identified via pooled sequencing. rs3848672 in RTEL1 was excluded from the replication genotyping because of its connection to rs61736615 and because a TaqMan probe could not be designed (see online supplementary figure S2).

When a CD susceptible gene identified by pooled sequencing included more than one SNV, replication genotyping was performed for a single representative SNV. The representative SNP was non-synonymous in the coding region and had a lower p value. Replication genotyping was performed for the following 11 representative SNVs: rs3810936 in TNFSF15, rs3181374 in TNFSF8, rs76418789 in IL23R, rs28362680 in BTNL2, rs2241880 in ATG16L1, rs3208181 in HLA-DQA2, rs1053004 in STAT3, rs2273650 in NFKBIA, rs888208 in NKX2-3, rs4462937 in DNAH12 and rs45450798 in PTPN2. However, rs45450798, which is the 3′ UTR SNV of PTPN2, deviated from Hardy–Weinberg equilibrium (p<0.001) and was excluded.

The results of pooled sequencing, replication genotyping and combined analysis of the CD susceptible common/low SNVs are shown in table 1. As a representative SNV of TNSF15, we selected the synonymous SNV, rs3810936, which is a well established CD susceptible SNV in both Caucasian and Asian populations.12 ,42 ,43 In our study, rs3810936 showed the strongest association with an increased risk for CD in the replication genotyping (OR=1.83, 95% CI 1.48 to 2.05, p<2.2×10−16). The pooled sequencing for the TNFSF8 gene identified 10 variants in the 3′ UTR that reached the significance threshold. We selected rs3181374 as the representative SNV because a lower p value was reported when we analysed the pooled sequencing data using a previous version of CRISP. During the replication study, a new version of CRISP was released, and we described the analysed results in this study using the new CRISP version. rs3181374 showed a significant association with an increased risk for CD (OR=1.83, 95% CI 1.48 to 2.05, p<2.2×10−16). Seven variants in the BTNL2 gene reached the significance threshold, and we selected the non-synonymous SNV rs28362680, which was used to identify rheumatoid arthritis susceptible variants in recent exome sequencing,44 as a representative SNV. rs3181374 showed a significant association with an increased risk for CD (OR=1.47, 95% CI 1.31 to 1.65, p=9.67×10−11). Two variants in the ATG16L1 gene reached the significance threshold, and rs2241880, which is a well established CD susceptible non-synonymous variant that is associated with autophagy in Caucasians,16 ,29 ,45 was selected as a representative SNV. The OR of the G allele in rs2241880 was 1.30 (95% CI 1.16 to 1.46, p=5.28×10−6). Two variants in the 3′ UTR of the STAT3 gene reached the significance threshold. The distance between these variants is only 346 bp, and we selected rs1053004 as the representative SNV because of its lower p value. rs1053004 is a novel CD susceptible SNP that was validated with an OR of 1.30 (95% CI 1.16 to 1.46, p=5.28×10−6).

Table 1

Common and low frequency single nucleotide variants associated with Crohn's disease

The remaining five genes had single variants in their exons, 3′ and 5′ UTRs. rs76418789 in IL23R, which was a previously reported risk loci in a case control study for Korean patients,19 was validated as a CD susceptible SNV (OR=0.47, 95% CI 0.36 to 0.61, p=1.14×10−8). rs3208181, which is a synonymous variant in HLA-DQA2, was found to be a novel CD susceptible locus in this study (OR=1.36, 95% CI 1.19 to 1.55, p=4.66×10−6). In addition, we identified rs2273650, which is a 3′ UTR variant in NFKBIA, as a novel CD protective variant (OR=0.80, 95% CI 0.70 to 0.92, p=3.93×10−4). rs888208 in NKX2.3, which was reported to be an ulcerative colitis susceptible SNV,46 was associated with a decreased risk for CD in the combined analysis (OR=0.82, 95% CI 0.73 to 0.92, p=6.37×10−4), despite only showing a trend in the replication genotyping. rs4462937 in DNAH12 was not significant in the replication study; however, the combined analysis showed a significant association with an increased risk for CD (OR=1.13, 95% CI 1.01 to 1.26, p=3.17×10−2).

Rare SNVs

Among the 5854 calls that passed the CRISP filter, 20 calls exhibited MAFs <0.01 for the case and/or control, a non-synonymous frameshift insertion or deletion, and stop-gain SNVs (see online supplementary table S4). Three SNVs in three genes, rs56118985 in JAK2, rs199858111 in NOD2 and rs200735402 in CARD9, showed statistical significance (p<0.02 and q<0.2, see online supplementary table S5). A replication study was performed, and among 500 CD patients and 1000 unrelated healthy controls, only rs200735402 in CARD9 showed significant differences, with an OR of 0.09 (95% CI 0.22 to 0.37, p=5.28×10−5) (table 2). We also performed an in silico analysis to estimate the pathogenicity of rs200735402 (E221K) using SIFT (http://sift.jcvi.org) and PolyPhen-2 software (http://genetics.bwh.harvard.edu/pph2). The SIFT (score 0.002) and PolyPhen-2 (score 1) results of the rare variant on rs200735402 predicted that this variant is likely to be damaging on the protein function or structure level. The results of the in silico analysis showed that the damaging effect on the CARD9 protein may have a functionally protective role in CD.

Table 2

Rare frequency single nucleotide variants associated with Crohn's disease

Discussion

Sequencing depth and coverage are key considerations in genomic analyses.47 However, sequencing costs often limit the amount of sequences that can be generated and, consequently, the biological outcomes that can be achieved from an experimental design. Deep resequencing of a target gene using pooled DNA samples is a fast, efficient and cost effective strategy to detect unrevealed variants that are associated with complex phenotypes in large cohorts.4 A recent pooled deep resequencing analysis of 56 target CD susceptible genes identified potentially causal variants in 8 of the 36 loci examined.4 In this study, we targeted 131 CD associated genes using pooled sequencing. Deep resequencing of 131 CD associated genes in the pooled DNA confirmed 3 previously reported risk loci (see online supplementary table S6) and identified 8 novel risk loci in Korean.

NOD2, which is the gene with the greatest risk for CD in Caucasians, has been identified as a potent autophagy inducer.13 ,14 ,17 Pooled sequencing identified rs199858111 in NOD2 as a rare novel risk SNV; however, the replication study could not call the genotypes for this rare missense variant. The genetic risk of rs199858111 in NOD2 for CD still needs to be validated in larger populations. The well established CD susceptible variants in NOD2 could not be identified in our study, similar to a previous Asian study. Interestingly, TNFSF15 showed the greatest risk for CD in Asians,2 ,12 and rs76418789 in IL23R was identified and replicated in the Korean population.2 ,19 The genetic risk of rs2241880 in ATG16L1 is well known in Caucasian populations16 ,29 ,45 but not in Asian populations.2 A recent letter and our study showed that rs2241880 in ATG16L1 contributes to CD susceptibility in Asian.17 Although genetic variants in TNFSF15, IL23R and ATG16L1 that are known to be associated with CD in Caucasians also contribute to CD susceptibility in Asians,12 ,17 ,19 ethnicity dependence of the major susceptibility genes may exist.

Among the eight identified novel risk loci, rs3181374, which is located in the 3′ UTR of TNFSF8, showed the greatest risk for CD. The distance between rs3810936 and rs3181374 was only 112 kb; however, no meaningful difference in the LD pattern was observed between the two SNVs compared with the JPT/CHB panel of HapMap (see online supplementary figure S3). TNFSF8 encodes a ligand for TNFRSF8/CD30 (CD30L), which is expressed on effector CD4+ T cells and which plays a critical role in the maintenance and activation of interleukin 17A producing γδ T cells.48 Positional candidate gene screening identified an association between a rare SNV located in TNFSF8 and spondyloarthritis.49 The genetic risk of rs3181374 in TNFSF8 for CD still needs to be validated in other populations.

The BTNL2 gene encodes an immune receptor that is involved in costimulation and that is associated with some Th1 dominated granulomatous diseases, such as sarcoidosis and tuberculoid leprosy.50 ,51 Recent exome sequencing of rheumatoid arthritis patients revealed that rs28362680 in BTNL2 gene is a rheumatoid arthritis susceptible variant.44 We found that rs28362680 in BTNL2 is associated with an increased risk for CD. The HLA-DQA2 protein binds peptides that are derived from antigens that are taken up by the endocytic route of antigen presenting cells and presents these peptides on the cell surface for recognition by CD4+T cells. The human major histocompatibility complex (MHC) represents the strongest susceptibility locus for autoimmune diseases. The International MHC and Autoimmunity Genetics Network previously reported three intronic SNPs in HLA-DQA2 (rs9276431, rs2213567 and rs2213568).52 We identified a novel but synonymous SNV, rs3208181 in HLA-DQA2. The STAT3 locus was significantly associated with CD susceptibility in a previous GWAS.34 Since then, several studies have demonstrated that STAT3 polymorphisms are associated with IBD; however, the results are inconsistent among different population cohorts.34 ,53 We identified the novel SNV rs1053004, which is located in the 3′ UTR of STAT3. A polymorphism of the NFKBIA gene is associated with CD patients lacking a predisposing allele of the CARD15/NOD2 gene.54 rs2273650 was reported as a psoriatic arthritis susceptible loci in a Canadian population.55 We identified that rs2273650 in NFKBIA is associated with an increased risk for CD. In contrast, rs888208 in NKX2-3 and rs4462937 in DNAH12 were not significant in the replication study; however, the combined analysis showed a significant association with an increased risk for CD. Further replication studies remain required.

In addition, we identified a rare non-synonymous variant, rs200735402 in CARD9. The CARD9 protein contains a protein interaction domain that is known to participate in the activation or suppression of CARD containing members of the caspase family and thus plays important roles as part of the innate immune response. CARD9 mediates signals from the pattern recognition receptor Dectin-1 to downstream signalling pathways, such as nuclear factor κB, thus activating proinflammatory and anti-inflammatory cytokines and subsequently inducing an appropriate innate and adaptive immune response for the efficient clearance of the infection.56 CARD9 shows a significant association with CD, which suggests that this locus may be a causal genetic variant.

Genetic risk evaluations have been widely performed for Caucasians but are limited in most Asian populations because of the lack of well organised patient cohorts and biobanks. In Korea, the KASID has recently taken a leading role in establishing the organisational structural basis for investigating IBD and has performed a series of nationwide studies in Korea.57 Members of KASID from each province of the nation have participated in multicentre studies and have registered patients with IBD. However, despite power analysis, the number of cases and controls remains smaller compared with Western studies based on the large scale Caucasian biobanks. The power limitations of our sample size limited our ability to draw firm conclusions that a particular variant identified in the discovery panel is not associated with a risk of CD. Because deep resequencing has an advantage for detecting missed rare variants with protein damaging impact, additional controls may be needed; however, publically available whole exomic sequencing datasets from healthy Korean populations are lacking. Therefore, we identified high impact variants, such as stop-gain/loss and frameshift insertion/deletion, in our pooled sequencing data and compared our high impact variants with those from the public Exome Aggregation Consortium (ExAC) database. One stop-gain SNV in TLR5 (Arg392Stop, rs5744168) seemed to have a weakly protective effect in our data. The MAFs of this SNV in Europeans and South Asians, as well as in East Asians, are relatively higher than that in our case. This nonsense polymorphism is known to be negatively associated with CD (table 3).58

Table 3

Putative high impact variants (stop-gain/loss and frameshift insertion/deletion) identified by pooled sequencing

GWAS have identified many CD susceptibility loci in CD. The majority of the loci identified in GWASs are in the non-coding intergenic and intronic regions. However, we focused more on the coding regions. Therefore, we needed to determine whether the coding SNVs identified in our study were likely causal or were in LD with more strongly associated variants in adjacent non-coding regions. Our targeted pooled resequencing of the exons of CD associated genes did not identify strongly associated variants in adjacent non-coding regions. Therefore, we performed an LD analysis of the identified SNVs with the SNVs located on adjacent non-coding regions that were identified in previous GWASs (see online supplementary table S7). In the LD analysis, rs12994997,59 rs3792109,60 rs382830934 and rs10210302,61 which were found in the non-coding region of the ATG16L1 gene, were reported in a Japanese (JPT) and Han Chinese (CHB) population in the 1000 Genomes Database, and these SNVs are all in strong LD (R2>0.8) with rs2241880. Therefore, we performed combined annotation dependent depletion analysis on these variants, and rs2241880 was predicted to be in the top 10% most deleterious substitutions in the human genome. In addition, rs2241880 is located on an exon whereas the other SNPs are located on intronic and intergenic regions. These results suggest that rs2241880, which was identified in this study, may likely be a meaningful variant compared with the other SNPs in the adjacent non-coding regions. However, additional functional studies are needed to confirm that rs2241880 is the causal variant (table 4).

Table 4

Combined annotation dependent depletion analysis of rs2241880 on 2q37.1 and of the adjacent single nucleotide variants

In this study, we performed deep resequencing of 131 candidate CD associated genes in a Korean population. Using cost effective pooled sequencing and a subsequent validation study, we confirmed three previously reported loci and identified eight novel loci associated with CD in Koreans. Although further validation studies and functional studies are required to confirm the roles of the newly discovered variants, the initial bioinformatics and underlying function of each gene suggest that these loci may have important roles in the development of CD. Comprehensive efforts will be required to determine the overlapping and ethnicity specific CD susceptibility variants between Asians and Caucasians to interpret the pathogenesis of CD and to identify potential therapeutic targets of CD.17

Acknowledgments

We would like to thank all of the participating patients and healthy donors who provided the DNA and clinical information necessary for this study. We acknowledge the use of case DNA collection from the Korea Research Network for Crohn's Disease (Drs Chang Hwan Choi, Hyun Joo Park, Won Moon, Suck-Ho Lee, Bo-In Lee, Jin Oh Kim, Kyu Chan Huh, Young Sook Park, Seok Won Jung, Kwang Ho Baik, Hiun Suk Chae, Geun Am Song, Sam Ryong Jee, Dong Soo Han, Eun Soo Kim, Sung Hee Jung, Chan-Guk Park, Eun Jeong Jang, Il Hyun Baek, Joo Sung Kim, Kang-Moon Lee, Jae Kyu Sung, Kyung Ho Kim, Yoon Tae Jeen, Byung Ik Jang, Hyun-Soo Kim, Jai Hyun Choi, Ja Seol Koo, Ji Hyun Lee, Sung-Ae Jung, Jae Hee Cheon, Suk-Kyun Yang, and Won Ho Kim). We acknowledge the use of control DNA samples from for an Ansan-Sihwa population cohort (Prof. Jeong-Sun Seo, Genomic Medicine Institute (GMI), Medical Research Center, Seoul National University), a health checkup population of Seoul National University Hospital (Prof. Belong Cho and Prof. Jin-ho Park. Department of Family Medicine, Seoul National University Hospital, Seoul National University College of Medicine) and a Korean twin cohort (Prof. Joohon Sung, Complex Disease and Genome Epidemiology Branch, Department of Epidemiology, School of Public Health, Seoul National University).

References

Supplementary materials

Footnotes

  • *SNH and CP contributed equally. †JIK and YHK contributed equally.

  • Contributors SNH: study design, acquisition of the data, interpretation of the data and drafting of the manuscript. CP: study design, DNA preparation, genotyping, acquisition of the data, analysis and interpretation of the data, and critical revision of the manuscript for important intellectual content. SJP, CKL, BDY and YSK: study design and acquisition of the data. SL and JC: analysis and interpretation of the data, statistical analysis and critical revision of the manuscript for important intellectual content. J-IK: study design, acquisition of the data, analysis and interpretation of the data, critical revision of the manuscript for important intellectual content and study supervision. Y-HK: study concept and design, acquisition of the data, analysis and interpretation of the data, critical revision of the manuscript for important intellectual content, study supervision and obtained funding.

  • Funding This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MEST) (No 2011-0016178) and the Korean Health technology R&D Project grant (A120176) by the Ministry of Health and Welfare, Republic of Korea.

  • Competing interests None.

  • Ethics approval The study was approved by the institutional review board of the Samsung Medical Centre. All samples were obtained with written informed consent under institutional review board approved protocols in each centre.

  • Provenance and peer review Not commissioned; externally peer reviewed.