Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer

Abstract

Association studies have linked microbiome alterations with many human diseases. However, they have not always reported consistent results, thereby necessitating cross-study comparisons. Here, a meta-analysis of eight geographically and technically diverse fecal shotgun metagenomic studies of colorectal cancer (CRC, n = 768), which was controlled for several confounders, identified a core set of 29 species significantly enriched in CRC metagenomes (false discovery rate (FDR) < 1 × 10−5). CRC signatures derived from single studies maintained their accuracy in other studies. By training on multiple studies, we improved detection accuracy and disease specificity for CRC. Functional analysis of CRC metagenomes revealed enriched protein and mucin catabolism genes and depleted carbohydrate degradation genes. Moreover, we inferred elevated production of secondary bile acids from CRC metagenomes, suggesting a metabolic link between cancer-associated gut microbes and a fat- and meat-rich diet. Through extensive validations, this meta-analysis firmly establishes globally generalizable, predictive taxonomic and functional microbiome CRC signatures as a basis for future diagnostics.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Despite study differences, meta-analysis identifies a core set of gut microbes strongly associated with CRC.
Fig. 2: Co-occurrence analysis of CRC-associated gut microbial species reveals four clusters preferentially linked to specific patient subgroups.
Fig. 3: Both taxonomic and functional metagenomic classification models generalize across studies, in particular when trained on data from multiple studies.
Fig. 4: Meta-analysis identifies consistent functional changes in CRC metagenomes.
Fig. 5: Meta-analysis results are validated in three independent study populations.

Similar content being viewed by others

Data availability

The raw sequencing data for the samples in the German study that have not been published before (see Methods) are available from the European Nucleotide Archive under study no. PRJEB27928. The metadata for these samples are available as Supplementary Table 6.

For the other studies included in the current study, the raw sequencing data can be found under the following European Nucleotide Archive identifiers: PRJEB10878 for Yu et al.11; PRJEB12449 for Vogtmann et al.10; ERP008729 for Feng et al.9; and ERP005534 for Zeller et al.8. The independent validation cohorts can be found in the Sequence Read Archive under the identifier no. SRP136711 for Thomas et al.27 and in the DNA Data Bank of Japan database under identification no. DRA006684.

The filtered taxonomic and functional profiles used as input for the statistical modeling pipeline are available in Supplementary Data 1.

The code and all analysis results can be found under https://github.com/zellerlab/crc_meta.

References

  1. Tringe, S. G. & Rubin, E. M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).

    Article  CAS  Google Scholar 

  2. Tremaroli, V. & Bäckhed, F. Functional interactions between the gut microbiota and host metabolism. Nature 489, 242–249 (2012).

    Article  CAS  Google Scholar 

  3. Lynch, S. V. & Pedersen, O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 2369–2379 (2016).

    Article  CAS  Google Scholar 

  4. Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

    Article  CAS  Google Scholar 

  5. Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).

    Article  CAS  Google Scholar 

  6. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

    Article  CAS  Google Scholar 

  7. Schirmer, M. et al. Dynamics of metatranscription in the inflammatory bowel disease gut microbiome. Nat. Microbiol. 3, 337–346 (2018).

    Article  CAS  Google Scholar 

  8. Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).

    Article  Google Scholar 

  9. Feng, Q. et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015).

    Article  CAS  Google Scholar 

  10. Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).

    Article  Google Scholar 

  11. Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).

    Article  CAS  Google Scholar 

  12. Bedarf, J. R. et al. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Med. 9, 39 (2017).

    Article  CAS  Google Scholar 

  13. Schmidt, T. S. B., Raes, J. & Bork, P. The human gut microbiome: from association to modulation. Cell 172, 1198–1215 (2018).

    Article  CAS  Google Scholar 

  14. Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262–266 (2015).

    Article  CAS  Google Scholar 

  15. Costea, P. I. et al. Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35, 1069–1076 (2017).

    CAS  PubMed  Google Scholar 

  16. Lozupone, C. A. et al. Meta-analyses of studies of the human microbiota. Genome Res. 23, 1704–1714 (2013).

    Article  CAS  Google Scholar 

  17. Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A. & Alm, E. J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 8, 1784 (2017).

    Article  Google Scholar 

  18. Shah, M. S. et al. Leveraging sequence-based faecal microbial community survey data to identify a composite biomarker for colorectal cancer. Gut 67, 882–891 (2018).

    Article  CAS  Google Scholar 

  19. Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).

    Article  Google Scholar 

  20. Dai, Z. et al. Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome 6, 70 (2018).

    Article  Google Scholar 

  21. Maier, L. et al. Extensive impact of non-antibiotic drugs on human gut bacteria. Nature 555, 623–628 (2018).

    Article  CAS  Google Scholar 

  22. Milanese, M. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).

    Article  Google Scholar 

  23. Kultima, J. R. et al. MOCAT2: a metagenomic assembly, annotation and profiling framework. Bioinformatics 32, 2520–2523 (2016).

    Article  CAS  Google Scholar 

  24. Hothorn, T. et al. A Lego system for conditional inference. Am. Stat. 60, 257–263 (2006).

    Article  Google Scholar 

  25. Mandal, S. et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663 (2015).

    PubMed  Google Scholar 

  26. Tjalsma, H., Boleij, A., Marchesi, J. R. & Dutilh, B. E. A bacterial driver-passenger model for colorectal cancer: beyond the usual suspects. Nat. Rev. Microbiol. 10, 575–582 (2012).

    Article  CAS  Google Scholar 

  27. Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. https://doi.org/10.1038/s41591-019-0405-7 (2019).

  28. Huerta-Cepas, J. et al.eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016).

    Article  CAS  Google Scholar 

  29. Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199–D205 (2014).

    Article  CAS  Google Scholar 

  30. Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).

    Article  CAS  Google Scholar 

  31. Vieira-Silva, S. et al. Species–function relationships shape ecological properties of the human gut microbiome. Nat. Microbiol. 1, 16088 (2016).

    Article  CAS  Google Scholar 

  32. Hirayama, A. et al. Quantitative metabolome profiling of colon and stomach cancer microenvironment by capillary electrophoresis time-of-flight mass spectrometry. Cancer Res. 69, 4918–4925 (2009).

    Article  CAS  Google Scholar 

  33. Denkert, C. et al. Metabolite profiling of human colon carcinoma: deregulation of TCA cycle and amino acid turnover. Mol. Cancer 7, 72 (2008).

    Article  Google Scholar 

  34. Mal, M., Koh, P. K., Cheah, P. Y. & Chan, E. C. Metabotyping of human colorectal cancer using two-dimensional gas chromatography mass spectrometry. Anal. Bioanal. Chem. 403, 483–493 (2012).

    Article  CAS  Google Scholar 

  35. Weir, T. L. et al. Stool microbiome and metabolome differences between colorectal cancer patients and healthy adults. PLoS ONE 8, e70803 (2013).

    Article  CAS  Google Scholar 

  36. Goedert, J. J. et al. Fecal metabolomics: assay performance and association with colorectal cancer. Carcinogenesis 35, 2089–2096 (2014).

    Article  CAS  Google Scholar 

  37. Aykan, N. F. Red meat and colorectal cancer. Oncol. Rev. 9, 288 (2015).

    Article  Google Scholar 

  38. Diet, Nutrition, Physical Activity and Cancer: a Global Perspective. A Summary of the Third Expert Report (World Cancer Research Fund, 2018).

  39. Dutilh, B. E., Backus, L., van Hijum, S. A. & Tjalsma, H. Screening metatranscriptomes for toxin genes as functional drivers of human colorectal cancer. Best Pract. Res. Clin. Gastroenterol. 27, 85–99 (2013).

    Article  CAS  Google Scholar 

  40. Sears, C. L. & Garrett, W. S. Microbes, microbiota, and colon cancer. Cell Host Microbe 15, 317–328 (2014).

    Article  CAS  Google Scholar 

  41. Ridlon, J. M., Harris, S. C., Bhowmik, S., Kang, D. J. & Hylemon, P. B. Consequences of bile salt biotransformations by intestinal bacteria. Gut Microbes 7, 22–39 (2016).

    Article  CAS  Google Scholar 

  42. Yoshimoto, S. et al. Obesity-induced gut microbial metabolite promotes liver cancer through senescence secretome. Nature 499, 97–101 (2013).

    Article  CAS  Google Scholar 

  43. Ajouz, H., Mukherji, D. & Shamseddine, A. Secondary bile acids: an underrecognized cause of colon cancer. World J. Surg. Oncol. 12, 164 (2014).

    Article  Google Scholar 

  44. Boleij, A. et al. The Bacteroides fragilis toxin gene is prevalent in the colon mucosa of colorectal cancer patients. Clin. Infect. Dis. 60, 208–215 (2015).

    Article  CAS  Google Scholar 

  45. Wu, S. et al. A human colonic commensal promotes colon tumorigenesis via activation of T helper type 17 T cell responses. Nat. Med. 15, 1016–1022 (2009).

    Article  CAS  Google Scholar 

  46. Dejea, C. M. et al. Patients with familial adenomatous polyposis harbor colonic biofilms containing tumorigenic bacteria. Science 359, 592–597 (2018).

    Article  CAS  Google Scholar 

  47. Ridlon, J. M., Kang, D. J. & Hylemon, P. B. Isolation and characterization of a bile acid inducible 7α-dehydroxylating operon in Clostridium hylemonae TN271. Anaerobe 16, 137–146 (2010).

    Article  CAS  Google Scholar 

  48. Mallonee, D. H., White, W. B. & Hylemon, P. B. Cloning and sequencing of a bile acid-inducible operon from Eubacterium sp. strain VPI 12708. J. Bacteriol. 172, 7011–7019 (1990).

    Article  CAS  Google Scholar 

  49. Ocvirk, S. & O’Keefe, S. J. D. Influence of bile acids on colorectal cancer risk: potential mechanisms mediated by diet–gut microbiota interactions. Curr. Nutr. Rep. 6, 315–322 (2017).

    Article  CAS  Google Scholar 

  50. Gevers, D. et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15, 382–392 (2014).

    Article  CAS  Google Scholar 

  51. Viennot, S. et al. Colon cancer in inflammatory bowel disease: recent trends, questions and answers. Gastroenterol. Clin. Biol. 33, S190–S201 (2009).

    Article  Google Scholar 

  52. Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/β-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).

    Article  CAS  Google Scholar 

  53. Kostic, A. D. et al. Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14, 207–215 (2013).

    Article  CAS  Google Scholar 

  54. Arthur, J. C. et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science 338, 120–123 (2012).

    Article  CAS  Google Scholar 

  55. Reddy, B. S. Diet and excretion of bile acids. Cancer Res. 41, 3766–3768 (1981).

    CAS  PubMed  Google Scholar 

  56. Ogino, S. et al. Integrative analysis of exogenous, endogenous, tumour and immune factors for precision medicine. Gut 67, 1168–1180 (2018).

    Article  CAS  Google Scholar 

  57. Ogino, S., Chan, A. T., Fuchs, C. S. & Giovannucci, E. Molecular pathological epidemiology of colorectal neoplasia: an emerging transdisciplinary and interdisciplinary field. Gut 60, 397–411 (2011).

    Article  Google Scholar 

  58. Hannigan, G. D., Duhaime, M. B., Ruffin, M. T. 4th, Koumpouras, C. C. & Schloss, P. D. Diagnostic potential and interactive dynamics of the colorectal cancer virome. MBio 9, e02248-18 (2018).

  59. zur Hausen, H. Red meat consumption and cancer: reasons to suspect involvement of bovine infectious factors in colorectal cancer. Int. J. Cancer 130, 2475–2483 (2012).

    Article  CAS  Google Scholar 

  60. Shkoporov, A. N. et al. Reproducible protocols for metagenomic analysis of human faecal phageomes. Microbiome 6, 68 (2018).

    Article  Google Scholar 

  61. Böhm, J. et al. Discovery of novel plasma proteins as biomarkers for the development of incisional hernias after midline incision in patients with colorectal cancer: The ColoCare study. Surgery 161, 808–817 (2017).

    Article  Google Scholar 

  62. Liesenfeld, D. B. et al. Metabolomics and transcriptomics identify pathway differences between visceral and subcutaneous adipose tissue in colorectal cancer patients: the ColoCare study. Am. J. Clin. Nutr. 102, 433–443 (2015).

    Article  CAS  Google Scholar 

  63. Pox, C. P. et al. Efficacy of a nationwide screening colonoscopy program for colorectal cancer. Gastroenterology 142, 1460–1467.e2 (2012).

    Article  Google Scholar 

  64. Furet, J. P. et al. Comparative assessment of human and farm animal faecal microbiota using real-time quantitative PCR. FEMS Microbiol. Ecol. 68, 351–362 (2009).

    Article  CAS  Google Scholar 

  65. Mende, D. R. et al. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).

    Article  CAS  Google Scholar 

  66. Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).

    Article  CAS  Google Scholar 

  67. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

  68. Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58, 267–288 (1996).

    Google Scholar 

  69. Smialowski, P., Frishman, D. & Kramer, S. Pitfalls of supervised feature selection. Bioinformatics 26, 440–443 (2010).

    Article  CAS  Google Scholar 

  70. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

    Google Scholar 

  71. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

    Article  Google Scholar 

  72. Oksanen, J. et al. vegan: Community Ecology Package (The Comprehensive R Archive Network, 2018).

  73. Costea, P. I., Zeller, G., Sunagawa, S. & Bork, P. A fair comparison. Nat. Methods 11, 359 (2014).

    Article  CAS  Google Scholar 

  74. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009).

  75. Peng, H., Long, F. & Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005).

    Article  Google Scholar 

  76. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    Article  CAS  Google Scholar 

  77. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  78. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

    Article  CAS  Google Scholar 

  79. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

    Article  CAS  Google Scholar 

  80. Caporaso, J. G. et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. USA 108, 4516–4522 (2011).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We are thankful to members of the Zeller, Bork, and Arumugam groups for inspiring discussions. Additionally, we thank Y. P. Yuan and the EMBL Information Technology Core Facility for support with high-performance computing, and the EMBL Genomics Core Facility for their sequencing support. We are also grateful for the advice provided by B. Klaus, EMBL Centre for Statistical Data Analysis. We acknowledge funding from EMBL, the German Cancer Research Center, the Huntsman Cancer Foundation, the Intramural Research Program of the National Cancer Institute, ETH Zürich, and the following external sources: the European Research Council (CancerBiome grant no. ERC-2010-AdG_20100317 to P.B., Microbios grant no. ERC-AdG-669830 to P.B., and Meta-PG grant no. ERC-2016-STG-716575 to N.S.); the Novo Nordisk Foundation (grant no. NNF10CC1016515 to M.A.); the Danish Diabetes Academy supported by the Novo Nordisk Foundation and TARGET Research Initiative (Danish Strategic Research Council grant no. 0603-00484B to M.A.); the Matthias-Lackas Foundation (to C.M.U.); the National Cancer Institute (grant nos. R01 CA189184, R01 CA207371, U01 CA206110, and P30 CA042014 to C.M.U.); the Federal Ministry of Education and Research (BMBF; the de.NBI network no. 031A537B to P.B. and the ERA-NET TRANSCAN project no. 01KT1503 to C.M.U.); the Helmut Horten Foundation (to S.Sunagawa); and the Fundação de Amparo à Pesquisa do Estado de São Paulo (grant no. 16/23527-2 to A.M.T.). For the Italy validation cohorts, funding was provided by the Lega Italiana per La Lotta contro i Tumori. For the Japan validation cohort, funding was provided to T.Y. and S.Y. by the National Cancer Center Research and Development Fund (grant nos. 25-A-4,28-A-4, and 29-A-6); Practical Research Project for Rare/Intractable Diseases from the Japan Agency for Medical Research and Development (grant no. JP18ek0109187); Japan Science and Technology Agency-PRESTO (grant no. JPMJPR1507); Japan Society for the Promotion of Science KAKENHI (grant nos. 16J10135, 142558, and 221S0002); Joint Research Project of the Institute of Medical Science, University of Tokyo; and the Takeda Science and Suzuken Memorial Foundations.

Author information

Authors and Affiliations

Authors

Contributions

G.Z., M.A., and P.B. conceived and supervised the study. P.S.K., N.H., C.M.U., H.B., E.V., and R.S. recruited the participants and collected the samples. E.K., A.Y.V., S.Sunagawa, and P.B. generated the metagenomic data. A.M., P.T.P., J.S.F., A.P., S.Sunagawa, L.P.C., G.Z., and M.A. developed the metagenomic profiling workflows and/or performed the taxonomic and functional profiling. J.W., G.Z., K.Z., P.T.P., A.K., M.A., and N.S. performed the statistical analysis and/or developed the statistical analysis workflows. E.K. and R.P. designed and performed the validation experiments. A.M.T., P.M., S.G., D.S., S.M., H.S., S.Shiba, T.S., S.Y., T.Y., L.W., A.N., and N.S. provided additional validation data. J.W., G.Z., M.A., P.T.P., and P.B. designed the figures. G.Z., J.W., M.A., and P.B. wrote the manuscript with contributions from P.T.P., A.M., S.Sunagawa, L.P.C., E.K., A.Y.V., E.V., R.S., P.S.K., H.B., E.N., N.S. and L.W. All authors discussed and approved the manuscript.

Corresponding authors

Correspondence to Manimozhiyan Arumugam, Peer Bork or Georg Zeller.

Ethics declarations

Competing interest

P.B., G.Z., A.Y.V., and S.Sunagawa are named inventors on a patent (EP2955232A1: Method for diagnosing adenomas and/or colorectal cancer (CRC) based on analyzing the gut microbiome).

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Potential confounding of individual microbial species associations by patient demographics and technical factors.

Variance explained by disease status (CRC versus CTRL) is plotted against variance explained by different putative confounding factors for individual microbial species. Each species is represented by a dot proportional in size to its abundance (see legend and Methods); core microbial markers identified in the meta-analysis are highlighted in red. For the confounder analysis, factors with continuous values were discretized into quartiles and the BMI was split into lean/overweight/obese according to conventional cutoffs. The variance explained by disease status was computed for all data; accordingly, the x values are the same in all panels and also in Fig. 1d. The variance explained by different confounding factors was computed using all samples for which data were available (indicated by the insets).

Source Data

Extended Data Fig. 2 Study heterogeneity shows a strong influence on alpha and beta diversity.

a, Alpha diversity as measured with the Shannon index was computed for all gut microbial species (n = 849), reference mOTUs (n = 246), and meta-mOTUs (n = 603) separately. P values were computed using a two-sided Wilcoxon test, while the overall P value (on top) was calculated using a two-sided blocked Wilcoxon test (n = 575 independent observations; see Methods). The ANOVA F-statistics below the panel was calculated using the R function ‘aov’. b, Principal coordinate analysis of samples from all five included studies based on Bray–Curtis distance; the study is color-coded and disease status (CRC versus CTRL) is indicated by filled/unfilled circles. The boxplots on the side and below show samples projected onto the first two principal coordinates broken down by study and disease status, respectively. P values were computed using a two-sided Wilcoxon test for disease status and a Kruskal–Wallis test for study (n = 575 independent observations). For all boxplots, boxes denote the IQR with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Country codes are as in Fig. 1b.

Source Data

Extended Data Fig. 3 The generalized fold change extends the established (median-based) fold change to provide higher resolution in sparse microbiome data.

a, In the top row, the logarithmic relative abundances for Bacteroides dorei/vulgatus, Parvimonas micra, and F. nucleatum subspecies animalis—examples for one high-prevalence and two low-prevalence species—are shown as swarm plots for the CTRL and CRC groups. The thick vertical lines indicate the medians in the different groups and the black horizontal line shows the difference between the two medians, which corresponds to the classical (median-based) fold change. Since F. nucleatum subspecies animalis is not detectable in more than 50% of cancer cases, there is no difference between the CTRL and CRC median; thus, the fold change is 0. The lower row shows the same data, but instead of only the median (or 50th percentile), 9 quantiles ranging from 10 to 90% are shown by the thinner vertical lines. The generalized fold change is indicated by the horizontal black line again, computed as the mean of the differences between the corresponding quantiles in both groups. In the case of the sparse data (for example, F. nucleatum), the differences in the 70, 80, and 90% quantiles cause the generalized fold change to be higher than 0. b, The median fold change is plotted against the newly developed generalized fold change for all microbial species. (The core set of microbial CRC marker species is highlighted in orange.) Marginal histograms visualize the distribution for both fold change and generalized fold change. c, Scatter plots showing the relationship between fold change and generalized fold change and the area under the ROC curve (AUROC) or the shift in prevalence between CRC and CTRL, with Spearman’s rank correlations (rho) added in the top left corners; the generalized fold change provides higher resolution (wider distribution around 0) and better correlation with the non-parametric AUROC effect size measure as well as prevalence shift, which captures the difference in prevalence of a species in CRC metagenomes relative to CTRL metagenomes.

Source Data

Extended Data Fig. 4 Microbial genera identified in the meta-analysis to be associated with CRC.

a, The meta-analysis significance of microbial genera, computed using a univariate, two-sided Wilcoxon test blocked for ‘study’ and ‘colonoscopy’ (n = 574 independent observations), is given by bar height (FDR = 0.005). Underneath, significance (FDR-corrected P value computed using a two-sided Wilcoxon test) and generalized fold changes (see Methods) within individual studies are displayed as heatmaps in gray and color, respectively (see keys). Genera are ordered by meta-analysis significance and direction of change. b, For highly significant genera (meta-analysis FDR = 1 × 105), association strength is quantified by the area under the ROC curve across individual studies (color-coded diamonds); 95% confidence intervals are depicted by gray lines. Country codes are as in Fig. 1b.

Source Data

Extended Data Fig. 5 The core set of CRC-enriched microbial species can be stratified into four clusters based on co-occurrence in CRC metagenomes.

a, The heatmap shows the Jaccard index (computed by comparing marker-positive samples; see Methods) for the core set of microbial marker species, computed on CRC cases only. Clustering was performed using the Ward algorithm as implemented in the R function ‘hclust’. The inset shows the distribution of Jaccard similarities within each cluster and for the background (all similarities between species not in the same cluster). b,c, Barplots show the fraction of CRC samples that are positive for a marker species cluster (defined as the union of positive marker species) broken down by patient subgroups based on BMI (b) and age (c) (see Fig. 2b–d for other patient subgroups). The significance of the associations between CRC subgroups and marker species clusters was tested using the Cochran–Mantel–Haenszel test blocked for ‘study’ and ‘colonoscopy’. (No significant associations were detected.) d, For the core set of microbial species with a genomic reference, the presence (red) or absence (white) of superoxide dismutase, peroxidase, and catalase are shown as heatmaps. Presence of the enzyme was determined by checking the protein annotations for the reference projects (see NCBI BioProject ID) in http://progenomes.embl.de/.

Source Data

Extended Data Fig. 6 Coefficients of LOSO LASSO logistic regression models compared to models trained on individual studies.

a, The mean coefficients (feature weights) from LASSO cross-validation models trained on single studies (color-coded) are plotted against the single-feature AUROC for each species feature. The horizontal lines highlight the microbial species that are—for at least one study—selected in more than 50% of the models in cross-validation and account for more than 10% of the absolute model weight in at least 10% of the cross-validation models. b, Similarly, b shows the same for the models trained in the LOSO setting (see Methods). The colors indicate which study has been left out of the training set (and is used for validation). The weights of the LOSO models are spread across more species; thus, generally, lower species are highlighted by the horizontal lines if their weights explain more than 2.5% of the absolute model in at least 10% of cross-validation models and they have been selected in more than 50% of models in cross-validation. c, The inset shows the distribution of the number of non-zero coefficients across all cross-validation models. d, The bar height indicates the number of non-zero coefficients that are shared between the mean models for each study or left-out study, respectively. e, The study-to-study difference (computed as the median of all pairwise differences between model weights for a single species across the mean models) for cross-validation single-study models are plotted against the same measure for the LOSO models. Species with a study-to-study difference of more than 0.02 in the cross-validation models are highlighted and annotated, showing much larger variability between models trained on single studies compared to LOSO models. Country codes as in Fig. 1b.

Source Data

Extended Data Fig. 7 Analysis of LOSO models for prediction bias.

a, To examine whether species- and gene family-level classification models are confounded, that is, biased toward certain patient subgroups, the prediction scores from the LOSO models are broken down into strata for each clinical parameter (for example, female and male for sex). The prediction bias for each variable was tested by Wilcoxon (for sex and BMI) or Kruskal–Wallis (all others) tests while blocking for study as the confounder. The boxes denote the IQRs, with the median as the horizontal black line and the whiskers extending up to the most extreme point within the 1.5-fold IQR. A significant difference in prediction score was detected only for the CRC stage. This stage bias is more pronounced for gene family then for species models. b, To examine the CRC stage bias further, the barplots show the true positive rate corresponding to an overall 10% FPR (see also Fig. 3c) for the different CRC stages, displaying slightly higher classification sensitivity for late-stage CRC for both species and gene family models.

Source Data

Extended Data Fig. 8 Cross-study performance of statistical models based on KEGG KO abundances, single-gene abundances from the metagenomic gene catalog (IGC), and the combination of taxonomic and eggNOG database abundance profiles.

ac, CRC classification accuracy resulting from cross-validation within each study (gray boxed along the diagonal) and study-to-study model transfer (external validations off the diagonal) as measured by the AUROC for the classification models trained on KEGG KOs (a), models based on the gene catalog (b), and models based on the combination of taxonomic and eggNOG database abundance profiles (c) (see Methods for the details on the statistical modeling workflows). The last column depicts the average AUROC across external validations. The barplots on the right show that the classification accuracy on a hold-out study improves if the data from all other studies are combined for training (LOSO validation) relative to models trained on data from a single study (study-to-study transfer, indicated by the bar color) consistently across different types of input data. Country codes as in Fig. 1b.

Source Data

Extended Data Fig. 9 Identification of bai genes in metagenomes.

Putative bai genes identified in the metagenomic IGC were clustered by co-abundance in metagenomes to infer genomic linkage (see Methods) to be able to infer operon completeness and species of origin. a, For each resulting cluster of putative bile acid-converting genes, the mean relative abundance was plotted against the mean percentage of protein identity derived from global alignment against the know bile acid-converting genes from C. scindens and C. hylemonae (see Methods). Completeness, that is, how many of the 11 different bai gene functions are represented in each cluster, and mean gene-to-gene correlation of relative abundance within each cluster are encoded by dot size and color, respectively (see legend). The four clusters with a mean protein identity > 75% to known bai operon-containing genomes were included in the subsequent analysis and labeled with the most highly correlated mOTU (see b). b, Pearson correlation between gene cluster abundances and the relative abundance of the most highly correlated species (in logarithmic space) is given by the bar height for the four gene clusters identified in a. The most highly correlating species is highlighted in darker gray (see labeling of gene clusters in a). c, The log-transformed abundances of all bai genes and the four species identified in b are shown as boxplots for CTRLs (gray) and CRC cases (red). Assessing the significance of the differences between CRC and CTRLs (using a Wilcoxon test blocked for ‘study’ and ‘colonoscopy’) demonstrates a much more significant CRC enrichment of the aggregated metagenomic bai gene abundance than of the individual clostridial species to which these belong. d, ROC curve for the qPCR quantification of the baiF gene in the genomic DNA of a subset of samples in the German study (see Methods and Fig. 4e).

Source Data

Extended Data Fig. 10 Validation of the meta-analysis of single-species associations in three independent cohorts.

a, Heatmap showing for the core set of CRC-associated species (see Fig. 1) the rank of the respective species within the associations of each study, including the three independent validation cohorts (see Table 1), compared to the rank in the meta-analysis (meta) on the left. b, Precision-recall curves for the different independent validation cohorts using the meta-analysis set of associated species at FDR = 0.005 (top) and FDR = 1 × 105 (bottom) as the ‘true’ set (see Methods) and the naïve (uncorrected) within-cohort significance as the predictor (see Supplementary Fig. 2). IT1, Italy 1; IT2, Italy 2; JP, Japan; other country codes are as in Fig. 1b.

Source Data

Supplementary information

Supplementary information

Supplementary Figures 1–8, Supplementary Tables 1, 2, and 5, and Supplementary References

Reporting Summary

Supplementary Tables

Supplementary Tables 3, 4, and 6

Supplementary Data

Supplementary Data 1 and 2

Source data

Source Data Fig. 1

Statistical Source Data

Source Data Fig. 2

Statistical Source Data

Source Data Fig. 3

Statistical Source Data

Source Data Fig. 4

Statistical Source Data

Source Data Fig. 5

Statistical Source Data

Source Data Extended Data Fig. 1

Statistical Source Data

Source Data Extended Data Fig. 2

Statistical Source Data

Source Data Extended Data Fig. 3

Statistical Source Data

Source Data Extended Data Fig. 4

Statistical Source Data

Source Data Extended Data Fig. 5

Statistical Source Data

Source Data Extended Data Fig. 6

Statistical Source Data

Source Data Extended Data Fig. 7

Statistical Source Data

Source Data Extended Data Fig. 8

Statistical Source Data

Source Data Extended Data Fig. 9

Statistical Source Data

Source Data Extended Data Fig. 10

Statistical Source Data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wirbel, J., Pyl, P.T., Kartal, E. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med 25, 679–689 (2019). https://doi.org/10.1038/s41591-019-0406-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-019-0406-6

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing