Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer

Wirbel, Jakob; Pyl, Paul Theodor; Kartal, Ece; Zych, Konrad; Kashani, Alireza; Milanese, Alessio; Fleck, Jonas S.; Voigt, Anita Y.; Palleja, Albert; Ponnudurai, Ruby; Sunagawa, Shinichi; Coelho, Luis Pedro; Schrotz-King, Petra; Vogtmann, Emily; Habermann, Nina; Niméus, Emma; Thomas, Andrew M.; Manghi, Paolo; Gandini, Sara; Serrano, Davide; Mizutani, Sayaka; Shiroma, Hirotsugu; Shiba, Satoshi; Shibata, Tatsuhiro; Yachida, Shinichi; Yamada, Takuji; Waldron, Levi; Naccarati, Alessio; Segata, Nicola; Sinha, Rashmi; Ulrich, Cornelia M.; Brenner, Hermann; Arumugam, Manimozhiyan; Bork, Peer; Zeller, Georg

doi:10.1038/s41591-019-0406-6

Article
Published: 01 April 2019

Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer

Nature Medicine volume 25, pages 679–689 (2019)Cite this article

40k Accesses
618 Citations
401 Altmetric
Metrics details

Subjects

Abstract

Association studies have linked microbiome alterations with many human diseases. However, they have not always reported consistent results, thereby necessitating cross-study comparisons. Here, a meta-analysis of eight geographically and technically diverse fecal shotgun metagenomic studies of colorectal cancer (CRC, n = 768), which was controlled for several confounders, identified a core set of 29 species significantly enriched in CRC metagenomes (false discovery rate (FDR) < 1 × 10⁻⁵). CRC signatures derived from single studies maintained their accuracy in other studies. By training on multiple studies, we improved detection accuracy and disease specificity for CRC. Functional analysis of CRC metagenomes revealed enriched protein and mucin catabolism genes and depleted carbohydrate degradation genes. Moreover, we inferred elevated production of secondary bile acids from CRC metagenomes, suggesting a metabolic link between cancer-associated gut microbes and a fat- and meat-rich diet. Through extensive validations, this meta-analysis firmly establishes globally generalizable, predictive taxonomic and functional microbiome CRC signatures as a basis for future diagnostics.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Despite study differences, meta-analysis identifies a core set of gut microbes strongly associated with CRC.**

**Fig. 2: Co-occurrence analysis of CRC-associated gut microbial species reveals four clusters preferentially linked to specific patient subgroups.**

**Fig. 3: Both taxonomic and functional metagenomic classification models generalize across studies, in particular when trained on data from multiple studies.**

**Fig. 4: Meta-analysis identifies consistent functional changes in CRC metagenomes.**

**Fig. 5: Meta-analysis results are validated in three independent study populations.**

A distinct Fusobacterium nucleatum clade dominates the colorectal cancer niche

Article Open access 20 March 2024

Martha Zepeda-Rivera, Samuel S. Minot, … Christopher D. Johnston

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Microbiota in health and diseases

Article Open access 23 April 2022

Kaijian Hou, Zhuo-Xun Wu, … Zhe-Sheng Chen

Data availability

The raw sequencing data for the samples in the German study that have not been published before (see Methods) are available from the European Nucleotide Archive under study no. PRJEB27928. The metadata for these samples are available as Supplementary Table 6.

For the other studies included in the current study, the raw sequencing data can be found under the following European Nucleotide Archive identifiers: PRJEB10878 for Yu et al.¹¹; PRJEB12449 for Vogtmann et al.¹⁰; ERP008729 for Feng et al.⁹; and ERP005534 for Zeller et al.⁸. The independent validation cohorts can be found in the Sequence Read Archive under the identifier no. SRP136711 for Thomas et al.²⁷ and in the DNA Data Bank of Japan database under identification no. DRA006684.

The filtered taxonomic and functional profiles used as input for the statistical modeling pipeline are available in Supplementary Data 1.

The code and all analysis results can be found under https://github.com/zellerlab/crc_meta.

References

Tringe, S. G. & Rubin, E. M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).
Article CAS Google Scholar
Tremaroli, V. & Bäckhed, F. Functional interactions between the gut microbiota and host metabolism. Nature 489, 242–249 (2012).
Article CAS Google Scholar
Lynch, S. V. & Pedersen, O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 2369–2379 (2016).
Article CAS Google Scholar
Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
Article CAS Google Scholar
Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).
Article CAS Google Scholar
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
Article CAS Google Scholar
Schirmer, M. et al. Dynamics of metatranscription in the inflammatory bowel disease gut microbiome. Nat. Microbiol. 3, 337–346 (2018).
Article CAS Google Scholar
Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
Article Google Scholar
Feng, Q. et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015).
Article CAS Google Scholar
Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).
Article Google Scholar
Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).
Article CAS Google Scholar
Bedarf, J. R. et al. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Med. 9, 39 (2017).
Article CAS Google Scholar
Schmidt, T. S. B., Raes, J. & Bork, P. The human gut microbiome: from association to modulation. Cell 172, 1198–1215 (2018).
Article CAS Google Scholar
Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262–266 (2015).
Article CAS Google Scholar
Costea, P. I. et al. Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35, 1069–1076 (2017).
CAS PubMed Google Scholar
Lozupone, C. A. et al. Meta-analyses of studies of the human microbiota. Genome Res. 23, 1704–1714 (2013).
Article CAS Google Scholar
Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A. & Alm, E. J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 8, 1784 (2017).
Article Google Scholar
Shah, M. S. et al. Leveraging sequence-based faecal microbial community survey data to identify a composite biomarker for colorectal cancer. Gut 67, 882–891 (2018).
Article CAS Google Scholar
Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).
Article Google Scholar
Dai, Z. et al. Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome 6, 70 (2018).
Article Google Scholar
Maier, L. et al. Extensive impact of non-antibiotic drugs on human gut bacteria. Nature 555, 623–628 (2018).
Article CAS Google Scholar
Milanese, M. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Article Google Scholar
Kultima, J. R. et al. MOCAT2: a metagenomic assembly, annotation and profiling framework. Bioinformatics 32, 2520–2523 (2016).
Article CAS Google Scholar
Hothorn, T. et al. A Lego system for conditional inference. Am. Stat. 60, 257–263 (2006).
Article Google Scholar
Mandal, S. et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663 (2015).
PubMed Google Scholar
Tjalsma, H., Boleij, A., Marchesi, J. R. & Dutilh, B. E. A bacterial driver-passenger model for colorectal cancer: beyond the usual suspects. Nat. Rev. Microbiol. 10, 575–582 (2012).
Article CAS Google Scholar
Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. https://doi.org/10.1038/s41591-019-0405-7 (2019).
Huerta-Cepas, J. et al.eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016).
Article CAS Google Scholar
Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199–D205 (2014).
Article CAS Google Scholar
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).
Article CAS Google Scholar
Vieira-Silva, S. et al. Species–function relationships shape ecological properties of the human gut microbiome. Nat. Microbiol. 1, 16088 (2016).
Article CAS Google Scholar
Hirayama, A. et al. Quantitative metabolome profiling of colon and stomach cancer microenvironment by capillary electrophoresis time-of-flight mass spectrometry. Cancer Res. 69, 4918–4925 (2009).
Article CAS Google Scholar
Denkert, C. et al. Metabolite profiling of human colon carcinoma: deregulation of TCA cycle and amino acid turnover. Mol. Cancer 7, 72 (2008).
Article Google Scholar
Mal, M., Koh, P. K., Cheah, P. Y. & Chan, E. C. Metabotyping of human colorectal cancer using two-dimensional gas chromatography mass spectrometry. Anal. Bioanal. Chem. 403, 483–493 (2012).
Article CAS Google Scholar
Weir, T. L. et al. Stool microbiome and metabolome differences between colorectal cancer patients and healthy adults. PLoS ONE 8, e70803 (2013).
Article CAS Google Scholar
Goedert, J. J. et al. Fecal metabolomics: assay performance and association with colorectal cancer. Carcinogenesis 35, 2089–2096 (2014).
Article CAS Google Scholar
Aykan, N. F. Red meat and colorectal cancer. Oncol. Rev. 9, 288 (2015).
Article Google Scholar
Diet, Nutrition, Physical Activity and Cancer: a Global Perspective. A Summary of the Third Expert Report (World Cancer Research Fund, 2018).
Dutilh, B. E., Backus, L., van Hijum, S. A. & Tjalsma, H. Screening metatranscriptomes for toxin genes as functional drivers of human colorectal cancer. Best Pract. Res. Clin. Gastroenterol. 27, 85–99 (2013).
Article CAS Google Scholar
Sears, C. L. & Garrett, W. S. Microbes, microbiota, and colon cancer. Cell Host Microbe 15, 317–328 (2014).
Article CAS Google Scholar
Ridlon, J. M., Harris, S. C., Bhowmik, S., Kang, D. J. & Hylemon, P. B. Consequences of bile salt biotransformations by intestinal bacteria. Gut Microbes 7, 22–39 (2016).
Article CAS Google Scholar
Yoshimoto, S. et al. Obesity-induced gut microbial metabolite promotes liver cancer through senescence secretome. Nature 499, 97–101 (2013).
Article CAS Google Scholar
Ajouz, H., Mukherji, D. & Shamseddine, A. Secondary bile acids: an underrecognized cause of colon cancer. World J. Surg. Oncol. 12, 164 (2014).
Article Google Scholar
Boleij, A. et al. The Bacteroides fragilis toxin gene is prevalent in the colon mucosa of colorectal cancer patients. Clin. Infect. Dis. 60, 208–215 (2015).
Article CAS Google Scholar
Wu, S. et al. A human colonic commensal promotes colon tumorigenesis via activation of T helper type 17 T cell responses. Nat. Med. 15, 1016–1022 (2009).
Article CAS Google Scholar
Dejea, C. M. et al. Patients with familial adenomatous polyposis harbor colonic biofilms containing tumorigenic bacteria. Science 359, 592–597 (2018).
Article CAS Google Scholar
Ridlon, J. M., Kang, D. J. & Hylemon, P. B. Isolation and characterization of a bile acid inducible 7α-dehydroxylating operon in Clostridium hylemonae TN271. Anaerobe 16, 137–146 (2010).
Article CAS Google Scholar
Mallonee, D. H., White, W. B. & Hylemon, P. B. Cloning and sequencing of a bile acid-inducible operon from Eubacterium sp. strain VPI 12708. J. Bacteriol. 172, 7011–7019 (1990).
Article CAS Google Scholar
Ocvirk, S. & O’Keefe, S. J. D. Influence of bile acids on colorectal cancer risk: potential mechanisms mediated by diet–gut microbiota interactions. Curr. Nutr. Rep. 6, 315–322 (2017).
Article CAS Google Scholar
Gevers, D. et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15, 382–392 (2014).
Article CAS Google Scholar
Viennot, S. et al. Colon cancer in inflammatory bowel disease: recent trends, questions and answers. Gastroenterol. Clin. Biol. 33, S190–S201 (2009).
Article Google Scholar
Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/β-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).
Article CAS Google Scholar
Kostic, A. D. et al. Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14, 207–215 (2013).
Article CAS Google Scholar
Arthur, J. C. et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science 338, 120–123 (2012).
Article CAS Google Scholar
Reddy, B. S. Diet and excretion of bile acids. Cancer Res. 41, 3766–3768 (1981).
CAS PubMed Google Scholar
Ogino, S. et al. Integrative analysis of exogenous, endogenous, tumour and immune factors for precision medicine. Gut 67, 1168–1180 (2018).
Article CAS Google Scholar
Ogino, S., Chan, A. T., Fuchs, C. S. & Giovannucci, E. Molecular pathological epidemiology of colorectal neoplasia: an emerging transdisciplinary and interdisciplinary field. Gut 60, 397–411 (2011).
Article Google Scholar
Hannigan, G. D., Duhaime, M. B., Ruffin, M. T. 4th, Koumpouras, C. C. & Schloss, P. D. Diagnostic potential and interactive dynamics of the colorectal cancer virome. MBio 9, e02248-18 (2018).
zur Hausen, H. Red meat consumption and cancer: reasons to suspect involvement of bovine infectious factors in colorectal cancer. Int. J. Cancer 130, 2475–2483 (2012).
Article CAS Google Scholar
Shkoporov, A. N. et al. Reproducible protocols for metagenomic analysis of human faecal phageomes. Microbiome 6, 68 (2018).
Article Google Scholar
Böhm, J. et al. Discovery of novel plasma proteins as biomarkers for the development of incisional hernias after midline incision in patients with colorectal cancer: The ColoCare study. Surgery 161, 808–817 (2017).
Article Google Scholar
Liesenfeld, D. B. et al. Metabolomics and transcriptomics identify pathway differences between visceral and subcutaneous adipose tissue in colorectal cancer patients: the ColoCare study. Am. J. Clin. Nutr. 102, 433–443 (2015).
Article CAS Google Scholar
Pox, C. P. et al. Efficacy of a nationwide screening colonoscopy program for colorectal cancer. Gastroenterology 142, 1460–1467.e2 (2012).
Article Google Scholar
Furet, J. P. et al. Comparative assessment of human and farm animal faecal microbiota using real-time quantitative PCR. FEMS Microbiol. Ecol. 68, 351–362 (2009).
Article CAS Google Scholar
Mende, D. R. et al. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).
Article CAS Google Scholar
Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58, 267–288 (1996).
Google Scholar
Smialowski, P., Frishman, D. & Kramer, S. Pitfalls of supervised feature selection. Bioinformatics 26, 440–443 (2010).
Article CAS Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Google Scholar
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Article Google Scholar
Oksanen, J. et al. vegan: Community Ecology Package (The Comprehensive R Archive Network, 2018).
Costea, P. I., Zeller, G., Sunagawa, S. & Bork, P. A fair comparison. Nat. Methods 11, 359 (2014).
Article CAS Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009).
Peng, H., Long, F. & Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005).
Article Google Scholar
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Article CAS Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article CAS Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).
Article CAS Google Scholar
Caporaso, J. G. et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. USA 108, 4516–4522 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

We are thankful to members of the Zeller, Bork, and Arumugam groups for inspiring discussions. Additionally, we thank Y. P. Yuan and the EMBL Information Technology Core Facility for support with high-performance computing, and the EMBL Genomics Core Facility for their sequencing support. We are also grateful for the advice provided by B. Klaus, EMBL Centre for Statistical Data Analysis. We acknowledge funding from EMBL, the German Cancer Research Center, the Huntsman Cancer Foundation, the Intramural Research Program of the National Cancer Institute, ETH Zürich, and the following external sources: the European Research Council (CancerBiome grant no. ERC-2010-AdG_20100317 to P.B., Microbios grant no. ERC-AdG-669830 to P.B., and Meta-PG grant no. ERC-2016-STG-716575 to N.S.); the Novo Nordisk Foundation (grant no. NNF10CC1016515 to M.A.); the Danish Diabetes Academy supported by the Novo Nordisk Foundation and TARGET Research Initiative (Danish Strategic Research Council grant no. 0603-00484B to M.A.); the Matthias-Lackas Foundation (to C.M.U.); the National Cancer Institute (grant nos. R01 CA189184, R01 CA207371, U01 CA206110, and P30 CA042014 to C.M.U.); the Federal Ministry of Education and Research (BMBF; the de.NBI network no. 031A537B to P.B. and the ERA-NET TRANSCAN project no. 01KT1503 to C.M.U.); the Helmut Horten Foundation (to S.Sunagawa); and the Fundação de Amparo à Pesquisa do Estado de São Paulo (grant no. 16/23527-2 to A.M.T.). For the Italy validation cohorts, funding was provided by the Lega Italiana per La Lotta contro i Tumori. For the Japan validation cohort, funding was provided to T.Y. and S.Y. by the National Cancer Center Research and Development Fund (grant nos. 25-A-4,28-A-4, and 29-A-6); Practical Research Project for Rare/Intractable Diseases from the Japan Agency for Medical Research and Development (grant no. JP18ek0109187); Japan Science and Technology Agency-PRESTO (grant no. JPMJPR1507); Japan Society for the Promotion of Science KAKENHI (grant nos. 16J10135, 142558, and 221S0002); Joint Research Project of the Institute of Medical Science, University of Tokyo; and the Takeda Science and Suzuken Memorial Foundations.

Author information

Luis Pedro Coelho
Present address: Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
These authors contributed equally: Jakob Wirbel, Paul Theodor Pyl.
These authors jointly supervised this work: Manimozhiyan Arumugam, Peer Bork, Georg Zeller.

Authors and Affiliations

Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
Jakob Wirbel, Ece Kartal, Konrad Zych, Alessio Milanese, Jonas S. Fleck, Anita Y. Voigt, Ruby Ponnudurai, Shinichi Sunagawa, Luis Pedro Coelho, Peer Bork & Georg Zeller
Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medicine, University of Copenhagen, Copenhagen, Denmark
Paul Theodor Pyl, Alireza Kashani, Albert Palleja & Manimozhiyan Arumugam
Division of Surgery, Oncology and Pathology, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden
Paul Theodor Pyl & Emma Niméus
Molecular Medicine Partnership Unit, Heidelberg, Germany
Ece Kartal & Peer Bork
The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
Anita Y. Voigt
Department of Biology, ETH Zürich, Zürich, Switzerland
Shinichi Sunagawa
Division of Preventive Oncology, National Center for Tumor Diseases and German Cancer Research Center, Heidelberg, Germany
Petra Schrotz-King & Hermann Brenner
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
Emily Vogtmann & Rashmi Sinha
Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
Nina Habermann
Division of Surgery, Department of Clinical Sciences Lund, Faculty of Medicine, Skane University Hospital, Lund, Sweden
Emma Niméus
Department CIBIO, University of Trento, Trento, Italy
Andrew M. Thomas, Paolo Manghi & Nicola Segata
Biochemistry Department, Chemistry Institute, University of São Paulo, São Paulo, Brazil
Andrew M. Thomas
IEO, European Institute of Oncology IRCCS, Milan, Italy
Sara Gandini & Davide Serrano
School of Life Science and Technology, Tokyo Institute of Technology, Tokyo, Japan
Sayaka Mizutani, Hirotsugu Shiroma & Takuji Yamada
Research Fellow of Japan Society for the Promotion of Science, Tokyo, Japan
Sayaka Mizutani
Division of Cancer Genomics, National Cancer Center Research Institute, Tokyo, Japan
Satoshi Shiba, Tatsuhiro Shibata & Shinichi Yachida
Laboratory of Molecular Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
Tatsuhiro Shibata
Department of Cancer Genome Informatics, Graduate School of Medicine/Faculty of Medicine, Osaka University, Osaka, Japan
Shinichi Yachida
PRESTO, Japan Science and Technology Agency, Saitama, Japan
Takuji Yamada
Graduate School of Public Health and Health Policy, City University of New York, New York, NY, USA
Levi Waldron
Institute for Implementation Science in Population Health, City University of New York, New York, NY, USA
Levi Waldron
Italian Institute for Genomic Medicine, Turin, Italy
Alessio Naccarati
Department of Molecular Biology of Cancer, Institute of Experimental Medicine, Prague, Czech Republic
Alessio Naccarati
Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, Salt Lake City, UT, USA
Cornelia M. Ulrich
Division of Clinical Epidemiology and Aging Research, German Cancer Research Center, Heidelberg, Germany
Hermann Brenner
German Cancer Consortium, German Cancer Research Center, Heidelberg, Germany
Hermann Brenner
Faculty of Healthy Sciences, University of Southern Denmark, Odense, Denmark
Manimozhiyan Arumugam
Max Delbrück Centre for Molecular Medicine, Berlin, Germany
Peer Bork
Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany
Peer Bork

Authors

Jakob Wirbel
View author publications
You can also search for this author in PubMed Google Scholar
Paul Theodor Pyl
View author publications
You can also search for this author in PubMed Google Scholar
Ece Kartal
View author publications
You can also search for this author in PubMed Google Scholar
Konrad Zych
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Kashani
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Milanese
View author publications
You can also search for this author in PubMed Google Scholar
Jonas S. Fleck
View author publications
You can also search for this author in PubMed Google Scholar
Anita Y. Voigt
View author publications
You can also search for this author in PubMed Google Scholar
Albert Palleja
View author publications
You can also search for this author in PubMed Google Scholar
Ruby Ponnudurai
View author publications
You can also search for this author in PubMed Google Scholar
Shinichi Sunagawa
View author publications
You can also search for this author in PubMed Google Scholar
Luis Pedro Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Petra Schrotz-King
View author publications
You can also search for this author in PubMed Google Scholar
Emily Vogtmann
View author publications
You can also search for this author in PubMed Google Scholar
Nina Habermann
View author publications
You can also search for this author in PubMed Google Scholar
Emma Niméus
View author publications
You can also search for this author in PubMed Google Scholar
Andrew M. Thomas
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Manghi
View author publications
You can also search for this author in PubMed Google Scholar
Sara Gandini
View author publications
You can also search for this author in PubMed Google Scholar
Davide Serrano
View author publications
You can also search for this author in PubMed Google Scholar
Sayaka Mizutani
View author publications
You can also search for this author in PubMed Google Scholar
Hirotsugu Shiroma
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Shiba
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuhiro Shibata
View author publications
You can also search for this author in PubMed Google Scholar
Shinichi Yachida
View author publications
You can also search for this author in PubMed Google Scholar
Takuji Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Levi Waldron
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Naccarati
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Segata
View author publications
You can also search for this author in PubMed Google Scholar
Rashmi Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Cornelia M. Ulrich
View author publications
You can also search for this author in PubMed Google Scholar
Hermann Brenner
View author publications
You can also search for this author in PubMed Google Scholar
Manimozhiyan Arumugam
View author publications
You can also search for this author in PubMed Google Scholar
Peer Bork
View author publications
You can also search for this author in PubMed Google Scholar
Georg Zeller
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.Z., M.A., and P.B. conceived and supervised the study. P.S.K., N.H., C.M.U., H.B., E.V., and R.S. recruited the participants and collected the samples. E.K., A.Y.V., S.Sunagawa, and P.B. generated the metagenomic data. A.M., P.T.P., J.S.F., A.P., S.Sunagawa, L.P.C., G.Z., and M.A. developed the metagenomic profiling workflows and/or performed the taxonomic and functional profiling. J.W., G.Z., K.Z., P.T.P., A.K., M.A., and N.S. performed the statistical analysis and/or developed the statistical analysis workflows. E.K. and R.P. designed and performed the validation experiments. A.M.T., P.M., S.G., D.S., S.M., H.S., S.Shiba, T.S., S.Y., T.Y., L.W., A.N., and N.S. provided additional validation data. J.W., G.Z., M.A., P.T.P., and P.B. designed the figures. G.Z., J.W., M.A., and P.B. wrote the manuscript with contributions from P.T.P., A.M., S.Sunagawa, L.P.C., E.K., A.Y.V., E.V., R.S., P.S.K., H.B., E.N., N.S. and L.W. All authors discussed and approved the manuscript.

Corresponding authors

Correspondence to Manimozhiyan Arumugam, Peer Bork or Georg Zeller.

Ethics declarations

Competing interest

P.B., G.Z., A.Y.V., and S.Sunagawa are named inventors on a patent (EP2955232A1: Method for diagnosing adenomas and/or colorectal cancer (CRC) based on analyzing the gut microbiome).

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Potential confounding of individual microbial species associations by patient demographics and technical factors.

Variance explained by disease status (CRC versus CTRL) is plotted against variance explained by different putative confounding factors for individual microbial species. Each species is represented by a dot proportional in size to its abundance (see legend and Methods); core microbial markers identified in the meta-analysis are highlighted in red. For the confounder analysis, factors with continuous values were discretized into quartiles and the BMI was split into lean/overweight/obese according to conventional cutoffs. The variance explained by disease status was computed for all data; accordingly, the x values are the same in all panels and also in Fig. 1d. The variance explained by different confounding factors was computed using all samples for which data were available (indicated by the insets).

Source Data

Extended Data Fig. 2 Study heterogeneity shows a strong influence on alpha and beta diversity.

a, Alpha diversity as measured with the Shannon index was computed for all gut microbial species (n = 849), reference mOTUs (n = 246), and meta-mOTUs (n = 603) separately. P values were computed using a two-sided Wilcoxon test, while the overall P value (on top) was calculated using a two-sided blocked Wilcoxon test (n = 575 independent observations; see Methods). The ANOVA F-statistics below the panel was calculated using the R function ‘aov’. b, Principal coordinate analysis of samples from all five included studies based on Bray–Curtis distance; the study is color-coded and disease status (CRC versus CTRL) is indicated by filled/unfilled circles. The boxplots on the side and below show samples projected onto the first two principal coordinates broken down by study and disease status, respectively. P values were computed using a two-sided Wilcoxon test for disease status and a Kruskal–Wallis test for study (n = 575 independent observations). For all boxplots, boxes denote the IQR with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Country codes are as in Fig. 1b.

Source Data

Extended Data Fig. 3 The generalized fold change extends the established (median-based) fold change to provide higher resolution in sparse microbiome data.

a, In the top row, the logarithmic relative abundances for Bacteroides dorei/vulgatus, Parvimonas micra, and F. nucleatum subspecies animalis—examples for one high-prevalence and two low-prevalence species—are shown as swarm plots for the CTRL and CRC groups. The thick vertical lines indicate the medians in the different groups and the black horizontal line shows the difference between the two medians, which corresponds to the classical (median-based) fold change. Since F. nucleatum subspecies animalis is not detectable in more than 50% of cancer cases, there is no difference between the CTRL and CRC median; thus, the fold change is 0. The lower row shows the same data, but instead of only the median (or 50th percentile), 9 quantiles ranging from 10 to 90% are shown by the thinner vertical lines. The generalized fold change is indicated by the horizontal black line again, computed as the mean of the differences between the corresponding quantiles in both groups. In the case of the sparse data (for example, F. nucleatum), the differences in the 70, 80, and 90% quantiles cause the generalized fold change to be higher than 0. b, The median fold change is plotted against the newly developed generalized fold change for all microbial species. (The core set of microbial CRC marker species is highlighted in orange.) Marginal histograms visualize the distribution for both fold change and generalized fold change. c, Scatter plots showing the relationship between fold change and generalized fold change and the area under the ROC curve (AUROC) or the shift in prevalence between CRC and CTRL, with Spearman’s rank correlations (rho) added in the top left corners; the generalized fold change provides higher resolution (wider distribution around 0) and better correlation with the non-parametric AUROC effect size measure as well as prevalence shift, which captures the difference in prevalence of a species in CRC metagenomes relative to CTRL metagenomes.

Source Data

Extended Data Fig. 4 Microbial genera identified in the meta-analysis to be associated with CRC.

a, The meta-analysis significance of microbial genera, computed using a univariate, two-sided Wilcoxon test blocked for ‘study’ and ‘colonoscopy’ (n = 574 independent observations), is given by bar height (FDR = 0.005). Underneath, significance (FDR-corrected P value computed using a two-sided Wilcoxon test) and generalized fold changes (see Methods) within individual studies are displayed as heatmaps in gray and color, respectively (see keys). Genera are ordered by meta-analysis significance and direction of change. b, For highly significant genera (meta-analysis FDR = 1 × 10⁻⁵), association strength is quantified by the area under the ROC curve across individual studies (color-coded diamonds); 95% confidence intervals are depicted by gray lines. Country codes are as in Fig. 1b.

Source Data

Extended Data Fig. 5 The core set of CRC-enriched microbial species can be stratified into four clusters based on co-occurrence in CRC metagenomes.

a, The heatmap shows the Jaccard index (computed by comparing marker-positive samples; see Methods) for the core set of microbial marker species, computed on CRC cases only. Clustering was performed using the Ward algorithm as implemented in the R function ‘hclust’. The inset shows the distribution of Jaccard similarities within each cluster and for the background (all similarities between species not in the same cluster). b,c, Barplots show the fraction of CRC samples that are positive for a marker species cluster (defined as the union of positive marker species) broken down by patient subgroups based on BMI (b) and age (c) (see Fig. 2b–d for other patient subgroups). The significance of the associations between CRC subgroups and marker species clusters was tested using the Cochran–Mantel–Haenszel test blocked for ‘study’ and ‘colonoscopy’. (No significant associations were detected.) d, For the core set of microbial species with a genomic reference, the presence (red) or absence (white) of superoxide dismutase, peroxidase, and catalase are shown as heatmaps. Presence of the enzyme was determined by checking the protein annotations for the reference projects (see NCBI BioProject ID) in http://progenomes.embl.de/.

Source Data

Extended Data Fig. 6 Coefficients of LOSO LASSO logistic regression models compared to models trained on individual studies.

a, The mean coefficients (feature weights) from LASSO cross-validation models trained on single studies (color-coded) are plotted against the single-feature AUROC for each species feature. The horizontal lines highlight the microbial species that are—for at least one study—selected in more than 50% of the models in cross-validation and account for more than 10% of the absolute model weight in at least 10% of the cross-validation models. b, Similarly, b shows the same for the models trained in the LOSO setting (see Methods). The colors indicate which study has been left out of the training set (and is used for validation). The weights of the LOSO models are spread across more species; thus, generally, lower species are highlighted by the horizontal lines if their weights explain more than 2.5% of the absolute model in at least 10% of cross-validation models and they have been selected in more than 50% of models in cross-validation. c, The inset shows the distribution of the number of non-zero coefficients across all cross-validation models. d, The bar height indicates the number of non-zero coefficients that are shared between the mean models for each study or left-out study, respectively. e, The study-to-study difference (computed as the median of all pairwise differences between model weights for a single species across the mean models) for cross-validation single-study models are plotted against the same measure for the LOSO models. Species with a study-to-study difference of more than 0.02 in the cross-validation models are highlighted and annotated, showing much larger variability between models trained on single studies compared to LOSO models. Country codes as in Fig. 1b.

Source Data

Extended Data Fig. 7 Analysis of LOSO models for prediction bias.

a, To examine whether species- and gene family-level classification models are confounded, that is, biased toward certain patient subgroups, the prediction scores from the LOSO models are broken down into strata for each clinical parameter (for example, female and male for sex). The prediction bias for each variable was tested by Wilcoxon (for sex and BMI) or Kruskal–Wallis (all others) tests while blocking for study as the confounder. The boxes denote the IQRs, with the median as the horizontal black line and the whiskers extending up to the most extreme point within the 1.5-fold IQR. A significant difference in prediction score was detected only for the CRC stage. This stage bias is more pronounced for gene family then for species models. b, To examine the CRC stage bias further, the barplots show the true positive rate corresponding to an overall 10% FPR (see also Fig. 3c) for the different CRC stages, displaying slightly higher classification sensitivity for late-stage CRC for both species and gene family models.

Source Data

Extended Data Fig. 8 Cross-study performance of statistical models based on KEGG KO abundances, single-gene abundances from the metagenomic gene catalog (IGC), and the combination of taxonomic and eggNOG database abundance profiles.

a–c, CRC classification accuracy resulting from cross-validation within each study (gray boxed along the diagonal) and study-to-study model transfer (external validations off the diagonal) as measured by the AUROC for the classification models trained on KEGG KOs (a), models based on the gene catalog (b), and models based on the combination of taxonomic and eggNOG database abundance profiles (c) (see Methods for the details on the statistical modeling workflows). The last column depicts the average AUROC across external validations. The barplots on the right show that the classification accuracy on a hold-out study improves if the data from all other studies are combined for training (LOSO validation) relative to models trained on data from a single study (study-to-study transfer, indicated by the bar color) consistently across different types of input data. Country codes as in Fig. 1b.

Source Data

Extended Data Fig. 9 Identification of bai genes in metagenomes.

Putative bai genes identified in the metagenomic IGC were clustered by co-abundance in metagenomes to infer genomic linkage (see Methods) to be able to infer operon completeness and species of origin. a, For each resulting cluster of putative bile acid-converting genes, the mean relative abundance was plotted against the mean percentage of protein identity derived from global alignment against the know bile acid-converting genes from C. scindens and C. hylemonae (see Methods). Completeness, that is, how many of the 11 different bai gene functions are represented in each cluster, and mean gene-to-gene correlation of relative abundance within each cluster are encoded by dot size and color, respectively (see legend). The four clusters with a mean protein identity > 75% to known bai operon-containing genomes were included in the subsequent analysis and labeled with the most highly correlated mOTU (see b). b, Pearson correlation between gene cluster abundances and the relative abundance of the most highly correlated species (in logarithmic space) is given by the bar height for the four gene clusters identified in a. The most highly correlating species is highlighted in darker gray (see labeling of gene clusters in a). c, The log-transformed abundances of all bai genes and the four species identified in b are shown as boxplots for CTRLs (gray) and CRC cases (red). Assessing the significance of the differences between CRC and CTRLs (using a Wilcoxon test blocked for ‘study’ and ‘colonoscopy’) demonstrates a much more significant CRC enrichment of the aggregated metagenomic bai gene abundance than of the individual clostridial species to which these belong. d, ROC curve for the qPCR quantification of the baiF gene in the genomic DNA of a subset of samples in the German study (see Methods and Fig. 4e).

Source Data

Extended Data Fig. 10 Validation of the meta-analysis of single-species associations in three independent cohorts.

a, Heatmap showing for the core set of CRC-associated species (see Fig. 1) the rank of the respective species within the associations of each study, including the three independent validation cohorts (see Table 1), compared to the rank in the meta-analysis (meta) on the left. b, Precision-recall curves for the different independent validation cohorts using the meta-analysis set of associated species at FDR = 0.005 (top) and FDR = 1 × 10⁻⁵ (bottom) as the ‘true’ set (see Methods) and the naïve (uncorrected) within-cohort significance as the predictor (see Supplementary Fig. 2). IT1, Italy 1; IT2, Italy 2; JP, Japan; other country codes are as in Fig. 1b.

Source Data

Supplementary information

Supplementary Figures 1–8, Supplementary Tables 1, 2, and 5, and Supplementary References

Reporting Summary

Supplementary Tables

Supplementary Tables 3, 4, and 6

Supplementary Data

Supplementary Data 1 and 2

Source data

Source Data Fig. 1

Statistical Source Data

Source Data Fig. 2

Statistical Source Data

Source Data Fig. 3

Statistical Source Data

Source Data Fig. 4

Statistical Source Data

Source Data Fig. 5

Statistical Source Data

Source Data Extended Data Fig. 1

Statistical Source Data

Source Data Extended Data Fig. 2

Statistical Source Data

Source Data Extended Data Fig. 3

Statistical Source Data

Source Data Extended Data Fig. 4

Statistical Source Data

Source Data Extended Data Fig. 5

Statistical Source Data

Source Data Extended Data Fig. 6

Statistical Source Data

Source Data Extended Data Fig. 7

Statistical Source Data

Source Data Extended Data Fig. 8

Statistical Source Data

Source Data Extended Data Fig. 9

Statistical Source Data

Source Data Extended Data Fig. 10

Statistical Source Data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wirbel, J., Pyl, P.T., Kartal, E. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med 25, 679–689 (2019). https://doi.org/10.1038/s41591-019-0406-6

Download citation

Received: 30 July 2018
Accepted: 20 February 2019
Published: 01 April 2019
Issue Date: April 2019
DOI: https://doi.org/10.1038/s41591-019-0406-6

This article is cited by

QNetDiff: a quantitative measurement of network rewiring
- Shota Nose
- Hirotsugu Shiroma
- Yushi Uno
BMC Bioinformatics (2024)
Gut microbiome for predicting immune checkpoint blockade-associated adverse events
- Muni Hu
- Xiaolin Lin
- Haoyan Chen
Genome Medicine (2024)
The gut-liver axis in hepatobiliary diseases
- Masataka Ichikawa
- Haruka Okada
- Takanori Kanai
Inflammation and Regeneration (2024)
Unlocking the secrets: exploring the influence of the aryl hydrocarbon receptor and microbiome on cancer development
- Menatallah Rayan
- Tahseen S. Sayed
- Hesham M. Korashy
Cellular & Molecular Biology Letters (2024)
A case–control study of the association between the gut microbiota and colorectal cancer: exploring the roles of diet, stress, and race
- Tiffany L. Carson
- Doratha A. Byrd
- Andrew D. Fruge
Gut Pathogens (2024)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interest

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links