Antibiotics-induced monodominance of a novel gut bacterial order

Objective The composition of the healthy human adult gut microbiome is relatively stable over prolonged periods, and representatives of the most highly abundant and prevalent species have been cultured and described. However, microbial abundances can change on perturbations, such as antibiotics intake, enabling the identification and characterisation of otherwise low abundant species. Design Analysing gut microbial time-series data, we used shotgun metagenomics to create strain level taxonomic and functional profiles. Community dynamics were modelled postintervention with a focus on conditionally rare taxa and previously unknown bacteria. Results In response to a commonly prescribed cephalosporin (ceftriaxone), we observe a strong compositional shift in one subject, in which a previously unknown species, U Borkfalki ceftriaxensis, was identified, blooming to 92% relative abundance. The genome assembly reveals that this species (1) belongs to a so far undescribed order of Firmicutes, (2) is ubiquitously present at low abundances in at least one third of adults, (3) is opportunistically growing, being ecologically similar to typical probiotic species and (4) is stably associated to healthy hosts as determined by single nucleotide variation analysis. It was the first coloniser after the antibiotic intervention that led to a long-lasting microbial community shift and likely permanent loss of nine commensals. Conclusion The bloom of U B. ceftriaxensis and a subsequent one of Parabacteroides distasonis demonstrate the existence of monodominance community states in the gut. Our study points to an undiscovered wealth of low abundant but common taxa in the human gut and calls for more highly resolved longitudinal studies, in particular on ecosystem perturbations.

In a first step, the complete genes were clustered at 95% nucleotide identity, a commonly used 137 cutoff in constructing gene catalogues (Sunagawa, Coelho, Chaffron, Kultima, Labadie, Salazar, 138 et al. 2015). For sequence clustering CD-HIT v4.6.1 [12] was used in est mode, employing 139 parameters adapted to full length genes: "-n 9 -G 1 -aS 0.95 -aL 0.6 -d 0 -c 0.95 -g 0". This 140 resulted in 3,257,016 clustered full length genes, onto which the incomplete genes were 141 mapped with bowtie2 [13]. Incomplete genes mapping with at least 95% nucleotide identity were 142 directly integrated into the initial clustering of complete genes. The remaining sequences were 143 clustered as before with CD-HIT, but changing alignment length parameters to "-aL 0.3 -aS 0.8" 144 to account for incomplete genes, resulting in 2,466,595 additional gene clusters. Additionally, 145 genes belonging to the 40 conserved marker genes were clustered separately, using clustering 146 identity thresholds as described in [14]. Merging marker gene clusters, incomplete and complete 147 clustered genes resulted in the gene catalogue, with a total of 5,790,967 non-redundant genes 148 at 95% nucleotide identity cut-off. The gene catalogue nucleotide and amino acid sequences 149 are available at http://vm-lux.embl.de/~hildebra/Bcef_GC. 150 151

Estimating species abundances in metagenomes 152
To estimate the species abundance per sample and at the same time also infer the species not 153 in our database, we used the 40 single copy ubiquitously present marker genes (MGs) that can 154 be predicted using specI [14] in the gene catalog. Since hidden markov models trained on the 155 tree of life are used for this prediction, we can assume that the MGs of unknown classes, 156 families, genera will be detected in this approach. From this set of marker genes, we used a 157 similarity based approach to identify species within these, mapping all predicted MGs with 158 lambda 1.9.3 [15] onto MGs clustered to specI's in the proGenomes database [16] and using a 159 MG specific similarity cutoff [14]. Hits were immediately accepted, if a metagenomic MG was 160 hitting a single proGenomes specI at the required similarity threshold. However, since several 161 metagenomic MGs had valid hits to multiple proGenomes MGs, we supplemented the 162 identification of a species using a coabundance approach, in concept similar to [17]. Briefly, first 163 for established specI's we calculated the average profile across all MGs. If the MG showed 164 similarity to a given specI, i.e. was a candidate, the remaining MGs that were not uniquely assigned to a specI were correlated to existing specI profiles. Here we also used species and 166 genus taxonomic assignments, to allow for mismatches in nucleotide sequence. MG's with a 167 spearman correlation coefficient > 0.9 to a candidate specI were immediately added to a given 168 specI. 169 Additionally, at different phases in the clustering algorithm, within each specI, MGs were 170 checked to correlate with the average profile (spearman) < 0.9, or were removed as false 171 positive assignments and iteratively tried to be added to different specI's, as described above. 172

Mapping reads against references 173
To estimate the abundance of genes in each assembly, the reads from a sample were mapped 174 against the assembly of these. Furthermore, in specific cases also all reads were mapped 175 against reference genomes (to estimate genome presence) or reference genome + decoy set 176 (see below) to map reads against a reference genome in order to call genetic variants. For 177 these use cases, all unfiltered reads were used in the mapping, irrespective of quality. This was 178 important as we were interested in very low abundant species and quality filtered reads were in 179 this project important to ensure a clean assembly; if reads were mapping at a good quality 180 against this assembly it was assumed that their quality was sufficient. To ensure good matches, 181 bowtie2 [13] was used with the options "--no-unal --end-to-end --score-min L,-0.6,-0.6" and 182 additionally "-X 10000" in the case of mate pair libraries. The resulting bam files were sorted, 183 duplicates removed and indexed using samtools 1.3.1 [18]. From these depth profiles were 184 created using bedtools which were translated with a custom C++ program "rdCov" 185 (https://github.com/hildebra/rdCover) into average coverage in a 50 bp window, per contig or per 186 gene predicted on each contig. 187 188

Functional annotations of protein sequences 189
Translated amino acid sequences of either complete genomes, gene catalogs or contigs were 190 predicted with prodigal [11] in normal mode, or metagenomic mode for the latter two cases. 191 These amino acid sequences were mapped to functional reference databases using diamond 192 [19] in blastp mode using options "-k 5 -e 1e-5 -sensitive". The reported mapping were further 193 filtered using customized Perl scripts, with filtering parameters set according on simulations 194 reported previously . In general, we required for each database that hits 195 had >25% AA identity, >30% gene coverage, >60 bit score, >60 AA acid alignment length and 196 an e-value <1e-7. The highest scoring hit was then accepted. 197 The following databases were used for functional assignments: Cazyme genes 198 Functional abundance matrices were normalized by total number of reads used in mapping. This 216 normalization considers differences in library size and includes the fraction of unmapped (that is 217 functionally unclassified) reads. Replication times of assembled genomes were estimated using 218 [28]. 219 220

Genome binning 221
Genomes were de novo binned for U B. ceftriaxensis and P. distasonis. This was based on 222 binning samples HD.S1.374 and HD.S1.377, respectively, with MaxBin [29] and a data set wide 223 binning via Canopy clustering [17]. MaxBin was run with default options, except for the changed 224 option "-min_contig_length 400". 225 Canopy clustering was run with options to optimize for rare, highly abundant species: "-p TC --226 die_on_kill --stop_criteria 250000 --cag_filter_min_sample_obs 5 --227 cag_filter_max_top3_sample_contribution 1 --filter_max_top3_sample_contribution 1" on a gene 228 abundance matrix that was normalized by sum using rtk [30]. The gene bins of both binning 229 methods were combined and the abundance profiles of the respective genes further analyzed in 230

R. 231
For each species, a pearson correlation matrix for the candidate gene profiles were created, 232 and the inverse of correlation coefficient was used to hierarchically cluster these (R function hclust, method = complete). The hierachical cluster was iteratively subdivided, until all 40 234 marker genes were in the same cluster (illustrated in Suppl. Fig. 16 for U B. ceftriaxensis). Based 235 on this clustering, the metagenomic assembly was selected combining most target genes in the 236 fewest scaffolds. In a last step the scaffolds based on HiSeq paired end reads were further 237 scaffolded with mate pair reads (insert size = 5000) from sample HD.S1.377. The mate pair 238 reads were mapped onto the selected species contigs using bowtie2 [13]. Based on these 239 mappings, scaffolds were calculated using BESST [31] using default parameters with the 240 exception of "-z 5000". In the case of U B. ceftriaxensis, this decreased the number of scaffolds in 241 the primary Spades assembly from 20 to 8 scaffolds. Genome completeness was estimated 242 In human variant calling, the reference genome (e.g. hs37d5) contains often a so-called decoy 287 contigs, that cover recurrently (but rarely) assembled regions that are not part of the primary 288 assembly of the human genome [9]. In order to decrease the rate of false positive read 289 mappings in the much more complex metagenome with many species that are sometimes 290 closely related, we used a similar principle. However, instead of adding new reference genomes 291 we took the inverse approach relying on our metagenomic assembly, which should contain the 292 best matching references to any reads from a given sample (see section metagenomic 293 assembly). From this assembly contigs similar to the mapping target were first removed. To do 294 this, the metagenomic assembly was mapped on the target sequences using blat [52] with the 295 following parameters: -t=dna -q=dna -minIdentity=95 -minScore=100. These identified matching 296 contigs were removed from the assembly, and the remaining assembled contigs and the 297 reference genome were used as new database to map the metagenomic reads against. In our 298 parameter tests, this procedure increased average mapping scores at variant calls (data not 299 shown) and was therefore used for all SNP calls. The script to create decoy mappings is 300 available in the MATAFILER pipeline. 301 302

Estimating dominant strain and its genetic variation 303
First reads from a given sample were mapped against the reference genome, using the above 304 described decoy mapping procedure. From these, nucleotide variants were called using 305 freebayes ver. 1.1.0 [53] with the following options -m 30 -q 30 -u -i -C 1 -F 0.1 -k -X --pooled-306 continuous --report-monomorphic --min-repeat-entropy 1 --use-best-n-alleles 2 -G 1. Freebayes 307 is a variant caller that corrects for misalignments using a local realignment among other 308 features, adapted to SNV calling in highly heterologous cancer genomes. The output vcf file was 309 filtered with a custom Perl script (vcf2cons.pl, available as part of the MATAFILER pipeline), that 310 created a consensus sequence from the reference allele, unless the alternative allele 311 frequencies was >0.501. The minimum coverage was set to 2 alleles. The 0.501 threshold 312 serves to avoid introducing a reference bias in the consensus sequence and to filter reads 313 assignments at a depth of 2, that have a 50% allele frequency. 314 Further, using the 50% allele frequency cutoff circumvents the possibility of several strains 315 being present in the same host, by only estimating the genome of this strain. If a second strain 316 would in another time point arise to become the dominant strain, this would be immediately 317 inferable from the within host diversity and flagged as a strain exchange (which happened in two 318 hosts for P. distasonis). 319 320 Estimating gene copy numbers on assembled genomes 321 Gene copies on a genome can severely influence SNV calling, as two similar copies of a gene 322 may accumulate mutations over time. These fixed mutations differing between the two genes 323 will be called as high confidence SNVs at 50% abundance, thus appearing as majority allele 324 based on a random process in underlying reads covering the duplicated genes. For this we 325 developed our own R scripts to use the coverage estimation across the species contigs in a 50 326 bp window (described above). Coverage across a contig is often not linear, due to growth of 327 bacteria and a coverage gradient between the origin and terminus of replications [54]. 328 Therefore, we first used a robust linear model (function rlm in R) to find a linear fit between 329 coverage and genomic position. We corrected the genome coverage, using the residual of this 330 linear model to obtain genome coverage estimates unobstructed by growth rate coverage 331 dynamics. Within this fit, we identified positions using the arbitrary threshold > 1.7 (corrected) average genome coverage for more than 4 consecutive 50 bp windows, to ensure that spurious 333 variation in genome coverage was not flagged as SNVs. 334 335

Phylogenies estimated from consensus genomes per sample 336
The consensus genome per sample (as described above) was first further masked, at the 337 beginning and end of contigs (200bp) and regions that were estimated to be copy number (option "-m TEST") to automatically determine the optimal nucleotide substitution model. To 348 further validate our mutation rate estimates, we also constructed bootstrapped trees for non-349 and synonymous trees, using the option "--fast" to obtain 100 phylogenetic trees that were 350 subsequently used to determine confidence intervals. 351 352

Statistical analysis 353
All statistical analysis was conducted in R 3.3.4 unless otherwise noted. Composition matrices 354 such as species of ARG abundance matrices were normalized by sum, unless otherwise noted. 355 Ordinations were visualized with non-metric multidimensional scaling (NMDS), as implemented 356 in the vegan function "metaMDS", engine "monoMDS", using between sample Bray-Curtis 357 distances, and restricted to two dimensions, unless otherwise noted. Ordination via principal 358 coordinate analysis (PCoA) was calculated via the capscale function, as implemented in vegan. 359 Differences between univariate variables such as taxonomic abundance or taxa richness were 360 tested using a non-parametric Wilcoxon rank-sum test, with Benjamini-Hochberg multiple testing 361 correction. Post-hoc statistical testing for significant differences between all combinations of 362 three or more groups was conducted only for taxa with p<0.2 in the Kruskal-Wallis test. For this, 363 wilcoxon rank-sum tests were calculated for all possible group combinations and corrected for 364 multiple testing using Benjamini-Hochberg multiple testing correction.  Methods), we had to reconstruct its genome from metagenomes in order improve our 393 understanding of the species' biology. A subset of metagenomic samples that had a relative 394 abundance of U B. ceftriaxensis >= 0.2% were selected and used in all subsequent metagenomic 395 analysis (132 samples total, Suppl. Table 8), although even in this subset U B. ceftriaxensis was 396 often just above the metagenomic detection threshold (Suppl. Fig. 1b), demonstrating its 397 rareness. These 132 metagenomes were de novo assembled and a novel gene catalog was 398 constructed at a 95% gene identity threshold, totaling 5,790,967 non-redundant genes (see 399 Methods). Through an initial draft genome binning of sample HD.S1.374, four species bins were 400 recovered (Fig. 1a), consisting of U B. ceftriaxensis, as well as Lactobacillus casei, 401 Streptococcus thermophilus and Propionibacterium freudenreichii. The GC vs abundance plots 402 for days 347, 376, 377, 378 and 380 illustrating these species are show in Suppl. Fig. 17. 403 The assembly quality of the three latter species was low, and those bins were therefore not 404 used in further analysis (data not shown). Further, we recovered an initial binning of P. 405 distasonis for sample HD.S1.377. Since the genomes were instable when rarefying to different 406 sequencing depths, we used additionally canopy clustering and identified the bins representing 407 these five species (Nielsen et al. 2014). Since predicted bins often differed by hundreds of 408 genes (Suppl. Fig. 16c), a custom binning method starting from these two bins (Suppl. Fig.  409 16a,b, see Methods) was combined with a mate-pair sequencing library generated from sample 410 HD.S1.380 (1.3% relative U B. ceftriaxensis abundance). Based on this, a draft genome for both 411 U B. ceftriaxensis and P. distasonis was obtained that were used for further analysis unless 412 otherwise mentioned. The U B. ceftriaxensis genome consisted of 8 scaffolds (10 contigs) with a 413 N50 of 1,925,355 bp and a total length of 2,608,799 bp (see Suppl. Table 3

Metabolism & Genetics 443
Functional annotation indicates that U B. ceftriaxensi is anaerobe with fermentative metabolism 444 and able to utilize a wide range of sugars (detailed listing in Suppl. Table 6, see also Suppl. Although the metabolism of U B. ceftriaxensi seems to function anaerobic, it's possible that it can 514 tolerate some degree of oxygenation, as we can find defense genes against reactive oxygen 515 species (ROS), like rubrerythrin and thioredoxin. However, this could also be related to defense 516 against white blood cells, that can use ROS to control bacteria [75]. 517 We could assign 359 genes to either the Virulence Factor database (VFDB) [76] or a custom 518 virulence factor subset from PATRIC [24]. However, most of these assignments do refer to parts 519 of the bacterial metabolism. Several endotoxins seem to be present (lipopolysaccharide and 520 lipooligosaccharides), genes involved in biofilm formation, O2 resistance and genes involved in 521 antiphagocytosis. Several genes are putatively involved in iron, copper and zinc uptake. Of note 522 is further Haemolysin and corresponding transporters, required to lyse eukaryotic cells. 523 Comparing to the expected distribution given other Firmicutes bacteria, we observed a n.s. 524 enrichment in the putative virulence categories Autophagy, Phosphate uptake, Escape from 525 phagosome and Chaperone (data not shown). 526 527 Resistance of U B. ceftriaxensis to beta-lactam antibiotics 528 U B. ceftriaxensis encodes genes related to drug resistance, such as genes coding for mate 529 efflux family proteins, putative beta-lactamases and phosphinothricin acetyltransferase, which 530 confer resistance to herbicides in Streptomyces hygroscopicus [77]. 531 We further tested the two putative beta-lactamases experimentally, that had a low similarity to 532 known beta-lactamases (33% and 31% AA identity, respectively to NCBI NR hypothetical beta-533 lactamases). Both candidate genes seem to be part of the core genome (peg.343, peg.1661, 534 Suppl. Table 6), as they were assembled into the largest contig. 535 To determine the beta-lactamase activity of these genes, they were expressed in E. coli and 536 beta-lactamase activity was tested on nitrocefin disks (Sigma). The disks changed color when 537 they had contact with the CTX-M 15 beta-lactamase expressing colony used as positive control 538 indicating strong beta-lactamase activity. A weaker, but clearly positive reaction was observed 539 for U B. ceftriaxensis fig|6666666.214148.peg.343 (Suppl. Ceftriaxone disrupts the peptidoglycan synthesis in the bacterial cell wall, inhibiting essential 547 penicillin binding proteins (pbp's). Bacterial resistance to beta-lactam antibiotics can be 548 conferred either through resistant forms of pbp's, beta lactamases or through spore formation 549 that survive the antibiotic treatment. U B. ceftriaxensis had two putative beta lactamases, but 550 both showed no or insufficient activity against ceftriaxone experimentally (see above). 551 Alternatively, U B. ceftriaxensis might survive as dormant persistor cells or spores, such as C. 552 difficile [78], as supported by high abundance of genes related to spore formation. 553 However, spore formation, that is persistence without growth, also seems unlikely as sole 554 explanation, since U B. ceftriaxensis absolute cell numbers were >1,100x increased compared to 555 baseline levels on the first day after antibiotic treatment. One would still expect a remaining 556 ceftriaxone concentration of 12-25% in the gut at this time point (see Methods). Thus, a direct 557 cephalosporine resistance of U B. ceftriaxensis seems likely, and this resistance is in gram 558 positive bacteria often conferred through resistant pbp' that are inherently resistance to beta-559 lactams [79]. The U B. ceftriaxensis genome encodes two class 5/6 pbp's, a class that has been 560 linked to confer cephalopsporin resistance in Enterococci and Listeria [80,81]. Before, it has 561 been shown to confer specifically ceftriaxone resistance [82]. While there are no molecular tools 562 to test the resistance of this pbp by transfer of the gene into another species, it seems most 563 plausible that U B. ceftriaxensis is resistant to ceftriaxone and that this resistance might be 564 intrinsically conferred through a combination of resistant pbp's and surviving as spores. 565

Detection of U B. ceftriaxensis in public samples 567
To investigate the prevalence of U B. ceftriaxensis in diverse environments we screened 3,692 568 public metagenomic datasets from human associated and other environments (see Methods). 569 The overall prevalence of U B. ceftriaxensis was 20.1% over all samples (Suppl. Table 7), being 570 most prevalent in gut samples from 3 continents (30.1% of 1,801) but entirely absent in ocean 571 samples (n=243). Exploring different human body sites, it is detectable in 8 out of 18 non-gut 572 sites. However, in non-gut samples detection rate was spurious (2.6% of 952 non-gut human 573 associated samples). We further analyzed 1,163 metagenomes balanced among animal guts 574 (mouse, pig, cat and dog), as well as ocean and soil, but did not detect this species in any of 575 these. 576 577

Density of bacterial cells in stool samples 578
Cell numbers are relatively stable within and between patients, ranging from 0.11 -1.03 e12 579 cells/ml stool. This seems realistic, although slightly higher than the average reported in a 580 metanalysis of several studies (0.92e11 cells/ml stool, [83]). Since our cell counts were 581 restricted to 19 samples total, we had no overlapping time point between pre-antibiotic U B. 582 ceftriaxensis detection and cell counts. To still infer absolute ratio change in U B. ceftriaxensis 583 bloom at day 374 compared to previous samples, we used the relative abundance of the deeply 584 sequenced day 7 and the mean abundance of all samples with cell counts before day 374, 585 namely day 0 and 60 to approximate the absolute abundance of U B. ceftriaxensis before day 586 374. 587 588 P. distasonis patient specific association 589 P. distasonis was associated to an antibiotic induced monodominance and therefore we wanted 590 to investigate if P. distasonis was replaced by another strain of the same species during the 591 antibiotic treatment. This analysis showed that P. distasonis remained stable over time in 592 subject HD.S1, but also most other subjects in the discovery dataset, with the exception of two 593 subjects that showed a clear change of P. distasonis genotypes (Suppl. Fig. 10a). Furthermore, 594