Image-based consensus molecular subtype classification (imCMS) of colorectal cancer using deep learning

Image analysis is a cost-effective tool to associate complex features of tissue organisation with molecular and outcome data. Here we predict consensus molecular subtypes (CMS) of colorectal cancer (CRC) from standard H&E sections using deep learning. Domain adversarial training of a neural classification network was performed using 1,553 tissue sections with comprehensive multi- omic data from three independent datasets. Image-based consensus molecular subtyping (imCMS) accurately classified CRC whole-slide images and preoperative biopsies, spatially resolved intratumoural heterogeneity and provided accurate secondary calls with higher discriminatory power than bioinformatic prediction. In all three cohorts imCMS established sensible classification in CMS unclassified samples, reproduced expected correlations with (epi)genomic alterations and effectively stratified patients into prognostic subgroups. Leveraging artificial intelligence for the development of novel biomarkers extracted from histological slides with molecular and biological interpretability has remarkable potential for clinical translation.


INTRODUCTION
Colorectal cancer (CRC) is a disease with heterogeneous molecular subtypes, variable clinical course and prognosis (1). An increasing understanding of CRC biology has led to the development of targeted treatments directed against key pro-oncogenic signalling pathways, but these treatments are only effective in a small proportion of patients (2,3). Molecular stratification of CRC patients is essential to form homogenised subgroups for personalised treatment and prognosis (4). Next generation sequencing (NGS) technologies enable the multi-omic profiling of malignant tumours but impact on clinical practice has been limited. This is due to high costs, difficulty in the standardisation of pre-analytical procedures, requirements for data storage and bioinformatics expertise (5,6). In contrast, histopathology slides are inexpensive to produce and principal stains such as haematoxylin and eosin (H&E) are firmly established in the pathology lab.
The application of traditional image analysis to histopathology facilitates the quantitative assessment of tissue architecture, cell distribution, and cellular morphology by light microscopy to generate feature libraries of unprecedented resolution and detail (7). More recently, deep learning is used to capture morphological differences with a precision that exceeds human performance. Coudray et al utilise this approach to detect targetable oncogenic driver mutations in lung cancer using deep neural classification networks (8). By combining an image-based analysis with molecular characterisation, it becomes feasible to identify novel genotype-phenotype correlations. For the first time it is now possible to characterise complex multi-scale morphological traits as well as genomic alterations at scale. Given that H&E processing allows analysis of large tissue sections at low cost and with short turn around without the need to modify existing clinical workflows, the discovery of morpho-molecular correlations holds the promise of revolutionising patient stratification in clinical practice (9). Imagebased methods are suitable for prioritisation of certain patient samples for additional molecular testing and for provision of additional guidance for the selection of tissue blocks. Ultimately, the biological interpretability of genomic alterations could revolutionise the development of new biomarkers.
In CRC, it is well known that tumour morphology, growth pattern and architecture hold important clues to differentiating biological subtypes with clinical impact (10). The composition of the tumour microenvironment is a key component determining the tumour progression and therapy response (11,12). Tumour and non-tumour tissue contribute to image information on the histological slide and to the consensus molecular classification (CMS) of CRC at the transcriptional level (13). The CMS classification distinguishes four groups of CRC with distinct clinical behaviour and biological interpretability. These include CMS1 (14%; microsatellite instability immune, favourable prognosis), CMS subgrouping shows a robust association with targetable alterations and may have potential to guide treatment allocation in clinical practice (1, 13). However, clinical implementation of the CMS classification has been held back by the considerable costs of RNA sequencing, the inability to bioinformatically obtain confident CMS calls from single samples, intratumoural heterogeneity, high levels of unclassified calls on biopsies and an unclear performance on FFPE material (13)(14)(15). Here, we derive a novel image-based CMS (imCMS) classification from H&E-stained tissue sections sourced from the Medical Research Council (MRC) and Cancer Research UK (CRUK) Stratification in COloRecTal cancer (S:CORT) program and The Cancer Genome Atlas (TCGA). We demonstrate the existence of distinct image phenotypes of CRC that reproducibly associate with CMS transcriptional classification, key oncogenic driver mutations and prognosis. Automatic, high-fidelity classification of three independent clinical cohorts including pre-operative biopsies underlines the applicability of this approach to heterogeneous sample sets and relevant clinical settings. We provide insight into classification calls for samples with considerable intratumoural heterogeneity and provide accurate secondary calls with higher discriminatory power than bioinformatic prediction. In all three cohorts, imCMS successfully classified CRC samples that were previously considered to have unknown biological and clinical behaviour and failed transcriptional classification. imCMS classification is standardised, inexpensive and could be carried out in a tele-pathology setting on routinely available H&E sections. This resolves key issues in the translation of transcriptional classification of CRC into clinical practice and has the potential to increase availability of molecular stratification in low resource settings.

Study design
This tudy was designed in accordance with the REMARK guidelines. The study design, cohorts and aims are outlined in [ Figure 1].

Patients
Cohort 1: FOCUS (Retrospective cohort, S:CORT) As part of the Stratification in COloRecTal cancer (S:CORT) program, 385 patients with available formalin-fixed paraffin embedded (FFPE) blocks of the primary CRC were selected from the MRC FOCUS randomised clinical trial (RCT) that tested different strategies of sequential and combination chemotherapy for patients with advanced CRC (30). Serial sections were cut from one representative block for H&E staining followed by four unstained sections for RNA extraction, a second H&E and eight unstained sections for DNA extraction for a total of 741 slides. H&E slides were re-reviewed by expert gastrointestinal pathologists and tumour tissue was annotated and used to guide RNA and DNA extractions from the first and second H&E respectively. RNA expression microarrays (Xcel array, Affymetrix), DNA target capture (SureSelect, Agilent) followed by NGS sequencing (Illumina) and DNA methylation arrays (EPIC arrays, Illumina) were applied in this order. All H&E slides were scanned at high resolution on an Aperio scanner at a total magnification of 200X. Digital slides were re-reviewed and tumour annotations were traced to generate region annotations for machine learning classification. Clinical data was retrieved from the trial database. Pathological TNM-stage and sidedness were extracted from pathological reports. Patients with synchronous disease were considered to be stage IV. 34 slides with technical failure of the staining or scanning procedure were excluded from further analysis. 41 slides had no available RNA expression for CMS classification for a final set of 666 slides (n=362 cases). Clinical and molecular data is summarised in [Table S1] and [ Figure 1A-B].  Table S1] and [ Figure 1A-B].

Cohort 3: GRAMPIAN (Retrospective cohort, S:CORT)
A total of 323 slides from 183 pre-treatment biopsy FFPE blocks from rectal cancer patients of the neoadjuvant setting were available for this study as part of the S:CORT program. All patients received pre-operative chemoradiotherapy followed by surgical resection. Slides and molecular profiling were processed as described for cohort 1 (FOCUS) but using 5 to 9 sections for RNA extraction and 9 for DNA. Pre-operative staging was derived from MRI scans. A total of 14 slides were excluded based on quality control criteria for a final set of 309 slides (n = 175 cases). Clinical and molecular data is summarised in [Table S1] and [ Figure 1A-B].

CMS calls
RNA microarray data was pre-processed and normalised using robust multi-array analysis with the R package affy (33) and probes collapsed by mean. CMS calls in all three cohorts were derived with the R package CMSclassifier (13) by random forest (RF) with the default posterior probability of 0.5. RF CMS classification of FFPE samples from the FOCUS and GRAMPIAN cohorts led to an increased frequency of unclassified samples as compared to the TCGA datasets derived from fresh frozen material. In order to derive calls with comparable frequencies, we therefore computed single sample predictor calls (R package CMSclassifier) after row-centring the expression data (13). Final CMS calls were generated when there was a match between both methods (RF and single sample predictor without applying any cut-off). There were 186 TCGA cases (n=191 slides) with discrepancies among our CMS calls and the calls originally reported by Guinney et al (13). These discrepant calls are most likely the result of the application of a clustering method that is strongly cohort-dependent in our analysis based on TCGA samples only and the original report combining thousands of samples from several selected cohorts. Due to lack of clear evidence of the ground truth CMS status, samples with classification discrepancies were labelled as unclassified.
Secondary CMS calls from RNA in classified samples were computed by RF using the second highest call with posterior probability above 0.3. The primary call was matched if no different CMS subtype was found. For unclassified samples, the first highest call above 0.3 was used, leaving the sample as unclassified if no subtype met this requirement. All these analyses were performed with R version 3.5.1 (34).

CIMP classification
Methylation array raw data from S:CORT cohorts 1 and 3 was processed with the R-package ChAMP (35). CIMP classification was generated by recursively partitioned mixture model as previously done in TCGA (36) and Guinney et al (13) with minor changes due to the higher number of probes. CIMP classification in TCGA according to Guinney et al was retrieved from Synapse (Synapse ID syn2623706).

imCMS classification
Pre-processing of image data and exclusion criteria For each of the three cohorts, digital slides were re-reviewed and invasive cancer regions were annotated by an expert gastrointestinal pathologist using the HALO TM software v2.3.2089.52 (Indica Labs, Corrales, NM, USA). For each slide, the annotated tumour areas were divided into tiles of 512x512 pixels. To avoid white background regions which did not provide useful information for classification, we excluded tiles with less than 50% tissue area. Total tissue area and the number of tiles is shown in [ Figure S1]. At 5x magnification, consecutive tiles were 50% overlapped in the FOCUS and TCGA cohorts (resections). To account for the small sample surface area of the tumour identified in the endoscopic biopsies of the GRAMPIAN cohort at 5x, tiles with a 75% overlap were used.
At 20x, no overlap in FOCUS and TCGA and 50% overlap in GRAMPIAN were used. imCMS classifier and the training procedure We trained a neural network to classify a given image tile taken from the marked tumour area into one of the four CMS classes using supervised learning. Inception V3 (37) pretrained on the ImageNet dataset (38) was trained on samples taken from the FOCUS cohort [ Figure 1C]. All instances in the training set were associated with corresponding molecular data. The class of each tile in the training set was matched to the overall RNA-based CMS call of the FOCUS slide. Tiles from unclassified slides were excluded. We trained 5 separate models with different subsets of the data in the manner akin to cross-validation. The data were split into 5 partitions while preserving the percentage of samples for each CMS class. For each model, 3 portions of the data were used for training, one for validation, and one for testing. The split was done at the patient level, meaning that no image tiles from the same patients would be used for training, validation, and testing at the same time. An inception V3 (37) model pretrained on the ImageNet dataset (38) was deployed. We minimised the cross-entropy loss of the model on our dataset via gradient backpropagation using Adam optimisation (39) with a learning rate of 0.0002 and a batch size of 32 for 100,000 iterations. To prevent the model from overfitting, the training image tiles were aggressively augmented using diverse optical and spatial transformations implemented in the imgaug library (40). To further avoid the class imbalance problem, we also sampled tiles according to the inverse of their class frequencies to guarantee that tiles from the minority classes such as CMS3 were sampled frequently in the training process. Finally, we selected the state of the model that yields the maximum macro-average AUC on the validation data.
We implemented the entire imCMS classification framework using the deep learning Pytorch library (41). All statistical analyses were performed in R version 3.5.1 (34).

Testing the model on independent cohorts
On the TCGA and GRAMPIAN datasets, we applied 5 versions of the, producing 5 different classification results for each tile which were then averaged to obtain the final prediction. This is analogous to an ensemble of experts' opinions (27). The prediction probability for each imCMS class was obtained from the proportion of the number of tiles assigned to that class, and the final imCMS call at the slide level was derived from the majority vote of tiles [ Figure 1D]. No unclassified slides were used in the evaluation. The classification performance of the model is reported in [ Table 2].

Domain adversarial training for better generalisation
To prevent the learning of dataset-dependent features that would limit the general applicability of the model we leveraged domain-adversarial training (26). Here the model was augmented with an additional classifier for predicting whether image tiles were drawn from training (FOCUS) or external cohorts (TCGA and GRAMPIAN) [ Figure 1C]. We forced this classifier to perform poorly to encourage the model to learn features which are dataset-independent. To train the domain-adversarial classifier, all image tiles from the FOCUS cohort and 30% of the tiles from the TCGA and GRAMPIAN datasets were used. Domain adversarial training did not involve imCMS class information. Our experiments demonstrate domain adversarial learning is critical to train a classifier that is suitable for this task [ Table 3].

Adjustment of the imCMS classification probability in the GRAMPIAN cohort
Image tiles containing histological features associated with the imCMS1 class in resection specimens (band like lymphocytic infiltration and mucin) were underrepresented in the rectal biopsies in the GRAMPIAN cohort. This resulted in very few biopsy samples considered as imCMS1 with high confidence [ Table 4] leading us to adjust the slide-level imCMS classification probabilities. To this end, we trained a RF classifier (42) with 100 trees of the maximum depth of 2 with 5-fold crossvalidation and only used the results from the test folds to avoid biased adjustment. imCMS classification of the CMS unclassified samples CMS unclassified samples from all three cohorts were re-classified using the imCMS classification algorithm. To this end, we trained a RF classifier (42) on the imCMS classification probabilities of classified samples in the cohort and then applied the learnt classifier to the unclassified samples to assign an imCMS call. Note that for the GRAMPIAN cohort, adjustment of the imCMS prediction probabilities were required as described in the previous section. represents the CMS classification probabilities from a RF CMSclassifier (13).

Assessment of the consistency between the imCMS and CMS classification heterogeneity
We assessed whether the level of similarity between the imCMS prediction probabilities and those of the transcriptomic CMS was better than the level of similarity produced by a random classifier.
Samples were stratified according to their primary and secondary CMS profile. For each comparison, a total of 100 random predictions were drawn from a 4-dimensional Dirichlet distribution with a concentration hyperparameter of 1.0 in each dimension in analogy to the imCMS classification probabilities. We calculated the cosine similarities of these random prediction probabilities and the mean of the CMS prediction probabilities.
The median difference between groups was compared using the Wilcoxon rank-sum test and the pvalues were adjusted to control false discovery rate (43). Any comparison that was highly underpowered due to the sample size (less than 2 data points in one of the populations) was discarded. For each group, outliers were detected via Tukey's rule (44) and removed. To avoid data correlation due to pairs of slides from the same samples, we performed two separate tests in which only one slide from a pair is used in each test. P-values <0.05 were considered statistically significant.

Survival analyses
Overall  Figure 1C]. The size distribution of annotated areas per slide and the number of tiles per slide is shown in [ Figure S1]. The imCMS class, prediction probability and spatial location for each tile were recorded. An overall imCMS call for each slide was assigned based on the majority classification of tiles [ Figure 1D].

imCMS classification is accurate, robust and generalisable
We systematically compared the performance of the imCMS classifier across all three cohorts. For benchmarking against molecular data, all unclassified samples were excluded from the test set.
Classification performance was compared using image tiles derived at a)  Figure S3B).

Histological patterns associated with imCMS status
To understand which specific morphological patterns associate with imCMS, we extracted and visually reviewed tiles with the highest prediction confidence for each imCMS subtype. The large-scale histology patterns corresponded well with the biological characteristics of the CMS1 and CMS4 classes as predicted from the molecular assay (13) imCMS classification on molecularly unclassified CMS samples Failure of the transcriptional CMS classification might represent a transition phenotype, intratumoural heterogeneity or might represent technical failure to classify (13). We therefore tested the performance of imCMS in samples categorised as unclassifiable by transcriptomic CMS [ Figure 2B].
As compared to transcriptional classification, imCMS yielded a significantly higher prediction confidence on the molecularly unclassified samples [ Figure S5]. Successful re-classification is underlined by a direct comparison of the key molecular profiles between classified samples and the imCMS reclassified samples. No major differences between these two groups in the majority of the traits except for CMS1 was found [ Figures 2D, S6, and Table S2]

Prognostic associations by imCMS status
We performed univariate Cox proportional hazard analysis to assess the prognostic value of the imCMS classification as compared to its molecular counterpart. In the FOCUS cohort, patient survival outcomes stratified by imCMS classification were highly in agreement with those of the transcriptional classification [ Figure 4A and Tables S6, S7]. The prognostic association of the imCMS classification was maintained in multivariate analysis including TNM stage, age and gender, indicating strong potential to stratify risk beyond pathological staging [ Table 5]. imCMS survival predictions were concordant when the input slides were replaced by sections cut at deeper tissue levels [Table S3 and Figure S9A]. For the TCGA cohort, PFI by both imCMS and CMS groups was highly consistent with CMS4 having the poorest prognosis [ Figure 4B and Table S4]. For OS, the CMS4 group was associated with the worst outcome while imCMS linked the imCMS1 group to adverse outcome [ Figure 4B and Table S4]. This discrepancy in the TCGA cohort could be explained by a less robust representation of disease biology by OS as compared to PFI but requires additional investigation in subsequent studies. We further explored the application of the imCMS classification for risk stratification in the unclassified samples of the TCGA cohort. In this previously unclassified group, the imCMS4 group was shown to have worse prognosis for both OS and PFI [ Figure S9b and Table S5].

DISCUSSION
H&E slides are generated as part of the standard work-up of any CRC treated by surgical resection (16,17). In the assessment of this histologic material, pathologists are presently limited to the strictly defined set of morphologic and anatomic criteria (16,17). This information supports the definition of broad prognostic risk groups but has no predictive value (16). The integration of genomic technologies in the clinical care of CRC patients has immense potential to drive personalised treatment but requires substantial financial, personnel and infrastructure resources (18). Combining morphological and molecular pathology to identify genotype-phenotype correlations is a promising approach to extend the amount of clinically relevant information that can be extracted from standard histologic slides (8).
In this study, we leverage artificial intelligence and image analysis technologies for the development of an image-based taxonomy of CRC with clear biological interpretability and clinical impact. Due to general applicability and low costs, morphomolecular classification of histopathology slides could become a new standard for patient stratification in clinical practice.
We trained and tested our image-based approach towards consensus molecular subtyping (imCMS) of CRC on three independent and well-characterised patient cohorts with availability of digital slides and transcriptional information from the CRUK MRC S:CORT program and TCGA. We specifically focused on relevant clinical scenarios in the management of CRC patients and investigated the imCMS classification of both preoperative biopsies and resection specimens. Our analyses demonstrate that the imCMS classifier is able to predict the consensus molecular signatures of CRC from histological slides with very high accuracy. While tissue features captured at low magnification proved most informative on CRC resection specimens, imCMS could be efficiently adapted for morpho-molecular classification of rectal cancer biopsy fragments at high magnification. Small biopsy fragments have previously proven difficult to analyse using genomic technologies due to the limited amount of tissue available (19). Pathologist assessment is therefore usually restricted to the diagnosis of cancer, a select panel of immunohistochemical studies and a limited assessment of additional prognostic features (17,20). Clinically approved assays that are predictive of therapeutic response from biopsy material are presently lacking, with up to 25% of rectal cancer patients gaining no benefit from current radiotherapy and chemotherapy protocols (21). As a stemlike (CMS4) transcriptional profile of CRC has been linked to poor prognosis and therapeutic resistance, imCMS could allow for more effective stratification of patients for primary surgery or neoadjuvant treatment (22,23).
Prospective studies are warranted to investigate the application of imCMS as a novel clinical stratification tool.  (1, 13). However, some key issues hamper clinical implementation of CMS such as the inability to obtain reliable calls from single samples. Two methods to call CMS were released by the original authors based on RF and single sample prediction (13).
The former provides reliable classification but is cohort-dependent and requires a high minimum number of samples while the latter generates calls on single samples with limited quality leading to underutilisation. Another problem is that some samples do not show enough evidence to make calls by either method leading to a substantial number of cases left as unclassified. Inconsistent classification calls could also be an expression of intratumoural heterogeneity or representative of a transition phenotype which is of considerable biological interest (1, 13). Spatial heterogeneity is an additional confounder that can result in CMS misclassification (14). imCMS is able to overcome all these problems. imCMS calls are intrinsically generated for single samples. With this paper we demonstrate that it is possible to identify CMS on the basis of tissue morphology.
The possibility of identifying morphological correlates that are associated with molecular subtypes opens new opportunities for in vitro diagnostics. However, the application of image-based patient stratification is presently limited by the availability of digital pathology infrastructure in routine diagnostic practice. This is met by broad scale initiatives for digitalization of medical infrastructure on a national and international level (28,29). Centralized testing could further compensate for the availability of computing infrastructure in low resource settings. Prospective validation of imCMS in independent studies will be critical to clinical translation. This includes both applications as a tool that could rationalize which cases would need confirmatory testing as well as stand-alone testing in cases where genomic methods fail to provide reliable classification. We hypothesise that the general principle can be applied not only to other cancer types but also to other diseases. It will therefore lay the foundation of a more systematic integration of image-based morphological analysis and molecular stratification.             Cosine similarity between the imCMS and CMS prediction scores, stratified by the primary and the secondary CMS calls. The levels of similarity were compared against those produced by a random classifier. Statistical analysis was performed using Wilcoxon rank-sum test, adjusted for the false discovery rate. P-value < 0.05 was considered statistically significant.       Supplementary Tables S1-S5

TABLES AND TABLE LEGENDS
Please see separate datafiles