Article Text

Original research
Screening of normal endoscopic large bowel biopsies with interpretable graph learning: a retrospective study
  1. Simon Graham1,2,
  2. Fayyaz Minhas1,
  3. Mohsin Bilal1,
  4. Mahmoud Ali3,
  5. Yee Wah Tsang3,
  6. Mark Eastwood1,
  7. Noorul Wahab1,
  8. Mostafa Jahanifar1,
  9. Emily Hero4,
  10. Katherine Dodd3,
  11. Harvir Sahota3,
  12. Shaobin Wu5,
  13. Wenqi Lu1,
  14. Ayesha Azam3,
  15. Ksenija Benes3,6,
  16. Mohammed Nimir3,
  17. Katherine Hewitt3,
  18. Abhir Bhalerao1,
  19. Andrew Robinson3,
  20. Hesham Eldaly3,
  21. Shan E Ahmed Raza1,
  22. Kishore Gopalakrishnan3,
  23. David Snead2,3,7,
  24. Nasir Rajpoot1,2,3
  1. 1Department of Computer Science, University of Warwick, Coventry, UK
  2. 2Histofy Ltd, Birmingham, UK
  3. 3Department of Pathology, University Hospitals Coventry and Warwickshire NHS Trust, Coventry, UK
  4. 4Department of Pathology, University Hospitals of Leicester NHS Trust, Leicester, UK
  5. 5Department of Pathology, East Suffolk and North Essex NHS Foundation Trust, Colchester, UK
  6. 6Department of Pathology, Royal Wolverhampton Hospitals NHS Trust, Wolverhampton, UK
  7. 7Division of Biomedical Sciences, University of Warwick Warwick Medical School, Coventry, UK
  1. Correspondence to Professor Nasir Rajpoot, Department of Computer Science, University of Warwick, Coventry CV4 7EZ, UK; n.m.rajpoot{at}warwick.ac.uk; Dr Simon Graham; simon.graham{at}warwick.ac.uk

Abstract

Objective To develop an interpretable artificial intelligence algorithm to rule out normal large bowel endoscopic biopsies, saving pathologist resources and helping with early diagnosis.

Design A graph neural network was developed incorporating pathologist domain knowledge to classify 6591 whole-slides images (WSIs) of endoscopic large bowel biopsies from 3291 patients (approximately 54% female, 46% male) as normal or abnormal (non-neoplastic and neoplastic) using clinically driven interpretable features. One UK National Health Service (NHS) site was used for model training and internal validation. External validation was conducted on data from two other NHS sites and one Portuguese site.

Results Model training and internal validation were performed on 5054 WSIs of 2080 patients resulting in an area under the curve-receiver operating characteristic (AUC-ROC) of 0.98 (SD=0.004) and AUC-precision-recall (PR) of 0.98 (SD=0.003). The performance of the model, named Interpretable Gland-Graphs using a Neural Aggregator (IGUANA), was consistent in testing over 1537 WSIs of 1211 patients from three independent external datasets with mean AUC-ROC=0.97 (SD=0.007) and AUC-PR=0.97 (SD=0.005). At a high sensitivity threshold of 99%, the proposed model can reduce the number of normal slides to be reviewed by a pathologist by approximately 55%. IGUANA also provides an explainable output highlighting potential abnormalities in a WSI in the form of a heatmap as well as numerical values associating the model prediction with various histological features.

Conclusion The model achieved consistently high accuracy showing its potential in optimising increasingly scarce pathologist resources. Explainable predictions can guide pathologists in their diagnostic decision-making and help boost their confidence in the algorithm, paving the way for its future clinical adoption.

  • endoscopy
  • colonic adenomas
  • colorectal cancer screening
  • colonic diseases

Data availability statement

WSIs from University Hospitals Coventry and Warwickshire NHS Trust, East Suffolk and North Essex NHS Foundation Trust, and South Warwickshire NHS Foundation Trust will be made available upon successful application to the PathLAKE data access committee. Relevant information on obtaining the data from the IMP cohort can be found in the original publication.

https://creativecommons.org/licenses/by/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Increasing screening rates for early detection of colon cancer are placing significant pressure on already understaffed and overloaded histopathology resources worldwide and especially in the UK.

  • Approximately one-third of endoscopic colon biopsies are reported as normal, and therefore, require minimal intervention, yet the biopsy results can take up to 2–3 weeks.

  • Artificial intelligence (AI) models hold great promise for reducing the burden of diagnostics for cancer screening but require incorporation of pathologist domain knowledge and explainability.

WHAT THIS STUDY ADDS

  • This study presents the first AI algorithm for rule out of normal from abnormal large bowel endoscopic biopsies with high accuracy across different patient populations.

  • For colon biopsies predicted as abnormal, the model can highlight diagnostically important biopsy regions and provide a list of clinically meaningful features of those regions such as glandular architecture, inflammatory cell density and spatial relationships between inflammatory cells, glandular structures and the epithelium.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • The proposed tool can both screen out normal biopsies and act as a decision support tool for abnormal biopsies, therefore, offering a significant reduction in the pathologist workload and faster turnaround times.

Introduction

Histological examination is a vital component in ensuring accurate diagnosis and appropriate treatment of many diseases. In routine practice, it involves visual assessment of key histological and cellular patterns in the tissue, which is a major step in understanding the state of various conditions, such as cancer. Histopathology has been at the forefront of many advances in care including, but not limited to, cancer screening programmes, molecular pathology, tumour classification and companion diagnostic testing, resulting in a rapid rise in demand for histology-derived data.1 This extra workload is placing tremendous pressure on pathologists,2 with 78% of UK cellular pathology departments already facing significant staff shortages.3 The surging demand and staffing challenges ultimately lead to delays in diagnosis,4 negatively impacting patient care especially for those with abnormal conditions (eg, cancer or serious inflammation) where early intervention and treatment are critical.5

New National Institute for Health and Care Excellence guidelines for referral of suspected cancer forecast an unprecedented rise in demand for endoscopy, with more than 750 000 additional procedures performed per year by 2020,6 leading to a breach in standard wait times in a quarter of National Health Service (NHS) hospitals.7 8 Endoscopic large bowel biopsies constitute approximately 10% of all requests in the UK NHS pathology laboratories. During the examination process, the pathologist examines each biopsy slide searching for disease, typically working from low to high magnification, and analyses a set of predefined histological features, such as gland architecture, inflammation and nuclear atypia for signs of abnormality.9 10 The resulting report indicates the presence of any disease process and categorises the abnormality into the most appropriate diagnosis.11 12 An overview of the pathologist diagnostic decision process for reporting endoscopic colon biopsies is provided in online supplemental figure 1. Approximately one-third of colonic biopsy samples are reported as normal (online supplemental table 1), representing a substantial workload where the pathologist’s expertise is not fully used. The underlying hypothesis of this study is that automated screening of normal biopsies may help address rising histopathology capacity challenges.

Supplemental material

Since the advent of digital pathology,13 there has been a sharp increase in the development of artificial intelligence (AI) tools that enable computational analysis of multi-gigapixel whole-slide images (WSIs). In particular, deep learning (DL) algorithms have achieved remarkable performance not only in routine diagnostic tasks, such as cancer grading14 and finding metastasis in lymph nodes, but also in finding origins for cancers of unknown primary15 and improved patient stratification.16 17 Notably, Campanella et al18 presented a seminal paper on clinical-grade WSI classification, while Ehteshami Bejnordi et al19 demonstrated that AI models are capable of surpassing pathologist performance for breast cancer metastasis detection. These models can be leveraged to help reduce inevitable errors in diagnosis, given that humans are naturally prone to mistakes, especially when faced with fatigue or distractions.20 21 Despite challenges associated with algorithm bias,22 23 AI tools are not as susceptible to these kinds of errors and therefore may help mitigate oversight, reduce workload and increase reproducibility.

Differentiating between normal and neoplastic colorectal WSIs using DL has previously been addressed, with reports of excellent performance.24–26 However, distinguishing normal from abnormal tissue samples required for large bowel biopsy screening remains a challenge, due to the difficulty in detecting various subtle conditions, such as mild inflammation. To the best of our knowledge, there are no existing multi-centric studies for normal versus abnormal classification of large bowel biopsies. Existing methods for colonic analysis operate on high power subimages (or image patches) and so do not explicitly model both the tissue microstructure and macrostructure, including glandular architecture, inflammatory cell density and spatial relationships between inflammatory cells, glandular structures and the epithelium. Relying solely on DL models to automatically detect histological patterns that are diagnostically relevant in small image regions may lead to suboptimal performance. Alternatively, explicitly incorporating histological features that are routinely used by pathologists during the colon biopsy diagnostic workflow may not only improve performance over conventional DL models but may also increase transparency and interpretability of the algorithm’s decision-making to the pathologist—a key requirement for trustworthy AI-based medical decision models.27 28

To help reduce the burden of large bowel biopsy screening, we propose the first interpretable AI algorithm for large bowel slide classification employing a gland-graph network named IGUANA (Interpretable Gland-Graphs using a Neural Aggregator). In the proposed approach, a WSI is modelled as a graph with nodes,29–33 each representing a gland associated with a set of 25 interpretable features capturing gland architecture, intra-gland nuclear morphology and inter-gland cell density. The interconnections between these nodes capture the spatial organisation of glands within the tissue. The node features were developed in collaboration with pathologists and in accordance with existing diagnostic pathways to boost predictive accuracy, interpretability and alignment with known histological characteristics of a wide range of colorectal pathologies. IGUANA identifies highly predictive regions in the biopsy tissue slide and provides an explanation as to why they may be highly predictive. Because of the use of biologically meaningful features, this explanation can easily be interpreted by a pathologist as the basis of the algorithm’s diagnostic decision-making. We validate our algorithm on an internal dataset containing 5054 WSIs and an independent multi-centre dataset containing 1561 WSIs, achieving the best performance compared with recent top-performing approaches. In addition, we analyse predictive regions identified by IGUANA along with local and WSI-level explanations and show that our approach can identify areas of abnormality, such as inflammation and neoplasia. The code for IGUANA is available in the open-source domain for research purposes (https://github.com/TissueImageAnalytics/iguana) and example results can be visualised in an interactive demo available at https://iguana.dcs.warwick.ac.uk.

Materials and methods

Study design

A summary of the used datasets and our overall pipeline can be seen in figure 1, which consists of the following steps: (1) histological segmentation, (2) feature extraction and edge generation, (3) graph prediction and (4) graph explanation. An overview of the experiment design is provided in online supplemental figure 2 and an in-depth description of the used datasets is given in online supplemental section S4.1, including the disease and demographic breakdown (online supplemental figures 3 and 4 and online supplemental tables 2–4). In addition, we provide a detailed method description in online supplemental sections S4.1–S4.7.

Figure 1

Illustration of the overall pipeline for colon tissue classification with gland-graph convolutional networks. (A) Overview of the data used in our experiments from four different centres using different scanners. (B) Summary of the pipeline, which involves graph construction, gland-graph inference and gland-graph explanation. (C) Zoomed-in image regions and corresponding results taken from the example in B. ESNE, East Suffolk and North Essex; UHCW, University Hospitals Coventry and Warwickshire; WSI, whole-slide images.

Patient and public involvement

Lay members have made a valuable contribution to this project in ensuring that the patient is at the heart of this project. Three lay advisors have been working with us since the conception of this project. One of the advisors is part of the National Cancer Research Institute consumer network and Independent Cancer Patient’s Voice group, who are both supportive of new technologies being brought into the NHS for patient benefit.

Results

Large-scale cross-validation for colon biopsy screening

To rigorously evaluate our approach for colon biopsy screening, we performed 3-fold cross-validation using 5054 H&E-stained colon biopsy WSIs from University Hospitals Coventry and Warwickshire (UHCW), where each slide was labelled as either normal or abnormal. Interpretable screening of normal colon biopsies is a challenging problem due to a wide spectrum of large bowel abnormalities including a variety of neoplastic and inflammatory conditions. Figure 2 shows the results of IGUANA, achieving an average area under the receiver operating characteristic (AUC-ROC) curve of 0.9783 ± 0.0036 and an AUC precision-recall (AUC-PR) of 0.9798 ± 0.0031. We also include results obtained using other existing slide-level classification algorithms such as Iterative Draw and Rank Sampling (IDaRS)34, Clustering-constrained Attention Multiple Instance Learning (CLAM)35 and a random forest (RF) baseline classifier using our glandular features (denoted by Gland-RF). We observe that IGUANA achieves the best performance compared with both patch-based methods (IDaRS and CLAM), demonstrating its strong predictive ability given that it uses only 25 features per gland. We provide additional comparative results between IGUANA and IDaRS in online supplemental figure 5. Detailed statistical results are also provided in online supplemental tables 5–9. Note that despite IGUANA outperforming it, the Gland-RF model produces comparable performance—signifying the strength of our set of clinically derived features—although without the localised interpretability provided by IGUANA. Also, as opposed to the two patch-based methods, IGUANA provides concrete justification as to why a certain diagnostic class was predicted. We go into further detail on interpretability and explainability later in this section.

Figure 2

Results obtained across the four cohorts used in our experiments. Here, we display the ROC and PR curves along with the respective AUC scores of our approach compared with IDaRS, CLAM and Gland-RF (a random forest approach using the same handcrafted features with global aggregation). We also display the specificities obtained at sensitivity cut-offs of 0.97, 0.98 and 0.99. The shaded areas in the curves and the error bars in the bar plots show one SD from the results. AUC, area under the curve; CLAM, Clustering-constrained Attention Multiple Instance Learning; ESNE, East Suffolk and North Essex; IDaRS, Iterative Draw and Rank Sampling; PR, precision-recall; RF, random forest; ROC, receiver operating characteristic; UHCW, University Hospitals Coventry and Warwickshire.

In addition, we assess differences in model performance across sex, age, ethnicity and anatomical site of the biopsy. For each subgroup-level analysis, we run 100 bootstrap runs to compute average AUC-ROC and its SD across subcategories (online supplemental table 10) and observe that our method is not biased towards any particular subgroup with only minor differences.

Model generalisation to independent cohorts

A true reflection of a model’s clinical utility requires the assessment of its performance on completely unseen cohorts. For this, we used three additional cohorts of H&E-stained colon biopsy slides, providing a total of 1537 WSIs. These cohorts consisted of 1132 slides from IMP Diagnostics Laboratory in Portugal,25 148 slides from East Suffolk and North Essex (ESNE) NHS Foundation Trust and 257 slides and South Warwickshire NHS Foundation Trust, where slides were again categorised as either normal or abnormal. We observe from figure 2 that our model attains high performance for both the ESNE and South Warwickshire cohorts, reaching AUC-ROC scores of 0.9567 ± 0.0155, 0.9649 ± 0.0025 and 0.9789 ± 0.0023 and AUC-PR scores of 0.9731 ± 0.0105, 0.9466 ± 0.0034 and 0.9949 ± 0.0006 for ESNE, South Warwickshire and IMP datasets, respectively. It is evident that there is a large difference in performance between IGUANA and other approaches on the external cohorts, signifying that superior generalisation to unseen data is a strength of our model. At a sensitivity of 0.99, we obtain a percentage increase over IDaRS of 47.4%, 63.6% and 58.9% for IMP, ESNE and South Warwickshire cohorts, respectively. This may be partly due to the ability of our initial segmentation model to perform well across images with different staining protocols.36 Example results obtained by this model across the four datasets are shown in figure 3.

Figure 3

Example segmentation results obtained by our multi-task model across the four datasets used in our experiments. The top row shows normal examples, whereas the bottom row shows abnormal examples. In particular, the bottom-left example from ESNE shows a hyperplastic polyp and the bottom-right example from South Warwickshire shows inflammation. AUC-PR, area under the curve-precision-recall; CLAM, Clustering-constrained Attention Multiple Instance Learning; ESNE, East Suffolk and North Essex; IDaRS, Iterative Draw and Rank Sampling; IGUANA, Interpretable Gland-Graphs using a Neural Aggregator; RF, random forest; UHCW, University Hospitals Coventry and Warwickshire.

Analysis of expected reduction in pathologist workload

The real-world value of our approach is determined by its ability to reduce pathologist workload. As our model is intended for screening, it must achieve high sensitivity. Therefore, assessment of the specificity at high sensitivity cut-off thresholds provides a good indication of its potential effectiveness as a screening tool. Here, the specificity is indicative of the percentage reduction in normal slides that require pathologist review. In the middle column of figure 2, we display the specificity of our model at sensitivities of 0.97, 0.98 and 0.99 on all datasets used in our experiments, where we see that IGUANA sustains the best performance at various cut-offs compared with other methods. During internal cross-validation, we obtain specificities of 0.7865 ± 0.0429, 0.6720 ± 0.1128 and 0.5409 ± 0.1210 for sensitivities of 0.97, 0.98 and 0.99, respectively. For independent validation, our method obtains average specificities across the three external datasets of 0.7513 ± 0.0919, 0.6679 ± 0.0779 and 0.5487 ± 0.1599 for sensitivities of 0.97, 0.98 and 0.99. Therefore, this indicates that at a sensitivity of 0.99, our method is able to screen around 54% of normal cases during both internal and external validation.

In online supplemental figure 6, we show the proportion of slides that require pathologist review to achieve a certain sensitivity.18 In these plots, we consider a target sensitivity of 0.99, which is reasonable due to high levels of interobserver disagreement for conditions such as mild inflammation. We also show with a vertical dashed line the proportion of abnormal slides in each dataset, which indicates the minimum number of slides that need to be reviewed for screening. For each of the cohorts, we observe that for our target of 0.99 sensitivity our model can screen out 32%, 31%, 17% and 13% of slides from UHCW, South Warwickshire, ESNE and IMP datasets, respectively. If considering a sensitivity of 0.97, we can screen out 44% of slides from UHCW, 46% from South Warwickshire, 30% from ESNE and 19% from IMP.

Local feature explanations increase model transparency

A major component of IGUANA is the ability to provide an interpretable and explainable output. In figure 4, we display visual explanations of the most predictive nodes and features given by IGUANA. Node explanations are shown in the form of a heatmap, where relatively high values indicate glandular areas that contribute to the slide being classified as abnormal. Therefore, we should expect that all glands in a normal slide will have low values in the associated heatmap as shown in figure 4A, where no glands contribute to the slide being classified as abnormal. Figure 4B–D shows WSIs with hyperplastic polyps, inflammation and adenocarcinoma, respectively. Hyperplastic polyps are often characterised by intraluminal folds and lumen dilation. On the other hand, inflammatory conditions usually have an increased number of lymphocytes, plasma cells, eosinophils and neutrophils within the lamina propria and potentially within the glands. Other indicators of inflammation can include crypt branching and crypt dropout. Colon adenocarcinoma is often denoted by irregular glandular morphology, epithelial nuclear atypia and multiple lumina. High-grade cancers typically lose their glandular appearance and form solid sheets of tumour cells. It can be observed that IGUANA is able to pick up abnormal glands with features in line with the above descriptions. In particular, we see that the most predictive glands in figure 4B contain lumen with a clearly irregular morphology, whereas highlighted glands in figure 4C show areas with a high degree of inflammation. The adenocarcinoma heatmap in figure 4D highlights areas that have lost their conventional glandular appearance. Specifically, epithelial nuclei are no longer arranged at the gland boundary, cribriform architecture is observed and glands appear much larger, due to the formation of tumour cell sheets.

Figure 4

Visualisation of node and feature explainability. We display the overlay of the node-level explanations in the form of a heat map showing the most predictive nodes in the WSI. We also show cropped images of the four most predictive nodes within each WSI along with the associated ten most predictive features and their feature importance value. The colour of the boundary of the top nodes (glands) indicates the corresponding value in the node explanation heatmap. (A–D) Show example slides that are normal, hyperplastic, inflammatory or cancerous, respectively. GEC, gland epithelial clustering; GECV, gland epithelial clustering variation; GED, gland epithelial density; GEO, gland epithelial organisation; GEoD, gland eosinophil density; GEOV, GEO variation; GES, gland epithelial size; GESV, GES variation; GD, gland density; GLD, gland lymphocyte density; GM, gland morphology; GND, gland neutrophil density; GS, gland size; ICD, Inflammatory cell density; LEO, Lumen epithelial organisation; LEOV, Lumen epithelial organisation variation; LPCP, lamina propria connective proportion; LPEoP, lamina propria eosinophil proportion; LPLP, lamina propria lymphocyte proportion; LPNP, lamina propria neutrophil proportion; LPPP, lamina propria plasma proportion; WSIs, whole-slide images.

In addition to the node explanation heatmap, IGUANA indicates why certain glands are being identified as abnormal. This is useful because it can provide confirmation that the correct features are being identified by the model, giving researchers and clinicians confidence that it is performing as expected. This strategy can also be used to identify additional features within abnormal conditions. To show this, in figure 4, we display the most predictive glands in each slide and provide the corresponding feature explanations. Specifically, we display the top ten features in descending order of significance, along with their corresponding feature importance values between 0 and 1. Here, we expect that the feature explanations should align with what is observed in the associated cropped regions. In our hyperplastic polyp example, we see that the top glands (ie, 1, 2 and 3) contain lumen with abnormal morphology, whereas lumen dilation is observed in top gland 4. In line with this, lumen morphology and lumen composition are high-scoring features across the provided examples. We also observe that lumen size and organisation of epithelial nuclei within the glands are often found to be important features. In the example shown in figure 4C, we observe that top glands have a high degree of inflammation, which is matched by top features, such as inflammatory cell density, gland density and lamina propria neutrophil proportion. In the adenocarcinoma example, we see that the top four glands are all large, have irregular morphology and often display solid sheets of tumour cells with no obvious glandular structure. This is highlighted in the feature explanation, where gland morphology, gland size and epithelial organisation are consistently top-ranked features. Here, epithelial organisation describes how the epithelial nuclei are positioned at the gland boundary. Due to the presence of solid tumour patterns across the top glands, this feature is frequently highlighted in cancerous cases. We provide additional visual examples of the interpretability of our model output in figure 5.

Figure 5

Additional visualisation of node and feature explainability. As before; we display the overlay of the node-level explanations in the form of a heat map showing the most predictive nodes in the WSI. We also show cropped images of the four most predictive nodes within each WSI along with the associated ten most predictive features and their feature importance value. (A–D) Show slides that are normal, inflammatory (with crypt abscesses), high-grade dysplasia or adenomatous polyps, respectively. GEC, gland epithelial clustering; GECV, gland epithelial clustering variation; GED, gland epithelial density; GEO, gland epithelial organisation; GEoD, gland eosinophil density; GEOV, GEO variation; GES, gland epithelial size; GESV, GES variation; GD, gland density; GLD, gland lymphocyte density; GM, gland morphology; GND, gland neutrophil density; GS, gland size; ICD, Inflammatory cell density; LEO, Lumen epithelial organisation; LEOV, Lumen epithelial organisation variation; LPCP, lamina propria connective proportion; LPEoP, lamina propria eosinophil proportion; LPLP, lamina propria lymphocyte proportion; LPNP, lamina propria neutrophil proportion; LPPP, lamina propria plasma proportion; WSIs, whole-slide images.

WSI-level feature explanations are consistent with known histological patterns

In figure 6A, we show WSI-level explanations averaged over different subconditions in the UHCW and IMP cohorts. We focus on these datasets because they are the largest, with both containing over 1000 samples. Here, we plot top 10 features across the various subconditions for increased readability. These plots can be used both to confirm that the global explanations are as expected and to understand which features are particularly important for categorising a certain subcondition as abnormal. In both UHCW and IMP cohorts, the normal radar plots have a small radius, indicating that no feature contributes to the slide being classified as abnormal. For inflammatory cases, the UHCW and IMP radar plots show that a wide range of features can contribute to the slide being classified as abnormal, where there may be both cellular and architectural changes in the tissue. However, the most important features that can differentiate between other subconditions include inflammatory cell density, gland lymphocyte infiltration and gland density. Gland density can be indicative of gland dropout, which is a sign of inflammation. The UHCW radar plots for dysplasia and adenocarcinoma are similar, where the most important features are gland morphology, gland epithelial cell organisation, gland epithelial cell size and variation of gland epithelial cell size. This is in line with the key expected histological patterns observed within these tissue types. Likewise, these plots are similar to the low-grade and high-grade dysplasia plots for the IMP cohorts, indicating that the correct histological features are being highlighted when providing the WSI feature explanation. For hyperplastic polyps, we can see that lumen composition, lumen morphology and epithelial cell organisation have a large influence in the slide being classified as abnormal. Lumen composition is the ratio of lumen to gland size, and therefore, can identify glands with lumen dilation, which is a distinguishing feature of hyperplastic polyps. Conversely, lumen serrations, which are present in hyperplastic polyps, can lead to irregular lumen morphology, further validating the feature explanations output by our model.

Figure 6

Analysis of global explanations. (A) Radar plots showing global feature importance for subconditions in the UHCW and IMP datasets. (B) Hierarchical biclustering of feature importance values. 1–7 denote prominent clusters after biclustering, with the following distinguishing histological characteristics: (1) inflammation, without gland neutrophil infiltration; (2) inflammation with both gland lymphocytic and neutrophilic infiltration; (3) neoplasia with irregular gland morphology and large epithelial cells; (4) irregular gland morphology with minimal inflammation; (5) hyperplasia with irregular lumen morphology and composition with inflammation in the lamina propria; (6) eosinophilic infiltration in the lamina propria and (7) neoplasia with gland epithelial cell clustering. UHCW, University Hospitals Coventry and Warwickshire.

WSI-level feature explanations identify population subgroups

In figure 6B, we perform hierarchical biclustering of all abnormal slides and WSI-level feature importance scores to help identify various subgroups that exist within the UHCW dataset. At the bottom of the plot, we identify various patient clusters which have varying histological appearance. These are numbered as follows: (1) general sign of inflammation, without neutrophil infiltration; (2) inflammation with a high degree of both lymphocytic and neutrophilic gland infiltration; (3) mainly neoplastic slides with irregular-shaped glands and large epithelial cells; (4) irregular gland morphology, with minimal inflammation; (5) abnormal lumen morphology and composition, with signs of inflammation in the lamina propria; (6) increased eosinophilic infiltration in the lamina propria and (7) neoplastic slides with gland epithelial clustering. Therefore, this gives us confidence that the network is learning key histological differences among the dataset to make an informed WSI-level prediction. More fine-grained clusters can be observed by referring to the associated dendrograms in the biclustering plot.

Interactive visualisation of results

We provide an interactive demo at https://iguana.dcs.warwick.ac.uk showing sample IGUANA results and highlighting the full output of our model at global and local levels, including the intermediate gland, lumen and nuclear segmentation results. In particular, we display the node explanations overlaid as a heat map on top of the glands and the local explanations by hovering over each node in the overlaid graph. Here, we provide the top five features to provide insight into what is contributing to certain glands being flagged as abnormal. It may also be of interest to assess the difference in features for nodes across the WSI. Therefore, we also enable visualisation of each of the 25 features overlaid on top of the glands as heat maps.

Discussion

There has been a staffing crisis in pathology for many years,37 which is being further exacerbated by the increased demand for histopathological examination. Embracing new technologies and AI in clinical practice may be necessary as hospitals seek to find new ways to improve patient care.38 AI screening of large bowel endoscopic biopsies holds great promise in helping to reduce these escalating workloads by filtering out normal specimens. However, currently there does not exist a solution that can do this with a high predictive performance. Also, explainable AI is now recognised as a key requirement for trustworthy AI in human-centred decision-making,28 yet is usually not considered in many healthcare applications. Therefore, in this study, we developed an AI model that can accurately differentiate normal from abnormal large bowel endoscopic biopsies, while providing an explanation for why a particular diagnosis was made.

We demonstrated that our proposed method for automatic colon biopsy screening could achieve a strong performance during both internal cross validation (mean AUC-ROC=0.98, mean AUC-PR=0.98) and on three independent external datasets (mean AUC-ROC=0.97, mean AUC-PR=0.97). Highly sensitive tools for screening are required to minimise the number of undetected abnormal conditions, since the false negative report is likely to lead to delayed diagnosis and potential patient harm. We believe a sensitivity of 0.99 is a reasonable target because the ground truth being used is the diagnosis provided by pathologists, which may have less than perfect sensitivity. This is also reflected in guidelines for breast biopsy screening in the UK, where sensitivities of 0.99 are expected.39 Currently, we obtain promising specificities of 0.789 ± 0.043 at a sensitivity of 0.97 and 0.541 ± 0.121 at a sensitivity of 0.99, which could have a positive impact on reducing pathologist workloads. We also show in online supplemental figure 6 the expected reduction in clinical workload, where we report up to a 32% time saving by screening out normal biopsies that do not require assessment, while still maintaining a sensitivity of 0.99.

To understand misclassifications made by our model, we show six normal slides with the highest predicted abnormality scores in online supplemental figure 7. After inspection, we see that IGUANA correctly classifies these slides and therefore identifies mislabelling errors in the dataset. Here, the examples should have been labelled as either inflammatory or hyperplastic polyp. In the figure, we include sample image regions, as well as local and WSI-level feature explanations that are reflective of the true category of each slide. In addition, we performed a false negative analysis, where in online supplemental figure 8a we show the counts of various subconditions along with the corresponding number of false negatives. In online supplemental figure 8b, we show the false negative rate of each category. It can be observed that the model found slides with lymphocytic and collagenous colitis somewhat challenging, with false positive rates of 0.29 and 0.46, respectively. Explicit modelling of the subepithelial collagen band should enable us to better detect collagenous colitis. It may be worth noting that there was a relatively small number of collagenous colitis samples in all four cohorts and so they may not have a large impact on the overall performance. Also, a high false negative rate was observed in the mild inflammation category, but this is to be expected because they are visually similar to normal samples.

In online supplemental figure 9, we show that our model output is well calibrated and hence can be interpreted as a measure of confidence. To enable explainable predictions, our algorithm relies on an accurate intermediate segmentation step, which requires many pixel-level annotations. This can be a time-consuming step and can therefore act as a bottleneck in the development of similar methods. In addition, the type of features that can be incorporated into our AI algorithm are dependent on which kinds of histological objects are initially localised. For example, we do not currently detect goblet cells and so do not include features indicative of goblet cell-rich hyperplastic polyps. Other histological objects that could be added include giant cells, signet ring cells and mitotic figures. In addition, although we segment the surface epithelium, we do not extract any associated features that can help identify conditions such as collagenous colitis. Our method also does not assess surface abnormality to detect intestinal spirochaetosis or pigment to detect melanosis coli. These shortcomings will be addressed in future work. Visual examples of features used within our framework, along with examples from the 5th and 95th percentiles, are given in online supplemental figure 10. We also provide a more in-depth description of these features, along with what conditions they can detect in online supplemental table 11. In online supplemental figure 1b, we highlight diagnostic features (in a red colour) that are not currently modelled in our framework.

There have been recent AI approaches developed for cancer detection in colonic WSIs.24 40 41 However, such approaches cannot be used for screening in clinical practice because they often fail to identify non-cancerous abnormalities such as inflammation. Similarly, AI models have been developed for detecting polyps,42 43 inflammatory bowel disease44 or grading dysplasia,25 but again they do not address the problem of screening normal from all types of abnormality. Our approach uses retrospective biopsies from pathology archives, where data are accordingly labelled as normal or abnormal to reflect the clinical screening process. Therefore, unlike other approaches, our AI model can be directly implemented as a triaging tool and may therefore have a profound effect on reducing pathologist workloads. In addition, most recent automatic methods rely on weak supervision, where only the overall diagnosis is used to guide the algorithm. This strategy may be advantageous because it does not rely on the time-consuming task of collecting many annotations. However, this limits the interpretability of the output, which may hinder the acceptance of such models in hospitals.

Analysis of colon biopsy slides by visual examination, either under the microscope or more recently on the computer screen, is the current gold standard. However, the current practice is unsustainable with increasing numbers of specimens that require examination and due to staff shortages, where only 3% of NHS hospitals report adequate staffing.3 With advances in cancer screening programmes and no immediate sign of the pathologist staffing crisis being resolved, additional measures to assist with reporting will be essential. Our proposed AI model addresses this unmet need by automatically filtering out normal colonic biopsies that require minimal intervention, yet make up a substantial proportion of all cases, with high degree of accuracy. As a result, our model significantly reduces the number of samples that require review by pathologists.

AI models are now starting to be used in clinical practice for prostate cancer detection, where a clear advantage for clinicians has been demonstrated in terms of reducing workloads and increasing reporting accuracy.45 46 There is a growing evidence that automated methods for tissue diagnosis can transform pathologist workflows and help drive new policies in healthcare. However, no such tool currently exists for screening large bowel endoscopic biopsies, perhaps due to the fact that no automated tool has been able to accurately detect all kinds of abnormality, including inflammation, dysplasia, hyperplasia and neoplasia. With its triaging capability, the proposed model promises to have positive implications on patient treatment due to faster time to diagnosis, resulting in the potential for early intervention where it is needed the most.

The proposed model may be particularly advantageous in low-income countries, where there exists an even greater shortage of pathologists. Despite the obvious benefit of outsourcing tasks to AI in these countries, there still remains a lack of infrastructure for digital pathology, which is a requirement for our approach. A few options may be explored to overcome this challenge, such as using digital mobile phone cameras,35 47 acquiring low-cost consumer-grade scanners and obtaining them via financing, leasing, philanthropic sources or non-profit organisations. Rather than investing in expensive hardware and performing full clinical integration, a cloud-based setup may be a more affordable option in low-resource settings, where scanned slides can be uploaded to the internet for processing. With AI models now appearing rapidly on the market, it is becoming increasingly important for initiatives to be put in place by policy-makers to help with the digitisation of pathology labs across the world, enabling the widespread adoption of computational pathology.

We have shown that IGUANA offers promise as an effective tool for AI-based colon biopsy screening with a strong emphasis on diagnostic interpretability providing concrete justification as to why a certain diagnostic class was predicted and making its predictions transparent and explainable. The proposed AI method can help alleviate current issues in pathologist shortages in the NHS and worldwide and reduce turnaround times of the screening results. Before deployment in clinical practice, a larger scale validation is required with further analysis of IGUANA’s feature explanation output. In addition, considerable time needs to be invested into extending the current user interface so that it easily integrates with current pathologists’ clinical workflows. This will involve a detailed study on the effectiveness of the decision support tool within abnormal biopsies and assessing its implications on time to report the diagnosis.

Data availability statement

WSIs from University Hospitals Coventry and Warwickshire NHS Trust, East Suffolk and North Essex NHS Foundation Trust, and South Warwickshire NHS Foundation Trust will be made available upon successful application to the PathLAKE data access committee. Relevant information on obtaining the data from the IMP cohort can be found in the original publication.

Ethics statements

Patient consent for publication

Ethics approval

This study was conducted under Health Research Authority National Research Ethics approval 15/NW/0843; IRAS 189095 and the Pathology image data Lake for Analytics, Knowledge and Education (PathLAKE) research ethics committee approval (REC reference 19/SC/0363, IRAS project ID 257932, South Central—Oxford C Research Ethics Committee). The study was conducted on retrospective data from histopathology archives relating to samples taken in the course of clinical care, and for which consent for research had not been taken. Gathering consent retrospectively was not feasible and deemed not necessary by the research ethics committee, as referenced above. Data collection and usage of the IMP Diagnostics dataset was performed in accordance with the Portuguese national legal and ethical standards applicable to that cohort.

Acknowledgments

We acknowledge Dr SA Sanders and Dr Naresh Chachlani for their assistance in providing WSIs from South Warwickshire NHS Foundation Trust.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • DS and NR are joint senior authors.

  • Twitter @simongraham73, @fayyazhere, @bilal_mohsin, @AyeshaSAzam, @sea_raza, @nmrajpoot

  • Contributors SG, DS and NR designed the study with support from all coauthors. SG led the development of the method with support from FM and NR. SG wrote the code and carried out the experiments. MB provided results using IDaRS and CLAM. MA, YWT, EH, KD, HS, AR, SW, AA, KB, MN, KH, HE, KG and DS provided diagnostic annotations of colonic biopsy slides. SG, FM, MA, KG, DS and NR performed analysis and interpretation of the results. MB, MJ, NW, WL, AB and SEAR provided technical and material support. SG, FM, DS and NR were all involved in the drafting of the paper. NR is the guarantor of the study. All authors read and approved the final paper. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. DS and NR are last authors.

  • Funding All authors would like to acknowledge the support from the PathLAKE digital pathology consortium which is funded by the Data to Early Diagnosis and Precision Medicine strand of the government’s Industrial Strategy Challenge Fund, managed and delivered by UK Research and Innovation (UKRI). FM acknowledges funding from EPSRC grant EP/W02909X/1.

  • Competing interests SG, DS and NR are co-founders of Histofy. DS reports personal fees from Royal Philips, outside the submitted work. NR and FM report research funding from GlaxoSmithKline.

  • Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.