Article Text

Standalone performance of artificial intelligence for upper GI neoplasia: a meta-analysis
  1. Julia Arribas1,
  2. Giulio Antonelli2,3,
  3. Leonardo Frazzoni4,
  4. Lorenzo Fuccio4,
  5. Alanna Ebigbo5,
  6. Fons van der Sommen6,
  7. Noha Ghatwary7,
  8. Christoph Palm8,9,
  9. Miguel Coimbra10,
  10. Francesco Renna11,
  11. J J G H M Bergman12,
  12. Prateek Sharma13,
  13. Helmut Messmann5,
  14. Cesare Hassan2,
  15. Mario J Dinis-Ribeiro1
  1. 1 CIDES/CINTESIS, Faculty of Medicine, University of Porto, Porto, Portugal
  2. 2 Digestive Endoscopy Unit, Nuovo Regina Margherita Hospital, Rome, Italy
  3. 3 Department of Translational and Precision Medicine, Sapienza University of Rome, Rome, Italy
  4. 4 Department of Medical and Surgical Sciences, S.Orsola-Malpighi Hospital, University of Bologna, Bologna, BO, Italy
  5. 5 III Medizinische Klinik, UniversitatsKlinikum Augsburg, Augsburg, Germany
  6. 6 Department of Electrical Engineering, VCA group, Eindhoven University of Technology, Eindhoven, Netherlands
  7. 7 Department of Computer Engineering, Arab Academy for Science and Technology, Alexandria, Egypt
  8. 8 Regensburg Medical Image Computing (ReMIC), Ostbayerische Technische Hochschule Regensburg, Regensburg, Germany
  9. 9 Regensburg Center of Health Sciences and Technology (RCHST), OTH Regensburg, Regensburg, Germany
  10. 10 INESC TEC, Faculdade de Ciências, University of Porto, Porto, Portugal
  11. 11 Instituto de Telecomunicações, Faculdade de Ciencias, University of Porto, Porto, Portugal
  12. 12 Dept of Gastroenterology, Academic Medical Center, Amsterdam, The Netherlands
  13. 13 Department of Gastroenterology and Hepatology, University of Kansas Medical Center, Kansas City, Kansas, USA
  1. Correspondence to Dr Mario J Dinis-Ribeiro, CIDES/CINTESIS, Faculty of Medicine, University of Porto, Porto 4200-072, Portugal; mario{at}med.up.pt

Abstract

Objective Artificial intelligence (AI) may reduce underdiagnosed or overlooked upper GI (UGI) neoplastic and preneoplastic conditions, due to subtle appearance and low disease prevalence. Only disease-specific AI performances have been reported, generating uncertainty on its clinical value.

Design We searched PubMed, Embase and Scopus until July 2020, for studies on the diagnostic performance of AI in detection and characterisation of UGI lesions. Primary outcomes were pooled diagnostic accuracy, sensitivity and specificity of AI. Secondary outcomes were pooled positive (PPV) and negative (NPV) predictive values. We calculated pooled proportion rates (%), designed summary receiving operating characteristic curves with respective area under the curves (AUCs) and performed metaregression and sensitivity analysis.

Results Overall, 19 studies on detection of oesophageal squamous cell neoplasia (ESCN) or Barrett's esophagus-related neoplasia (BERN) or gastric adenocarcinoma (GCA) were included with 218, 445, 453 patients and 7976, 2340, 13 562 images, respectively. AI-sensitivity/specificity/PPV/NPV/positive likelihood ratio/negative likelihood ratio for UGI neoplasia detection were 90% (CI 85% to 94%)/89% (CI 85% to 92%)/87% (CI 83% to 91%)/91% (CI 87% to 94%)/8.2 (CI 5.7 to 11.7)/0.111 (CI 0.071 to 0.175), respectively, with an overall AUC of 0.95 (CI 0.93 to 0.97). No difference in AI performance across ESCN, BERN and GCA was found, AUC being 0.94 (CI 0.52 to 0.99), 0.96 (CI 0.95 to 0.98), 0.93 (CI 0.83 to 0.99), respectively. Overall, study quality was low, with high risk of selection bias. No significant publication bias was found.

Conclusion We found a high overall AI accuracy for the diagnosis of any neoplastic lesion of the UGI tract that was independent of the underlying condition. This may be expected to substantially reduce the miss rate of precancerous lesions and early cancer when implemented in clinical practice.

  • diagnostic and therapeutic endoscopy
  • gastrointesinal endoscopy
  • gastric pre-cancer
  • Barrett's oesophagus
  • oesophageal lesions

Data availability statement

Data are available upon reasonable request. Complete dataset used for meta-analysis available with the corresponding author upon request.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Significance of this study

What is already known about this subject?

  • Missed rate for upper-GI cancers has been reported to range from 5% to 11%, and up to 40% for Barrett-related early stage neoplasia.

  • Many artificial intelligence (AI) systems have been developed, claiming to aid endoscopists in upper-GI neoplasia detection, but these are still often experimental in design, generating uncertainty on clinical applicability and value.

What are the new findings?

  • This analysis presents the pooled accuracy estimates for all reported AI systems for detection of all upper-GI neoplasia but also provides pooled analyses estimating AI accuracy in diverse clinical and methodological settings.

  • AI accuracy for upper-GI neoplasia appeared to approximate 90% with similar values across the spectrum of different cancers, and importantly favoured in high-quality studies, in multicentre settings using advance imaging.

How might it impact on clinical practice in the foreseeable future?

  • According to available evidence, AI systems have shown high accuracy in the detection of any kind of upper-GI neoplasia.

  • Further developments should focus on delivering real-time AI systems capable of detecting simultaneously the different types of upper-GI neoplasia.

Introduction

Cancer of the upper GI (UGI) tract—namely oesophageal and stomach cancer—accounts for nearly 1.5 million cancer-related deaths every year worldwide, as they are often diagnosed at a late stage, when curable treatment options are limited.1 Screening for early-stage cancers in high-risk populations and surveillance of precancerous conditions can lead to early diagnosis and a higher chance of successful treatment, and has shown to reduce incidence and mortality.2–4 However, early lesions and precancerous conditions can still be overlooked,5 especially in general endoscopy out-patient setting, where disease prevalence and endoscopists experience are lower, increasing rates of underdiagnosed or overlooked neoplasia.6–9 Furthermore, disease prevalence of these types of precancerous/cancerous conditions is low, and this impacts the possibility to undergo a structured training in detection and characterisation. In detail, miss rates for UGI cancers in the European community have been reported to range from 5% to 11%, and up to 40% for Barrett early-stage neoplasia.5–7 9 10 The development and implementation of advanced imaging techniques11 have shown good performance for detection and characterisation of these lesions, but these outcomes come mainly from academic settings and their generalisability and cost effectiveness in community-based endoscopy are questionable.12–14

Artificial intelligence (AI) has been claimed to reduce rates of underdiagnosed or overlooked neoplasia in the UGI tract due to its high accuracy in lesion recognition.15 AI could help endoscopists in the detection and interpretation of findings and take advantage of high-quality images provided by advanced imaging technologies to improve its accuracy.

In brief, the development of an AI system usually encompasses a training phase, in which the system autonomously ‘learns’ on image libraries where experts have manually selected the areas of interest (ie, visible neoplasia), a validation phase in which the system is again presented with images and recalibrated, and a testing phase in which the system is presented with new, unknown images, and its performance is evaluated.15

Several disease-specific systems have been tested for detection and characterisation of UGI cancerous and precancerous conditions, namely oesophageal squamous cell neoplasia (ESCN), Barrett’s esophagus-related neoplasia (BERN), as well as gastric adenocarcinoma (GAC), gastric atrophy and intestinal metaplasia, resulting in different estimates of accuracy.15 16 Most studies are still experimental in design with an ‘offline’ (ie, not in-vivo) testing,15 generating some uncertainty on the technical feasibility of real-time application as well as its interaction with the human endoscopist.

Differently from the disease-specific study setting, endoscopists performing UGI endoscopy need to detect with a reasonably high accuracy any potential lesion, ranging from ESCN to GAC. Thus, a synoptic view on the actual contribution of AI for all the possible spectrum of diseases is clinically relevant.

Aim of our systematic review was to summarise AI accuracy for detection and characterisation of the different UGI early neoplastic and preneoplastic lesions.

Methods

The methods of our analysis and inclusion criteria were based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) recommendations.17

PRISMA Checklist is available in online supplemental appendix 1.

Supplemental material

Study registration

This review was registered on the PROSPERO international database (University of York Centre for Reviews, www.crd.york.ac.uk/prospero/). Number: 180176.

Inclusion and exclusion criteria: reference standard

For the purpose of this systematic review, we considered only studies published in English. All articles published only in abstract form and review articles were excluded. Studies with no data on histopathology and inability to construct a 2×2 contingency table were excluded, as well as studies not providing data on a ‘per-image’ basis (ie, ‘per-pixel’; ‘per-lesion’, etc).

For the meta-analysis on neoplasia detection, we included studies reporting the use of AI for detection of any type of UGI neoplastic lesion as compared with human detection with histological confirmation as ground truth. Regarding the meta-analysis on histological prediction of precancerous conditions, the inclusion criteria were studies reporting the use of AI in characterisation of any kind of UGI precancerous condition and using histology as the reference standard.

Data sources and search strategy

We performed a comprehensive literature search of three scientific databases (PubMed/Medline, Embase and Scopus) up to July 2020 to identify full articles evaluating the diagnostic accuracy of AI-assisted examination on UGI neoplasia with or without the control of endoscopists. In addition, we set out to identify articles evaluating diagnostic accuracy of AI-assisted examination on UGI precancerous lesions, with or without the control of the endoscopists. Electronic searches were supplemented by manual searches of references of included studies and review articles. Complete search strategy and search strings used are available in online supplemental appendix 2.

Supplemental material

Selection process

Titles of all the identified articles were screened by two reviewers (JA and GA) to exclude studies not related to study topic or meeting one of the exclusion criteria. The remaining potentially relevant studies were screened for eligibility by analysis of the abstract and the full text—disagreements were resolved through discussion with senior authors (CH and MJD-R) until consensus was achieved. The reasons for excluding trials were recorded. When there were multiple articles for a single study, we used the latest publication and supplemented it, if necessary, with data from the more complete version.

Data extraction

By using standardised forms, the same reviewers (JA and GA) extracted data independently from each eligible study. Reviewers resolved disagreements by discussion with senior authors (CH or MJD-R). The total number of images/cases used in the validation/test sets were extracted, as well as the total number of ‘ground truth’ (human detected and histologically confirmed as neoplastic or non-neoplastic) images/cases. The numbers of images/cases that were true positive (images/cases showing a neoplastic lesion detected/predicted-as-neoplastic by AI), true negative (images/cases showing non-neoplastic mucosa without AI detection or lesions predicted as non-neoplastic), false positive (FP, images/cases showing non-neoplastic mucosa or lesions detected/predicted-as-neoplastic by AI) or false negative (images/cases showing a neoplastic lesion missed by AI or predicted as non-neoplastic) were extracted.

In addition, year of publication, country where the study was conducted, type of study, number of patients, model of the endoscopy system, advanced imaging modalities and type and design of AI systems used were also retrieved. Study authors were contacted for missing information.

Study outcomes

The main outcomes of interest were the pooled diagnostic accuracy, sensitivity and specificity of AI in detection and characterisation of neoplastic lesions in UGI endoscopy. The accuracy of AI was defined as the area under the hierarchical summary receiver-operating characteristic (SROC) curve (area under the curve; AUC). Secondary outcomes were the calculation of pooled positive (PPV) and negative (NPV) predictive values as well as ROC curve calculations, with respective AUC.

Quality assessment

The degree of bias was assessed using a modified version of the Quality for Assessment of Diagnostic Studies (QUADAS)18 score. In detail, we identified specific bias domains for diagnostic studies in AI. We divided in two main domains and respective subdomains, namely Training bias (subdomains: Selection bias, Spectrum Bias and Operator bias) and Testing bias (subdomains: Overfitting bias and Operator bias). For Overfitting bias, we considered at low risk of bias papers explicitly describing the use of overfitting mitigation techniques as data augmentation, dropout, batch normalisation, regularisation, early stopping and transfer learning from large datasets.

Statistical analysis

We pooled summary estimates of sensitivity, specificity, positive likelihood ratio (LR+) and negative likelihood ratio (LR−) of AI as a diagnostic test, through a bivariate mixed-effect regression model. We computed 95% CIs for the diagnostic accuracy parameters through the bivariate model, as well. PPV and NPV were computed for the pooled prevalence of cancer. Forest plots for sensitivity and specificity, and an SROC curve were provided. LR+ and LR− were applied to the pooled prevalence of the various types of cancer (ie, pretest probability), in order to compute the post-test probability in case of a positive or negative test result. A Fagan’s plot was provided, accordingly.

Heterogeneity was assessed through visual inspection of forest plots and SROC curve, and quantified by the between-study standard deviation (SD) for logit-transformed sensitivity and specificity.19 In order to explain heterogeneity, we performed sensitivity analyses based on subgroup meta-analyses and bivariate metaregression. Variables which might impact on the diagnostic accuracy of AI were selected a priori according to the following levels: (1) ‘cancer’ level (ie, cancer subgroup as divided into ESCN, BERN and GAC), (2) ‘study’ level regarding the methodological quality of the included studies (ie, study size, study design, mono vs multicentre studies, presence or absence of external validation and quality of study) and (3) ‘technological’ level (ie, convolutional neural networks (CNN) vs support vector machine (SVM) AI, and white-light imaging vs advanced imaging). We performed these analyses overall and according to cancer type (ie, ESCN, BERN and GAC). Subgroup analyses were performed if at least two studies provided the data of interest.

The publication bias was assessed by Deeks’ funnel plot with regression test. All the analyses were performed with the package mada 20 for R.21

Results

Search results

The search strategy yielded a total of 1678 studies. Once duplicates were removed, a total of 1294 studies were screened by analysis of title and abstract, and 1124 studies were removed because not related to the study topic or met one or more exclusion criteria. Overall, 170 studies were then further assessed for eligibility by full-text examination, and a further 134 were excluded for not meeting all inclusion /exclusion criteria. Reasons for exclusion were recorded. Finally, 36 articles16 22–56 were included in the final analysis (figure 1).

Figure 1

Preferred Reporting Items for Systematic Reviews andMeta-Analyses flow chart of included studies.

Study characteristics

Among the 36 included studies, 10 had a multicentre design and 26 a single centre. Twenty-six studies were performed in an Eastern centre, nine in a Western centre and one was a multicentre study involving both Eastern and Western Centres.

Almost all the AI systems used were CNNs. In five cases,31 36 44 45 53 SVM systems were used.

Of the included studies, 16 used only internal validation, and 20 used external validation. Of the 20 using external validation, only 316 29 33 used a live, clinical, in-vivo validation, while all other studies used an offline, image-based or video-based validation. Of the three prospective studies included, one29 used a single-screen graphic user interface (GUI) with the AI in overlay, providing real-time feedback, while the others16 33 used a double-screen GUI, with the real-time processing showing on the second screen. The reference standard in all detection studies were expert endoscopists manually delineating neoplasia in endoscopic images. In characterisation studies, the reference standard was histopathological evaluation.

Of included studies, only 13 reported how many patients were included in the image extraction for the training set. Median number of patients in the training set was 384 (range: 17–15 040). In addition, 26 studies reported how many images were used for the training set (median: 2570; range: 39–1 25 898). For the testing/validation sets, 26 studies reported the number of patients included, total patients 86 429, median 60 (range 39–84,425). However, 84 425/86 429 were due to one very large series.16 In 30 studies, the number of images used was reported, accounting for a total of 1 108 639 (median: 203; range 60–8 94 927). Also, in this case, however, 894 927/1 108 639 images were due to the same single large series.16 Complete characteristics of included studies are reported in table 1.

Table 1

Study details

Studies not included in quantitative synthesis

We contacted all authors of studies that did not present the data as per inclusion criteria. Authors that did not answer were finally not included in the quantitative synthesis. Furthermore, eight studies could not be included in quantitative synthesis because they were single studies of their subgroup, or their methodology did not permit pooling data. We included these studies in table 1 for completeness, but not in the final quantitative synthesis through meta-analysis.

Finally, 19 studies (all on detection) were included in the quantitative synthesis through meta-analysis, 3 studies22 23 26 on ESCN, 9 studies29–36 53 on BERN and 7 studies40 41 43–46 54 on GAC.

Primary outcome

The pooled prevalence of early stage neoplasia among included studies was 46% (CI 38% to 54%). Overall, AI had sensitivity 90% (CI 85% to 94%), specificity 89% (CI 85% to 92%), PPV 87% (CI 83% to 91%), NPV 91% (CI 87% to 94%), LR+ 8.2 (CI 5.7 to 11.7) and LR− 0.111 (CI 0.071 to 0.175) for neoplasia detection (figures 2 and 3). Heterogeneity was present both for sensitivity and specificity (0.93 and 0.56 between-study SD in the logit scale, respectively). In absolute terms, at 46% of early stage neoplasia, a positive result of AI would increase the disease probability to 88%, whereas a negative result would decrease the disease probability to 9% (see figure 4A). Additional analyses according to hypothesised 1% and 10% pretest prevalence of all cancers (ie, low-volume and high-volume endoscopy centres) are provided in online supplemental figure 1.

Supplemental material

Figure 2

Forest plots of artificial intelligence diagnostic performance for all cancers, and stratified by oesophageal squamous cell neoplasia, Barrett’s esophagus-related neoplasia and gastric adenocarcinoma. (A) Sensitivity. (B) Specificity.

Figure 3

Summary receiver operating characteristics (SROC) curve of artificial intelligence for cancer diagnosis. Each triangle represents an individual study; the triangle size is proportional to each study size. The diamond represents summary sensitivity and specificity. The ellipse represents the 95% confidence region. AUC, area under the curve.

Figure 4

Fagan plot depicting the impact of a positive or negative result of artificial intelligence on pretest probabilities (ie, the pooled prevalence of cancer). (A) All cancers. (B) Oesophageal squamous cell neoplasia (ESCN). (C) Barrett’s esophagus-related neoplasia (BERN). (D) Gastric adenocarcinoma (GAC).

Squamous oesophageal carcinoma detection

When pooling together the three studies22 23 26 on ESCN detection, total number of images used for training was 17 329 from 1130 patients. From one study,22 we were unable to extract the number of patients included in the training set. Total number of images used for testing was 176 841 from 218 patients. Among these, one study22 validated both on still images and videos, and we included only images in order to pool the results with the other studies.

The pooled prevalence of ESCN was 28% (CI 8% to 48%). Overall, AI had sensitivity 93% (CI 73% to 99%), specificity 89% (CI 77% to 95%), PPV 77% (CI 55% to 89%), NPV 97% (CI 88% to 100%), LR+ 8.5 (CI 3.2 to 19.8), and LR− 0.079 (CI 0.011 to 0.351). Details can be found in figure 2 and online supplementary figure 2A. Heterogeneity was present both for sensitivity and specificity (1.36 and 0.81 between-study SD in the logit scale, respectively). In absolute terms, at 28% ESCN prevalence, a positive result of AI would increase the disease probability to 77%, whereas a negative result would decrease the disease probability to 3% (see figure 4B). Additional analyses according to hypothesised 1% and 10% pretest prevalence of oesophageal ESCN (ie, low-volume and high-volume endoscopy centres) are provided in online supplemental figure 3.

Supplemental material

Supplemental material

BERN detection

When pooling together the nine studies29–36 on BERN detection, total number of images used for training was 12 909 from 1506 patients. This information was however available only for seven studies. Total number of images used for testing was 2340 from 445 patients.

The pooled prevalence of BERN was 50% (CI 42% to 58%). Overall, AI had sensitivity 89% (CI 83% to 93%), specificity 88% (CI 84% to 91%), PPV 88% (CI 84% to 91%), NPV 89% (CI 83% to 93%), LR+ 7.5 (CI 5.2 to 10.4), and LR− 0.120 (CI 0.074 to 0.198). Details can be found in figure 2 and online supplemental figure 2B. Heterogeneity was present both for sensitivity and specificity (0.60 and 0.29 between-study SD in the logit scale, respectively). In absolute terms, at 50% of early stage neoplasia prevalence, a positive result of AI would increase the disease probability to 88%, whereas a negative result would decrease the disease probability to 11% (see figure 4C). Additional analyses according to hypothesised 1% and 10% pretest prevalence of BERN (ie, low-volume and high-volume endoscopy centres) are provided in online supplemental figure 4.

Supplemental material

GAC detection

When pooling together the seven studies40 41 43–46 on GAC detection, total number of images used for training was 28 938. Total number of images used for testing was 13 562.

The pooled prevalence of GAC was 48% (CI 33% to 64%). Overall, AI had sensitivity 88% (CI 78% to 94%), specificity 89% (CI 82% to 93%), PPV 88% (CI 80% to 93%), NPV 89% (CI 80% to 94%), LR+ 8 (CI 4.3 to 13.4), and LR− 0.134 (CI 0.063 to 0.274). Details can be found in figure 2 and online supplemental figure 2C. Heterogeneity was present both for sensitivity and specificity (0.97 and 0.66 between-study SD in the logit scale, respectively). In absolute terms, at 55% of cancer prevalence, a positive result of AI would increase the disease probability to 88%, whereas a negative result would decrease the disease probability to 11% (see figure 4D). Additional analyses according to hypothesised 1% and 10% pretest prevalence of GAC (ie, low-volume and high-volume endoscopy centres) are provided in online supplemental figure 5.

Supplemental material

Study quality and publication bias

Study quality assessment according to the modified QUADAS score is available in table 2. In detail, eight studies were considered of High quality, and 28 studies were considered of low quality. A significant trend to selection and spectrum bias was identified, with many studies including only the best images from non-consecutive cases in the training phase. Regarding testing, 20 studies included an external validation, although in most cases the images came from the same centres. On the other hand, 18 studies were tested against different endoscopists. No significant publication bias was found, according to visual inspection of funnel plot (online supplemental figure 6) and to Deeks’ regression test (p=0.73).

Supplemental material

Table 2

Quality assessment

Sensitivity analysis

In order to explore possible sources of heterogeneity, we performed several metaregression analyses for the overall study sample, and separately for the BERN and GAC subgroups (online supplemental table 1). We also performed subgroup analyses according to categorical variables identified as possible sources of heterogeneity (online supplemental table 2).

Supplemental material

In the overall analysis, the sample size of the included studies significantly increased specificity (p=0.003). In the ESCN subgroup, the sample size of the included studies, external validation and high-quality studies significantly increased specificity (p=0.013, p=0.001 and p=0.001, respectively), whereas multicentre studies yielded higher sensitivity (p=0.001). In the BERN subgroup, the sample size of the included studies and the use of advanced endoscopic imaging significantly increased specificity (p=0.006 and p=0.005, respectively). No significant impact on sensitivity and specificity of AI was found for all the other prespecified variables (online supplemental table 1). We performed a further subgroup meta-analysis including only the two prospective studies considered in the quantitative analysis, and we found that AI had sensitivity 79% (CI 68% to 87%), specificity 89% (CI 77% to 95%), PPV 76% (CI 57% to 89%), NPV 90% (CI 84% to 94%), LR+ 7.1 (CI 17.3 to 2.9), LR− 0.240 (CI 0.140 to 0.422). The forest plots for sensitivity and specificity can be found in online supplemental figure 7.

Supplemental material

Discussion

According to our meta-analysis, AI systems show a high accuracy in the detection of neoplastic lesions, mostly HGD/early cancers, in the UGI tract. Of note, the high performance of AI was persistent across all the spectrum of UGI neoplastic lesions, including ESCN, BERN, and GAC. In addition, the use of advanced imaging was associated with an increase of specificity, at least in BERN detection, suggesting a synergy between these techniques and AI. On the other hand, the main drawback is that these studies were mostly performed in an experimental setting, leaving uncertainty on the interaction between AI and human endoscopists in clinical practice, with a potential cumulative value.

The main clinical relevance of our analysis is related to the miss rate and impracticalities of adequate training for UGI neoplasia. First, most UGI lesions are non-polypoid, requiring a higher level of skills than colorectal polyp detection. Thus, most endoscopists are suboptimal in their detection, missing such lesions despite fully recognisable by an expert on the endoscopy screen. Second, in most Western countries, prevalence of UGI neoplasia is extremely low, preventing an adequate training in lesions recognition during the postgraduate courses. Thus, a cumulative 91% sensitivity for early stage neoplasia, as shown by AI in our study, is unlikely to be achievable by most of the community-based endoscopists. In addition, endoscopists tend to specialise in one specific disease, such as BE-neoplasia or GAC, also according to country-specific disease prevalence. On the opposite, AI appears to be completely resistant to disease change, presenting with extremely similar values, irrespectively of the underlying condition. This prompts an urgent implementation of AI in the clinical setting, even more than in the lower GI tract where the high prevalence and polypoid appearance of neoplasia simplify the detection by the endoscopist.

According to our analysis, the negative likelihood ratio of AI techniques is 0.1, indicating a over 90% reduction of the pre-endoscopy probability of disease. In an extremely enriched population, as the one of our review, this was able to significantly reduce the probability of disease. However, most of the examinations are performed in an average-risk population with a very low overall risk of early stage neoplasia, leaving an extremely negligible residual risk after a negative outcome from AI.8 57 In detail, our Fagan’s nomogram clearly showed that at a pretest 1% prevalence of disease, the residual post-test risk is virtually 0%. That is, a virtual missed rate of 0% if all the other quality parameters are observed.

There are limitations in the included studies, preventing an immediate implementation of AI in clinical practice. First, most studies were carried out ‘offline’ and were developed and tested using a limited number of still, stored images. Very few studies included videos, and even less included live, in-vivo validation. On the other hand, the idea that an ‘AI system is watching’ may curtail the time and effort an endoscopist puts into producing high-quality video material: the AI will spot it anyway. As even the best AI system cannot compensate for poor input, there may be cases in which high-quality still images, collected after washing and suctioning, are superior to low-quality collected videos. Besides, due to the blackbox nature of the current deep-learning algorithms, a high volume of images is needed to train a trustful algorithm. Thus, there is uncertainty on the real-time performances of these devices. The present analysis should prompt the design of high-quality validation studies, possibly applied to real-time live endoscopy, to prove AI efficacy and speed its implementation in everyday clinical practice.

Second, the human interaction with the AI findings is unknown. However, it is unlikely that the endoscopist unable to clearly recognise a lesion flagged by AI as neoplastic will refuse to take at least one biopsy to exclude a positive diagnosis. On the other hand, we cannot exclude that potentially FP results will not result in overtreatment of non-neoplastic conditions. At this regard, the increase in specificity when using advanced imaging is relevant. Despite the high accuracy by advanced imaging in expert centres, use of this technology in community setting has been jeopardised by the lack of training and interest. However, if such technologies are able to increase AI accuracy without requiring additional training, this may reduce the current reluctance in their use.

Third, the possibility of spectrum bias (ie, selecting only the best images and presenting them to the machine) is a potential limitation of AI accuracy reporting. However, when looking at human performances on the same databases administered to the machines, this bias seems to be excluded by the relatively low performance of human experts.

Fourth, the implications of AI-driven detection on the clinical course of the disease are still to be determined, and may be different according to stage-related prognosis. On the other hand, this limitation applies also to human detection.

Fifth, we present as primary outcomes pooled performances of all AI systems on detection of UGI neoplasia, regardless of the specific kind of neoplasia on which the systems were trained. It is important to note that most systems were trained on the detection of a specific neoplasia and their performances may not apply to other kinds of cancer. On the other hand, the largest study published to date16 was trained and externally validated on all UGI neoplasia, with similar results to our pooled analysis, and we also present a separate detailed analysis on every kind of UGI neoplasia to provide the most complete picture of available literature.

Last, we set out to evaluate also AI systems on neoplasia characterisation, but we did not include any of them in the pooled analysis since the paucity and presentation of data is still a major limitation of the literature on this subject.

In this review, we propose a modified version of the QUADAS score,18 adapted for the specific biases expected in studies regarding accuracy of AI. We think this is clinically valuable as AI is a relatively new tool in the field of endoscopy, many physicians are not used to the specific terminology and characteristics of these techniques, and the tools that are generally used to analyse the limitations and strengths of medical studies may not be accurate in this context. Overall, the majority of included studies were graded as low quality, mainly for the high risk of selection bias in the training or validation sets, and for the lack of clinical, real-time validation, that as of today limits the widespread diffusion and use of AI in UGI cancer detection. Proposed modified version of QUADAS score adapted to AI studies may be a useful tool to design future high-quality studies. In detail, future studies should broaden the quantity and quality of training image libraries, possibly from multicentric settings. As for testing, in-vivo clinical validation is to be considered gold standard, including different levels of expertise and disease prevalence. Furthermore, standardised reporting of performance results should be strongly encouraged. On this regard, a recent, updated version of the CONSORT statement regarding studies on AI, named CONSORT-AI,58 has been recently proposed, that can greatly assist further standardisation of research results.

In conclusion, initial pooled estimates have shown high AI accuracy for the detection of early neoplasia of the UGI tract, irrespectively of the specific disease, prompting further high-quality validation studies for its fast implementation to minimise the high risk of missed lesions in community endoscopy.

Data availability statement

Data are available upon reasonable request. Complete dataset used for meta-analysis available with the corresponding author upon request.

Ethics statements

References

Supplementary materials

Footnotes

  • JA and GA are joint first authors.

  • CH and MJD-R are joint senior authors.

  • Twitter @fvdsommen, @ReMIC_OTH

  • JA and GA contributed equally.

  • CH and MJD-R contributed equally.

  • Contributors JA, GA, MJD-R and CH: conception and design. JA, GA, CH and MJD-R: data extraction and interpretation; drafting of the article. LFr and LF: statistical analysis. AE, FvdS, CP, MC, FR, HM, NG, LF, LFr, JJGHMB and PS: critical revision of the article for important intellectual content. All authors read and approved the final version of the manuscript. JA and GA, first authors, equally contributed to this manuscriot. CH and MJD-R, senior authors, equally contributed to this manuscript.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests JB reports grants and personal fees from Olympus, Fujifilm, Pentax Endoscopy outside the submitted work; PS reports personal fees from Olympus and Boston Scientific, grants from CDx, US Endoscopy, Medtronic, Ironwood, Erbe, Fujifilm, outside the submitted work; CH reports personal fees from Medtronic, Fujifilm, Olympus, outside the submitted work; MJD-R reports grants from Olympus, Fujifulm, outside the submitted work. JA, GA, LF, LFr, AE, FVDS, NG, CP, MC, FR, HM have no COI to declare.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.