Article Text

Original research
Histopathologist features predictive of diagnostic concordance at expert level among a large international sample of pathologists diagnosing Barrett’s dysplasia using digital pathology
  1. Myrtle J van der Wel1,
  2. Helen G Coleman2,
  3. Jacques J G H M Bergman3,
  4. Marnix Jansen4,
  5. Sybren L Meijer1
  6. on behalf of the BOLERO working group
    1. 1 Pathology, Amsterdam University Medical Center, Amsterdam, The Netherlands
    2. 2 Centre for Public Health, Queens University Belfast, Belfast, UK
    3. 3 Department of Gastroenterology, Academic Medical Center, Amsterdam, The Netherlands
    4. 4 Pathology, UCL Cancer Institute, London, UK
    1. Correspondence to Dr Marnix Jansen, UCL Cancer Institute, Room 234D, 72 Huntley Str, London WC1E 6AG, UK; m.jansen{at}; Dr Sybren L Meijer, Academic Medical Center, Academic Medical Center, 1105 AZ Amsterdam, The Netherlands; s.l.meijer{at}


    Objective Guidelines mandate expert pathology review of Barrett’s oesophagus (BO) biopsies that reveal dysplasia, but there are no evidence-based standards to corroborate expert reviewer status. We investigated BO concordance rates and pathologist features predictive of diagnostic discordance.

    Design Pathologists (n=51) from over 20 countries assessed 55 digitised BO biopsies from across the diagnostic spectrum, before and after viewing matched p53 labelling. Extensive demographic and clinical experience data were obtained via online questionnaire. Reference diagnoses were obtained from a review panel (n=4) of experienced Barrett’s pathologists.

    Results We recorded over 6000 case diagnoses with matched demographic data. Of 2805 H&E diagnoses, we found excellent concordance (>70%) for non-dysplastic BO and high-grade dysplasia, and intermediate concordance for low-grade dysplasia (42%) and indefinite for dysplasia (23%). Major diagnostic errors were found in 248 diagnoses (8.8%), which reduced to 232 (8.3%) after viewing p53 labelled slides. Demographic variables correlating with diagnostic proficiency were analysed in multivariate analysis, which revealed that at least 5 years of professional experience was protective against major diagnostic error for H&E slide review (OR 0.48, 95% CI 0.31 to 0.74). Working in a non-teaching hospital was associated with increased odds of major diagnostic error (OR 1.76, 95% CI 1.15 to 2.69); however, this was neutralised when pathologists viewed p53 labelled slides. Notably, neither case volume nor self-identifying as an expert predicted diagnostic proficiency. Extrapolating our data to real-world case prevalence suggests that 92.3% of major diagnostic errors are due to overinterpreting non-dysplastic BO.

    Conclusion Our data provide evidence-based criteria for diagnostic proficiency in Barrett’s histopathology.

    • barrett's oesophagus
    • dysplasia
    • oesophageal cancer
    • health service research

    Statistics from

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

    Significance of this study

    What is already known on this subject?

    • Pathology evaluation of surveillance biopsies of patients with Barrett’s is poorly reproducible.

    • Guidelines mandate that biopsies with dysplasia be reviewed by an expert, but there are no evidence-based criteria to corroborate expert reviewer status.

    What are the new findings?

    • Through a large online consensus study among more than 50 pathologists in over 20 countries, we reveal histopathologist-dependent predictors of major diagnostic error in the assessment of Barrett’s biopsies.

    • The size of our data set allows us to quantify the impact of these variables, such as experience commensurate with age and professional setting, in multivariate analysis.

    How might it impact on clinical practice in the foreseeable future?

    • Our data provide evidence-based criteria for diagnostic proficiency in Barrett’s histopathology and will help facilitate training and support to reduce diagnostic variability.


    Barrett’s oesophagus (BO) is a premalignant condition which predisposes to oesophageal adenocarcinoma (OAC), with a reported annual conversion rate of 0.1%–0.2%.1–3 BO is defined histopathologically as the replacement of normal stratified squamous epithelial lining of the distal oesophagus with columnar epithelium that can contain intestinal metaplasia. The implementation of formal surveillance strategies and widespread adoption of endoscopic treatment techniques, such as endoscopic resection and ablation for dysplastic BO, have led to a surge in diagnostic pathology workload. The goal of endoscopic surveillance and biopsy verification is objective risk stratification for patients according to their perceived progression risk to OAC.

    Previous studies have revealed, however, that diagnostic reproducibility (interobserver agreement) among pathologists grading dysplastic BO biopsy material is moderate to poor, even among expert reviewers (online supplementary table 1).4–17 Previous work from our group has shown that central pathology review by a dedicated panel within the context of prospective intervention trials failed to confirm an initial diagnosis of low-grade dysplasia (LGD) in over three-quarters of cases submitted for panel review. On follow-up, cases that had been downgraded to non-dysplastic BO (NDBO) revealed a nominal progression risk of about 0.5% per patient/year, while cases that had been confirmed LGD on central review showed a progression risk of about 10% per patient/year. These data clearly attest to the clinical return of dedicated pathology review.18 19 International BO management guidelines now mandate histopathology review of all BO biopsy cases found to reveal dysplasia by an independent expert pathologist.20 21 However, while major society guidelines have qualitatively defined an expert BO pathologist as ‘a pathologist with a special interest in BO-related neoplasia who is recognised as an expert in this field by their peers’, we lack firm evidence-based standards to corroborate expert reviewer status.21–26 This now represents an acute unmet need as these considerations also carry important medicolegal implications.

    Supplemental material

    Recently, the US Food and Drug Administration has approved the use of whole slide imaging (WSI) for primary diagnostic use.27 The advantages of WSI are numerous and include simultaneous assessment by multiple pathologists, streamlined expert consultation and digital image analysis. It is expected that digital pathology will rapidly gain widespread acceptance in the coming years, in particular in the context of distant case review. A number of large-scale diagnostic consensus studies have been performed, which have broadly suggested that the diagnostic discordance rate between pathologists using digital slide review is non-inferior to conventional glass slide diagnosis.28–30 However, these studies generally examined a large number of diagnostic categories without focusing on a particular category of known diagnostic discordance such as Barrett’s dysplasia. Establishing the validity of this new technology to BO histopathological work-up is therefore a clear priority.

    Here we set out to develop quantitative standards of expert reviewer status for guideline development purposes using massive online digital pathology reporting. We define expert reviewer status as evidence of diagnostic concordance on a par with consensus within an expert review panel, acknowledging that, in lieu of an objective biomarker of progression risk, there will be diagnostic variation among expert pathologists. We collected extensive demographic information of participating pathologists to understand operator-dependent predictors of diagnostic variation.



    Sixty-five GI pathologists worldwide were approached to join this study through either professional GI pathology working groups or direct professional contacts. Fifty-nine pathologists responded positively to our enquiries and were recruited to this study, of whom 51 pathologists completed the entire case set of 55 H&E-stained and 55 matching p53 immunohistochemistry labelled slides (110 slides total). These 51 pathologists are henceforth referred to as participating pathologists. Participating pathologists received emails detailing the study objectives and were provided with personal log-in credentials to the purpose-built, online scoring environment described in the Electronic scoring environment section. The lead study author (MJvdW) provided assistance with participating pathologists’ log-in queries, evaluated study progress and chaired the panel consensus meeting.

    Four BO pathologists (including two study authors, MJ and SLM) with extensive experience in BO dysplasia assessment reviewed all slides as a reference pathologist panel. This group has successfully collaborated on previous BO intervention studies where patient outcome has been evaluated prospectively,18 19 31–37 as well as on the Amsterdam Barrett’s Advisory Committee.31 These four pathologists are henceforth referred to as reference pathologists.

    Slide selection and scanning

    The lead study author selected a representative case-mix of 55 BO biopsy cases from across the diagnostic spectrum (online supplementary table 2). Inclusion criteria were diagnosis confirmed by a second GI pathologist; documented clinical follow-up of at least 1 year available; and tissue block available. All cases were treatment-naïve. Per case, immunohistochemical staining for p53 was performed using a Ventana BenchMark XT autostainer (Ventana Medical Systems, Tucson, Arizona). Antigen retrieval was performed with Cell Conditioning Solution 1 (CC1) mild. p53 was detected with p53 antibody (Mouse DO-7 + BP 53-12, Thermo Scientific), and the sections were incubated in a 1:500 dilution for 32 min at room temperature. Bound antibody was detected using the Biotin Free ultraView Universal DAB Detection Kit (Roche Diagnostics), and slides were counterstained with haematoxylin (Roche Diagnostics).38 One H&E slide and one consecutive section p53 labelled slide were digitised from each case using a scanner with a 20× microscope objective (Slide, Olympus, Tokyo, Japan). Scans were checked for focus and acuity by the study coordinator and rescanned if necessary. Subsequently, slides were anonymised, randomised, renamed and stored on a secure server. The ‘Digital Slidebox 4.5’ (, Slidepath, Leica Microsystems, Dublin, Ireland) virtual slide viewing software was used to evaluate the digital slides during the study. Endoscopic mucosal resection (EMR) specimens were not included in our study cohort.

    Electronic scoring environment

    Template electronic case record forms (CRFs) were custom-built within a web-based software tool designed to capture clinical study data (OpenClinica V.3.6, an open source Center for Translational Molecular Medicine, Translational Research IT (CTMM TraiT) project, Waltham, USA). One CRF consists of an extensive questionnaire documenting pathologist characteristics such as age, sex, host institution, and experience in reporting BO biopsies and digital pathology (full questionnaire details in online supplementary table 3). The second CRF was built to record individual case diagnoses. Importantly, this second CRF consists of separate parts to record H&E and H&E plus p53 labelled slide diagnoses independently. The first part of the case diagnosis CRF contains a dynamic URL link to the scanned H&E slide and includes questions about the slide quality and diagnosis, and whether the assessor would require a p53 labelled slide. Importantly, the second part of the templated CRF, which contains a dynamic link to the p53 labelled slide alongside the matching H&E slide, only opens after the study pathologist has completed assessment of the H&E-stained slide and saved their case diagnosis for this slide. This second part of the templated CRF, in addition to a dynamic link to the matching p53 labelled slide, again included corresponding slide assessment questions.

    Digital case assessments

    Reference and participating pathologists were asked to assess each case, according to the modified Vienna classification for GI neoplasia.39 40 Reference pathologists first assessed all cases individually and completed the questionnaire. An online consensus meeting was then convened after a 2-month washout period to discuss discrepancies and produce reference diagnoses for each of the 110 assessments (55 H&E-stained slides and 55 matching p53 labelled slides). The panel assessment was taken forward as the reference diagnosis without further discussion if reference panel members achieved a majority diagnosis (ie, concordance between either three out of four, or four out of four pathologists) on a case directly from their independent scoring. Group discussions were held between these four pathologists to review and discuss cases for which there was no majority diagnosis to mimic real-world practice. The discrepancies where a majority diagnosis had not been reached after individual slide review encompassed 21 cases based on H&E slide viewing and 13 cases based on the p53 labelled slide. These cases were reviewed during the panel discussion (21 H&E slides reviewed without matching p53 labelled slide, and 13 cases with H&E-stained slide and matching p53 labelled slides) to arrive at a consensus diagnosis for all 110 assessments.

    From the case assessments by the participating pathologists, two post-p53 labelled case assessments were inadvertently left blank by individual participating pathologists (one each) after evaluating the case H&E slide. Results from the matching H&E slides were imputed as post-p53 case diagnosis in these cases, based on the H&E slide score, corresponding to two high-grade dysplasia (HGD) diagnoses.

    Population estimates

    To extrapolate our findings to the proportional prevalence of Barrett’s dysplasia in real-world practice, we used incident and surveillance reports from the population-based Northern Ireland Barrett’s Oesophagus Register, the methods of which have been described elsewhere.41 42 The prevalence for the most recently available data in 2014 was applied, in which 2872 patients received a pathology diagnosis of NDBO (n=2627, 91.5%), indefinite for dysplasia (IND) (n=36, 1.2%), LGD (n=85, 3%) or HGD (n=124, 4.3%). These values were then used to estimate the population impact of interpretation discordance for each diagnostic category.

    Statistical analysis

    The characteristics of the 4 reference pathologists and the 51 participating pathologists were compared informally. We examined the overall concordance of the study pathologists compared with the consensus reference diagnosis per case. This process was conducted for each of the 4 individual members of the reference panel against the final consensus diagnosis of this panel, as well as for the overall sample of 51 pathologists against the consensus diagnosis. Per-pathologist scores were not calculated, since we aimed to study the cohort behaviour rather than the individual pathologist. Concordance was initially compared based on four relevant diagnostic categories (NDBO, IND, LGD, HGD), and then compared based on three relevant diagnostic categories (NDBO, IND, LGD or HGD) to reflect the fact that HGD and LGD are now treated endoscopically in some settings.32 We calculated 95% CIs for overall concordance and per diagnostic category. Since this cohort was strongly enriched for dysplasia, we did not use kappa statistics, since these are less reliable when cross tables are skewed.

    To evaluate the potential clinical impact of discordant interpretations across the cohort of participating pathologists, we then reclassified all discordant assessments as either major or minor discordances. Major overinterpretation is defined as NDBO reference diagnosis overinterpreted as either LGD or HGD, whereas, vice versa, major underinterpretation is LGD or HGD reference diagnosis underinterpreted as NDBO by the participating pathologist. These discordant interpretations would bear major consequences in clinical practice. All other discordant interpretations were classified as minor discordant interpretations. A tabular overview of interpretation classifications as major or minor is shown in online supplementary table 4. Since both major overinterpretation and major underinterpretation can have negative implications for patient management, these were further combined for the purposes of some analyses, as indicated.

    Unadjusted logistic regression analyses were then conducted to identify any pathologist characteristics that were associated with overall and major overinterpretation or underinterpretation of BO cases compared with the consensus diagnosis. Considering that age and professional experience are inextricably linked, we evaluated individual combinations of age and experience for odds of major overinterpretations and underinterpretations, and combined these into three categories in whom similar ORs were observed (online supplementary table 5). Forward selection of significant factors was used to create multivariable-adjusted logistic regression models of characteristics associated with misinterpretation. Although routine use of p53 immunohistochemistry was not associated with diagnostic errors, this was retained in multivariate models for p53-stained slides. All statistical analyses were performed using Stata V.14.2.


    Study design

    This study is based on assessments of digitised slides to investigate diagnostic concordance of BO biopsies among a large and heterogeneous sample of GI pathologists. We investigated rates and features predictive of diagnostic concordance among these pathologists, with a particular focus on the demographic characteristics of the pathologists, the impact of viewing p53 labelled slides alongside H&E-stained slides, and on features associated with major diagnostic discordance that would negatively impact on patient stratification and treatment pathways. The purposes of this study were to build a quantitative model of expert BO pathologist review characteristics and to provide practical recommendations that could minimise errors in the interpretation of BO biopsies in the routine setting.

    The study flow chart is shown in figure 1A. All pathologists first filled out a baseline questionnaire for detailed demographic and clinical experience data. Pathologists then assessed the 110 digitised slides (55 H&E slides and matching p53 labelled slides) and recorded their answers on dedicated electronic CRFs. As detailed in the Methods section, diagnostic entries were recorded after viewing the H&E-stained slide and again after the matched p53 labelled slide was revealed alongside the case H&E slide.

    Figure 1

    Study design and study participants. (A) Fifty-five representative BO biopsies with H&E slide and consecutive p53 labelled slides were collected and scanned for digital diagnostic review. Each pathologist on the study first completed a detailed demographic questionnaire (online supplementary table 3). Pathologists then assessed 55 biopsy cases whereby diagnostic entries on H&E slide alone and after revealing matched p53 labelled slides were recorded separately allowing detailed insight into the added benefit of p53 labelled slides on diagnostic agreement. Reference diagnoses were established after consensus panel meeting. Within-group interobserver agreement was established for reference panel (n=4) and participating pathologists (n=51), and multivariate regression analyses were carried out to interrogate demographic predictors of diagnostic concordance, as detailed in the text. (B) Map showing geographical dispersion of pathologists participating in the BOLERO study, whereby every red dot signifies a residential city of one or more participating pathologists. BO, Barrett’s oesophagus; IHC, immunohistochemistry.

    The entire study set was completed by 55 pathologists working in over 20 countries and 5 continents (figure 1B). Of these 55 pathologists, 4 pathologists with extensive and published experience in BO dysplasia assessment were designated beforehand as reference pathologists.18 19 32 43 44 In sum, with 55 pathologists reviewing 55 biopsy cases, each of which includes one H&E-stained slide and a matched p53 labelled slide, this generated a massive data set of over 6000 case diagnoses with matched demographic data as input data for our Barrett’s digital pathology (Barrett mOlecuLar ExpeRt cOnsensus (BOLERO)) consensus study, one of the largest digital pathology consensus studies reported thus far. Case diagnoses were compared with reference diagnoses, and we searched for pathologist demographic features that predict diagnostic consensus at expert level.

    Patient characteristics of BO biopsy samples

    Patient characteristics of the sample biopsies are shown in online supplementary table 2. Of these patients, 94.5% were male (52 of 55). The median age at diagnosis was 65, the median body mass index was 27, and the median BO segment length was 4 cm circumferentially, with a maximum of 5 cm. Patients had a history of smoking in 63.6% of cases (35 of 55), a history of heart burn symptoms in 89% of cases (49 of 55) and used antireflux medication in 96.4% of cases (53 of 55).

    Pathologist characteristics

    The baseline characteristics of the pathologists taking part in the study are displayed in table 1 and online supplementary table 6. Participating pathologists represented a heterogeneous sample comprising a wide range of ages, workplace settings (academic teaching, private and/or district general hospital settings) and years of professional experience. Just over 50% of the participating pathologists reported dedicated fellowship experience, while the majority (72%) worked in a large laboratory with ≥10 pathologist colleagues. The most commonly reported guidelines to which pathologists adhered to were North American, British or Japanese; however, a quarter of pathologists reported using other guidelines in their clinical practice. Two-thirds of the participating pathologists self-identified as expert GI pathologists. Note that although pathologists were approached through professional societies, no effort was made to purposely recruit experts into the study. Pathologists also reported on other parameters and working practices in their laboratories, such as typical number of BO cases reported per week, confidence and enjoyment in reporting BO, reporting of endoscopic resection specimens, frequency of adjunct p53 labelled slide use in BO reporting, participation in double-reporting, multidisciplinary team meetings, and use of WSI, as well as typical interactions and perceptions of practices of their endoscopy colleagues (table 1 and online supplementary table 6). Participating and reference pathologists were generally well matched for age ranges and professional experience, although all 4 reference pathologists were male, whereas 22 of 51 (43.1%) participating pathologists in the larger cohort were female.

    Table 1

    Demographics of pathologists reporting in the Barrett mOlecuLar ExpeRt cOnsensus (BOLERO) study

    Case assessment overview

    A total of 3025 diagnoses were generated based on H&E-stained slide case review and another 3025 diagnoses were recorded after viewing the matching p53 labelled slides for study cases (figure 2A,B). The corresponding waterfall plots showing the ranked distribution of assessments reveal a gradual transition from NDBO examples with high interobserver concordance to HGD cases with similarly high interobserver concordance and diagnostic categories where concordance gradually transitions between these extremes. These plots also confirm that our case set includes representative biopsies from across the diagnostic spectrum of BO pathology. Relevant examples of study cases are shown in figure 2C.

    Figure 2

    Diagnostic variation across the study cohort. (A) Waterfall plot showing the ranked distribution of case assessments (n=3025) based on H&E slides alone for the entire cohort of pathologists. The x-axis shows the diagnostic concordance in percentages, and the y-axis shows ranked cases 1–55. Colour coding as in B. (B) Same visualisation for case assessments (n=3025) after revealing matched p53 labelled slides. (C) Four representative examples of the study set. Consensus diagnosis and cohort diagnoses are shown. HGD, high-grade dysplasia; IHC, immunohistochemistry; IND, indefinite for dysplasia; LGD, low-grade dysplasia; NDBE, non-dysplastic Barrett’s esophagus.

    Concordance of reference pathologists versus consensus diagnosis on H&E and p53 labelled slides

    Consensus diagnoses were generated following panel review. The reference panel consensus diagnoses for the H&E-stained slide case review included 16 NDBO, 6 IND, 18 LGD and 15 HGD case diagnoses. After the addition of matched p53 labelled slides and reference panel review, a small number of cases were reclassified, including 1 NDBO diagnosis as LGD, 1 LGD diagnosis as NDBO and 4 IND diagnoses as LGD, thus totalling 16 NDBO, 2 IND, 22 LGD and 15 HGD after p53 labelled slide review.

    Individual consensus panel member diagnoses were then compared with the final consensus panel diagnosis to obtain concordance rates between the four reference pathologists. This revealed excellent diagnostic agreement when reporting NDBO, LGD and HGD on H&E-stained slides alone (84.4%, 65.3% and 78.3%, respectively), rising to 89.4% when LGD and HGD diagnoses were combined. After revealing the matching p53 labelled slide for the 55 cases, agreement further improved to 85.9% for ND, 72.7% for LGD and 76.7% for HGD, rising to 91.9% when LGD and HGD were combined (online supplementary table 7A,B).

    Supplemental material

    Concordance of participating pathologists versus consensus diagnosis on H&E and p53 labelled slides

    The complete set of 5610 case assessments recorded by the 51 participating pathologists was then compared with the reference panel diagnoses to obtain concordance rates and compare diagnostic agreement within and between categories. The diagnostic agreement between 51 participating pathologists for H&E-stained slide diagnoses is depicted in figure 3A–C and online supplementary figure 1A, while concordance percentages are shown in table 2A. We found excellent concordance between the participating pathologists for NDBO reference diagnosis cases (643 of 816 diagnoses; 78.8%) and HGD reference diagnosis cases (544 of 765 diagnoses; 71.1%). As expected, there was moderate concordance for LGD reference diagnosis cases (382 of 918; 41.6%) and poor concordance for IND reference diagnosis cases (70 of 306; 22.9%). However, if dysplastic assessments were grouped (ie, combining LGD and HGD reference diagnosis cases), then 77.5% (1305 of 1683) of cases were concordant. Major overinterpretation or underinterpretation was found in 8.8% of assessments (248 of 2805 diagnoses).

    Figure 3

    Diagnostic variation per reference diagnoses. (A–F) Waterfall plots showing the ranked distribution of case assessments by participating pathologists per diagnostic category, as indicated. The left column (A–C) shows the diagnostic variation per reference diagnosis based on H&E slide review alone, and the right column (D–F) shows the diagnostic variation per reference diagnosis after revealing matched p53 labelled slides. The x-axis shows the diagnostic concordance in percentages, and the y-axis shows the ranked cases. Colour coding as in figure 2B. Diagnostic variation for indefinite for dysplasia cases is shown in online supplementary figure 1. IHC, immunohistochemistry.

    Table 2

    Cross table comparing the 51 participating pathologists’ diagnoses with the consensus-derived reference diagnoses for 55 oesophageal biopsy cases (A) on H&E-stained slides and (B) on H&E and p53 labelled slides for 5610 total case interpretations*

    Addition of matched p53 labelled slides improved diagnostic concordance (figure 3D–F and online supplementary figure 1B), with small but clinically meaningful improvements seen in the diagnostic concordance between participating pathologists for NDBO reference diagnosis cases (83.8% vs 78.8% on H&E slide) and LGD/HGD combined reference diagnosis cases (79.3% vs 77.5% on H&E slide) (table 2B). In addition to this, p53 labelled slides also had a small but beneficial impact on reducing the number of major overinterpretations and underinterpretations (8.3%, 232 of 2805 diagnoses), representing 0.5% fewer overall major misinterpretations compared with H&E-stained slide diagnosis alone.

    Characteristics associated with concordance on H&E slides

    This massive data set was then interrogated to reveal histopathologist predictors of over-reporting or under-reporting and major diagnostic errors in univariate analysis. To this end, all diagnostic discordances within our data set (ie, case diagnoses not matching reference diagnosis) were first reclassified as major or minor overinterpretation or underinterpretation (see the Methods section and online supplementary table 4). Factors associated with reduced odds of major diagnostic errors included ≥5 years of experience commensurate with age (OR 0.65, 95% CI 0.45 to 0.93); working in an academic teaching hospital (OR 0.59, 95% CI 0.43 to 0.81); routinely double-reporting IND cases (OR 0.70, 95% CI 0.52 to 0.94); working in a larger lab (≥10 vs <10 pathologists; OR 0.72, 95% CI 0.54 to 0.96); and using digital pathology (OR 0.63, 95% CI 0.47 to 0.89). In contrast, working within a district general hospital (OR 1.72, 95% CI 1.30 to 2.26) or private hospital (OR 1.41, 95% CI 1.04 to 1.91) or not using major society guidelines (OR 1.43, 95% CI 1.06 to 1.94) were all associated with increased odds of major diagnostic errors (online supplementary table 8A-C).

    Several factors were not associated with major diagnostic error, including pathologist sex. Participating in upper GI multidisciplinary team meetings was not associated with reduced odds of major diagnostic error, although it was associated with reduced odds of over-reporting. Notably, self-identifying as a Barrett’s pathology expert, holding a dedicated fellowship, and reporting greater enjoyment or confidence in Barrett’s reporting were not associated with decreased odds of major overinterpretation or underinterpretation (online supplementary table 8A). Finally, reporting ≥20 cases per week was associated with reduced odds of overinterpretation or underinterpretation of Barrett’s dysplasia (OR 0.69, 95% CI 0.53 to 0.89), although this association was attenuated when investigating major diagnostic errors (online supplementary table 8B).

    Multivariate analyses before and after revealing matched p53 labelled slides

    Multivariable models were then applied, including all factors associated with collective overinterpretation and underinterpretation on H&E digital slide review in univariate analysis, as shown in figure 4. At least 5 years of experience commensurate with age was the strongest protective factor against major diagnostic error on H&E slide review (OR 0.48, 95% CI 0.31 to 0.74). In contrast, working in a district general hospital was associated with increased odds of major diagnostic error (OR 1.76, 95% CI 1.15 to 2.69). Importantly, this effect was neutralised if pathologists in these settings viewed cases with additional p53 labelled slides (OR 1.44, 95% CI 0.92 to 2.28). As expected, routine use of p53 labelled slides was associated with reduced odds of major diagnostic error. Viewing 5–19 BO cases with p53-stained slides per week was associated with increased odds of major diagnostic errors, which was neutralised when viewing ≥20 cases per week. Most other results showed similar trends to those seen in univariate analysis, but these were no longer statistically significant (figure 4).

    Figure 4

    Characteristics associated with odds of major overinterpretation or underinterpretation of Barrett’s oesophagus with dysplasia in multivariable-adjusted analysis. *All characteristics factors mutually adjusted for each other. **Additional adjustment for p53 labelled slides in routine pathology practice. AGA, American Gastroenterological Association; BSG, British Society of Gastroenterology; IHC, immunohistochemistry; JES, Japan Esophageal Society.

    Population estimates

    To determine the impact of our results in a real-world clinical setting, we extrapolated the results from this case set (in which dysplastic biopsies were purposely over-represented) to the Barrett’s dysplasia prevalence reported from the population-based Northern Ireland Barrett’s Oesophagus Register. As shown in figure 5, 18.6% of all Barrett’s cases would be classified as having a major overinterpretation or underinterpretation, based on the findings of this study as applied to the real-word clinical setting of H&E slide plus adjunct p53 labelled slide viewing. The majority of these would be attributed to potential overinterpretation of NDBO (426 out of 461 cases, or 92.3%; figure 5).

    Figure 5

    Population-level impact of diagnostic variation for Barrett’s oesophagus surveillance biopsies. The x-axis shows the population prevalence of diagnostic classes where the width of each class is consistent with its proportional prevalence (total 100%), and the y-axis shows the diagnostic concordance with the total surface area adding up to all diagnoses made in 1 year. Diagnostic concordance is shown as either concordant (in white), overinterpreted (in blue) and underinterpreted (in magenta), where % shown reveals concordant diagnoses that would be confirmed for each diagnostic class on review by an expert pathologist panel (table 2). HGD, high-grade dysplasia; ID, indefinite for dysplasia; LGD, low-grade dysplasia.


    We have carried out the largest investigation of diagnostic concordance of BO biopsy reporting among GI pathologists to date. Previous studies had been limited to a small number of expert pathologists, which meant findings were not necessarily generalisable to real-world settings. This work has revealed several novel findings.

    First, overall concordance for H&E digital slide review of NDBO and LGD/HGD as a combined outcome was excellent (exceeding 77%), although concordance for IND and LGD as a stand-alone diagnosis was lower (23%–42%). These test characteristics replicate known glass slide test characteristics (online supplementary table 1), suggesting that distant BO biopsy slide review is reproducible and safe.

    Second, our multivariate analyses revealed several pathologist characteristics and working practices independently associated with the risk of misinterpretations. Reassuringly, pathologist experience commensurate with age was most protective against major overinterpretation or underinterpretation, confirming the validity of our experimental strategy. Our multivariate regression analyses also confirm that working within a teaching hospital environment protects against major diagnostic error. This provides supportive evidence for guideline statements that BO complicated by dysplasia is best managed within an expert centre.21–23 26 Importantly, self-identifying as an expert was not associated with decreased odds of major overinterpretation or underinterpretation.

    Lastly, our study design sheds light on the context-dependent impact of p53 labelled slides. We find that the overall prevalence of major misinterpretations (NDBO classified as LGD/HGD, or vice versa) across this biopsy series enriched for IND/LGD/HGD cases was 8.8%, which was reduced, marginally, by the addition of p53 labelled slides (8.3%). Although this would suggest a limited impact of the adjunct use of p53 labelled slides, our multivariate analysis allows us to unpack this figure and reveals that major discordance was reduced by viewing matched p53 labelled slides specifically for those pathologists working away from teaching hospital settings. This demonstrates that the beneficial impact of adjunct p53 labelled slides is dependent on context and is greatest outside expert centre settings where, indeed, most primary dysplasia diagnoses in surveillance are made. Extrapolating our concordance data to real-world dysplasia prevalence shows that the majority of major misdiagnoses in real-world practice overinterpret NDBO (426 out of 461 cases, or 92.3%; figure 5). In these cases, routine addition of adjunct p53 labelled slides may have substantial impact towards limiting overdiagnosis, although our study was not designed to examine the latter point. Routine use of p53 labelled slides is supported by several national guidelines,21 23 26 and our study confirms that this is appropriate.

    Taken together, our study for the first time provides an evidence-based quantitative model of BO histopathology diagnosis at expert consensus level. Our data reassuringly suggest that BO reporting on a par with expert consensus is not limited to a small league of experienced histopathologists but can be predicted from a small number of intuitive demographic predictors (experience, professional setting, use of p53 labelled slides). This suggests practical interventions to reduce diagnostic variability are feasible, through improved training and support. To implement routine external review of dysplastic BO biopsies, as mandated by several major society guidelines, requires regional or national teams of dedicated GI pathologists with Barrett’s expertise. Combined with our observation that concordance rates for digital slide viewing were not inferior to conventional glass slide pathology review,18 19 together these data suggest that distant digital review of challenging BO biopsy cases is safe to formally implement within current care delivery systems, provided quality benchmarks are met. In the Netherlands, such a set-up has been successfully implemented over the past 5 years to accommodate nationwide digital expert review of all dysplastic BO biopsies.44 45

    Our study has considerable strengths compared with previous interobserver variation studies of BO reporting. We have evaluated diagnostic concordance for dysplastic BO among the largest group of GI pathologists worldwide. The heterogeneous mix of pathologists involved in this study also enabled novel investigations into pathologist-dependent predictors associated with diagnostic discordance. The online reporting strategy mimicked routine workflow and facilitated data collection and curation in a flexible manner. The case set was purposely enriched for dysplastic cases in order to attain sufficient statistical power in our downstream regression analyses. Diagnostic concordance within a large group of pathologists with different levels of GI pathology expertise was excellent for LGD and HGD combined.

    This study also has limitations that are important to note. One caveat to our study design is the original data set which is skewed towards the inclusion of dysplastic biopsies. Our case-mix therefore does not represent a cross-section of diagnostic biopsy cases encountered in daily practice, which would be heavily weighted towards the NDBO end of the spectrum. Because a complete revision study whereby all consecutive surveillance biopsies are prospectively reviewed by a consensus panel of experienced pathologists is not practically feasible, we set out to extrapolate the population impact of histopathologist diagnostic variation from our data set. To this end, we exploited the dysplasia population prevalence from the Northern Ireland Barrett’s Oesophagus Register (see the Methods section) and modelled the impact of diagnostic variation using our concordance data (figure 5). We found that, across all diagnostic categories, 81.4% of all diagnoses would be confirmed by consensus of four experienced Barrett’s pathologists. Given the fact that the overbearing majority of Barrett’s surveillance biopsies were reported to contain non-dysplastic Barrett’s mucosa, proportionally the largest share of diagnostic discordance is seen in this category (92.3%). Vice versa, a small number of biopsies in routine practice (estimated at 1.3% of total) will initially be reported as non-dysplastic Barrett’s mucosa, whereas consensus panel review would reveal HGD. These data suggest that the population impact of diagnostic variation is real and is most prominent for non-dysplastic Barrett’s biopsies that are overinterpreted, which may lead to overtreatment. A small number of patients would be undertreated despite the presence of abnormalities that mandate invasive management.

    A second limitation is that while our heterogeneous global group of pathologists allowed us to interrogate associations of a host of operator-dependent characteristics with diagnostic consensus (case volume, practice setting, diagnostic experience and so on), this study feature may limit the generalisability of our findings within the national setting. Replication of our findings in samples of pathologists within particular geographical regions adhering to one diagnostic guideline will be required to determine whether the quantitative predictive features described here are similarly applicable in that setting. Given that the majority of pathologists participating in this study were based either in Europe or North America, greater representation from low-income to middle-income settings would be particularly welcome. This could further enhance the value of this recursive exercise for teaching and registration purposes.

    In conclusion, using this rich data set of case assessments by a large, heterogeneous sample of GI pathologists, we have evaluated diagnostic concordance for BO diagnosis using digital case review. Our results reveal quantitative predictors of diagnostic performance that will aid formulation of quality assurance criteria for guideline development and standard implementation of digital pathology in BO biopsy review.


    We sincerely thank Joann Elmore and Gary Longdon for their helpful additions to our study protocol. We sincerely thank Alden van Putten, Rudy Scholten, Rene Breet and David de Koning (OpenClinica) for their help in building and maintaining the online study environment. We sincerely thank Onno de Boer, Eelco Roos and Wim van Est for their help with scanning of the slides.


    Supplementary materials

    • Supplementary Data

      This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


    • MJ and SLM are shared corresponding and senior authors.

    • Twitter @jansen_marnix

    • MJvdW and HGC contributed equally.

    • Correction notice This article has been corrected since it published Online First. The corresponding author details have been updated.

    • Collaborators BOLERO study participants (in alphabetical order): Dr Junko Aida, Tokyo Metropolitan Institute of Gerontology, Tokyo, Japan; Dr Rossana Baiocco, General Hospital of Desenzano del Garda, Desenzano, Italy; Dr Camille Boulagnon-Rombi, Université de Reims Champagne-Ardenne, Reims, France; Dr Iva Brcic, Medical University of Graz, Graz, Austria; Dr Lodewijk Brosens, University Medical Center Utrecht, Utrecht, the Netherlands; Dr Fátima Carneiro, IPATIMUP, Porto, Portugal; Dr Gieri Cathomas, Kantonsspital Baselland, Liestal, Switzerland; Dr Denis Chatelain, CHU Amiens-Picardie, Amiens, France; Dr Allison Cluroe, Addenbrooke's Hospital, Cambridge, UK; Dr Parag Dabir, Regional Hospital, Randers, Denmark; Dr Giovanni De Petris, Penrose Hospital, Colorado Springs, USA; Dr Michael Doukas, Erasmus Medical Center, Rotterdam, the Netherlands; Dr Hala El-Zimaity, Toronto General Hospital, Toronto, Canada; Dr Matteo Fassan, University of Padua, Padua, Italy; Dr Roberto Fiocca, University of Genova, Genova, Italy; Dr Jean-François Fléjou, Saint-Antoine Hospital, Paris, France; Dr Alejandro García Varona, Hospital El Bierzo, Leon, Spain; Dr Elvira Gonzalez Obeso, Hospital Clinico Universitario, Valladolid, Spain; Dr Heike Grabsch, (1) Division of Pathology and Data Analytics, Leeds Institute of Medical Research at St James's, University of Leeds, Leeds, UK, and (2) Department of Pathology, GROW School for Oncology and Developmental Biology, Maastricht University Medical Center+, Maastricht, NL; Dr Federica Grillo, University of Genova, Genova, Italy; Dr Barbara Gruber, Patologia Bariloche, San Carlos de Bariloche, Argentina; Dr Laura Guerra Pastrian, University Hospital La Paz, Madrid, Spain; Dr Anne Hoorens, University Hospital Gent, Gent, Belgium; Dr Marnix Jansen, University College Hospital, London, UK; Dr Katerina Kamaradova, Charles University Hospital, Hradec Kralove, Czech Republic; Dr Ryoji Kushima, Shiga University of Medical Science, Shiga, Japan; Dr Cord Langner, Medical University of Graz, Graz, Austria; Dr Rupert Langer, University of Bern, Bern, Switzerland; Dr Felix Lasitschka, Universitätsklinikum Heidelberg, Heidelberg, Germany; Dr Ester Lörinc, University Hospital Lund and Malmö, Lund, Sweden; Dr Luca Mastracci, University of Genova, Genova, Italy; Dr Damian McManus, Belfast HSC Trust, Belfast, UK; Dr Sybren Meijer, Academic Medical Center, Amsterdam, the Netherlands; Dr Carmen Mendez, University Hospital La Paz, Madrid, Spain; Dr Anya Milne, Diakonessenhuis, Utrecht, the Netherlands; Dr Miriam Mitchison, University College Hospital, London, UK; Dr Masoud Mireskandari, Jena University Hospital, Jena, Germany; Dr Elizabeth Montgomery, Johns Hopkins Medical Institute, Baltimore, USA; Dr Cian Muldoon, St James’s Hospital, Dublin, Ireland; Dr Maria O’Donovan, Cambridge Cancer Centre, Cambridge, UK; Dr Rob Odze, Brigham and Women’s Hospital, Boston, USA; Dr Johan Offerhaus, University Medical Centre Utrecht, the Netherlands; Dr Gabriel Olmedilla, University Hospital La Paz, Madrid, Spain; Dr John Pauli, The Prince Charles Hospital, Brisbane, Australia; Dr Rachel S van der Post, Radboud university medical centre, Nijmegen, the Netherlands; Dr Bob Riddell, Mount Sinai Hospital, Toronto, Canada; Dr Ari Ristimaki, Haartman Institute, Helsinki, Finland; Dr Ana Rodriguez, University Hospital La Paz, Madrid, Spain; Dr Manual Rodriguez-Justo, University College Hospital, London, UK; Dr Shigeki Sekine, National Cancer Center Hospital, Tokyo, Japan; Dr Kees Seldenrijk, St Antonius Hospital, Nieuwegein, the Netherlands; Dr Tulio Souza, Hospital Aliança, Salvador, Brazil; Dr Matt Stachler, Brigham and Women’s Hospital, Boston, USA; Dr Michael Vieth, Klinikum Bayreuth, Bayreuth, Germany; Dr Vincenzo Villanacci, Spedali Civili di Brescia, Brescia, Italy; Dr Rhonda Yantiss, Weill Cornell Medical College, New York, USA.

    • Contributors Study concept and design: MJvdW, MJ, SLM. Acquisition of data: MJvdW, MJ, SLM. Analysis and interpretation of data: MJvdW, HGC. Drafting of the manuscript: MJvdW, HGC, MJ, SLM. Critical revision of the manuscript: MJvdW, HGC, JJGHMB, MJ SLM. Study supervision: SLM, MJ.

    • Funding This study was funded by Cancer Research UK and Dutch Cancer Society.

    • Competing interests None declared.

    • Patient consent for publication Not required.

    • Ethics approval This study used anonymised, archived, formalin-fixed, paraffin-embedded material and did not require approval from the relevant institutional ethics committee under applicable local regulatory law (‘Code of conduct’, FEDERA).

    • Provenance and peer review Not commissioned; externally peer reviewed.

    • Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information.