Objective Histopathology is potentially an important outcome measure in UC. Multiple histological disease activity (HA) indices, including the Geboes score (GS) and modified Riley score (MRS), have been developed; however, the operating properties of these instruments are not clearly defined. We assessed the reproducibility of existing measures of HA.
Design Five experienced pathologists with GI pathology fellowship training and expertise in IBD evaluated, on three separate occasions at least two weeks apart, 49 UC colon biopsies and scored the GS, MRS and a global rating of histological severity using a 100 mm visual analogue scale (VAS). The reproducibility of each grading system and for individual instrument items was quantified by estimates of intraclass correlation coefficients (ICCs) based on two-way random effects models. Uncertainty of estimates was quantified by 95% two-sided CIs obtained using the non-parametric cluster bootstrap method. Biopsies responsible for the greatest disagreement based on the ICC estimates were identified. A consensus process was used to determine the most common sources of measurement disagreement. Recommendations for minimising disagreement were subsequently generated.
Results Intrarater ICCs (95% CIs) for the total GS, MRS and VAS scores were 0.82 (0.73 to 0.88), 0.71 (0.63 to 0.80) and 0.79 (0.72 to 0.85), respectively. Corresponding inter-rater ICCs were substantially lower: 0.56 (0.39 to 0.67), 0.48 (0.35 to 0.66) and 0.61 (0.47 to 0.72). Correlation between the GS and VAS was 0.62 and between the MRS and VAS was 0.61.
Conclusions Although ‘substantial’ to ‘almost perfect’ ICCs for intrarater agreement were found in the assessment of HA in UC, ICCs for inter-rater agreement were considerably lower. According to the consensus process results, standardisation of item definitions and modification of the existing indices is required to create an optimal UC histological instrument.
- CHRONIC ULCERATIVE COLITIS
- INFLAMMATORY BOWEL DISEASE
- CLINICAL TRIALS
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Significance of this study
What is already known on this subject?
Endoscopic bowel healing is associated with favourable long-term outcomes in UC but does not assure histologically inactive disease.
Histological inflammation is associated with an increased risk of relapse.
Quantitative histological indices are preferable to subjective assessment of symptoms as outcome measures in clinical trials.
Although multiple histological scoring systems have been developed and used in clinical trials, their operating properties have not been systematically validated.
Identification and standardisation of the histological features of UC disease activity may help define both treatment goals in clinical practice and outcome measures for clinical trials.
What are the new findings?
Intraclass correlation coefficients (ICCs) for intrarater agreement for histological disease activity using Geboes score (GS), modified Riley score (MRS) and 100 mm visual analogue score (VAS) in UC are ‘substantial’ to ‘almost perfect’ when pathologists centrally score biopsies.
ICCs for inter-rater agreement are considerably lower than intrarater agreement.
Good correlation exists between a global VAS of histological disease activity with the GS and MRS.
A consensus process identified causes of disagreement and potential solutions for improving inter-rater agreement.
How might it impact on clinical practice in the foreseeable future?
Validated histological scoring systems are needed in UC.
Partially validated indices, such as GS or MRS, can provide preliminary indicators of disease activity and/or progression until fully validated instruments are available.
Standardisation of definitions and measures of image quality may improve future use of currently available indices.
Clinical symptoms and endoscopy are traditionally used to evaluate disease activity in UC. However, there is growing interest in the assessment of histological disease activity based on the concept that resolution of bowel inflammation beyond endoscopic healing may provide additional clinical benefit. For example, in a prospective study of 82 patients with quiescent UC, the presence of residual histological inflammation on rectal biopsy was associated with a threefold risk of relapse at 12 months of follow-up.1 However, before assessment of histological disease activity can be appraised and ultimately accepted as a useful metric in clinical research and clinical practice, validated histological disease activity instruments must be developed.2 The process of validating an index typically involves assessment of reproducibility, validity and responsiveness. Reproducibility, defined as the degree to which repeated measurements provide similar results, is typically assessed by measuring agreement or reliability. Measurement errors and variance between subjects play an important role in the interpretation of reproducibility based on these two measures. Reliability is defined as the extent to which raters are able to consistently distinguish between study subjects where agreement is defined as the extent of how similar responses from multiple assessments appear.3 As emphasised by Guyatt, agreement is a relatively more important property for the assessment of evaluative instruments in distinction to discriminant instruments where reliability is a priority.4 Validity, defined as the extent to which a score measures what it is intended to measure, is frequently assessed through ‘criterion validity’, which is measured by correlation with a gold standard. However, since there is no existing gold standard for histological disease activity in UC, a global measure of histological disease activity can be used to establish criterion-related validity through construct validity.
The two histological indices most commonly used to evaluate disease activity in UC are the Geboes score (GS) and the modified Riley score (MRS). The GS is a seven-item instrument that has been used as an outcome measure in clinical trials, which classifies histological changes as grade 0 (structural change only); grade 1 (chronic inflammation); grade 2 (a, lamina propria neutrophils; b, lamina propria eosinophils); grade 3 (neutrophils in the epithelium); grade 4 (crypt destruction) and grade 5 (erosions or ulcers), and generates a score from 0 to 5.4, with higher scores indicating greater inflammation. A decrease of the GS to grade 0 or 1 has been empirically designated as histological healing.5 The MRS is a six-item instrument that grades each item on a four-point scale (none, mild, moderate or severe). Scores range from 0 (no inflammation) to 7 (severe acute inflammation).1 ,6 The MRS, which excludes the items of architectural distortion found in the original Riley score1 based upon the premise that these are unlikely to be responsive to change following therapy, was used as a secondary outcome measure in one large randomised controlled trial.6
Further characterisation of these indices requires the assessment of validity, responsiveness and reproducibility. However, neither instrument was developed using a structured framework for index development nor are their operating properties well defined.7 In this paper, we assessed reproducibility through intrarater and inter-rater agreement testing, for the GS and the MRS, and identified the items of highest disagreement. Correlation between visual analogue scale (VAS), the GS and the MRS was also evaluated.
Rectal biopsies were obtained from patients with active UC who participated in a phase II randomised controlled trial of MLN-02, a monoclonal antibody directed to the alpha-4-beta-7 integrin. Results of this trial were reported previously.6 We elected to only evaluate patients from the control group of the trial as the specific effect of MLN-02, a highly selective inhibitor of gut lymphocyte trafficking, on inflammatory responses in the mucosa is currently unknown.
Biopsies were paraffin embedded, sectioned and H&E stained. The slides were scanned at 40× magnification on a Scanscope CS (Aperio, Vista, California, USA) slide scanner, and the digitised images hosted for viewing on a secure, regulatory-compliant website using proprietary ImageScope (Aperio, Vista, California, USA) software.
Five pathologists with GI pathology fellowship training and experience in IBD (KG, CB, KK, RP and CL) participated in this study. The pathologists were selected both for their expertise and for their willingness to commit time to the initiative and were trained in the use of a central image management system that hosted the digital histological images. Standardised training materials were provided on the GS and MRS that included examples of ideal digital images for each individual index item. Points of disagreement between readers regarding the definitions of items were discussed and consensus was reached prior to initiation of the study. During this training period, central readers selected additional items that did not comprise the GS or MRS but were thought to be relevant. These items were identified as being part of two other non-validated histological indices: Chicago8 ,9 and Harpaz.10
Each pathologist independently randomly reviewed 50 digital slide images three times, approximately 2 weeks apart. All of the individual items that comprise the GS and MRS were included for scoring. Additionally, items from the Chicago and Harpaz indices, and other items considered potentially relevant by the pathologists, were evaluated at each reading. Global severity of histological disease was assessed using a 100 mm VAS, where no disease activity was scored as 0 and the most severe activity was scored as 1. Images were reviewed independently in the absence of clinical information. A separate subjective assessment of the overall quality of each slide, based on three criteria (stain, section and image quality), was also generated. Following completion of the initial reading process, the sources of disagreement among readers were evaluated using a two-step procedure. First, outlying images were identified using case-deletion diagnostics for mixed models,11 which is performed by sequentially deleting items and examining which of them has the most influence on estimates of variance. Second, a consensus process was performed, in which five additional specialised pathologists (NH, RR, DD, MV and MP) were invited to join the initial central reading pathologists to reassess digital images with the greatest amount of disagreement. Following review of the images where disagreement was greatest, each pathologist completed a survey to identify potential sources of disagreement.12 ,13 Results of the survey were discussed among the group to attain consensus regarding the sources of variance and methods to standardise these assessments. Rules were created to aid in future studies that require reading of digital histological images.
Statistical analyses and sample size considerations
All analyses were conducted using SAS V.9.4 (SAS Institute, Cary, North Carolina, USA). Descriptive statistics were used to assess the clinical characteristics of the patients. Inter-rater agreement was defined as the correlation between two measurements on the same biopsy image made by two different pathologists, while intrarater agreement was defined as the correlation between two measurements on the same biopsy image made by the same pathologist. Intra-agreement and interagreement was estimated for each of the histological indices based on a two-way random effects model with interaction between images and raters using the restricted maximum likelihood method. The resulting intraclass and interclass correlation coefficients (ICCs) may be regarded as the most general statistics for agreement since kappa, weighted kappa and concordance correlation coefficients are considered to be special forms of ICCs.14–16 Associated two-sided 95% CIs were obtained using the non-parametric centile bootstrap method with 2000 samples obtained with replacement at the level of image to maintain data structure. This approach is commonly known as the cluster bootstrap method.17 The same approach was used for a subgroup analysis limited to images with no quality issues. The strength of agreement was evaluated according to the criteria of Landis and Koch whereby ICCs of <0.00, 0.00–0.20, 0.21–0.40, 0.41–0.60, 0.61–0.80 and 0.81–1.00 indicate ‘poor’, ‘slight’, ‘fair’, ‘moderate’, ‘substantial’ and ‘almost perfect’ agreement, respectively.18 These benchmarks are more conservative than those suggested by Cicchetti,19 who suggested that excellent agreement should be based on an ICC of no <0.75. Measurement errors (residuals) and variance components for each index and individual item are presented to allow for interpretation of the impact of the individual components of the ICC. Correlation between VAS with GS and MRS was measured using Pearson's correlation coefficient, accounting for replicates using linear mixed models for point estimates. The cluster bootstrap method with 2000 replicates was used to generate associated two-sided 95% CI.20
The study evaluated the intrarater and inter-rater reproducibility of the GS, MRS and VAS using a design in which five pathologists made three independent measurements of 50 biopsy images. This sample size was sufficient without consideration of the triplicate images.21 In particular, assuming a true ICC of 0.75, the study had an 83% chance of obtaining a one-sided 95% lower confidence bound for the ICC of 0.6, the ‘substantial’ agreement criterion.
The biopsy slides analysed in this study were obtained from a clinical trial that complied with all applicable regulatory requirements. The consent of study participants included the use of the collected data for other medical purposes, and thus additional consent for the present study was not obtained. All participant information used in the present study was de-identified and the pathologists were blinded to clinical information. Additionally, readers were blinded to the results of their previous reads and to scores of the other readers.
Table 1 shows the demographic and clinical characteristics of the patients. One of the 50 selected digital images was found to be unusable and the analysis is, therefore, based on the remaining 49 patients. The characteristics of the patients were generally representative of participants in induction trials of treatment for mild to moderately active UC.
The spectrum of disease activity of the study population according to VAS scores is presented in figure 1.
Intrarater ICCs for total GS (grades 0–5), MRS and VAS scores were 0.82 (0.73 to 0.88), 0.71 (0.63 to 0.80) and 0.79 (0.72 to 0.85), respectively, indicating ‘substantial’ to ‘almost perfect’ agreement. Inter-rater ICCs (95% CI) for the total GS (grades 0–5), MRS and VAS scores were 0.56 (0.39 to 0.67), 0.48 (0.35 to 0.60) and 0.61 (0.47 to 0.72), indicating ‘good’ agreement. When ICCs for GS were measured on a collapsed categorical scale between 1 and 3 (category 1=grade 0 or 1 ‘inactive or mildly active’; category 2=grade 2 or 3 ‘moderately active’ and category 3=grade 4 or 5 ‘severely active with epithelial involvement’), intrarater ICC was 0.77 (95% CI 0.71 to 0.83) and inter-rater ICC was 0.51 (95% CI 0.40 to 0.63), indicating ‘substantial’ and ‘good’ intrarater and inter-rater agreement, respectively. Alternatively, when ICCs for GS were measured on a continuous scale from 0 to 22, intrarater ICC was 0.84 (95% CI 0.80 to 0.89) and inter-rater ICC was 0.60 (95% CI 0.46 to 0.71), indicating ‘almost perfect’ and ‘substantial’ intrarater and inter-rater agreement, respectively. Variance components and residuals for each index are presented in table 2.
Individual item ICCs for the assessment of architectural features are summarised in table 3. The highest intrarater ICC was observed for crypt architectural distortion 0.85 (0.76 to 0.91) and the lowest intrarater ICC was patchiness 0.48 (0.33 to 0.59). The highest inter-rater ICC was observed for crypt architectural irregularities according to the MRS criteria 0.72 (0.59 to 0.80) and the lowest inter-rater ICC was with patchiness 0.19 (0.06 to 0.32).
Individual item ICCs for the assessment of acute inflammation are summarised in table 4. The highest intrarater ICC was observed for changes according to the Chicago index 0.76 (0.69 to 0.83) and the lowest intrarater ICC was observed for the assessment of crypt abscesses 0.55 (0.44 to 0.66). The highest inter-rater ICC was observed for the detection of neutrophils 0.52 (0.40 to 0.62), while the lowest inter-rater ICC was observed for the presence of lamina propria eosinophils according to the GS criteria 0.26 (0.15 to 0.37).
Individual item ICCs for the assessment of chronic inflammation are summarised in table 5. The highest intrarater ICC was also observed for the detection of a chronic inflammatory infiltrate according to the GS 0.81 (0.72 to 0.86), whereas the lowest intrarater ICC was observed for the assessment of basal plasmacytosis 0.81 (0.72 to 0.86). The highest inter-rater ICC was observed for the detection of a chronic inflammatory infiltrate according to both GS and MRS 0.81 (0.72 to 0.86) and the lowest inter-rater ICC was observed for the assessment of basal plasmacytosis 0.63 (0.48 to 0.74).
Individual item ICCs for the assessment of epithelial injury are summarised in table 6. The highest intrarater ICC was also observed for the detection of erosions or ulcerations according to the GS 0.78 (0.71 to 0.84), whereas the lowest intrarater ICC was observed for the identification of granulomas 0.49 (0.11 to 0.74). The highest inter-rater ICC was observed for the detection of erosions or ulcerations according to the GS 0.56 (0.43 to 0.67) and surface epithelial integrity according to MRS 0.56 (0.43 to 0.67). The lowest inter-rater ICC was observed for the identification of granulomas 0.56 (0.43 to 0.67).
Correlation of histological indices with global VAS
The intent of these analyses was to assess the criterion validity of scores against a global measure of disease activity. The correlation between overall histological severity as measured by the VAS scores and MRS was r=0.624 (95% CI 0.545 to 0.688). For GS, correlation with VAS was measured using three different approaches. First, correlation was measured with GS as a six-grade ordinal scale (0–5) and revealed r=0.61 (95% CI 0.50 to 0.67). Second, correlation between VAS and a total score generated using GS as a continuous scale (total cumulative score of 22) resulted in r=0.66 (95% CI 0.57 to 0.72). Finally, when GS was used as a categorical scale between 1 and 3 (inactive as grade 0 or 1, mildly active as grade 2 or 3 and severely active with epithelial involvement as grade 4 or 5) correlation testing showed r=0.58 (95% CI 0.48 to 0.64) (figure 2).
Disagreement and the consensus process
In total, 17 of the 49 biopsy images accounted for the majority of the disagreement. The most common sources of disagreement were the interpretation of several item definitions including artefact, granulation tissue, crypt destruction, crypt distortion, basal plasmacytosis, lamina propria neutrophils and approaches to scoring of slides with suboptimal and poor quality. Methods to standardise the interpretation of the items with the greatest disagreement were developed during the consensus process (table 7).
The pathologists identified 213 of 734 (29%) digital slides as being of suboptimal quality. The leading explanations for suboptimal image quality were over staining (16%), inadequate sampling (8%), poor orientation (1%), inability to focus adequately (1%) and other (5%). A total of 74 slides (10%) were deemed of poor quality and were subsequently excluded in a subgroup analysis of intrarater and inter-rater agreement.
Results of subgroup analysis excluding poor-quality images
Analysis excluding digital slides deemed as poor quality showed intrarater ICCs of 0.85 (0.75, 0.90), 0.72 (0.62, 0.79) and 0.81 (0.74 to 0.86) and inter-rater ICCs (95% CI) of 0.58 (0.41 to 0.71), 0.49 (0.35 to 0.61) and 0.60 (0.46 to 0.71) for GS, MRS and VAS, respectively. Individual ICCs for items that assess architecture, acute inflammation, chronic inflammation and epithelial injury are summarised separately in online supplementary appendix.
In this large-scale evaluation of histological indices of disease activity in UC, we demonstrated ‘substantial’ to ‘almost perfect’ ICCs for intrarater agreement for both total GS and MRS scores and, with a few exceptions, for the individual items that constitute these instruments. These findings are encouraging since a high degree of reproducibility is a critical operating property of a valid disease activity instrument. The ICCs for inter-rater agreement were considerably lower than for intrarater agreement for the GS, MRS and VAS. Although these differences in intrarater and inter-rater agreement ICCs are expected, because observers are more likely to agree with themselves than with each other, the differences were greater than those observed in two recent studies of identical design that evaluated central endoscopic scoring in UC (endoscopic VAS intraobserver ICC=0.91 (0.80 to 0.94) and interobserver ICC=0.78 (0.70 to 0.85)) and Crohn's disease (Crohn's disease endoscopic index of severity intraobserver ICC=0.89 (0.86 to 0.93) and interobserver ICC=0.89 (0.86 to 0.93)).22 ,23 We hypothesise that this discordance is due to variations in reader interpretation of item definitions. The GS for UC was originally developed to compare a topical therapy with a systemically acting drug.5 Accordingly, crypt epithelial damage and surface epithelial damage were included as separate items. Therefore, the intention for this index was to mainly become able to distinguish between quiescent (inactive disease or grade 1), mildly active disease (defined by the presence of polymorphonuclear cells or neutrophils or grades 2 and 3) and moderate to severely active disease (defined by epithelial cell damage or grades 4 and 5).5 Alternatively, a continuous score could be used.24 While a continuous score may show more variability in the intensity of the inflammation, a positive score may not distinguish between active or inactive disease. The optimal score will need to be further defined by additional agreement and responsiveness testing, which we are currently undertaking.
High ICCs for inter-rater agreement were observed for items assessing acute inflammation including superficially located neutrophils compared with neutrophils in the lamina propria for both GS and MRS. However, inter-rater agreement for the assessment of neutrophils in lamina propria might be improved by standardising the quality of sectioning and staining of the biopsies (figure 3). Acute inflammation can be assessed based on the presence or absence of neutrophils scattered in the epithelium or in the colonic crypts as cryptitis or crypt abscesses. Features that suggest chronic inflammation, such as basal plasmacytosis, can have significant prognostic implications25 and, therefore, are important to detect. All features of chronic inflammation showed high intrarater and inter-rater ICCs according to GS and MRS. Epithelial injury may be considered a marker of severity but may also be confused with artefact that occurs when the endoscopic biopsy sample is acquired or when the slide is prepared. Our results showed a moderate degree of variation in agreement between items of this category. Crypt distortion according to GS and MRS demonstrated high agreement between central pathologists and central histological assessment and can therefore be considered reproducible for diagnosing IBD. However, architectural distortion is not a marker of disease activity. As end points for clinical trials and in clinical practice, acute and chronic inflammatory changes are more relevant than architectural changes and contribute more to decision making regarding treatment and prognosis.
Acquisition and preparation of biopsy samples and tissue sections are technically difficult processes that have potential for artefact and variation in quality. However, based on a subgroup analysis limited to slides with optimal quality, lower-quality slides (10% of the total sample) did not seem to significantly contribute to the observed disagreement among pathologists. Because assessment of histological features is interpretive rather than quantitative, inherent variation in assessment exists. In our study, 17 of the slides accounted for the majority of the disagreement. A systematic survey process of pathologists involved in the study generated new recommendations for scoring definitions that, if validated, may decrease inter-rater scoring variability in future studies. It should be pointed out that we only evaluated biopsies from the recto-sigmoid. Although we know of no data to address this issue, it is possible that regional differences in colonic histopathology might exist that would preclude generalising our results to biopsies taken from other locations. Future studies that incorporate standardised definitions, questionnaires and improved slide quality may help to objectively identify the best scoring system for use as an end point in clinical trials. Whether this instrument is a revised version of an existing index or a newly developed index requires further assessment.
Many of the individual items evaluated in this agreement study show high ICCs for intrarater and inter-rater agreement (crypt architectural distortion according to MRS, detection of neutrophils according to the Chicago index, chronic inflammatory changes according to GS, detection of chronic inflammatory changes according to GS and MRS, erosions or ulcerations according to GS, surface epithelial integrity according to MRS) and are likely the best candidates for inclusion in a new disease activity instrument provided they are responsive to change. Items that showed lower ICCs for intra-agreement and interagreement (patchiness, assessment of surface neutrophils, detection of lamina propria eosinophils according to GS, basal plasmacytosis, identifying granulomas) are likely to be highly problematic unless they can be improved through re-definition, training and increased quality of samples. Basal plasmacytosis, in particular, requires properly oriented biopsy sections. The consensus process provided us with potential approaches to improve agreement. Accordingly, we plan to re-evaluate most of the items in the current study and to also include a formal evaluation of item responsiveness and improved quality control on section preparation. We anticipate that this process will result in either uniformly defined versions of the current indices (GS and MRS) or a new histological disease activity index that is optimally configured to minimise disagreement and to maximise responsiveness to meaningful change. An evaluative index that lacks reproducibility is unlikely to be a valid outcome measure.4 Therefore, identification of a reproducible and responsive histological disease activity index suitable for use as an end point in clinical trials of drug development for UC remains a research priority. This index would provide an objective measurement of response to therapy that directly measures inflammation and that might be predictive of long-term clinical outcomes. We speculate that such an index could also potentially be useful in clinical practice. The strong correlation observed between VAS with GS and MRS suggests that these two indices are potentially valid.
In conclusion, we found ‘substantial’ to ‘almost perfect’ ICCs for intrarater agreement among pathologists in the assessment of disease activity in UC using the GS and MRS, but only moderate inter-rater agreement. These findings suggest that while individual pathologists are highly reproducible in their assessment of UC histological disease activity using existing indices, more studies are needed to identify an index that is reproducible and responsive when scored by multiple pathologists. Results of a consensus process helped us characterise the most important sources of disagreement and generated recommendations that may potentially improve inter-rater agreement as a basis for revising the existing instruments or creating a new instrument.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Files in this Data Supplement:
- Data supplement 1 - Online supplement
Contributors Guarantor of the article: BGL. Development of study concept and design: MHM, BGF and WJS, GD, RK, CB, KK, DKD, LMS, KAB, JKM, MKV, KG, MAV, RP, CL, RR, NH, MS, MP, LWS, GYZ and BGL. Study supervision: MHM, BGF, WJS, GD, RK, MKV, KG and BGL. Acquisition, analysis and interpretation of the data: MHM, BGF, WJS, GD, RK, CB, KK, DKD, LMS, KAB, JKM, MAS, MKV, KG, MAV, RP, CL, RR, NH, MS, MP, LWS, GYZ and BGL. Statistical analysis: LWS and GYZ. Drafting of the manuscript: MHM, BGF, LMS, LWS, GYZ, MS and BGL. Critical revision of the manuscript for important intellectual content: MHM, BGF, WJS, GD, RK, CB, KK, DKD, LMS, KAB, JKM, MKV, KG, MAV, RP, CL, RR, NH, MS, MP, LWS, GYZ and BGL.
Competing interests The following authors are employees or are affiliated with Robarts Clinical Trials: MHM, BGF, WJS, GD, RK, CB, KK, LMS, KAB, JKM, MAS, MKV, KG, MAV, RP, CL, MP, LWS, GYZ and BGL. The following authors have no competing interests: DKD, RR, NH and MS.
Ethics approval University of Western Ontario Ethics Board.
Provenance and peer review Not commissioned; externally peer reviewed.