Article Text

Original article
Development and validation of a histological index for UC
  1. Mahmoud H Mosli1,2,3,
  2. Brian G Feagan1,2,4,
  3. Guangyong Zou1,4,
  4. William J Sandborn1,5,
  5. Geert D'Haens1,6,
  6. Reena Khanna1,2,
  7. Lisa M Shackelton1,
  8. Christopher W Walker1,
  9. Sigrid Nelson1,
  10. Margaret K Vandervoort1,
  11. Valerie Frisbie1,
  12. Mark A Samaan1,
  13. Vipul Jairath1,7,8,
  14. David K Driman9,
  15. Karel Geboes10,
  16. Mark A Valasek11,
  17. Rish K Pai12,
  18. Gregory Y Lauwers13,14,
  19. Robert Riddell15,
  20. Larry W Stitt1,4,
  21. Barrett G Levesque1,5
  1. 1Robarts Clinical Trials, Robarts Research Institute, University of Western Ontario, London, Ontario, Canada
  2. 2Department of Medicine, University of Western Ontario, London, Ontario, Canada
  3. 3Department of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
  4. 4Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada
  5. 5Division of Gastroenterology, University of California San Diego, La Jolla, CA, USA
  6. 6Department of Gastroenterology, Academic Medical Center, Amsterdam, The Netherlands
  7. 7Nuffield Department of Medicine, University of Oxford, Oxford, UK
  8. 8Oxford Clinical Trials Research Unit, University of Oxford, Oxford, UK
  9. 9Department of Pathology, University of Western Ontario, London, Ontario, Canada
  10. 10Department of Pathology, University Hospital of KU Leuven and UZ Gent, Leuven, Belgium
  11. 11Department of Pathology, University of California, San Diego, USA
  12. 12Department of Laboratory Medicine and Pathology, Mayo Clinic Arizona, Scottsdale, USA
  13. 13Massachusetss General Hospital, Boston, USA
  14. 14Department of Pathology, Harvard Medical School, Boston, USA
  15. 15Department of Pathology, Mount Sinai Hospital, University of Toronto, Toronto, Ontario, Canada
  1. Correspondence to Dr Barrett G Levesque, Division of Gastroenterology, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA; bglevesque{at}ucsd.edu

Abstract

Objective Although the Geboes score (GS) and modified Riley score (MRS) are commonly used to evaluate histological disease activity in UC, their operating properties are unknown. Accordingly, we developed an alternative instrument.

Design Four pathologists scored 48 UC colon biopsies using the GS, MRS and a visual analogue scale global rating. Intra-rater and inter-rater reliability for each index and individual index items were measured using intraclass correlation coefficients (ICCs). Items with high reliability were used to develop the Robarts histopathology index (RHI). The responsiveness/validity of the RHI and multiple histological, endoscopic and clinical outcome measures were evaluated by analyses of change scores, standardised effect size (SES) and Guyatt's responsiveness statistic (GRS) using data from a clinical trial of an effective therapy.

Results Inter-rater ICCs (95% CIs) for the total GS and MRS scores were 0.79 (0.63 to 0.87) and 0.80 (0.69 to 0.87). The correlation estimates between change scores in RHI and change score in GS and MRS were 0.75 (0.67 to 0.82) and 0.84 (0.79 to 0.88), respectively. The SES and GRS estimates for GS, MRS and RHI were: 1.87 (1.54 to 2.20) and 1.23 (0.97 to 1.50), 1.29 (1.02 to 1.56) and 0.88 (0.65 to 1.12), and 1.05 (0.79 to 1.30) and 0.88 (0.64 to 1.12), respectively.

Conclusions The RHI is a new histopathological index with favourable operating properties.

  • ULCERATIVE COLITIS
  • HISTOPATHOLOGY

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Significance of this study

What is already known on this subject?

  • Histological disease activity is an important outcome measure in controlled studies of therapy for UC.

  • A need exists for reproducible and responsive histological indices.

  • The Geboes score (GS) and modified Riley score (MRS), which are the currently available histopathological instruments, have not been adequately validated.

  • Intra-rater reliability for histological assessment of UC disease activity by expert pathologists is adequate, but inter-rater reliability is suboptimal.

  • Sources of scoring disagreement among pathologists have been identified through a formal consensus process.

What are the new findings?

  • Scoring conventions improved inter-rater reliability among central expert pathologists; intraclass correlation coefficients for inter-rater reliability for histological disease activity using GS, MRS and 100 mm visual analogue score (VAS) in UC are ‘substantial’ to ‘almost perfect’.

  • Although all components of the GS were shown to be reliable, the relationship between GS and a global measure of histological inflammation (VAS) was not linear indicating that GS is not an optimal index.

  • Component items of GS that best predicted VAS were chronic inflammatory infiltrate, lamina propria neutrophils, neutrophils in the epithelium, and erosion or ulceration. These items were included in a new index, the Robarts histopathology index (RHI).

  • RHI was shown to be reproducible, responsive and valid.

How might it impact on clinical practice in the foreseeable future?

  • RHI is a new evaluative instrument that was developed using rigorous methodology.

  • External validation of RHI using an independent data set is required.

  • RHI is likely to be a valuable outcome measure for use in clinical trials.

Introduction

Disease activity in UC has traditionally been evaluated by symptoms and endoscopy.1 ,2 However, a need exists to develop additional objective measures for use in drug development and clinical care. Histological grading of disease activity has promise as an outcome measure in clinical trials of therapy for UC and as a prognostic marker in practice.3–8 Notwithstanding that disease-specific histological activity indices currently exist,9 the limited data available indicate that their operating properties may be suboptimal.7 ,10 ,11 Ideally an evaluative index should be reproducible, respond to clinically meaningful change in disease activity, remain unchanged in stable patients and detect relevant changes in disease status over time. The responsiveness of an evaluative instrument has important implications for sample size calculations in clinical trials, since a responsive instrument can minimise the number of patients required to show significant differences between interventions.12

In a previous study, we assessed the reliability among expert pathologists in evaluating the Geboes score (GS) and the modified Riley score (MRS), the two most commonly used histological instruments.9 Although high levels of intra-rater reliability were observed for both measures, inter-rater reliability was less satisfactory. Six index items were responsible for the majority of the variability in scoring. These findings have important negative connotations for the conduct of large clinical trials where it may be necessary, based upon logistic considerations, to have multiple pathologists score biopsies. In an attempt to understand and minimise sources of disagreement a RAND consensus process examined the component items of these two scores.13 The RAND appropriateness methodology uses a modified ‘Delphi’ panel approach to combine the best available evidence and the personal clinical experience of experts in the field.14 The panel aids decision-making through an iterative process in which questions are identified, alternative viewpoints are defined, experts rate the appropriateness of statements and then, using predefined criteria, consensus is obtained. The RAND appropriateness methodology is considered a robust and highly regarded method for evaluating appropriateness that has been widely used in medicine for a broad range of purposes.15 In this exercise, 10 pathologists with specific expertise in IBD participated in multiple electronic surveys with a goal to identify and minimise variances in item scoring. Ultimately, histological scoring conventions were generated (see online supplementary table S1 and figure S1) for the problematic items.

Given this previous research, the objectives of the current study were to: (1) reassess intra-rater and inter-rater reliability for the existing histological indices and their constituent items following introduction of the scoring conventions, (2) to derive a new index using reliable histological items, (3) to establish the construct validity of the new index, and (4) to assess the responsiveness of the new index by evaluating the validity of its change scores and the ability to detect the treatment effect of a proven effective medical therapy for UC.

Methods

Study population

We used colonic biopsies obtained during the conduct of a multicentre, randomised, placebo-controlled phase 2 induction study of MLN0216 to perform an integrated reproducibility/responsiveness/index development and validation study. MLN02 was a previous formulation of vedolizumab, a monoclonal antibody to the α4β7 integrin, which is now approved in the USA and Europe for the treatment of UC. In this trial MLN02 was evaluated as induction therapy in 181 adult patients with UC with clinically active disease (UC Clinical Score (UCCS) score between 5 and 9 points with a score of at least 1 point on stool frequency or rectal bleeding), an endoscopic modified Baron score (MBS) >1 and a minimum of 25 cm of disease extent on sigmoidoscopic examination. Participants were randomly assigned in a 1:1:1 ratio to receive intravenous MLN02 at a dose of 0.5 mg/kg (n=58), 2.0 mg/kg (n=60) or placebo (n=63) at baseline and day 29. Clinical remission at week 6 was the primary outcome measure. Flexible sigmoidoscopy with colonic biopsy was performed at baseline, week 4 and week 6. At 6 weeks, 33%, 32% and 14% of patients in the treatment groups were in clinical remission (overall p=0.03; p=0.02 for both comparisons with placebo). Corresponding improvements in favour of MLN02 were observed for mucosal healing, histopathology and quality of life.16 Subsequently the efficacy of vedolizumab, an improved formulation of MLN02, was confirmed in large-scale phase 3 trials for the induction and maintenance of remission in UC.17 For the purpose of this study, data from the two MLN02 dose groups were pooled for analysis because no important differences in efficacy were observed between these groups.

Study material

Original tissue blocks collected during the conduct of the MLN02 trial were used. Biopsies were prepared (paraffin embedded, sectioned, H&E stained) and scanned at ×400 magnification (40×objective×10×microscope eye piece) on a Ventana whole slide scanner (Mt Sinai Services, Toronto, Canada). Scanned images were compressed using Web Microscope Compressor (V.1.064) and hosted for viewing by the pathologist readers on the Robarts Web Microscope database hosted on a secure remote server.

Study design and analytical approaches

The overall design of the study is shown in figure 1. Digital images were read by expert histopathologists (MAV, DKD, GL and RKP), who were unaware of clinical information, treatment assignment or visit number. In the first part of the study which assessed reliability, histological disease activity was evaluated according to the GS, MRS and a global rating of histopathological severity measured on a 100 mm visual analogue scale (VAS). The GS is a seven-item ordinal instrument that has been used as an outcome measure in clinical trials, which classifies histological changes as grade 0 (structural change only); grade 1 (chronic inflammation); grade 2 (a, lamina propria neutrophils; b, lamina propria eosinophils); grade 3 (neutrophils in the epithelium); grade 4 (crypt destruction) and grade 5 (erosions or ulcers), and generates a score from 0 to 5.4, with higher scores indicating greater inflammation.10 Several methods have been used for calculating GS and no generally accepted scoring convention exists. In this study we calculated a total GS score using the original ordinal 6-point scale, which specifies grades 0–5. MRS is a six-item (presence of an acute inflammatory cell infiltrate (neutrophils in the lamina propria), crypt abscesses, mucin depletion, surface epithelial integrity, chronic inflammatory cell infiltrate (round cells in the lamina propria), crypt architectural irregularities) instrument graded on a 4-point scale (none, mild, moderate or severe). Scores range from 0 (no inflammation) to 7 (severe acute inflammation).7 ,16 VAS is an evaluative tool commonly used to measure constructs that range across a continuum of possible responses when no gold standard is available. It requires the evaluator to place a mark on a 100 mm line, where 0 means no disease activity and 100 means severe disease activity.18 Total scores and individual components for GS and MRS were assessed. Fifty images of biopsies taken during the MLN02 trial were randomly selected from the complete population of images, two of which were ultimately determined to be unusable due to poor image quality, leaving a total sample of 48 images. The sampling procedure was stratified according to MBS, an endoscopic measure of disease severity, to ensure that a full spectrum of disease activity was available. MBS ranges from 0 to 4 (0, normal mucosa; 1, granular mucosa with an abnormal vascular pattern; 2, friable mucosa; 3, microulceration with spontaneous bleeding; 4, gross ulceration).16 Images were scored three times by all four readers, in a random order, at least 2 weeks apart. GS, MRS and VAS were evaluated on each reading. The participating pathologists were extensively educated on the item scoring conventions developed during the previously described consensus process (see online supplementary figure S1). Representative examples of all of the items were provided to the readers in an electronic atlas (see online supplementary figure S1). Inter-rater and intra-rater reliability statistics were calculated for the overall histological scores and the component items of the GS.

Figure 1

Overall study design illustrating the major phases of index development for RHI. RCT, randomised-controlled trial; VAS, visual analogue scale; GS, Geboes score; MRS, modified Riley score; ICC, intraclass correlation coefficient; RHI, Robarts histopathology index.

For the new index development (Part 2 of the study), items with at least a ‘moderate’ level of reliability, based on Landis and Koch benchmarks, whereby intraclass correlation coefficients (ICCs) of <0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8 and >0.8 constitute ‘poor,’ ‘fair,’ ‘moderate,’ ‘good,’ ‘substantial’ and ‘almost perfect’ reliability, respectively,19 were selected as candidate items in developing a new index, using the VAS score as the anchor. These empirical benchmarks, which were originally developed for grading kappa statistics, have become widely adopted for assessment of ICCs. The development of the new index, which was designated the Robarts Histopathology Index (RHI) centred on building the model that best predicts the VAS score.

Exploratory bivariate analyses between the VAS and each of the items selected based on reliability were performed first to guide the coding of each item. Specifically, we prespecified that GS item variables would be coded as continuous if a linear relationship was demonstrated between change in score and change in VAS. If a linear relationship was not evident, the bivariate relationships were used to collapse item levels. A full model was then obtained using all items, followed by a step-down model building approach with p=0.05 used as the criterion for item selection. Residuals from the final model were subjected to statistical diagnostics examination. The stability of the final model was assessed and calibrated using the bootstrap method with 2000 replicates.20 For ease of calculation of RHI, we standardised the regression coefficients by dividing the smallest coefficient and rounding to integers.

The construct validity of the new index (Part 3 of the study) was evaluated by comparing Pearson product correlations between the index and clinical, endoscopic and health related quality of life instruments. In this approach a priori predictions were made regarding the correlations between RHI and other valid measures of disease activity (VAS, MBS, sum of the aggregate Mayo Clinic Score (MCS)21 rectal bleeding and stool frequency subscores and the IBD Questionnaire (IBDQ)22). The full MCS21 is composed of four subscores (bleeding, stool frequency, physician assessment and endoscopic appearance) rated from 0 to 3 that are summed to give a total score ranging from 0 to 12, with higher scores representing more severe disease. The IBDQ consists of 32 questions divided into four dimensions: bowel symptoms (10 items), systemic symptoms (5 items), emotional function (12 items) and social function (5 items). Every question has graded responses from 1 (worst situation) to 7 (best situation). Total scores range from 32 to 224 with higher scores representing better quality of life.22 These a priori correlation predictions were then compared with the observed correlations. A valid instrument should show appropriate, ordered relationships with the other relevant measures of disease activity. We also calculated RHI scores corresponding to usual definitions of clinical and endoscopic remission (MBS of 0 or 1, Mayo Clinic rectal bleeding score of 0, Mayo Clinic stool frequency score of ≤1, and a sum of the Mayo Clinic rectal bleeding and stool frequency scores of 0 or 1).

For measurement of responsiveness, (Part 4 of the study), images from the baseline biopsies were paired with their corresponding week 6, post-treatment image (a week 4 image was used if no week 6 image was available). The original study included 181 patients, of whom 154 had a baseline and a week 6 (or week 4) image available (18 patients did not have a week 4 or week 6 image, 1 patient did not have a baseline image, 8 patients had unusable images at either baseline or week 4 or week 6). For these analyses, a single central reader (RKP) reviewed each pair of images and scored GS, MRS and RHI. Correlations between change scores were used to assess the validity of the new index; correlations exceeding a threshold of 0.7 were considered acceptable. To further evaluate the potential value of RHI as an outcome measure in clinical trials and to facilitate index interpretability, the standardised effect size (SES) and Guyatt's responsiveness statistic (GRS)23 were calculated using treatment allocation with patients assigned to MLN02 considered ‘changed’ and those assigned to placebo considered ‘unchanged.’ Furthermore, these statistics were also calculated using the total population and the following clinical definitions of change: (1) a 1-point difference from baseline in the MBS and (2) a 2-point change in the sum of the aggregate rectal bleeding and stool frequency subscores of the MCS. The degree of index responsiveness was classified using previously described conventions whereby effect sizes of 0.2, 0.5 and 0.8 represent low, moderate and large degrees of responsiveness.24

Sample size justification

In Part 1 of the study, the sample size calculation was based on a one-way random effects model as discussed by Zou.25 Assuming a true ICC of 0.7, rating of 50 images three times would have 80% chance of obtaining the one-sided 95% lower bound >0.5. For Part 2, the sample size was determined by applying the ‘rule-of-10’ which states that 10 observations per item are sufficient in a regression analysis. The purpose of the rule is to assure stability of the model estimates and although it was empirically derived, considerable experience exists that supports the validity of this approach.26 A total of 150 images was regarded as sufficient for the present study since 15 items were evaluated in the regression model. This sample size was also large enough to distinguish a difference in ICCs of 0.8 from 0.7 at the 5% significance level, suggesting the sample size for Part 3 and responsiveness in Part 4 were sufficient. The sample size for assessing the magnitude of effect size in Part 4 was determined according to a formula for an imbalanced study with smaller group sizes of n and m, respectively (ie, where r is the ratio of the larger group to smaller group, ES is the target effect size, z is the quantile value of the standard normal distribution). Thus, for r=2 (based upon a treatment assignment of 1:1:1 to placebo, low dose MLN02 and high dose), a total of 150 images (50 placebo and 100 pooled MLN002) would have 88% power to detect an effect size of 0.3 at a two-sided 5% significance level.

Ethical considerations

The biopsies analysed in this study were obtained from a clinical trial that complied with all applicable regulatory requirement(s). The consent of study participants included the use of the collected data for other medical purposes, and thus additional consent for the present study was not obtained. All participant information used in the present study was de-identified and the pathologists were blinded to clinical information.

Results

Study population

Baseline characteristics of the study patients are summarised in table 1. There were no important differences in baseline demographics or disease characteristics between patients treated with MLN02 and those treated with placebo, and the characteristics were generally representative of participants in induction trials of treatment for active UC.

Table 1

Baseline demographics and clinical characteristics

Index reliability

Individual ICCs for VAS, MRS and GS are summarised in table 2. Intra-rater ICCs (95% CI) for VAS, MRS and GS (5-point scale) were 0.83 (0.77 to 0.87), 0.85 (0.77 to 0.91) and 0.88 (0.79 to 0.93), respectively, indicating ‘almost perfect’ intra-rater reliability. Inter-rater ICCs (95% CI) for VAS, MRS and GS (5-point scale) were 0.67 (0.55 to 0.74), 0.80 (0.69 to 0.87) and 0.79 (0.63 to 0.87) indicating ‘substantial’ to ‘almost perfect’ inter-rater reliability (table 2). Intra-rater ICCs were also above the ‘substantial’ benchmark for scoring of all items included in GS. Inter-rater ICCs for the scoring of the individual items were all above the benchmark indicating ‘good’ reliability, with the lowest inter-rater ICC (0.48 (0.34 to 0.58)) observed for scoring of lamina propria eosinophils. These results suggest that all GS items following application of scoring conventions may be regarded as reliable and, hence were eligible candidate items for further index development. It is also notable that the point estimates for all of the index scores and the component items, with the exception of structural/architectural change, were superior to those obtained in a previous reliability study.9 This finding suggests that the scoring conventions improved inter-rater reliability. This impression was confirmed by examining the scores of the single reader (RKP) who participated in both studies. Clear evidence of a training effect was demonstrated (see online supplementary table S2).

Table 2

Reliability of VAS, MRS and GS

Item selection for index development

Figure 2A, which shows the bivariate relationships between each of the GS items and the VAS score, demonstrates a linear relationship between increments in VAS scores and the GS items ‘structural/architectural change’, ‘ chronic inflammatory infiltrate’ and ‘lamina propria neutrophils.’ Thus, the scores for these items were treated as continuous values in the model. Figure 2A also shows that a VAS score of approximately 50 corresponds to scores of 1, 2 and 3 for the GS item ‘lamina propria eosinophils,’ which justified collapsing these three levels into one level. Therefore, for model development, ‘lamina propria eosinophils’ scores were recoded as 0 and 1. Similarly, ‘neutrophils in epithelium’ was recoded as 0, 1 (original scores=1 and 2) and 2 (original score=3), ‘crypt destruction’ was recoded as 0 and 1 (original score=1, 2 and 3), and ‘erosion or ulceration’ was recoded as 0 and 1 (original scores=2 and 3), 2 (original score=3) and 3 (original score=4).

Figure 2

(A)Univariable summaries of VAS scores as stratified by the levels of Geboes items. The figure shows the VAS scores for the histopathological items evaluated according to each of their levels. These data were used to select the number of item levels for regression analysis. For example, a linear relationship is present for ‘lamina propria neutrophils’ so four levels were included whereas only two levels were appropriate for ‘crypt destruction.’ The numbers of images are shown in the right margin. (B) Calibration plot of actual versus predicted VAS using the final model with four variables (chronic inflammatory infiltrate, lamina propria neutrophils, neutrophils in epithelium, and erosion or ulceration). The perfect prediction (Ideal) is shown by the 45° line. The model performance as assessed by the derivation sample is shown by the dotted line (Apparent). The model performance as assessed by bootstrap validation with 2000 replications is shown by the dashed line (Bias-corrected). The closeness between the Apparent and Bias-corrected plots suggests stability of the model performance in data sets other than that used to derive the model. VAS, visual analogue scale.

The model building process started with all seven GS item variables, followed by a step-down procedure with a bootstrap of 2000 resamples, which selected a final model with ‘chronic inflammatory infiltrate’, ‘lamina propria neutrophils’, ‘neutrophils in the epithelium’ and ‘erosion or ulceration’ as items that best predicted VAS with an R2 value of 0.82 (table 3). The calibration plot using a bootstrap of 2000 resamples (figure 2B) shows that the final model has reasonable external validity. After simplification of the model, RHI can be calculated as:Embedded Image

Table 3

Final regression model for Robarts Histopathology Index

The total score ranges from 0 (no disease activity) to 33 (severe disease activity). The intra-rater and inter-rater ICCs (95% CIs) for RHI were 0.92 (0.88 to 0.94) and 0.82 (0.74 to 0.86), indicating ‘almost perfect’ intra-rater and inter-rater reliability.

Validity and correlation testing

The correlation matrix between histological, clinical and endoscopic measures of disease activity in UC is shown in table 4. The relationship between RHI, VAS, MBS, UCCS and IBDQ was in the predicted order and large differences were not observed between the predicted correlation coefficients and those observed, with the exception that RHI showed better than predicted correlation with IBDQ. These observations indicate that RHI has construct validity.

Table 4

Correlation matrix between histological, clinical and endoscopic measures of disease activity in UC

Definition of RHI remission

Online supplementary table S3 shows the RHI scores that correspond to various accepted definitions of endoscopic and clinical remission. Inspection of these data indicates that an RHI score ≤6 is most commonly observed in these patients.

Responsiveness

The correlation estimates between change scores in RHI and change score in GS and MRS were 0.75 (0.67 to 0.82) and 0.84 (0.79 to 0.88), respectively. Table 5 presents the estimates of correlation between RHI change scores and change scores in VAS, MBS, UCCS, as well as the IBDQ. The results show that RHI is highly responsive with correlation coefficients above 0.70 for change scores in VAS. In general, the observed correlations in change scores between RHI and the endoscopic and clinical instruments did not differ substantially from the a priori predicted correlations, although changes in the RHI score showed a somewhat lower than expected correlation with changes in MBS.

Table 5

Correlation matrix between changes in clinical and endoscopic measures of disease activity in UC

Estimates of effect size for each index based on treatment assignment to MLN02 and predefined criteria of change are summarised in table 6. All of the indices evaluated including RHI showed ‘moderate’ to ‘large’ responsiveness according to accepted conventions for classification. Caution should be taken in making comparisons of responsiveness between RHI, GS and MRS since application of the responsiveness statistical measures (SES and GRS) are predicated upon a normative distribution of data, which is not apparent for some GS items (figure 2A).

Table 6

Estimates (with 95% CIs) of effect size for VAS, MRS, GS and RHI based on treatment allocation, MBS and Mayo score definitions as criteria for change

Discussion

Evaluation of histopathological disease activity has many potential applications in research and clinical care in UC. However, important limitations are inherent to the evaluative indices currently available to assess severity of inflammation. Specifically, GS was developed as an ordinal classification system and although most of the items have face validity, formal assessment of index reliability, responsiveness and validity has not occurred. Similar deficiencies exist for the MRS.

Furthermore, our initial reliability study identified multiple GS items with poor reliability.9 Consequently, we used a scientifically rigorous process to develop a new measure of histopathological disease activity. Initially, standardised item scoring conventions were developed for all potential items and a reliability study was repeated in which these conventions were applied. We found that implementation of the conventions improved reliability since the point estimates for the inter-rater estimates were substantially higher than those documented in our previous study, for total index scores and the component index items. Analysis of data from one reader (RKP) who participated in both studies showed significant ‘before and after’ improvement in intra-rater reliability for all of the items, following application of the scoring conventions (see online supplementary table S2).

Since all of the items met our minimum reliability criterion of an ICC ≥0.4 (moderate reliability), they were all evaluated as predictors of global histological disease activity using multiple linear regression models. Seven of the most robust items were further assessed for responsiveness to change. ‘Structural-architectural changes’, ‘crypt destruction’ and ‘lamina propria eosinophils’ were eliminated due to poor responsiveness to therapy after 6 weeks of treatment with MLN02. Chronic architectural changes are known to be poorly responsive to acute therapy and may persist for an extended portion of time. Eosinophils are poorly understood and are commonly increased in biopsies of patients with quiescent colitis.27 We could not identify a plausible explanation for the lack of responsiveness for crypt destruction. The remaining four items were combined to generate RHI, with weighting of the items according to the model regression coefficients. With one exception where responsiveness was moderate according to GRS (see table 4; absolute change in MBS of at least 1 point), the responsiveness of this new instrument was found to be large according to SES and GRS, using three different definitions of changed/unchanged (treatment of known efficacy, endoscopy by MBS and clinical symptoms of bleeding and loose bowel movements). We also evaluated the responsiveness of GS and MRS. Surprisingly, the point estimates for the responsiveness statistics were higher for these instruments than the RHI. However, comparisons across these indices are highly problematic for several reasons. First, several of the items in GS are not normally distributed (figure 2A). The estimates of responsiveness by SES and GRS are predicated on normative data, which raises uncertainty about the validity of GS responsiveness estimates. Second, the statistical power to detect differences in the responsiveness statistics is limited. Development of a responsive histopathological index is an aspirational goal for drug development in UC. Responsive indices are more statistically efficient and can therefore detect statistically significant differences between treatment groups with relatively fewer patients. This is particularly relevant for proof of concept and early phase dose- finding studies that typically have small sample sizes. Based on our data RHI is able to detect a statistically significant difference with approximately 15–21 patients per study group. However, the minimum clinically important difference for RHI is unknown and will require further evaluation in prospective studies.

In the final phase of this study we confirmed the validity of RHI using construct validity. The predicted relationship between RHI, VAS, MBS, the sum of the MCS rectal bleeding and stool frequency subscores and the IBDQ scores was consistent with the observed relationship. The validity of the new index is also supported by the improvement in scores that was observed following treatment with MLN02, an effective induction agent for UC.

Our study had several methodological strengths. We were able to secure the collaboration of a large international team of experts in pathology, index development and biostatistics. The availability of a large and relatively homogenous collection of biopsies that were processed in a standardised manner for the express purpose of conducting the study was an important feature. The use of specimens from a clinical trial of an agent of known efficacy also allowed increased confidence in our conclusions regarding the responsiveness of the indices and their component items.

The study had some relevant limitations. First, the participating pathologists were highly experienced and intensively trained on the item scoring conventions, thus our results may not be generalisable to other settings. Second, the prognostic value of RHI was not assessed. Prospective cohort studies should define this potential application of the index. RHI also was not developed to predict the maintenance of therapeutic response, and other components of GS may be helpful in this setting. Finally, and most importantly, RHI was not independently evaluated in a second data set. This is a critical limitation of the study and we plan to perform a second study using data from a recently performed phase 2 clinical trial during which we will perform a full interpretability evaluation, including assessment of the minimum clinically important difference.

Despite these limitations the development of RHI is an important advance. Multiple new therapies are currently being developed for the treatment of UC. On one hand, this escalation in research activity holds tremendous promise for improved patient care. On the other it has created a substantial challenge. Recruitment of patients into clinical trials has become increasingly difficult, and therapies are being developed for UC with evolving definitions of treatment response that specify deeper levels of remission than endoscopic healing. As such, histological evaluation of disease activity is likely to become an integral part of the management of UC. We believe that RHI will help enable future clinical research in UC.

Acknowledgments

The authors thank Takeda Pharmaceuticals for the use of the biopsy samples as study material.

References

Supplementary materials

Footnotes

  • Contributors Guarantor of the article: BGL. Development of study concept and design: MHM, BGF, WJS, GD'H, RK, DKD, LMS, SN, MKV, KG, MAV, RKP, RR, LWS, GZ, BGL. Study supervision: MHM, BGF, WJS, GD'H, RK, MKV, KG, BGL. Acquisition, analysis and interpretation of data: MHM, BGF, WJS, GD'H, RK, DKD, LMS, MAS, MKV, KG, MAV, RKP, GL, RR, LWS, GZ, BGL. Statistical analysis: LWS, GZ. Drafting of the manuscript: MHM, BGF, LMS, LWS, GZ, BGL. Critical revision of the manuscript for important intellectual content: MHM, BGF, WJS, GD'H, RK, DKD, LMS, SN, MKV, KG, MAV, RKP, GL, RR, LWS, GZ, BGL.

  • Funding Robarts Clinical Trials, University of Western Ontario, London, Ontario, Canada (Central Reading, Database Management).

  • Ethics approval Research Ethics Board at the University of Western Ontario.

  • Competing interests MHM, LMS, CWW, SN, MKV and GZ are employees of Robarts Clinical Trials, which was the research organisation that conducted this study. BGF reports personal fees from Robarts Clinical Trials. Grants from AbbVie, Amgen, Astra Zeneca, Bristol Myers Squibb, Janssen/JnJ—Canada, USA and Global, Biotech/Centocor, Roche/Genentech, Millennium, Pfizer, Receptos, Santarus, Sanofi, Tillotts Pharma AG, UCB, personal fees from AbbVie—Canada, USA and Global, Actogenix, Akros, Albireo, Amgen, Astra Zeneca, Avaxia Biologics Avir, Axcan, Baxter Healthcare, Biogen Idec, Boehringer-Ingelheim, Bristol Myers Squibb, Calypso Biotech, Celgene, Elan/Biogen, EnGene, Ferring, Roche/Genentech, GiCare, Gilead, Given Imaging, GSK, Ironwood, Janssen/JnJ/Biotech, Kyowa Kakko Kirin, Lexicon, Lilly, Lycera Biotech, Merck, Millennium, Nestle, Novo Nordisk, Pfizer, Prometheus Therapeutics & Diagnostics, Protagonist, Receptos Salix, Serono, Shire, Sigmoid, Synergy, Takeda—Canada, USA and Global, Teva, TiGenix, Tillotts, UCB, Vertex, VHsquared, Warner-Chilcott, Wyeth, Zealand and Zyngenia, outside the submitted work. WJS reports personal fees from Robarts Clinical Trials, has received consulting fees from Abbott, ActoGeniX NV, AGI Therapeutics, Alba Therapeutics, Albireo, Alfa Wasserman, Amgen, AM-Pharma BV, Anaphore, Astellas, Athersys, Atlantic Healthcare, Aptalis, BioBalance, Boehringer-Ingelheim, Bristol-Myers Squibb, Celgene, Celek Pharmaceuticals, Cellerix SL, Cerimon Pharmaceuticals, ChemoCentryx, CoMentis, Cosmo Technologies, Coronado Biosciences, Cytokine Pharmasciences, Eagle Pharmaceuticals, EnGene, Eli Lilly, Enteromedics, Exagen Diagnostics, Ferring Pharmaceuticals, Flexio Therapeutics, Funxional Therapeutics, Genzyme, Gilead Sciences, Given Imaging, GSK, Human Genome Sciences, Ironwood Pharmaceuticals, KaloBios Pharmaceuticals, Lexicon Pharmaceuticals, Lycera, Meda Pharmaceuticals, Merck Research Laboratories, Merck Serono, Millennium Pharmaceuticals, Nisshin Kyorin Pharmaceuticals, Novo Nordisk, NPS Pharmaceuticals, Optimer Pharmaceuticals, Orexigen Therapeutics, PDL Biopharma, Pfizer, Procter and Gamble, Prometheus Laboratories, ProtAb, Purgenesis Technologies, Relypsa, Roche, Salient Pharmaceuticals, Salix Pharmaceuticals, Santarus, Schering Plough, Shire Pharmaceuticals, Sigmoid Pharma, Sirtris Pharmaceuticals, SLA Pharma UK, Targacept, Teva Pharmaceuticals, Therakos, Tillotts Pharma AG, TxCell SA, UCB Pharma, Viamet Pharmaceuticals, Vascular Biogenics, Warner Chilcott UK and Wyeth; research grants from Abbott, Bristol Myers Squibb, Genentech, GSK, Janssen, Milennium Pharmaceuticals, Novartis, Pfizer, Procter and Gamble, Shire Pharmaceuticals and UCB Pharma; payments for lectures/speakers bureaux from Abbott, Bristol Myers Squibb and Janssen; and holds stock/stock options in Enteromedics. GD’H reports grants and personal fees from Abbvie, Ferring, Glaxo SmithKline, Jansen Biologics, Hospira, Takeda, Merck Sharp Dome, Prometheus Labs, Robarts Clinical Trials, Tillotts, personal fees from Ablynx, Amakem, Amgen, AM Pharma, Boehringer Ingelheim, Bristol Myers Squibb, Cosmo, Celgene, Celtrion, Galapagos, Covidien, Engene, Medimetrics, Mitsubishi, Mundipharma, Novonordisk, Pfizer, Receptos, Salix, Sandoz, Setpoint, Shire, Teva, Tigenix, Topivert, Versant and Vifor, grants from Photopill and Dr Falk Pharma, outside the submitted work. RK reports fees from Janssen, AbbVie and Takeda, Canada, outside the submitted work and is an employee of Robarts Clinical Trials which was the research organisation that conducted this study. VF, LWS report receiving personal fees from Robarts Clinical Trials. MAS has nothing to disclose. VJ reports personal fees from Robarts Clinical Trials. Salary is partially funded by the UK National Institute for Health Research, has received scientific advisory board fees from AbbVie. RKP reports personal fees from Robarts Clinical Trials, during the conduct of the study. BGL reports personal fees from Takeda, Prometheus Labs, Nestle Health Sciences, Abbvie and Robarts Clinical Trials, outside the submitted work.

  • Provenance and peer review Not commissioned; externally peer reviewed.