Article Text

## Statistics from Altmetric.com

## Setting the framework: the difference between reliability and agreement

On a daily basis, clinicians and researchers face the challenge of measuring multiple outcomes. From responses to therapies and assessments of disease activity, to certainty of diagnoses and innovation of cutting-edge diagnostic tools, it is essential within every field that outcome measurement be valid, reproducible and reliable.1 At first glance, validity, reproducibility, reliability and agreement may seem similar; however, there are fundamental differences among these concepts that are important for study design and execution, and for methodology and statistical analyses. Alvan Feinstein saw that problem and introduced the term clinimetrics, or, “the methodologic discipline focusing on measurement issues in clinical medicine”.2 The concept of clinimetrics is not new; on the contrary, it has been considered a subset of psychometrics.3 Terwee, de Vet, Mokkink and Knol, among others,4 developed tools to assess and evaluate health measurement instruments in clinical medicine. It is, therefore, why the backbone of this paper will rely on the *CO*nsensus-based *S*tandards for the selection of health *M*easurement *In*struments (COSMIN) initiative.

The COSMIN initiative is a multidisciplinary, international consensus which aimed to create standards to evaluate the methodological quality and design and preferred statistical analyses of a study on measurement properties.5 The initiative primarily focused on Health Related Patient-Reported Outcomes (HR-PRO) due to the complexity of these outcomes measurements; however, these concepts still apply to other type of outcomes and will be followed here.4 For reader clarification, HR-PRO is defined by Mokkink *et al*4 as “any aspect of a patient's health status that is directly assessed by the patient, that is, without the interpretation of the patient's responses by a physician or anyone else”; examples include self-administered or computer-administered questionnaires.

The COSMIN taxonomy in the evaluation of a measurement instrument shows three main quality domains: reliability, validity and responsiveness. As noted in the online supplementary figure S1 and table S1, within each domain, there are several properties which help in the evaluation of the domain, and relate to each other.4 ,5 While all domains are important in the evaluation of a new instrument, this review aims to provide guidance on the reliability domain because it is frequently encountered in GI research6 and lesser known compared with diagnostic accuracy studies. For a detailed description of other domains and their evaluation, the reader is referred elsewhere.4 ,5

This paper aims to accomplish three goals: (1) provide a review of key concepts in reliability and agreement; (2) give an overview of statistical tools available on reliability and agreement; and (3) propose a stepwise approach for the analysis/reporting of high quality reliability/agreement studies based on available checklists. To this effect, a search engine has been applied in PUBMED to identify relevant papers in the field of reliability and agreement studies so pertinent references are included in the present work (see online supplementary figure S2). This engine was not designed to be a formal systematic review, but created to provide an orientation on how references for the current narrative review were selected.

The reviewer proposed to follow a similar scheme as provided by Gisev *et al*,7 but adapting the COSMIN initiative checklist,4 ,8 ,9 the Quality Appraisal of Reliability Studies (QAREL, online supplementary table S2) checklist10 and the Guidelines for Reporting Reliability and Agreement Studies (GRRAS, online supplementary table S3),11 so that the researcher follows the recommended published guidelines (figure 1). Several STEPS are proposed in sequential fashion to guide authors navigating in the development of the paper.

The paper will provide limited mathematical formulations for two main reasons. First, such analyses may scare most statistic-naïve readers. Second, current statistical software, assuming they are used accurately, provide straightforward answers. This is why the reviewer will give ‘the punchline’ after each section on what method to use and provide extensive accompanying bibliography for the keen reader as well as examples of STATA commands (StataCorp. 2011. *Stata Statistical Software: Release V.12*. College Station, Texas, USA: StataCorp LP). Finally, this paper will not assess the methodology of diagnostic accuracy studies (sensitivity, specificity).

## Differences in reliability and agreement: related but not synonymous (STEP 1)

According to de Vet *et al*,1 measurement property *reliability* answers the question of how well patients can be distinguished from each other despite measurement errors. In contrast, *agreement* shows exactly how close the scores for repeated measurements are, and is related to measurement error defined as a “systematic and random error of a patient's score that is not attributed to true changes in the construct (or concept) to be measured”.8 Reliability and agreement have been often confused and used interchangeably, however they are distinguishable in key ways. Kottner and Streiner point that “a clear distinction between the conceptual meanings of agreement and reliability is necessary to select appropriate statistical approaches and enable adequate interpretations”.12 Whereas “agreement points to the question, whether diagnoses, scores, or judgments are identical or similar or the degree to which they differ”, “reliability coefficients provide information about the ability of the scores to distinguish between subjects”.12 An excellent graphical description on differences between reliability/agreement can be found in de Vet *et al*.1

*Reproducibility* is considered an umbrella term that encompasses reliability and agreement.1 On the other hand, the Standards for Reporting of Diagnostic Accuracy (STARD) define *diagnostic accuracy* as “the amount of agreement between the results from the index test and those from the reference standard”.13 As demonstrated in ‘final concepts from the COSMIN initiative: validity, responsiveness and interpretability’, this concept refers to criterion validity and will not be assessed as part of reliability/agreement studies discussed here.

One should be cautious of the numerous papers providing mixed terminology regarding which parameters belong to reliability and which to agreement.7 ,14 ,15 In order to provide an unified approach, this review follows the published guidelines from experts in the field and recommends that authors adopt this taxonomy accordingly (see online supplementary figure S1).8^{–}11 ,16

In sum, reliability and agreement are two separate concepts that describe how the instrument performs to detect real variability between subjects and whether the instrument, over repeated measures, yields consistent results. It is when the researcher needs to figure out the objective of the study and decide on the analytical pathway to pursue (figure 1).

## What is the type of outcome variable and what is the most appropriate statistical parameter? (STEP 2–6)

### Parameter selection is based on objective and the nature of the measurement (STEP 2)

Depending on the type of measurement variable, there are different types of reliability and agreement measurements parameters (table 1). At this stage, the reader already knows whether the instrument is a number or category—that is, blood pressure, severity classification for pathology results, respectively—and may be ready to apply table 1; however, it is important to consider how many instruments (or raters) there are (STEP 3), as some of these statistics are only applied for two raters and two measurements (Cohen's κ). One should also consider whether the tests (or observers) are independent from each other (STEP 4). For each test, the fact that the test can accommodate multiple readers will be provided (indicated within the text as STEP 3; also noted in table 2 under “3 or more”), and instead shift focus to the often-ignored paired/dependent data.

### Number of raters/tests involved and paired data: what statistical tests should be used and why? (STEP 4)

Paired, correlated, clustered, matched data and repeated measures are all synonymous terms that have the same statistical implication—when compared with independent data, they demonstrate a correlation between measurements and this must be taken into the analyses to avoid data analyses flaws. Examples of such paired data include observations that are made on the same patient simultaneously or over time (repeated measures, pre-post test), on the same organ (eg, colon, eyes, etc.…), where there has been some type of matching by design (eg, matched case-control studies), or where there has been some clustering in that each observation belongs to the same patient or is part of a clustered randomised clinical trial.

In addition, common statistical tests are not appropriate and dependency needs to be accounted. In order to guide the reader, table 2 summarises common statistical tests used in any type of analyses for paired data.17 ,18 Should similar considerations be applied when running univariate and/or multivariate analyses? The answer is ‘yes’. Generalised estimating equations,19 for instance, is an example of a tool used to account for this type of correlated data and the reader is encouraged to apply it to correlated data and multivariate analyses.20 ,21 It is not, however, the only alternative, as other regression techniques adjusted for cluster effect have been developed.22

Such examples include the commonly used per-biopsy and per-patient analyses. The use of clustered methods as the one previously described will depend on the question of the research. In order words, when the unit of interest is the patient when multiple biopsies are taken, or some other type of clustering (techniques, hospitals) are of interest, there is some dependency or correlation within cluster data, this type of clustered/paired statistical analyses should be used. If not, the researcher will need to defend his/her position on choosing alternative unclustered/unpaired data.

### Reliability parameters for categorical (nominal, dichotomous and ordinal) measurement variables

#### κ index and extensions

The κ index is the most commonly used method to assess reliability in categorical instruments. The κ index is a family of indices23 ,24 which relates “the amount of agreement that is observed beyond chance agreement *to* the amount of agreement that can maximally be reached beyond chance agreement”.25 The underlying assumption when calculating this parameter is that rater reports are statistically independent; that is, that they can be calculated by multiplicating different cells.26 The term ‘measure of *agreement*’ may elicit confusion, but the general structure of the formula for κ is indeed derived from the general equation of reliability, rightly making it a measurement of reliability.27 Under certain circumstances (ie, quadratic weights on weighted κ), it is equivalent to the two-way mixed, single-measures intraclass correlation coefficient (ICC)28 which will be further explained in ‘ICC, a layman's terms definition and method of selection’ and figure 2. If the study has paired data, options are available for κ as well (Step 4).29 ,30

##### Statistical inference, power calculation and software of κ indices

The κ indices can also be used for statistical inference (significance testing, SE estimation and CIs calculation)31 and sample size formulae exist for such purposes.31–33 STATA, for example, provides an extensive list of statistical packages for such purposes, including calculation of CIs (kapci), sample size/power calculation (sskdlg, sskapp) for incomplete designs (such as missing data), and multiple raters –here STEP 3-(κ2). On power estimation the researcher shall consider, as any other variables in medicine, that dichotomisation of a continuous variable may carry an important loss of power, thus requiring cautionary decision-making.34

##### Interpretation and caveats of κ indices

Across the literature, three major interpretations in which a value of 0.6 or beyond is considered acceptable or fair have been consistently used (table 3).28 It is also possible to obtain negative κ indices when one of the raters reverts accidentally the scale of rating.25 ,33 While they have been widely used, κ indices are far from perfect and face several criticisms. *First*, the criteria to evaluate κ (table 3) are arbitrary. *Second*, κ is influenced by the marginal distribution of the ratings (‘the bottom rows and the end columns’ of the cross-tabulation) and the prevalence of the disease. In this sense, several reports have been shown to correct for bias (marginal distribution inequities) and prevalence and, whereas it may be reported, some authors have considered the adjusted κs inaccurate.35 Finally, κ, as a summary statistic, can lose some information when examining the discrepancy between observers, and this information is even more accentuated when the researcher converts a natural continuous variable into a binary one.34 For example, it would be important to see in a hypothetical grading of no cancer, preneoplastic and cancer type of pathology exam what the discordance was between a ‘no cancer’ diagnosis and ‘preneoplastic.’ Consequently, it is important to provide raw data for readers to review as an online supplementary table in order to show the rates of agreement and the reasons for discrepancies.

#### Other alternatives for categorical reliability: weighted κs, ICCs and modelling patterns of agreement

##### Weighted κs and matrix of weighted κs

A weighted κ is an extension of the regular κ in which the researcher provides arbitrary weights to penalise the degree of disagreement between observers for multiple categories. Several weighting systems can be used including linear, quadratic and Cicchetti-Allison weights33 (STATA command kapwgt).

The matrix of weighted κ is another alternative method to evaluate agreement in ordinal variables and takes into account κ indices and correlation coefficients.36 For further review, the reader is referred to Roberts.31 ,37 This approach, however, has been rarely used in medicine.

##### ICC for ordinal variables: perhaps better than weighted κ

Kraemer,38 Streiner and Norman28 favour κ for two-by-two agreement assessment, but recommend using the ICC coefficient when treating anything other than dichotomous variables with more than two observers. This is so, because ICC, as noted per Streiner and Norman,28 provides the following: “(a) the ability to isolate factors affecting reliability; (b) flexibility, in that it is very simple to analyse reliability studies with more than two observers and more than two response options (STEP 3); (c) we can include or exclude systematic differences between raters (bias) as part of the error term; (d) it can handle missing data; (e) it is able to simulate the effect on reliability of increasing or decreasing the number of raters; (f) it is far easier to let the computer do the calculations; and (g) it provides a unifying framework that ties together different ways of measuring inter-rater agreement—all of the designs are just variations on the theme of the ICC”.28 A recent paper published in *Gut* illustrates how to apply the ICC methodology to examine agreement between ordinal measurements and can be followed as an example.6

##### Modelling patterns of reliability: log-linear models, latent class models and beyond

Advanced techniques are possible for researchers who are interested in providing more information than a summary statistic, for instance, if an investigator wanted to explore patterns of reliability by different characteristics or compare reliability adjusting for different covariates, there is no gold standard.39 Instead, a researcher may apply latent class models, in which reliability (or agreement as used in the references) is considered an unobservable categorical truth (or latent class) and relate this latent class to the observed data.40–43 A good tutorial can be found in the website of Dr Uebersax44 and, for a more practical application, in the papers by Christensen *et al*45 and Dunn16 which may be the few studies in gastroenterology using this technique. Other approaches are log-linear models, which examine the association of cell counts on the level of categories (eg, observers) and models’ interaction patterns and associations.26 ,46 ,47 Similar to the latent class models, there are a handful of examples using log-linear models in GI reliability.16 ,48 Again, STATA provides commands for latent class models (gllamm) and log-linear models (ipf).

#### The punchline for categorical variables: κ with caution, perhaps ICC

Whenever researchers are presented with categorical variables, the use of κ indices is acceptable, provided that the proper extension is used and that it correctly identifies the problems associated with these indices. Although not part of the COSMIN evaluation tool, the use of ICC is also a good alternative. Finally, while latent and log-linear models provide a unique opportunity for research in the topic, they are less explored in the field of reliability studies.

### Reliability parameters for continuous measurement variables

#### Interclass versus intraclass: what is the difference?

*Inter*class correlation coefficient measures the relationship between two different classes or types of variables (eg, Pearson's r) and it is evident that the result won't have any metrics and/or variance. In contrast, when research seeks to evaluate the relationship between variables of a common class that share the same metrics and variance, ICCs are used (eg, ICC).

#### ICC, a layman's terms definition and method of selection

The general equation of the reliability parameter (eg, ICC) is as follows:

This equation demonstrates that variability (ie, variance) between study objects is related to the measurement error or, as defined in the COSMIN initiative, reliability is “the proportion of the total variance in the measurements which is due to ‘true’ differences between patients”.4

##### Selection of the ICC

At least six types of ICCs have been described, and derived from different methodology and, therefore, leading to different inferences.28 ,49 ,50 Thus, when authors report ‘an ICC,’ a thorough explanation shall be provided; however, the question remains which ICC needs to be used and why. Luckily, there are guidelines to assist researchers (figure 2).49 ,51 No matter which ICC is chosen, it is important to provide readers with an online supplementary table of all variance components that can be used for interpretation and correction of measurement error (eg, in the paper by de Vet *et al*1 in table 2, or as noted in table 5.10 in the de Vet *et al*52 book, chapter 5). Furthermore, it is also important to provide data to support assumptions of analyses of variance (ANOVA) as noted per McGraw and Wong.49

##### Statistical inference, power calculation and software

As discussed, several authors have shown different techniques of obtaining statistical inference for different ICCs.49 For sample size calculation, several formulae are available: one relies on hypothesis testing;32 ,53 other on the width of the ICC interval,54 ,55 and assurance of probability for ICC estimation being on a certain value (eg, 90% assurance that the ICC would be 0.95).56 STATA provides support for statistical inferences (icc, including some forms of figure 2 ICCs), and sample size calculation (sampicc).

##### Interpretation and caveats of the ICC

The value of ICC traditionally ranges from 0 (non reliable) to 1 (perfect reliability). Traditionally, ICC values less than 0.40 are considered poor, values between 0.40 and 0.59 fair, values between 0.60 and 0.74 good, and values between 0.75 and 1.00 excellent.50 ,57 ,58 However, careful observation of the above formula shows that, depending on the relationship between person/observation variation and measurement error, ICC—and reliability parameters in general—can be easily influenced by measurement error. For example, if there is higher variation between study subjects compared with measurement error, the instrument will provide high reliability, that is, the instrument will accurately distinguish between participants. On the other hand, if there is not enough variation between patients, a small measurement error can make a difference and make the instrument not reliable. As a consequence, reliability will depend on the variability of the sample from which the index was derived and is influenced by the measurement error to some extent.

##### Extra notes on ICCs: other extensions and paired data

Haber *et al*14 and Muller *et al*59 provide other types of coefficients (parametrical and non-parametrical) when any of the underlying ICC assumptions are not met (eg, concordance correlation coefficient (CCC), Rothery, etc…). Chen and Barnhart60 provide guidance on CCC and ICC. The CCC is similar to the ICC but that the CCC does not rely on ANOVA assumptions. This would be seen as an advantage, but the CCC is less developed than ICC-based methodology. In the case of repeated measures on the same subjects over time (STEP 4), ICC can be used to take into account such paired design by modelling interaction with time and observation as well as time and observer.

#### The punchline for continuous measurements: ICC is good but needs to specify model assumptions

Since the inferences of ICC heavily depend on the method of calculation (figure 2), researchers should follow a rationale to come up with the best index based on his/her assumptions. It is also important to take into account that ICC is applicable to other populations if the underlying variance is the same. Because of this comparability and the potential uses for calibration (see later sections), it is important to provide an online supplementary table that describes variance for observations and observers.

### Generalisability and decision studies (G and D studies): beyond the usual report of ICCs and κs (STEP 7)

Obtaining the correct statistical parameters may be the end of an agreement study for some researchers, but not for all. For instance, Streiner and Norman make a distinction between the statistician's approach and the psychometrician’s (or in this review, to the clinician!).61 A statistical mind would be interested in determining the effects (main effects or interactions) that are *statistically* significant, as well as the overall (treatment) effect, with any variation considered ‘noise’ and treated as ‘error.’ In contrast, the psychometrician (clinician) would be interested in measuring variance components and (hopefully) identifying and characterising the ‘error’ in order to understand what differentiates subjects (ie, reliability). Thus, it is quite possible that readers may like to continue this approach and find himself/herself as ‘G theorists’.61 In this setting of Generalisability and Decision studies (G and D studies), researchers might be interested in some (or all) of the questions described below:52 ,61

What is the reliability of the measurements if we compare for all objects, one measurement by one rater with another measurement by another rater?

What is the reliability of the measurements if we compare for all objects, the measurements performed by the same rater (ie, intrarater reliability)?

What is the reliability of the measurements if we compare for all objects, the measurements performed by different raters (ie, inter-rater reliability)?

Which strategy is to be recommended for increasing the reliability of the measurement—using the average of more measurements of the objects by rater, or using the average of one measurement by raters?

As introduced earlier, this is the approach used in the generalisability theory (in contrast with classical test theory), where factors affecting reliability are affected *simultaneously*. This theory has several twists in methodology, uses ‘facets’ instead of ‘factors,’ and takes into account whether the design is crossed (all raters evaluate all objects) versus nested (not all raters evaluate all objects). Overall, the methodology allows a three-way ANOVA analyses and is summarised in figure 3. For further details of the underlying methodology and formulae, and application of the generalisability theory on reliability studies, see Streiner and Norman.61

### Agreement: related but not equal to reliability

As discussed earlier, agreement shows exactly how close scores for repeated measurements are, and is related to measurement error,1 which shall be minimised if possible, when the aim of the instrument is to evaluate patients over time. It is now the time to review different approaches to assess agreement for categorical and continuous measurements.

#### Agreement for categorical variables

According to de Vet *et al*25 there are really “no parameters of measurement error for categorical variables”, since only classification and ordering are important. Additionally, since there are no units, there are no clear parameters of measurement error. Nonetheless, percentage of agreement and proportions of specific agreement can be calculated, which will be discussed in the next section on the setting of continuous variables.

#### Agreement for continuous variables

##### Coefficient of variation

The coefficient of variation expresses the SD as a percentage of the mean, multiplied by 100% and expressed as a percentage. It is calculated for each pair of observations, and a zero value would represent a perfect agreement. One major advantage of the coefficient of variation is that it is unitless, and can consequently be used for variable comparison and in testing models. Its only meaningful interpretation occurs when the scales compared are positive (for instance, not meaningful for negative scales such as Fahrenheit).62 Available commands are in STATA (estat cv).

##### Correlation coefficients: why they are not in table 1?

In the past, correlation coefficients (such as Spearman or Pearson) had been used to provide a measure of agreement; however, as Bland and Altman point out, the correlation coefficient measures the strength of the relation between two variables, not the agreement63 (table 1). Rather, correlation coefficients measure values in a straight line from −1 to +1 (perfect negative correlation and positive correlation, respectively). The correlation coefficient measures also rely on the accuracy of the assessed variable and on the study variation (eg, population dependent).64 Pearson's correlation coefficient assumes that the variable is normally distributed (in contrast to its non-parametrical counterpart, Spearman's rank correlation). Other misuses of correlation coefficients are described by Altman as forgetting repeating measures over time, restricted sampling of individuals, mixing samples of different characteristics and so on.65

Commands in STATA include the conventional correlation (correlate), variable-adjusted correlation (pcorr), CIs (corrcii), and power/sample size (power onecorrelation).

##### Proportion of positive agreement and specific agreement

The proportions of positive agreement and specific agreement provide raw estimation of important descriptive data, and these estimations convey the overall agreement between raters and for each particular category (positive or negative agreement). Agreement studies as those described here are used less frequently but should be considered a standard in the reporting of basic descriptive statistics. An excellent resource for agreement studies can be found on Uebersax's website66 and sampling distributions for positive agreement and negative agreement have been described by Samsa.67 It is, however, in the context of criterion validity (ie, diagnostic accuracy studies), when these proportions gain importance as they are required by the Food and Drug Administration to be part of the integral report.68

##### Standard error measurement and its association with ICC

A thorough summary of the relationship between measures of reliability and agreement is provided by de Vet *et al*.1 They define measurement error, by SE of measurement (SEM) and equate the square root of the error or , where, can include a systematic error between observers/raters () or not (). De Vet *et al*1 provide a more in-depth explanation of the formulae, which demonstrates that since ICC uses different types of σ^{2}s (variances), as shown in figure 2, it is possible to derive SEM from ICC. One note of caution is to avoid using the formula . Researchers have used ICC obtained from different studies with different subject variability and, as readers already know, ICC depends on the study participant heterogeneity; thus, only when heterogeneity is similar can the formula be valid.1 ,69

##### Bland-Altman plot: excellent resource to summarise measurement error

In 1986, Douglas Altman and J Martin Bland proposed a different method to assess systematic error between observers. In order to calculate it, they plotted the mean difference between raters’ scores of the two observers against the difference between the scores of the observers. For instance, a recent paper by Sedwick,70 showed a Bland-Altman plot which investigated the agreement between primary care and daytime ambulatory monitoring in blood pressure. Online supplementary figure S3 shows the mean systematic difference between both readings (d, green line), and two other lines (red, dotted), which represent the ‘limits of agreement’ as estimated by d±1.96x SD of the differences. A further work by de Vet *et al*,71 shows an association between the limits of agreement and SEM using a 95% CI, the minimal or smallest detectable change (SCD) .

The limit of agreement represents the range within which most differences between measurements by the two methods will lie.72 In other words, the range of the CI will show an indirect magnitude of the measurement error; consequently, if a difference between two observations is noted beyond the interval, it is most likely that the differences are real as opposed to a measurement error. As noted in the online supplementary figure S3, the graph is quite powerful in illustrating the degree of agreement between two raters, allowing readers to see the relationship between extreme values and the degree of disagreement. As noted per Bland and Altman, “the decision about what is acceptable agreement is a clinical one; statistics alone cannot answer the question.” Nevertheless, given the data, the graph shows what the expected systematic and random errors are, making it possible to see the smallest detectable change. In addition, it would be possible to correct using the average difference in the measurement of several techniques.

The plot, however, has several *limitations* that need to be taken into account. First, the differences between measures are normally distributed (this can be checked using the usual histogram or normal plot.) Second, it is assumed that the sample subjects are representative of the population the study is aimed to extrapolate. Third, the plot can only be used for pairwise observations, so whenever more than two techniques are present, each pairwise Bland-Altman plot needs to be reported separately. Finally, it is assumed that the systematic error (d) is constant across different values. The latter assumption is often criticised.73 Nonetheless, Bland and Altman have proposed several statistical techniques to deal with such criticisms, including log-transformation of the variable74 or running regression techniques to account for this non-uniformity across values.72 STATA also provides command to estimate the Bland-Altman plot.

##### Repeated measures for measurement errors in continuous variables

In their paper, Bland and Altman describe several methods to account for repeated measurements on the same subjects with similar and different numbers of replications based on variance methodology. Readers are referred to that paper to learn more about implementation.72

#### A word on calibration studies and statistical techniques on measurement error models

Calibration studies are commonly used in nutritional epidemiology within which they compare the gold standard of food reporting intake against other types of less preferable questionnaires that are easy to fill and apply in large numbers. Paraphrasing Carroll *et al*,75 calibration studies can be used to: (1) adjust relative risks when using a suboptimal method of ascertainment; (2) estimate the sample size required in the main study since measurement error can influence the power; (3) estimate the correlation between intake from the less desired method and the gold (or preferred) standard; and, (4) estimate the slope of the regression of the less preferred method on the gold standard method, a variable that is important in assessing the patterns of bias. Carroll *et al*75 offer a quick, step-by-step approach for power estimation, which would require knowing the correlations between each parameter of interest in both instruments, the total number of cases of disease to obtain a desired power to detect the outcome of interest, and the slope of the regression of the intake of one method compared with the gold standard (or preferred method). Another paper by the same authors also shows that, in order to improve calibration in this setting of nutritional epidemiology, increasing the number of subjects as compared with the number of repeated questionnaires yields the same results for calibration purposes.76 Calibration studies open the door to endless possibilities in the field of measurement error in GI research.

ICCs can be also used in calibration for regression dilution bias, a phenomenon common in biological continuous measurement and defined as “the attenuation in a regression coefficient that occurred when a single measured value of a covariate factor (eg, blood pressure) was used instead of the usual or average value over a period of time”.77 Knuiman *et al*,78 provide an excellent resource which shows how using ICC in adjusted coefficients can correct for this phenomenon.

Finally, there are regression models that can also help in the correction of measurement error when some of the variables are measured with error. Dr Guolo79 has recently summarised a set of models which are less stringent on model assumptions (‘robust’) and has provided guidance on when to use them. Some of the models include flexible-parametrical modelling and semiparametrical modelling, quasi-likelihood, estimating equations and empirical likelihood, simulation-extrapolation method, among others. Readers are encouraged to consult a statistician when selecting a particular model. STATA, for instance, has several commands which can help in the evaluation of measurement errors.80

#### The punchline for measurement error: SEM and Bland-Altman

As noted above, measurement error needs to be explored with SE measurement, which is also related with a measurement of agreement. It is for this reason that Bland and Altman72 advocate providing reliability and agreement data in the same study because it will help to understand the reproducibility of the measurement. If multiple raters are present pairwise Bland-Altman plots can be provided.

## Assessing biases in reliability and agreement studies (STEP 8)

As a general rule, most of the reporting guidelines recommend addressing potential biases,11 an often forgotten but integral part of high quality report. In epidemiological terms, bias is defined as a systematic error in the design or conduct of a study. It can appear at any stage of the study, from the method of selection of participants (selection bias) to procedures for gathering relevant exposure/disease (information bias). The presence of biases will impact internal validity of the study and, as a consequence, make inferences invalid.

*Selection bias* is present when the association between the exposure and disease is different for those who participate and those who should be theoretically eligible for the study, including those who do not participate.81 *Information bias* occurs when participants are systematically measured with error (as compared with random error). Unfortunately, research is not bias-free and it is always expected to discuss potential biases in research (information and selection bias). Delgado-Rodriguez82 provides excellent guidance bias, and investigators are encouraged to consider and report both types of biases and see whether current research is subject to such a problem, and, what measures are implemented to control potential biases.

Most of the literature will address biases in diagnostic accuracy studies but the same guidance can be applied for reliability and agreement studies. For example, verification bias, “when the execution of the gold standard is influenced by the results of the assessed test, typically the reference test is less frequently performed when the test result is negative”,82 can be corrected as described per Zhou.83 For example, in spectrum bias,82 “when researchers included only ‘clear’ or ‘definite’ cases, not representing the whole spectrum of disease presentation, and/or ‘clear’ or healthy control subjects, not representing the conditions in which a differential diagnosis should be carried out”, multivariate models exist for correction.84

The *punchline* is that the research needs to address and report potential biases and provide, if available, methods for correction.

## Final concepts from the COSMIN initiative: validity, responsiveness and interpretability

The current work has not addressed validity, “the degree to which an instrument truly measures the construct(s) it purports to measure”.8 The concept of validity has several interpretations that should be considered when researchers attempt such an approach, in particular of ‘unobservable’ constructs or concepts (for instance in many areas of psychology). *Validity* has been divided into three major types: content validity, criterion validity and construct validity. As reported in de Vet *et al*,85 *content validity* focuses on whether the content of the instrument corresponds with the construct that one intends to measure with regards to relevance and comprehensiveness. *Criterion validity* is applicable in situations where there is a gold standard—such as many of the conditions studied in internal medicine— and the objective is to assess how the measurement instrument agrees with the scores of the gold standard. Table 4, adapted from de Vet *et al* illustrates how frequently researchers may deal with studies evaluating criterion validity. Proper statistical analyses and study design should be followed as recommended per the COSMIN initiative5 ,8 ,85 and the STARD,13 among others. Finally, *construct validity*, applicable in situations where there is no gold standard, refers to whether the instrument provides the expected scores based on existing knowledge about the construct. In summary, researchers should follow the correct methodology for validity studies in order to have accurate statistical reports and analyses.

*Responsiveness* is the ability of an instrument to detect changes over time.8 This is important when the gold standard (criterion approach) and/or the diagnostic criteria (if no gold standard available, construct approach) change over time. It is obvious that when both elements (instrument and comparator) are present, it is possible to assess whether the instrument is responsive.8 ,86

Finally, the COSMIN initiative addresses the concept of *interpretability* as “the degree to which one can assign qualitative meaning—that is, clinical or commonly understood connotations—to an instrument's quantitative scores or change in scores”.8 Four issues needed to be considered for the evaluation of interpretability.87 *First*, what is the distribution of scores of a study sample on the instrument? As noted in many parts of this paper, some of the concepts (ICC) depend on the variability of the sample and might not be directly transportable to other populations. *Second*, are there floor and ceiling effects? Those effects occur when a high proportion of the population has scores at the lower or higher end of the scale. *Third*, are scores and change scores available for relevant subgroups (general population, specific disease groups)? *Finally*, is the minimal important difference known? The concept of small detectable change as compared with minimal important difference is quite important since the first may be related with numerical and measurement errors, whereas the second is based on clinical knowledge.88

## Getting the report ready (STEPS 9–11)

Up to now, the paper has addressed meaningful parameters of reliability and/or agreement based on the objective of the study, the characteristics of the population, the number of observers, and measurements (STEPS 1–6) with possible discussion of generalisability/decision (if needed, STEP 7). It has also addressed the potential of biases that might be present as well as various methods of correction that have been used (STEP 8).

It is now the time when the work is self-evaluated and, hopefully, additional measures are made to produce a manuscript that is of high methodological quality (STEPS 9 and 10) and of high report quality (STEP 11). For the last three steps (STEPS 9 to 11), researchers should consult the COSMIN checklist, boxes B and C (see online supplementary boxes S1 and S2)4; the four-point checklist used individual studies check in meta-analysis (a nice shot for a final self-exam of the paper)89 (boxes B and C, online supplementary boxes S1 and S2); ending with the QAREL (see online supplementary table S2)10 and GRRAS (see online supplementary table S3)11 checks. At the time of this step the answer will likely be ‘yes to all.’

## Summary: putting all together for reliability and agreement studies

Reliability and agreement are two essential properties fundamental to clinical medicine. Instruments should be able to differentiate between subjects (reliable) and, if possible, able to have same the value over repeated measures (reliable). Researchers should follow the proposed stepwise approach to analyse and report studies using some of the suggested statistical methods., The reviewer recommends self-evaluation of the research/paper by matching the data with the COSMIN checklist (Boxes B and C for reliability and agreement, respectively), accompanied by the four-point COSMIN initiative checklist. Additionally, all of these methods could be used as a screening tool for quality check on peer review of papers.

Future directions in reliability and agreement studies in the measure of GI disorders should explore latent class modelling and correction of measurement error by calibration studies.

## References

## Supplementary materials

## Supplementary Data

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

**Files in this Data Supplement:**- Data supplement 1 - Online supplement

## Footnotes

Correction notice This article has been corrected since it was published Online First. The formula under the heading ‘Reliability parameters for continuous measurement variables’ has been corrected.

Acknowledgements The author thanks Sandeep Mahajan, MD and Navneet K Jaswal, JD for editing the current manuscript.

Competing interests None.

Provenance and peer review Commissioned; externally peer reviewed.