Reliability and agreement studies: a guide for clinical investigators

Ruben Hernaez

doi:10.1136/gutjnl-2014-308619

Article Text

PDF

PDF +
Supplementary
Material

Leading article

Reliability and agreement studies: a guide for clinical investigators

Ruben Hernaez

Correspondence to Dr Ruben Hernaez, Division of Gastroenterology and Hepatology, Department of Internal Medicine, The Johns Hopkins School of Medicine, 600 N. Wolfe Street, Blalock 439, Baltimore, Maryland 21287, USA; rhernae1{at}jhmi.edu

https://doi.org/10.1136/gutjnl-2014-308619

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Setting the framework: the difference between reliability and agreement

On a daily basis, clinicians and researchers face the challenge of measuring multiple outcomes. From responses to therapies and assessments of disease activity, to certainty of diagnoses and innovation of cutting-edge diagnostic tools, it is essential within every field that outcome measurement be valid, reproducible and reliable.1 At first glance, validity, reproducibility, reliability and agreement may seem similar; however, there are fundamental differences among these concepts that are important for study design and execution, and for methodology and statistical analyses. Alvan Feinstein saw that problem and introduced the term clinimetrics, or, “the methodologic discipline focusing on measurement issues in clinical medicine”.2 The concept of clinimetrics is not new; on the contrary, it has been considered a subset of psychometrics.3 Terwee, de Vet, Mokkink and Knol, among others,4 developed tools to assess and evaluate health measurement instruments in clinical medicine. It is, therefore, why the backbone of this paper will rely on the COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) initiative.

The COSMIN initiative is a multidisciplinary, international consensus which aimed to create standards to evaluate the methodological quality and design and preferred statistical analyses of a study on measurement properties.5 The initiative primarily focused on Health Related Patient-Reported Outcomes (HR-PRO) due to the complexity of these outcomes measurements; however, these concepts still apply to other type of outcomes and will be followed here.4 For reader clarification, HR-PRO is defined by Mokkink et al4 as “any aspect of a patient's health status that is directly assessed by the patient, that is, without the interpretation of the patient's responses by a physician or anyone else”; examples include self-administered or computer-administered questionnaires.

The COSMIN taxonomy in the evaluation of a measurement instrument shows three main quality domains: reliability, validity and responsiveness. As noted in the online supplementary figure S1 and table S1, within each domain, there are several properties which help in the evaluation of the domain, and relate to each other.4 ,5 While all domains are important in the evaluation of a new instrument, this review aims to provide guidance on the reliability domain because it is frequently encountered in GI research6 and lesser known compared with diagnostic accuracy studies. For a detailed description of other domains and their evaluation, the reader is referred elsewhere.4 ,5

This paper aims to accomplish three goals: (1) provide a review of key concepts in reliability and agreement; (2) give an overview of statistical tools available on reliability and agreement; and (3) propose a stepwise approach for the analysis/reporting of high quality reliability/agreement studies based on available checklists. To this effect, a search engine has been applied in PUBMED to identify relevant papers in the field of reliability and agreement studies so pertinent references are included in the present work (see online supplementary figure S2). This engine was not designed to be a formal systematic review, but created to provide an orientation on how references for the current narrative review were selected.

The reviewer proposed to follow a similar scheme as provided by Gisev et al,7 but adapting the COSMIN initiative checklist,4 ,8 ,9 the Quality Appraisal of Reliability Studies (QAREL, online supplementary table S2) checklist10 and the Guidelines for Reporting Reliability and Agreement Studies (GRRAS, online supplementary table S3),11 so that the researcher follows the recommended published guidelines (figure 1). Several STEPS are proposed in sequential fashion to guide authors navigating in the development of the paper.

Figure 1

Stepwise approach to analyse and report reliability and agreement studies.

The paper will provide limited mathematical formulations for two main reasons. First, such analyses may scare most statistic-naïve readers. Second, current statistical software, assuming they are used accurately, provide straightforward answers. This is why the reviewer will give ‘the punchline’ after each section on what method to use and provide extensive accompanying bibliography for the keen reader as well as examples of STATA commands (StataCorp. 2011. Stata Statistical Software: Release V.12. College Station, Texas, USA: StataCorp LP). Finally, this paper will not assess the methodology of diagnostic accuracy studies (sensitivity, specificity).

Differences in reliability and agreement: related but not synonymous (STEP 1)

According to de Vet et al,1 measurement property reliability answers the question of how well patients can be distinguished from each other despite measurement errors. In contrast, agreement shows exactly how close the scores for repeated measurements are, and is related to measurement error defined as a “systematic and random error of a patient's score that is not attributed to true changes in the construct (or concept) to be measured”.8 Reliability and agreement have been often confused and used interchangeably, however they are distinguishable in key ways. Kottner and Streiner point that “a clear distinction between the conceptual meanings of agreement and reliability is necessary to select appropriate statistical approaches and enable adequate interpretations”.12 Whereas “agreement points to the question, whether diagnoses, scores, or judgments are identical or similar or the degree to which they differ”, “reliability coefficients provide information about the ability of the scores to distinguish between subjects”.12 An excellent graphical description on differences between reliability/agreement can be found in de Vet et al.1

Reproducibility is considered an umbrella term that encompasses reliability and agreement.1 On the other hand, the Standards for Reporting of Diagnostic Accuracy (STARD) define diagnostic accuracy as “the amount of agreement between the results from the index test and those from the reference standard”.13 As demonstrated in ‘final concepts from the COSMIN initiative: validity, responsiveness and interpretability’, this concept refers to criterion validity and will not be assessed as part of reliability/agreement studies discussed here.

One should be cautious of the numerous papers providing mixed terminology regarding which parameters belong to reliability and which to agreement.7 ,14 ,15 In order to provide an unified approach, this review follows the published guidelines from experts in the field and recommends that authors adopt this taxonomy accordingly (see online supplementary figure S1).8^–11 ,16

In sum, reliability and agreement are two separate concepts that describe how the instrument performs to detect real variability between subjects and whether the instrument, over repeated measures, yields consistent results. It is when the researcher needs to figure out the objective of the study and decide on the analytical pathway to pursue (figure 1).

What is the type of outcome variable and what is the most appropriate statistical parameter? (STEP 2–6)

Parameter selection is based on objective and the nature of the measurement (STEP 2)

Depending on the type of measurement variable, there are different types of reliability and agreement measurements parameters (table 1). At this stage, the reader already knows whether the instrument is a number or category—that is, blood pressure, severity classification for pathology results, respectively—and may be ready to apply table 1; however, it is important to consider how many instruments (or raters) there are (STEP 3), as some of these statistics are only applied for two raters and two measurements (Cohen's κ). One should also consider whether the tests (or observers) are independent from each other (STEP 4). For each test, the fact that the test can accommodate multiple readers will be provided (indicated within the text as STEP 3; also noted in table 2 under “3 or more”), and instead shift focus to the often-ignored paired/dependent data.

View this table:

Table 1

Common statistical tests in reliability and agreement studies

View this table:

Table 2

Common statistical tests for comparing groups by statistical independence

Number of raters/tests involved and paired data: what statistical tests should be used and why? (STEP 4)

Paired, correlated, clustered, matched data and repeated measures are all synonymous terms that have the same statistical implication—when compared with independent data, they demonstrate a correlation between measurements and this must be taken into the analyses to avoid data analyses flaws. Examples of such paired data include observations that are made on the same patient simultaneously or over time (repeated measures, pre-post test), on the same organ (eg, colon, eyes, etc.…), where there has been some type of matching by design (eg, matched case-control studies), or where there has been some clustering in that each observation belongs to the same patient or is part of a clustered randomised clinical trial.

In addition, common statistical tests are not appropriate and dependency needs to be accounted. In order to guide the reader, table 2 summarises common statistical tests used in any type of analyses for paired data.17 ,18 Should similar considerations be applied when running univariate and/or multivariate analyses? The answer is ‘yes’. Generalised estimating equations,19 for instance, is an example of a tool used to account for this type of correlated data and the reader is encouraged to apply it to correlated data and multivariate analyses.20 ,21 It is not, however, the only alternative, as other regression techniques adjusted for cluster effect have been developed.22

Such examples include the commonly used per-biopsy and per-patient analyses. The use of clustered methods as the one previously described will depend on the question of the research. In order words, when the unit of interest is the patient when multiple biopsies are taken, or some other type of clustering (techniques, hospitals) are of interest, there is some dependency or correlation within cluster data, this type of clustered/paired statistical analyses should be used. If not, the researcher will need to defend his/her position on choosing alternative unclustered/unpaired data.

Reliability parameters for categorical (nominal, dichotomous and ordinal) measurement variables

κ index and extensions

The κ index is the most commonly used method to assess reliability in categorical instruments. The κ index is a family of indices23 ,24 which relates “the amount of agreement that is observed beyond chance agreement to the amount of agreement that can maximally be reached beyond chance agreement”.25 The underlying assumption when calculating this parameter is that rater reports are statistically independent; that is, that they can be calculated by multiplicating different cells.26 The term ‘measure of agreement’ may elicit confusion, but the general structure of the formula for κ is indeed derived from the general equation of reliability, rightly making it a measurement of reliability.27 Under certain circumstances (ie, quadratic weights on weighted κ), it is equivalent to the two-way mixed, single-measures intraclass correlation coefficient (ICC)28 which will be further explained in ‘ICC, a layman's terms definition and method of selection’ and figure 2. If the study has paired data, options are available for κ as well (Step 4).29 ,30

Figure 2

Stepwise approach to select intraclass correlation coefficient (adapted from McGraw et al.49 and Shrout et al.51). ANOVA, analysis of variance.

Statistical inference, power calculation and software of κ indices

The κ indices can also be used for statistical inference (significance testing, SE estimation and CIs calculation)31 and sample size formulae exist for such purposes.31–33 STATA, for example, provides an extensive list of statistical packages for such purposes, including calculation of CIs (kapci), sample size/power calculation (sskdlg, sskapp) for incomplete designs (such as missing data), and multiple raters –here STEP 3-(κ2). On power estimation the researcher shall consider, as any other variables in medicine, that dichotomisation of a continuous variable may carry an important loss of power, thus requiring cautionary decision-making.34

Interpretation and caveats of κ indices

Across the literature, three major interpretations in which a value of 0.6 or beyond is considered acceptable or fair have been consistently used (table 3).28 It is also possible to obtain negative κ indices when one of the raters reverts accidentally the scale of rating.25 ,33 While they have been widely used, κ indices are far from perfect and face several criticisms. First, the criteria to evaluate κ (table 3) are arbitrary. Second, κ is influenced by the marginal distribution of the ratings (‘the bottom rows and the end columns’ of the cross-tabulation) and the prevalence of the disease. In this sense, several reports have been shown to correct for bias (marginal distribution inequities) and prevalence and, whereas it may be reported, some authors have considered the adjusted κs inaccurate.35 Finally, κ, as a summary statistic, can lose some information when examining the discrepancy between observers, and this information is even more accentuated when the researcher converts a natural continuous variable into a binary one.34 For example, it would be important to see in a hypothetical grading of no cancer, preneoplastic and cancer type of pathology exam what the discordance was between a ‘no cancer’ diagnosis and ‘preneoplastic.’ Consequently, it is important to provide raw data for readers to review as an online supplementary table in order to show the rates of agreement and the reasons for discrepancies.

View this table:

Table 3

Several criteria for κ interpretation, adapted from Streiner and Norman28

Other alternatives for categorical reliability: weighted κs, ICCs and modelling patterns of agreement

Weighted κs and matrix of weighted κs

A weighted κ is an extension of the regular κ in which the researcher provides arbitrary weights to penalise the degree of disagreement between observers for multiple categories. Several weighting systems can be used including linear, quadratic and Cicchetti-Allison weights33 (STATA command kapwgt).

The matrix of weighted κ is another alternative method to evaluate agreement in ordinal variables and takes into account κ indices and correlation coefficients.36 For further review, the reader is referred to Roberts.31 ,37 This approach, however, has been rarely used in medicine.

ICC for ordinal variables: perhaps better than weighted κ

Kraemer,38 Streiner and Norman28 favour κ for two-by-two agreement assessment, but recommend using the ICC coefficient when treating anything other than dichotomous variables with more than two observers. This is so, because ICC, as noted per Streiner and Norman,28 provides the following: “(a) the ability to isolate factors affecting reliability; (b) flexibility, in that it is very simple to analyse reliability studies with more than two observers and more than two response options (STEP 3); (c) we can include or exclude systematic differences between raters (bias) as part of the error term; (d) it can handle missing data; (e) it is able to simulate the effect on reliability of increasing or decreasing the number of raters; (f) it is far easier to let the computer do the calculations; and (g) it provides a unifying framework that ties together different ways of measuring inter-rater agreement—all of the designs are just variations on the theme of the ICC”.28 A recent paper published in Gut illustrates how to apply the ICC methodology to examine agreement between ordinal measurements and can be followed as an example.6

Modelling patterns of reliability: log-linear models, latent class models and beyond

Advanced techniques are possible for researchers who are interested in providing more information than a summary statistic, for instance, if an investigator wanted to explore patterns of reliability by different characteristics or compare reliability adjusting for different covariates, there is no gold standard.39 Instead, a researcher may apply latent class models, in which reliability (or agreement as used in the references) is considered an unobservable categorical truth (or latent class) and relate this latent class to the observed data.40–43 A good tutorial can be found in the website of Dr Uebersax44 and, for a more practical application, in the papers by Christensen et al45 and Dunn16 which may be the few studies in gastroenterology using this technique. Other approaches are log-linear models, which examine the association of cell counts on the level of categories (eg, observers) and models’ interaction patterns and associations.26 ,46 ,47 Similar to the latent class models, there are a handful of examples using log-linear models in GI reliability.16 ,48 Again, STATA provides commands for latent class models (gllamm) and log-linear models (ipf).

The punchline for categorical variables: κ with caution, perhaps ICC

Whenever researchers are presented with categorical variables, the use of κ indices is acceptable, provided that the proper extension is used and that it correctly identifies the problems associated with these indices. Although not part of the COSMIN evaluation tool, the use of ICC is also a good alternative. Finally, while latent and log-linear models provide a unique opportunity for research in the topic, they are less explored in the field of reliability studies.

Reliability parameters for continuous measurement variables

Interclass versus intraclass: what is the difference?

Interclass correlation coefficient measures the relationship between two different classes or types of variables (eg, Pearson's r) and it is evident that the result won't have any metrics and/or variance. In contrast, when research seeks to evaluate the relationship between variables of a common class that share the same metrics and variance, ICCs are used (eg, ICC).

ICC, a layman's terms definition and method of selection

The general equation of the reliability parameter (eg, ICC) is as follows:

This equation demonstrates that variability (ie, variance) between study objects is related to the measurement error or, as defined in the COSMIN initiative, reliability is “the proportion of the total variance in the measurements which is due to ‘true’ differences between patients”.4

Selection of the ICC

At least six types of ICCs have been described, and derived from different methodology and, therefore, leading to different inferences.28 ,49 ,50 Thus, when authors report ‘an ICC,’ a thorough explanation shall be provided; however, the question remains which ICC needs to be used and why. Luckily, there are guidelines to assist researchers (figure 2).49 ,51 No matter which ICC is chosen, it is important to provide readers with an online supplementary table of all variance components that can be used for interpretation and correction of measurement error (eg, in the paper by de Vet et al1 in table 2, or as noted in table 5.10 in the de Vet et al52 book, chapter 5). Furthermore, it is also important to provide data to support assumptions of analyses of variance (ANOVA) as noted per McGraw and Wong.49

Statistical inference, power calculation and software

As discussed, several authors have shown different techniques of obtaining statistical inference for different ICCs.49 For sample size calculation, several formulae are available: one relies on hypothesis testing;32 ,53 other on the width of the ICC interval,54 ,55 and assurance of probability for ICC estimation being on a certain value (eg, 90% assurance that the ICC would be 0.95).56 STATA provides support for statistical inferences (icc, including some forms of figure 2 ICCs), and sample size calculation (sampicc).

Interpretation and caveats of the ICC

The value of ICC traditionally ranges from 0 (non reliable) to 1 (perfect reliability). Traditionally, ICC values less than 0.40 are considered poor, values between 0.40 and 0.59 fair, values between 0.60 and 0.74 good, and values between 0.75 and 1.00 excellent.50 ,57 ,58 However, careful observation of the above formula shows that, depending on the relationship between person/observation variation and measurement error, ICC—and reliability parameters in general—can be easily influenced by measurement error. For example, if there is higher variation between study subjects compared with measurement error, the instrument will provide high reliability, that is, the instrument will accurately distinguish between participants. On the other hand, if there is not enough variation between patients, a small measurement error can make a difference and make the instrument not reliable. As a consequence, reliability will depend on the variability of the sample from which the index was derived and is influenced by the measurement error to some extent.

Extra notes on ICCs: other extensions and paired data

Haber et al14 and Muller et al59 provide other types of coefficients (parametrical and non-parametrical) when any of the underlying ICC assumptions are not met (eg, concordance correlation coefficient (CCC), Rothery, etc…). Chen and Barnhart60 provide guidance on CCC and ICC. The CCC is similar to the ICC but that the CCC does not rely on ANOVA assumptions. This would be seen as an advantage, but the CCC is less developed than ICC-based methodology. In the case of repeated measures on the same subjects over time (STEP 4), ICC can be used to take into account such paired design by modelling interaction with time and observation as well as time and observer.

The punchline for continuous measurements: ICC is good but needs to specify model assumptions

Since the inferences of ICC heavily depend on the method of calculation (figure 2), researchers should follow a rationale to come up with the best index based on his/her assumptions. It is also important to take into account that ICC is applicable to other populations if the underlying variance is the same. Because of this comparability and the potential uses for calibration (see later sections), it is important to provide an online supplementary table that describes variance for observations and observers.

Generalisability and decision studies (G and D studies): beyond the usual report of ICCs and κs (STEP 7)

Obtaining the correct statistical parameters may be the end of an agreement study for some researchers, but not for all. For instance, Streiner and Norman make a distinction between the statistician's approach and the psychometrician’s (or in this review, to the clinician!).61 A statistical mind would be interested in determining the effects (main effects or interactions) that are statistically significant, as well as the overall (treatment) effect, with any variation considered ‘noise’ and treated as ‘error.’ In contrast, the psychometrician (clinician) would be interested in measuring variance components and (hopefully) identifying and characterising the ‘error’ in order to understand what differentiates subjects (ie, reliability). Thus, it is quite possible that readers may like to continue this approach and find himself/herself as ‘G theorists’.61 In this setting of Generalisability and Decision studies (G and D studies), researchers might be interested in some (or all) of the questions described below:52 ,61

What is the reliability of the measurements if we compare for all objects, one measurement by one rater with another measurement by another rater?
What is the reliability of the measurements if we compare for all objects, the measurements performed by the same rater (ie, intrarater reliability)?
What is the reliability of the measurements if we compare for all objects, the measurements performed by different raters (ie, inter-rater reliability)?
Which strategy is to be recommended for increasing the reliability of the measurement—using the average of more measurements of the objects by rater, or using the average of one measurement by raters?

As introduced earlier, this is the approach used in the generalisability theory (in contrast with classical test theory), where factors affecting reliability are affected simultaneously. This theory has several twists in methodology, uses ‘facets’ instead of ‘factors,’ and takes into account whether the design is crossed (all raters evaluate all objects) versus nested (not all raters evaluate all objects). Overall, the methodology allows a three-way ANOVA analyses and is summarised in figure 3. For further details of the underlying methodology and formulae, and application of the generalisability theory on reliability studies, see Streiner and Norman.61

Figure 3

Stepwise approach for generalisability and decision studies (adapted from Streiner and Norman). ANOVA, analysis of variance.61

Agreement: related but not equal to reliability

As discussed earlier, agreement shows exactly how close scores for repeated measurements are, and is related to measurement error,1 which shall be minimised if possible, when the aim of the instrument is to evaluate patients over time. It is now the time to review different approaches to assess agreement for categorical and continuous measurements.

Agreement for categorical variables

According to de Vet et al25 there are really “no parameters of measurement error for categorical variables”, since only classification and ordering are important. Additionally, since there are no units, there are no clear parameters of measurement error. Nonetheless, percentage of agreement and proportions of specific agreement can be calculated, which will be discussed in the next section on the setting of continuous variables.

Agreement for continuous variables

Coefficient of variation

The coefficient of variation expresses the SD as a percentage of the mean, multiplied by 100% and expressed as a percentage. It is calculated for each pair of observations, and a zero value would represent a perfect agreement. One major advantage of the coefficient of variation is that it is unitless, and can consequently be used for variable comparison and in testing models. Its only meaningful interpretation occurs when the scales compared are positive (for instance, not meaningful for negative scales such as Fahrenheit).62 Available commands are in STATA (estat cv).

Correlation coefficients: why they are not in table 1?

In the past, correlation coefficients (such as Spearman or Pearson) had been used to provide a measure of agreement; however, as Bland and Altman point out, the correlation coefficient measures the strength of the relation between two variables, not the agreement63 (table 1). Rather, correlation coefficients measure values in a straight line from −1 to +1 (perfect negative correlation and positive correlation, respectively). The correlation coefficient measures also rely on the accuracy of the assessed variable and on the study variation (eg, population dependent).64 Pearson's correlation coefficient assumes that the variable is normally distributed (in contrast to its non-parametrical counterpart, Spearman's rank correlation). Other misuses of correlation coefficients are described by Altman as forgetting repeating measures over time, restricted sampling of individuals, mixing samples of different characteristics and so on.65

Commands in STATA include the conventional correlation (correlate), variable-adjusted correlation (pcorr), CIs (corrcii), and power/sample size (power onecorrelation).

Proportion of positive agreement and specific agreement

The proportions of positive agreement and specific agreement provide raw estimation of important descriptive data, and these estimations convey the overall agreement between raters and for each particular category (positive or negative agreement). Agreement studies as those described here are used less frequently but should be considered a standard in the reporting of basic descriptive statistics. An excellent resource for agreement studies can be found on Uebersax's website66 and sampling distributions for positive agreement and negative agreement have been described by Samsa.67 It is, however, in the context of criterion validity (ie, diagnostic accuracy studies), when these proportions gain importance as they are required by the Food and Drug Administration to be part of the integral report.68

Standard error measurement and its association with ICC

A thorough summary of the relationship between measures of reliability and agreement is provided by de Vet et al.1 They define measurement error, by SE of measurement (SEM) and equate the square root of the error or , where, can include a systematic error between observers/raters () or not (). De Vet et al1 provide a more in-depth explanation of the formulae, which demonstrates that since ICC uses different types of σ²s (variances), as shown in figure 2, it is possible to derive SEM from ICC. One note of caution is to avoid using the formula . Researchers have used ICC obtained from different studies with different subject variability and, as readers already know, ICC depends on the study participant heterogeneity; thus, only when heterogeneity is similar can the formula be valid.1 ,69

Bland-Altman plot: excellent resource to summarise measurement error

In 1986, Douglas Altman and J Martin Bland proposed a different method to assess systematic error between observers. In order to calculate it, they plotted the mean difference between raters’ scores of the two observers against the difference between the scores of the observers. For instance, a recent paper by Sedwick,70 showed a Bland-Altman plot which investigated the agreement between primary care and daytime ambulatory monitoring in blood pressure. Online supplementary figure S3 shows the mean systematic difference between both readings (d, green line), and two other lines (red, dotted), which represent the ‘limits of agreement’ as estimated by d±1.96x SD of the differences. A further work by de Vet et al,71 shows an association between the limits of agreement and SEM using a 95% CI, the minimal or smallest detectable change (SCD) .

The limit of agreement represents the range within which most differences between measurements by the two methods will lie.72 In other words, the range of the CI will show an indirect magnitude of the measurement error; consequently, if a difference between two observations is noted beyond the interval, it is most likely that the differences are real as opposed to a measurement error. As noted in the online supplementary figure S3, the graph is quite powerful in illustrating the degree of agreement between two raters, allowing readers to see the relationship between extreme values and the degree of disagreement. As noted per Bland and Altman, “the decision about what is acceptable agreement is a clinical one; statistics alone cannot answer the question.” Nevertheless, given the data, the graph shows what the expected systematic and random errors are, making it possible to see the smallest detectable change. In addition, it would be possible to correct using the average difference in the measurement of several techniques.

The plot, however, has several limitations that need to be taken into account. First, the differences between measures are normally distributed (this can be checked using the usual histogram or normal plot.) Second, it is assumed that the sample subjects are representative of the population the study is aimed to extrapolate. Third, the plot can only be used for pairwise observations, so whenever more than two techniques are present, each pairwise Bland-Altman plot needs to be reported separately. Finally, it is assumed that the systematic error (d) is constant across different values. The latter assumption is often criticised.73 Nonetheless, Bland and Altman have proposed several statistical techniques to deal with such criticisms, including log-transformation of the variable74 or running regression techniques to account for this non-uniformity across values.72 STATA also provides command to estimate the Bland-Altman plot.

Repeated measures for measurement errors in continuous variables

In their paper, Bland and Altman describe several methods to account for repeated measurements on the same subjects with similar and different numbers of replications based on variance methodology. Readers are referred to that paper to learn more about implementation.72

A word on calibration studies and statistical techniques on measurement error models

Calibration studies are commonly used in nutritional epidemiology within which they compare the gold standard of food reporting intake against other types of less preferable questionnaires that are easy to fill and apply in large numbers. Paraphrasing Carroll et al,75 calibration studies can be used to: (1) adjust relative risks when using a suboptimal method of ascertainment; (2) estimate the sample size required in the main study since measurement error can influence the power; (3) estimate the correlation between intake from the less desired method and the gold (or preferred) standard; and, (4) estimate the slope of the regression of the less preferred method on the gold standard method, a variable that is important in assessing the patterns of bias. Carroll et al75 offer a quick, step-by-step approach for power estimation, which would require knowing the correlations between each parameter of interest in both instruments, the total number of cases of disease to obtain a desired power to detect the outcome of interest, and the slope of the regression of the intake of one method compared with the gold standard (or preferred method). Another paper by the same authors also shows that, in order to improve calibration in this setting of nutritional epidemiology, increasing the number of subjects as compared with the number of repeated questionnaires yields the same results for calibration purposes.76 Calibration studies open the door to endless possibilities in the field of measurement error in GI research.

ICCs can be also used in calibration for regression dilution bias, a phenomenon common in biological continuous measurement and defined as “the attenuation in a regression coefficient that occurred when a single measured value of a covariate factor (eg, blood pressure) was used instead of the usual or average value over a period of time”.77 Knuiman et al,78 provide an excellent resource which shows how using ICC in adjusted coefficients can correct for this phenomenon.

Finally, there are regression models that can also help in the correction of measurement error when some of the variables are measured with error. Dr Guolo79 has recently summarised a set of models which are less stringent on model assumptions (‘robust’) and has provided guidance on when to use them. Some of the models include flexible-parametrical modelling and semiparametrical modelling, quasi-likelihood, estimating equations and empirical likelihood, simulation-extrapolation method, among others. Readers are encouraged to consult a statistician when selecting a particular model. STATA, for instance, has several commands which can help in the evaluation of measurement errors.80

The punchline for measurement error: SEM and Bland-Altman

As noted above, measurement error needs to be explored with SE measurement, which is also related with a measurement of agreement. It is for this reason that Bland and Altman72 advocate providing reliability and agreement data in the same study because it will help to understand the reproducibility of the measurement. If multiple raters are present pairwise Bland-Altman plots can be provided.

Assessing biases in reliability and agreement studies (STEP 8)

As a general rule, most of the reporting guidelines recommend addressing potential biases,11 an often forgotten but integral part of high quality report. In epidemiological terms, bias is defined as a systematic error in the design or conduct of a study. It can appear at any stage of the study, from the method of selection of participants (selection bias) to procedures for gathering relevant exposure/disease (information bias). The presence of biases will impact internal validity of the study and, as a consequence, make inferences invalid.

Selection bias is present when the association between the exposure and disease is different for those who participate and those who should be theoretically eligible for the study, including those who do not participate.81 Information bias occurs when participants are systematically measured with error (as compared with random error). Unfortunately, research is not bias-free and it is always expected to discuss potential biases in research (information and selection bias). Delgado-Rodriguez82 provides excellent guidance bias, and investigators are encouraged to consider and report both types of biases and see whether current research is subject to such a problem, and, what measures are implemented to control potential biases.

Most of the literature will address biases in diagnostic accuracy studies but the same guidance can be applied for reliability and agreement studies. For example, verification bias, “when the execution of the gold standard is influenced by the results of the assessed test, typically the reference test is less frequently performed when the test result is negative”,82 can be corrected as described per Zhou.83 For example, in spectrum bias,82 “when researchers included only ‘clear’ or ‘definite’ cases, not representing the whole spectrum of disease presentation, and/or ‘clear’ or healthy control subjects, not representing the conditions in which a differential diagnosis should be carried out”, multivariate models exist for correction.84

The punchline is that the research needs to address and report potential biases and provide, if available, methods for correction.

Final concepts from the COSMIN initiative: validity, responsiveness and interpretability

The current work has not addressed validity, “the degree to which an instrument truly measures the construct(s) it purports to measure”.8 The concept of validity has several interpretations that should be considered when researchers attempt such an approach, in particular of ‘unobservable’ constructs or concepts (for instance in many areas of psychology). Validity has been divided into three major types: content validity, criterion validity and construct validity. As reported in de Vet et al,85 content validity focuses on whether the content of the instrument corresponds with the construct that one intends to measure with regards to relevance and comprehensiveness. Criterion validity is applicable in situations where there is a gold standard—such as many of the conditions studied in internal medicine— and the objective is to assess how the measurement instrument agrees with the scores of the gold standard. Table 4, adapted from de Vet et al illustrates how frequently researchers may deal with studies evaluating criterion validity. Proper statistical analyses and study design should be followed as recommended per the COSMIN initiative5 ,8 ,85 and the STARD,13 among others. Finally, construct validity, applicable in situations where there is no gold standard, refers to whether the instrument provides the expected scores based on existing knowledge about the construct. In summary, researchers should follow the correct methodology for validity studies in order to have accurate statistical reports and analyses.

View this table:

Table 4

Statistical parameters commonly used in criterion validity studies

Responsiveness is the ability of an instrument to detect changes over time.8 This is important when the gold standard (criterion approach) and/or the diagnostic criteria (if no gold standard available, construct approach) change over time. It is obvious that when both elements (instrument and comparator) are present, it is possible to assess whether the instrument is responsive.8 ,86

Finally, the COSMIN initiative addresses the concept of interpretability as “the degree to which one can assign qualitative meaning—that is, clinical or commonly understood connotations—to an instrument's quantitative scores or change in scores”.8 Four issues needed to be considered for the evaluation of interpretability.87 First, what is the distribution of scores of a study sample on the instrument? As noted in many parts of this paper, some of the concepts (ICC) depend on the variability of the sample and might not be directly transportable to other populations. Second, are there floor and ceiling effects? Those effects occur when a high proportion of the population has scores at the lower or higher end of the scale. Third, are scores and change scores available for relevant subgroups (general population, specific disease groups)? Finally, is the minimal important difference known? The concept of small detectable change as compared with minimal important difference is quite important since the first may be related with numerical and measurement errors, whereas the second is based on clinical knowledge.88

Getting the report ready (STEPS 9–11)

Up to now, the paper has addressed meaningful parameters of reliability and/or agreement based on the objective of the study, the characteristics of the population, the number of observers, and measurements (STEPS 1–6) with possible discussion of generalisability/decision (if needed, STEP 7). It has also addressed the potential of biases that might be present as well as various methods of correction that have been used (STEP 8).

It is now the time when the work is self-evaluated and, hopefully, additional measures are made to produce a manuscript that is of high methodological quality (STEPS 9 and 10) and of high report quality (STEP 11). For the last three steps (STEPS 9 to 11), researchers should consult the COSMIN checklist, boxes B and C (see online supplementary boxes S1 and S2)4; the four-point checklist used individual studies check in meta-analysis (a nice shot for a final self-exam of the paper)89 (boxes B and C, online supplementary boxes S1 and S2); ending with the QAREL (see online supplementary table S2)10 and GRRAS (see online supplementary table S3)11 checks. At the time of this step the answer will likely be ‘yes to all.’

Summary: putting all together for reliability and agreement studies

Reliability and agreement are two essential properties fundamental to clinical medicine. Instruments should be able to differentiate between subjects (reliable) and, if possible, able to have same the value over repeated measures (reliable). Researchers should follow the proposed stepwise approach to analyse and report studies using some of the suggested statistical methods., The reviewer recommends self-evaluation of the research/paper by matching the data with the COSMIN checklist (Boxes B and C for reliability and agreement, respectively), accompanied by the four-point COSMIN initiative checklist. Additionally, all of these methods could be used as a screening tool for quality check on peer review of papers.

Future directions in reliability and agreement studies in the measure of GI disorders should explore latent class modelling and correction of measurement error by calibration studies.

References

↵
1. de Vet HC,
2. Terwee CB,
3. Knol DL, et al
. When to use agreement versus reliability measures. J Clin Epidemiol 2006;59:1033–9. doi:10.1016/j.jclinepi.2005.10.015
OpenUrl CrossRef PubMed Web of Science
↵
1. de Vet HC,
2. Terwee CB,
3. Bouter LM
. Current challenges in clinimetrics. J Clin Epidemiol 2003;56:1137–41. doi:10.1016/j.jclinepi.2003.08.012
OpenUrl CrossRef PubMed Web of Science
↵
1. Streiner DL
. Clinimetrics vs. psychometrics: an unnecessary distinction. J Clin Epidemiol 2003;56:1142–5. doi:10.1016/j.jclinepi.2003.08.011
OpenUrl CrossRef PubMed
↵
1. Mokkink LB,
2. Terwee CB,
3. Patrick DL, et al
. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res 2010;19:539–49. doi:10.1007/s11136-010-9606-8
OpenUrl CrossRef PubMed Web of Science
↵
1. Mokkink LB,
2. Terwee CB,
3. Patrick DL, et al
. COSMIN checklist manual. http://www.cosmin.nl/images/upload/files/COSMIN%20checklist%20manual%20v9.pdf (accessed 1/5/2015).
↵
1. Mosli MH,
2. Feagan BG,
3. Zou GY, et al
. Reproducibility of histological assessments of disease activity in UC. Gut 2014 Oct 30. pii: gutjnl-2014-307536. doi:10.1136/gutjnl-2014-307536. doi:10.1136/gutjnl-2014-307536
↵
1. Gisev N,
2. Bell JS,
3. Chen TF
. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res Social Adm Pharm 2013;9:330–8. doi:10.1016/j.sapharm.2012.04.004
OpenUrl CrossRef PubMed
↵
1. Mokkink LB,
2. Terwee CB,
3. Patrick DL, et al
. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol 2010;63:737–45. doi:10.1016/j.jclinepi.2010.02.006
OpenUrl CrossRef PubMed Web of Science
↵
1. Terwee CB,
2. Mokkink LB,
3. Knol DL, et al
. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res 2012;21:651–7. doi:10.1007/s11136-011-9960-1
OpenUrl CrossRef PubMed Web of Science
↵
1. Lucas NP,
2. Macaskill P,
3. Irwig L, et al
. The development of a quality appraisal tool for studies of diagnostic reliability (QAREL). J Clin Epidemiol 2010;63:854–61. doi:10.1016/j.jclinepi.2009.10.002
OpenUrl CrossRef PubMed
↵
1. Kottner J,
2. Audige L,
3. Brorson S, et al
. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol 2011;64:96–106. doi:10.1016/j.jclinepi.2010.03.002
OpenUrl CrossRef PubMed
↵
1. Kottner J,
2. Streiner DL
. The difference between reliability and agreement. J Clin Epidemiol 2011;64:701–2. doi:10.1016/j.jclinepi.2010.12.001
OpenUrl CrossRef PubMed
↵
1. Bossuyt PM,
2. Reitsma JB,
3. Bruns DE, et al
. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med 2003;138:W1–12. doi:10.7326/0003-4819-138-1-200301070-00012-w1
OpenUrl CrossRef PubMed
↵
1. Haber M,
2. Barnhart HX
. Coefficients of agreement for fixed observers. Stat Methods Med Res 2006;15:255–71. doi:10.1191/0962280206sm441oa
OpenUrl Abstract/FREE Full Text
↵
1. Higgins PA,
2. Straub AJ
. Understanding the error of our ways: mapping the concepts of validity and reliability. Nurs Outlook 2006;54:23–9. doi:10.1016/j.outlook.2004.12.004
OpenUrl CrossRef PubMed
↵
1. Dunn G
. Statistical evaluation of measurement errors. 2nd edn. London: Hodder Arnold, 2004.
↵
1. Windish DM,
2. Diener-West M
. A clinician-educator's roadmap to choosing and interpreting statistical tests. J Gen Intern Med 2006;21:656–60. doi:10.1111/j.1525-1497.2006.00390.x
OpenUrl CrossRef PubMed Web of Science
↵
1. Lang TA,
2. Secic M
. Reporting hypothesis testing. How to report statistics in medicine. 2nd edn. Sheridan Press, 2006:51.
↵
1. Hanley JA,
2. Negassa A,
3. Edwardes MD, et al
. Statistical analysis of correlated data using generalized estimating equations: an orientation. Am J Epidemiol 2003;157:364–75. doi:10.1093/aje/kwf215
OpenUrl Abstract/FREE Full Text
↵
1. Loftus EV Jr.,
2. Harewood GC,
3. Loftus CG, et al
. PSC-IBD: a unique form of inflammatory bowel disease associated with primary sclerosing cholangitis. Gut 2005;54:91–6. doi:10.1136/gut.2004.046615
OpenUrl Abstract/FREE Full Text
↵
1. Woreta TA,
2. Sutcliffe CG,
3. Mehta SH, et al
. Incidence and risk factors for steatosis progression in adults coinfected with HIV and hepatitis C virus. Gastroenterology 2011;140:809–17. doi:10.1053/j.gastro.2010.11.052
OpenUrl CrossRef PubMed
↵
1. Graubard BI,
2. Korn EL
. Regression analysis with clustered data. Stat Med 1994;13:509–22. doi:10.1002/sim.4780130514
OpenUrl CrossRef PubMed Web of Science
↵
1. Uebersax JS
. Kappa coefficients: A critical appraisal. http://john-uebersax.com/stat/kappa.htm (accessed 1/8/2015).
↵
1. Yang Z,
2. Zhou M
. Kappa statistic for clustered matched-pair data. Stat Med 2014;33:2612–33. doi:10.1002/sim.6113
OpenUrl CrossRef PubMed
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:115–19.
↵
1. Banerjee M,
2. Capozzoli M,
3. McSweeny L, et al
. Beyond kappa: a review of interrater agreement measurements. Can J Stat 1999;27:3–23. doi:10.2307/3315487
OpenUrl CrossRef Web of Science
↵
1. Dunn G
. Setting the scene. Statistical evaluation of measurement errors. 2nd edn. London: Hodder Arnold, 2004:17.
↵
1. Streiner DL,
2. Norman GR
. Reliability. Health measurement scales: a practical guide to their development and use. 4th edn. New York, USA: Oxford Scholarship Online, 2009.
↵
1. Haley SM,
2. Osberg JS
. Kappa coefficient calculation using multiple ratings per subject: a special communication. Phys Ther 1989;69:970–4.
OpenUrl Abstract/FREE Full Text
↵
1. Shoukri MM,
2. Martin SW,
3. Mian IU
. Maximum likelihood estimation of the kappa coefficient from models of matched binary responses. Stat Med 1995;14:83–99. doi:10.1002/sim.4780140109
OpenUrl CrossRef PubMed Web of Science
↵
1. Roberts C
. Modelling patterns of agreement for nominal scales. Stat Med 2008;27:810–30. doi:10.1002/sim.2945
OpenUrl CrossRef PubMed
↵
1. Walter SD,
2. Eliasziw M,
3. Donner A
. Sample size and optimal designs for reliability studies. Stat Med 1998;17:101–10. doi:10.1002/(SICI)1097-0258(19980115)17:1<101::AID-SIM727>3.0.CO;2-E
OpenUrl CrossRef PubMed Web of Science
↵
1. Sim J,
2. Wright CC
. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 2005;85:257–68.
OpenUrl Abstract/FREE Full Text
↵
1. Donner A,
2. Eliasziw M
. Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994;50:550–5. doi:10.2307/2533400
OpenUrl CrossRef PubMed
↵
1. Hoehler FK
. Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J Clin Epidemiol 2000;53:499–503. doi:10.1016/S0895-4356(99)00174-2
OpenUrl CrossRef PubMed Web of Science
↵
1. Solovieva S,
2. Vehmas T,
3. Riihimaki H, et al
. Hand use and patterns of joint involvement in osteoarthritis. A comparison of female dentists and teachers. Rheumatology (Oxford) 2005;44:521–8. doi:10.1093/rheumatology/keh534
OpenUrl Abstract/FREE Full Text
↵
1. Roberts C,
2. McNamee R
. A matrix of kappa-type coefficients to assess the reliability of nominal scales. Stat Med 1998;17:471–88. doi:10.1002/(SICI)1097-0258(19980228)17:4<471::AID-SIM745>3.0.CO;2-N
OpenUrl CrossRef PubMed Web of Science
↵
1. Kraemer HC
. Measurement of reliability for categorical data in medical research. Stat Methods Med Res 1992;1:183–99. doi:10.1177/096228029200100204
OpenUrl Abstract/FREE Full Text
↵
1. Guggenmoos-Holzmann I,
2. Vonk R
. Kappa-like indices of observer agreement viewed from a latent class perspective. Stat Med 1998;17:797–812. doi:10.1002/(SICI)1097-0258(19980430)17:8<797::AID-SIM776>3.0.CO;2-G
OpenUrl CrossRef PubMed Web of Science
↵
1. Uebersax JS,
2. Grove WM
. Latent class analysis of diagnostic agreement. Stat Med 1990;9:559–72. doi:10.1002/sim.4780090509
OpenUrl CrossRef PubMed Web of Science
↵
1. Uebersax JS
. Modeling approaches for the analysis of observer agreement. Invest Radiol 1992;27:738–43. doi:10.1097/00004424-199209000-00017
OpenUrl CrossRef PubMed Web of Science
↵
1. Uebersax JS,
2. Grove WM
. A latent trait finite mixture model for the analysis of rating agreement. Biometrics 1993;49:823–35. doi:10.2307/2532202
OpenUrl CrossRef PubMed Web of Science
↵
1. Bartholomew DJ,
2. Knott M,
3. Moustaki I
. Latent variable models and factor analysis: a unified approach. 3rd edn. West Sussex, UK: Wiley, 2011.
↵
1. Uebersax JS
. Latent Structure Analyses. http://www.john-uebersax.com/stat/index.htm (accessed 1/8/2015).
↵
1. Christensen AH,
2. Gjorup T,
3. Hilden J, et al
. Observer homogeneity in the histologic diagnosis of Helicobacter pylori. Latent class analysis, kappa coefficient, and repeat frequency. Scand J Gastroenterol 1992;27:933–9. doi:10.3109/00365529209000166
OpenUrl CrossRef PubMed Web of Science
↵
1. Agresti A
. Modelling patterns of agreement and disagreement. Stat Methods Med Res 1992;1:201–18. doi:10.1177/096228029200100205
OpenUrl Abstract/FREE Full Text
↵
1. Becker MP,
2. Agresti A
. Log-linear modelling of pairwise interobserver agreement on a categorical scale. Stat Med 1992;11:101–14. doi:10.1002/sim.4780110109
OpenUrl CrossRef PubMed Web of Science
↵
1. Valet F,
2. Guinot C,
3. Ezzedine K, et al
. Quality assessment of ordinal scale reproducibility: log-linear models provided useful information on scale structure. J Clin Epidemiol 2008;61:983–90. doi:10.1016/j.jclinepi.2007.11.004
OpenUrl CrossRef PubMed
↵
1. McGraw KO,
2. Wong SP
. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996;1:30–46. doi:10.1037/1082-989X.1.1.30
OpenUrl CrossRef Web of Science
↵
1. Weir JP
. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J Strength Cond Res 2005;19:231–40.
OpenUrl CrossRef PubMed Web of Science
↵
1. Shrout PE,
2. Fleiss JL
. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86: 420–8. doi:10.1037/0033-2909.86.2.420
OpenUrl CrossRef PubMed Web of Science
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:131–7.
↵
1. Donner A,
2. Eliasziw M
. Sample size requirements for reliability studies. Stat Med 1987;6:441–8. doi:10.1002/sim.4780060404
OpenUrl CrossRef PubMed Web of Science
↵
1. Giraudeau B,
2. Mary JY
. Planning a reproducibility study: how many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Stat Med 2001;20:3205–14. doi:10.1002/sim.935
OpenUrl CrossRef PubMed Web of Science
↵
1. Bonett DG
. Sample size requirements for estimating intraclass correlations with desired precision. Stat Med 2002;21:1331–5. doi:10.1002/sim.1108
OpenUrl CrossRef PubMed Web of Science
↵
1. Zou GY
. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med 2012;31:3972–81. doi:10.1002/sim.5466
OpenUrl CrossRef PubMed
↵
1. Cicchetti DV
. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess 1994;6:284–90. doi:10.1037/1040-3590.6.4.284
OpenUrl CrossRef
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:120.
↵
1. Muller R,
2. Buttner P
. A critical discussion of intraclass correlation coefficients. Stat Med 1994;13:2465–76. doi:10.1002/sim.4780132310
OpenUrl CrossRef PubMed Web of Science
↵
1. Chen CC,
2. Barnhart HX
. Assessing agreement with intraclass correlation coefficient and concordance correlation coefficient for data with repeated measures. Comput Stat Data Anal 2013;60:132–45. doi:10.1016/j.csda.2012.11.004
OpenUrl CrossRef Web of Science
↵
1. Streiner DL,
2. Norman GR
. Generalizability theory. Health measurement scales: a practical guide to their development and use. 4th edn. New York, USA: Oxford Scholarship Online, 2009.
↵
Institute for Digital Research and Education. FAQ: What is the coefficient of variation? Consulted on 1/12/2015 ed. 2015.
↵
1. Bland JM,
2. Altman DG
. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307–10. doi:10.1016/S0140-6736(86)90837-8
OpenUrl CrossRef PubMed Web of Science
↵
1. Schmidt ME,
2. Steindorf K
. Statistical methods for the validation of questionnaires—discrepancy between theory and practice. Methods Inf Med 2006;45:409–13.
OpenUrl PubMed
↵
1. Altman DG
. Relation between two continuous variables. Pract Stat Med Res 1999;1:277–99.
OpenUrl
↵
1. Uebersax JS
. Raw Agreement Indices. http://www.john-uebersax.com/stat/raw.htm (accessed 1/8/2015).
↵
1. Samsa GP
. Sampling distributions of p(pos) and p(neg). J Clin Epidemiol 1996;49:917–19. doi:10.1016/0895-4356(96)00042-X
OpenUrl CrossRef PubMed
↵
US Department of Health and Human Services FaDA. Statistical Guidance on Reporting from Studies Evaluating Diagnostic Tests. Consulted on 1/12/2015 ed. 2007.
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:111–13.
↵
1. Sedgwick P
. Limits of agreement (Bland-Altman method). BMJ 2013;346:f1630. doi:10.1136/bmj.f1630
OpenUrl FREE Full Text
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:113–15.
↵
1. Bland JM,
2. Altman DG
. Measuring agreement in method comparison studies. Stat Methods Med Res 1999;8:135–60. doi:10.1191/096228099673819272
OpenUrl Abstract/FREE Full Text
↵
1. Costa-Santos C,
2. Bernardes J,
3. Yres-de-Campos D, et al
. The limits of agreement and the intraclass correlation coefficient may be inconsistent in the interpretation of agreement. J Clin Epidemiol 2011;64:264–9. doi:10.1016/j.jclinepi.2009.11.010
OpenUrl CrossRef PubMed
↵
1. Euser AM,
2. Dekker FW,
3. le CS
. A practical approach to Bland-Altman plots and variation coefficients for log transformed variables. J Clin Epidemiol 2008;61:978–82. doi:10.1016/j.jclinepi.2007.11.003
OpenUrl CrossRef PubMed
↵
1. Carroll RJ,
2. Pee D,
3. Freedman LS, et al
. Statistical design of calibration studies. Am J Clin Nutr 1997;65(4 Suppl):1187S–9S.
OpenUrl Abstract/FREE Full Text
↵
1. Carroll RJ,
2. Freedman L,
3. Pee D
. Design aspects of calibration studies in nutrition, with analysis of missing data in linear measurement error models. Biometrics 1997;53:1440–57. doi:10.2307/2533510
OpenUrl CrossRef PubMed
↵
1. MacMahon S,
2. Peto R,
3. Cutler J, et al
. Blood pressure, stroke, and coronary heart disease. Part 1, Prolonged differences in blood pressure: prospective observational studies corrected for the regression dilution bias. Lancet 1990;335:765–74. doi:10.1016/0140-6736(90)90878-9
OpenUrl CrossRef PubMed Web of Science
↵
1. Knuiman MW,
2. Divitini ML,
3. Buzas JS, et al
. Adjustment for regression dilution in epidemiological regression analyses. Ann Epidemiol 1998;8:56–63. doi:10.1016/S1047-2797(97)00107-5
OpenUrl CrossRef PubMed Web of Science
↵
1. Guolo A
. Robust techniques for measurement error correction: a review. Stat Methods Med Res 2008;17:555–80. doi:10.1177/0962280207081318
OpenUrl Abstract/FREE Full Text
↵
1. Carroll RJ,
2. Hardin J,
3. Schmiediche H
. Stata software for generalized linear measurement error models. Accessed on 1/14/2015 ed. 2015.
↵
1. Rothman KJ,
2. Greenland S
. Precision and validity in epidemiologic studies. Modern epidemiology. 2nd edn. Philadelphia: Lippincott Williams and Wilkins, 1998:115.
↵
1. Delgado-Rodriguez M,
2. Llorca J
. Bias. J Epidemiol Community Health 2004;58:635–41. doi:10.1136/jech.2003.008466
OpenUrl Abstract/FREE Full Text
↵
1. Zhou XH
. Correcting for verification bias in studies of a diagnostic test's accuracy. Stat Methods Med Res 1998;7:337–53. doi:10.1191/096228098676485370
OpenUrl Abstract/FREE Full Text
↵
1. Bachmann LM,
2. Ter RG,
3. Weber WE, et al
. Multivariable adjustments counteract spectrum and test review bias in accuracy studies. J Clin Epidemiol 2009;62:357–61. doi:10.1016/j.jclinepi.2008.02.007
OpenUrl CrossRef PubMed
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Validity. Measurement in medicine. New York, USA: Cambridge University Press, 2008: 150–201.
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Responsiveness. Measurement in medicine. New York, USA: Cambridge University Press, 2008:202–16.
↵
1. de Vet HC,
2. Terwee CB,
3. Mokkink LB, et al
. Interpretability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:227–74.
↵
1. de Vet HC,
2. Terwee CB
. The minimal detectable change should not replace the minimal important difference. J Clin Epidemiol 2010;63:804–5. doi:10.1016/j.jclinepi.2009.12.015
OpenUrl CrossRef PubMed
↵
1. Terwee CB
. COSMIN checklist with 4-point scale. http://www.cosmin.nl/images/upload/files/COSMIN%20checklist%20with%204-point%20scale%2022%20juni%202011.pdf (accessed 1/5/2015).

Supplementary materials

Supplementary Data

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Files in this Data Supplement:

Data supplement 1 - Online supplement

Footnotes

Correction notice This article has been corrected since it was published Online First. The formula under the heading ‘Reliability parameters for continuous measurement variables’ has been corrected.
Acknowledgements The author thanks Sandeep Mahajan, MD and Navneet K Jaswal, JD for editing the current manuscript.
Competing interests None.
Provenance and peer review Commissioned; externally peer reviewed.

[1] ↵
de Vet HC,
Terwee CB,
Knol DL, et al
. When to use agreement versus reliability measures. J Clin Epidemiol 2006;59:1033–9. doi:10.1016/j.jclinepi.2005.10.015
OpenUrl CrossRef PubMed Web of Science

[2] de Vet HC,

[3] Terwee CB,

[4] Knol DL, et al

[5] ↵
de Vet HC,
Terwee CB,
Bouter LM
. Current challenges in clinimetrics. J Clin Epidemiol 2003;56:1137–41. doi:10.1016/j.jclinepi.2003.08.012
OpenUrl CrossRef PubMed Web of Science

[6] de Vet HC,

[7] Terwee CB,

[8] Bouter LM

[9] ↵
Streiner DL
. Clinimetrics vs. psychometrics: an unnecessary distinction. J Clin Epidemiol 2003;56:1142–5. doi:10.1016/j.jclinepi.2003.08.011
OpenUrl CrossRef PubMed

[10] Streiner DL

[11] ↵
Mokkink LB,
Terwee CB,
Patrick DL, et al
. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res 2010;19:539–49. doi:10.1007/s11136-010-9606-8
OpenUrl CrossRef PubMed Web of Science

[12] Mokkink LB,

[13] Terwee CB,

[14] Patrick DL, et al

[15] ↵
Mokkink LB,
Terwee CB,
Patrick DL, et al
. COSMIN checklist manual. http://www.cosmin.nl/images/upload/files/COSMIN%20checklist%20manual%20v9.pdf (accessed 1/5/2015).

[16] Mokkink LB,

[17] Terwee CB,

[18] Patrick DL, et al

[19] ↵
Mosli MH,
Feagan BG,
Zou GY, et al
. Reproducibility of histological assessments of disease activity in UC. Gut 2014 Oct 30. pii: gutjnl-2014-307536. doi:10.1136/gutjnl-2014-307536. doi:10.1136/gutjnl-2014-307536

[20] Mosli MH,

[21] Feagan BG,

[22] Zou GY, et al

[23] ↵
Gisev N,
Bell JS,
Chen TF
. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res Social Adm Pharm 2013;9:330–8. doi:10.1016/j.sapharm.2012.04.004
OpenUrl CrossRef PubMed

[24] Gisev N,

[25] Bell JS,

[26] Chen TF

[27] ↵
Mokkink LB,
Terwee CB,
Patrick DL, et al
. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol 2010;63:737–45. doi:10.1016/j.jclinepi.2010.02.006
OpenUrl CrossRef PubMed Web of Science

[28] Mokkink LB,

[29] Terwee CB,

[30] Patrick DL, et al

[31] ↵
Terwee CB,
Mokkink LB,
Knol DL, et al
. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res 2012;21:651–7. doi:10.1007/s11136-011-9960-1
OpenUrl CrossRef PubMed Web of Science

[32] Terwee CB,

[33] Mokkink LB,

[34] Knol DL, et al

[35] ↵
Lucas NP,
Macaskill P,
Irwig L, et al
. The development of a quality appraisal tool for studies of diagnostic reliability (QAREL). J Clin Epidemiol 2010;63:854–61. doi:10.1016/j.jclinepi.2009.10.002
OpenUrl CrossRef PubMed

[36] Lucas NP,

[37] Macaskill P,

[38] Irwig L, et al

[39] ↵
Kottner J,
Audige L,
Brorson S, et al
. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol 2011;64:96–106. doi:10.1016/j.jclinepi.2010.03.002
OpenUrl CrossRef PubMed

[40] Kottner J,

[41] Audige L,

[42] Brorson S, et al

[43] ↵
Kottner J,
Streiner DL
. The difference between reliability and agreement. J Clin Epidemiol 2011;64:701–2. doi:10.1016/j.jclinepi.2010.12.001
OpenUrl CrossRef PubMed

[44] Kottner J,

[45] Streiner DL

[46] ↵
Bossuyt PM,
Reitsma JB,
Bruns DE, et al
. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med 2003;138:W1–12. doi:10.7326/0003-4819-138-1-200301070-00012-w1
OpenUrl CrossRef PubMed

[47] Bossuyt PM,

[48] Reitsma JB,

[49] Bruns DE, et al

[50] ↵
Haber M,
Barnhart HX
. Coefficients of agreement for fixed observers. Stat Methods Med Res 2006;15:255–71. doi:10.1191/0962280206sm441oa
OpenUrl Abstract/FREE Full Text

[51] Haber M,

[52] Barnhart HX

[53] ↵
Higgins PA,
Straub AJ
. Understanding the error of our ways: mapping the concepts of validity and reliability. Nurs Outlook 2006;54:23–9. doi:10.1016/j.outlook.2004.12.004
OpenUrl CrossRef PubMed

[54] Higgins PA,

[55] Straub AJ

[56] ↵
Dunn G
. Statistical evaluation of measurement errors. 2nd edn. London: Hodder Arnold, 2004.

[57] Dunn G

[58] ↵
Windish DM,
Diener-West M
. A clinician-educator's roadmap to choosing and interpreting statistical tests. J Gen Intern Med 2006;21:656–60. doi:10.1111/j.1525-1497.2006.00390.x
OpenUrl CrossRef PubMed Web of Science

[59] Windish DM,

[60] Diener-West M

[61] ↵
Lang TA,
Secic M
. Reporting hypothesis testing. How to report statistics in medicine. 2nd edn. Sheridan Press, 2006:51.

[62] Lang TA,

[63] Secic M

[64] ↵
Hanley JA,
Negassa A,
Edwardes MD, et al
. Statistical analysis of correlated data using generalized estimating equations: an orientation. Am J Epidemiol 2003;157:364–75. doi:10.1093/aje/kwf215
OpenUrl Abstract/FREE Full Text

[65] Hanley JA,

[66] Negassa A,

[67] Edwardes MD, et al

[68] ↵
Loftus EV Jr.,
Harewood GC,
Loftus CG, et al
. PSC-IBD: a unique form of inflammatory bowel disease associated with primary sclerosing cholangitis. Gut 2005;54:91–6. doi:10.1136/gut.2004.046615
OpenUrl Abstract/FREE Full Text

[69] Loftus EV Jr.,

[70] Harewood GC,

[71] Loftus CG, et al

[72] ↵
Woreta TA,
Sutcliffe CG,
Mehta SH, et al
. Incidence and risk factors for steatosis progression in adults coinfected with HIV and hepatitis C virus. Gastroenterology 2011;140:809–17. doi:10.1053/j.gastro.2010.11.052
OpenUrl CrossRef PubMed

[73] Woreta TA,

[74] Sutcliffe CG,

[75] Mehta SH, et al

[76] ↵
Graubard BI,
Korn EL
. Regression analysis with clustered data. Stat Med 1994;13:509–22. doi:10.1002/sim.4780130514
OpenUrl CrossRef PubMed Web of Science

[77] Graubard BI,

[78] Korn EL

[79] ↵
Uebersax JS
. Kappa coefficients: A critical appraisal. http://john-uebersax.com/stat/kappa.htm (accessed 1/8/2015).

[80] Uebersax JS

[81] ↵
Yang Z,
Zhou M
. Kappa statistic for clustered matched-pair data. Stat Med 2014;33:2612–33. doi:10.1002/sim.6113
OpenUrl CrossRef PubMed

[82] Yang Z,

[83] Zhou M

[84] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:115–19.

[85] de Vet HC,

[86] Terwee CB,

[87] Mokkink LB, et al

[88] ↵
Banerjee M,
Capozzoli M,
McSweeny L, et al
. Beyond kappa: a review of interrater agreement measurements. Can J Stat 1999;27:3–23. doi:10.2307/3315487
OpenUrl CrossRef Web of Science

[89] Banerjee M,

[90] Capozzoli M,

[91] McSweeny L, et al

[92] ↵
Dunn G
. Setting the scene. Statistical evaluation of measurement errors. 2nd edn. London: Hodder Arnold, 2004:17.

[93] Dunn G

[94] ↵
Streiner DL,
Norman GR
. Reliability. Health measurement scales: a practical guide to their development and use. 4th edn. New York, USA: Oxford Scholarship Online, 2009.

[95] Streiner DL,

[96] Norman GR

[97] ↵
Haley SM,
Osberg JS
. Kappa coefficient calculation using multiple ratings per subject: a special communication. Phys Ther 1989;69:970–4.
OpenUrl Abstract/FREE Full Text

[98] Haley SM,

[99] Osberg JS

[100] ↵
Shoukri MM,
Martin SW,
Mian IU
. Maximum likelihood estimation of the kappa coefficient from models of matched binary responses. Stat Med 1995;14:83–99. doi:10.1002/sim.4780140109
OpenUrl CrossRef PubMed Web of Science

[101] Shoukri MM,

[102] Martin SW,

[103] Mian IU

[104] ↵
Roberts C
. Modelling patterns of agreement for nominal scales. Stat Med 2008;27:810–30. doi:10.1002/sim.2945
OpenUrl CrossRef PubMed

[105] Roberts C

[106] ↵
Walter SD,
Eliasziw M,
Donner A
. Sample size and optimal designs for reliability studies. Stat Med 1998;17:101–10. doi:10.1002/(SICI)1097-0258(19980115)17:1<101::AID-SIM727>3.0.CO;2-E
OpenUrl CrossRef PubMed Web of Science

[107] Walter SD,

[108] Eliasziw M,

[109] Donner A

[110] ↵
Sim J,
Wright CC
. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 2005;85:257–68.
OpenUrl Abstract/FREE Full Text

[111] Sim J,

[112] Wright CC

[113] ↵
Donner A,
Eliasziw M
. Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994;50:550–5. doi:10.2307/2533400
OpenUrl CrossRef PubMed

[114] Donner A,

[115] Eliasziw M

[116] ↵
Hoehler FK
. Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J Clin Epidemiol 2000;53:499–503. doi:10.1016/S0895-4356(99)00174-2
OpenUrl CrossRef PubMed Web of Science

[117] Hoehler FK

[118] ↵
Solovieva S,
Vehmas T,
Riihimaki H, et al
. Hand use and patterns of joint involvement in osteoarthritis. A comparison of female dentists and teachers. Rheumatology (Oxford) 2005;44:521–8. doi:10.1093/rheumatology/keh534
OpenUrl Abstract/FREE Full Text

[119] Solovieva S,

[120] Vehmas T,

[121] Riihimaki H, et al

[122] ↵
Roberts C,
McNamee R
. A matrix of kappa-type coefficients to assess the reliability of nominal scales. Stat Med 1998;17:471–88. doi:10.1002/(SICI)1097-0258(19980228)17:4<471::AID-SIM745>3.0.CO;2-N
OpenUrl CrossRef PubMed Web of Science

[123] Roberts C,

[124] McNamee R

[125] ↵
Kraemer HC
. Measurement of reliability for categorical data in medical research. Stat Methods Med Res 1992;1:183–99. doi:10.1177/096228029200100204
OpenUrl Abstract/FREE Full Text

[126] Kraemer HC

[127] ↵
Guggenmoos-Holzmann I,
Vonk R
. Kappa-like indices of observer agreement viewed from a latent class perspective. Stat Med 1998;17:797–812. doi:10.1002/(SICI)1097-0258(19980430)17:8<797::AID-SIM776>3.0.CO;2-G
OpenUrl CrossRef PubMed Web of Science

[128] Guggenmoos-Holzmann I,

[129] Vonk R

[130] ↵
Uebersax JS,
Grove WM
. Latent class analysis of diagnostic agreement. Stat Med 1990;9:559–72. doi:10.1002/sim.4780090509
OpenUrl CrossRef PubMed Web of Science

[131] Uebersax JS,

[132] Grove WM

[133] ↵
Uebersax JS
. Modeling approaches for the analysis of observer agreement. Invest Radiol 1992;27:738–43. doi:10.1097/00004424-199209000-00017
OpenUrl CrossRef PubMed Web of Science

[134] Uebersax JS

[135] ↵
Uebersax JS,
Grove WM
. A latent trait finite mixture model for the analysis of rating agreement. Biometrics 1993;49:823–35. doi:10.2307/2532202
OpenUrl CrossRef PubMed Web of Science

[136] Uebersax JS,

[137] Grove WM

[138] ↵
Bartholomew DJ,
Knott M,
Moustaki I
. Latent variable models and factor analysis: a unified approach. 3rd edn. West Sussex, UK: Wiley, 2011.

[139] Bartholomew DJ,

[140] Knott M,

[141] Moustaki I

[142] ↵
Uebersax JS
. Latent Structure Analyses. http://www.john-uebersax.com/stat/index.htm (accessed 1/8/2015).

[143] Uebersax JS

[144] ↵
Christensen AH,
Gjorup T,
Hilden J, et al
. Observer homogeneity in the histologic diagnosis of Helicobacter pylori. Latent class analysis, kappa coefficient, and repeat frequency. Scand J Gastroenterol 1992;27:933–9. doi:10.3109/00365529209000166
OpenUrl CrossRef PubMed Web of Science

[145] Christensen AH,

[146] Gjorup T,

[147] Hilden J, et al

[148] ↵
Agresti A
. Modelling patterns of agreement and disagreement. Stat Methods Med Res 1992;1:201–18. doi:10.1177/096228029200100205
OpenUrl Abstract/FREE Full Text

[149] Agresti A

[150] ↵
Becker MP,
Agresti A
. Log-linear modelling of pairwise interobserver agreement on a categorical scale. Stat Med 1992;11:101–14. doi:10.1002/sim.4780110109
OpenUrl CrossRef PubMed Web of Science

[151] Becker MP,

[152] Agresti A

[153] ↵
Valet F,
Guinot C,
Ezzedine K, et al
. Quality assessment of ordinal scale reproducibility: log-linear models provided useful information on scale structure. J Clin Epidemiol 2008;61:983–90. doi:10.1016/j.jclinepi.2007.11.004
OpenUrl CrossRef PubMed

[154] Valet F,

[155] Guinot C,

[156] Ezzedine K, et al

[157] ↵
McGraw KO,
Wong SP
. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996;1:30–46. doi:10.1037/1082-989X.1.1.30
OpenUrl CrossRef Web of Science

[158] McGraw KO,

[159] Wong SP

[160] ↵
Weir JP
. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J Strength Cond Res 2005;19:231–40.
OpenUrl CrossRef PubMed Web of Science

[161] Weir JP

[162] ↵
Shrout PE,
Fleiss JL
. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86: 420–8. doi:10.1037/0033-2909.86.2.420
OpenUrl CrossRef PubMed Web of Science

[163] Shrout PE,

[164] Fleiss JL

[165] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:131–7.

[166] de Vet HC,

[167] Terwee CB,

[168] Mokkink LB, et al

[169] ↵
Donner A,
Eliasziw M
. Sample size requirements for reliability studies. Stat Med 1987;6:441–8. doi:10.1002/sim.4780060404
OpenUrl CrossRef PubMed Web of Science

[170] Donner A,

[171] Eliasziw M

[172] ↵
Giraudeau B,
Mary JY
. Planning a reproducibility study: how many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Stat Med 2001;20:3205–14. doi:10.1002/sim.935
OpenUrl CrossRef PubMed Web of Science

[173] Giraudeau B,

[174] Mary JY

[175] ↵
Bonett DG
. Sample size requirements for estimating intraclass correlations with desired precision. Stat Med 2002;21:1331–5. doi:10.1002/sim.1108
OpenUrl CrossRef PubMed Web of Science

[176] Bonett DG

[177] ↵
Zou GY
. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med 2012;31:3972–81. doi:10.1002/sim.5466
OpenUrl CrossRef PubMed

[178] Zou GY

[179] ↵
Cicchetti DV
. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess 1994;6:284–90. doi:10.1037/1040-3590.6.4.284
OpenUrl CrossRef

[180] Cicchetti DV

[181] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:120.

[182] de Vet HC,

[183] Terwee CB,

[184] Mokkink LB, et al

[185] ↵
Muller R,
Buttner P
. A critical discussion of intraclass correlation coefficients. Stat Med 1994;13:2465–76. doi:10.1002/sim.4780132310
OpenUrl CrossRef PubMed Web of Science

[186] Muller R,

[187] Buttner P

[188] ↵
Chen CC,
Barnhart HX
. Assessing agreement with intraclass correlation coefficient and concordance correlation coefficient for data with repeated measures. Comput Stat Data Anal 2013;60:132–45. doi:10.1016/j.csda.2012.11.004
OpenUrl CrossRef Web of Science

[189] Chen CC,

[190] Barnhart HX

[191] ↵
Streiner DL,
Norman GR
. Generalizability theory. Health measurement scales: a practical guide to their development and use. 4th edn. New York, USA: Oxford Scholarship Online, 2009.

[192] Streiner DL,

[193] Norman GR

[194] ↵
Institute for Digital Research and Education. FAQ: What is the coefficient of variation? Consulted on 1/12/2015 ed. 2015.

[195] ↵
Bland JM,
Altman DG
. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307–10. doi:10.1016/S0140-6736(86)90837-8
OpenUrl CrossRef PubMed Web of Science

[196] Bland JM,

[197] Altman DG

[198] ↵
Schmidt ME,
Steindorf K
. Statistical methods for the validation of questionnaires—discrepancy between theory and practice. Methods Inf Med 2006;45:409–13.
OpenUrl PubMed

[199] Schmidt ME,

[200] Steindorf K

[201] ↵
Altman DG
. Relation between two continuous variables. Pract Stat Med Res 1999;1:277–99.
OpenUrl

[202] Altman DG

[203] ↵
Uebersax JS
. Raw Agreement Indices. http://www.john-uebersax.com/stat/raw.htm (accessed 1/8/2015).

[204] Uebersax JS

[205] ↵
Samsa GP
. Sampling distributions of p(pos) and p(neg). J Clin Epidemiol 1996;49:917–19. doi:10.1016/0895-4356(96)00042-X
OpenUrl CrossRef PubMed

[206] Samsa GP

[207] ↵
US Department of Health and Human Services FaDA. Statistical Guidance on Reporting from Studies Evaluating Diagnostic Tests. Consulted on 1/12/2015 ed. 2007.

[208] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:111–13.

[209] de Vet HC,

[210] Terwee CB,

[211] Mokkink LB, et al

[212] ↵
Sedgwick P
. Limits of agreement (Bland-Altman method). BMJ 2013;346:f1630. doi:10.1136/bmj.f1630
OpenUrl FREE Full Text

[213] Sedgwick P

[214] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Reliability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:113–15.

[215] de Vet HC,

[216] Terwee CB,

[217] Mokkink LB, et al

[218] ↵
Bland JM,
Altman DG
. Measuring agreement in method comparison studies. Stat Methods Med Res 1999;8:135–60. doi:10.1191/096228099673819272
OpenUrl Abstract/FREE Full Text

[219] Bland JM,

[220] Altman DG

[221] ↵
Costa-Santos C,
Bernardes J,
Yres-de-Campos D, et al
. The limits of agreement and the intraclass correlation coefficient may be inconsistent in the interpretation of agreement. J Clin Epidemiol 2011;64:264–9. doi:10.1016/j.jclinepi.2009.11.010
OpenUrl CrossRef PubMed

[222] Costa-Santos C,

[223] Bernardes J,

[224] Yres-de-Campos D, et al

[225] ↵
Euser AM,
Dekker FW,
le CS
. A practical approach to Bland-Altman plots and variation coefficients for log transformed variables. J Clin Epidemiol 2008;61:978–82. doi:10.1016/j.jclinepi.2007.11.003
OpenUrl CrossRef PubMed

[226] Euser AM,

[227] Dekker FW,

[228] le CS

[229] ↵
Carroll RJ,
Pee D,
Freedman LS, et al
. Statistical design of calibration studies. Am J Clin Nutr 1997;65(4 Suppl):1187S–9S.
OpenUrl Abstract/FREE Full Text

[230] Carroll RJ,

[231] Pee D,

[232] Freedman LS, et al

[233] ↵
Carroll RJ,
Freedman L,
Pee D
. Design aspects of calibration studies in nutrition, with analysis of missing data in linear measurement error models. Biometrics 1997;53:1440–57. doi:10.2307/2533510
OpenUrl CrossRef PubMed

[234] Carroll RJ,

[235] Freedman L,

[236] Pee D

[237] ↵
MacMahon S,
Peto R,
Cutler J, et al
. Blood pressure, stroke, and coronary heart disease. Part 1, Prolonged differences in blood pressure: prospective observational studies corrected for the regression dilution bias. Lancet 1990;335:765–74. doi:10.1016/0140-6736(90)90878-9
OpenUrl CrossRef PubMed Web of Science

[238] MacMahon S,

[239] Peto R,

[240] Cutler J, et al

[241] ↵
Knuiman MW,
Divitini ML,
Buzas JS, et al
. Adjustment for regression dilution in epidemiological regression analyses. Ann Epidemiol 1998;8:56–63. doi:10.1016/S1047-2797(97)00107-5
OpenUrl CrossRef PubMed Web of Science

[242] Knuiman MW,

[243] Divitini ML,

[244] Buzas JS, et al

[245] ↵
Guolo A
. Robust techniques for measurement error correction: a review. Stat Methods Med Res 2008;17:555–80. doi:10.1177/0962280207081318
OpenUrl Abstract/FREE Full Text

[246] Guolo A

[247] ↵
Carroll RJ,
Hardin J,
Schmiediche H
. Stata software for generalized linear measurement error models. Accessed on 1/14/2015 ed. 2015.

[248] Carroll RJ,

[249] Hardin J,

[250] Schmiediche H

[251] ↵
Rothman KJ,
Greenland S
. Precision and validity in epidemiologic studies. Modern epidemiology. 2nd edn. Philadelphia: Lippincott Williams and Wilkins, 1998:115.

[252] Rothman KJ,

[253] Greenland S

[254] ↵
Delgado-Rodriguez M,
Llorca J
. Bias. J Epidemiol Community Health 2004;58:635–41. doi:10.1136/jech.2003.008466
OpenUrl Abstract/FREE Full Text

[255] Delgado-Rodriguez M,

[256] Llorca J

[257] ↵
Zhou XH
. Correcting for verification bias in studies of a diagnostic test's accuracy. Stat Methods Med Res 1998;7:337–53. doi:10.1191/096228098676485370
OpenUrl Abstract/FREE Full Text

[258] Zhou XH

[259] ↵
Bachmann LM,
Ter RG,
Weber WE, et al
. Multivariable adjustments counteract spectrum and test review bias in accuracy studies. J Clin Epidemiol 2009;62:357–61. doi:10.1016/j.jclinepi.2008.02.007
OpenUrl CrossRef PubMed

[260] Bachmann LM,

[261] Ter RG,

[262] Weber WE, et al

[263] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Validity. Measurement in medicine. New York, USA: Cambridge University Press, 2008: 150–201.

[264] de Vet HC,

[265] Terwee CB,

[266] Mokkink LB, et al

[267] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Responsiveness. Measurement in medicine. New York, USA: Cambridge University Press, 2008:202–16.

[268] de Vet HC,

[269] Terwee CB,

[270] Mokkink LB, et al

[271] ↵
de Vet HC,
Terwee CB,
Mokkink LB, et al
. Interpretability. Measurement in medicine. New York, USA: Cambridge University Press, 2008:227–74.

[272] de Vet HC,

[273] Terwee CB,

[274] Mokkink LB, et al

[275] ↵
de Vet HC,
Terwee CB
. The minimal detectable change should not replace the minimal important difference. J Clin Epidemiol 2010;63:804–5. doi:10.1016/j.jclinepi.2009.12.015
OpenUrl CrossRef PubMed

[276] de Vet HC,

[277] Terwee CB

[278] ↵
Terwee CB
. COSMIN checklist with 4-point scale. http://www.cosmin.nl/images/upload/files/COSMIN%20checklist%20with%204-point%20scale%2022%20juni%202011.pdf (accessed 1/5/2015).

[279] Terwee CB

Log in using your username and password

Main menu

Log in using your username and password

You are here

Statistics from Altmetric.com

Request Permissions

Setting the framework: the difference between reliability and agreement

Differences in reliability and agreement: related but not synonymous (STEP 1)

What is the type of outcome variable and what is the most appropriate statistical parameter? (STEP 2–6)

Parameter selection is based on objective and the nature of the measurement (STEP 2)

Number of raters/tests involved and paired data: what statistical tests should be used and why? (STEP 4)

Reliability parameters for categorical (nominal, dichotomous and ordinal) measurement variables

κ index and extensions

Statistical inference, power calculation and software of κ indices

Interpretation and caveats of κ indices

Other alternatives for categorical reliability: weighted κs, ICCs and modelling patterns of agreement

Weighted κs and matrix of weighted κs

ICC for ordinal variables: perhaps better than weighted κ

Modelling patterns of reliability: log-linear models, latent class models and beyond

The punchline for categorical variables: κ with caution, perhaps ICC

Reliability parameters for continuous measurement variables

Interclass versus intraclass: what is the difference?

ICC, a layman's terms definition and method of selection

Selection of the ICC

Statistical inference, power calculation and software

Interpretation and caveats of the ICC

Extra notes on ICCs: other extensions and paired data

The punchline for continuous measurements: ICC is good but needs to specify model assumptions

Generalisability and decision studies (G and D studies): beyond the usual report of ICCs and κs (STEP 7)

Agreement: related but not equal to reliability

Agreement for categorical variables

Agreement for continuous variables

Coefficient of variation

Correlation coefficients: why they are not in table 1?

Proportion of positive agreement and specific agreement

Standard error measurement and its association with ICC

Bland-Altman plot: excellent resource to summarise measurement error

Repeated measures for measurement errors in continuous variables

A word on calibration studies and statistical techniques on measurement error models

The punchline for measurement error: SEM and Bland-Altman

Assessing biases in reliability and agreement studies (STEP 8)

Final concepts from the COSMIN initiative: validity, responsiveness and interpretability

Getting the report ready (STEPS 9–11)

Summary: putting all together for reliability and agreement studies

References

Supplementary materials

Supplementary Data

Footnotes

Read the full text or download the PDF:

Log in using your username and password