Introduction

Colonography comprises a complete colon examination that can be performed with either computed tomography (CT) [1] or magnetic resonance imaging (MRI) [2]. CT-colonography has been reported to be a feasible, safe, well-tolerated examination with good diagnostic accuracy for the detection of colorectal polyps and cancer (CRC) [3]. Unfortunately, CT-colonography requires ionising radiation, which poses a substantial drawback to large-scale use in patients at both average and increased risk of CRC [4]. Although some studies have shown that substantial dose reduction in CT-colonography is feasible [5, 6], an alternative imaging method that does not require ionising radiation would be preferable, especially for screening purposes.

Since 1997, several research groups have investigated the use of MR-colonography. These studies show a large variation in terms of bowel preparation used, luminal contrast agents and imaging features. However in general, two main strategies can be identified for the visualisation of the colonic lumen and wall, i.e. the bright lumen and the dark lumen strategy [2, 7].

To determine the diagnostic accuracy of MR-colonography, to date only one meta-analysis concerning an overall estimation of the diagnostic accuracy of MR-colonography for the diagnosis of colorectal masses has been carried out [8]. However, in that meta-analysis limited evaluation was performed regarding the detection of different polyp size thresholds. Additionally, that meta-analysis was performed in 2004 and mainly concerned earlier studies, which were conducted in relatively small population cohorts. Considering the quantity of studies performed since 2004 and the rapid developments and associated progress in the MRI field, an update seems warranted.

Therefore the primary aim of this study was to perform a systematic review and meta-analysis of the diagnostic accuracy of MR-colonography compared with the reference standard (colonoscopy) for the detection of colorectal lesions, with a special interest in different polyp size thresholds. Our secondary aim was to assess the methodological quality and accuracy of reporting of the available primary studies by using the QUADAS tool and thereby to propose future reporting recommendations.

Materials and methods

Literature search

A computer-assisted literature search was performed of the MEDLINE, EMBASE and Cochrane databases for relevant publications on the accuracy of MR-colonography in detecting colorectal lesions (see Appendix). We searched the databases for publications dating from May 1997, when MR-colonography was first described [2], to February 2009. There were no language restrictions. One observer (FZ) assessed the title and/or abstract of all retrieved papers to identify relevant articles for inclusion. Papers were considered ineligible if from reading the title or abstract it appeared that the paper was irrelevant, did not meet all the inclusion criteria or met any of the exclusion criteria. Reference lists of review articles and papers selected for inclusion were checked by hand to identify other relevant papers. The eligible articles were retrieved as full-text articles and independently checked by two reviewers (FZ,SB) for inclusion and exclusion criteria.

Inclusion and exclusion criteria

Full prospective reports in which subjects at average or increased risk of CRC underwent 1.5-T or 3.0-T MR-colonography and completed colonoscopy for verification were considered for inclusion. Furthermore eligible studies needed to focus on the detection of colorectal polyps and CRC, irrespective of histological findings. Inclusion criteria also required the construction of 2 × 2 tables, either by extracting true-positive (TP), false-negative (FN), false-positive (FP) and true-negative (TN) values or by reconstructing from sensitivity and specificity values.

Studies that reported any diagnosis other than colorectal polyps and/or CRC or in which the accuracy of detecting colorectal polyps could not be extrapolated from the paper were excluded from our study. In addition studies with less than 10 patients were excluded. If there was any suspicion of a duplicate study, with a noticeable overlap of the study population, the most recent study with the largest population cohort was considered for inclusion. Disagreement between the two reviewers regarding inclusion and exclusion criteria was resolved by consensus. If a primary study was considered for inclusion but additional information was required because of the incompleteness of the data sets, the corresponding author was contacted.

Study characteristics

Methodological quality assessment and relevant data extraction were independently performed by the same two reviewers using a standardised form. In the event of disagreement, a decision was made by consensus. No blinding to the authors’ information, publication year or journal title was applied.

Study quality assessment

To assess the methodological quality of the included studies and to detect potential bias, ten relevant items (a–j) of the Quality Assessment of Diagnostic Accuracy Studies in Systematic Reviews (QUADAS) tool were used [9]. We focused on the qualitative assessment of the included study population, index test and reference test. Therefore we assessed study population characteristics, such as number of included subjects; definition of potential CRC risk factors; whether subjects were consecutively recruited; mean or median age with age range and sex distribution (a). We determined whether a clear description of selection criteria was reported (b) and whether an accurate reference test (i.e. colonoscopy) was used (c). In addition, we determined the possibility of a disease progression bias. Therefore we documented the time interval between MR-colonography and colonoscopy, assuming that the index test always preceded the reference test (maximum time-interval 4 weeks) (d). To exclude the possibility of a partial verification bias, we assessed whether the whole sample or a random selection of the sample received verification by means of colonoscopy: we accepted a sample of at least 90% receiving the reference test as complete verification (e). Furthermore we recorded if a clear description was given for the execution of the index test (f), if the index test findings were interpreted without knowledge of the reference standard (g), and if the reference standard was potentially adjusted by the index test findings (e.g. segmental unblinding, reassessment) (h). Additionally we investigated whether intermediate test results were reported in the included studies (i) and if withdrawals were reported (j).

Imaging features

If available, the following characteristics were documented regarding bowel preparation methods: (a) type of bowel preparation; (b) in the case of limited bowel preparation methods specification of contrast material used; (c) type of dietary restrictions; (d) type of colonic luminal contrast method applied; (e) amount of enema and pressure used if recorded; (f) amount and type of spasmolytic drugs, if administered. In addition we recorded the following MR imaging characteristics: (g) magnetic field strength; (h) intravenous paramagnetic contrast material used; (i) imaging parameters (e.g. acquisition time and imaging plane); (j) imaging procedure positions; and (k) total examination time.

Imaging analysis

The following data regarding image analysis, data handling and the reference standard were extracted from the selected studies, if available: (a) image quality assessment (evaluation regarding bowel distension, motion artefacts, lumen homogeneity); (b) type of data interpretation (two-dimensional reading (2D), three-dimensional reading (3D) or both); (c) number of observers; (d) definition of observer experience; (e) definition of consensus reading in the case of multiple observers; (f) review time; and (g) histological findings.

Data extraction

For each report, we attempted to construct a 2 × 2 contingency table, consisting of TP, FN, FP and TN values for per patient analysis purposes. For the per patient analysis, 2 × 2 tables were constructed for patients with any polyp (irrespective of size) and patients with large polyps (10 mm or larger).

For per polyp analyses TP and FN values were extracted or reconstructed from each of the included studies. We attempted to stratify the extracted data into three different polyp size thresholds that are generally applied in colonography literature, based on the associated potential CRC risk [10]. Small polyps are generally defined as polyps measuring less than 6 mm, medium polyps measure between 6 and 9 mm, and large polyps have a size of 10 mm or larger. Additionally we attempted to extract subset analysis of adenomas and CRC, if data were available.

Data analysis

Per patient analysis

Per study we constructed 2 × 2 contingency tables for MR-colonography compared with the reference standard and calculated sensitivity as TP/(FN+TP) and specificity as TN/(FP+TN). For the assessment of heterogeneity the I 2 test statistic was used. The I 2 test is a measure of inconsistency describing the percentage of total variation between studies that is due to heterogeneity, with larger percentages indicating increasing heterogeneity [11]. In the case of an I 2 value larger than 75%, we assumed that data were significantly heterogeneous; consequently no data pooling was performed.

In all other cases, I 2 values less than 75%, we used the following bivariate statistical models to summarise results for meta-analysis: the random effects model (random for both sensitivity as well as specificity, both I 2 values between 25% and 75%), the fixed effects model (homogeneous for both sensitivity and specificity, both I 2 values less than 25%) or the mixed effects model (e.g. random for sensitivity and fixed for specificity, one I 2 value less than 25% and the other I 2 value between 25% and 50%).

The bivariate effects model [12] was used to summarise estimates of sensitivity and specificity with 95% confidence intervals. In this bivariate effects model, the logit-transformed sensitivities and logit-transformed specificities are assumed to follow a bivariate normal distribution across studies around a mean logit-sensitivity and mean logit-specificity, and therefore mean logit-sensitivity and mean logit-specificity with corresponding standard errors were obtained. After antilogit transformation, summary estimates of sensitivity and specificity with their 95% confidence intervals (CIs) were obtained.

Per polyp analysis

For each threshold per study we calculated sensitivity as TP/(FN+TP). The I 2 test statistics was used to quantify heterogeneity for sensitivity in percentages. In the case of an I 2 value larger than 75% no data pooling was performed. In all other cases, we used either univariate random effects (I 2 values between 25% and 75%) or univariate fixed effects models (I 2 values less than 25%) to obtain summary estimates of sensitivity for meta-analysis. All analyses were executed using SAS software (SAS 9.2 procNlmixed, SAS Institute, Cary, NC, USA).

Results

Search characteristics

We retrieved 353 articles on the initial search. After screening based on title and abstract, 316 papers were excluded from our study. Main considerations for rejection were duplicate studies (identical studies in MEDLINE, EMBASE and Cochrane databases), study design (e.g. review, letters and comments) and non-related topic (e.g. IBD, CT-colonography) (Fig. 1). Thirty-seven papers were considered for inclusion and the full-text papers were retrieved.

Fig. 1
figure 1

Flow chart indicating selection of articles included for analysis (and potentially relevant studies that were excluded by reviewers [1336])

Study design characteristics

Thirteen studies met all predefined criteria and were included in this study (Fig. 1). Study design characteristics of all included studies are outlined in Table 1. Ten selected studies [3845, 48, 49] provided a clear description of the study population included, and in three studies [37, 46, 47] no indications of referral to colonoscopy were provided. In one study insufficient information was provided [37] regarding the time period between index test and the reference standard. In the remaining studies, colonoscopy was performed after MR-colonography within a time interval ranging from same day performance [40, 41, 4449] to a maximum of 4 weeks [42]. MR-colonography findings were presented by segmental unblinding in four studies [37, 42, 45, 49]. Of the remaining studies one reported on a potential colonoscopic reassessment in the case of inconsistencies between MR-colonography and colonoscopy findings [38], two studies did not describe any details of colonoscopy (un)blinding methods [46, 48] and in six studies the gastroenterologist was unaware of the MR findings during the complete colonoscopy procedure. Uninterpretable results of MR-colonography were reported in eight studies [3739, 42, 4447]. In general, 11 studies fulfilled at least eight methodological criteria.

Table 1 Quality assessment of included studies using relevant items of the QUADAS tool

Patient characteristics

Patient characteristics are outlined in Table 2. In this meta-analysis we included 13 studies with in total 1,285 patients. Five studies reported a study population of more than 100 patients [38, 42, 45, 47, 48] and these studies comprised 908 (71%) patients of the total study population. The largest study population included 315 asymptomatic individuals with a normal risk profile for CRC [42]. In nine studies [3841, 4345, 48, 49] symptomatic and/or asymptomatic patients at increased risk of CRC were included, and in three studies [37, 46, 47] indications for colonoscopy were unclear.

Table 2 Study characteristics of included studies

Imaging features and image analysis

MR imaging features are outlined in Tables 3 and 4. Most of the studies reported all relevant data. However, there was a variation in the preparation as well as the applied technical parameters. Dark-lumen MR-colonography was reported in nine studies [37, 3945, 49] and a water-based enema and intravenous paramagnetic contrast administration were used in eight of these studies (89%).

Table 3 MR imaging characteristics of included studies
Table 4 MRI technical parameters of included studies

Individual reader experience was defined in four studies ranging from 40 cases [38] to over 50 [48, 49]. In two other studies [39, 40], reader experience was defined as more than 4 years [39] or 5–15 years’ [40] clinical experience with abdominal MRI; however, no proven competence was shown for reading MR-colonography in these studies (Table 5).

Table 5 Image analysis characteristics

Data extraction

For each included study we were able to construct 2 × 2 contingency tables of the extracted determinates. However no standard format of data presentation was found, as per patient data were not reported for each threshold. In 11 studies [3949] per patient reporting concerned at least overall results, which included polyps of all sizes. In six of these studies overall results were presented with at least one additional threshold of per patient polyp data. In two studies per patient data were presented as sensitivity and specificity stratified to medium- and large-sized polyps combined (6 mm or larger) and polyps of 10 mm or larger; however, the overall polyp data were missing [37, 38]. Corresponding authors were contacted in order to obtain overall polyp data (including those for polyps smaller than 6 mm), and all supplied us with the required data. Per polyp data for each of the different polyp size categories could be obtained in 5 of the 13 studies (38%).

Data analysis

Per patient analysis

Inter-study heterogeneity (I 2) for the detection of patients with polyps, irrespective of size, was significant for sensitivity (86%; 95% CI 79–91%) and proved moderate for specificity (58%; 95% CI 28–76%). Therefore, calculating summary estimates of sensitivity and specificity for the detection of all polyps was not sensible in this context (Fig. 2a). Outcomes for the detection of patients with large polyps (10 mm or larger) were available in six studies comprising 927 patients (72%). The I 2 percentage for the sensitivities was 37% (95% CI 10–63%) and for the specificities 60% (95% CI 17–80%). The per patient summary estimates of sensitivity and specificity values for this polyp size threshold were 88% (95% CI 63–97%) and 99% (95% CI 95–100%), respectively (Fig. 2b). Because of this low to moderate heterogeneity, per patient data for polyps of 10 mm or larger were analysed with the use of a random effects approach.

Fig. 2
figure 2

a Forest plot of per patient sensitivity and specificity, including sensitivity and specificity estimates, for all polyps. FN false-negative, FP false-positive, TN true-negative, TP true-positive values. Lauenstein (2005) compared two different sequences in the same study population [44]; results of both sequences are used for calculating sensitivity and specificity estimates. Heterogeneity (I2) between study results for sensitivities was 86% (CI 79–91%) and for specificities 58% (CI 28–76%). b Forest plot of per patient sensitivity and specificity, including pooled sensitivity and specificity, for polyps 10 mm or larger. Heterogeneity (I2) between study results for sensitivities was 37% (CI 10–63%) and for specificities 60% (CI 17–80%)

Per polyp analysis

For per polyp data, using the I 2 test statistics we found significant heterogeneity for polyps smaller than 6 mm (81% (95% CI 65–90%) and polyps 6–9 mm (80% (95% CI 62–89%), which impedes reasonable meta-analysis for these two thresholds. Individual sensitivities for polyps smaller than 6 mm are presented in Fig. 3a. Individual per polyp sensitivities for polyps 6–9 mm were based on the data of six studies comprising 204 polyps (Fig. 3b). The I 2 for the sensitivity of polyps 10 mm or larger was 51% (95% CI 8–74%). For polyps 10 mm or larger the mean sensitivity estimate was 84% (95% CI 66–94%) (Fig. 3c) and was based on the results of 145 polyps of 10 mm or larger and obtained by the random effects approach. Reported individual detection rates of MR-colonography for CRC were 100% comprising 32 carcinomas in 5 studies (Fig. 4). In two studies an additional subanalysis for adenomas was performed [40, 42]. Per patient sensitivity for detecting adenomatous polyps 10 mm or larger in these studies was 100% and 87%, respectively.

Fig. 3
figure 3

Forest plot of per polyp sensitivity, including pooled per polyp sensitivity, for polyps smaller than 6 mm (a), polyps 6–9 mm (b) and polyps 10 mm or larger (c). FP false-positive, TN true-negative values. Heterogeneity (I 2) among study results for sensitivities for polyps smaller than 6 mm was 81% (CI 65–90%); polyps 6–9 mm, 80% (CI 62–89%); and polyps 10 mm or larger, 51% (CI 8–74%)

Fig. 4
figure 4

Forest plot of sensitivity of MR-colonography in the detection of CRC. Lauenstein (2005) and Saar (2008) compared two different sequences in the same study population [44, 49]; results of both sequences are outlined in this forest plot

Discussion

Our systematic review demonstrates an average per patient sensitivity of 88% (95% CI 63–97%) and specificity of 99% (95% CI 95–100%) for the detection of large polyps (10 mm or larger) with the use of MR-colonography. The sensitivity of MR-colonography in detecting CRC was 100%. At per polyp analysis, a summary sensitivity estimate of MR-colonography in detecting polyps 10 mm or larger was acceptable (84%; 95% CI 66–94%). Additionally, substantial variation is shown in data reporting between studies, as no standard format is used for presenting both per patient and per polyp results.

Important variability between study results was shown in sensitivity and specificity values, which is reflected by the significantly high I 2 values (above 75%) for overall per patient data and per polyp data in the detection of polyps smaller than 6 mm and polyps 6–9 mm, and impeded the complementary performance of a rational meta-analysis. This heterogeneity might be a consequence of a prominent diversity in technical aspects, as no consensus has been achieved regarding important study elements. This appears to be the opposite of CT-colonography, for which—because of the rapid development of this technique—a consensus statement is currently established [50, 51]. Halligan et al. [52] proposed a minimum data set for study-level reporting for CT-colonography in order to improve the quality of reporting in this field. The most obvious measure is to adopt similar reporting of study characteristics to those of CT-colonography as far as possible. Still, compared with CT-colonography, research on MR-colonography is rather limited and more importantly to date no consensus has been achieved regarding imaging aspects. Therefore similar recommendations to those applied in CT-colonography can only be achieved for certain aspects of MR-colonography.

In a substantial number of included studies, important demographic characteristics could often not be derived from the available dataset after withdrawals were excluded from the initial included population. Regarding the description of the presence of risk factors for CRC in the study cohort, reports were detailed. Most primary studies included patients at increased risk of colorectal polyps, which leads to a higher prevalence of abnormalities and will ultimately result in better diagnostic outcomes [53]. One study exclusively reports on a screening population consisting of 315 individuals at no increased risk of CRC, and the overall prevalence of clinically relevant abnormalities in this cohort was 6.3% [42]. It should be stated that a detailed description regarding demographic characteristics and potential risk factors for CRC is required for study reporting. Moreover complete description of the data collection (prospective, retrospective) and participant sampling (consecutively) should be provided [54].

Similar to CT-colonography, the prerequisite for MR-colonography is a clean, well-distended colon with few residual faeces. Although the reported methods of achieving this baseline varied considerably, the technical specifications of the materials and methods used to perform MR-colonography were sufficiently described in all studies. Because of the small groups and heterogeneous data, we were not able to perform a formal subgroup analysis and therefore we are unable to propose recommendations regarding the application of specific MR-colonography techniques (i.e. dark lumen, bright lumen, bowel purgation, faecal tagging).

Six studies (46%) reported adequate determinates in order to calculate per patient sensitivity and specificity values for separate size thresholds. In two of these studies calculation could be performed for each of three different size thresholds with additional split analyses for adenomas. Although per patient analysis on overall polyp data were included in our statistical approach, we believe that similar to CT-colonography a reasonable minimum size for reported polyps is larger than 5 mm [51]. Therefore we recommend that per patient analysis must be reported both stratified into thresholds of medium and large polyps (6–9 mm and 10 mm or larger, respectively) and combined. Additionally we propose per polyp sensitivity results for polyps 6–9 mm and polyps 10 mm or larger, as this analysis enlightens the effective diagnostic performance of the test [55].

In CT-colonography, diagnostic performance is known to be closely related to the level of observers’ experience [56]. In our systematic review, most of the included studies insufficiently defined observer experience. Until now the required level of experience was not known for either MR-colonography or CT-colonography, but in CT-colonography 50 verified training cases has been specified to be an absolute minimum [57]. MR-colonography is most likely to be more difficult to interpret than CT-colonography; therefore observer experience of just 40–50 validated MR-colonography cases, which was reported in several studies, is always expected to be inadequate. As observer performance plays a substantial role in the measurement of accuracy, we recommend clearly describing the total number of per study observers together with a clear definition of the observers’ experience, quantified as the total number of verified cases interpreted.

To our knowledge, so far one meta-analysis has been carried out to evaluate the diagnostic performance of MR-colonography [8]. In that meta-analysis sensitivity and specificity estimates for the detection of polyps of all sizes combined were 75% and 96%, but the presence of significant heterogeneity between the different studies minimised the statistical value of these outcomes. Individual study sensitivities in our analysis differed markedly as well and hampered quantification of all extracted data. However to be confronted with statistical heterogeneity is almost unavoidable when performing a meta-analysis of diagnostic studies.

Despite the statistical heterogeneity, the clinical relevance of summarising the sensitivity estimate for the detection of polyps of all sizes, as calculated in the previous meta-analysis, is limited, and reporting per relevant size category is far more informative. Importantly, in the present study we were able to perform additional analysis for the clinically most relevant polyp size threshold on both a per patient and a per polyp basis. This was based on a considerable albeit not large number of polyps 10 mm or larger (145 polyps in total).

Furthermore, the previous meta-analysis included eight earlier comparative studies (1998–2004) with similar inclusion criteria to those we have set in our study. However in three of these eight included studies [18, 21, 25], we were not able to extract important determinates (e.g. FP, TNs) and these were consequently excluded from our analysis. As no false-positive findings of any size were extracted for the primary studies concerned, the authors of the previous meta-analysis reported an excellent overall pooled specificity. In total we included three studies that were also evaluated in the previous meta-analysis [43, 45, 47], as we additionally excluded one study based on the use of 1.0-T field strength [34] and one study based on the inclusion of fewer than 10 patients [58].

A limitation of our study was the exclusion of seven studies due to the absence of per patient polyp data and consequently not meeting our inclusion criteria. This could potentially result in a selection bias and ultimately in biased diagnostic estimates. Therefore we would like to emphasise the importance of completeness in data reporting in comparative MR-colonography studies as this will facilitate future meta-analyses.

In our study publication bias is inevitable, regardless of attempts to use appropriate analytical approaches and execute a wide search without essential restrictions. However, we did not evaluate publication bias because much controversy remains about the applied statistical methods and outcomes in studies detecting publication bias [59].

In recent meta-analyses [52, 55, 60], high sensitivity estimates (ranging from 85 to 93%) were reported for detecting patients with large polyps by using CT-colonography. Our results seem comparable as the applied inclusion criteria do not differ considerably, and therefore MR-colonography might be regarded as a future competitive diagnostic tool for this category of colorectal polyps. In this context the principal advantage of MR-colonography is the use of non-ionising radiation. However one must bear in mind that currently MR-colonography is hampered by its limited availability, unfavourable cost-effectiveness and longer examination time.

Moreover, in order to compare the accuracy of these two techniques, ideally it would be preferable to execute a direct head-to-head comparison study. To date one study compared CT-colonography with MR-colonography in the detection of colorectal abnormalities in the same study population [19]. However CT-colonography was not performed using state-of-the-art CT-colonography, which makes it difficult to pose meaningful conclusions.

In conclusion, this systematic review shows that MR-colonography can play a role in the detection of large colorectal polyps in patients at increased risk of CRC. More research is needed to define its role in the detection of medium-sized polyps in this population, as this is far from established to date. Sizeable prospective screening studies using state-of-the-art technique are warranted for this purpose. During our analysis we found little uniformity in the methods used with regard to MR-colonography and data reporting. Ultimately, this leads to considerable heterogeneity, and therefore we propose reporting recommendations regarding crucial study design characteristics (i.e. definition of observer experience in MR-colonography, standardised per patient and per polyp data presentation) for future studies. The methodology as used in CT-colonography studies can serve as a framework for new MR-colonography studies.