- Split View
-
Views
-
Cite
Cite
Molin Wang, Aya Kuchiba, Shuji Ogino, A Meta-Regression Method for Studying Etiological Heterogeneity Across Disease Subtypes Classified by Multiple Biomarkers, American Journal of Epidemiology, Volume 182, Issue 3, 1 August 2015, Pages 263–270, https://doi.org/10.1093/aje/kwv040
- Share Icon Share
Abstract
In interdisciplinary biomedical, epidemiologic, and population research, it is increasingly necessary to consider pathogenesis and inherent heterogeneity of any given health condition and outcome. As the unique disease principle implies, no single biomarker can perfectly define disease subtypes. The complex nature of molecular pathology and biology necessitates biostatistical methodologies to simultaneously analyze multiple biomarkers and subtypes. To analyze and test for heterogeneity hypotheses across subtypes defined by multiple categorical and/or ordinal markers, we developed a meta-regression method that can utilize existing statistical software for mixed-model analysis. This method can be used to assess whether the exposure-subtype associations are different across subtypes defined by 1 marker while controlling for other markers and to evaluate whether the difference in exposure-subtype association across subtypes defined by 1 marker depends on any other markers. To illustrate this method in molecular pathological epidemiology research, we examined the associations between smoking status and colorectal cancer subtypes defined by 3 correlated tumor molecular characteristics (CpG island methylator phenotype, microsatellite instability, and the B-Raf protooncogene, serine/threonine kinase (BRAF), mutation) in the Nurses' Health Study (1980–2010) and the Health Professionals Follow-up Study (1986–2010). This method can be widely useful as molecular diagnostics and genomic technologies become routine in clinical medicine and public health.
Based on the underlying premise that individuals with the same disease name have similar etiologies and disease evolution, epidemiologic research typically aims to investigate the relationship between exposure and disease. With the advancement of biomedical sciences, it is increasingly evident that many human disease processes comprise a range of heterogeneous molecular pathological processes, modified by the exposome (1). Molecular classification can be utilized in epidemiology because individuals with similar molecular pathological processes likely share similar etiologies (2). Pathogenic heterogeneity has been considered in various neoplasms such as endometrial (3), colorectal (3–20), and lung (21–24) cancers, as well as nonneoplastic diseases such as stroke (25), cardiovascular disease (26), autism (27), infectious disease (28), autoimmune disease (29), glaucoma (30), and obesity (31).
New statistical methodologies to address disease heterogeneity are useful in not only molecular pathological epidemiology (MPE) (32) with bona fide molecular subclassification but also epidemiologic research that takes other features of disease heterogeneity (e.g., lethality, disease severity) into consideration. There are statistical methods for evaluating whether the association of an exposure with disease varies by subtypes that are defined by categorical (33–36) or ordinal (33–35) subclassifiers (M.W., unpublished manuscript, 2015); the published methods by Chatterjee (33), Chatterjee et al. (34), and Rosner et al. (35) apply to cohort studies, and the method by Begg et al. (36) focuses on case-control studies. For simplicity, we use the term “categorical variable” (or the adjective “categorical”) when referring to “nonordinal categorical variable” throughout this paper.
Given the complexity of molecular pathology and pathogenesis indicated by the unique disease principle (1), no single biomarker can perfectly subclassify any disease entity. Notably, molecular disease markers are often correlated (37). For example, in colorectal cancer, there is a strong association between high-level microsatellite instability (MSI) and high-level CpG island methylator phenotype (CIMP) and between high-level CIMP and the B-Raf protooncogene, serine/threonine kinase (BRAF), mutation (38).
Cigarette smoking has been associated with the risk of high-level MSI colorectal cancer (16–18, 20, 39–42), high-level CIMP colorectal cancer (17, 20, 42, 43), and BRAF-mutated colorectal cancer (17, 19, 20, 42). Given the correlations between these molecular markers, the association of smoking with a subtype defined by 1 marker may solely (or in part) reflect the association with a subtype defined by another marker. Thus, it remains unclear which molecular marker subtypes are primarily differentially associated with smoking, and how a marker can confound the association between smoking and subtypes defined by other markers. Although the published methods (33–35) are useful to analyze the exposure-subtype associations according to multiple subtyping markers in cohort studies using existing statistical software, analysis using those methods can become computationally infeasible in large data sets. In this article, we present an intuitive and computationally efficient biostatistical method for the analysis of disease and etiological heterogeneity when there are multiple disease subtyping markers (categorical and/or ordinal), which are possibly, but not necessarily, correlated.
METHODS
Cohort and nested case-control studies
One-stage method
Two-stage method
Unmatched case-control study
Interaction between markers
The adjusted proposed by Rosner et al. (35) can also be estimated in models 3 and 4. For example, if there are 2 binary markers, cross-classification of which defines 4 subtypes, and the second-stage model of the fixed-effects meta-regression method is where γp represents the difference in exposure-disease subtype associations between the 2 subtypes defined by the pth marker while the level of the other marker is the same, p = 1,2. The meta-regression method can also be used to evaluate whether the difference in exposure-disease subtype association across the subtypes defined by 1 marker depends on the level of another marker by including appropriate interaction terms for these markers in the meta-regression model. For example, in the second-stage fixed-effects model, rejection of the null hypothesis H0 : γ3 = 0 implies that the difference in exposure-disease subtype associations across the subtypes defined by the first marker depends on the level of the second marker. The discussion above, which is for the fixed-effects 2-stage method, can be easily extended to the random-effects method.
Categorical exposures and multiple exposures
Let β1j = (β1j1, …, β1jK), K > 1, represent the subtype-specific exposure-disease association corresponding to binary indicators created for a categorical exposure with K + 1 levels, or multiple exposures, 1 or more of which could be categorical exposures, for which binary indicators are created. The first-stage analysis of the 2-stage method, which is the subtype-specific analysis for each cross-classified subtype, is the same as in the cases when β1j is scalar. At the second stage, 1 strategy is to conduct the meta-regression analysis for each element of β1j separately. For the kth element of β1j, the random-effects meta-regression model or the fixed-effects meta-regression model, which does not include the random-effects term bjk, may be used to characterize the relationship between β1jk and levels of the multiple markers. For an any given k, in cohort and nested case-control studies, ejk's, j = 1, …, J, are independent, and in unmatched case-control studies,
EXAMPLE
To illustrate the proposed meta-regression method for multiple markers, we examine the associations between smoking status (never, former, current) and 8 possible colorectal cancer subtypes defined by 3 binary markers, CIMP (high vs. low/negative), MSI (high vs. microsatellite stable (MSS)), and BRAF (mutant vs. wild type). The smoking status is coded as 0 for never, 1 for former, and 2 for current, and the trend association is examined. The analysis includes 88,620 women in the Nurses’ Health Study (NHS), following from 1980 to 2010, and 46,251 men in the Health Professionals Follow-up Study (HPFS), following from 1986 to 2010, with 3,099,586 person-years of follow-up. In each cohort, 1 subtype with fewer than 5 cases (low-level/negative CIMP, high-level MSI, mutated BRAF) was excluded, leading to a total of 1,118 colorectal cancer cases (654 women in NHS and 464 men in HPFS) in the remaining 7 subtypes.
In the first stage of the 2-stage meta-regression approach, a subtype-specific multivariate Cox model analysis, stratified by age (months) and calendar year of the questionnaire cycle, as well as adjusted for potential confounders, was performed for each cohort. Table 1 contains subtype definitions, subtype-specific case numbers, and the estimated smoking status-colorectal cancer subtype associations in the NHS and HPFS. In the second-stage analysis, we modeled the subtype and cohort-specific log(RR) using the 3 markers considered (MSI, CIMP, and BRAF) and cohort (NHS vs. HPFS) and compared the results with those from the 1-stage method (33–35); in the 1-stage method, we conducted the Cox model analysis for each cohort using the data duplication method and then combined the estimates from NHS and HPFS by the fixed-effects meta-analysis approach. Table 2 shows inferences for the function exp() of the coefficients of the marker variables in the model for log(RR) that represent the ratios of RRs between marker levels. For example, based on the meta-regression method, the estimated ratio of the RR for the association of smoking with high-level CIMP colorectal cancer over the RR for low-level/negative CIMP colorectal cancer, while the MSI and BRAF levels stay the same, was 1.23 (95% confidence interval: 0.84, 1.82). As shown in Table 2, the results from these 2 methods were consistent. The results from this analysis suggest that we do not have sufficient statistical evidence to conclude that the smoking-colorectal cancer subtype associations are different across subtypes defined by any 1 of the biomarkers (MSI, CIMP, and BRAF) while controlling for the other 2 biomarkers.
Subtype . | CIMP . | MSI . | BRAF . | No. of Cases . | RR . | 95% CIb . | P Valueb . |
---|---|---|---|---|---|---|---|
1 | L/N | MSS | Wild type | 832 | 1.12 | 1.01, 1.25 | 0.039 |
2 | L/N | MSS | Mutant | 47 | 0.86 | 0.54, 1.37 | 0.53 |
3 | L/N | High | Wild type | 42 | 1.35 | 0.80, 2.25 | 0.26 |
4 | High | MSS | Wild type | 34 | 1.28 | 0.71, 2.32 | 0.41 |
5 | High | MSS | Mutant | 31 | 1.00 | 0.57, 1.78 | 0.99 |
6 | High | High | Wild type | 43 | 1.93 | 1.18, 3.14 | 0.008 |
7 | High | High | Mutant | 95 | 1.45 | 1.05, 2.00 | 0.026 |
Subtype . | CIMP . | MSI . | BRAF . | No. of Cases . | RR . | 95% CIb . | P Valueb . |
---|---|---|---|---|---|---|---|
1 | L/N | MSS | Wild type | 832 | 1.12 | 1.01, 1.25 | 0.039 |
2 | L/N | MSS | Mutant | 47 | 0.86 | 0.54, 1.37 | 0.53 |
3 | L/N | High | Wild type | 42 | 1.35 | 0.80, 2.25 | 0.26 |
4 | High | MSS | Wild type | 34 | 1.28 | 0.71, 2.32 | 0.41 |
5 | High | MSS | Mutant | 31 | 1.00 | 0.57, 1.78 | 0.99 |
6 | High | High | Wild type | 43 | 1.93 | 1.18, 3.14 | 0.008 |
7 | High | High | Mutant | 95 | 1.45 | 1.05, 2.00 | 0.026 |
Abbreviations: BRAF, B-Raf protooncogene, serine/threonine kinase; CI, confidence interval; CIMP, CpG island methylator phenotype; L/N, low/negative; MSI, microsatellite instability; MSS, microsatellite stable; RR, relative risk.
a The analysis includes only subtypes with ≥5 cases. The subtype-specific analyses were controlled for body mass index expressed as weight (kg)/height (m)2 (<25, 25–29.9, ≥30), family history of colorectal cancer (yes/no), physical activity in metabolic equivalent tasks (quintiles), red meat intake (quintiles of servings/day), alcohol consumption (0, quartiles of g/day), total caloric intake (quintiles of calories/day), regular aspirin use (2 or more tablets/week or at least 2 times/week or less) and stratified by age (months) and calendar year. Postmenopausal hormone use (never/ever) is also adjusted in the Nurses’ Health Study.
b The cohort-specific estimates were combined by using a fixed-effects meta-analysis method.
Subtype . | CIMP . | MSI . | BRAF . | No. of Cases . | RR . | 95% CIb . | P Valueb . |
---|---|---|---|---|---|---|---|
1 | L/N | MSS | Wild type | 832 | 1.12 | 1.01, 1.25 | 0.039 |
2 | L/N | MSS | Mutant | 47 | 0.86 | 0.54, 1.37 | 0.53 |
3 | L/N | High | Wild type | 42 | 1.35 | 0.80, 2.25 | 0.26 |
4 | High | MSS | Wild type | 34 | 1.28 | 0.71, 2.32 | 0.41 |
5 | High | MSS | Mutant | 31 | 1.00 | 0.57, 1.78 | 0.99 |
6 | High | High | Wild type | 43 | 1.93 | 1.18, 3.14 | 0.008 |
7 | High | High | Mutant | 95 | 1.45 | 1.05, 2.00 | 0.026 |
Subtype . | CIMP . | MSI . | BRAF . | No. of Cases . | RR . | 95% CIb . | P Valueb . |
---|---|---|---|---|---|---|---|
1 | L/N | MSS | Wild type | 832 | 1.12 | 1.01, 1.25 | 0.039 |
2 | L/N | MSS | Mutant | 47 | 0.86 | 0.54, 1.37 | 0.53 |
3 | L/N | High | Wild type | 42 | 1.35 | 0.80, 2.25 | 0.26 |
4 | High | MSS | Wild type | 34 | 1.28 | 0.71, 2.32 | 0.41 |
5 | High | MSS | Mutant | 31 | 1.00 | 0.57, 1.78 | 0.99 |
6 | High | High | Wild type | 43 | 1.93 | 1.18, 3.14 | 0.008 |
7 | High | High | Mutant | 95 | 1.45 | 1.05, 2.00 | 0.026 |
Abbreviations: BRAF, B-Raf protooncogene, serine/threonine kinase; CI, confidence interval; CIMP, CpG island methylator phenotype; L/N, low/negative; MSI, microsatellite instability; MSS, microsatellite stable; RR, relative risk.
a The analysis includes only subtypes with ≥5 cases. The subtype-specific analyses were controlled for body mass index expressed as weight (kg)/height (m)2 (<25, 25–29.9, ≥30), family history of colorectal cancer (yes/no), physical activity in metabolic equivalent tasks (quintiles), red meat intake (quintiles of servings/day), alcohol consumption (0, quartiles of g/day), total caloric intake (quintiles of calories/day), regular aspirin use (2 or more tablets/week or at least 2 times/week or less) and stratified by age (months) and calendar year. Postmenopausal hormone use (never/ever) is also adjusted in the Nurses’ Health Study.
b The cohort-specific estimates were combined by using a fixed-effects meta-analysis method.
Marker . | Two-Stage Approach . | One-Stage Approach . | ||||
---|---|---|---|---|---|---|
RRR . | 95% CI . | P Value . | RRR . | 95% CI . | P Value . | |
CIMP | 1.23 | 0.84, 1.82 | 0.29 | 1.28 | 0.87, 1.88 | 0.21 |
MSI | 1.34 | 0.93, 1.91 | 0.11 | 1.31 | 0.92, 1.87 | 0.13 |
BRAF | 0.78 | 0.55, 1.09 | 0.14 | 0.78 | 0.56, 1.10 | 0.16 |
Marker . | Two-Stage Approach . | One-Stage Approach . | ||||
---|---|---|---|---|---|---|
RRR . | 95% CI . | P Value . | RRR . | 95% CI . | P Value . | |
CIMP | 1.23 | 0.84, 1.82 | 0.29 | 1.28 | 0.87, 1.88 | 0.21 |
MSI | 1.34 | 0.93, 1.91 | 0.11 | 1.31 | 0.92, 1.87 | 0.13 |
BRAF | 0.78 | 0.55, 1.09 | 0.14 | 0.78 | 0.56, 1.10 | 0.16 |
Abbreviations: BRAF, B-Raf protooncogene, serine/threonine kinase; CI, confidence interval; CIMP, CpG island methylator phenotype; MSI, microsatellite instability; RRR, ratio of relative risks.
Marker . | Two-Stage Approach . | One-Stage Approach . | ||||
---|---|---|---|---|---|---|
RRR . | 95% CI . | P Value . | RRR . | 95% CI . | P Value . | |
CIMP | 1.23 | 0.84, 1.82 | 0.29 | 1.28 | 0.87, 1.88 | 0.21 |
MSI | 1.34 | 0.93, 1.91 | 0.11 | 1.31 | 0.92, 1.87 | 0.13 |
BRAF | 0.78 | 0.55, 1.09 | 0.14 | 0.78 | 0.56, 1.10 | 0.16 |
Marker . | Two-Stage Approach . | One-Stage Approach . | ||||
---|---|---|---|---|---|---|
RRR . | 95% CI . | P Value . | RRR . | 95% CI . | P Value . | |
CIMP | 1.23 | 0.84, 1.82 | 0.29 | 1.28 | 0.87, 1.88 | 0.21 |
MSI | 1.34 | 0.93, 1.91 | 0.11 | 1.31 | 0.92, 1.87 | 0.13 |
BRAF | 0.78 | 0.55, 1.09 | 0.14 | 0.78 | 0.56, 1.10 | 0.16 |
Abbreviations: BRAF, B-Raf protooncogene, serine/threonine kinase; CI, confidence interval; CIMP, CpG island methylator phenotype; MSI, microsatellite instability; RRR, ratio of relative risks.
In a second analysis for illustrating the proposed meta-regression method, the first-stage analysis was the same as before, but in the second stage, we started from a model with all 3 markers, 2-way interactions of the markers, and cohort, and then used stepwise model selection with a cutoff P = 0.05 for entering or removing the variables. This analysis was for selecting covariates in the meta-regression model that are important for characterizing the subtype-specific exposure-disease association. Only MSI was in the final model (ratio of RR for high-level MSI vs. MSS = 1.38, 95% confidence interval: 1.07, 1.79; P value = 0.015).
DISCUSSION
When subtypes are defined by multiple categorical and/or ordinal markers, we propose a meta-regression method that is intuitive, does not need augmentation of the data set, and can be easily implemented using existing statistical software such as SAS procedures for the mixed-model analysis. This meta-regression method can be used to test for etiological heterogeneity across multiple disease subtypes classified by multiple markers, to assess whether the exposure-disease subtype associations are different across subtypes defined by 1 marker while controlling for other markers, and to evaluate whether the difference in exposure-disease subtype association across subtypes by 1 marker depends on any of other markers.
Addressing etiological heterogeneity by MPE research has relevance to disease prevention. As an example, we herein discuss smoking, colonoscopy, and colorectal cancer risk. Colonoscopy has been associated with lower colorectal cancer risk for up to 10 years after the procedure in individuals with average risk for developing colorectal cancer (52); however, it remains to be determined whether colonoscopy every 10 years is also effective for colorectal cancer prevention in high-risk individuals. A recent MPE study suggests that the preventive effect of colonoscopy may be weaker for high-level MSI colorectal cancer than for non–high-level MSI colorectal cancer (52). MPE studies (16–18, 20, 39–42) have also shown that smokers are susceptible to developing high-level MSI colorectal cancer. Taken together, it is implied that the preventive effect of colonoscopy is not as effective for smokers compared with nonsmokers. Hence, MPE research can help us toward more personalized disease- prevention strategies.
In addition to heterogeneity between tumors across individuals, accumulating evidence has indicated heterogeneity within 1 tumor in 1 individual. An integrative concept (“the unique tumor principle”) on intra- and intertumor heterogeneity along with epidemiologic exposures has been discussed in detail (53). Though our current paper primarily addresses intertumor (or interindividual) heterogeneity, it is of interest to develop new statistical methodologies to address both intra- and intertumor heterogeneities in the future.
As advancements of biomedical technologies, molecular pathology tests are increasingly common in clinical practice, as well as epidemiologic studies (54–56). The MPE approach is useful for not only assessment of risk of developing disease but also evaluation of predictive biomarkers for intervention in a disease population (57). In the future, routine clinical molecular pathology data may be integrated into population-based disease registries and databases, and large-scale MPE studies can be routine research practice (58). Thus, our methodology will be widely useful.
We developed a user-friendly SAS macro %stepmetareg implementing this meta-regression method. It includes a stepwise selection procedure to select covariates considered in the meta-regression model that are important for characterizing the subtype-specific exposure-disease association, represented by The SAS macro can be obtained at the website http://www.hsph.harvard.edu/donna-spiegelman/software/.
This meta-regression method will be most useful in situations where the number of subtypes is relatively low; otherwise, the number of cases for each unique tumor subtype defined by cross-classification of the multiple markers may be too small to obtain stable estimates of each β1j. The minimum number of cases required for each tumor subtype for obtaining stable estimates of each β1j depends on the number of covariates in the first-stage model. A rule of thumb for the minimum events per covariate is 5–10. An advantage of the proposed 2-stage method for cohort studies is that j =1, …, J can be estimated separately, without using the data duplication method, which becomes computationally infeasible when the augmented data set becomes too large. In addition, the random-effects model has the advantage that it can incorporate additional heterogeneity between subtypes that cannot be explained by the given marker variables.
Disease subtype data are often missing in some proportion of cases. Chatterjee et al. (34) developed an estimating function method based on model 2 that can be used to handle missing subtype data under a missing-at-random assumption. That method can be used directly to handle missing subtype data for estimating β1j in the first stage of the 2-stage models. Statistical methods for handling missing marker data, which are covariates data now, in the second-stage model of the 2-stage method may be developed through extension of existing methods for missing covariates data problems in the mixed-model analysis; this is a topic of future research. Alternatively, we may use the conventional method of creating missing indicators for missing markers data, as well as the method of imputing the missing marker data based on regression models that link the marker data and covariates that contain information about the marker data. When these methods are used, the 2-stage method with a random-effect meta-regression model could have the advantage of partially taking into account additional variability due to using missing indicators or using imputed marker data through the random-effect term; future research is needed for this topic.
In conclusion, in consideration of pathogenesis and etiological heterogeneity of disease, we developed a meta-regression method to study etiological heterogeneity across disease subtypes defined by multiple biomarkers. This method is useful in the emerging interdisciplinary field of molecular pathological epidemiology (32, 59). There is an increasing need to integrate molecular pathology and epidemiology to better understand disease etiologies and causalities (59–62). Our meta-regression method can be widely useful, as use of molecular pathology and genomic technologies is increasingly common in clinical medicine and public health.
ACKNOWLEDGMENTS
Author affiliations: Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts (Molin Wang, Shuji Ogino); Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts (Molin Wang); Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts (Molin Wang); Biostatistics Division, Center for Research Administration and Support, National Cancer Center, Tokyo, Japan (Aya Kuchiba); Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts (Shuji Ogino); and Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts (Shuji Ogino).
This work was supported by US National Institutes of Health grants (R01 CA151993 and R35 CA197735 to S.O., P01 CA87969 to S. E. Hankinson, UM1 CA186107 to M. J. Stampfer, P01 CA55075 and UM1 CA167552 to W. C. Willett, and P50 CA127003 to C. S. Fuchs); grants from The Paula and Russell Agrusa Fund for Colorectal Cancer Research (to C. S. Fuchs); and the Friends of the Dana-Farber Cancer Institute (to S.O.).
We deeply thank hospitals and pathology departments throughout the United States for generously providing us with tissue specimens. We also would like to thank the following state cancer registries for their help: Alabama, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Idaho, Illinois, Indiana, Iowa, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Nebraska, New Hampshire, New Jersey, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, Tennessee, Texas, Virginia, Washington, and Wyoming.
Conflict of interest: none declared.
REFERENCES
Author notes
Abbreviations: BRAF, B-Raf protooncogene, serine/threonine kinase; CIMP, CpG island methylator phenotype; HPFS, Health Professionals Follow-up Study; MPE, molecular pathological epidemiology; MSI, microsatellite instability; MSS, microsatellite stable; NHS, Nurses' Health Study; RR, relative risk.
- smoking
- nurses' health study
- mutation
- colorectal cancer
- heterogeneity
- biological markers
- clinical medicine
- epidemiologic studies
- genome
- molecular diagnostic techniques
- protein-serine-threonine kinases
- proto-oncogenes
- software
- technology
- neoplasms
- public health medicine
- microsatellite instability
- braf gene
- pathology, molecular
- hpfs trial
- cpg island methylator phenotype