|
|
||||||||
RESEARCH REPORT |
1 Section of Epidemiology and Biostatistics, Department of Obstetrics, Gynecology and Reproductive Sciences, University of Medicine and Dentistry of New Jersey, Robert Wood Johnson Medical School, 125 Paterson Street, New Brunswick, NJ 08901-1977; and 2 Division of Oral and Maxillofacial Radiology, Department of Diagnostic Sciences, University of Medicine and Dentistry of New Jersey, New Jersey Dental School, Newark, NJ;
* corresponding author, cande.ananth{at}umdnj.edu
| ABSTRACT |
|---|
|
|
|---|
KEY WORDS: clustered responses nested responses correlated binary responses alternating logistic regressions generalized estimating equations site-specific caries data
| INTRODUCTION |
|---|
|
|
|---|
Dental caries causes destruction of discrete tooth surfaces. Site-specific sampling (the tooth surface) results in not only clustered (correlated) but also nested (hierarchical) responses. One approach is to avoid the problem of clustered data by using the subject as the unit of analysis (Manji and Fejerskov, 1994). Most epidemiological and caries risk studies use the total number of decayed, missing, or filled teeth (DFMT) or surfaces (DFMS) as a subject-specific outcome. Another approach is to ignore the clustering and use ordinary logistic regression (OLR) to model caries risk. However, the variance of regression parameters from an OLR model will be imprecise and may lead to incorrect inferences.
There are several regression techniques for the analysis of clustered binary responses when modeling the marginal mean is the focus and the dependence structure is considered a nuisance. For such scenarios, first-order generalized estimating equations (GEE1) offer one elegant solution (Liang and Zeger, 1986; Prentice, 1988). However, when modeling the dependence structure is of scientific interest, second-order generalized estimating equations (GEE2), using the correlation coefficient (Prentice, 1988) or the odds ratio (Lipsitz et al., 1991) as the measure of pairwise dependence, provides an analytic solution. GEE2 methods have also been applied to marginal models for correlated binary responses with multiple classes and several levels of nested responses (Qagish and Liang, 1992; Podgor et al., 1996). However, when the cluster size gets large, the GEE2 procedure can quickly become computationally impractical. Alternating logistic regressions (ALR) overcomes this limitation, and can be used to model, simultaneously, the marginal probability and the pairwise association for large clusters (Carey et al., 1993).
In this paper, we illustrate ALR with an application to clustered data with multiple levels of nesting. The data came from a radiographic efficacy study (Kantor, 1989; Kantor and Stinnett, 1990), where the clustered response variable is the presence or absence of a caries lesion on each of 16 tooth surfaces per subject. There are 3 levels of nesting: tooth surfaces within an interproximal (IP) region, IP regions within a jaw, and jaws within a subject. We applied ALR to assess the extent of caries aggregation at each level of nesting.
| MATERIALS & METHODS |
|---|
|
|
|---|
|
|
The Multivariate Model
As we analyzed individual tooth surfaces for aggregation at 3 levels of nesting, there were 16 correlated response variables; hence, a multivariate approach to modeling was required. Following the notations of Qaqish and Liang (1992), let Yij denote the jth binary response on the ith cluster, 1
i
K and 1
j
ni; for example, the presence (Yij = 1) or absence (Yij = 0) of caries on the jth tooth surface for the ith subject (or cluster). For a single cluster i, Yi = (Yi1, Yi2,...,Yini)' denotes a vector of ni-binary responses. Let Xij (1
j
ni) denote a p-covariate vector associated with the jth binary response on the ith cluster Yij, and let Zijk (j < k) denote a q-covariate vector associated with the jth and kth response pair (for instance, the presence or absence of a caries lesion on 2 separate tooth surfaces for the ith subject). We suppress the conditioning of covariates in the models of the form E(Yij | xij), but rather denote them simply as E(Yij).
Let µij = E(Yij) denote the (ni x 1) first-order mean vector, and let µijk = E(Yij,Yik), 1
j < k
ni, denote the {ni(ni 1)/2 x 1} second-order mean vector. With Yij being distributed as a Bernoulli variate, the marginal probability of caries (i.e., caries risk) is modeled assuming a logistic representation:
![]() | (1) |
and the dependence (i.e., caries aggregation) as
![]() | (2) |
Eq. 1
is the log odds for E(Yij), and Eq. (2)
is the log of the odds ratio relating Yij to Yik.
The pairwise association for the 2 binary responses (Yij, Yik), (j < k), is measured in terms of a marginal odds ratio (Lipsitz et al., 1991), given by
![]() |
The log of the odds ratio,
ijk, for caries between 2 tooth surfaces (j,k) for the ith subject is the log of the ratio of the odds of caries on tooth surface j, given that tooth surface k is carious, to the odds of caries on tooth surface j, given that tooth surface k is caries-free. Therefore, when
ijk is zero, the odds ratio is 1, implying the absence of clustering of caries between the 2 surfaces (j,k) within a subject. An odds ratio significantly different from the null suggests the presence of "aggregation" of caries within a subject. The dependence structure in our study was modeled with the marginal odds ratio (Lipsitz et al., 1991), instead of the correlation coefficient (Prentice, 1988). A limitation of using the correlation coefficient as a measure of dependence involves range restrictions (Diggle et al., 2002), but using the odds ratio as a measure of association overcomes this limitation (Lipsitz et al., 1991; Liang et al., 1992). Finally, odds ratios are commonly used in epidemiologic studies and are easy to interpret.
Ordinary Logistic Regression (OLR)
When the clustering is ignored, Eq. (1)
is the OLR model. However, the estimated variances from this model can be imprecise, and so, statistical tests and biologic inferences may be incorrect.
First-order Generalized Estimating Equations (GEE1)
GEE1 are multivariate analogues of the quasi-likelihood estimating functions (Wedderburn, 1974). When interest centers solely on modeling the marginal mean, µij, and the dependence is considered a nuisance, then the estimating function for ß has the form
![]() |
Let Wi = (Y1Y2, Y1Y3,...,Yni1Yni)' denote the {ni(ni 1)/2 x 1} vector of pairwise (cross-products) outcomes of the vector Yi. Define
i = E(Wi). Since the joint distribution of Yi is not fully specified by the parameters
= (ß,
), where
is the "working" dependence structure, a full likelihood-based approach to estimation is not feasible (Liang et al., 1992). Liang and Zeger (1986) showed that the solution of
of U(
) = 0, as K
, has an asymptotic multivariate Gaussian distribution with mean 0 and covariance which can be consistently estimated using
, where H0 and H1 are defined as
![]() |
![]() |
The advantages of GEE1 are that even when the dependence structure is incorrectly specified, GEE1 provides consistent estimation of the marginal mean parameters.
Alternating Logistic Regressions (ALR)
When the dependence structure (i.e., caries aggregation) is the focus, GEE2 may be used to model µij and µijk simultaneously (Prentice, 1988). However, the simultaneous estimation of µij and µijk can become computationally prohibitive, especially when the cluster sizes become large. For instance, for a cluster of size ni, the dimension of the matrix cov(Yi,Wi) is of the order {ni + ni(ni 1)/2} x {ni + ni(ni 1)/2}. In our study, the maximum cluster size for each subject was 16 surfaces, resulting in a 136 x 136 matrix ({16 + 16(16 1)/2}) for cov(Yi,Wi). Moreover, when odds ratios are used to model the pairwise associations, estimation of cov(Yi,Wi) requires solving O(n3) cubic and O(n2) seventh-degree polynomial equations for each cluster (Carey et al., 1993).
ALR overcomes the computational difficulty of GEE2 and can be used to estimate
= (ß,
) for simultaneous modeling of µij and µijk (Carey et al., 1993). For cluster sizes
4, not only can using GEE2 to model µij and µijk become computationally prohibitive, but also the efficiency in the estimates of ß from GEE1 and ALR is > 90% compared with that of GEE2 (Heagerty and Zeger, 1996). Briefly, the ALR algorithm iterates between the following two steps until convergence:
Step 1: Given the initial values of
=
(r), calculate côv(r), and solve the following estimating equation (Score function) for an updated
(r+1):
![]() |
Step 2: Given
(r+1) and
(r), evaluate the offset from the following equation:
![]() |
and perform the offset logistic regression of Yij on ZijkYik, to obtain
(r+1).
Let
ijk = E(Yij | Yik = yik). Define Rijk = Yij
ijk to denote a vector of residuals, and let Si denote a matrix with diagonal elements
ijk(1
ijk). The ALR estimate of
= (ß,
) is the simultaneous solution of the following set of unbiased estimating equations:
![]() |
![]() |
The loss of efficiency in estimating the pairwise association parameters
is typically much less severe when centered values, yij E(yij), are used, rather than yij itself (Heagerty and Zeger, 1996). A limitation of ALR is that although modeling µij is robust to different choices of working correlation, µijk modeling is not (i.e., the model must be specified correctly). Thus, measures of association among responses need to be cautiously interpreted, since they are sensitive to the assumption that the model is correctly specified (Stokes et al., 2000). All regression models were fit in SAS version 8.2 (SAS Institute, Cary, NC, USA) with the GENMOD procedure based on the binomial distribution assumption and with the logit link function for uncentered products.
| RESULTS |
|---|
|
|
|---|
![]() | (3) |
where
ij = Pr(Yij = 1), and I(.) is an indicator function for observers. We also tested for other two-way interactions in Eq. (3)
, and none reached statistical significance. For comparison, we show the parameter estimates with naïve standard errors of OLR (model a) that ignored the clustering, and the estimates based on GEE1 (model b) that accounted for clustered observations (Table 2
). The naïve standard errors for all terms in the OLR model were smaller than with GEE1 estimation. We did not interpret the marginal model in depth, since our purpose was to illustrate the application of ALR for modeling the dependence structure, that is, caries aggregation.
|
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
All models were fitted with ALR. Eq. 4
was based on a constant pairwise log odds ratio within a subject (model c), Eq. 5
allowed for separate log odds ratios for within- and between-tooth surfaces (model d), Eq. 6
allowed for separate estimation of the pairwise association for the two jaws (model e), and finally, Eq. 7
allowed for estimation of the 10 separate associations of within- and between-IP regions (model f). Model (f) was preferred over models (c-e) since it allowed for an assessment of caries aggregation at the lowest level of nesting (i.e., IP region).
The parameter estimates for these 4 models with their robust standard errors are shown in Table 2
. A subject with caries on any tooth surface was exp(0.935) or 2.55 (95% CI 2.18, 2.97) times as likely to have caries on any other different surface (model c). Caries was more likely to aggregate within- and between-tooth surfaces (model d). A test of
1 =
2 =
3 = 0 resulted in a significant Wald-like
(Table 2
). Caries was also more likely to aggregate within and between jaws (model e) once again, a test of
4 =
5 =
6 = 0 being highly significant (Wald-like
).
Model (f) allows for separate estimation of the pairwise associations by IP regions. A simultaneous 10-df test of
7 = ... =
16 = 0 was significant (Wald-like
, P < 0.001), confirming the presence of strong caries aggregation. Odds ratios from model (f), with robust 95% confidence intervals, are presented in Table 3
. Three notable findings are apparent from this analysis. First, pairwise associations of caries appeared to be stronger within rather than between IP regions (P < 0.05); that is, the odds ratios on the diagonal are greater than on the off-diagonal entries. For example, a subject with caries on 1 surface in the PP IP region was exp(1.509) or 4.52 times as likely to have caries on the adjacent surface within the PP IP region, but at most exp(1.046) or 2.85 times as likely to have caries in another IP region (i.e., the PM region in this example). Second, as the distance between IP regions increased, the association grew weaker (P < 0.05). This is seen from the diminishing odds ratios going from the left to right for each row, and from bottom to top for each column in Table 3
. Third, regardless of the choice of the pairwise association structure, parameter estimates for the marginal probability and their robust standard errors among models (c-f) were fairly similar (Table 2
).
|
| DISCUSSION |
|---|
|
|
|---|
Given that modeling caries aggregation was the goal of our analysis, we could have applied either GEE2 or ALR. Since GEE2 is computationally inefficient relative to ALR when the number of clusters is large (Carey et al., 1993), we preferred ALR. Consistent estimation of the parameters in the marginal probability model with GEE2 requires the correct specification of the model for
ijk. However, ALR provides consistent estimates of the mean parameters, even when the dependence structure is misspecified.
Alternatively, the marginal probability and the conditional pairwise odds ratios of the multivariate binary responses can be modeled with the log-linear framework (Fitzmaurice and Laird, 1993). However, such models have several shortcomings. The primary disadvantages of log-linear models are two-fold. First, they are not reproducible (Liang et al., 1992), in that if the vector Yij = (Yi1, Yi2,...,YiKi)' satisfies a log-linear model, then a subset of Yij need not necessarily follow the same log-linear model. This restriction has rather profound implications in oral health studies when the cluster size varies across clusters. In contrast, marginal models are fully reproducible. The second disadvantage of log-linear models is that if one is interested in modeling the pairwise associations as a function of covariates, the log-linear framework is not applicable. These issues are discussed in greater detail by Liang et al.(1992) and Qaqish and Liang (1992).
In the case of clusters of size 2, Rosner (1989) proposed estimating the pairwise association based on beta-binomial regression models, and Ananth and Preisser (1999) demonstrated an application of bivariate logistic regression. Recently, the ALR algorithm has been extended to model ordinal responses when the association structure is the primary focus of analysis (Heagerty and Zeger, 1996). Population-averaged and cluster-specific regression models have been applied to a longitudinal study of periodontal disease progression where estimation of the marginal probability was the primary focus (Ten Have et al., 1995). The issue of analyzing clustered responses, specifically within- and between-subject comparisons, in periodontal disease studies has been the focus of recent papers (DeRouen et al., 1995; Mancl et al., 2000).
In summary, the ALR procedure offers a computationally convenient method for fitting complex regression structures that involve multiple classes and several levels of nesting which are commonly encountered in dental research. Specifically, in our case study, the analysis reveals that caries lesions appear to aggregate strongly within subjects.
| ACKNOWLEDGMENTS |
|---|
Received March 28, 2003; Last revision March 18, 2004; Accepted April 23, 2004
| REFERENCES |
|---|
|
|
|---|
Carey V, Zeger SL, Diggle P (1993). Modeling multivariate binary data with alternating logistic regressions. Biometrika 80:517526.
DeRouen TA, Hujoel PP, Mancl LA (1995). Statistical issues in periodontal research. J Dent Res 74:17311737.
Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002). Analysis of longitudinal data. 2nd ed. New York, NY: Oxford Science Publications.
Fitzmaurice GM, Laird NM (1993). A likelihood-based method for analysing longitudinal binary responses. Biometrika 80:141151.
Heagerty PJ, Zeger SL (1996). Marginal regression models for clustered ordinal measurements. J Am Stat Assoc 91:10241036.
Kantor ML (1989). Radiographic caries detection as a function of film speed and kilovoltage (abstract). J Dent Res 68(Spec Iss):961.
Kantor ML, Stinnett SS (1990). Factors influencing radiographic detection of dental caries (abstract). J Dent Res 69(Spec Iss):288.
Liang K-Y, Zeger SL (1986). Longitudinal data analysis using generalized estimating equations. Biometrika 73:1322.
Liang K-Y, Zeger SL, Qaqish BF (1992). Multivariate regression analyses for categorical data (with discussion). J R Stat Soc [Ser B] 54:340.
Lipsitz SR, Laird NM, Harrington DP (1991). Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78:153160.
Mancl LA, Leroux BG, DeRouen TA (2000). Between-subject and within-subject statistical information in dental research. J Dent Res 79:17781781.
Manji F, Fejerskov O (1994). An epidemiological approach to dental caries. In: Textbook of clinical cariology. 2nd ed. Thylstrup A, Fejerskov O, editors. Copenhagen, DK: Munksgaard, pp. 159191.
Podgor MJ, Hiller R, the Framingham Eye Studies Group (1996). Associations of types of lens opacities between and within eyes of individuals: an application of second-order generalised estimating equations. Stat Med 15:145156.[ISI][Medline]
Prentice RL (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics 44:10331048.[ISI][Medline]
Qaqish BF, Liang K-Y (1992). Marginal models for correlated binary responses with multiple classes and multiple levels of nesting. Biometrics 48:939950.[ISI][Medline]
Rosner B (1989). Multivariate methods for clustered binary data with more than one level of nesting. J Am Stat Assoc 84:373380.
Stokes ME, Davis CS, Koch GG (2000). Categorical data analysis using the SAS system. 2nd ed. Cary, NC: SAS Institute, Inc., pp. 527533.
Ten Have TR, Landis R, Weaver SL (1995). Association models for periodontal disease progression: a comparison of methods for clustered binary data. Stat Med 14:413429.[ISI][Medline]
Wedderburn RWN (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61:439447.
This article has been cited by other articles:
![]() |
S. O. Manda, M. S Gilthorpe, Y.-K. Tu, A. Blance, and M. T Mayhew A Bayesian analysis of amalgam restorations in the Royal Air Force using the counting process approach with nested frailty effects Statistical Methods in Medical Research, December 1, 2005; 14(6): 567 - 578. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| IADR Journals | Advances in Dental Research ® |
| Journal of Dental Research ® | Critical Reviews (1990-2004) |