|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||
PROCEEDINGS |
Dental Health Unit, 3A Skelton House, Manchester Science Park, Manchester M15 6SH, UK; Helen.Worthington{at}man.ac.uk
| ABSTRACT |
|---|
|
|
|---|
KEY WORDS: multi-center trials multi-site trials statistical models pooling data
| INTRODUCTION |
|---|
|
|
|---|
Although multi-center trials had been conducted since the 1940s, there was little written about the analysis of the resulting data until 1986, when Fleiss wrote both an article and a chapter on multi-center studies in his well-known book on clinical trials (Fleiss, 1986a,b). The pooling of the data from multi-center trials is considered in both of these. To understand the issues involved, one must understand the basic ideas behind pooling data.
As an illustration of this, Table 1
gives some summary data from a caries clinical trial comparing dentifrices containing 250 ppm F and 1000 ppm F. The overall results of the study showed a significant difference in efficacy between the two dentifrices on several outcomes, the 1000-ppm-F dentifrice being more effective at reducing 32-month caries increments (Mitropoulos et al., 1984). The clinical trial included children in five schools who were randomly allocated within each school to the test and control groups. Although this is not a multi-center study where each center had its own administrative structure, it will suffice to demonstrate the different ways of analyzing the data. For the purpose of this example, it will also be assumed that the data are normally distributed, and a fixed-factor analysis is also assumed, as is the assumption of homoscedasticity. The mean DFT increment over 32 months (
c1,
c2), based on clinical scores only, is shown for each study group (1 = 250 ppm F, 2 = 1000 ppm F) for each of the five schools (c = 1 to 5). The numbers of children varied between 84 and 188 per school. One way of pooling the data is by averaging the within-center differences where
|
![]() |
is the estimator of µ1-µ2, the assumed common difference between the treatment means, where the weights Wc = nc1nc2/(nc1+nc2) have been used.
In the example given in Table 1
,
W = 77.35/181.26 = 0.43, with SE (
W) = 0.16. This weighted analysis is a General Linear Model (GLM) type II analysis, where each effect is adjusted for the other effects (Fleiss, 1986b). Another way of pooling the data is to put together all the data for one treatment, ignoring the centers, where the overall mean for the 250-ppm-F group was 2.36 and that for the 1000-ppm-F group was 1.93, which, in Table 1
, D = 0.43, with standard error 0.17. This is equivalent to undertaking a simple independent-sample t test to compare the group means, and both are unbiased estimators of µc1-µc2. With appropriate weights, the sampling variation of
will exceed that of
W, although there was only a small difference in this example. This GLM type I analysis tests the factors in order. The first ranked factor is tested for significance without any account at all taken of the other factor; then the factor ranked second is tested by averaging differences within each level of the first (Fleiss, 1986b). This method is also known as the hierarchical decomposition of the sum of squares method, and each term is adjusted only for the term that precedes it in the model (SPSS, 1999; http://www.spss.com).
The weighted analysis is the correct method when the random assignment of patients has been carried out separately and independently within the center, for the analysis will then properly have been dictated by the design. This is the correct method provided that there is no interaction between the centers and groups. If there is an interaction, then the analysis consists of simple averaging of the differences µc1-µc2, with no clinic receiving greater or less weight than another. So, in Table 1
, this would be simply the average of the differences Dc = 2.34/5 = 0.47, which is larger than that for the weighted model (0.43). This is easy to explain to clinical colleagues when the sample sizes of the clinics are similar but not when some are much larger, e.g., 10 times. Nevertheless, the test based on the interaction is correct when an interaction exists, regardless of the sample sizes. This is known as a GLM type III analysis, where the hypothesis tests that the unweighted averages of the treatment means are equal (Fleiss, 1986b). This method calculates the sum of squares of an effect in the design as the sum of squares adjusted for any other effects that do not contain it and orthogonal to any effects (if any) that contain it. The computations require non-empty cells to produce estimable contrasts. The output from carrying out both a GLM type II and type III sum of squares for this example is shown in Tables 2
and 3
. It can be seen that, in this example, there was no apparent interaction, and the choice of model makes little difference between the p-values for the treatment effect, with the p-value for the type III model including the interaction term being slightly lower.
|
|
Recently, the ICH E9 statistical principles for clinical trials, published in 1999, advocate the use of multi-center studies and state, "the mean treatment effect may be investigated first by using a model which allows for center differences but does not include a term for treatment-by-center interaction" (ICH, 1999). Today, the above statement has caused controversy in statistical circles and brings in many issues in the design and analysis of multi-center trials. The use of generalized linear models (GLMs) in the analysis of multi-center clinical trials is now commonplace. Currently, diverse opinion exists as to the proper modeling approach in this setting (Schwemer, 2000). Approaches to the analysis of multi-center clinical trials using GLMs include: (a) analysis of full model, including terms for treatment, center, and their interaction; (b) starting with a full-model but removing the interaction term depending on its level of significance; and (c) using a reduced model from the outset, perhaps augmenting it with an interaction term if secondary analysis suggests its presence (as ICH advises).
If the sample sizes in each group and center are equal, the GLM types II and III least-squares estimators are equal. In this case, the decision to remove terms from the model affects only the estimate of the residual variance and degrees of freedom of the test statistic. The concern thus is solely one of power. Usually, the sample sizes among cells are unbalanced (although there is often near-balance within the centers). This disparity can lead to the full model with loss in efficiency if there is no interaction present. However, if the model assumes no interaction when there is interaction, the estimate will be biased (Goldberg and Koury, 1990).
The majority of statisticians recommend the use of the unweighted (GLM type III) analysis, including main effects and treatment-by-center interaction terms to obtain unbiased estimates of treatment contrasts. This is based on the classic textbooks on linear models along with recommendations in statistical software manuals such as SAS and SPSS (Searle, 1971; Milliken and Johnson, 1984; http://www.spss.com, 1999; http://www.sas.com, 2000). If the interaction term is non-significant at a level above 0.10 or 0.20, the interaction term is usually dropped from the model. However, some statisticians take a different view and believe that the GLM type II model should always be used (Senn, 1998). It is pointed out that the unweighted analysis is untenable and leads to paradoxes, so that the analysis based on additional centers could be less precise. This view is supported by the findings of one group of investigators who used simulations of trial data to compare the various estimators which result from the different approaches in terms of the mean square error and their power to reject the null hypothesis of no treatment difference (Jones et al., 1998). Another statistician also concludes that, to study patients as opposed to centers, one should base the analysis on weighting centers according to center sizes (Kallen, 1997). Lin considers this when discussing how the centers should be weighted in the statistical analysis of multi-center studies (Lin, 1999). This paper demonstrates why we should be careful of using the unweighted analysis as the primary statistical method, due to the power perspective. Due to low power, for rejection of the null hypothesis of no treatment difference with use of the unweighted compared with the weighted analysis, there is serious concern for the use of the unweighted analysis as the primary method of analysis.
Some have argued that one can solve this problem by pooling very small centers with a larger one before conducting the unweighted analysis, so the patients in the very small centers do not have a disproportionate influence on the results. However, the validity of this practice relies on the assumption that the mean differences between the centers are small (Lin, 1999). Otherwise, the standard deviation of the pooled center can be arbitrarily inflated by the variation among small centers, and the statistical power is reduced.
When multi-center trials are conducted in the pharmaceutical industry, sample sizes at different clinical centers are usually not controlled to be balanced. Forcing the different centers to have balanced sample sizes is not only unnecessary, but also very costly, since the study could last unreasonably long (Lin, 1999). However, it is possible that the faster-enrolling centers, which may not be more reliable, will dominate the overall results of the study. It is therefore sensible to avoid major imbalances among the study sites, since the differences in sample size are likely to affect the power of the tests for interaction and the generalizability of the overall study findings.
The problem of whether to include an interaction term in the model has been examined in relation to the analysis of binary data (Agresti and Hartzel, 2000). It is concluded that, with many strata or sparse data, the power of the tests of the hypothesis of no interaction may be weak, and the safest approach is to use the interaction model.
Another issue to be considered when in the analysis of data from multi-center trials is whether the clinics should be considered as fixed or random effects. The consensus is to model the clinical as fixed, not random, effects, a view which has been endorsed by several statisticians (Fleiss, 1986a; Senn, 1998; Agresti and Hartzel, 2000). This issue is looked at in detail by investigators who compared different methods for comparing treatments for a binary response (Agresti and Hartzel, 2000). Using several examples, they reached similar conclusions about the treatment effects, whether fixed or random effects are used. In their experience, the fixed-effects model and the random-effects model, assuming no interaction, tend to provide similar results about the common treatment effect. In contrast, they found that the two models provide quite different estimates of individual center or treatment effects. They concluded that the choice depends on the intended scope of the inference. Another investigator, using the empirical Bayes approach, compares the analysis of multi-center trials regarding the center effects as random rather than fixed (Gould, 1998). The results of empirical and conventional Bayes analyses are compared with the results of fixed- and mixed-model ANOVAs, based on data from a trial, and it is shown that the Bayesian methods can identify potential outliers and are more robust to outliers than ANOVA. However, it is concluded that the level of sophistication and insight provided by empirical and conventional Bayes may not justify the effort required to implement them in all circumstances.
From this review of the statistical literature on pooling results from multi-center studies, some strong themes emerged. There is controversy about the use of unweighted (type III) or weighted (type II) analysis. The weighted analysis provides the most powerful test of the treatment contrast if there is no interaction between treatment and center. If there is an interaction, the unweighted analysis leads to unbiased estimates. Although, from an estimation and hypothesis-testing standpoint, there is no need to balance the number of patients between the sites, it is sensible to avoid major imbalances among the study sites. The consensus view is to use a fixed-effects model for the analysis of multi-center trials.
| FOOTNOTES |
|---|
| REFERENCES |
|---|
|
|
|---|
Fleiss J (1986a). Analysis of data from multiclinic trials. Control Clin Trials 7:267275.[ISI][Medline]
Fleiss J (1986b). The design and analysis of clinical experiments. New York: John Wiley & Sons.
Goldberg JD, Koury KJ (1990). Design and analysis of multicenter trials. In: Statistical methodology in the pharmaceutical sciences. Chapter 7. Berry D, editor. New York: Marcel Dekker.
Gould AL (1998). Multi-centre trial analysis revisited. Stat Med 17:17791797.[ISI][Medline]
Hill AB (1962). Statistical methods in clinical and preventive medicine. New York: Oxford University Press.
ICH E, Working Group (1999). ICH harmonised tripartite guideline. Statistical principles for clinical trials. Stat Med 18:19051942.[ISI][Medline]
Jones B, Teather D, Wang J, Lewis JA (1998). A comparison of various estimators of a treatment difference for a multi-centre clinical trial. Stat Med 17:17671777.[ISI][Medline]
Kallen A (1997). Treatment-by-centre interaction: what is the issue? Drug Info J 31:927936.
Lin Z (1999). An issue of statistical analysis in controlled multi-centre studies: how shall we weight the centres? Stat Med 18:365373.[ISI][Medline]
Milliken GA, Johnson DE (1984). Analysis of messy data. Vol. 1. Design of experiments. Belmont, CA: Life Learning Publications.
Mitropoulos CM, Holloway PJ, Davies TGH, Worthington HV (1984). Relative efficacy of dentifrices containing 250 or 1000 ppmF in preventing dental cariesreport of a 32-month clinical trial. Community Dent Health 1:193200.[Medline]
SAS Institute, Inc. (2000). SAS OnlineDoc, Version 8 with PDF files (from http://www.sas.com). Cary, NC: SAS Institute, Inc.
Schwemer G (2000). General linear models for multicenter clinical trials. Control Clin Trials 21:2129.[ISI][Medline]
Searle SR (1971). Linear models. New York: John Wiley & Sons, Inc.
Senn S (1998). Some controversies in planning and analysing multi-centre trials. Stat Med 17:17531765.[ISI][Medline]
SPSS (1999). SPSS for Windows release 10.0.5. Copyright ©SPSS Inc. http://www.spss.com.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| IADR Journals | Advances in Dental Research ® |
| Journal of Dental Research ® | Critical Reviews (1990-2004) |