|
|
||||||||
RESEARCH REPORTS |
1 Department of Periodontology, Leeds Dental Institute, University of Leeds, Clarendon Way, Leeds, LS2 9LU, UK; and
2 Biostatistics Unit, Centre for Epidemiology & Biostatistics, Leeds Institute of Genetics, Health & Therapeutics, University of Leeds, 30/32 Hyde Terrace, Leeds, LS2 9LN, UK;
* corresponding author, y.k.tu{at}leeds.ac.uk
| ABSTRACT |
|---|
|
|
|---|
KEY WORDS: power sample size randomized controlled trials
| INTRODUCTION |
|---|
|
|
|---|
An important issue frequently overlooked in evaluations of the quality of RCTs in dental research is sample size and hence study power. The power of a hypothesis test relates to the probability of rejecting the null hypothesis when the alternative hypothesis is true (Dawson and Trapp, 2001). When a test incorrectly fails to reject the null hypothesis, it is known as Type II error. Power is defined as (1 ß), where ß is the probability of Type II error (Moye, 2000). The lack of power in periodontal research has been addressed previously (Hujoel et al., 1992; Gunsolley et al., 1998), although it is still not common practice for published trials to declare clearly how they determine power and sample size before the start of each trial. Moreover, it is relatively unknown to dental researchers that the statistical analyses used to compare differences in treatment effects can have substantial impact on study power.
The aims of this study are to use computer simulations to explore how using different statistical methods of analyzing change in continuous outcomes affects study power, and the sample size required for RCTs to test the superiority of one treatment over another or the equivalence of one treatment with another. Simulations are based on the assumption that the treatment effect is either independent of baseline disease severity or is associated with baseline disease levels. By using different assumptions, we investigated which statistical methods give rise to the greatest power for different clinical scenarios.
| MATERIALS & METHODS |
|---|
|
|
|---|
For the superiority trial, pre-treatment CAL (precal) is 9 mm, with standard deviation (SD) of 2 mm in both test and control treatment groups, based on original data in the literature (Esposito et al., 2003; Tu et al., 2004). The mean treatment effect is 4 mm for the treatment group and 2 mm for the control group. A 2-mm difference between treatment arms is chosen because it has been suggested that a 2-mm change is clinically significant and necessary for an expensive treatment such as GTR to justify its use routinely (Greenstein and Lamster, 2000). For the equivalence trial, treatment effect is 4 mm for both the test group (EMD) and the active-control group (GTR). For the new treatment to claim equivalence to the established treatment, it is suggested that the former achieve 90% of the effect of the latter (Rethman and Nunn, 1999). In this study, 0.5 mm was the adopted equivalence margin (tolerance). Thus, the differences in the performances of the two treatments must fall within this range to claim that the treatment effects of EMD are comparable with those of GTR. Two one-sided test (TOST) procedures (Chow and Shao, 2002) were adopted to test the equivalence of the two treatments, i.e., therapeutic equivalence is concluded if and only if the 95% confidence interval (CI) of the estimate is within the pre-defined equivalence margin. One-sided tests (OST) were adopted to test the non-inferiority of the new treatment to the established one, i.e., non-inferiority is concluded if the lower limit of the 95% CI does not exceed the lower margin.
Two SDs of post-treatment CAL (postcal) are considered: 2 mm and 1.5 mm. If there is no difference in the SD between precal and postcal, the treatment effects are not dependent on the baseline disease levels; where the post-treatment SD becomes smaller than the pre-treatment SD, there is a baseline effect (Tu et al., 2004).
Four univariate and two multivariate statistical methods of analyzing change are used: (1) test only the post-treatment scores; (2) test the change scores; (3) test the percentage change scores; (4) analysis of covariance (ANCOVA); (5) random effects modeling (REM); and (6) multivariate analysis of variance (MANOVA). [A detailed explanation of these methods is provided in the online Appendix.] In superiority trials, the power of all six methods is compared. In equivalence trials, REM and MANOVA are not considered. The reason for omitting REM is that this method achieves exactly the same estimates (regression coefficients) and standard errors as using change scores, i.e., they achieve exactly the same power in equivalence trials. The reason for excluding MANOVA is that this method does not give rise to an estimate of the treatment difference, and hence an associated confidence interval. Consequently, only the four univariate statistical methods of analyzing changes are used to test equivalence.
Correlation between Pre- and Post-treatment Values
Variability in the treatment effect, as measured by the correlation between the pre- and post-treatment values, can affect study power. A previous study has showed that the correlation between pre- and post-treatment CAL might vary from nearly zero to a strong positive correlation, such as 0.9 (Tu et al., 2004). For superiority trials, the hypothetical sample size ranges from 10 to 30 in each group, although preliminary simulations revealed that, if there was a baseline effect on treatment, little difference is detected among the six methods when the sample size is 30 in each group. Consequently, simulations were performed with only 10 and 20 in each group when a baseline effect is assumed. For equivalence and non-inferiority trials, the hypothetical sample sizes were 50, 100, and 150. For each sample size, the correlation between pre- and post-treatment values is varied from 0.1 to 0.9, with intervals of 0.2. Ten thousand simulations (sufficient to achieve robust estimates) were performed for each scenario.
For superiority trials, the power of the six tests is calculated as the percentage of simulated studies that show a statistical difference between the two groups. For equivalence and non-inferiority trials, the power of the four univariate tests is calculated as the percentage of simulated studies that show the treatment effects of both treatments to be equivalent. The objectives were to determine: (1) whether there were substantial differences in the power among the various statistical methods; (2) how pre- and post-treatment correlation affects the performance of each methodwhether treatment effects are associated with baseline disease severity; and (3) for equivalence trials only, if there were any differences in power between TOST and OST. The 5% level of statistical significance was assumed throughout, and all the computer simulations were performed with the statistical software R version 1.8.1 (R development core team, 2003).
| RESULTS |
|---|
|
|
|---|
|
|
Simulations for the Power of Equivalence
Power using only post-treatment values was unaffected by the correlation between pre- and post-treatment values (rpre.post) and was more powerful for small correlations than using change/percentage change scores (Tables 3
, 4
). Power was dependent on rpre.post using change score, percentage change score, and ANCOVA, with power steadily increasing as rpre.post increased. Using percentage change score was slightly more powerful than using change score when rpre.post is not large. When rpre.post was small, the power of ANCOVA was close to that for only post-treatment values, although much higher than for change score or percentage change score. Generally, the most powerful test remained ANCOVA.
|
|
Simulations for the Power of Non-inferiority
Testing OST procedures gives rise to greater power than testing TOST. Differences are especially substantial when the correlations between pre- and post-treatment values are small. However, even for OST procedures, the sample sizes simulated did not always achieve satisfactory power (Tables 3
, 4
).
| DISCUSSION |
|---|
|
|
|---|
A recent systematic review of the efficacy of GTR compared with flap operation in the treatment of infrabony defects (Needleman et al., 2001) showed that few RCTs have sample sizes more than or equal to 20 in each group, and most trials used paired t tests or two-sample t tests to analyze their data. Unless the differences in the treatment effect were larger than 2 mm, or the variability of treatments was relatively small, most studies would have been potentially at risk of being underpowered as analyzed, while their sample size and power might have been adequate had they used ANCOVA. Nevertheless, differences in the treatment effect between GTR and open-flap operation seemed to be, on average, smaller than 2 mm (Needleman et al., 2001). Analyses using change score or percentage change score might therefore have an unacceptably high probability of a false-negative result, even if the sample size is adequate, if alternative statistical methods such as ANCOVA had been used rather than those adopted.
This study demonstrates that using different statistical methods to analyze results from clinical trials in periodontal research has substantial impact on study power, since the correlation between pre- and post-treatment values varies greatly. ANCOVA should be used in preference to change score or percentage change score, reducing Type II error rates. Power calculations need to be performed at the planning stages of clinical trials so that the appropriate sample size can be determined.
| ACKNOWLEDGMENTS |
|---|
Received March 24, 2004; Last revision November 2, 2004; Accepted November 23, 2004
| REFERENCES |
|---|
|
|
|---|
Bonate P (2000). Analysis of pretest-postest designs. Boca Raton, FL: Chapman & Hall/CRC.
Chow SC, Shao J (2002). A note on statistical methods for assessing therapeutic equivalence. Control Clin Trial 23:515520.[Medline]
Dawson B, Trapp RG (2001). Basic and clinical biostatistics. 3rd ed. New York: McGraw-Hill.
Esposito M, Coulthard P, Worthington HV (2003). Enamel matrix derivative (Emdogain®) for periodontal tissue regeneration in intrabony defects (Cochrane Review). In: The Cochrane Library, Issue 2. http://www.thecochranelibrary.com
Everitt B (1995). The analysis of repeat measures: a practical review with examples. Statistician 44:113135.
Fleiss JL (1992). General design issues in efficacy, equivalency and superiority trials. J Periodontal Res 27:306313.[ISI][Medline]
Greenstein G, Lamster I (2000). Efficacy of periodontal therapy: statistical versus clinical significance. J Periodontol 71:657662.[Medline]
Gunsolley JC, Elswick RK, Davenport JM (1998). Equivalence and superiority testing in regeneration clinical trials. J Periodontol 69:521527.[Medline]
Hujoel PP, Baab DA, DeRouen TA (1992). The power of tests to detect differences between periodontal treatments in published studies. J Clin Periodontol 19:779784.[Medline]
Koch GG, Paquette DW (1997). Design principles and statistical considerations in periodontal clinical trials. Ann Periodontol 2:4263.[Medline]
Laster LL (1992). Some aspects of efficient experimental design and analysis in periodontal trials. J Periodontal Res 27:405411.[Medline]
Lehnhoff RW, Grainger RM (1974). Use of analysis of covariance in periodontal clinical trials. J Periodontal Res 9(Suppl 14):143159.
Moher D, Schulz KF, Altman DG (2001). The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials. BMC Med Res Methodol 1:2. http://biomedcentral.com[Medline]
Montenegro R, Needleman I, Moles D, Tonetti M (2002). Quality of RCTs in periodontologya systematic review. J Dent Res 81:866870.
Moye LA (2000). Statistical reasoning in medicine: the intuitive P-value primer. New York: Springer-Verlag.
Needleman IG, Giedrys-Leepper E, Tucker RJ, Worthington HV (2001). Guided tissue regeneration for periodontal infra-bony defects. Cochrane Data Base Systematic Review. J Periodontal Res 37:380388.
R Development Core Team (2003). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-00-3, URL http://www.R-project.org.
Rethman MP, Nunn ME (1999). Clinical versus statistical significance. J Periodontol 70:700702.[Medline]
Senn S, Stevens L, Chaturvedi N (2000). Repeated measures in clinical trials: simple strategies for analysis using summary measures. Stat Med 19:861877.[ISI][Medline]
Tu YK, Maddick IH, Griffiths GS, Gilthorpe MS (2004). Mathematical coupling can undermine the statistical assessment of clinical research: illustration from the treatment of guided tissue regeneration. J Dent 32:133142. Erratum J Dent 32:339340, 2004.[ISI][Medline]
Vickers AJ (2001). The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Med Res Methodol 1:6. http://biomedcentral.com[Medline]
This article has been cited by other articles:
![]() |
K. Thoirs, M. A. Williams, and M. Phillips Ultrasonographic Measurements of the Ulnar Nerve at the Elbow: Role of Confounders J. Ultrasound Med., May 1, 2008; 27(5): 737 - 743. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Juggins and N. Hunt Relevant research from non-orthodontic journals J. Orthod., September 1, 2005; 32(3): 220 - 222. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| IADR Journals | Advances in Dental Research ® |
| Journal of Dental Research ® | Critical Reviews (1990-2004) |