JDR JDR Most Cited Articles
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Appendix
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (7)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Tu, Y.-K.
Right arrow Articles by Gilthorpe, M.S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tu, Y.-K.
Right arrow Articles by Gilthorpe, M.S.
J Dent Res 84(3):283-287, 2005
© 2005 International and American Associations for Dental Research


RESEARCH REPORTS
Clinical

Statistical Power for Analyses of Changes in Randomized Controlled Trials

Y.-K. Tu1,2,*, A. Blance1,2, V. Clerehugh1, and M.S. Gilthorpe2

1 Department of Periodontology, Leeds Dental Institute, University of Leeds, Clarendon Way, Leeds, LS2 9LU, UK; and
2 Biostatistics Unit, Centre for Epidemiology & Biostatistics, Leeds Institute of Genetics, Health & Therapeutics, University of Leeds, 30/32 Hyde Terrace, Leeds, LS2 9LN, UK;

* corresponding author, y.k.tu{at}leeds.ac.uk


   ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Randomized controlled trials (RCTs) are widely recommended as the most useful study design to generate reliable evidence and guidance to daily practices in medicine and dentistry. However, it is not well-known in dental research that different statistical methods of data analysis can yield substantial differences in study power. In this study, computer simulations are used to explore how using different univariate and multivariate statistical methods of analyzing change in continuous outcome variables affects study power, and the sample size required for RCTs. Results show that, in general, analysis of covariance (ANCOVA) yields greater power than other statistical methods in testing the superiority of one treatment over another, or in testing the equivalence between two treatments. Therefore, ANCOVA should be used in preference to change score or percentage change score to reduce type II error rates.

KEY WORDS: power • sample size • randomized controlled trials


   INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Randomized controlled trials (RCT) are widely recommended as the most useful study design to generate reliable evidence. There are textbooks and journal articles devoted to the subject of assessing RCTs, and a consensus statement (CONSORT) has been published and updated regularly by an international team to give guidance on the design, conduct, and publication of RCTs (Moher et al., 2001). However, a recent article examined the quality of RCTs in periodontal research according to several criteria, and the results were discouraging (Montenegro et al., 2002), revealing that the quality of these RCTs frequently failed to reach recommended standards.

An important issue frequently overlooked in evaluations of the quality of RCTs in dental research is sample size and hence study power. The power of a hypothesis test relates to the probability of rejecting the null hypothesis when the alternative hypothesis is true (Dawson and Trapp, 2001). When a test incorrectly fails to reject the null hypothesis, it is known as Type II error. Power is defined as (1 – ß), where ß is the probability of Type II error (Moye, 2000). The lack of power in periodontal research has been addressed previously (Hujoel et al., 1992; Gunsolley et al., 1998), although it is still not common practice for published trials to declare clearly how they determine power and sample size before the start of each trial. Moreover, it is relatively unknown to dental researchers that the statistical analyses used to compare differences in treatment effects can have substantial impact on study power.

The aims of this study are to use computer simulations to explore how using different statistical methods of analyzing change in continuous outcomes affects study power, and the sample size required for RCTs to test the superiority of one treatment over another or the equivalence of one treatment with another. Simulations are based on the assumption that the treatment effect is either independent of baseline disease severity or is associated with baseline disease levels. By using different assumptions, we investigated which statistical methods give rise to the greatest power for different clinical scenarios.


   MATERIALS & METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
For this study, two research designs were simulated separately: one for investigating superiority trials, and the other for investigating equivalence trials. The hypothetical research question for superiority trials is to detect whether a new treatment, such as guided tissue regeneration (GTR), achieves better treatment outcomes than a conventional treatment, such as the open flap operation. The hypothetical research question for an equivalence or non-inferiority trial is to detect whether a new treatment, such as application of enamel matrix derivatives (EMD) into an infrabony defect, is able to attain treatment effects comparable with those of an established treatment, such as GTR. Sample size estimations for an equivalence trial are to warrant that the trial has enough power to demonstrate that the new treatment is as good as the active control or ‘standard’ treatment by as much as a pre-specified quantity or margin. The study design is to allocate patients randomly to treatment and control groups, where each patient contributes only one lesion. The treatment outcome is clinical attachment level (CAL) gain, although this could also be probing pocket depth (PPD) reduction.

For the superiority trial, pre-treatment CAL (precal) is 9 mm, with standard deviation (SD) of 2 mm in both test and control treatment groups, based on original data in the literature (Esposito et al., 2003; Tu et al., 2004). The mean treatment effect is 4 mm for the treatment group and 2 mm for the control group. A 2-mm difference between treatment arms is chosen because it has been suggested that a 2-mm change is clinically significant and necessary for an expensive treatment such as GTR to justify its use routinely (Greenstein and Lamster, 2000). For the equivalence trial, treatment effect is 4 mm for both the test group (EMD) and the active-control group (GTR). For the new treatment to claim equivalence to the established treatment, it is suggested that the former achieve 90% of the effect of the latter (Rethman and Nunn, 1999). In this study, 0.5 mm was the adopted equivalence margin (tolerance). Thus, the differences in the performances of the two treatments must fall within this range to claim that the treatment effects of EMD are comparable with those of GTR. Two one-sided test (TOST) procedures (Chow and Shao, 2002) were adopted to test the equivalence of the two treatments, i.e., therapeutic equivalence is concluded if and only if the 95% confidence interval (CI) of the estimate is within the pre-defined equivalence margin. One-sided tests (OST) were adopted to test the non-inferiority of the new treatment to the established one, i.e., non-inferiority is concluded if the lower limit of the 95% CI does not exceed the lower margin.

Two SDs of post-treatment CAL (postcal) are considered: 2 mm and 1.5 mm. If there is no difference in the SD between precal and postcal, the treatment effects are not dependent on the baseline disease levels; where the post-treatment SD becomes smaller than the pre-treatment SD, there is a ‘baseline effect’ (Tu et al., 2004).

Four univariate and two multivariate statistical methods of analyzing change are used: (1) test only the post-treatment scores; (2) test the change scores; (3) test the percentage change scores; (4) analysis of covariance (ANCOVA); (5) random effects modeling (REM); and (6) multivariate analysis of variance (MANOVA). [A detailed explanation of these methods is provided in the online Appendix.] In superiority trials, the power of all six methods is compared. In equivalence trials, REM and MANOVA are not considered. The reason for omitting REM is that this method achieves exactly the same estimates (regression coefficients) and standard errors as using change scores, i.e., they achieve exactly the same power in equivalence trials. The reason for excluding MANOVA is that this method does not give rise to an estimate of the treatment difference, and hence an associated confidence interval. Consequently, only the four univariate statistical methods of analyzing changes are used to test equivalence.

Correlation between Pre- and Post-treatment Values
Variability in the treatment effect, as measured by the correlation between the pre- and post-treatment values, can affect study power. A previous study has showed that the correlation between pre- and post-treatment CAL might vary from nearly zero to a strong positive correlation, such as 0.9 (Tu et al., 2004). For superiority trials, the hypothetical sample size ranges from 10 to 30 in each group, although preliminary simulations revealed that, if there was a baseline effect on treatment, little difference is detected among the six methods when the sample size is 30 in each group. Consequently, simulations were performed with only 10 and 20 in each group when a baseline effect is assumed. For equivalence and non-inferiority trials, the hypothetical sample sizes were 50, 100, and 150. For each sample size, the correlation between pre- and post-treatment values is varied from 0.1 to 0.9, with intervals of 0.2. Ten thousand simulations (sufficient to achieve robust estimates) were performed for each scenario.

For superiority trials, the power of the six tests is calculated as the percentage of simulated studies that show a statistical difference between the two groups. For equivalence and non-inferiority trials, the power of the four univariate tests is calculated as the percentage of simulated studies that show the treatment effects of both treatments to be equivalent. The objectives were to determine: (1) whether there were substantial differences in the power among the various statistical methods; (2) how pre- and post-treatment correlation affects the performance of each method—whether treatment effects are associated with baseline disease severity; and (3) for equivalence trials only, if there were any differences in power between TOST and OST. The 5% level of statistical significance was assumed throughout, and all the computer simulations were performed with the statistical software R version 1.8.1 (R development core team, 2003).


   RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Simulations for the Power of Superiority
Power with the use of only post-treatment values was unaffected by the correlation between pre- and post-treatment values (rpre.post), although power for the other five statistical methods was dependent on rpre.post (Tables 1Go, 2Go), with power exhibiting a dramatic improvement as rpre.post increased. If no baseline effect was assumed (Table 1Go), when rpre.post reached 0.5, the use of the change score, percentage change score, random effects modeling (REM), and MANOVA had power comparable with that when only post-treatment values were used. Change score and REM achieved comparable power throughout for different values of rpre.post and sample size, and using percentage change score was slightly more powerful than using change score. MANOVA had greater power than change score, percentage change score, and REM when rpre.post was low. In contrast, when rpre.post was high, MANOVA had less power than the other methods. Except for rpre.post being small, where using only post-treatment values has greatest power, generally, the most powerful test was ANCOVA.


View this table:
[in this window]
[in a new window]
 
Table 1. Percentages of Statistically Significant Differences (at the 5% level) between Test and Control Groups for Sample Sizes of 10, 20, and 30 in Each Group by Six Different Statistical Methods when the Correlations ({rho}) between Pre-treatment CAL and Post-treatment CAL Range between 0.1 and 0.9
 

View this table:
[in this window]
[in a new window]
 
Table 2. Percentages of Statistically Significant Differences (at the 5% level) between Test and Control Groups for Sample Sizes of 10 and 20 in Each Group with Six Different Statistical Methods when the SD of Post-treatment is Reduced to 1.5 mm and the Correlations ({rho}) between Pre-treatment CAL and Post-treatment CAL Range between 0.1 and 0.9
 
When the SD of post-treatment values was reduced, the power of all six methods improved compared with those observed with constant SD. Using change score or REM achieved power comparable with that achieved when only post-treatment values were used and when rpre.post is 0.7, although using percentage change score had slightly greater power than change score. Therefore, using change score or percentage change score becomes less efficient when treatment effects are dependent on the baseline disease severity. MANOVA always has greater power than change score, percentage change score, or REM, when treatment effects are dependent on the baseline disease severity. Generally, the most powerful test remains ANCOVA.

Simulations for the Power of Equivalence
Power using only post-treatment values was unaffected by the correlation between pre- and post-treatment values (rpre.post) and was more powerful for small correlations than using change/percentage change scores (Tables 3Go, 4Go). Power was dependent on rpre.post using change score, percentage change score, and ANCOVA, with power steadily increasing as rpre.post increased. Using percentage change score was slightly more powerful than using change score when rpre.post is not large. When rpre.post was small, the power of ANCOVA was close to that for only post-treatment values, although much higher than for change score or percentage change score. Generally, the most powerful test remained ANCOVA.


View this table:
[in this window]
[in a new window]
 
Table 3. Percentages of Statistically Therapeutic Equivalences (at the 5% level) between Test and Control Groups for Sample Sizes of 50, 100, and 150 in Each Group by Four Different Statistical Methods when the SD of Post-treatment is 2 mm (no baseline effect) and the Correlations ({rho}) between Pre-treatment CAL and Post-treatment CAL Range between 0.1 and 0.9
 

View this table:
[in this window]
[in a new window]
 
Table 4. Percentages of Statistically Therapeutic Equivalences (at the 5% level) between Test and Control Groups for Sample Sizes of 50, 100, and 150 in Each Group by Four Different Statistical Methods when the SD of Post-treatment is Reduced to 1.5 mm (moderate baseline effect) and the Correlations ({rho}) between Pre-treatment CAL and Post-treatment CAL Range between 0.1 and 0.9
 
Power generally increased when the SD of post-treatment values were reduced from 2 mm to 1.5 mm. Even though ANCOVA was typically the most powerful test, under TOST procedures when rpre.post was as large as 0.9, sample sizes of 50 patients in each group achieved only 50% power, for a post-treatment SD of 2 mm and 83% power for a post-treatment SD of 1.5 mm (Tables 3Go, 4Go). As the sample size increased to 100 in each group, ANCOVA attained a more acceptable power when rpre.post was 0.9, at 92% for a post-treatment SD of 2 mm and 99% for a post-treatment SD of 1.5 mm. When rpre.post was 0.9, using change score attained 90% and 89% power for post-treatment SDs of 2 mm and 1.5 mm, respectively; whereas using percentage change score attained 62% and 97% power. With groups of 150 patients, ANCOVA attained at least 99% power for both post-treatment SDs when rpre.post was 0.9, and attained 81% and 94% power for a post-treatment SD of 1.5 mm when rpre.post was 0.5 and 0.7, respectively. Using change score also attained 99% power for both post-treatment SDs when rpre.post was 0.9. Using percentage change score gave rise to 89% and 100% power for a post-treatment SD of 1.5 mm when rpre.post was 0.7 and 0.9, respectively (Tables 3Go, 4Go). Using only post-treatment values never attained acceptable power in any circumstances under TOST procedures.

Simulations for the Power of Non-inferiority
Testing OST procedures gives rise to greater power than testing TOST. Differences are especially substantial when the correlations between pre- and post-treatment values are small. However, even for OST procedures, the sample sizes simulated did not always achieve satisfactory power (Tables 3Go, 4Go).


   DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Study power is determined by the sample size, the variability of treatment effects, and the statistical methods adopted (Everitt, 1995; Bonate, 2000; Senn et al., 2000; Vickers, 2001). Occasional discussions of the impact of sample size and the variability of treatment effects on study power are observed in the dental literature (Blomqvist and Dahlén, 1985; Fleiss, 1992; Laster, 1992; Koch and Paquette, 1997; Gunsolley et al., 1998), and the superiority of ANCOVA over testing change score in terms of study power was noted three decades ago (Lehnhoff and Grainger, 1974). Nevertheless, to our knowledge, this study is the first (in oral health research) to show that, for a fixed sample size, different univariate and multivariate methods result in substantial divergence in the resultant study power, and these differences in power are more prominent when rpre.post is low. Although study power can be improved by increasing the sample size, it consequently takes longer and costs more to recruit patients to accomplish the trial. Increasing sample size is not always achievable and might even be unethical if a smaller, more cost-effective trial could obtain the same results by simply adopting more powerful statistical methods, such as ANCOVA.

A recent systematic review of the efficacy of GTR compared with flap operation in the treatment of infrabony defects (Needleman et al., 2001) showed that few RCTs have sample sizes more than or equal to 20 in each group, and most trials used paired t tests or two-sample t tests to analyze their data. Unless the differences in the treatment effect were larger than 2 mm, or the variability of treatments was relatively small, most studies would have been potentially at risk of being underpowered as analyzed, while their sample size and power might have been adequate had they used ANCOVA. Nevertheless, differences in the treatment effect between GTR and open-flap operation seemed to be, on average, smaller than 2 mm (Needleman et al., 2001). Analyses using change score or percentage change score might therefore have an unacceptably high probability of a false-negative result, even if the sample size is adequate, if alternative statistical methods such as ANCOVA had been used rather than those adopted.

This study demonstrates that using different statistical methods to analyze results from clinical trials in periodontal research has substantial impact on study power, since the correlation between pre- and post-treatment values varies greatly. ANCOVA should be used in preference to change score or percentage change score, reducing Type II error rates. Power calculations need to be performed at the planning stages of clinical trials so that the appropriate sample size can be determined.


   ACKNOWLEDGMENTS
 
All four authors were funded by the United Kingdom government’s Higher Education Funding Council for England (HEFCE).

Received March 24, 2004; Last revision November 2, 2004; Accepted November 23, 2004


   REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Blomqvist N, Dahlén G (1985). Analysis of change—are base-line measurements needed? Some statistical comments on common experimental design. J Clin Periodontol 12:877–881.[Medline]

Bonate P (2000). Analysis of pretest-postest designs. Boca Raton, FL: Chapman & Hall/CRC.

Chow SC, Shao J (2002). A note on statistical methods for assessing therapeutic equivalence. Control Clin Trial 23:515–520.[Medline]

Dawson B, Trapp RG (2001). Basic and clinical biostatistics. 3rd ed. New York: McGraw-Hill.

Esposito M, Coulthard P, Worthington HV (2003). Enamel matrix derivative (Emdogain®) for periodontal tissue regeneration in intrabony defects (Cochrane Review). In: The Cochrane Library, Issue 2. http://www.thecochranelibrary.com

Everitt B (1995). The analysis of repeat measures: a practical review with examples. Statistician 44:113–135.

Fleiss JL (1992). General design issues in efficacy, equivalency and superiority trials. J Periodontal Res 27:306–313.[ISI][Medline]

Greenstein G, Lamster I (2000). Efficacy of periodontal therapy: statistical versus clinical significance. J Periodontol 71:657–662.[Medline]

Gunsolley JC, Elswick RK, Davenport JM (1998). Equivalence and superiority testing in regeneration clinical trials. J Periodontol 69:521–527.[Medline]

Hujoel PP, Baab DA, DeRouen TA (1992). The power of tests to detect differences between periodontal treatments in published studies. J Clin Periodontol 19:779–784.[Medline]

Koch GG, Paquette DW (1997). Design principles and statistical considerations in periodontal clinical trials. Ann Periodontol 2:42–63.[Medline]

Laster LL (1992). Some aspects of efficient experimental design and analysis in periodontal trials. J Periodontal Res 27:405–411.[Medline]

Lehnhoff RW, Grainger RM (1974). Use of analysis of covariance in periodontal clinical trials. J Periodontal Res 9(Suppl 14):143–159.

Moher D, Schulz KF, Altman DG (2001). The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials. BMC Med Res Methodol 1:2. http://biomedcentral.com[Medline]

Montenegro R, Needleman I, Moles D, Tonetti M (2002). Quality of RCTs in periodontology—a systematic review. J Dent Res 81:866–870.[Abstract/Free Full Text]

Moye LA (2000). Statistical reasoning in medicine: the intuitive P-value primer. New York: Springer-Verlag.

Needleman IG, Giedrys-Leepper E, Tucker RJ, Worthington HV (2001). Guided tissue regeneration for periodontal infra-bony defects. Cochrane Data Base Systematic Review. J Periodontal Res 37:380–388.

R Development Core Team (2003). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-00-3, URL http://www.R-project.org.

Rethman MP, Nunn ME (1999). Clinical versus statistical significance. J Periodontol 70:700–702.[Medline]

Senn S, Stevens L, Chaturvedi N (2000). Repeated measures in clinical trials: simple strategies for analysis using summary measures. Stat Med 19:861–877.[ISI][Medline]

Tu YK, Maddick IH, Griffiths GS, Gilthorpe MS (2004). Mathematical coupling can undermine the statistical assessment of clinical research: illustration from the treatment of guided tissue regeneration. J Dent 32:133–142. Erratum J Dent 32:339–340, 2004.[ISI][Medline]

Vickers AJ (2001). The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Med Res Methodol 1:6. http://biomedcentral.com[Medline]




This article has been cited by other articles:


Home page
J Ultrasound MedHome page
K. Thoirs, M. A. Williams, and M. Phillips
Ultrasonographic Measurements of the Ulnar Nerve at the Elbow: Role of Confounders
J. Ultrasound Med., May 1, 2008; 27(5): 737 - 743.
[Abstract] [Full Text] [PDF]


Home page
J. Orthod.Home page
K. Juggins and N. Hunt
Relevant research from non-orthodontic journals
J. Orthod., September 1, 2005; 32(3): 220 - 222.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Appendix
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (7)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Tu, Y.-K.
Right arrow Articles by Gilthorpe, M.S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tu, Y.-K.
Right arrow Articles by Gilthorpe, M.S.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
IADR Journals Advances in Dental Research ®
Journal of Dental Research ® Critical Reviews (1990-2004)