Intended for healthcare professionals

Education And Debate

Estimating sample sizes for binary, ordered categorical, and continuous outcomes in two group comparisons

BMJ 1995; 311 doi: https://doi.org/10.1136/bmj.311.7013.1145 (Published 28 October 1995) Cite this as: BMJ 1995;311:1145

This article has a correction. Please see:

  1. M J Campbell, reader in medical statistics,a,
  2. S A Julious, statistician programmera,
  3. D G Altman, headb
  1. aMedical Statistics and Computing, University of Southampton, Southampton General Hospital, Southampton SO16 6YD
  2. bMedical Statistics Laboratory, Imperial Cancer Research Fund, PO Box 123, London WC2A 3PX
  1. Correspondence to: Dr Campbell.
  • Accepted 21 July 1995

Sample size calculations are now mandatory for many research protocols, but the ones useful in common situations are not all easily accessible. This paper outlines the ways of calculating sample sizes in two group studies for binary, ordered categorical, and continuous outcomes. Formulas and worked examples are given. Maximum power is usually achieved by having equal numbers in the two groups. However, this is not always possible and calculations for unequal group sizes are given.

A sample size calculation is now almost mandatory in research protocols and to justify the size of clinical trials in papers.1 Nevertheless, one of the most common faults in papers reporting clinical trials is in fact a lack of justification of the sample size, and it is a major concern that important therapeutic effects are being missed because of inadequately sized studies.2 A recent paper has concluded “the reporting of statistical power and sample size needs to be improved.”3 Recent articles in the BMJ have described the basis of sample size calculations,4 5 and explained the fundamental concepts of statistical significance (alpha), effect size ((delta)), and power (1-ß). A nomogram for sample size calculations for continuous data is also available.6 However, there have been some recent developments in the theory of sample size calculations, which are likely to prove useful, and the purpose of this paper is to make available a collection of formulas and examples for a variety of situations likely to be encountered in practice. In particular, situations not dealt with in previous articles are two group comparisons with unequal sample sizes, and sample sizes for ordered categorical outcomes (for example categories better, same, or worse). The paper describes sample size calculations, and provides tables, for studies comparing two groups of individuals that have outcome variables that are binary (yes/no), ordered categorical, or continuous. A further paper will consider studies when the data are paired. Further examples are given by Machin and Campbell.7

Parameter definition

Of all the parameters that have to be specified before the sample size can be determined the most critical is the effect size. Reducing the effect size by half will quadruple the required sample size. The effect size can be interpreted as a “clinically important difference,” but this is often difficult to quantify. A valuable attempt at classification was made by Burnand et al, who reviewed three major medical journals and looked for words such as “impressive difference,” “important difference,” “dramatic increase” and then calculated a standardised effect size.8 This provided a guide to the size of effect regarded as important by other authors. There are several ways of eliciting useful sample sizes: a Bayesian perspective has been given recently,9 along with an economic approach,10 and one based on patients' rather than clinicians' perceptions of benefit.11

In statistical significance tests one sets up a null hypothesis and, given the observed difference of interest, calculates the probability of observing the difference (or a more extreme one) under the null hypothesis. This yields the P value. If the P value is less than some prespecified level then we reject the null hypothesis. This level is known as the significance level (alpha). If we reject the null hypothesis when it is true we make a type I error, and we set (alpha), the significance level, to control the probability of doing this. If the null hypothesis is in fact false but we fail to reject it, we make a type II error, and the probability of a type II error is denoted as ß. The probability of rejecting the null hypothesis when it is false is termed the power and is defined as 1-ß.

Unequal numbers in each group

For a given total sample size the maximum power is achieved by having equal numbers of subjects in the two groups. Often, however, in observational studies an equal number is not expected in each group since the incidence of a particular factor may be higher in one group than in another. In clinical trials, also, the numbers of subjects taking one treatment may have to be limited, so to achieve the necessary power one has to allocate more patients to the other treatment. In this case the sample sizes should be adjusted by a factor dependent on the allocation ratio,12 given as equation 1 in the Appendix.

If one were to maintain the same sample size as calculated for a 1:1 ratio but then allocated in the ratio 2:1 the loss in power would be quite small (around 5%). However, if the allocation ratio is allowed to exceed 2:1 with the same total sample size the power falls very quickly (a loss of around 25% in power for a ratio of 5:1) and consequently a considerably larger total sample size is required to maintain a fixed power with an imbalanced study than with a balanced one.

Continuous data

In a two group comparative study where the outcome measure is a continuous variable which is plausibly normally distributed, such as blood pressure, a two sample t test would be the statistical test used in the final analysis.

To calculate a sample size, in addition to the parameters discussed above, an estimate of the population standard deviation (sigma) must be given. The sample size formula7 is given as equation 2 in the Appendix, and table I gives the sample size required for different values of the standardised difference d, defined as d=(delta)/(sigma), at various levels of power at the two sided 5% significance level.

Alternatively, Lehr gives a quick formula for calculating these sample sizes.13 For a two sided significance level of 5% and power of 80%, the number required in each group is given approximately by m=16/d2. This formula overestimates the sample sizes a little for small values of d; otherwise it gives close approximations to the sample size.

WORKED EXAMPLE

In a recent paper, Godfrey et al14 found that 46 people who had no whorls on their fingers had a mean systolic blood pressure of 136 mm Hg compared with 93 patients with at least one whorl for whom the mean blood pressure was 144 mm Hg.

Suppose an experimenter wished to confirm these findings but suspected that the mean difference would be less than that observed, with 5 mm Hg being the clinically minimum difference accepted. The overall standard deviation of blood pressure in each group is assumed to be 17 mm Hg, the same as that published. We find d=5/17=0.294, which is about 0.3, and so from table I the sample size required to detect this difference with a two sided significance level of 5% and with 80% power would be 176 subjects in each group and so 352 subjects in total. Alternatively, from Lehr's quick formula we get m=16/0.2942=185 patients per group. Suppose, like Godfrey et al, we would expect to recruit two people with whorls for every one person with no whorls. With r=2 from equation 1 we find that m'=3x176/4=132 and so rm'=264, giving a modified total sample size of 396. The overall sample size is larger if the groups were unequal because the design has less power than a design of the same size with equal numbers in the two groups.

TABLE I

Sample sizes required per group at the two sided 5% significance level for different values of d and power (d=expected mean difference/ standard deviation)

View this table:

Binary data

A binary outcome is a response which has just two categories. These categories may be of the form yes/no or presence/absence in relation to a given factor, for example alive/dead. Often an experimenter may wish to compare treatments by testing whether the difference in proportions responding on each treatment could be due to chance. In this case the effect size can be formulated as (delta)=pA-pB, where pA and pB are the proportions expected in the two treatment groups. The statistical test used to test for the association between two binary variables is the Pearson χ2 test.

To calculate the number of patients required in each arm of a binary trial use equation 3 in the Appendix. For proportions greater than about 0.1 this simplifies to equation 4. Table II gives the sample sizes required for various values of pA and pB for two sided significance level (alpha) and power 1-ß. Note, however, that for pA in the table only values up to 0.5 are given. This is because having a success rate of 65%, say, is identical to a failure rate of 35% and so the sample sizes for comparing pA to pB are the same as those for comparing 1-pA and 1-pB.

An approximate result similar to Lehr's formula13 for 80% power and two sided 5% significance level is that m=16p(1-p)/(pA-pB)2, where p=(pA + pB)/2. Like Lehr's equation given earlier, this overestimates the sample size a little.

Observational surveys such as case control studies are often summarised by an odds ratio or relative risk, rather than a difference in proportions. If pA is the proportion of cases exposed to a risk factor and pB is the proportion of controls exposed to the same risk factor, then the odds ratio of being a case given the risk factor is odds ratio=pA(1-pB)/{pB(1-pA)}. An approximate sample size formula using the odds ratio (OR) is given by equation 5 in the Appendix.

WORKED EXAMPLE

Tovey and Bonell stated that 52 (19%) out of 281 men found condoms too tight.15 Of these 68% had experienced their condom splitting compared with only 26% of men whose condoms were not tight. Suppose from anecdotal evidence a researcher suspected that the prevalence of reported splitting was nearer 50% in the group finding condoms too tight and wished to conduct a study to show this prevalence still to be significantly higher than in the other group.

The expectation is that the observed ratio of the frequencies of “not tight” (A) to “tight” (B) would be 4:1. Here pA=0.5, pB=0.25 and r=4. From table II the sample size required with equal allocation in each group would be 58, and using equation 2 one derives a modified sample size of just 37 subjects in the group who found condoms too tight and 148 in the other group, giving a total of 185. In the unlikely event of equal group sizes a total of 116 subjects would be required, yielding a saving of 69 subjects. Again, this arises because the equal groups case is more efficient. Note that Lehr's formula for equal sized groups gives approximately 60 per group or a total of 120 subjects required. If we specified the effect size as an odds ratio, then the postulated odds of splitting when the condom is too tight are three times that when it is not. From equation 5, we find in this case that for equal allocation we require 55 subjects per group.

TABLE II

Sample sizes to detect a difference in two proportions, pA and pB, at a 5% significance level with 80% power

View this table:

Ordered categorical data

A study may be undertaken where the outcome measure of interest is an ordered scale, such as a Likert scale (strongly disagree, disagree, agree and strongly agree) or a rating scale (better, same, worse). The statistical test used in this instance is the Mann-Whitney U test, with allowance for ties.16 The calculation of sample sizes when the data are ordered is not immediately straightforward. The problem becomes considerably easier, however, if one considers a number of pragmatic steps which will be described later in this section.

As before, we need to specify an effect size, and here it turns out to be easier to use the odds ratio. We must also specify the proportion of subjects expected in each category of the scale for one of the groups. Suppose we have t categories, with the higher ordered categories indicating worse prognosis, and the proportions expected in group A are pA1, pA2,___pAt (where pA1 + pA2+___+pAt=1) with similar notation for group B. Let cA1, cA2,___cAt, be the cumulative probabilities, so cA1=pA1, cA2=pA1 + pA2, etc. The odds ratio is the chance of a subject being in a given category or lower in one group compared with the other. For category 1 it is given by OR1={cA1/(1-cA1)}/{cB1/(1-cB1)} and similarly OR2 for category 2, up to category t-1. As will be shown later, the odds ratio may not necessarily be too difficult to estimate, as the proportions expected for one group may already be known through a pilot study or from previous research. The experimenter may postulate that on the new treatment a patient is only half as likely to have a score above a given level than on the old treatment and so the odds ratio would be estimated as 0.5. Alternatively, an experimenter may know the expected proportions in each category for one group and speculate that, if a proportion, p, were in a particular category or better, then a clinically significant difference would be for the corresponding proportion to be about 20% higher in the other group. From this information an odds ratio can be calculated and hence the other expected proportions and the sample size.

Equation 6 in the Appendix gives the formula for sample size calculations for ordered categorical data. It assumes that the odds ratio is constant for each pair of adjacent categories, that is OR1=OR2=___ORt-1, and this assumption means that the Mann-Whitney U test is the best test to use. It also means that one can estimate the odds ratio from any cumulative proportion from each group. To aid the calculations table III gives values of the numerator from equation 6 for different values of odds ratio and power.

TABLE III

For ordered categorical data, values for 6 (z1-(alpha/2)+z1-ß2/log OR)2 for various values of the odds ratio (OR) and power (1-ß) at two sided 5% significance

View this table:

If the number of categories is large it is difficult to postulate the proportion of people who would fall in a given category. However, Whitehead has shown that there is little increase in power (and hence saving in number of subjects recruited) to be gained by increasing the number of groups beyond five.17

WORKED EXAMPLE

In a randomised controlled trial of paracetamol for the treatment of feverish children, Kinmonth et al categorised playfulness as normal or slightly, moderately, or very listless.18 The results for the 43 replies are given in table V, together with the proportions and the cumulative proportions. The first odds ratio in the table is calculated from {0.14/(1-0.14)/(0.27/(1-0.27))} =0.44, and in a similar way we get 0.287 and 0.1625 for the other two pairs. The average is about 0.3.

Suppose a new study was planned in which we wished to replicate these results. The distribution of children in the control group (group A) was expected to be about the same as was found previously and shall be used in the calculation of the sample sizes. If an odds ratio of 0.33 in favour of paracetamol (or equivalently an odds ratio of about 3 against the control) was expected, then from the definition of the odds ratio we can calculate the expected cumulative proportions in the treatment group (group B) from the formula CBi=CAi/(Cai + OR (1-Cai)). Thus the proportion expected in the first category of group B is 0.14/(0.14+0.33 (1-0.14))=0.33 and so on. The cumulative proportions expected in group B are 0.33, 0.65, 0.83, and 1.00, and so the actual proportions expected are 0.33, 0.32=(0.65-0.33), 0.18=(0.83-0.65), and 0.17=(1.00-0.83). The average proportions p are given by 0.235, 0.280, 0.210, and 0.275. Thus (1-(Sigma)p3)=0.935. For 80% power and 5% significance level, from table III, the numerator is 39.02, and so the sample size is 39.02/0.935=41.7, or about 42 patients per group.

The formula is quite complicated and we have a number of suggestions to simplify matters. If the mean proportions (pi's) in each category are roughly equal then the denominator in equation 6 is constant for a given number of categories, and if the number of categories exceeds five it is approximately unity. Thus for 80% power and a two sided significance of 5%, an estimate of the sample size can be obtained from m=47/(log OR).2 If the number of categories is less than or equal to five then multiply this sample size estimate by a correction factor given in table IV. From this table, in the situation of approximately equal proportions, it is evident that having only two categories in your data for analysis may require you to recruit a third more patients than if the data were kept continuous. For our example, the correction factor from table IV is 1.067 and so n=1.067x47/(log 0.33)2=40.8, or 41 patients.

TABLE IV

Correction factor to be used with table III when the number of categories is </=5

View this table:
TABLE V

Playfulness in children

View this table:

Another simplification occurs if the proportion of subjects in one category for both groups is expected to be large. We can combine categories until there are only two left and use the formula and table given previously for binary data. Combining categories reduces the amount of information available, so one would expect the required sample size to increase.

In the worked example if we had pooled those scoring 1-2 and those scoring 3-4, we would compare proportion pA=0.38 to pB=0.65. Formula 4 shows that this study would require 49.9, or about 50 patients per group. Thus, use of all four categories, rather than simply two, yields a reduction of 16% in the study size, and this might outweigh the benefit of an easier sample size calculation.

Comment

From the equations in the Appendix it is clear that the sample size, significance level, power, and effect size are all interlinked. Given any three parameters, in principle the equations can be solved for the fourth. Thus, if the sample size were limited by resources, and the significance level fixed in advance, one could arbitrarily increase the power of the study by postulating larger effect sizes. In practice, however, the estimate of the effect of an intervention often proves too optimistic, resulting in many trials which are too small. The need for sample size calculations provides an excellent opportunity to involve a statistician early in the planning of a study and not just when the analysis is required. This paper has covered only a limited range of designs, and a statistician could advise on other designs. These include comparison of more than two groups,19 comparison of survival curves,7 20 21 and studies to demonstrate bioequivalence.22 Computer software is available for some of the sample size calculations discussed here,23 24 25 26 and other reviews have been given.27 28

Acknowledgments

We thank Dr D Machin for comments on an earlier manuscript.

Appendix

In each of the following m is the number of subjects required in each group for a two sided significance (alpha) and power 1-ß, and z1-(alpha/2) and z1-ß are the appropriate values from the standard Normal distribution for the 100(1-(alpha)/2) and 100(1-ß) percentiles respectively. Some useful values are the following: for two sided (alpha)=0.05, z1-(alpha/2)=1.96; for two sided (alpha)=0.01, z1-(alpha/2)=2.58; for ß=0.2, z1-ß=0.84; and for ß=0.1, z1-ß=1.28.

UNEQUAL ALLOCATION

Given m, calculated assuming equal sized groups, let m' be the sample size in the first group and rm' the sample size in the second group. Then m' is given by m1=r+1/2rxm, (1) where r is the allocation ratio.

CONTINUOUS DATA

To detect a difference (delta) we require7: m=2(z1-(alpha/2)+z1-ß)2/d2 + z21-(alpha/2)/4 (2) where d=(delta)/(sigma) and (sigma) is the standard deviation of the measurements. The last term in the equation is a correction factor to enable Normal tables rather than t tables to be used and can be ignored except for very small sample sizes. For a 5% two sided significance level it increases the sample size by 1. Table I gives the sample size required for different values of d and power from 50% to 99%.

BINARY OUTCOME

Suppose the expected proportions in groups A and B were pA and pB.

m=[z1-(alpha/2)(square root){2p (1-p)}+z1-ß (square root){pA (1-pA)+pB (1-pB)}]2/(delta)2 (3) where (delta)=pA-pB, and p=(pA+pB)/2. An approximate, simpler formula, is: m=(z1-(alpha/2)+z1-ß2 [pA (1-pA)+pB (1-pB)]/(delta)2 (4) which is sufficiently accurate except when pA, pB are small (say <0.05). Table II gives the sample size required per group at 5% significance level and 80% power for values of pA between 0 and 0.45 and pB between 0.05 and 1.00.

If the effect size is specified as an odds ratio OR=pA (1-pB)/pB (1-pA), then an approximate formula is given by m=2(z1-(alpha/2)+z1-ß2/log (OR)2 p (1-p) (5) Ordered categorical data m=6(z1-(alpha/2)+z1-ß2/(log OR)2/[1-(Sigma)i=1kPi-3], (6) where OR is the odds ratio of a patient being in category i or less for one treatment compared to the other, k is the number of categories and pi is the mean proportion expected in category i--that is, pi=(pAi+pBi)/2 where pAi and pBi are the proportions expected in category i for the two groups A and B respectively.

Footnotes

  • Funding MJC and SAJ are funded by the Higher Education Funding Council and DGA by the Imperial Cancer Research Fund

  • Conflict of interest None.

References