Sample size calculation for prevalence studies using Scalex and ScalaR calculators | BMC Medical Research Methodology

Sample size calculation method

For an objective that estimates a prevalence, the sample size calculation formula is quite simple and available in a number of books.

The following formula [2] must be used:


where not= Sample size,

Z= Z statistic for a confidence level (1.96 for a 95% confidence level),

P= Expected prevalence or proportion, and.

D= Accuracy.

However, we do not encourage researchers to use formulas as this may lead to human error in manual calculation. We can use available software and concentrate on carefully choosing the appropriate parameters for the calculation.

Choosing the right settings

The above formula indicates three parameters to be determined.

Parameter 1: confidence level

When we take a sample but want to know the population (such as smoking prevalence) from which the sample is taken, we will not know the exact population prevalence because we are not studying all members of the population . However, looking at the sample gives us an estimate that has upper and lower bounds (unofficially “a range”, but we call it “interval” in the statistics) for prevalence in the population. We normally calculate these lower and upper bounds or an interval with a certain level of confidence. The commonly used or almost always used “confidence level” for these intervals or estimates is 95% (which we have called the 95% confidence interval, CI) in the medical and health fields. Also, most data analysis software gives the results with 95% CIs by default. For these reasons, and also to minimize user error by non-statisticians, we have set the confidence level to 95% without giving users a choice in these calculators shown.

Parameter 2: Accuracy

As mentioned above, we will not know the exact population prevalence because we are not studying all members of the population. Therefore, the prevalence that we calculate from the sample could deviate (error) from the prevalence in the population. We call this deviation sampling error. We also know that the larger the sample size, the smaller the estimation errors. Errors are calculated as precision or also called “margin of error”.

Practically, the precision reflects the width of the 95% confidence interval. If one decides to choose an absolute precision of ± 2% in the estimate of a prevalence, one should expect, in the result, a width of 95% CI of 4% (example: 95% CI: 23%, 27%). If the absolute precision is ±5% in estimating a prevalence, one would expect the result to have a 95% CI width of 10% (example: 95% CI: 20%, 30%). The width of the IC is twice that of the precision. The details are shown in Table 1.

Table 1 Relationship between precision and the width of the confidence interval (CI)

This is an opportunity for researchers to decide on the precision (margin of error) and width of CI they want to see in the results. Normally, researchers want to have a narrower IC width, but the narrower it is, the more expensive it will be (larger sample size). Even if researchers decide to go with a smaller sample size, they can also predict or appreciate how poor the CI width will be in their results. It is therefore an informed decision that must be made by researchers.

Concretely, we give some recommendations for the choice of a precision value (table 2). In general, well-funded studies or large-scale studies, aiming to get the attention of policy makers, should aim for an accuracy of 2-3%, whereas small-scale (or poorly funded) studies, for example, researching undergraduate or masters projects, can consider 4-5% accuracy. If the precision is greater than 5% (eg, 10%), due to limited resources, researchers should consider the study as a preliminary study.

Table 2 Recommended precision for expected prevalence

However, the above recommendation applies to the expected prevalence of 10-90%. When the expected prevalence is too small (less than 10%) or too large (more than 90%), we must apply a much lower precision. It is obvious that an accuracy of 5% is possible for an expected prevalence of 50%, but an accuracy of 5% is totally inappropriate for an expected prevalence of 2%.

We present the details of the accuracy of the expected prevalence with examples in Table 2.

Parameter 3: data variability

The greater the variation in the data, the larger the sample size needed. This relationship can be explained by a simple analogy. When we cook a soup and near the finish, we stir it well before tasting it. We always need a very small amount (small sample size) to taste because we stir well and the variation is almost zero.

Practically, in prevalence estimation, the prevalence has an effect on this variation and therefore on the sample size required. The relationship between prevalence and sample size is shown in Fig. 1.

Fig. 1

Prevalence and effect on sample size

Obviously, it is the goal of research to estimate prevalence and researchers do not know this prevalence. Therefore, to calculate the sample size, we normally find it from the most recent published studies with a similar study population. If we cannot find suitable studies in the literature, we may consider conducting a pilot study.

When we find several suitable prevalences in the literature, for example ranging from 15 to 30%, we should use the prevalence giving the highest sample size (in this case, 30%) according to Fig. 1 which shows that 30% will require the largest sample size in this prevalence range of 15-30%. Similarly, if the prevalence ranges from 60-80% in recent literature, we should use 60% as this requires the largest sample size in this range.

We would like to warn that some books or guidelines suggest using the expected prevalence of 50% if we could not get the prevalence at all [2, 14, 15]. We discourage this practice. In Figure 1, we should note that the 50% prevalence will produce the largest sample size only in the range of 10 and 90% prevalence. The required sample size is much higher in the region below 10 and above 90%. Therefore, a 50% prevalence shortcut should not be used. It is best to calculate the sample size with an appropriate expected prevalence. Researchers can find a possible range of expected prevalence and apply the recommendation in the previous paragraph.

For this illustration, we have drawn Fig. 1 using precision for a small-scale study (Table 2). This means that we use the precision of 5% fixed for the expected prevalence between 10 and 90%, half of the expected prevalence for the expected prevalence less than 10%, and half of the (100 minus the expected prevalence) for the expected prevalence greater than 90%.

Parameter 4: anticipated loss

We always have sample size loss during the research process for several reasons, such as non-response, incomplete data, loss to follow-up, etc. Researchers should estimate the loss with their past experience and inflate the sample size in the calculation accordingly. These losses (in particular, non-response, incomplete data and follow-up losses) are very much related to the fields of research (for example, the non-response rate could be higher if we study sexual questions or other sensitive issues) and to the public. that researchers intend to study. Therefore, we recommend that researchers use nonresponse rates from previous studies in similar research areas and in similar populations.

Although we can put any percentage of the potential loss and inflate the sample size, this does not guarantee that the calculated sample size is valid in terms of a representative sample. In general, we recommend that a loss of less than 10% is an acceptable loss. However, there are different opinions on the acceptable percentage of loss or attrition [16] depending on the type of study. At least, it is important to note that the higher the loss or attrition, the greater the compromise on the validity of the results will be.

Sample size calculation report

The sample size report should be reproducible. This means that all parameters used must be reported. There are four parameters, namely the level of confidence (mainly 95%), the expected prevalence (mainly taken from the literature or from a pilot study), the precision or the margin of error of the estimate (decision of the researchers) and the expected loss (experience of the researchers) used in the study. calculation. We must also include the name of the software or calculator with an appropriate reference. The Scalex SP calculator has incorporated the draft report which the user can copy and use. It ensures that all necessary parameters used are included in the report.

Comments are closed.