Understanding the Bootstrap
In modern research, one of the most fundamental challenges is uncertainty. Whenever we collect data, whether from surveys, experiments, or observational studies, we want to make claims not only about the specific sample we observe but about the broader population it represents. Doing this requires tools for statistical inference, and central to inference is the concept of a sampling distribution. Traditionally, researchers have relied on parametric approaches, but these often invoke strong assumptions that may not hold in practice. However, in the last half-century the bootstrap has emerged as one of the most influential methods for estimating sampling variability in a way that requires far fewer assumptions and is now widely accessible given advances in high speed computing. Although there are many “flavors” of bootstrapping, our focus here is on the non-parametric bootstrap which involves resampling raw data from the original observed sample values.
Parametric Inference and the Sampling Distribution
In classical statistics, inference depends on specifying a model for the data-generating process and the idea of repeated sampling. For instance, suppose we take a sample of size N and calculate the sample mean. Ultimately, we’d like to make a conclusion about the mean in the population from which we drew our sample. However, we need to account for sampling error. That is, had we drawn a different sample of the same size, we would have obtained a different sample mean. In fact, there are infinitely many different samples of size N that we might have drawn from our population, each of which would yield a somewhat different mean, simply depending on who happened to get into the sample. The collection of means across all of these hypothetical samples has its own distribution, known as a sampling distribution. The sampling distribution reflects our uncertainty in the estimation of the population mean (i.e., variation due to sampling error) and it underpins confidence intervals, hypothesis tests, and other inferential procedures.
For instance, under the assumption that the we draw independent and identically distributed observations from a normal distribution (or by appealing to the Central Limit Theorem at large sample sizes), the sample mean follows a normal distribution with mean equal to the population mean and standard error equal to the standard deviation divided by the square root of N. If the standard deviation for the population is known, then we can simply divide our sample mean by the standard error and reference this to the standard normal distribution (z-distribution) to make inferences. More typically, however, we aren’t privy to this knowledge and must plug in our sample standard deviation to get the standard error. The sampling distribution of the mean will then deviate from the normal curve due to this added uncertainty in small samples. Thanks to our favorite Guinness brewer William Gossett (who published under “Student”), we know that in such cases, the sample mean divided by the standard error follows a t-distribution, a bell-shaped distribution whose tail thickness is determined by the degrees-of-freedom.
So far, we have focused on the sample mean, but we can imagine a sampling distribution for any parameter we wish to estimate, whether it be a regression coefficient, variance, factor loading, or any other value of interest. The challenge, however, is that in real research we often find ourselves making uncertain parametric assumptions in order to obtain a known sampling distribution. For example, when using a t-distribution to make inferences about a regression coefficient, we assume a sufficiently large sample size and that the errors are normally distributed, independent, and homoscedastic, conditions that may or may not hold in practice. When assumptions are met, such parametric procedures work exceedingly well; when not met, the resulting inferences can be both biased and misleading, sometimes markedly so.
What Is the Bootstrap?
The non-parametric bootstrap, first formally proposed by Bradley Efron in 1979, is a computational technique for empirically approximating the sampling distribution without the requirement of strong parametric assumptions. Instead of relying on mathematical formulas, the bootstrap uses the observed data itself as a stand-in for the population. (Thus, the term “bootstrap” which is drawn from the phrase draw yourself up by your own bootstraps meaning you take personal responsibility and use the resources you have available to you). The basic procedure is conceptually simple. First, you draw your sample of size N from the population in the usual way. Next, you draw a “bootstrap sample” also of size N from the original data with replacement. Then you compute and retain your statistic of interest on the bootstrap sample in whatever way you please (e.g., mean, regression coefficient, mediated effect). Finally, you repeat this process many times (often with 1000 bootstrap samples or more) to create an empirical distribution for the statistic from which to make inferences back to the population.
The critical step to understand is that we are randomly drawing a bootstrap sample from our original sample data with replacement. Say we have a sample of N=100 observations; we would draw say 1000 bootstrap samples of size 100 where in each one a given observation may appear repeatedly or not at all (thus the sampling “with replacement”). This empirical distribution of the estimate (whatever that might be) under very general conditions approximates the sampling distribution. As such, we can use this to compute standard errors, confidence intervals, and bias estimates in similar ways to that of the parametric sampling distribution but with far fewer assumptions about the population. For instance, the bootstrapped standard error is simply the standard deviation of the bootstrapped sample estimates. And a bootstrapped confidence interval can be computed by simply locating the 2.5th and 97.5th percentiles of the bootstrapped sample estimates, values that may or may not be symmetric around the estimate (in contrast to the symmetry assumed by parametric z– or t-type confidence intervals). The beauty of the bootstrap lies in how it transforms a theoretical problem (deriving the sampling distribution) into a computational one.
Typical Applications
The bootstrap has found applications across nearly every domain of research. A classic example in the social sciences relates to the testing of indirect effects in mediation models, path analysis, and structural equation modeling. Such an effect arises within a causal chain within which one variable affects another which in turn affects a third (with more complex chains also being possible). Each effect within the causal chain is represented by its own regression coefficient. Under standard assumptions, and in large samples, each regression coefficient estimate will have a normal sampling distribution, allowing for the usual inferences. However, we don’t want to test each link in the chain individually; we want to test the chain as a whole. That is, we want to test the indirect effect of the initial predictor on the final outcome as transmitted through the intervening variables (or mediators). The sample estimate of an indirect effect is obtained by computing the product of the regression coefficients involved in the chain. Easy enough. To test the indirect effect, however, we need to know its sampling distribution, and that’s where things get tricky. Each link in the chain has a normal sampling distribution, but a product of normal variates is generally not itself normally distributed. Using a normal sampling distribution as an approximation (the delta-method or “Sobel method” for testing indirect effects) is convenient but often leads to biased inferences. This is a well-known problem that has sparked a variety of solutions, one of which is to derive the correct parametric distribution for the indirect effect (known as the “distribution of the product” method). More commonly, however, investigators have turned to the non-parametric bootstrap to obtain empirically-based inferential tests that do not rely on parametric assumptions at all. Indeed, bootstrapping is now the gold standard for testing mediated effects in practice.
Regardless of whether one is evaluating an indirect effect, a variance estimate, or any sample estimate of interest, there are many potential uses of the bootstrap results. For example, we can estimate standard errors, particularly when no simple formula exists (e.g., in complex nonlinear models). Similarly, we can compute confidence intervals using several different bootstrap methods (percentile, bias-corrected, accelerated) that provide intervals with better coverage properties than parametric ones in certain settings. Further, in machine learning and predictive modeling, bootstrap samples can be used for model validation and cross-validation and to estimate prediction error. Finally, in fields where data collection is difficult or expensive, such as clinical trials, educational experiments, or niche social science surveys, the bootstrap offers a way to make inference with small samples and limited data. These are just a few examples of how the bootstrap can be used in practice, and many additional options are available.
Advantages of the Bootstrap
There are many advantages to the bootstrap. Unlike traditional parametric methods, the non-parametric bootstrap does not require specifying a functional form for the population distribution. This makes it attractive when normality or homoscedasticity is questionable. The method also works for a wide range of statistics including means, medians, regression coefficients, correlation coefficients, or even more complex estimands like Gini coefficients. At its core, the bootstrap is easy to explain and implement. With modern software (R, Stata, SAS, Python), the procedure often requires just a few lines of code. The bootstrap can be extended for clustered data, time series, or hierarchical designs (e.g., students nested within classrooms), making it useful in applied social science and education research. As a general method, the bootstrap is remarkably flexible and can be applied in many interesting and challenging research scenarios.
Disadvantages and Limitations
As with any procedure, there are also disadvantages that must be considered. Because the bootstrap treats the sample as a proxy for the population, any biases in the sample will propagate through the bootstrap distribution. In small or unrepresentative samples, the bootstrap may give misleading results. Although less of an issue today, the bootstrap can be computationally intensive, especially for large datasets or complex models. Thousands of resamples are often needed for stable estimates. The bootstrap may also perform poorly for statistics that depend heavily on the tails of the distribution (e.g., extreme quantiles, maximum values), because the resampled datasets cannot create values outside the observed range. In clustered or dependent data structures, naïve bootstrapping can underestimate variability unless modified (e.g., block bootstrap, cluster bootstrap). This is particularly relevant in education research, where students are not independent observations, or repeated measures applications, where observations are correlated over time within persons. Care must be taken when evaluating the potential use of the bootstrap procedure in practice given the associated limitations.
Conclusion
The bootstrap represents one of the great innovations in modern statistics: a method that converts inference from an algebraic to a computational problem. By resampling from the observed data, researchers can approximate the sampling distribution of almost any statistic, gaining access to standard errors, confidence intervals, and bias estimates without heavy reliance on parametric formulas. Its strengths (flexibility, fewer assumptions, and ease of implementation) make it a powerful tool, especially in social sciences and education where data are often messy, distributions non-normal, and sample sizes modest. Yet, it is not a panacea: bootstrap inference depends on sample representativeness, can be computationally costly, and struggles with extreme statistics or dependent data if applied naïvely. For applied researchers, the bootstrap is best viewed as one tool in the inferential toolbox. When combined with sound research design and thoughtful modeling, it provides a robust way to grapple with uncertainty and to extract credible insights from limited data.
Suggested Readings
Alfons, A., Ateş, N. Y., & Groenen, P. J. (2022). A robust bootstrap test for mediation analysis. Organizational Research Methods, 25, 591-617.
Efron, B. (1979). Bootstrap Methods: Another look at the jackknife. The Annals of Statistics, 7, 1-26.
Efron, B. (2000). The bootstrap and modern statistics. Journal of the American Statistical Association, 95, 1293-1296.
Efron, B., & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 54-75.
McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Journal of the Royal Statistical Society, Series C, 36, 318-324.
Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 40, 879-891.
Stine, R. (1989). An introduction to bootstrap methods: Examples and ideas. Sociological Methods & Research, 18, 243-291.
Tibshirani, R. J., & Efron, B. (1993). An introduction to the bootstrap. Monographs on Statistics and Applied Probability, 57, 1-436.
