# I’m reporting within- and between-group effects in from a multilevel model, and my reviewer says I need to address “sampling error” in the group means. What does this mean, and what can I do to address this?

This is a long-neglected topic, and one that is receiving increasing attention in the methodological literature. The problem that the reviewer is referring to is that the usual ways we obtain within- and between-group effects for lower-level predictors within the multilevel model (MLM) oftentimes generate biased estimates. The same issue arises with any form of clustering, such as when trying to separate within- and between-*person* effects in repeated measures data, but just to make things simpler, we’ll use “group” throughout to refer to the upper-level sampling units. The crux of the problem is that the way we typically separate within- and between-group effects involves manually computing group means (i.e., the mean values for the predictor within each specific group) that are then included in the model as a level-2 predictor, and these sample means are nearly always estimated with some degree of sampling error. One can think of this sampling error like measurement error, introducing unreliability into the group means, and producing a kind of endogeneity bias. The upshot is that our between-group estimates can be biased, sometimes substantially so, and we need to be aware of when this can occur and ways to fix the problem when it does.

In our prior post, we talked about how it can be important to separate within- and between-group effects for lower level predictors in MLM. To recap, this is usually done in one of two ways. The first way is to add the predictor, say *x*, to the model (perhaps after grand-mean centering) along with the group means of *x*. With this specification, the obtained coefficient for *x *will be the estimated within-group effect and the coefficient for the group means of *x* will be the estimated *contextual effect*, capturing the extent to which the between group effect of *x* differs from its within-group effect. The second way is to center *x* with respect to its group mean, and then fit the model with this group-mean centered *x *along with the group means of *x*. The group-mean-centered *x* generates an estimate of the within-group effect and the group means generate an estimate of the between-group effect. Regardless of which approach is used, the observed group means of *x *are included in the model as a predictor, and this is where we can run into problems.

To understand this, think back to when you took intro stats and began to learn about inference and sampling variability. Chances are the lecture went something like this… Imagine that in the population, variable *x *has mean *μ* and variance *σ*^{2}. We want to estimate *μ* based on a sample of *n* observations on *x*. The estimate we obtain, the sample mean, is not going to be exactly equal to *μ* because it’s calculated from a sample rather than the entire population, thus we would obtain different estimates from different samples, and these will tend to vary more from one another in small samples. The variance of the sample mean across repeated samples is *σ*^{2}* */ *n*, and taking the square root of this yields the familiar formula for the standard error of the mean, *σ* / sqrt(*n*).

Now let’s return to the MLM context. Each group mean that we calculate is subject to this same sampling error. When the number of observations sampled for a given group is small then the sampling error in the group mean will be large. This makes perfect sense: Imagine you have a classroom in which you have sampled just five of 40 students and then use the mean of these five students to estimate some characteristic of the entire class; naturally, this mean might vary substantially if computed on some other random five students in the class. Across groups, these sampling errors add “error variance” to the group means that cause the between-group effect to be biased. In the case of a single predictor, the bias is predictably in the direction of the within-group effect (leading the contextual effect to be under-estimated). With multiple predictors, the pattern of bias also depends on the correlations among the predictors and can be harder to predict a priori. Further, this bias can propagate to the estimates of true (non-aggregated) Level 2 predictors (that is, level-2 predictors that are not obtained as a function of level-1 observations), even though these predictors do not contain sampling error. Interestingly, because the within and between effects are orthogonal, this bias does not extend to the within-group effect estimates, which remain unbiased.

Bias due to sampling error in the group means seems like a big problem, except that sometimes it’s not. One consideration is the group sizes in your sample: the larger they are, the less sampling error there will be in the group means. With large enough group sizes, you don’t need to worry much about bias. Likewise, when the true between-group differences are large (there is a high intra-class correlation for the predictor), the sampling error will make up a smaller part of the observed group mean differences, producing less bias. Another mitigating circumstance is if you sampled most or all of the individuals in the group population. The usual formula for the standard error of the mean assumes an infinite population, that is, you sampled *n* people from an infinitely large pool. However, often times, and especially with hierarchical data, there may be a limited population size for each group (e.g., one is sampling from a classroom of 20 students). In a finite population of *N* individuals, the sampling variance of the mean can be considerably smaller. In other words, there will be less bias to the between-group effects if the sampling ratio (units in sample to potential units in finite population) is large (e.g., if you sampled 15 students in a classroom of 20). In some cases, you may even have *all *of the available units for each group, such as when studying siblings nested within families. Then there’s no bias whatsoever. Sometimes it is also possible to obtain group-level information from the population rather than calculating it from your sample. For instance, administrative records might provide information on the average family income of all the students in a class, even if only some of them are in the sample. Again, using this, rather than the sample mean, would remove the bias. Finally, if the between-group effect is not really very different from the within-group effect, then the bias in the estimate will be small.

But let’s say your situation doesn’t fit with any of these exceptions, then what? Well, some very clever methodologists have been working on ways to fix the bias. Three primary approaches have been suggested, each paralleling an approach for handling measurement error in standard regression models. One way to handle measurement error is with a latent variable model. Following this strategy, Lüdtke et al (2008) proposed the multilevel latent covariate (MLC) model to handle sampling error in the group means. In this model, the observed scores for the sampled group members are viewed as indicators of the true underlying latent group mean. Shin and Raudenbush (2010) implemented the same idea within a multivariate MLM framework. A second strategy is to generate scores for the latent variable that produce consistent estimates when used as predictors in an observed-variables model. Consistent with this approach, Croon and van Veldhoven (2007) and Shin & Raudenbush (2010) showed that accurate estimates of between-group effects can be obtained by using empirical Bayes’ (EB) estimates of the group means of *x* rather than the observed sample means. Finally, a third way to handle measurement error is to fit a standard regression model but then implement a post-estimation correction to the estimates based on prior knowledge about the reliability of the predictor. In this case, we can infer the reliability of the predictor from the group size, since it is due to sampling error. Grilli & Ramphichini (2011) and Gottfredson (2019) describe the appropriate corrections to implement this approach. As we describe next, all three of these general approaches can yield accurate estimates of the between-group effects, but which to choose may depend on the specific characteristics of your application.

The MLC model is widely recognized, theoretically elegant, makes most efficient use of the data, and is conducted in one step, requiring no pre-treatment of the data or post-transformation of the estimates. On the flip side, the MLC is a complex latent variable model and estimation can go awry when the number of groups is small (e.g., less than 50). The MLC is also based on a reflective measurement model that assumes that the people in the group are interchangeable and the latent group mean is a characteristic of the group that affects the individual scores (people could come and go but the latent group mean would stay the same). This is in contrast to a formative model, in which the scores of the group members are not necessarily interchangeable and collectively determine the population mean of the group (as people come and go the true group mean changes too). A reflective measurement model can be difficult to justify at times, but the MLC can still be profitably used with a formative process as long as the sampling ratio is low (e.g., only 5% of the population group members were sampled).

The EB approach has the advantage that it is straightforward to implement within a standard multilevel model. You can generate EB estimates for *x* in most MLM software programs (sometimes these are referred to instead as empirical best linear unbiased predictors, or EBLUPs). Then, following Shin & Raudenbush (2010), you simply use these rather than the usual group means of *x* when fitting the model to *y *(both at Level 2 and when centering the predictor at Level 1). However, this approach too has drawbacks. First, computing the EB estimates gets increasingly complicated when the number of the predictors increases. These must be computed simultaneously for all of the Level 1 predictors and accounting for any other Level 2 predictors that will be in the model for *y*. Ultimately, for a sufficiently complex model, you may need to program in the matrix equations yourself (see Croon & van Veldhoven, 2007, p. 51-52). Second, like the MLC, the EB estimates implicitly assume a reflective measurement model (though they too could still be used with a formative measurement process if the sampling ratio was sufficiently low). Third, although this approach generates consistent estimates of the fixed effects, it does not correct the variance component estimates, which may remain biased. In turn, this may bias the standard errors of the fixed effects.

Like the EB approach, the reliability-correction approach has the advantage that it can be implemented within a standard multilevel model. Further, no changes are required to the traditional procedures for separating within- and between-group effects. One simply needs to correct the estimates after fitting the model to counteract the expected bias due to sampling error. Corrections can be applied to both fixed effects and variance components and can be computed for either infinite or finite group populations, irrespective of reflective or formative measurement. Adjustments can also be made to the standard errors. But there are downsides to this approach too. First, the reliability-corrected estimates can show excessive sampling variability, making this approach most useful when working in large sample contexts (many groups). Second, the corrections are derived for balanced groups and aren’t fully accurate when group sizes vary. Third, correction formulas focus on the case of a single predictor, whereas it is more common for models to have multiple predictors.

Thus, as with so many things in statistics, there is no one right answer for how to address this problem. In your response to the reviewer, we would recommend the following. First, assess if sampling error is truly a problem for your particular analysis. Might your research fall into one of the exceptions where bias is not expected to be a problem (e.g., large cluster sizes or a high sampling ratio)? If so, you simply need to explain this to your reviewer. Second, if it is a problem, think about which of the possible alternative modeling approaches will best suit your needs by considering the advantages and disadvantages discussed above. If you have many predictors, and are fortunate to have a large number of groups in your sample, the MLC model may be your best bet, provided you can reasonably assume a reflective measurement model or low sampling ratio. If your model is small, the EB or reliability correction approaches might be easier to implement, and one or the other could be used to provide a sensitivity analysis for the original results (i.e., does the story change when accounting for sampling error?). These too perform best with a large number of groups. Last, if you have finite group sizes in the population, you recorded the total sizes of the groups from which you sampled, and you are sampling more than a small fraction of the available group populations, the reliability correction approach is the only one of the three that will take this into account to produce accurate estimates.

Research on this topic is ongoing and expanding, but we hope this post will help to orient you to the relevant literature and give you some ideas for how to move forward with your manuscript.

Croon, M.A. & van Veldhoven, M.J.P.M. (2007). Predicting group-level outcome variables from variables measured at the individual level: a latent variable multilevel model. *Psychological Methods, 12*, 45-57.

Gottfredson, N.C. (2019). A straightforward approach for coping with unreliability of person means when parsing within-person and between-person effects in longitudinal studies. *Addictive Behaviors, 94*, 156-161. DOI: 10.1016/j.addbeh.2018.09.031

Grilli, L., & Rampichini, C. (2011). The role of sample cluster means in multilevel models: A view on endogeneity and measurement error issues. *Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 7,* 121–133. https://doi.org/10.1027/1614-2241/a000030

Lüdtke, O., Marsh, H.W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008). The multilevel latent covariate model: a new, more reliable approach to group-level effects in contextual studies. *Psychological methods, 13*, 203-229. DOI: 10.1037/a0012869

Shin, Y., & Raudenbush, S. W. (2010). A latent cluster-mean approach to the contextual effects model with missing data. *Journal of Educational and Behavioral Statistics, 35*, 26–53. https://doi.org/10.3102/1076998609345252