The post What exactly qualifies as intensive longitudinal data and why am I not able to use more traditional growth models to study stability and change over time? appeared first on CenterStat.

]]>To start, there is often much confusion over what constitutes intensive longitudinal data (or ILD), in large part because there exists no formal definition that separates ILD from other types of longitudinal data. That said, ILD tends to fall between two traditional data structures obtained from alternative designs: panel data and time series data. It’s useful to first consider these traditional structures to see how several of their features will combine within ILD.

Historically, the most common method for gathering longitudinal data in psychology and the social and health sciences has been the *panel design*. Typically, a panel design involves assessing a large sample of subjects (say 200 or more) at a much smaller number of time points (say three to six) that tend to be widely spaced in time (say six or 12 months or more). Panel data are often used to empirically examine long-term trajectories of change that might span multiple years, and common analytic methods include the standard latent curve model or a multilevel growth model. (See our prior Help Desk entry on the relation between the LCM and MLM).

A second type of longitudinal design, commonly used in economics among other areas, is the *time series* *design*, which resides at the opposite end of the continuum from the panel design. More specifically, a time series design is often based on just a single unit that is repeatedly assessed a very large number of times (say 100 to 200 or more) at intervals that tend to be close together in time (say daily or even hourly). Time series data are often used to empirically examine short-term dynamic processes that might unfold hour-by-hour or day-by-day (e.g., the daily closing cost of the S&P500) and many specialized analytic methods exist to fit models to these highly dense data.

ILD tends to fall between the two extremes of panel data on one end and time series on the other. More specifically, ILD tends to have fewer subjects than panel data but more than time series (say 50 or 100 subjects) and more time points than panel data but possibly fewer than time series (say 30 or 40 assessments). Data might be captured using wearable technology (e.g., heart rate or blood pressure monitors) or by sending random prompts throughout the day via smart phones or other electronic devices (e.g., a tone sounds on a smart phone three times throughout the day and an individual is prompted to respond to a brief feelings survey). As a hypothetical example, a study might be designed to randomly measure nicotine cravings and cigarette use in a sample of 50 individuals four times per day for a two week period resulting in 56 assessments on each individual, thus falling between traditional panel and time series designs in structure.

In the spirit of *be careful what you ask for*, once you obtain intensive longitudinal data you must then select an optimal modeling strategy to test your motivating hypotheses, and this is not always an easy task. To begin, some longitudinal models that we are familiar with from panel data simply will not work with ILD. Consider the latent curve model (LCM): because the LCM is embedded within the structural equation model, each observed time point is represented by a manifest variable in the model. This works well if the model is fit to annual assessments of some outcome (say antisocial behavior at age 6, 7, 8, 9 and 10) where each age-specific measure serves as an indicator on the underlying latent curve factor. However, the LCM rapidly breaks down with higher numbers of repeated measures in which only one observation may have been obtained at any given assessment (e.g., 9:15am, 9:52am, and so on). For our prior example with 56 repeated assessments taken on 50 subjects, the LCM is simply not an option.

We can next consider the multilevel model (MLM) and it turns out that this option works quite well for many ILD research applications. (See our Office Hours channel on YouTube for a lecture on the MLM with repeated measures data). The MLM approaches the complex ILD structure as nested data in which repeated assessments are nested within individual. Interestingly, unlike the standard LCM, the MLM can be applied to both more traditional panel data and to ILD. The reason is that, whereas the LCM incorporates the passage of time into the factor loading matrix and requires an observed variable at each assessment, the MLM incorporates the passage of time as a numerical predictor in the regression model. As such, the MLM can easily allow for highly dense (meaning many time points) and highly sparse (meaning few or even one assessment is shared by any individual at any given time point) data without problem. (The LCM can under certain circumstances be contorted to accommodate some of these features as well, but the MLM does this seamlessly). However, there are several complications that must be addressed when fitting an MLM to intensive longitudinal data that do not commonly arise in panel data.

The first issue is what is called *serial correlation* of the residuals for the repeatedly measured outcome. With apologies for the technical terminology, this means is that for a given person, when there is a “bump” at one timepoint, that tends to carry over to the next time point too. For instance, say a person’s average heart rate is 72 BPM. I measured this person at 9:10am and 9:26am. What I don’t know is that this person was late for their 9:00am job, which lead them to move faster and increased their stress, and they had only just arrived at 9:10am. This manifested in a heart rate of 91 BPM at 9:10 and 83 BPM at 9:26. The initial bump has thus not entirely dissipated by the second assessment.

Serial correlation is often not of importance in panel data because these perturbations have long since washed out (the residual correlation goes to zero over the long lags). A person’s heart rate might be higher than usual when I assess them at age 26 because they had a second shot of espresso or got in an argument with a colleague at work, but the effect of the espresso or argument has long since worn off by the time I reassess them at age 27. Of course, even with panel data the repeated measures are correlated, but not because of serial correlation of within-person *residuals* but because of individual differences in level and change over time. For instance, some people have consistently higher heart rates and others have consistently lower heart rates and this stability will lead to across-person positive correlations in repeated measures. We typically model these individual differences in level and change via latent growth factors / random effects when fitting LCMs / MLMs. Such individual differences may be an important source of correlation in ILD too, but we also have to contend with the serially correlated residuals. Although an added complexity, the MLM is quite well suited at incorporating serial correlations such as these. Complex error terms can be defined among the time-specific residuals such as auto-regressive, Gaussian decay, spatial power, or Toeplitz structures. It is very important these serial correlations be represented in the model if needed both to gain insights into the phenomenon under study and to ensure that other parameter estimates of interest are not biased.

A second issue that often arises in ILD is the presence of cycles or transition points that might occur during the assessment period. For example, daily measures taken over several weeks may vary as a function of weekday vs. weekend (e.g., if studying college drinking) or might cycle regularly throughout a day (e.g., hourly heart rate data varying as a function of waking to sleeping and back to waking). Although such cycles and transition points might be present in panel data as well, these are less likely to occur because there are typically fewer time-linked assessments and these tend to aggregate over longer durations (e.g., if we ask “over the past 30 days” to obtain monthly alcohol use levels, these ratings will implicitly smooth over weekday-weekend differences in daily alcohol use). In contrast, multiple cycles might be observed in ILD spanning a 50 or 60 time point series.

Finally, a third issue is the distinction between within- versus between-person effects. Often ILD is collected with the idea of assessing processes as they unfold in real time for individual participants (“life as lived”). For instance, we might be interested in using ILD to test a negative reinforcement hypothesis for alcohol use. That is, we wish to test the proposition that people drink more than they typically do when they are experiencing increased negative affect under the expectation that this will reduce their negative affect. Using a daily diary study, we measure negative affect each day and alcohol use each night and we build a model to predict alcohol use from negative affect. To fully assess the negative reinforcement hypothesis, we must differentiate the within-person effect (e.g., when my negative affect is higher than usual I drink more than is typical for me) from any between-person correlation that may also exist (e.g., that people who have higher negative affect in general tend to drink more in general). Fortunately, with the MLM we have well developed methods for separating within- and between-person effects, although there are some complications to consider (see our prior help desk post specifically on this issue)

The MLM is thus well suited to address all of these complexities that commonly arise in intensive longitudinal data. Once incorporated, the MLM offers many of the very same advantages as when applied to panel data: time-varying predictors can be incorporated at level-1 with either fixed or random effects, time invariant predictors can be incorporated at level-2, and interactions can be estimated within or across levels of analysis. However, there are two key limitations of the MLM that may or may not arise in a given application. The first is that, similar to the traditional general linear model, the MLM assumes all measures are error-free and all observed variance is “true” variance. This is often (if not always) an unrealistic assumption and violation of this assumption can lead to significant biases in the estimated results. The second is that the MLM only allows for one dependent variable at a time and is thus limited to the estimation of unidirectional effects. Say that you are interested in testing the reciprocal relations between depression during the day and substance use that evening, and you obtain multiple daily measures spanning a week of time. The MLM allows for the estimation of the prediction of substance use from depression, but not the simultaneous estimation of the reciprocal prediction of depression from substance use. As such, the MLM is only evaluating one part of the research hypotheses at hand.

However, recent developments have introduced a new analytic procedure that combines elements of the MLM, the SEM, and time series models called the dynamic structural equation model (or DSEM). The DSEM functionally picks up where the MLM leaves off, but expands the model to potentially include latent factors (to estimate and remove measurement error) and multiple dependent variables (to estimate reciprocal effects between two or more variables over time). DSEM is a recent development and much has yet to be learned about best practices in applied research settings, but it represents a significant development in our ability to fit complex models to ILD.

Want to learn more? We recently had the honor of being invited to provide a series of three lectures on intensive longitudinal data analysis for the American Psychological Association and we have posted our lecture materials in the resources section of the CenterStat home page (https://centerstat.org/apa-ild/). The first session discusses the challenges and opportunities of ILD; the second focuses on the analysis of ILD using the multilevel model; and the third focuses on the analysis of ILD using the dynamic structural equation model. In addition to those resources, below are several suggested readings on the design, collection, and analysis of intensive longitudinal data. Asynchronous access to CenterStat workshops on *Multilevel Modeling** *and *Analyzing Intensive Longitudinal Data* is also available to those who might wish to register for additional training. You can also check our workshop schedule for upcoming live offerings.

Good luck with your work!

Asparouhov, T., Hamaker, E. L., & Muthén, B. (2018). Dynamic structural equation models. Structural Equation Modeling: *A Multidisciplinary Journal, 25*, 359-388.

Asparouhov, T., & Muthén, B. (2020). Comparison of models for the analysis of intensive longitudinal data. *Structural Equation Modeling: A Multidisciplinary Journal, 27*, 275-297.

Bolger, N., & Laurenceau, J. P. (2013). *Intensive longitudinal methods: An introduction to diary and experience sampling research*. Guilford Press.

Hamaker, E. L., Asparouhov, T., Brose, A., Schmiedek, F., & Muthén, B. (2018). At the frontiers of modeling intensive longitudinal data: Dynamic structural equation models for the affective measurements from the COGITO study. *Multivariate Behavioral Research, 53*, 820-841.

Hoffman, L. (2015). Longitudinal analysis: Modeling within-person fluctuation and change. Routledge.

McNeish, D., & Hamaker, E. L. (2020). A primer on two-level dynamic structural equation models for intensive longitudinal data in Mplus. *Psychological Methods, 25*, 610-635.

McNeish, D., Mackinnon, D. P., Marsch, L. A., & Poldrack, R. A. (2021). Measurement in intensive longitudinal data. *Structural Equation Modeling: A Multidisciplinary Journal, 28*, 807-822.

Walls, T. A., & Schafer, J. L. (Eds.). (2006). *Models for intensive longitudinal data*. Oxford University Press.

The post What exactly qualifies as intensive longitudinal data and why am I not able to use more traditional growth models to study stability and change over time? appeared first on CenterStat.

]]>The post CenterStat Partners with APA to Offer Free Training on Intensive Longitudinal Data appeared first on CenterStat.

]]>The full program is available for free (to both APA members and non-members), and links to individual sessions are provided below. Sessions will be livestreamed and recordings will be made available to all those who registered approximately two weeks after each live seminar. The lecture materials for the last three sessions, delivered by Dan Bauer & Patrick Curran, are available here.

**August 31:**Training for the Collection of Real-World Biobehavioral Data Using Wearable Devices

Presenter: Benjamin Nelson, PhD**September 15:**Introduction to Intensive Longitudinal Methods

Presenter: Jean-Philippe Laurenceau, PhD**October 4:**Intensive Longitudinal Data: Methodological Challenges and Opportunities

Presenters: Daniel Bauer, PhD, and Patrick Curran, PhD**October 6:**Intensive Longitudinal Data: A Multilevel Modeling Perspective

Presenters: Daniel Bauer, PhD, and Patrick Curran, PhD**October 11:**Intensive Longitudinal Data: A Dynamic Structural Equation Modeling Perspective

Presenters: Daniel Bauer, PhD, and Patrick Curran, PhD

Participants who decide to pursue more in depth training on ILD after attending these seminars may wish to consider enrolling in our full workshops on Multilevel Modeling (by Dan Bauer and Patrick Curran) and Analyzing Intensive Longitudinal Data (by Jean-Philippe Laurenceau and Niall Bolger)

The post CenterStat Partners with APA to Offer Free Training on Intensive Longitudinal Data appeared first on CenterStat.

]]>The post What’s the best way to determine the number of latent classes in a finite mixture analysis? appeared first on CenterStat.

]]>One of the single most difficult tasks in finite mixture modeling is to determine the number of classes within the population, a process sometimes referred to as *class enumeration*. Typically, one will fit a finite mixture model using maximum likelihood estimation, in which the number of classes must be declared as part of the model specification. Thus, the analyst will fit a model with 1 class, then 2 classes, then 3, etc., and then compare the fit of these models to try to determine the optimal number of classes. Various approaches to determining the optimal number of classes can be considered but they generally fall into three primary categories: likelihood ratio tests, information criterion, and entropy statistics. Let’s consider each in turn. (And, yes, there are Bayesian approaches to this problem too, but they aren’t widely used in practice so we won’t be addressing those).

One approach for evaluating the number of classes is to use a likelihood ratio test (LRT). LRTs represent a general procedure for testing between nested models, i.e., where one model consists of parameters that are a restricted subset of the parameters of the other model. The LRT is computed as –2 times the difference in the log-likelihoods of the two models and, under certain regularity conditions (essentially *assumptions*), it is distributed as a central chi-square with degrees of freedom equal to the difference in number of estimated parameters. From the chi-square, we obtain a *p*-value under the null hypothesis that the simpler model is the right one. Effectively we are saying, look….we know that if we throw more parameters at the model it will fit the sample data better (i.e., the log-likelihood improves) but is this improvement greater than I would expect by chance alone given the number of parameters added (the degrees-of-freedom of the LRT)? If the *p*-value is significant, then we conclude that it is a greater improvement than we would expect by chance, rejecting the simpler model in favor of the more complex model. If it’s not significant, then we conclude there is not a meaningful difference between the two models and we retain the simpler model. In other words, we conclude that the extra parameters may just be overfitting, picking up random variation or noise in the sample that doesn’t reflect the true underlying structure in the population.

That is how we typically use LRTs in a traditional modeling framework, but let’s think about how we would apply this general testing approach to determine the number of classes in a finite mixture. First, we can establish that a *K*-class model is nested within a *K*+1-class model. For instance, one could set the mixing probability (prevalence rate) of one class in the *K*+1-class model to zero. Presto, this deletes one of the classes to produce a *K*-class model. So far so good. Now we fit models with 1 v. 2 classes, calculate the LRT, and if the *p*-value is significant we say 2 is better than one. Then we test 2 v. 3 classes, 3 v. 4 classes, etc., and stop when we get to the point that adding another class no longer results in a significant improvement in model fit. But where things get complicated is in the fine print to the likelihood ratio test. The regularity conditions required for the test distribution to be a central chi-square aren’t met when testing a *K* versus a *K*+1-class model. So while it still makes sense to conduct likelihood ratio tests, we no longer have the familiar chi-square with which to obtain p-values. We need to somehow modify how we conduct LRTs for use in this context.

One option is to bootstrap the test distribution. McLachlan (1987) proposed a parametric bootstrapping procedure that involves (1) simulating data sets from the *K*-class model estimates that were obtained from the real data; (2) fitting *K* and *K*+1-class models to the simulated data sets; (3) computing the likelihood ratio test statistic for each simulated data set; (4) using the distribution of bootstrapped LRT values to obtain the *p*-value for the likelihood ratio test statistic obtained with the real data. It’s a clever approach, but somewhat computationally intense, especially if one wants a precise *p*-value. The other option is to derive the correct theoretical test distribution for the LRT. Lo, Mendell & Rubin (2001) performed these derivations, determining it (appropriately enough) to be a mixture of chi-squares. They also provided an ad-hoc adjusted version of the test with a bit better performance at realistic sample sizes. Simulation studies, however, have shown the Lo-Mendell-Rubin LRT (original and adjusted versions) to have elevated Type I error rates for some models, whereas the bootstrapped LRT consistently work well. We thus tend to prefer the bootstrapped LRT, despite its greater computational demands (which is an increasingly less relevant concern given ever-improving computational speeds of even the lowliest desktop computers).

A second approach to evaluating the number of classes is to use information criteria (IC). Two well-known information criteria are Akaike’s Information Criterion (AIC) and Bayes’ Information Criterion (BIC), but there are many others. What ICs generally try to do is balance the *fit* of the model against the *complexity* of the model. Fit is measured by –2 times the log-likelihood and a penalty is then applied for complexity, usually some function of the number of parameters and/or sample size. Often, but not always, ICs are scaled so that smaller values are better. So one would fit models with 1, 2, 3, etc. classes and then select the model with the lowest IC value as providing the best balance of fit against complexity. Different ICs were motivated in different ways and implement different penalties. Some penalties are stiffer than others, so for instance the BIC penalty usually exceeds the AIC penalty. When choosing the number of classes, simulation studies have shown AIC to be too liberal (tends to support taking too many classes), whereas BIC generally does well as long as the classes are reasonably well separated. For less distinct classes (that is, classes that may reside closer together and are thus harder to discern), a sample size adjusted version of the BIC, which ratchets down the penalty a bit, sometimes performs better. While there are many different ICs to choose from, we generally find the BIC to be a reasonable choice.

A third common approach is to consider the *entropy* of the model. Entropy is a measure of how accurately one could assign cases to classes. Finite mixture models are probabilistic classification models in the sense that there is not a hard partition of the sample into non-overlapping clusters but instead there is a probability that each person belongs to each class; further, these probabilities sum to 1.0 for each individual reflecting there is a 100% chance they belong to one of the classes. However, sometimes one is interested in producing such a hard partition based on the probabilities, for instance by assigning a case to the class to which they most likely belong, a technique called *modal assignment*. If the probabilities of class membership tend toward zero and one, then this implies that there should be few errors of assignment. But as the probabilities move away from zero and one this reflects greater uncertainty about how to assign cases and an increased rate of assignment errors. For instance, if my probabilities for belonging to Classes 1 and 2 are .9 and .1, there’s a 90% chance I would be correctly assigned to Class 1. That’s pretty good. But if my probabilities are .6 and .4, there is only a 60% chance that placing me into Class 1 would be the right decision. Entropy summarizes the uncertainty of class membership across all individuals, providing a sense of how accurately one can classify based on the model.

There are several different types of entropy-based statistics. Some are of the same form as the ICs described above, in which the fit of the model is balanced against a penalty that is now a function of entropy (e.g., the classification likelihood criterion). Others are transformations of entropy to make interpretation easier (e.g., normalized entropy criterion). The *E *entropy statistic developed by Ramaswamy et al. (1992) is particularly popular – it has a nice scale, ranging from 0 to 1, with 1 indicating perfect accuracy, and is standard output in some software (e.g., Mplus). One might thus calculate *E* values (or some other entropy based statistic) for models with different numbers of latent classes and then select the model with the greatest classification accuracy. But this presupposes that one wants to select a model that consists of well-separated classes. Sometimes, classes aren’t well separated. Consider that there is a well-recognized height difference between adult men and women, yet men are only about 7% taller than women on average, so there is a lot of overlap between the height distributions. It seems reasonable to assume that latent classes will overlap at least as much as natural groups do, so entropy may be a poor guide to the number of classes in many realistic scenarios. Thus, in most cases, it is probably best not to use entropy to guide class enumeration, but instead to consider it a property of the model that is ultimately selected. That is, determine the number of classes using the BIC and/or bootstrapped likelihood ratio, then examine the entropy as a descriptive statistic of the selected model.

So we seem to have arrived at a straightforward set of recommendations. First, fit models with 1, 2, 3, etc. latent classes (until estimation breaks or we reach some practically useful / theoretically plausible upper bound like say 10 classes). Second, compare the fit of these models using your preferred information criterion (perhaps BIC, perhaps sample-size adjusted BIC). Also use the bootstrapped likelihood ratio test to get formal *p*-values. Hope your IC of choice and the bootstrapped LRT arrive at the same answer. Third, write your paper. How hard can all of that possibly be? Well, sometimes (maybe even oftentimes) this process doesn’t work, occasionally in small ways and occasionally in blow-up-in-your-face ways. You might end up selecting a model that is problematic, like having a very small class that is impractical and which you suspect may just reflect outliers or over-fitting to the data. Or you might select a model where, substantively, some of the classes seem similar enough that it isn’t worth distinguishing them. In such cases, you might use your content area knowledge (expert opinion) to decide that maybe the quantitatively “best” model isn’t as useful as the next-best model. Of course, this introduces subjectivity to the model selection process, and people may disagree about these decisions, so you want to justify your choice.

Other times, IC values just keep getting better as classes are added to the model and bootstrapped LRTs just keep giving significant results. This seems to happen a lot when analyzing especially large samples. What this reveals is a problem in our logic so far. To this point, we’ve assumed that the finite mixture model is *literally* *correct*: that is, there is some number of latent groups mixed together in the population and our job is to go find that number. But what if the model isn’t literally correct? Arguably, all models represent imperfect approximations to the true data generating process. We hope these models recover important features of the underlying structure, but we don’t necessarily regard them as correct. From this perspective, there isn’t some number of true classes to find. But, if that is the case, then what we are we doing when we conduct class enumeration? We would argue that we are evaluating different possible approximations to the data, trying to discern how many classes it takes to recover the primary structure without taking so many that we are starting to capture noise or nuisance variation.

At small sample sizes, we can only afford a gross approximation with few classes, but with higher sample sizes, we can start to recover finer structure with more classes. That finer structure may not always be of substantive interest, but it’s there, and traditional class enumeration procedures (BIC, etc.) will reward models that recover it. For example, with a modest amount of data we might be able to identify differences in attitudes, behavior, fashion, and speech between individuals living in broad regions of the United States, like the Northeast and Southwest. With more data, we might be able to see more nuanced differences, separating into smaller regions like mid-Atlantic states, upper Midwest, etc. In reality, the states (aside from Alaska and Hawaii) are contiguous, and attitudes, behavior, fashion, and speech patterns vary continuously over complex cultural and geographic gradients. Nevertheless, regional classifications capture important differences in local conditions. There’s no right number of classifications, just differences in fineness. With enough data, we can make our classes extremely local, but this might not always be useful to do.

Ultimately, then, there is an inconsistency between the perspective motivating the development and evaluation of traditional class enumeration procedures (that there is a true number of classes to find) and the context within which these are applied in practice (where the model is an approximation). This can lead to problems like seeing support for more and more classes at larger and larger sample sizes. In such cases, the number selected may again be determined more by subjective considerations such as the size, distinctiveness, and practical utility of the classes.

In sum, standard practice in determining the number of classes for a finite mixture model is to fit models with 1, 2, 3, etc. classes using maximum likelihood estimation, then compare fit using specialized likelihood ratio tests (bootstrapped LRT or Lo-Mendell-Rubin LRT), information criterion (BIC, AIC, etc.), or entropy, and to try to objectively triangulate on an optimal number. Simulation studies suggest bootstrapped LRTs and BIC generally work well. However, these presuppose that there is some true number of classes to find. In most instances, a more realistic perspective is that the model is instead providing an approximation to the underlying structure and there may not be a true number of classes to find. Even the archetypal concept of species undergirding our example with the finches is a bit more muddled than we learned in high school biology. On this view, the goal of our analysis is to select a number of classes that recovers the important features of the data without capturing noise or nuisance variation. Traditional class enumeration procedures can still serve as a useful guide, balancing fit and parsimony in quantifiable ways, but content area knowledge also plays an important role in determining how fine to make the approximation before it becomes impractical and unwieldy.

**References**

Hensen, J.M., Reise, S.P., & Kim, K.H. (2007). Detecting mixtures from structural model differences using latent variable mixture modeling: a comparison of relative model fit statistics. *Structural Equation Modeling, 14*, 202-226.

Kim, S.-Y. (2014). Determining the number of latent classes in single- and multiphase growth mixture models. *Structural Equation Modeling, 21*, 263-279.

Liu, M. & Hancock, G.R. (2014). Unrestricted mixture models for class identification in growth mixture modeling. *Educational and Psychological Measurement, Online First*.

Lo, Y., Mendell, N.R., & Rubin, D.B. (2001). Testing the number of components in a normal mixture. *Biometrika, 88*, 767–778.

McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. *Journal of the Royal Statistical Society, Series C, 36*, 318-324.

McLachlan, G., & Peel, D. (2000). *Finite mixture models*. New York: Wiley.

Nylund, K.L., Asparouhov, T. & Muthen, B.O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. *Structural Equation Modeling, 14*, 535-569.

Ramaswamy, V., DeSarbo, W.S., Reibstein, D.J., & Robinson, W.T. (1992). An empirical pooling approach for estimating marketing mix elasticities with PIMS data. *Marketing Science, 12, *241-254.

Tofighi, D. & Enders, C.K. (2008). Identifying the correct number of classes in growth mixture models. In G.R. Hancock & K.M. Samuelsen (Eds.), *Advances in Latent Variable Mixture Models* (pp. 317-341). Greenich, CT: Information Age.

The post What’s the best way to determine the number of latent classes in a finite mixture analysis? appeared first on CenterStat.

]]>The post My advisor told me to use principal components analysis to examine the structure of my items and compute scale scores, but I was taught not to use it because it is not a “true” factor analysis. Help! appeared first on CenterStat.

]]>Help, indeed. This issue has been a source of both confusion and contention for more than 75 years, and papers have been published on this topic as recently as just a few years ago. A thorough discussion of principal components analysis (PCA) and the closely related methods of exploratory factor analysis (EFA) would require pages of text and dozens of equations; here we will attempt to present a more succinct and admittedly colloquial description of the key issues at hand. We can begin by considering the nature of *composites*.

Say that you were interested in obtaining scores on negative affect (e.g., sadness, depression, anxiety) and you collected data from a sample of individuals who responded to 12 items assessing various types of mood and behavior (e.g., sometimes I feel lonely, I often have trouble sleeping, I feel nervous for no apparent reason, etc.). The simplest way to obtain a composite scale score would be to compute a mean of the 12 items for each person to represent their overall level of negative affect. This is often called an *unweighted* linear composite because all items contribute equally and additively to the scale score: that is, you simply add them all up and divide by 12. This approach is widely used in nearly all areas of social science research.

However, now imagine that you could compute *more* than one composite from the set of 12 items. For example, you might not believe a single overall composite of negative affect exists, but that there is one composite that primarily reflects *depression* and another that primarily reflects *anxiety*. This is initially very strange to think about because you want to obtain *different *composites from the *same* 12 items. The key is to *differentially* weight the items for each composite you compute. You might use larger weights for the first six items and smaller weights for the second six items to obtain the first composite, and then use smaller weights for the first six items and larger weights for the second six items to obtain the second composite. Now instead of having a single overall composite of the 12 items assessing negative affect, you have one composite that you might choose to label *depression* and a second composite that you might choose to label *anxiety*, and both were based on differential weighting of the same 12 items. This is the core of PCA.

PCA dates back to the 1930’s and was first proposed by Harold Hotelling as a *data reduction method*. His primary motivation was to take a larger amount of information and reduce it to a smaller amount of information by computing a set of weighted linear composites. The goal was for the composites to reflect *most, *though not *all,* of the original information. He accomplished this through the use of the eigenvalues and eigenvectors associated with the correlation matrix of the full set of items. Eigenvalues represent the variance associated with each composite, and eigenvectors represent the weights used to compute each composite. In our example, the first two eigenvalues would represent the *variances *of the depression and anxiety composites, and the eigenvectors or *weights* would tell us how much each item contributes to each composite. It is possible to compute as many composites as items (so we could compute 12 composites based on our 12 items) but this would accomplish nothing in terms of data reduction because we would simply be exchanging 12 items for 12 composites. Instead, we want to compute a much smaller number of composites than items that represent *most* but not *all* of the observed variance (so we might exchange 12 items for two or three composites). The cost of this reduction is some loss of information, but the gain is being able to work with a smaller number of composites relative to the original set of items.

There are many heuristics used to determine the “optimal” number of composites to extract from a set of items. Methods include the Kaiser-Guttman rule, looking for the “bend” in a scree plot of eigenvalues, parallel analysis, and evaluating the incremental variance associated with each extracted component. There are also many methods of “rotation” that allow us to rescale the item weights in particular ways to make the underlying components more interpretable (helping us “name” the factors). For example, if the first six items assessed things like sadness and loneliness and had large weights on the first component but smaller weights on the second, we might choose to name the first component “depression”, and so on. Often, the end goal is to obtain conceptually meaningful weighted composite scores for later analyses.

Although Hotelling developed PCA strictly as a method of data reduction and composite scoring (indeed, he never even discussed rotation because he was not interested in interpreting individual items), over time this method came to be associated with a broader class of models called exploratory factor analysis, or EFA. The goals of EFA are often very similar to those of PCA and might include scale development, understanding the psychometric structure underlying a set of items, obtaining scale scores for later analysis, or all three. There are many steps in EFA that overlap with those of PCA, including identifying the optimal number of factors to extract; how to rescale (or “rotate”) the factor loadings to enhance interpretation; how to “name” the factors based on what items are weighted more vs. less; and how to compute optimal scores. Given these similarities, there has long been contention about whether PCA is a formal member of the EFA family, or if PCA is not a “true” factor model but instead something distinctly different.

Contention on this point centers on a key defining feature of PCA: it assumes that all items are measured *without error* and all observed variance is available for potential factoring. When fewer composites are taken than the number of items, some residual variance in the items will be left over, but this is still considered “true” variance and not measurement error. In contrast, EFA explicitly assumes that the item responses may be, and indeed very likely are, characterized by measurement error. As such, whereas PCA expresses the components as a direct function of the items (that is, the items *induce* the components), EFA conceptually reverses this relation and instead expresses the items as a function of the underlying latent factors. The factors are “latent” in the sense that we believe them to exist but they are not directly observed, and our motivating goal is to infer their existence based on what we did observe: namely, the items.

Of critical importance is that, unlike the PCA, the EFA assumes that only *part* of the observed item variance is true score variance and the remaining part is explicitly defined as measurement error. Although this assumption allows the model to more accurately reflect what we believe to exist in the population (we nearly always recognize there is the potential for measurement error in our obtained items), this also creates a significant challenge in model estimation because the measurement errors are additional parameters that must be estimated from the data. Whereas PCA can be computed directly from our observed sample data, EFA requires us to move to more advanced methods that allow us to obtain optimal estimates of population parameters via iterative estimation. There are many methods of estimation that can be used in the EFA (e.g., unweighted least squares, generalized least squares, maximum likelihood), each of which have certain advantages and disadvantages. In general, maximum likelihood is often viewed as the “gold standard” method of estimation in most research applications.

We can think about four key issues that ultimately distinguish PCA from EFA:

- The theoretical model is
*formative*in PCA and*reflective*in EFA. In other words, the composites are viewed as a function of the items in PCA, but the items are viewed as a function of the latent factors in EFA. - PCA assumes all observed variance among a set of items is available for factoring, whereas EFA assumes only a subset of the observed variance among a set of items is available for factoring. This implies that PCA assumes no measurement error while EFA explicitly incorporates measurement error into the model.
- Although both PCA and EFA allow for the creation of weighted composites of items, in PCA these are direct linear combinations of items whereas in EFA these are model-implied estimates (or predicted values) of the underlying latent factors. As such, in PCA there is only a single method for computing composites, but in EFA there are many (e.g., regression, Bartlett, constrained covariance, etc.), all of which can differ slightly from one to the other.
- Finally, the confusion between PCA and EFA is exacerbated by the fact that in nearly all major software packages PCA is available as part of the “factor analysis” estimation procedures (e.g., in SAS PROC FACTOR a PCA is defined using “method=principal” but an EFA is defined using “method=ML”).

It is difficult to draw firm guidelines for when and if to use PCA in practice. It depends on the underlying theory, the characteristics of the sample, and the goals of the analysis. In most social science applications, particularly those focused on the measurement of psychological constructs, it is often best to use EFA because this better represents what we believe to hold in the population. However, if EFA is not possible due to estimation problems, or if there is an exceedingly large number of items under study, then PCA is a viable alternative. Interestingly, PCA has begun to make a recent comeback in usage within psychology given increased interest in machine learning. It is not uncommon for PCA to be applied to 50 or 100 variables in order to distill them down to a smaller number of composites to be used in subsequent analysis.

Our general recommendation is to initially consider EFA estimated using ML as your first best option, both for model fitting and score estimation. This is because, far more often than not, the EFA model better represents the mechanism we believe to have given rise to the observed data; namely, a process that combines both true underlying construct variation and random measurement error. However, if the EFA is not viable for some reason, then PCA is a perfectly defensible option as long as the omission of measurement error is clearly recognized. Finally, all of the above relates to the exploratory factor analysis model in which all items load on all underlying factors. In contrast, the confirmatory factor analysis (CFA) model allows for *a priori* tests of measurement structure based on theory. If there is a stronger underlying theoretical model under consideration, then CFA is often a better option. We discuss the CFA model in detail in our free three-day workshop, *Introduction to Structural Equation Modeling*.

Below are a few readings that might be of use.

Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford publications.

Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refinement of clinical assessment instruments. *Psychological Assessment, 7*, 286-299.

Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. *Psychological Methods, 4*, 272-299.

Widaman, K. F. (1993). Common factor analysis versus principal component analysis: Differential bias in representing model parameters? *Multivariate Behavioral Research, 28*, 263-311.

Widaman, K. F. (2018). On common factor and principal component representations of data: Implications for theory and for confirmatory replications. *Structural Equation Modeling: A Multidisciplinary Journal, 25*, 829-847.

The post My advisor told me to use principal components analysis to examine the structure of my items and compute scale scores, but I was taught not to use it because it is not a “true” factor analysis. Help! appeared first on CenterStat.

]]>The post Second Annual Winter Institute appeared first on CenterStat.

]]>And don’t forget that you can also obtain asynchronous access to any of our great workshops from last Spring, now including our ***Free** Introduction to Structural Equation Modeling* class. Videos are now available for six months from registration and all other materials can be downloaded and retained indefinitely.

The post Second Annual Winter Institute appeared first on CenterStat.

]]>The post I fit a multilevel model and got the warning message “G Matrix is Non-Positive Definite.” What does this mean and what should I do about it? appeared first on CenterStat.

]]>First, let’s translate the technical jargon. Following Laird & Ware (1982), many software programs used to fit multilevel models use the label **G** to reference the covariance matrix of the random effects. For instance, for a linear growth model, we might include both a random intercept and a random slope for time to capture (unexplained) individual differences in starting level and rate of change. In fitting the model, we don’t estimate the individual values of the random effects directly. Instead, we estimate the variances and covariance of the random effects, i.e., a variance for the intercepts, a variance for the slopes, and a covariance between intercepts and slopes. These variances and covariances are contained in the matrix **G**. Similarly, with hierarchically nested data (e.g., children nested within classrooms or patients nested within physician), we use random effects to capture (unexplained) cluster-level differences. Random intercepts capture between-cluster differences in outcome levels whereas random slopes capture between-cluster differences in the effects of predictors. Again, as part of fitting the model, we need to estimate the variances and covariances of these random effects and, again, these variance and covariance parameters are contained within the **G** matrix. Note that some software programs may use a different label for the covariance matrix of the random effects, but for this post we will use the common notation of **G** throughout.

When the **G** matrix is non-positive definite (NPD) this means that there are fewer dimensions of variation in the matrix than the expected number (i.e., the number of rows or columns of the matrix, corresponding to the number of random effects in the model). For instance, in our linear growth model example, there are two potentially correlated dimensions of variation specified in the **G** matrix, one corresponding to the random intercepts and one corresponding to the random slopes for time. This is no different than what we would expect for any two variables. If we measured height and weight, for instance, there would be variation in height, variation in weight, and some covariation between height and weight, and this would be captured in the 2 x 2 covariance matrix for the two variables. Here we are simply considering random effects rather than measured variables, but the principle remains the same. Now, imagine what would happen if there was no variation for one variable or random effect. For instance, suppose there were no individual differences in rate of change, making the variance of the slopes equal to zero? Then there would be only one remaining dimension of variation in the matrix (reflecting the random intercepts) and **G** would be NPD (having fewer actual dimensions of variation than its specified number of rows/columns). Thus, one way an NPD **G** matrix can arise is if one (or more) of the random effects in the fitted model has a variance of zero.

However, this is not the only possible way to obtain an NPD **G** matrix. For example, what happens if the intercepts and slopes of our growth model are perfectly correlated (e.g., *r = *1.0 or -1.0)? Then the two random effects are redundant with one another and actually represent just one dimension of variation. Again, this would lead the **G** matrix to be NPD. More technically, any time one random effect can be expressed as a perfect linear function of the other random effects, the **G** matrix will be NPD. Note that, depending on whether your software program implements boundary constraints on the variance and covariance parameters or not, you can even get negative variances for random effects or correlations exceeding ±1 (known as improper estimates).

Now let’s consider why an NPD covariance matrix for the random effects is usually a problem. Typically, when one includes random effects in a multilevel model, the assumption is that they “exist” as distinguishable components of variation. For instance, our growth model states that people differ in their starting points and rates of change, differences captured by the random intercepts and slopes included in the model specification. When we include random effects like these in our models, we expect them to have variance and, while they might be correlated with one another, none is thought to be fully redundant with the others. When we receive the “G matrix is non-positive definite” warning, it tells us our expectations were wrong. The estimated model found fewer dimensions of variation than the number of random effects that were specified.

Sometimes the problem is just that estimation went awry. For instance, when predictors with random slopes have very different scales, the variances of the random slopes may be numerically quite different, and this can impede proper model estimation. A second possible reason for **G** to be NPD is that we included random effects in the model that simply aren’t there. Sure, people differ in their starting levels but everyone is actually changing at the same rate, so the random intercepts are good but the random slopes are superfluous. A third possibility is that the data simply aren’t sufficient to support estimating the model (even if the model accurately describes the process under study). This often occurs with smaller sample sizes, more complex models, or some combination of the two.

To illustrate this last possibility, let’s say we fit our growth model to data comprised of two repeated measures per person, and these were collected at the same two points in time for everyone in the sample (a common scenario sometimes referred to as “time structured data”). With only these two time points, there simply is not enough information to be able to obtain unique estimates for all of our model parameters. That is, our model is “under identified”. To intuit why this is the case, imagine a time plot for a set of individuals. If we allow ourselves to draw a different line for each person, each with its own starting level and rate of change, then we will connect the dots perfectly for every case. Yet our model assumes there will be some residual variability around the line as well, i.e., variation around the individual trajectory. Since each line connects the dots, we have no remaining variability with which to capture the residual. Conversely, were we to try to introduce residuals by drawing lines that didn’t perfectly connect the dots, we couldn’t do so without using arbitrary intercepts and slopes. Thus, a typical linear growth model that includes both a residual and random intercepts and slopes cannot be estimated using just two time points of data without producing an NPD covariance matrix for the random effects. That doesn’t mean that there aren’t truly differences in where people start and the rate at which they are changing. It just means that the data are insufficient to tell us about those differences. To be able to identify the model, we would need a third time point (for at least some sizable portion of the sample) to be able to draw a line for each person that doesn’t simply connect the dots and that allows for individual differences in intercepts and slopes as well as residual variability.

A general but imperfect rule of thumb is that, for many of the units in the sample, you want at least one more observation than the number of random effects (e.g., to include two random effects in our growth model, a good number of people in the sample should have three or more repeated measures). If you have fewer observations per unit than indicated by this rule, that may be the cause of your NPD **G** matrix. The warning is telling you that you are trying to do too much with the data at hand. Although we illustrated this rule with longitudinal data, it applies equally to hierarchical data applications. For instance, with dyadic data, there are two partners per dyad, allowing for the inclusion of a random intercept to account for the between-dyad differences; however, no further random effects can be included in the model because their variance/covariance parameters would not be identified given the uniform cluster size of two.

Complicating matters, however, is that even when the number of observations per sampling unit are theoretically sufficient, one may still obtain an NPD covariance matrix. That is, the model is in principle mathematically identified but the data still aren’t able to support the full dimensionality of the random effects. Such a scenario is most likely to arise in small samples and when the number of random effects in the model is either large (i.e., 5 or more) or approaches the maximum number that can possibly be identified by the data. For instance, let’s say we have time structured data with four repeated measures per person. In principle, we can fit a quadratic growth model with a random intercept, random linear effect of time, and random quadratic effect of time. Four observations per person should be enough to be able to obtain unique variance and covariance estimates for three random effects. Yet when we fit the model, we might still obtain the warning, “G matrix is non-positive definite.” In such a case, inspecting the variance-covariance parameter estimates will likely reveal that the quadratic random effect has an estimated variance of zero (or negative variance) or that our random effects have excessively high correlations with one another (in practice, these very high correlations are very commonly negative). Empirically, we cannot distinguish all the components of variability that we specified for the individual trajectories.

Now that we understand when and why NPD **G** matrices occur, let’s consider what to do about them. What to do depends, of course, on what prompted the NPD solution. First, do your best to determine whether your model is identified. Model identification can be tricky with multilevel models, but drawing on our rule of thumb, consider whether with *p* random effects in your model, your sampling units have at least *p *+ 1 observations. If not, you probably need to simplify the model. Even if your model is mathematically identified, model simplification might still be in order. Remember, a non-positive definite **G** matrix signals a lack of empirical support for each random effect to represent a non-redundant component of variation. A logical remedy is then to remove random effects until the warning message goes away. Typically, one should remove higher-order terms before lower-order terms (e.g., remove the quadratic random effect before the linear one, and the linear before the intercept). One pattern of results that is particularly amenable to this strategy is when the variance estimate for a random effect collapses to zero (or goes negative), suggesting it should be removed. We caution, however, that non-significance of a variance estimate should not be taken to imply that the random effect can be sacrificed without worry. Non-significance might simply be a result of low power. Trimming terms based on p-values thus runs the risk of over-simplification, with consequences for the validity of the inferences made from the model.

Additionally, we want to emphasize that reducing the number of random effects is not always defensible, desirable, or necessary. For instance, suppose our theory suggested the inclusion of two random slopes in the model. Each is estimated with some non-zero variance but the slopes are excessively correlated with one another, producing an NPD **G** matrix. Which should we remove? Both were hypothesized to exist and there is no empirical information to prompt the exclusion of one versus the other. Fortunately, we may not have to remove either. Sometimes, re-parameterizing the random effects covariance matrix is sufficient to resolve the problem. Specifically, McNeish & Bauer (2020) showed that using a factor analytic (FA) decomposition of the random effects covariance matrix can greatly aid convergence and reduce the incidence of NPD solutions. When necessary, the FA decomposition can also be used to facilitate a dimension reduction to the random effects covariance matrix that doesn’t require any of the random effects to be omitted entirely. In that case, you are effectively acknowledging that an NPD G matrix is just something you have to live with given the complexity of your model, but you are choosing to do so in as graceful (and empirically useful) a manner as possible.

One other strategy is to abandon the random effects entirely and move to a marginal or “population average” model (Fitzmaurice, Laird, & Ware, 2011; McNeish, Stapleton & Silverman, 2016). In a marginal model, one captures dependence among observations using a covariance structure for the residuals (e.g., compound symmetric, autoregressive, etc.) rather than through the introduction of random effects. Generalized estimating equations (GEE) are one popular algorithm for fitting marginal models, particularly when working with longitudinal data and discrete outcome variables. The obvious downside to a marginal modeling approach is the inability to quantify individual differences between units. For instance, applied in a longitudinal setting, a marginal model would provide estimates of how the mean of the outcome variable changes over time but would not provide estimates of how individuals vary from one another in their trajectories. With hierarchical data, a related approach is to assume independence of observations (despite knowing this assumption to be incorrect), but then implement “cluster corrected” or “robust” standard errors to obtain valid inferences. This latter option is commonly used in survey research where the nesting of units is a by-product of the sampling design (e.g., cluster sampling) but of little substantive interest. In general, these marginal modeling approaches obviate the possibility of an NPD **G** matrix by omitting random effects from the model, but they are typically only useful if clustering is a nuisance and between-cluster differences are not of theoretical interest (see McNeish et al., 2016).

In sum, the warning “G matrix is non-positive definite” tells you that there are fewer unique components of variation in your estimated random effects covariance matrix than the number of random effects in the model. This can be a consequence of fitting an under-identified model, in which case one must simplify the random effects structure. Alternatively, it may reflect sparse empirical information to support the random effects in the model (especially in small samples or with more complex models). Removing random effects is then a common solution. Often, however, a better solution is to re-parameterize the random effects covariance matrix to facilitate optimization to a proper solution, for instance by using a factor analytic decomposition. If the random effects are not of substantive interest, then you might also consider moving to a marginal model to avoid the issue entirely.

**References**

Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011) *Applied longitudinal analysis* (2nd). Philadelphia, PA: Wiley

Laird, N.M. & Ware, J.H. (1982). Random-effects models for longitudinal data. *Biometrics*, *38*, 963-974. https://doi.org/10.2307/2529876

McNeish, D. & Bauer, D.J. (in press). Reducing incidence of nonpositive definite covariance matrices in mixed effect models. *Multivariate Behavioral Research*. https://doi.org/10.1080/00273171.2020.1830019

McNeish, D., Stapleton, L.M. & Silverman, R.D. (2016). On the unnecessary ubiquity of hierarchical linear modeling. *Psychological Methods, 22*, 114-140. https://doi.org/10.1037/met0000078

The post I fit a multilevel model and got the warning message “G Matrix is Non-Positive Definite.” What does this mean and what should I do about it? appeared first on CenterStat.

]]>The post Introduction to Latent Curve Modeling at Society for Psychotherapy Research appeared first on CenterStat.

]]>The post Introduction to Latent Curve Modeling at Society for Psychotherapy Research appeared first on CenterStat.

]]>The post I’m reporting within- and between-group effects in from a multilevel model, and my reviewer says I need to address “sampling error” in the group means. What does this mean, and what can I do to address this? appeared first on CenterStat.

]]>In our prior post, we talked about how it can be important to separate within- and between-group effects for lower level predictors in MLM. To recap, this is usually done in one of two ways. The first way is to add the predictor, say *x*, to the model (perhaps after grand-mean centering) along with the group means of *x*. With this specification, the obtained coefficient for *x *will be the estimated within-group effect and the coefficient for the group means of *x* will be the estimated *contextual effect*, capturing the extent to which the between group effect of *x* differs from its within-group effect. The second way is to center *x* with respect to its group mean, and then fit the model with this group-mean centered *x *along with the group means of *x*. The group-mean-centered *x* generates an estimate of the within-group effect and the group means generate an estimate of the between-group effect. Regardless of which approach is used, the observed group means of *x *are included in the model as a predictor, and this is where we can run into problems.

To understand this, think back to when you took intro stats and began to learn about inference and sampling variability. Chances are the lecture went something like this… Imagine that in the population, variable *x *has mean *μ* and variance *σ*^{2}. We want to estimate *μ* based on a sample of *n* observations on *x*. The estimate we obtain, the sample mean, is not going to be exactly equal to *μ* because it’s calculated from a sample rather than the entire population, thus we would obtain different estimates from different samples, and these will tend to vary more from one another in small samples. The variance of the sample mean across repeated samples is *σ*^{2}* */ *n*, and taking the square root of this yields the familiar formula for the standard error of the mean, *σ* / sqrt(*n*).

Now let’s return to the MLM context. Each group mean that we calculate is subject to this same sampling error. When the number of observations sampled for a given group is small then the sampling error in the group mean will be large. This makes perfect sense: Imagine you have a classroom in which you have sampled just five of 40 students and then use the mean of these five students to estimate some characteristic of the entire class; naturally, this mean might vary substantially if computed on some other random five students in the class. Across groups, these sampling errors add “error variance” to the group means that cause the between-group effect to be biased. In the case of a single predictor, the bias is predictably in the direction of the within-group effect (leading the contextual effect to be under-estimated). With multiple predictors, the pattern of bias also depends on the correlations among the predictors and can be harder to predict a priori. Further, this bias can propagate to the estimates of true (non-aggregated) Level 2 predictors (that is, level-2 predictors that are not obtained as a function of level-1 observations), even though these predictors do not contain sampling error. Interestingly, because the within and between effects are orthogonal, this bias does not extend to the within-group effect estimates, which remain unbiased.

Bias due to sampling error in the group means seems like a big problem, except that sometimes it’s not. One consideration is the group sizes in your sample: the larger they are, the less sampling error there will be in the group means. With large enough group sizes, you don’t need to worry much about bias. Likewise, when the true between-group differences are large (there is a high intra-class correlation for the predictor), the sampling error will make up a smaller part of the observed group mean differences, producing less bias. Another mitigating circumstance is if you sampled most or all of the individuals in the group population. The usual formula for the standard error of the mean assumes an infinite population, that is, you sampled *n* people from an infinitely large pool. However, often times, and especially with hierarchical data, there may be a limited population size for each group (e.g., one is sampling from a classroom of 20 students). In a finite population of *N* individuals, the sampling variance of the mean can be considerably smaller. In other words, there will be less bias to the between-group effects if the sampling ratio (units in sample to potential units in finite population) is large (e.g., if you sampled 15 students in a classroom of 20). In some cases, you may even have *all *of the available units for each group, such as when studying siblings nested within families. Then there’s no bias whatsoever. Sometimes it is also possible to obtain group-level information from the population rather than calculating it from your sample. For instance, administrative records might provide information on the average family income of all the students in a class, even if only some of them are in the sample. Again, using this, rather than the sample mean, would remove the bias. Finally, if the between-group effect is not really very different from the within-group effect, then the bias in the estimate will be small.

But let’s say your situation doesn’t fit with any of these exceptions, then what? Well, some very clever methodologists have been working on ways to fix the bias. Three primary approaches have been suggested, each paralleling an approach for handling measurement error in standard regression models. One way to handle measurement error is with a latent variable model. Following this strategy, Lüdtke et al (2008) proposed the multilevel latent covariate (MLC) model to handle sampling error in the group means. In this model, the observed scores for the sampled group members are viewed as indicators of the true underlying latent group mean. Shin and Raudenbush (2010) implemented the same idea within a multivariate MLM framework. A second strategy is to generate scores for the latent variable that produce consistent estimates when used as predictors in an observed-variables model. Consistent with this approach, Croon and van Veldhoven (2007) and Shin & Raudenbush (2010) showed that accurate estimates of between-group effects can be obtained by using empirical Bayes’ (EB) estimates of the group means of *x* rather than the observed sample means. Finally, a third way to handle measurement error is to fit a standard regression model but then implement a post-estimation correction to the estimates based on prior knowledge about the reliability of the predictor. In this case, we can infer the reliability of the predictor from the group size, since it is due to sampling error. Grilli & Ramphichini (2011) and Gottfredson (2019) describe the appropriate corrections to implement this approach. As we describe next, all three of these general approaches can yield accurate estimates of the between-group effects, but which to choose may depend on the specific characteristics of your application.

The MLC model is widely recognized, theoretically elegant, makes most efficient use of the data, and is conducted in one step, requiring no pre-treatment of the data or post-transformation of the estimates. On the flip side, the MLC is a complex latent variable model and estimation can go awry when the number of groups is small (e.g., less than 50). The MLC is also based on a reflective measurement model that assumes that the people in the group are interchangeable and the latent group mean is a characteristic of the group that affects the individual scores (people could come and go but the latent group mean would stay the same). This is in contrast to a formative model, in which the scores of the group members are not necessarily interchangeable and collectively determine the population mean of the group (as people come and go the true group mean changes too). A reflective measurement model can be difficult to justify at times, but the MLC can still be profitably used with a formative process as long as the sampling ratio is low (e.g., only 5% of the population group members were sampled).

The EB approach has the advantage that it is straightforward to implement within a standard multilevel model. You can generate EB estimates for *x* in most MLM software programs (sometimes these are referred to instead as empirical best linear unbiased predictors, or EBLUPs). Then, following Shin & Raudenbush (2010), you simply use these rather than the usual group means of *x* when fitting the model to *y *(both at Level 2 and when centering the predictor at Level 1). However, this approach too has drawbacks. First, computing the EB estimates gets increasingly complicated when the number of the predictors increases. These must be computed simultaneously for all of the Level 1 predictors and accounting for any other Level 2 predictors that will be in the model for *y*. Ultimately, for a sufficiently complex model, you may need to program in the matrix equations yourself (see Croon & van Veldhoven, 2007, p. 51-52). Second, like the MLC, the EB estimates implicitly assume a reflective measurement model (though they too could still be used with a formative measurement process if the sampling ratio was sufficiently low). Third, although this approach generates consistent estimates of the fixed effects, it does not correct the variance component estimates, which may remain biased. In turn, this may bias the standard errors of the fixed effects.

Like the EB approach, the reliability-correction approach has the advantage that it can be implemented within a standard multilevel model. Further, no changes are required to the traditional procedures for separating within- and between-group effects. One simply needs to correct the estimates after fitting the model to counteract the expected bias due to sampling error. Corrections can be applied to both fixed effects and variance components and can be computed for either infinite or finite group populations, irrespective of reflective or formative measurement. Adjustments can also be made to the standard errors. But there are downsides to this approach too. First, the reliability-corrected estimates can show excessive sampling variability, making this approach most useful when working in large sample contexts (many groups). Second, the corrections are derived for balanced groups and aren’t fully accurate when group sizes vary. Third, correction formulas focus on the case of a single predictor, whereas it is more common for models to have multiple predictors.

Thus, as with so many things in statistics, there is no one right answer for how to address this problem. In your response to the reviewer, we would recommend the following. First, assess if sampling error is truly a problem for your particular analysis. Might your research fall into one of the exceptions where bias is not expected to be a problem (e.g., large cluster sizes or a high sampling ratio)? If so, you simply need to explain this to your reviewer. Second, if it is a problem, think about which of the possible alternative modeling approaches will best suit your needs by considering the advantages and disadvantages discussed above. If you have many predictors, and are fortunate to have a large number of groups in your sample, the MLC model may be your best bet, provided you can reasonably assume a reflective measurement model or low sampling ratio. If your model is small, the EB or reliability correction approaches might be easier to implement, and one or the other could be used to provide a sensitivity analysis for the original results (i.e., does the story change when accounting for sampling error?). These too perform best with a large number of groups. Last, if you have finite group sizes in the population, you recorded the total sizes of the groups from which you sampled, and you are sampling more than a small fraction of the available group populations, the reliability correction approach is the only one of the three that will take this into account to produce accurate estimates.

Research on this topic is ongoing and expanding, but we hope this post will help to orient you to the relevant literature and give you some ideas for how to move forward with your manuscript.

Croon, M.A. & van Veldhoven, M.J.P.M. (2007). Predicting group-level outcome variables from variables measured at the individual level: a latent variable multilevel model. *Psychological Methods, 12*, 45-57.

Gottfredson, N.C. (2019). A straightforward approach for coping with unreliability of person means when parsing within-person and between-person effects in longitudinal studies. *Addictive Behaviors, 94*, 156-161. DOI: 10.1016/j.addbeh.2018.09.031

Grilli, L., & Rampichini, C. (2011). The role of sample cluster means in multilevel models: A view on endogeneity and measurement error issues. *Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 7,* 121–133. https://doi.org/10.1027/1614-2241/a000030

Lüdtke, O., Marsh, H.W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008). The multilevel latent covariate model: a new, more reliable approach to group-level effects in contextual studies. *Psychological methods, 13*, 203-229. DOI: 10.1037/a0012869

Shin, Y., & Raudenbush, S. W. (2010). A latent cluster-mean approach to the contextual effects model with missing data. *Journal of Educational and Behavioral Statistics, 35*, 26–53. https://doi.org/10.3102/1076998609345252

The post I’m reporting within- and between-group effects in from a multilevel model, and my reviewer says I need to address “sampling error” in the group means. What does this mean, and what can I do to address this? appeared first on CenterStat.

]]>The post My advisor told me I should group-mean center my predictors in my multilevel model because it might “make my effects significant” but this doesn’t seem right to me. What exactly is involved in centering predictors within the multilevel model? appeared first on CenterStat.

]]>As a simple example, imagine your sample consists of multiple classrooms, and each classroom contains multiple students. Further, you obtained a student-level predictor reflecting *locus-of-control* and a student-level outcome reflecting *math achievement*. Your goal is to examine if students who report higher levels of control also tend to perform better on a math exam. Given the hierarchical structure of your data (the nesting of students within classroom), there are actually three possible relations that can exist between your predictor and outcome, the total effect, within-group effect, and between-group effect (where, here, group is classroom). Let’s consider these in turn.

First is the total effect (or marginal effect), which represents the regression of math achievement on locus-of-control pooling over all students and classrooms. This total effect actually represents a weighted composite of the within-group and between-group components of the relation. While it is perfectly fine to estimate and interpret total effects from the standpoint of prediction (e.g., pooling over students and classrooms, a 1-unit change in the predictor leads to a so-many-points change in the outcome), it is much more difficult to draw theoretically meaningful conclusions from these effects, as the location of the effect is ambiguous – the total effect is a mish mosh of the within- and between-group effects. For this reason, when working with multilevel data, it is often preferable to estimate and interpret the within- and between-group effects directly instead.

The within-group effect is the relation between student locus-of-control within a given classroom; this evaluates whether, on average, students who are higher (or lower) on control with respect to the *other students in their class* tend to score higher (or lower) on the math assessment. One way to think about this effect is to imagine that you had only sampled students from a single classroom, say Class A. If you ran a simple regression analysis on the data, you would obtain an effect that tells you about how differences in locus of control are predictive of differences in math scores for students in Class A. You might assume that there’s nothing particularly special about this class and you would have observed the same effect had you sampled from Class B, or Class C, etc. With the multilevel data, we can leverage the data from all of the classrooms in our sample to estimate this common within-group effect with greater precision (and, if we don’t want to assume the within-group effect is the same in each classroom, we can allow for that too, but that’s a story for another day). The within-group effect continues to tell us, within a given group, how do differences in the predictor relate to differences in the outcome?

In contrast, the between-group effect is the relation between the *classroom* *mean* of student locus-of-control and math achievement; this evaluates whether, on average, *classes* categorized by higher (or lower) control tend to score higher (or lower) on math achievement. Here, we can imagine that instead of collecting the individual data, we were only provided with summary data for each classroom. Again, we could run a simple regression on this data, obtaining an estimate of how differences in the average value of locus of control between classrooms relate to differences in average value of math. With access to the individual, student-level data, we can estimate this effect more optimally (accounting for differences in classroom sizes, for instance), but the interpretation remains the same. If we were to compare two classrooms that differed by 1 unit in their average locus of control values, we would expect the students within these classrooms to differ in their average math scores by the magnitude of the between-group effect.

It is often quite important (if not required) to properly disaggregate the total effect into the within-group component and the between-group component within an MLM, and centering the predictors allows us to accomplish this. To see this, let’s consider a very simple one-predictor MLM for students nested within classrooms in which our predictor is locus-of-control and our outcome is math achievement.

Broadly, centering refers to the process of subtracting the mean from a variable (usually a predictor). Unlike in ordinary regression, centering becomes complicated with multilevel data because there are two possible means around which lower-level predictors can be centered. The first is the *grand mean* that represents the mean of the predictor pooling over all observations and all groups. The second is the *group mean* that represents the mean of the predictor within the group to which the observations belong.

There are thus two primary choices when centering lower-level predictors: we can *grand mean center* the predictor, where we deviate each individual score the overall mean (literally subtracting the grand mean from each person’s score), or we can *group mean center* the predictor, where we instead deviate each individual score from their own group mean. The former reflects the individual’s relative standing on the predictor with respect to *everyone* in the sample and the latter reflects the individual’s relative standing on the predictor with respect to everyone in their *own* *group*. Either of these rescaled (or “centered”) predictors can then be used in the Level 1 model, as can the raw (or *uncentered*) version of the predictor. Which is used influences the interpretation of the obtained effects. Further, because the group mean is a characteristic of the group, this itself can be used as an upper-level predictor in the Level 2 equation, regardless of which form of centering is used for the predictor at Level 1 (or even if it is left in the raw scale).

When using the predictor in the raw scale or within grand-mean centering, it is critical to include the group means of the predictor at Level 2 to properly disaggregate the effects. The effect obtained for the predictor at Level 1 will then be the within-group effect and the effect obtained at Level 2 will then be the *difference* between the between- and within-group effects, sometimes called the *contextual effect*.

Problems, however, arise if you fail to include the group means in the model when using the raw scale or grand-mean centered predictor. If you do that, you will get a mish mosh effect estimate for the Level 1 predictor that represents neither represents the between-group nor the within-group effect. Instead, it confounds these two effects together into a single value that may not resemble either. To make matters worse, this mish mosh also doesn’t represent the total effect, as it weights the within- and between-group effects differently. The obtained estimate is difficult to interpret, outside of a few special cases.^{1}

In contrast, when using the predictor with group-mean centering, the effect obtained for the predictor at Level 1 will always be with within-group effect, regardless of whether the group means are included at Level 2 or not. If the group means are included at Level 2, the effect obtained will be the between-group effect. Importantly, MLMs fit using raw, grand-mean centered, or group-mean centered predictors all fit precisely the same, provided the group means are entered as predictors at Level 2 and there are no random slopes in the model (again, a story for another day).

With this as context, we can now return to your question, the answer to which depends on how you specified your initial model. If you included the group means in your model at Level 2, then you will obtain exactly the same within-group effect estimate (and p-value) for your Level 1 predictor regardless of which method of centering you use. In that case, your advisor would be wrong: group-mean centering won’t change a thing. On the other hand, if you haven’t included the group means in the model at Level 2, then group-mean centering will generate an estimate of the within-group effect that will differ from the mish mosh estimate you previously obtained with the raw scale or grand-mean centering. The significance of the within-group effect might well differ from the mish mosh estimate you had before, in which case your adviser would be right. And then there’s the effect of the group means at Level 2 to consider. Remember that when these are included at Level 2 the obtained estimates differ in interpretation depending on whether group-mean centering is used or not at Level 1. When using the predictor in raw scale or with grand-mean centering, the estimate represents the contextual effect, whereas with group-mean centering, the estimate represents the between-group effect. These will typically differ from one another and may differ in significance as well, since they test different null hypotheses.

The bottom line is that your advisor might or might not be right, depending on which aspect of the relationship between the predictor and outcome you are estimating in your models (e.g., total, within, or between-group effects). Different forms of centering and model specification can lead to important interpretational differences in the model results that are critical to consider when drawing substantive inferences. It is critical to be aware of exactly what effects you wish to estimate and to ensure that you are specifying the model in such a way that you will obtain tests of those effects.

We can thus draw the following general conclusions:

- If either the raw or grand mean centered predictor is entered at Level 1 without the group mean entered at Level 2, the obtained regression coefficient will confound the within- and between-group components of the relation into a single estimate that is difficult to interpret, outside of special circumstances (e.g., where the within- and between-group effects are the same).
- If either the raw or grand mean centered predictor is entered at Level 1 and the group mean is entered at Level 2, then the regression coefficient associated with the Level 1 predictor represents an unambiguous estimate of the
**within-group**effect, and the regression coefficient associated with the Level 2 group mean represents the*difference*between the between-group and within-group effect; this latter effect is sometimes called the**contextual effect**. - If the group mean centered predictor is entered at Level 1 with or without the group mean entered at Level 2, the regression coefficient represents an unambiguous estimate of the
**within-group**effect. - If the group mean is entered at Level 2 with or without the group mean centered predictor at Level 1, the regression coefficient represents an unambiguous estimate of the
**between-group effect**. - Finally, generalizing from points #3 and #4, if the group mean centered predictor is entered at Level 1 and the group mean is entered at Level 2, this provides simultaneous and unambiguous estimates of both the within-group and between-group effects of the predictor on the outcome.

Given the above, it is quite easy to see how confusion can arise about different options for centering, and how individual choices can impact subsequent interpretations of model results. Here we have only offered a brief review, and there are many clear and cogent descriptions of these issues as they arise both in hierarchically clustered data (as described above) and in longitudinal data (where we instead talk about within-person and between-person effects). For more detailed discussions of these issues see Raudenbush and Bryk (2002, pages 31-35, 134-149, and 181-183), Enders and Tofighi (2007), Kreft, de Leeuw, and Aiken (1995), and (if we may) Curran and Bauer (2011).

In conclusion, simply know that there is no “right” or “wrong” choice about centering, but there is most definitely an *optimal* choice based on the theoretical questions under study.

————————————

^{1} One special case is where the within- and between-group effects are the same. Then the value obtained for the raw-scale or grand-mean centered predictor at Level 1 is an unbiased estimate of these effects. But there is seldom cause to assume these effects to be identical *a priori*. Another special case is where there is no between-group variance in the predictor, due to balancing across clusters by design, in which case the estimate will resolve to the within-group effect. An example would be in a longitudinal study where the time scores are the same for all people because the assessment schedule is identical across participants and there is no missing data.

————————————

Curran, P. J., & Bauer, D. J. (2011). The disaggregation of within-person and between-person effects in longitudinal models of change. *Annual Review of Psychology, 62*, 583-619.

Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: a new look at an old issue. *Psychological Methods*, *12*, 121-138.

Kreft, I. G. G., de Leeuw, J., & Aiken, L. S. (1995). The effect of different forms of centering in hierarchical linear models. *Multivariate Behavioral Research, 30*, 1–21. Raudenbush, S. W., & Bryk, A. S. (2002). *Hierarchical linear models: Applications and data analysis methods* (Vol. 1). Sage Publications.

The post My advisor told me I should group-mean center my predictors in my multilevel model because it might “make my effects significant” but this doesn’t seem right to me. What exactly is involved in centering predictors within the multilevel model? appeared first on CenterStat.

]]>The post A reviewer recently asked me to comment on the issue of equivalent models in my structural equation model. What is the difference between alternative models and equivalent models within an SEM? appeared first on CenterStat.

]]>To begin, one of the greatest strengths of the SEM is the ability to estimate models in very specific ways to closely correspond to theory. Sometimes we can think of this as the “whiteboard” problem: we draw out our measured variables on the board and then connect them with single- and double-headed arrows and circles in a way that best reflects our theoretically-derived research hypotheses. We often build one model that is most consistent with our theory, but there are *alternative* models we might consider. Alternative models represent different path diagrams that make different statements about the underlying theory. A key strength of the SEM is that we can make formal comparisons of the fit of alternative models based on sample data: one model might attain superior fit when compared to another, providing empirical support for favoring the better fitting model versus the alternative.

In contrast, whereas *alternative* models almost always lead to *differences* in model fit, *equivalent* models are different representations of model structure that result in precisely the* same *model fit. That is, the models are *equivalent* *representations* of the sample data and cannot be distinguished from one another based on empirical fit. An equivalent model can be thought of as a re-parameterization of the original model. In other words, it is just a different way of “packaging” the same information in the data and no equivalent model can be distinguished from another based on fit alone. If you were to fit a series of equivalent models to the same sample data you obtain exactly the same chi-square test statistic, RMSEA, CFI, TLI, and any other omnibus measure of fit. (Side note: One thing that may be confusing is that, depending on how the models are estimated, their log-likelihoods might differ, but these differences will cancel out when computing measures of fit relative to the corresponding saturated or baseline models, thus their fit remains the same).

Take a very simple example: a three variable mediation model might state that the predictor leads to the mediator that in turn leads to the outcome; diagrammatically, this is portrayed as:

This model has one degree-of-freedom and will have obtain some degree of fit to the data (chi-square, RMSEA, CFI, etc.). However, there are two equivalent models that obtain __precisely__ the same model fit when estimated using sample data. The first is:

and the second is:

All three of these models make fundamentally different statements about the underlying model that gave rise to the observed data, yet all three fit *precisely* the same. As such, the three models are numerically equivalent and can only be adjudicated based on theory.

The above example only considers three measured variables with two regression coefficients. Imagine how this problem scales up with many more measures and many more parameters, particularly if a model includes one or more latent factors. Fortunately, much research has been conducted to help identify a set of existing equivalent models that accompany any given hypothesized model. Although several important papers have been written on this topic (e.g., Stelzl, 1986; MacCallum et al., 1993), a key contribution was made by Lee and Hershberger (1990) where they developed what are sometimes called “replacement rules” or just “Lee-Hershberger rules”. Briefly, Lee and Hershberger describe a very clever approach where variables in a given model can be organized into three blocks: a preceding block, a focal block, and a succeeding block. Then, within the focal block, a large number of modifications can be made to how the variables relate to one another (e.g., reversing pathways, changing regression coefficients to covariances), all of which will achieve identical model fit. A model of even modest complexity might have 50 corresponding alternative expressions, and more complex models can result in hundreds if not thousands of equivalent counterparts.

There are several core takeaway points here. First, it is important to realize that this is simply a characteristic of the SEM and is part of the price we pay for having the flexibility to parameterize models in precisely the way we desire. Second, very little can be done to empirically distinguish among equivalent models (given traditional measures of fit will be identical). Some specific suggestions have been offered (e.g., Raykov & Penev, 1999) but none are able to fully resolve the issue. Indeed, even replication with an independent sample does not resolve the issue because two equivalent models will attain identical fit within *any *given sample data.

As such, it is important that a researcher be unambiguously aware that this issue exists and to realize that any given hypothesized model is just one of an entire *family* of models, all of which are numerically indistinguishable in terms of model fit. Of course some of these models may not be theoretically plausible (e.g., a mediator predicting biological sex or a prediction back in time) but many dozens of options may remain. It is often best to treat this as a limitation of any given study and to potentially present one or a small number of equivalent model options to the reader so that these too might be considered as plausible representations of the data. Further, it might be beneficial to consider these issues when engineering future studies in which certain design elements might be incorporated to help reduce the universe of possible equivalent models.

**References**

Lee, S., & Hershberger, S. (1990). A simple rule for generating equivalent models in covariance structure modeling. *Multivariate Behavioral Research, 25*, 313-334.

MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of equivalent models in applications of covariance structure analysis. *Psychological Bulletin, 114*, 185-199.

Raykov, T., & Penev, S. (1999). On structural equation model equivalence. *Multivariate Behavioral Research, 34*, 199-244.

Stelzl, I. (1986). Changing a causal hypothesis without changing the fit: Some rules for generating equivalent path models. *Multivariate Behavioral Research, 21*, 309-331.

The post A reviewer recently asked me to comment on the issue of equivalent models in my structural equation model. What is the difference between alternative models and equivalent models within an SEM? appeared first on CenterStat.

]]>