The post What are ROC curves and how are these used to aid decision making? appeared first on CenterStat.

]]>

This same approach to decision making appears routinely in daily life. For example, we might want to know if a person referred to a clinic is likely to be diagnosed with major depression or not: they either are truly depressed or not (which is unknown to us) and we must decide based on some brief screening test if we believe it is likely that they suffer from depression. Or we might want to know if it is likely someone has a medical diagnosis that requires a more invasive biopsy procedure. Or, one that is near and dear to us all, we may want to know if we do or do not have COVID. There is a “true” condition (you really do or really do not have COVID) and we obtain a positive or negative result on a rapid test we bought for ten dollars from Walgreens. We have precisely the same four possible outcomes as shown above, two that are correct (it says you have COVID if you really do, or you do not have COVID if you really do not) and two that are incorrect (it says you have COVID when you really don’t, or you do not have COVID if you really do).

This is such an important concept that there are specific terms that capture these possible outcomes. *Sensitivity* is the probability that you will receive a positive rapid test result if you truly have COVID (the probability of a *true positive* for those with the disease). *Specificity* is the probability that you will receive a negative rapid test result if you truly do not have COVID (the probability of a *true negative* for those without the disease). One minus sensitivity thus represents the error of obtaining a false negative result, and one minus specificity represents the error of obtaining a false positive result. However, all of the above assumes that the rapid test has one of two outcomes: the little window on the rapid test either indicates a negative or a positive result. But this begs the very important question, *how does the test know?* That is, the test is based on a

ROC stands for *receiver operating characteristic*, the history of which can be traced back to the development of radar during World War II. Radar was in its infancy and engineers were struggling to determine how it could best be calibrated to maximize the probability of identifying a real threat (an enemy bomber, or a *true positive*) while minimizing the probability of a false alarm (a bird or a rain squall, or a *false positive*). The challenge was where to best set the continuous sensitivity of the receiver (called *gain*) to optimally balance these two outcomes. In other words, there was an infinite continuum of possible gain settings and they needed to determine a specific value that would balance true versus false readings. This is precisely the situation in which we find ourselves when using a brief screening instrument to identify depression or blood antigen levels to identify COVID.

To be more concrete, say that we had a 20-item screening instrument for major depression designed to assess whether an individual should be referred for treatment or not, but we don’t know at what specific score a referral should be made. We thus want to examine the ability of the continuous measure to optimally discriminate between true and false positive decisions across all possible cut-offs on the continuum. To accomplish this, we gather a sample of individuals with whom we conduct a comprehensive diagnostic workup to determine “true” depression, and we give the same individuals our brief 20-item screener and obtain a person-specific scale score that is continuously distributed. We can now construct what is commonly called a *ROC curve* that plots the true positive rate (or sensitivity) against the false positive rate (or one minus specificity) across all possible cut-points on a continuous measure. That is, we can determine how every possible cut-point on the screener discriminates between those who did or did not receive a comprehensive depression diagnosis.

To construct a ROC curve, we begin by creating a bivariate plot in which the *y-*axis represents *sensitivity* (or *true positives*) and the *x-*axis represents one minus *specificity* (or *false positives*). Because we are working in the metric of probabilities each axis is scaled between zero and one. We are thus plotting the true positive rate against the false positive rate across the continuum of possible cut points on the screener. Next, a 45-degree line is fixed from the origin (or 0,0 point) to the upper right quadrant (or 1,1 point) to indicate random discrimination; that is, for a given cut-point on the continuous measure, you are as likely to make a true positive as you are a false positive. However, the key information in the ROC curve is superimposing the sample-based curve that is associated with your continuous screener; this reflects the actual true-vs-false positive rate across all possible cut-offs of your screener. An idealized ROC curve (drawn from Wikipedia) is presented below.

If the screener has no ability to discriminate between the two groups, the sample-based ROC curve will fall on the 45-degree line. However, that rarely happens in practice; instead, the curve capturing the true-to-false positive rates across all possible cut-points will lie above the 45-degree line indicating that the test is performing better than chance alone. The further the ROC curve deviates from the 45-degree line, the better able the screener is to correctly assign individuals to groups. At the extreme, a perfect screener will fall in the upper left corner (the 0,1 point) indicating all decisions are true positives and none are false positives. This too rarely if ever occurs in practice, and a screener will nearly always fall somewhere in the upper-left area of the plot.

But how do we know if our sample-based curve is meaningfully higher than the 45-degree line? There are many ways that have been proposed to evaluate this, but the most common is computing the *area under the curve*, or AUC. Because the plot defines a unit square (that is, it is one unit wide and one unit tall), 50% of the area of the square falls below the 45-degree line. Because we are working with probabilities, we can literally interpret this to mean that a there is a 50-50 chance a randomly drawn person from the depressed group has a higher score on the screener than a randomly drawn person from the non-depressed group. This of course reflects that the screener has no better than random chance of correctly specifying an individual. But what if the AUC for the screener was say .80? This would reflect that there is a probability of .8 that a randomly drawn person from the depressed group will have a higher score on the screener than a randomly drawn person from the non-depressed group. In other words, the screener is able to *discriminate* between the two groups at a higher rate than chance alone. But how high is high enough? There is not really a “right” answer, but conventional benchmarks are that AUCs over .90 are “excellent”, values between .70 and .90 are “acceptable” and values below .70 are “poor”. Like most general benchmarks in statistics, these are subjective, and much will ultimately depend on the specific theoretical question, measures, and sample at hand. (We could also plot multiple curves to compare two or more screeners, but we don’t detail this here.)

Note, however, that the AUC is a characteristic of the screener itself and we have not yet determined the optimal cut-point to use to classify individual cases. For example, say we wanted to determine the optimal value on our 20-item depression screener that would maximize the true positives and minimize the false positives in our referral for individuals to obtain a comprehensive diagnostic evaluation. Imagine that individual scores could range in value from zero to 50 and we could in principle set the cut-off value at any point on the scale. The ROC curve allows us to compare the true positive to false positive rate across the entire range of the screener and estimate what the true vs. false positive classification at each and every value of the screener. We then can select the optimal value that best balances true positives from false positives, and that value becomes are cut-off point at which we demarcate who is referred for a comprehensive diagnostic evaluation and those who are not. There are a variety of methods for accomplishing this goal, including computing the closest point at which the curve approaches the upper-left corner, the point at which a certain ratio of true-to-false positives is reached, and using more recent methods drawn from Bayesian estimation and machine learning. Some of these methods become quite complex, and we do not detail these here.

Regardless of method used, it is important to realize that the optimal cut-point may not be universal but varies by one or more moderators (e.g., biological sex or age) such that one cut-point is ideal for children and another for adolescents. Further, the ideal cut-point might be informed by the relative cost of making a true vs. false positive. For example, a more innocuous example might be determining if a child might benefit from additional tutoring in mathematics compared to a much more severe determination of whether an individual might suffer from severe depression and be at risk for self-harm. Different criteria might be used in determining the optimal cut-point for the former vs. the latter. Importantly, this statistical architecture is quite general and can be applied across a wide array of settings within the social sciences and offers a rigorous and principled method to help guide optimal decision making. We offer several suggested readings below.

Fan, J., Upadhye, S., & Worster, A. (2006). Understanding receiver operating characteristic (ROC) curves. *Canadian Journal of Emergency Medicine, 8*, 19-20.

Hart, P. D. (2016). Receiver operating characteristic (ROC) curve analysis: A tutorial using body mass index (BMI) as a measure of obesity. *J Phys Act Res*, 1, 5-8.

Janssens, A. C. J., & Martens, F. K. (2020). Reflection on modern methods: Revisiting the area under the ROC Curve. *International journal of epidemiology, 49*, 1397-1403.

Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. *Journal of Thoracic Oncology, 5*, 1315-1316.

Petscher, Y. M., Schatschneider, C., & Compton, D. L. (Eds.). (2013). *Applied quantitative analysis in education and the social sciences*. Routledge.

Youngstrom, E. A. (2014). A primer on receiver operating characteristic analysis and diagnostic efficiency statistics for pediatric psychology: we are ready to ROC. *Journal of Pediatric Psychology, 39*, 204-221.

The post What are ROC curves and how are these used to aid decision making? appeared first on CenterStat.

]]>The post What exactly qualifies as intensive longitudinal data and why am I not able to use more traditional growth models to study stability and change over time? appeared first on CenterStat.

]]>To start, there is often much confusion over what constitutes intensive longitudinal data (or ILD), in large part because there exists no formal definition that separates ILD from other types of longitudinal data. That said, ILD tends to fall between two traditional data structures obtained from alternative designs: panel data and time series data. It’s useful to first consider these traditional structures to see how several of their features will combine within ILD.

Historically, the most common method for gathering longitudinal data in psychology and the social and health sciences has been the *panel design*. Typically, a panel design involves assessing a large sample of subjects (say 200 or more) at a much smaller number of time points (say three to six) that tend to be widely spaced in time (say six or 12 months or more). Panel data are often used to empirically examine long-term trajectories of change that might span multiple years, and common analytic methods include the standard latent curve model or a multilevel growth model. (See our prior Help Desk entry on the relation between the LCM and MLM).

A second type of longitudinal design, commonly used in economics among other areas, is the *time series* *design*, which resides at the opposite end of the continuum from the panel design. More specifically, a time series design is often based on just a single unit that is repeatedly assessed a very large number of times (say 100 to 200 or more) at intervals that tend to be close together in time (say daily or even hourly). Time series data are often used to empirically examine short-term dynamic processes that might unfold hour-by-hour or day-by-day (e.g., the daily closing cost of the S&P500) and many specialized analytic methods exist to fit models to these highly dense data.

ILD tends to fall between the two extremes of panel data on one end and time series on the other. More specifically, ILD tends to have fewer subjects than panel data but more than time series (say 50 or 100 subjects) and more time points than panel data but possibly fewer than time series (say 30 or 40 assessments). Data might be captured using wearable technology (e.g., heart rate or blood pressure monitors) or by sending random prompts throughout the day via smart phones or other electronic devices (e.g., a tone sounds on a smart phone three times throughout the day and an individual is prompted to respond to a brief feelings survey). As a hypothetical example, a study might be designed to randomly measure nicotine cravings and cigarette use in a sample of 50 individuals four times per day for a two week period resulting in 56 assessments on each individual, thus falling between traditional panel and time series designs in structure.

In the spirit of *be careful what you ask for*, once you obtain intensive longitudinal data you must then select an optimal modeling strategy to test your motivating hypotheses, and this is not always an easy task. To begin, some longitudinal models that we are familiar with from panel data simply will not work with ILD. Consider the latent curve model (LCM): because the LCM is embedded within the structural equation model, each observed time point is represented by a manifest variable in the model. This works well if the model is fit to annual assessments of some outcome (say antisocial behavior at age 6, 7, 8, 9 and 10) where each age-specific measure serves as an indicator on the underlying latent curve factor. However, the LCM rapidly breaks down with higher numbers of repeated measures in which only one observation may have been obtained at any given assessment (e.g., 9:15am, 9:52am, and so on). For our prior example with 56 repeated assessments taken on 50 subjects, the LCM is simply not an option.

We can next consider the multilevel model (MLM) and it turns out that this option works quite well for many ILD research applications. (See our Office Hours channel on YouTube for a lecture on the MLM with repeated measures data). The MLM approaches the complex ILD structure as nested data in which repeated assessments are nested within individual. Interestingly, unlike the standard LCM, the MLM can be applied to both more traditional panel data and to ILD. The reason is that, whereas the LCM incorporates the passage of time into the factor loading matrix and requires an observed variable at each assessment, the MLM incorporates the passage of time as a numerical predictor in the regression model. As such, the MLM can easily allow for highly dense (meaning many time points) and highly sparse (meaning few or even one assessment is shared by any individual at any given time point) data without problem. (The LCM can under certain circumstances be contorted to accommodate some of these features as well, but the MLM does this seamlessly). However, there are several complications that must be addressed when fitting an MLM to intensive longitudinal data that do not commonly arise in panel data.

The first issue is what is called *serial correlation* of the residuals for the repeatedly measured outcome. With apologies for the technical terminology, this means is that for a given person, when there is a “bump” at one timepoint, that tends to carry over to the next time point too. For instance, say a person’s average heart rate is 72 BPM. I measured this person at 9:10am and 9:26am. What I don’t know is that this person was late for their 9:00am job, which lead them to move faster and increased their stress, and they had only just arrived at 9:10am. This manifested in a heart rate of 91 BPM at 9:10 and 83 BPM at 9:26. The initial bump has thus not entirely dissipated by the second assessment.

Serial correlation is often not of importance in panel data because these perturbations have long since washed out (the residual correlation goes to zero over the long lags). A person’s heart rate might be higher than usual when I assess them at age 26 because they had a second shot of espresso or got in an argument with a colleague at work, but the effect of the espresso or argument has long since worn off by the time I reassess them at age 27. Of course, even with panel data the repeated measures are correlated, but not because of serial correlation of within-person *residuals* but because of individual differences in level and change over time. For instance, some people have consistently higher heart rates and others have consistently lower heart rates and this stability will lead to across-person positive correlations in repeated measures. We typically model these individual differences in level and change via latent growth factors / random effects when fitting LCMs / MLMs. Such individual differences may be an important source of correlation in ILD too, but we also have to contend with the serially correlated residuals. Although an added complexity, the MLM is quite well suited at incorporating serial correlations such as these. Complex error terms can be defined among the time-specific residuals such as auto-regressive, Gaussian decay, spatial power, or Toeplitz structures. It is very important these serial correlations be represented in the model if needed both to gain insights into the phenomenon under study and to ensure that other parameter estimates of interest are not biased.

A second issue that often arises in ILD is the presence of cycles or transition points that might occur during the assessment period. For example, daily measures taken over several weeks may vary as a function of weekday vs. weekend (e.g., if studying college drinking) or might cycle regularly throughout a day (e.g., hourly heart rate data varying as a function of waking to sleeping and back to waking). Although such cycles and transition points might be present in panel data as well, these are less likely to occur because there are typically fewer time-linked assessments and these tend to aggregate over longer durations (e.g., if we ask “over the past 30 days” to obtain monthly alcohol use levels, these ratings will implicitly smooth over weekday-weekend differences in daily alcohol use). In contrast, multiple cycles might be observed in ILD spanning a 50 or 60 time point series.

Finally, a third issue is the distinction between within- versus between-person effects. Often ILD is collected with the idea of assessing processes as they unfold in real time for individual participants (“life as lived”). For instance, we might be interested in using ILD to test a negative reinforcement hypothesis for alcohol use. That is, we wish to test the proposition that people drink more than they typically do when they are experiencing increased negative affect under the expectation that this will reduce their negative affect. Using a daily diary study, we measure negative affect each day and alcohol use each night and we build a model to predict alcohol use from negative affect. To fully assess the negative reinforcement hypothesis, we must differentiate the within-person effect (e.g., when my negative affect is higher than usual I drink more than is typical for me) from any between-person correlation that may also exist (e.g., that people who have higher negative affect in general tend to drink more in general). Fortunately, with the MLM we have well developed methods for separating within- and between-person effects, although there are some complications to consider (see our prior help desk post specifically on this issue)

The MLM is thus well suited to address all of these complexities that commonly arise in intensive longitudinal data. Once incorporated, the MLM offers many of the very same advantages as when applied to panel data: time-varying predictors can be incorporated at level-1 with either fixed or random effects, time invariant predictors can be incorporated at level-2, and interactions can be estimated within or across levels of analysis. However, there are two key limitations of the MLM that may or may not arise in a given application. The first is that, similar to the traditional general linear model, the MLM assumes all measures are error-free and all observed variance is “true” variance. This is often (if not always) an unrealistic assumption and violation of this assumption can lead to significant biases in the estimated results. The second is that the MLM only allows for one dependent variable at a time and is thus limited to the estimation of unidirectional effects. Say that you are interested in testing the reciprocal relations between depression during the day and substance use that evening, and you obtain multiple daily measures spanning a week of time. The MLM allows for the estimation of the prediction of substance use from depression, but not the simultaneous estimation of the reciprocal prediction of depression from substance use. As such, the MLM is only evaluating one part of the research hypotheses at hand.

However, recent developments have introduced a new analytic procedure that combines elements of the MLM, the SEM, and time series models called the dynamic structural equation model (or DSEM). The DSEM functionally picks up where the MLM leaves off, but expands the model to potentially include latent factors (to estimate and remove measurement error) and multiple dependent variables (to estimate reciprocal effects between two or more variables over time). DSEM is a recent development and much has yet to be learned about best practices in applied research settings, but it represents a significant development in our ability to fit complex models to ILD.

Want to learn more? We recently had the honor of being invited to provide a series of three lectures on intensive longitudinal data analysis for the American Psychological Association and we have posted our lecture materials in the resources section of the CenterStat home page (https://centerstat.org/apa-ild/). The first session discusses the challenges and opportunities of ILD; the second focuses on the analysis of ILD using the multilevel model; and the third focuses on the analysis of ILD using the dynamic structural equation model. In addition to those resources, below are several suggested readings on the design, collection, and analysis of intensive longitudinal data. Asynchronous access to CenterStat workshops on *Multilevel Modeling** *and *Analyzing Intensive Longitudinal Data* is also available to those who might wish to register for additional training. You can also check our workshop schedule for upcoming live offerings.

Good luck with your work!

Asparouhov, T., Hamaker, E. L., & Muthén, B. (2018). Dynamic structural equation models. Structural Equation Modeling: *A Multidisciplinary Journal, 25*, 359-388.

Asparouhov, T., & Muthén, B. (2020). Comparison of models for the analysis of intensive longitudinal data. *Structural Equation Modeling: A Multidisciplinary Journal, 27*, 275-297.

Bolger, N., & Laurenceau, J. P. (2013). *Intensive longitudinal methods: An introduction to diary and experience sampling research*. Guilford Press.

Hamaker, E. L., Asparouhov, T., Brose, A., Schmiedek, F., & Muthén, B. (2018). At the frontiers of modeling intensive longitudinal data: Dynamic structural equation models for the affective measurements from the COGITO study. *Multivariate Behavioral Research, 53*, 820-841.

Hoffman, L. (2015). Longitudinal analysis: Modeling within-person fluctuation and change. Routledge.

McNeish, D., & Hamaker, E. L. (2020). A primer on two-level dynamic structural equation models for intensive longitudinal data in Mplus. *Psychological Methods, 25*, 610-635.

McNeish, D., Mackinnon, D. P., Marsch, L. A., & Poldrack, R. A. (2021). Measurement in intensive longitudinal data. *Structural Equation Modeling: A Multidisciplinary Journal, 28*, 807-822.

Walls, T. A., & Schafer, J. L. (Eds.). (2006). *Models for intensive longitudinal data*. Oxford University Press.

The post What exactly qualifies as intensive longitudinal data and why am I not able to use more traditional growth models to study stability and change over time? appeared first on CenterStat.

]]>The post What’s the best way to determine the number of latent classes in a finite mixture analysis? appeared first on CenterStat.

]]>One of the single most difficult tasks in finite mixture modeling is to determine the number of classes within the population, a process sometimes referred to as *class enumeration*. Typically, one will fit a finite mixture model using maximum likelihood estimation, in which the number of classes must be declared as part of the model specification. Thus, the analyst will fit a model with 1 class, then 2 classes, then 3, etc., and then compare the fit of these models to try to determine the optimal number of classes. Various approaches to determining the optimal number of classes can be considered but they generally fall into three primary categories: likelihood ratio tests, information criterion, and entropy statistics. Let’s consider each in turn. (And, yes, there are Bayesian approaches to this problem too, but they aren’t widely used in practice so we won’t be addressing those).

One approach for evaluating the number of classes is to use a likelihood ratio test (LRT). LRTs represent a general procedure for testing between nested models, i.e., where one model consists of parameters that are a restricted subset of the parameters of the other model. The LRT is computed as –2 times the difference in the log-likelihoods of the two models and, under certain regularity conditions (essentially *assumptions*), it is distributed as a central chi-square with degrees of freedom equal to the difference in number of estimated parameters. From the chi-square, we obtain a *p*-value under the null hypothesis that the simpler model is the right one. Effectively we are saying, look….we know that if we throw more parameters at the model it will fit the sample data better (i.e., the log-likelihood improves) but is this improvement greater than I would expect by chance alone given the number of parameters added (the degrees-of-freedom of the LRT)? If the *p*-value is significant, then we conclude that it is a greater improvement than we would expect by chance, rejecting the simpler model in favor of the more complex model. If it’s not significant, then we conclude there is not a meaningful difference between the two models and we retain the simpler model. In other words, we conclude that the extra parameters may just be overfitting, picking up random variation or noise in the sample that doesn’t reflect the true underlying structure in the population.

That is how we typically use LRTs in a traditional modeling framework, but let’s think about how we would apply this general testing approach to determine the number of classes in a finite mixture. First, we can establish that a *K*-class model is nested within a *K*+1-class model. For instance, one could set the mixing probability (prevalence rate) of one class in the *K*+1-class model to zero. Presto, this deletes one of the classes to produce a *K*-class model. So far so good. Now we fit models with 1 v. 2 classes, calculate the LRT, and if the *p*-value is significant we say 2 is better than one. Then we test 2 v. 3 classes, 3 v. 4 classes, etc., and stop when we get to the point that adding another class no longer results in a significant improvement in model fit. But where things get complicated is in the fine print to the likelihood ratio test. The regularity conditions required for the test distribution to be a central chi-square aren’t met when testing a *K* versus a *K*+1-class model. So while it still makes sense to conduct likelihood ratio tests, we no longer have the familiar chi-square with which to obtain p-values. We need to somehow modify how we conduct LRTs for use in this context.

One option is to bootstrap the test distribution. McLachlan (1987) proposed a parametric bootstrapping procedure that involves (1) simulating data sets from the *K*-class model estimates that were obtained from the real data; (2) fitting *K* and *K*+1-class models to the simulated data sets; (3) computing the likelihood ratio test statistic for each simulated data set; (4) using the distribution of bootstrapped LRT values to obtain the *p*-value for the likelihood ratio test statistic obtained with the real data. It’s a clever approach, but somewhat computationally intense, especially if one wants a precise *p*-value. The other option is to derive the correct theoretical test distribution for the LRT. Lo, Mendell & Rubin (2001) performed these derivations, determining it (appropriately enough) to be a mixture of chi-squares. They also provided an ad-hoc adjusted version of the test with a bit better performance at realistic sample sizes. Simulation studies, however, have shown the Lo-Mendell-Rubin LRT (original and adjusted versions) to have elevated Type I error rates for some models, whereas the bootstrapped LRT consistently work well. We thus tend to prefer the bootstrapped LRT, despite its greater computational demands (which is an increasingly less relevant concern given ever-improving computational speeds of even the lowliest desktop computers).

A second approach to evaluating the number of classes is to use information criteria (IC). Two well-known information criteria are Akaike’s Information Criterion (AIC) and Bayes’ Information Criterion (BIC), but there are many others. What ICs generally try to do is balance the *fit* of the model against the *complexity* of the model. Fit is measured by –2 times the log-likelihood and a penalty is then applied for complexity, usually some function of the number of parameters and/or sample size. Often, but not always, ICs are scaled so that smaller values are better. So one would fit models with 1, 2, 3, etc. classes and then select the model with the lowest IC value as providing the best balance of fit against complexity. Different ICs were motivated in different ways and implement different penalties. Some penalties are stiffer than others, so for instance the BIC penalty usually exceeds the AIC penalty. When choosing the number of classes, simulation studies have shown AIC to be too liberal (tends to support taking too many classes), whereas BIC generally does well as long as the classes are reasonably well separated. For less distinct classes (that is, classes that may reside closer together and are thus harder to discern), a sample size adjusted version of the BIC, which ratchets down the penalty a bit, sometimes performs better. While there are many different ICs to choose from, we generally find the BIC to be a reasonable choice.

A third common approach is to consider the *entropy* of the model. Entropy is a measure of how accurately one could assign cases to classes. Finite mixture models are probabilistic classification models in the sense that there is not a hard partition of the sample into non-overlapping clusters but instead there is a probability that each person belongs to each class; further, these probabilities sum to 1.0 for each individual reflecting there is a 100% chance they belong to one of the classes. However, sometimes one is interested in producing such a hard partition based on the probabilities, for instance by assigning a case to the class to which they most likely belong, a technique called *modal assignment*. If the probabilities of class membership tend toward zero and one, then this implies that there should be few errors of assignment. But as the probabilities move away from zero and one this reflects greater uncertainty about how to assign cases and an increased rate of assignment errors. For instance, if my probabilities for belonging to Classes 1 and 2 are .9 and .1, there’s a 90% chance I would be correctly assigned to Class 1. That’s pretty good. But if my probabilities are .6 and .4, there is only a 60% chance that placing me into Class 1 would be the right decision. Entropy summarizes the uncertainty of class membership across all individuals, providing a sense of how accurately one can classify based on the model.

There are several different types of entropy-based statistics. Some are of the same form as the ICs described above, in which the fit of the model is balanced against a penalty that is now a function of entropy (e.g., the classification likelihood criterion). Others are transformations of entropy to make interpretation easier (e.g., normalized entropy criterion). The *E *entropy statistic developed by Ramaswamy et al. (1992) is particularly popular – it has a nice scale, ranging from 0 to 1, with 1 indicating perfect accuracy, and is standard output in some software (e.g., Mplus). One might thus calculate *E* values (or some other entropy based statistic) for models with different numbers of latent classes and then select the model with the greatest classification accuracy. But this presupposes that one wants to select a model that consists of well-separated classes. Sometimes, classes aren’t well separated. Consider that there is a well-recognized height difference between adult men and women, yet men are only about 7% taller than women on average, so there is a lot of overlap between the height distributions. It seems reasonable to assume that latent classes will overlap at least as much as natural groups do, so entropy may be a poor guide to the number of classes in many realistic scenarios. Thus, in most cases, it is probably best not to use entropy to guide class enumeration, but instead to consider it a property of the model that is ultimately selected. That is, determine the number of classes using the BIC and/or bootstrapped likelihood ratio, then examine the entropy as a descriptive statistic of the selected model.

So we seem to have arrived at a straightforward set of recommendations. First, fit models with 1, 2, 3, etc. latent classes (until estimation breaks or we reach some practically useful / theoretically plausible upper bound like say 10 classes). Second, compare the fit of these models using your preferred information criterion (perhaps BIC, perhaps sample-size adjusted BIC). Also use the bootstrapped likelihood ratio test to get formal *p*-values. Hope your IC of choice and the bootstrapped LRT arrive at the same answer. Third, write your paper. How hard can all of that possibly be? Well, sometimes (maybe even oftentimes) this process doesn’t work, occasionally in small ways and occasionally in blow-up-in-your-face ways. You might end up selecting a model that is problematic, like having a very small class that is impractical and which you suspect may just reflect outliers or over-fitting to the data. Or you might select a model where, substantively, some of the classes seem similar enough that it isn’t worth distinguishing them. In such cases, you might use your content area knowledge (expert opinion) to decide that maybe the quantitatively “best” model isn’t as useful as the next-best model. Of course, this introduces subjectivity to the model selection process, and people may disagree about these decisions, so you want to justify your choice.

Other times, IC values just keep getting better as classes are added to the model and bootstrapped LRTs just keep giving significant results. This seems to happen a lot when analyzing especially large samples. What this reveals is a problem in our logic so far. To this point, we’ve assumed that the finite mixture model is *literally* *correct*: that is, there is some number of latent groups mixed together in the population and our job is to go find that number. But what if the model isn’t literally correct? Arguably, all models represent imperfect approximations to the true data generating process. We hope these models recover important features of the underlying structure, but we don’t necessarily regard them as correct. From this perspective, there isn’t some number of true classes to find. But, if that is the case, then what we are we doing when we conduct class enumeration? We would argue that we are evaluating different possible approximations to the data, trying to discern how many classes it takes to recover the primary structure without taking so many that we are starting to capture noise or nuisance variation.

At small sample sizes, we can only afford a gross approximation with few classes, but with higher sample sizes, we can start to recover finer structure with more classes. That finer structure may not always be of substantive interest, but it’s there, and traditional class enumeration procedures (BIC, etc.) will reward models that recover it. For example, with a modest amount of data we might be able to identify differences in attitudes, behavior, fashion, and speech between individuals living in broad regions of the United States, like the Northeast and Southwest. With more data, we might be able to see more nuanced differences, separating into smaller regions like mid-Atlantic states, upper Midwest, etc. In reality, the states (aside from Alaska and Hawaii) are contiguous, and attitudes, behavior, fashion, and speech patterns vary continuously over complex cultural and geographic gradients. Nevertheless, regional classifications capture important differences in local conditions. There’s no right number of classifications, just differences in fineness. With enough data, we can make our classes extremely local, but this might not always be useful to do.

Ultimately, then, there is an inconsistency between the perspective motivating the development and evaluation of traditional class enumeration procedures (that there is a true number of classes to find) and the context within which these are applied in practice (where the model is an approximation). This can lead to problems like seeing support for more and more classes at larger and larger sample sizes. In such cases, the number selected may again be determined more by subjective considerations such as the size, distinctiveness, and practical utility of the classes.

In sum, standard practice in determining the number of classes for a finite mixture model is to fit models with 1, 2, 3, etc. classes using maximum likelihood estimation, then compare fit using specialized likelihood ratio tests (bootstrapped LRT or Lo-Mendell-Rubin LRT), information criterion (BIC, AIC, etc.), or entropy, and to try to objectively triangulate on an optimal number. Simulation studies suggest bootstrapped LRTs and BIC generally work well. However, these presuppose that there is some true number of classes to find. In most instances, a more realistic perspective is that the model is instead providing an approximation to the underlying structure and there may not be a true number of classes to find. Even the archetypal concept of species undergirding our example with the finches is a bit more muddled than we learned in high school biology. On this view, the goal of our analysis is to select a number of classes that recovers the important features of the data without capturing noise or nuisance variation. Traditional class enumeration procedures can still serve as a useful guide, balancing fit and parsimony in quantifiable ways, but content area knowledge also plays an important role in determining how fine to make the approximation before it becomes impractical and unwieldy.

**References**

Hensen, J.M., Reise, S.P., & Kim, K.H. (2007). Detecting mixtures from structural model differences using latent variable mixture modeling: a comparison of relative model fit statistics. *Structural Equation Modeling, 14*, 202-226.

Kim, S.-Y. (2014). Determining the number of latent classes in single- and multiphase growth mixture models. *Structural Equation Modeling, 21*, 263-279.

Liu, M. & Hancock, G.R. (2014). Unrestricted mixture models for class identification in growth mixture modeling. *Educational and Psychological Measurement, Online First*.

Lo, Y., Mendell, N.R., & Rubin, D.B. (2001). Testing the number of components in a normal mixture. *Biometrika, 88*, 767–778.

McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. *Journal of the Royal Statistical Society, Series C, 36*, 318-324.

McLachlan, G., & Peel, D. (2000). *Finite mixture models*. New York: Wiley.

Nylund, K.L., Asparouhov, T. & Muthen, B.O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. *Structural Equation Modeling, 14*, 535-569.

Ramaswamy, V., DeSarbo, W.S., Reibstein, D.J., & Robinson, W.T. (1992). An empirical pooling approach for estimating marketing mix elasticities with PIMS data. *Marketing Science, 12, *241-254.

Tofighi, D. & Enders, C.K. (2008). Identifying the correct number of classes in growth mixture models. In G.R. Hancock & K.M. Samuelsen (Eds.), *Advances in Latent Variable Mixture Models* (pp. 317-341). Greenich, CT: Information Age.

The post What’s the best way to determine the number of latent classes in a finite mixture analysis? appeared first on CenterStat.

]]>The post My advisor told me to use principal components analysis to examine the structure of my items and compute scale scores, but I was taught not to use it because it is not a “true” factor analysis. Help! appeared first on CenterStat.

]]>Help, indeed. This issue has been a source of both confusion and contention for more than 75 years, and papers have been published on this topic as recently as just a few years ago. A thorough discussion of principal components analysis (PCA) and the closely related methods of exploratory factor analysis (EFA) would require pages of text and dozens of equations; here we will attempt to present a more succinct and admittedly colloquial description of the key issues at hand. We can begin by considering the nature of *composites*.

Say that you were interested in obtaining scores on negative affect (e.g., sadness, depression, anxiety) and you collected data from a sample of individuals who responded to 12 items assessing various types of mood and behavior (e.g., sometimes I feel lonely, I often have trouble sleeping, I feel nervous for no apparent reason, etc.). The simplest way to obtain a composite scale score would be to compute a mean of the 12 items for each person to represent their overall level of negative affect. This is often called an *unweighted* linear composite because all items contribute equally and additively to the scale score: that is, you simply add them all up and divide by 12. This approach is widely used in nearly all areas of social science research.

However, now imagine that you could compute *more* than one composite from the set of 12 items. For example, you might not believe a single overall composite of negative affect exists, but that there is one composite that primarily reflects *depression* and another that primarily reflects *anxiety*. This is initially very strange to think about because you want to obtain *different *composites from the *same* 12 items. The key is to *differentially* weight the items for each composite you compute. You might use larger weights for the first six items and smaller weights for the second six items to obtain the first composite, and then use smaller weights for the first six items and larger weights for the second six items to obtain the second composite. Now instead of having a single overall composite of the 12 items assessing negative affect, you have one composite that you might choose to label *depression* and a second composite that you might choose to label *anxiety*, and both were based on differential weighting of the same 12 items. This is the core of PCA.

PCA dates back to the 1930’s and was first proposed by Harold Hotelling as a *data reduction method*. His primary motivation was to take a larger amount of information and reduce it to a smaller amount of information by computing a set of weighted linear composites. The goal was for the composites to reflect *most, *though not *all,* of the original information. He accomplished this through the use of the eigenvalues and eigenvectors associated with the correlation matrix of the full set of items. Eigenvalues represent the variance associated with each composite, and eigenvectors represent the weights used to compute each composite. In our example, the first two eigenvalues would represent the *variances *of the depression and anxiety composites, and the eigenvectors or *weights* would tell us how much each item contributes to each composite. It is possible to compute as many composites as items (so we could compute 12 composites based on our 12 items) but this would accomplish nothing in terms of data reduction because we would simply be exchanging 12 items for 12 composites. Instead, we want to compute a much smaller number of composites than items that represent *most* but not *all* of the observed variance (so we might exchange 12 items for two or three composites). The cost of this reduction is some loss of information, but the gain is being able to work with a smaller number of composites relative to the original set of items.

There are many heuristics used to determine the “optimal” number of composites to extract from a set of items. Methods include the Kaiser-Guttman rule, looking for the “bend” in a scree plot of eigenvalues, parallel analysis, and evaluating the incremental variance associated with each extracted component. There are also many methods of “rotation” that allow us to rescale the item weights in particular ways to make the underlying components more interpretable (helping us “name” the factors). For example, if the first six items assessed things like sadness and loneliness and had large weights on the first component but smaller weights on the second, we might choose to name the first component “depression”, and so on. Often, the end goal is to obtain conceptually meaningful weighted composite scores for later analyses.

Although Hotelling developed PCA strictly as a method of data reduction and composite scoring (indeed, he never even discussed rotation because he was not interested in interpreting individual items), over time this method came to be associated with a broader class of models called exploratory factor analysis, or EFA. The goals of EFA are often very similar to those of PCA and might include scale development, understanding the psychometric structure underlying a set of items, obtaining scale scores for later analysis, or all three. There are many steps in EFA that overlap with those of PCA, including identifying the optimal number of factors to extract; how to rescale (or “rotate”) the factor loadings to enhance interpretation; how to “name” the factors based on what items are weighted more vs. less; and how to compute optimal scores. Given these similarities, there has long been contention about whether PCA is a formal member of the EFA family, or if PCA is not a “true” factor model but instead something distinctly different.

Contention on this point centers on a key defining feature of PCA: it assumes that all items are measured *without error* and all observed variance is available for potential factoring. When fewer composites are taken than the number of items, some residual variance in the items will be left over, but this is still considered “true” variance and not measurement error. In contrast, EFA explicitly assumes that the item responses may be, and indeed very likely are, characterized by measurement error. As such, whereas PCA expresses the components as a direct function of the items (that is, the items *induce* the components), EFA conceptually reverses this relation and instead expresses the items as a function of the underlying latent factors. The factors are “latent” in the sense that we believe them to exist but they are not directly observed, and our motivating goal is to infer their existence based on what we did observe: namely, the items.

Of critical importance is that, unlike the PCA, the EFA assumes that only *part* of the observed item variance is true score variance and the remaining part is explicitly defined as measurement error. Although this assumption allows the model to more accurately reflect what we believe to exist in the population (we nearly always recognize there is the potential for measurement error in our obtained items), this also creates a significant challenge in model estimation because the measurement errors are additional parameters that must be estimated from the data. Whereas PCA can be computed directly from our observed sample data, EFA requires us to move to more advanced methods that allow us to obtain optimal estimates of population parameters via iterative estimation. There are many methods of estimation that can be used in the EFA (e.g., unweighted least squares, generalized least squares, maximum likelihood), each of which have certain advantages and disadvantages. In general, maximum likelihood is often viewed as the “gold standard” method of estimation in most research applications.

We can think about four key issues that ultimately distinguish PCA from EFA:

- The theoretical model is
*formative*in PCA and*reflective*in EFA. In other words, the composites are viewed as a function of the items in PCA, but the items are viewed as a function of the latent factors in EFA. - PCA assumes all observed variance among a set of items is available for factoring, whereas EFA assumes only a subset of the observed variance among a set of items is available for factoring. This implies that PCA assumes no measurement error while EFA explicitly incorporates measurement error into the model.
- Although both PCA and EFA allow for the creation of weighted composites of items, in PCA these are direct linear combinations of items whereas in EFA these are model-implied estimates (or predicted values) of the underlying latent factors. As such, in PCA there is only a single method for computing composites, but in EFA there are many (e.g., regression, Bartlett, constrained covariance, etc.), all of which can differ slightly from one to the other.
- Finally, the confusion between PCA and EFA is exacerbated by the fact that in nearly all major software packages PCA is available as part of the “factor analysis” estimation procedures (e.g., in SAS PROC FACTOR a PCA is defined using “method=principal” but an EFA is defined using “method=ML”).

It is difficult to draw firm guidelines for when and if to use PCA in practice. It depends on the underlying theory, the characteristics of the sample, and the goals of the analysis. In most social science applications, particularly those focused on the measurement of psychological constructs, it is often best to use EFA because this better represents what we believe to hold in the population. However, if EFA is not possible due to estimation problems, or if there is an exceedingly large number of items under study, then PCA is a viable alternative. Interestingly, PCA has begun to make a recent comeback in usage within psychology given increased interest in machine learning. It is not uncommon for PCA to be applied to 50 or 100 variables in order to distill them down to a smaller number of composites to be used in subsequent analysis.

Our general recommendation is to initially consider EFA estimated using ML as your first best option, both for model fitting and score estimation. This is because, far more often than not, the EFA model better represents the mechanism we believe to have given rise to the observed data; namely, a process that combines both true underlying construct variation and random measurement error. However, if the EFA is not viable for some reason, then PCA is a perfectly defensible option as long as the omission of measurement error is clearly recognized. Finally, all of the above relates to the exploratory factor analysis model in which all items load on all underlying factors. In contrast, the confirmatory factor analysis (CFA) model allows for *a priori* tests of measurement structure based on theory. If there is a stronger underlying theoretical model under consideration, then CFA is often a better option. We discuss the CFA model in detail in our free three-day workshop, *Introduction to Structural Equation Modeling*.

Below are a few readings that might be of use.

Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford publications.

Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refinement of clinical assessment instruments. *Psychological Assessment, 7*, 286-299.

Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. *Psychological Methods, 4*, 272-299.

Widaman, K. F. (1993). Common factor analysis versus principal component analysis: Differential bias in representing model parameters? *Multivariate Behavioral Research, 28*, 263-311.

Widaman, K. F. (2018). On common factor and principal component representations of data: Implications for theory and for confirmatory replications. *Structural Equation Modeling: A Multidisciplinary Journal, 25*, 829-847.

The post My advisor told me to use principal components analysis to examine the structure of my items and compute scale scores, but I was taught not to use it because it is not a “true” factor analysis. Help! appeared first on CenterStat.

]]>The post I fit a multilevel model and got the warning message “G Matrix is Non-Positive Definite.” What does this mean and what should I do about it? appeared first on CenterStat.

]]>First, let’s translate the technical jargon. Following Laird & Ware (1982), many software programs used to fit multilevel models use the label **G** to reference the covariance matrix of the random effects. For instance, for a linear growth model, we might include both a random intercept and a random slope for time to capture (unexplained) individual differences in starting level and rate of change. In fitting the model, we don’t estimate the individual values of the random effects directly. Instead, we estimate the variances and covariance of the random effects, i.e., a variance for the intercepts, a variance for the slopes, and a covariance between intercepts and slopes. These variances and covariances are contained in the matrix **G**. Similarly, with hierarchically nested data (e.g., children nested within classrooms or patients nested within physician), we use random effects to capture (unexplained) cluster-level differences. Random intercepts capture between-cluster differences in outcome levels whereas random slopes capture between-cluster differences in the effects of predictors. Again, as part of fitting the model, we need to estimate the variances and covariances of these random effects and, again, these variance and covariance parameters are contained within the **G** matrix. Note that some software programs may use a different label for the covariance matrix of the random effects, but for this post we will use the common notation of **G** throughout.

When the **G** matrix is non-positive definite (NPD) this means that there are fewer dimensions of variation in the matrix than the expected number (i.e., the number of rows or columns of the matrix, corresponding to the number of random effects in the model). For instance, in our linear growth model example, there are two potentially correlated dimensions of variation specified in the **G** matrix, one corresponding to the random intercepts and one corresponding to the random slopes for time. This is no different than what we would expect for any two variables. If we measured height and weight, for instance, there would be variation in height, variation in weight, and some covariation between height and weight, and this would be captured in the 2 x 2 covariance matrix for the two variables. Here we are simply considering random effects rather than measured variables, but the principle remains the same. Now, imagine what would happen if there was no variation for one variable or random effect. For instance, suppose there were no individual differences in rate of change, making the variance of the slopes equal to zero? Then there would be only one remaining dimension of variation in the matrix (reflecting the random intercepts) and **G** would be NPD (having fewer actual dimensions of variation than its specified number of rows/columns). Thus, one way an NPD **G** matrix can arise is if one (or more) of the random effects in the fitted model has a variance of zero.

However, this is not the only possible way to obtain an NPD **G** matrix. For example, what happens if the intercepts and slopes of our growth model are perfectly correlated (e.g., *r = *1.0 or -1.0)? Then the two random effects are redundant with one another and actually represent just one dimension of variation. Again, this would lead the **G** matrix to be NPD. More technically, any time one random effect can be expressed as a perfect linear function of the other random effects, the **G** matrix will be NPD. Note that, depending on whether your software program implements boundary constraints on the variance and covariance parameters or not, you can even get negative variances for random effects or correlations exceeding ±1 (known as improper estimates).

Now let’s consider why an NPD covariance matrix for the random effects is usually a problem. Typically, when one includes random effects in a multilevel model, the assumption is that they “exist” as distinguishable components of variation. For instance, our growth model states that people differ in their starting points and rates of change, differences captured by the random intercepts and slopes included in the model specification. When we include random effects like these in our models, we expect them to have variance and, while they might be correlated with one another, none is thought to be fully redundant with the others. When we receive the “G matrix is non-positive definite” warning, it tells us our expectations were wrong. The estimated model found fewer dimensions of variation than the number of random effects that were specified.

Sometimes the problem is just that estimation went awry. For instance, when predictors with random slopes have very different scales, the variances of the random slopes may be numerically quite different, and this can impede proper model estimation. A second possible reason for **G** to be NPD is that we included random effects in the model that simply aren’t there. Sure, people differ in their starting levels but everyone is actually changing at the same rate, so the random intercepts are good but the random slopes are superfluous. A third possibility is that the data simply aren’t sufficient to support estimating the model (even if the model accurately describes the process under study). This often occurs with smaller sample sizes, more complex models, or some combination of the two.

To illustrate this last possibility, let’s say we fit our growth model to data comprised of two repeated measures per person, and these were collected at the same two points in time for everyone in the sample (a common scenario sometimes referred to as “time structured data”). With only these two time points, there simply is not enough information to be able to obtain unique estimates for all of our model parameters. That is, our model is “under identified”. To intuit why this is the case, imagine a time plot for a set of individuals. If we allow ourselves to draw a different line for each person, each with its own starting level and rate of change, then we will connect the dots perfectly for every case. Yet our model assumes there will be some residual variability around the line as well, i.e., variation around the individual trajectory. Since each line connects the dots, we have no remaining variability with which to capture the residual. Conversely, were we to try to introduce residuals by drawing lines that didn’t perfectly connect the dots, we couldn’t do so without using arbitrary intercepts and slopes. Thus, a typical linear growth model that includes both a residual and random intercepts and slopes cannot be estimated using just two time points of data without producing an NPD covariance matrix for the random effects. That doesn’t mean that there aren’t truly differences in where people start and the rate at which they are changing. It just means that the data are insufficient to tell us about those differences. To be able to identify the model, we would need a third time point (for at least some sizable portion of the sample) to be able to draw a line for each person that doesn’t simply connect the dots and that allows for individual differences in intercepts and slopes as well as residual variability.

A general but imperfect rule of thumb is that, for many of the units in the sample, you want at least one more observation than the number of random effects (e.g., to include two random effects in our growth model, a good number of people in the sample should have three or more repeated measures). If you have fewer observations per unit than indicated by this rule, that may be the cause of your NPD **G** matrix. The warning is telling you that you are trying to do too much with the data at hand. Although we illustrated this rule with longitudinal data, it applies equally to hierarchical data applications. For instance, with dyadic data, there are two partners per dyad, allowing for the inclusion of a random intercept to account for the between-dyad differences; however, no further random effects can be included in the model because their variance/covariance parameters would not be identified given the uniform cluster size of two.

Complicating matters, however, is that even when the number of observations per sampling unit are theoretically sufficient, one may still obtain an NPD covariance matrix. That is, the model is in principle mathematically identified but the data still aren’t able to support the full dimensionality of the random effects. Such a scenario is most likely to arise in small samples and when the number of random effects in the model is either large (i.e., 5 or more) or approaches the maximum number that can possibly be identified by the data. For instance, let’s say we have time structured data with four repeated measures per person. In principle, we can fit a quadratic growth model with a random intercept, random linear effect of time, and random quadratic effect of time. Four observations per person should be enough to be able to obtain unique variance and covariance estimates for three random effects. Yet when we fit the model, we might still obtain the warning, “G matrix is non-positive definite.” In such a case, inspecting the variance-covariance parameter estimates will likely reveal that the quadratic random effect has an estimated variance of zero (or negative variance) or that our random effects have excessively high correlations with one another (in practice, these very high correlations are very commonly negative). Empirically, we cannot distinguish all the components of variability that we specified for the individual trajectories.

Now that we understand when and why NPD **G** matrices occur, let’s consider what to do about them. What to do depends, of course, on what prompted the NPD solution. First, do your best to determine whether your model is identified. Model identification can be tricky with multilevel models, but drawing on our rule of thumb, consider whether with *p* random effects in your model, your sampling units have at least *p *+ 1 observations. If not, you probably need to simplify the model. Even if your model is mathematically identified, model simplification might still be in order. Remember, a non-positive definite **G** matrix signals a lack of empirical support for each random effect to represent a non-redundant component of variation. A logical remedy is then to remove random effects until the warning message goes away. Typically, one should remove higher-order terms before lower-order terms (e.g., remove the quadratic random effect before the linear one, and the linear before the intercept). One pattern of results that is particularly amenable to this strategy is when the variance estimate for a random effect collapses to zero (or goes negative), suggesting it should be removed. We caution, however, that non-significance of a variance estimate should not be taken to imply that the random effect can be sacrificed without worry. Non-significance might simply be a result of low power. Trimming terms based on p-values thus runs the risk of over-simplification, with consequences for the validity of the inferences made from the model.

Additionally, we want to emphasize that reducing the number of random effects is not always defensible, desirable, or necessary. For instance, suppose our theory suggested the inclusion of two random slopes in the model. Each is estimated with some non-zero variance but the slopes are excessively correlated with one another, producing an NPD **G** matrix. Which should we remove? Both were hypothesized to exist and there is no empirical information to prompt the exclusion of one versus the other. Fortunately, we may not have to remove either. Sometimes, re-parameterizing the random effects covariance matrix is sufficient to resolve the problem. Specifically, McNeish & Bauer (2020) showed that using a factor analytic (FA) decomposition of the random effects covariance matrix can greatly aid convergence and reduce the incidence of NPD solutions. When necessary, the FA decomposition can also be used to facilitate a dimension reduction to the random effects covariance matrix that doesn’t require any of the random effects to be omitted entirely. In that case, you are effectively acknowledging that an NPD G matrix is just something you have to live with given the complexity of your model, but you are choosing to do so in as graceful (and empirically useful) a manner as possible.

One other strategy is to abandon the random effects entirely and move to a marginal or “population average” model (Fitzmaurice, Laird, & Ware, 2011; McNeish, Stapleton & Silverman, 2016). In a marginal model, one captures dependence among observations using a covariance structure for the residuals (e.g., compound symmetric, autoregressive, etc.) rather than through the introduction of random effects. Generalized estimating equations (GEE) are one popular algorithm for fitting marginal models, particularly when working with longitudinal data and discrete outcome variables. The obvious downside to a marginal modeling approach is the inability to quantify individual differences between units. For instance, applied in a longitudinal setting, a marginal model would provide estimates of how the mean of the outcome variable changes over time but would not provide estimates of how individuals vary from one another in their trajectories. With hierarchical data, a related approach is to assume independence of observations (despite knowing this assumption to be incorrect), but then implement “cluster corrected” or “robust” standard errors to obtain valid inferences. This latter option is commonly used in survey research where the nesting of units is a by-product of the sampling design (e.g., cluster sampling) but of little substantive interest. In general, these marginal modeling approaches obviate the possibility of an NPD **G** matrix by omitting random effects from the model, but they are typically only useful if clustering is a nuisance and between-cluster differences are not of theoretical interest (see McNeish et al., 2016).

In sum, the warning “G matrix is non-positive definite” tells you that there are fewer unique components of variation in your estimated random effects covariance matrix than the number of random effects in the model. This can be a consequence of fitting an under-identified model, in which case one must simplify the random effects structure. Alternatively, it may reflect sparse empirical information to support the random effects in the model (especially in small samples or with more complex models). Removing random effects is then a common solution. Often, however, a better solution is to re-parameterize the random effects covariance matrix to facilitate optimization to a proper solution, for instance by using a factor analytic decomposition. If the random effects are not of substantive interest, then you might also consider moving to a marginal model to avoid the issue entirely.

**References**

Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011) *Applied longitudinal analysis* (2nd). Philadelphia, PA: Wiley

Laird, N.M. & Ware, J.H. (1982). Random-effects models for longitudinal data. *Biometrics*, *38*, 963-974. https://doi.org/10.2307/2529876

McNeish, D. & Bauer, D.J. (in press). Reducing incidence of nonpositive definite covariance matrices in mixed effect models. *Multivariate Behavioral Research*. https://doi.org/10.1080/00273171.2020.1830019

McNeish, D., Stapleton, L.M. & Silverman, R.D. (2016). On the unnecessary ubiquity of hierarchical linear modeling. *Psychological Methods, 22*, 114-140. https://doi.org/10.1037/met0000078

The post I fit a multilevel model and got the warning message “G Matrix is Non-Positive Definite.” What does this mean and what should I do about it? appeared first on CenterStat.

]]>The post I’m reporting within- and between-group effects in from a multilevel model, and my reviewer says I need to address “sampling error” in the group means. What does this mean, and what can I do to address this? appeared first on CenterStat.

]]>In our prior post, we talked about how it can be important to separate within- and between-group effects for lower level predictors in MLM. To recap, this is usually done in one of two ways. The first way is to add the predictor, say *x*, to the model (perhaps after grand-mean centering) along with the group means of *x*. With this specification, the obtained coefficient for *x *will be the estimated within-group effect and the coefficient for the group means of *x* will be the estimated *contextual effect*, capturing the extent to which the between group effect of *x* differs from its within-group effect. The second way is to center *x* with respect to its group mean, and then fit the model with this group-mean centered *x *along with the group means of *x*. The group-mean-centered *x* generates an estimate of the within-group effect and the group means generate an estimate of the between-group effect. Regardless of which approach is used, the observed group means of *x *are included in the model as a predictor, and this is where we can run into problems.

To understand this, think back to when you took intro stats and began to learn about inference and sampling variability. Chances are the lecture went something like this… Imagine that in the population, variable *x *has mean *μ* and variance *σ*^{2}. We want to estimate *μ* based on a sample of *n* observations on *x*. The estimate we obtain, the sample mean, is not going to be exactly equal to *μ* because it’s calculated from a sample rather than the entire population, thus we would obtain different estimates from different samples, and these will tend to vary more from one another in small samples. The variance of the sample mean across repeated samples is *σ*^{2}* */ *n*, and taking the square root of this yields the familiar formula for the standard error of the mean, *σ* / sqrt(*n*).

Now let’s return to the MLM context. Each group mean that we calculate is subject to this same sampling error. When the number of observations sampled for a given group is small then the sampling error in the group mean will be large. This makes perfect sense: Imagine you have a classroom in which you have sampled just five of 40 students and then use the mean of these five students to estimate some characteristic of the entire class; naturally, this mean might vary substantially if computed on some other random five students in the class. Across groups, these sampling errors add “error variance” to the group means that cause the between-group effect to be biased. In the case of a single predictor, the bias is predictably in the direction of the within-group effect (leading the contextual effect to be under-estimated). With multiple predictors, the pattern of bias also depends on the correlations among the predictors and can be harder to predict a priori. Further, this bias can propagate to the estimates of true (non-aggregated) Level 2 predictors (that is, level-2 predictors that are not obtained as a function of level-1 observations), even though these predictors do not contain sampling error. Interestingly, because the within and between effects are orthogonal, this bias does not extend to the within-group effect estimates, which remain unbiased.

Bias due to sampling error in the group means seems like a big problem, except that sometimes it’s not. One consideration is the group sizes in your sample: the larger they are, the less sampling error there will be in the group means. With large enough group sizes, you don’t need to worry much about bias. Likewise, when the true between-group differences are large (there is a high intra-class correlation for the predictor), the sampling error will make up a smaller part of the observed group mean differences, producing less bias. Another mitigating circumstance is if you sampled most or all of the individuals in the group population. The usual formula for the standard error of the mean assumes an infinite population, that is, you sampled *n* people from an infinitely large pool. However, often times, and especially with hierarchical data, there may be a limited population size for each group (e.g., one is sampling from a classroom of 20 students). In a finite population of *N* individuals, the sampling variance of the mean can be considerably smaller. In other words, there will be less bias to the between-group effects if the sampling ratio (units in sample to potential units in finite population) is large (e.g., if you sampled 15 students in a classroom of 20). In some cases, you may even have *all *of the available units for each group, such as when studying siblings nested within families. Then there’s no bias whatsoever. Sometimes it is also possible to obtain group-level information from the population rather than calculating it from your sample. For instance, administrative records might provide information on the average family income of all the students in a class, even if only some of them are in the sample. Again, using this, rather than the sample mean, would remove the bias. Finally, if the between-group effect is not really very different from the within-group effect, then the bias in the estimate will be small.

But let’s say your situation doesn’t fit with any of these exceptions, then what? Well, some very clever methodologists have been working on ways to fix the bias. Three primary approaches have been suggested, each paralleling an approach for handling measurement error in standard regression models. One way to handle measurement error is with a latent variable model. Following this strategy, Lüdtke et al (2008) proposed the multilevel latent covariate (MLC) model to handle sampling error in the group means. In this model, the observed scores for the sampled group members are viewed as indicators of the true underlying latent group mean. Shin and Raudenbush (2010) implemented the same idea within a multivariate MLM framework. A second strategy is to generate scores for the latent variable that produce consistent estimates when used as predictors in an observed-variables model. Consistent with this approach, Croon and van Veldhoven (2007) and Shin & Raudenbush (2010) showed that accurate estimates of between-group effects can be obtained by using empirical Bayes’ (EB) estimates of the group means of *x* rather than the observed sample means. Finally, a third way to handle measurement error is to fit a standard regression model but then implement a post-estimation correction to the estimates based on prior knowledge about the reliability of the predictor. In this case, we can infer the reliability of the predictor from the group size, since it is due to sampling error. Grilli & Ramphichini (2011) and Gottfredson (2019) describe the appropriate corrections to implement this approach. As we describe next, all three of these general approaches can yield accurate estimates of the between-group effects, but which to choose may depend on the specific characteristics of your application.

The MLC model is widely recognized, theoretically elegant, makes most efficient use of the data, and is conducted in one step, requiring no pre-treatment of the data or post-transformation of the estimates. On the flip side, the MLC is a complex latent variable model and estimation can go awry when the number of groups is small (e.g., less than 50). The MLC is also based on a reflective measurement model that assumes that the people in the group are interchangeable and the latent group mean is a characteristic of the group that affects the individual scores (people could come and go but the latent group mean would stay the same). This is in contrast to a formative model, in which the scores of the group members are not necessarily interchangeable and collectively determine the population mean of the group (as people come and go the true group mean changes too). A reflective measurement model can be difficult to justify at times, but the MLC can still be profitably used with a formative process as long as the sampling ratio is low (e.g., only 5% of the population group members were sampled).

The EB approach has the advantage that it is straightforward to implement within a standard multilevel model. You can generate EB estimates for *x* in most MLM software programs (sometimes these are referred to instead as empirical best linear unbiased predictors, or EBLUPs). Then, following Shin & Raudenbush (2010), you simply use these rather than the usual group means of *x* when fitting the model to *y *(both at Level 2 and when centering the predictor at Level 1). However, this approach too has drawbacks. First, computing the EB estimates gets increasingly complicated when the number of the predictors increases. These must be computed simultaneously for all of the Level 1 predictors and accounting for any other Level 2 predictors that will be in the model for *y*. Ultimately, for a sufficiently complex model, you may need to program in the matrix equations yourself (see Croon & van Veldhoven, 2007, p. 51-52). Second, like the MLC, the EB estimates implicitly assume a reflective measurement model (though they too could still be used with a formative measurement process if the sampling ratio was sufficiently low). Third, although this approach generates consistent estimates of the fixed effects, it does not correct the variance component estimates, which may remain biased. In turn, this may bias the standard errors of the fixed effects.

Like the EB approach, the reliability-correction approach has the advantage that it can be implemented within a standard multilevel model. Further, no changes are required to the traditional procedures for separating within- and between-group effects. One simply needs to correct the estimates after fitting the model to counteract the expected bias due to sampling error. Corrections can be applied to both fixed effects and variance components and can be computed for either infinite or finite group populations, irrespective of reflective or formative measurement. Adjustments can also be made to the standard errors. But there are downsides to this approach too. First, the reliability-corrected estimates can show excessive sampling variability, making this approach most useful when working in large sample contexts (many groups). Second, the corrections are derived for balanced groups and aren’t fully accurate when group sizes vary. Third, correction formulas focus on the case of a single predictor, whereas it is more common for models to have multiple predictors.

Thus, as with so many things in statistics, there is no one right answer for how to address this problem. In your response to the reviewer, we would recommend the following. First, assess if sampling error is truly a problem for your particular analysis. Might your research fall into one of the exceptions where bias is not expected to be a problem (e.g., large cluster sizes or a high sampling ratio)? If so, you simply need to explain this to your reviewer. Second, if it is a problem, think about which of the possible alternative modeling approaches will best suit your needs by considering the advantages and disadvantages discussed above. If you have many predictors, and are fortunate to have a large number of groups in your sample, the MLC model may be your best bet, provided you can reasonably assume a reflective measurement model or low sampling ratio. If your model is small, the EB or reliability correction approaches might be easier to implement, and one or the other could be used to provide a sensitivity analysis for the original results (i.e., does the story change when accounting for sampling error?). These too perform best with a large number of groups. Last, if you have finite group sizes in the population, you recorded the total sizes of the groups from which you sampled, and you are sampling more than a small fraction of the available group populations, the reliability correction approach is the only one of the three that will take this into account to produce accurate estimates.

Research on this topic is ongoing and expanding, but we hope this post will help to orient you to the relevant literature and give you some ideas for how to move forward with your manuscript.

Croon, M.A. & van Veldhoven, M.J.P.M. (2007). Predicting group-level outcome variables from variables measured at the individual level: a latent variable multilevel model. *Psychological Methods, 12*, 45-57.

Gottfredson, N.C. (2019). A straightforward approach for coping with unreliability of person means when parsing within-person and between-person effects in longitudinal studies. *Addictive Behaviors, 94*, 156-161. DOI: 10.1016/j.addbeh.2018.09.031

Grilli, L., & Rampichini, C. (2011). The role of sample cluster means in multilevel models: A view on endogeneity and measurement error issues. *Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 7,* 121–133. https://doi.org/10.1027/1614-2241/a000030

Lüdtke, O., Marsh, H.W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008). The multilevel latent covariate model: a new, more reliable approach to group-level effects in contextual studies. *Psychological methods, 13*, 203-229. DOI: 10.1037/a0012869

Shin, Y., & Raudenbush, S. W. (2010). A latent cluster-mean approach to the contextual effects model with missing data. *Journal of Educational and Behavioral Statistics, 35*, 26–53. https://doi.org/10.3102/1076998609345252

The post I’m reporting within- and between-group effects in from a multilevel model, and my reviewer says I need to address “sampling error” in the group means. What does this mean, and what can I do to address this? appeared first on CenterStat.

]]>The post My advisor told me I should group-mean center my predictors in my multilevel model because it might “make my effects significant” but this doesn’t seem right to me. What exactly is involved in centering predictors within the multilevel model? appeared first on CenterStat.

]]>As a simple example, imagine your sample consists of multiple classrooms, and each classroom contains multiple students. Further, you obtained a student-level predictor reflecting *locus-of-control* and a student-level outcome reflecting *math achievement*. Your goal is to examine if students who report higher levels of control also tend to perform better on a math exam. Given the hierarchical structure of your data (the nesting of students within classroom), there are actually three possible relations that can exist between your predictor and outcome, the total effect, within-group effect, and between-group effect (where, here, group is classroom). Let’s consider these in turn.

First is the total effect (or marginal effect), which represents the regression of math achievement on locus-of-control pooling over all students and classrooms. This total effect actually represents a weighted composite of the within-group and between-group components of the relation. While it is perfectly fine to estimate and interpret total effects from the standpoint of prediction (e.g., pooling over students and classrooms, a 1-unit change in the predictor leads to a so-many-points change in the outcome), it is much more difficult to draw theoretically meaningful conclusions from these effects, as the location of the effect is ambiguous – the total effect is a mish mosh of the within- and between-group effects. For this reason, when working with multilevel data, it is often preferable to estimate and interpret the within- and between-group effects directly instead.

The within-group effect is the relation between student locus-of-control within a given classroom; this evaluates whether, on average, students who are higher (or lower) on control with respect to the *other students in their class* tend to score higher (or lower) on the math assessment. One way to think about this effect is to imagine that you had only sampled students from a single classroom, say Class A. If you ran a simple regression analysis on the data, you would obtain an effect that tells you about how differences in locus of control are predictive of differences in math scores for students in Class A. You might assume that there’s nothing particularly special about this class and you would have observed the same effect had you sampled from Class B, or Class C, etc. With the multilevel data, we can leverage the data from all of the classrooms in our sample to estimate this common within-group effect with greater precision (and, if we don’t want to assume the within-group effect is the same in each classroom, we can allow for that too, but that’s a story for another day). The within-group effect continues to tell us, within a given group, how do differences in the predictor relate to differences in the outcome?

In contrast, the between-group effect is the relation between the *classroom* *mean* of student locus-of-control and math achievement; this evaluates whether, on average, *classes* categorized by higher (or lower) control tend to score higher (or lower) on math achievement. Here, we can imagine that instead of collecting the individual data, we were only provided with summary data for each classroom. Again, we could run a simple regression on this data, obtaining an estimate of how differences in the average value of locus of control between classrooms relate to differences in average value of math. With access to the individual, student-level data, we can estimate this effect more optimally (accounting for differences in classroom sizes, for instance), but the interpretation remains the same. If we were to compare two classrooms that differed by 1 unit in their average locus of control values, we would expect the students within these classrooms to differ in their average math scores by the magnitude of the between-group effect.

It is often quite important (if not required) to properly disaggregate the total effect into the within-group component and the between-group component within an MLM, and centering the predictors allows us to accomplish this. To see this, let’s consider a very simple one-predictor MLM for students nested within classrooms in which our predictor is locus-of-control and our outcome is math achievement.

Broadly, centering refers to the process of subtracting the mean from a variable (usually a predictor). Unlike in ordinary regression, centering becomes complicated with multilevel data because there are two possible means around which lower-level predictors can be centered. The first is the *grand mean* that represents the mean of the predictor pooling over all observations and all groups. The second is the *group mean* that represents the mean of the predictor within the group to which the observations belong.

There are thus two primary choices when centering lower-level predictors: we can *grand mean center* the predictor, where we deviate each individual score the overall mean (literally subtracting the grand mean from each person’s score), or we can *group mean center* the predictor, where we instead deviate each individual score from their own group mean. The former reflects the individual’s relative standing on the predictor with respect to *everyone* in the sample and the latter reflects the individual’s relative standing on the predictor with respect to everyone in their *own* *group*. Either of these rescaled (or “centered”) predictors can then be used in the Level 1 model, as can the raw (or *uncentered*) version of the predictor. Which is used influences the interpretation of the obtained effects. Further, because the group mean is a characteristic of the group, this itself can be used as an upper-level predictor in the Level 2 equation, regardless of which form of centering is used for the predictor at Level 1 (or even if it is left in the raw scale).

When using the predictor in the raw scale or within grand-mean centering, it is critical to include the group means of the predictor at Level 2 to properly disaggregate the effects. The effect obtained for the predictor at Level 1 will then be the within-group effect and the effect obtained at Level 2 will then be the *difference* between the between- and within-group effects, sometimes called the *contextual effect*.

Problems, however, arise if you fail to include the group means in the model when using the raw scale or grand-mean centered predictor. If you do that, you will get a mish mosh effect estimate for the Level 1 predictor that represents neither represents the between-group nor the within-group effect. Instead, it confounds these two effects together into a single value that may not resemble either. To make matters worse, this mish mosh also doesn’t represent the total effect, as it weights the within- and between-group effects differently. The obtained estimate is difficult to interpret, outside of a few special cases.^{1}

In contrast, when using the predictor with group-mean centering, the effect obtained for the predictor at Level 1 will always be with within-group effect, regardless of whether the group means are included at Level 2 or not. If the group means are included at Level 2, the effect obtained will be the between-group effect. Importantly, MLMs fit using raw, grand-mean centered, or group-mean centered predictors all fit precisely the same, provided the group means are entered as predictors at Level 2 and there are no random slopes in the model (again, a story for another day).

With this as context, we can now return to your question, the answer to which depends on how you specified your initial model. If you included the group means in your model at Level 2, then you will obtain exactly the same within-group effect estimate (and p-value) for your Level 1 predictor regardless of which method of centering you use. In that case, your advisor would be wrong: group-mean centering won’t change a thing. On the other hand, if you haven’t included the group means in the model at Level 2, then group-mean centering will generate an estimate of the within-group effect that will differ from the mish mosh estimate you previously obtained with the raw scale or grand-mean centering. The significance of the within-group effect might well differ from the mish mosh estimate you had before, in which case your adviser would be right. And then there’s the effect of the group means at Level 2 to consider. Remember that when these are included at Level 2 the obtained estimates differ in interpretation depending on whether group-mean centering is used or not at Level 1. When using the predictor in raw scale or with grand-mean centering, the estimate represents the contextual effect, whereas with group-mean centering, the estimate represents the between-group effect. These will typically differ from one another and may differ in significance as well, since they test different null hypotheses.

The bottom line is that your advisor might or might not be right, depending on which aspect of the relationship between the predictor and outcome you are estimating in your models (e.g., total, within, or between-group effects). Different forms of centering and model specification can lead to important interpretational differences in the model results that are critical to consider when drawing substantive inferences. It is critical to be aware of exactly what effects you wish to estimate and to ensure that you are specifying the model in such a way that you will obtain tests of those effects.

We can thus draw the following general conclusions:

- If either the raw or grand mean centered predictor is entered at Level 1 without the group mean entered at Level 2, the obtained regression coefficient will confound the within- and between-group components of the relation into a single estimate that is difficult to interpret, outside of special circumstances (e.g., where the within- and between-group effects are the same).
- If either the raw or grand mean centered predictor is entered at Level 1 and the group mean is entered at Level 2, then the regression coefficient associated with the Level 1 predictor represents an unambiguous estimate of the
**within-group**effect, and the regression coefficient associated with the Level 2 group mean represents the*difference*between the between-group and within-group effect; this latter effect is sometimes called the**contextual effect**. - If the group mean centered predictor is entered at Level 1 with or without the group mean entered at Level 2, the regression coefficient represents an unambiguous estimate of the
**within-group**effect. - If the group mean is entered at Level 2 with or without the group mean centered predictor at Level 1, the regression coefficient represents an unambiguous estimate of the
**between-group effect**. - Finally, generalizing from points #3 and #4, if the group mean centered predictor is entered at Level 1 and the group mean is entered at Level 2, this provides simultaneous and unambiguous estimates of both the within-group and between-group effects of the predictor on the outcome.

Given the above, it is quite easy to see how confusion can arise about different options for centering, and how individual choices can impact subsequent interpretations of model results. Here we have only offered a brief review, and there are many clear and cogent descriptions of these issues as they arise both in hierarchically clustered data (as described above) and in longitudinal data (where we instead talk about within-person and between-person effects). For more detailed discussions of these issues see Raudenbush and Bryk (2002, pages 31-35, 134-149, and 181-183), Enders and Tofighi (2007), Kreft, de Leeuw, and Aiken (1995), and (if we may) Curran and Bauer (2011).

In conclusion, simply know that there is no “right” or “wrong” choice about centering, but there is most definitely an *optimal* choice based on the theoretical questions under study.

————————————

^{1} One special case is where the within- and between-group effects are the same. Then the value obtained for the raw-scale or grand-mean centered predictor at Level 1 is an unbiased estimate of these effects. But there is seldom cause to assume these effects to be identical *a priori*. Another special case is where there is no between-group variance in the predictor, due to balancing across clusters by design, in which case the estimate will resolve to the within-group effect. An example would be in a longitudinal study where the time scores are the same for all people because the assessment schedule is identical across participants and there is no missing data.

————————————

Curran, P. J., & Bauer, D. J. (2011). The disaggregation of within-person and between-person effects in longitudinal models of change. *Annual Review of Psychology, 62*, 583-619.

Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: a new look at an old issue. *Psychological Methods*, *12*, 121-138.

Kreft, I. G. G., de Leeuw, J., & Aiken, L. S. (1995). The effect of different forms of centering in hierarchical linear models. *Multivariate Behavioral Research, 30*, 1–21. Raudenbush, S. W., & Bryk, A. S. (2002). *Hierarchical linear models: Applications and data analysis methods* (Vol. 1). Sage Publications.

The post My advisor told me I should group-mean center my predictors in my multilevel model because it might “make my effects significant” but this doesn’t seem right to me. What exactly is involved in centering predictors within the multilevel model? appeared first on CenterStat.

]]>The post A reviewer recently asked me to comment on the issue of equivalent models in my structural equation model. What is the difference between alternative models and equivalent models within an SEM? appeared first on CenterStat.

]]>To begin, one of the greatest strengths of the SEM is the ability to estimate models in very specific ways to closely correspond to theory. Sometimes we can think of this as the “whiteboard” problem: we draw out our measured variables on the board and then connect them with single- and double-headed arrows and circles in a way that best reflects our theoretically-derived research hypotheses. We often build one model that is most consistent with our theory, but there are *alternative* models we might consider. Alternative models represent different path diagrams that make different statements about the underlying theory. A key strength of the SEM is that we can make formal comparisons of the fit of alternative models based on sample data: one model might attain superior fit when compared to another, providing empirical support for favoring the better fitting model versus the alternative.

In contrast, whereas *alternative* models almost always lead to *differences* in model fit, *equivalent* models are different representations of model structure that result in precisely the* same *model fit. That is, the models are *equivalent* *representations* of the sample data and cannot be distinguished from one another based on empirical fit. An equivalent model can be thought of as a re-parameterization of the original model. In other words, it is just a different way of “packaging” the same information in the data and no equivalent model can be distinguished from another based on fit alone. If you were to fit a series of equivalent models to the same sample data you obtain exactly the same chi-square test statistic, RMSEA, CFI, TLI, and any other omnibus measure of fit. (Side note: One thing that may be confusing is that, depending on how the models are estimated, their log-likelihoods might differ, but these differences will cancel out when computing measures of fit relative to the corresponding saturated or baseline models, thus their fit remains the same).

Take a very simple example: a three variable mediation model might state that the predictor leads to the mediator that in turn leads to the outcome; diagrammatically, this is portrayed as:

This model has one degree-of-freedom and will have obtain some degree of fit to the data (chi-square, RMSEA, CFI, etc.). However, there are two equivalent models that obtain __precisely__ the same model fit when estimated using sample data. The first is:

and the second is:

All three of these models make fundamentally different statements about the underlying model that gave rise to the observed data, yet all three fit *precisely* the same. As such, the three models are numerically equivalent and can only be adjudicated based on theory.

The above example only considers three measured variables with two regression coefficients. Imagine how this problem scales up with many more measures and many more parameters, particularly if a model includes one or more latent factors. Fortunately, much research has been conducted to help identify a set of existing equivalent models that accompany any given hypothesized model. Although several important papers have been written on this topic (e.g., Stelzl, 1986; MacCallum et al., 1993), a key contribution was made by Lee and Hershberger (1990) where they developed what are sometimes called “replacement rules” or just “Lee-Hershberger rules”. Briefly, Lee and Hershberger describe a very clever approach where variables in a given model can be organized into three blocks: a preceding block, a focal block, and a succeeding block. Then, within the focal block, a large number of modifications can be made to how the variables relate to one another (e.g., reversing pathways, changing regression coefficients to covariances), all of which will achieve identical model fit. A model of even modest complexity might have 50 corresponding alternative expressions, and more complex models can result in hundreds if not thousands of equivalent counterparts.

There are several core takeaway points here. First, it is important to realize that this is simply a characteristic of the SEM and is part of the price we pay for having the flexibility to parameterize models in precisely the way we desire. Second, very little can be done to empirically distinguish among equivalent models (given traditional measures of fit will be identical). Some specific suggestions have been offered (e.g., Raykov & Penev, 1999) but none are able to fully resolve the issue. Indeed, even replication with an independent sample does not resolve the issue because two equivalent models will attain identical fit within *any *given sample data.

As such, it is important that a researcher be unambiguously aware that this issue exists and to realize that any given hypothesized model is just one of an entire *family* of models, all of which are numerically indistinguishable in terms of model fit. Of course some of these models may not be theoretically plausible (e.g., a mediator predicting biological sex or a prediction back in time) but many dozens of options may remain. It is often best to treat this as a limitation of any given study and to potentially present one or a small number of equivalent model options to the reader so that these too might be considered as plausible representations of the data. Further, it might be beneficial to consider these issues when engineering future studies in which certain design elements might be incorporated to help reduce the universe of possible equivalent models.

**References**

Lee, S., & Hershberger, S. (1990). A simple rule for generating equivalent models in covariance structure modeling. *Multivariate Behavioral Research, 25*, 313-334.

MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of equivalent models in applications of covariance structure analysis. *Psychological Bulletin, 114*, 185-199.

Raykov, T., & Penev, S. (1999). On structural equation model equivalence. *Multivariate Behavioral Research, 34*, 199-244.

Stelzl, I. (1986). Changing a causal hypothesis without changing the fit: Some rules for generating equivalent path models. *Multivariate Behavioral Research, 21*, 309-331.

The post A reviewer recently asked me to comment on the issue of equivalent models in my structural equation model. What is the difference between alternative models and equivalent models within an SEM? appeared first on CenterStat.

]]>The post I have a fair amount of missing data that I don’t want to delete prior to my analysis. What are the best options available for me to retain these partially missing cases? appeared first on CenterStat.

]]>Missing data are a common problem faced by nearly all data analysts, particularly with the increasing emphasis on the collection of repeated assessments over time. Data values can be missing for a variety of reasons. A common situation is when a subject provides data at one time point but fails to provide data at a later time point; this is sometimes called attrition. However, data can also be missing within a single administration. For example, a subject might find a question objectionable and not want to provide a response; a subject might be fatigued or not invested in the study and skip an entire section; or there might be some mechanical failure where data are not recorded or items are inadvertently not presented. Regardless of source, it is very common for assessments to be missing for a portion of the sample under study. Fortunately, there are several excellent options available that allow us to retain cases that only provide partial data.

Historically, any case that was missing any data was simply dropped from the analysis, an approach called listwise deletion. Listwise deletion was widely used primarily because no alternative methods existed that would allow for the inclusion of partially missing cases. However, listwise deletion results in lower power and often produces biased estimates with limited generalizability. Other traditional approaches included pairwise deletion (where correlations were computed using only cases available on those pair of variables), mean imputation (where a single value was imputed for the missing case and treated as if it had actually been observed), and last-value-carried-forward (where the last observation among a set of repeated measures replaced the subsequent missing values). Like listwise deletion, these other approaches also have significant limitations. Fortunately, there are now several modern approaches to missing data analysis that perform markedly better than these traditional methods. Prior to discussing these modern methods, it helps to consider first the alternative underlying mechanisms that lead to the data being missing in the first place. The terminology and associated acronyms for these mechanisms are a bit labyrinthine, but once understood they bring clarity to the issues at hand.

The first missing data mechanism is called missing completely at random, or MCAR, and reflects a process in which data are missing in a purely random fashion that is unrelated to either the missing value itself or other observed variables in the data set. An example is when values are missing because of a programming error that randomly governs the presentation of items to subjects. The second mechanism is called missing at random, or MAR. This mechanism reflects a process in which data are not missing as a function of the missing value itself, but can be missing in relation to other variables in the data file. For example, men might be twice as likely to be missing as women, but if biological sex was a measured variable in the data file then this could be used to establish MAR. The final mechanism is missing not at random, or MNAR. This kind of missing data (also known as informatively missing or non-ignorably missing) is the most serious mechanism and defines a process in which data are missing due to the missing value itself. For example, an individual may choose not to respond to a question about drug use because that person is elevated on drug use. These three mechanisms are important to delineate because the primary methods for addressing missing data in an analysis depend on the underlying mechanism being MCAR or MAR but not MNAR.

There are two general approaches to fitting models using partially missing cases. The first is multiple imputation, or MI. Under MI, an imputation model is defined that uses variables that were observed in the data file to generate (or impute) numerical values for the observations that were missing. However, to reflect that these values are imputed with uncertainty, this process is repeated multiple times resulting in a different imputed value for each repetition. Thus, 10 or 20 imputed data sets are created, the model of interest is fitted to each data set, and the results from all the estimated models are pooled for subsequent inference. When using MI, the missing data must be MAR given the variables included in the imputation model (e.g., if missingness varies by sex, then sex must be included in the imputation model), but not all of these variables need to be in the analysis model (e.g., sex might be included in the imputation model but not in the fitted model and is thus an “auxiliary” variable).

The second primary approach for accommodating partial missingness is called full information maximum likelihood, or FIML. Here, no raw data are imputed; instead, models are fit to both the complete data and partially missing data and each individual observation contributes whatever data are available to the overall likelihood function. Implicitly, cases with complete data contribute more to the analysis but cases with partial data also contribute what information they have. Because multiple data files are not generated, FIML requires only a single model be estimated from which all inferences are drawn. Since there is no distinction between an imputation model versus the model of interest in FIML, MAR must be satisfied by including all variables predictive of missingness in the fitted model (either as a structural part of the fitted model or innocuously included as auxiliary variables).

There are many situations in which MI and FIML operate almost precisely the same and it is common for each approach to produce comparable results when based on the same model and data. However, there are situations where each has specific advantages and disadvantages with respect to the other. For instance, FIML does not require the estimation and pooling of results in multiple data sets, making it simpler for the user. In contrast, with MI it is often possible to include many more variables in the imputation model (to help satisfy MAR) than are ultimately of interest in the fitted model. Further, different software packages implement these options to differing degrees, so the user must be fully aware of what each program is doing when fitting models to partially missing cases.

The missingness mechanism we have not yet addressed is MNAR. Unfortunately, MNAR data is distinctly harder to handle. Part of the problem is that one can never establish whether the data are MAR versus MNAR because to do so would require actually observing the data that are missing. Thus, approaches for accommodating MNAR, which include selection models and pattern mixture models, are best implemented as sensitivity analyses. These approaches build in assumptions about the non-random missing data process so that we can observe how much the substantive conclusions drawn from the analysis change relative to when assuming MAR. Fortunately, in many applications this is unnecessary as MI and FIML (though they assume MAR) often perform well with MNAR data so long as the informativeness of the missing data process is not strong and the fraction of missing data is not large.

In sum, the existence of MI and FIML makes listwise deletion of missing cases almost indefensible except in a narrow band of specific situations. Indeed, the default estimators in most SEM software packages are now FIML and this issue becomes almost transparent to the user. But be certain that you know precisely how the software is handling missing cases and verify that cases are not being dropped without your knowledge. A non-exhaustive sampling of recommended readings are below.

**References**

Enders, C. K. (2010). *Applied missing data analysis*. Guilford Press.

Graham, J.W. (2003). Adding missing-data-relevant variables to FIML-based Structural Equation Models. *Structural Equation Modeling, 10*, 80-100.

Graham, J.W. (2009). Missing data analysis: Making it work in the real world. *Annual Review of Psychology, 60*, 549-576.

Graham, J.W. (2012). *Missing Data: Analysis and Design*. New York: Springer.

Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data designs in psychological research. *Psychological Methods, 11*, 323-343.

Harel, O., & Schafer, J. L. (2009). Partial and latent ignorability in missing-data problems. *Biometrika, 96*, 37-50.

Little, R. J., & Rubin, D. B. (2019). *Statistical Analysis with Missing Data* (Vol. 793). John Wiley & Sons.

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. *Psychological Methods, 7*, 147-177.

Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. *Multivariate Behavioral Research, 33*, 545-571.

The post I have a fair amount of missing data that I don’t want to delete prior to my analysis. What are the best options available for me to retain these partially missing cases? appeared first on CenterStat.

]]>The post The Cronbach’s Alphas for all the scales in my path analysis are in the .7s, so why is a reviewer criticizing me for not paying sufficient attention to reliability? appeared first on CenterStat.

]]>

Reliability addresses the issue of consistency of measurement and is centered on the belief that an observed score is some combination of unobserved true score and error (e.g., DeVellis, 2016). Reliability can then be defined in terms of the relative magnitude of these two components. Many conceptual definitions of reliability have been proposed, but the most widely used is the ratio of true score variance to total observed variance. Thus a reliability of 1.0 reflects that all of the observed variability is true score variability and there is no error of measurement. However, as reliability falls below 1.0 this indicates that more and more of the observed variability is due to measurement error. The posed question notes reliability values in the .7’s, and this indicates that as much as 30% of the observed variability in the measured variables is due to error. This is a non-trivial amount of error that can have potentially profound implications in subsequent model fitting.

For example, all members of the general linear model (ANOVA, multiple regression, path analysis, etc.) assume that predictors are measured with perfect reliability (the precise assumption is that the distributions of predictors are “fixed and known” but this in turn implies perfect reliability). Violation of this assumption leads to biased regression coefficients. With just one predictor, the biasing effect of measurement error is always to attenuate coefficient estimates; that is, sample estimates become systematically smaller than the actual population values as the degree of unreliability of the predictor increases (Bollen, 1989, pp. 167-168). With two or more predictors, the direction and magnitude of bias is harder to predict (Bollen & Schwing, 1987). Further, although unreliability in an outcome measure does not bias the raw regression coefficients, it does distort standardized effect estimates and leads to inflated standard errors that reduce power. Taken together, violation of the assumption of perfect reliability when using manifest scale scores can result in substantially biased parameter estimates and markedly lower statistical power, concerns that are implied by the reviewer’s critique.

Two challenges thus arise: how to best empirically compute reliability and how to best account for unreliability in analyses. McNeish (2018) offers a recent comprehensive review of the first question. Historically, coefficient alpha (or “Cronbach’s Alpha”) has become the standard empirical measure of reliability in the social and health sciences. However, alpha is based on several strict and untenable assumptions that often drive down the estimated value such that coefficient alpha is often viewed as a “lower bound” reliability estimate. Other methods of estimation exist (e.g., Omega, coefficient H, Greatest Lower Bound) but each of these is associated with certain limitations. Further, all of these are associated with a classical test theory approach in which there is a single reliability estimate across the range of scale scores. In contrast, axiomatic approaches such as IRT expand this conceptualization such that reliability depends in part on the underlying score itself (e.g., Thissen & Wainer, 2001, pp. 1178-119). For example, reliability decreases with extreme scores at the lowest and highest parts of the latent trait distribution, and these differences are not reflected within the CTT framework.

Regardless of how it is computed, the second issue remains: how to account for imperfect reliability in subsequent analysis. The ideal option is to move from a manifest variable model to a latent variable model (e.g., Bollen, 2002). Here, multiple-indicator latent factors are defined in place of a mean or sum score. This allows for the estimation of measurement error and the separation of true score variability from error variability. Latent variables add complexity to any model and thus an informed decision is needed as to the relative gains achieved relative to the loss of parsimony. When it is not possible to simultaneously estimate a full latent variable model, an alternative is to try to adjust for unreliability in scale scores when conducting the path analysis. Various options for correcting for unreliability have been proposed in the literature, and this remains an active area of research (e.g., Devlieger, 2019).

Returning to the question about the reviewer’s criticism, it is true that simply reporting coefficient alpha values with an unsupported subjective judgment that the obtained values are “adequate” does not address the complexity of the issues at hand. First, it is helpful to consider whether coefficient alpha is the optimal method of reliability estimation or if there are other better options available. Second, if reliability estimates are less than 1.0, it should be communicated to the reader exactly what implications this has on subsequent modeling and inferential tests. Finally, if scales are determined to have meaningful levels of unreliability (whatever that is judged to be), then expanding the modeling framework to include multiple-indicator latent factors should be closely considered.

**References**

Bandalos, D. L. (2018). *Measurement theory and applications for the social sciences*. Guilford Publications.

Bollen, K. A. (1989). *Structural Equations with Latent Variables*. John Wiley New York.

Bollen, K. A. (2002). Latent variables in psychology and the social sciences. *Annual Review of Psychology, 53*, 605-634.

Bollen, K. A., & Schwing, R. C. (1987). Air pollution-mortality models: A demonstration of the effects of random measurement error. *Quality and Quantity, 21*, 37-48.

DeVellis, R. F. (2016). *Scale development: Theory and applications (Vol. 26)*. Sage publications.

Devlieger, I., Talloen, W., & Rosseel, Y. (2019). New Developments in Factor Score Regression: Fit Indices and a Model Comparison Test. *Educational and Psychological Measurement, 79*, 1017–1037.

McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. *Psychological Methods, 23*, 412-433.

Thissen, D. & Wainer, H. (Eds.). (2001). *Test scoring*. L. Erlbaum Associates, Publishers.

The post The Cronbach’s Alphas for all the scales in my path analysis are in the .7s, so why is a reviewer criticizing me for not paying sufficient attention to reliability? appeared first on CenterStat.

]]>