The post What are ROC curves and how are these used to aid decision making? appeared first on CenterStat.

]]>

This same approach to decision making appears routinely in daily life. For example, we might want to know if a person referred to a clinic is likely to be diagnosed with major depression or not: they either are truly depressed or not (which is unknown to us) and we must decide based on some brief screening test if we believe it is likely that they suffer from depression. Or we might want to know if it is likely someone has a medical diagnosis that requires a more invasive biopsy procedure. Or, one that is near and dear to us all, we may want to know if we do or do not have COVID. There is a “true” condition (you really do or really do not have COVID) and we obtain a positive or negative result on a rapid test we bought for ten dollars from Walgreens. We have precisely the same four possible outcomes as shown above, two that are correct (it says you have COVID if you really do, or you do not have COVID if you really do not) and two that are incorrect (it says you have COVID when you really don’t, or you do not have COVID if you really do).

This is such an important concept that there are specific terms that capture these possible outcomes. *Sensitivity* is the probability that you will receive a positive rapid test result if you truly have COVID (the probability of a *true positive* for those with the disease). *Specificity* is the probability that you will receive a negative rapid test result if you truly do not have COVID (the probability of a *true negative* for those without the disease). One minus sensitivity thus represents the error of obtaining a false negative result, and one minus specificity represents the error of obtaining a false positive result. However, all of the above assumes that the rapid test has one of two outcomes: the little window on the rapid test either indicates a negative or a positive result. But this begs the very important question, *how does the test know?* That is, the test is based on a

ROC stands for *receiver operating characteristic*, the history of which can be traced back to the development of radar during World War II. Radar was in its infancy and engineers were struggling to determine how it could best be calibrated to maximize the probability of identifying a real threat (an enemy bomber, or a *true positive*) while minimizing the probability of a false alarm (a bird or a rain squall, or a *false positive*). The challenge was where to best set the continuous sensitivity of the receiver (called *gain*) to optimally balance these two outcomes. In other words, there was an infinite continuum of possible gain settings and they needed to determine a specific value that would balance true versus false readings. This is precisely the situation in which we find ourselves when using a brief screening instrument to identify depression or blood antigen levels to identify COVID.

To be more concrete, say that we had a 20-item screening instrument for major depression designed to assess whether an individual should be referred for treatment or not, but we don’t know at what specific score a referral should be made. We thus want to examine the ability of the continuous measure to optimally discriminate between true and false positive decisions across all possible cut-offs on the continuum. To accomplish this, we gather a sample of individuals with whom we conduct a comprehensive diagnostic workup to determine “true” depression, and we give the same individuals our brief 20-item screener and obtain a person-specific scale score that is continuously distributed. We can now construct what is commonly called a *ROC curve* that plots the true positive rate (or sensitivity) against the false positive rate (or one minus specificity) across all possible cut-points on a continuous measure. That is, we can determine how every possible cut-point on the screener discriminates between those who did or did not receive a comprehensive depression diagnosis.

To construct a ROC curve, we begin by creating a bivariate plot in which the *y-*axis represents *sensitivity* (or *true positives*) and the *x-*axis represents one minus *specificity* (or *false positives*). Because we are working in the metric of probabilities each axis is scaled between zero and one. We are thus plotting the true positive rate against the false positive rate across the continuum of possible cut points on the screener. Next, a 45-degree line is fixed from the origin (or 0,0 point) to the upper right quadrant (or 1,1 point) to indicate random discrimination; that is, for a given cut-point on the continuous measure, you are as likely to make a true positive as you are a false positive. However, the key information in the ROC curve is superimposing the sample-based curve that is associated with your continuous screener; this reflects the actual true-vs-false positive rate across all possible cut-offs of your screener. An idealized ROC curve (drawn from Wikipedia) is presented below.

If the screener has no ability to discriminate between the two groups, the sample-based ROC curve will fall on the 45-degree line. However, that rarely happens in practice; instead, the curve capturing the true-to-false positive rates across all possible cut-points will lie above the 45-degree line indicating that the test is performing better than chance alone. The further the ROC curve deviates from the 45-degree line, the better able the screener is to correctly assign individuals to groups. At the extreme, a perfect screener will fall in the upper left corner (the 0,1 point) indicating all decisions are true positives and none are false positives. This too rarely if ever occurs in practice, and a screener will nearly always fall somewhere in the upper-left area of the plot.

But how do we know if our sample-based curve is meaningfully higher than the 45-degree line? There are many ways that have been proposed to evaluate this, but the most common is computing the *area under the curve*, or AUC. Because the plot defines a unit square (that is, it is one unit wide and one unit tall), 50% of the area of the square falls below the 45-degree line. Because we are working with probabilities, we can literally interpret this to mean that a there is a 50-50 chance a randomly drawn person from the depressed group has a higher score on the screener than a randomly drawn person from the non-depressed group. This of course reflects that the screener has no better than random chance of correctly specifying an individual. But what if the AUC for the screener was say .80? This would reflect that there is a probability of .8 that a randomly drawn person from the depressed group will have a higher score on the screener than a randomly drawn person from the non-depressed group. In other words, the screener is able to *discriminate* between the two groups at a higher rate than chance alone. But how high is high enough? There is not really a “right” answer, but conventional benchmarks are that AUCs over .90 are “excellent”, values between .70 and .90 are “acceptable” and values below .70 are “poor”. Like most general benchmarks in statistics, these are subjective, and much will ultimately depend on the specific theoretical question, measures, and sample at hand. (We could also plot multiple curves to compare two or more screeners, but we don’t detail this here.)

Note, however, that the AUC is a characteristic of the screener itself and we have not yet determined the optimal cut-point to use to classify individual cases. For example, say we wanted to determine the optimal value on our 20-item depression screener that would maximize the true positives and minimize the false positives in our referral for individuals to obtain a comprehensive diagnostic evaluation. Imagine that individual scores could range in value from zero to 50 and we could in principle set the cut-off value at any point on the scale. The ROC curve allows us to compare the true positive to false positive rate across the entire range of the screener and estimate what the true vs. false positive classification at each and every value of the screener. We then can select the optimal value that best balances true positives from false positives, and that value becomes are cut-off point at which we demarcate who is referred for a comprehensive diagnostic evaluation and those who are not. There are a variety of methods for accomplishing this goal, including computing the closest point at which the curve approaches the upper-left corner, the point at which a certain ratio of true-to-false positives is reached, and using more recent methods drawn from Bayesian estimation and machine learning. Some of these methods become quite complex, and we do not detail these here.

Regardless of method used, it is important to realize that the optimal cut-point may not be universal but varies by one or more moderators (e.g., biological sex or age) such that one cut-point is ideal for children and another for adolescents. Further, the ideal cut-point might be informed by the relative cost of making a true vs. false positive. For example, a more innocuous example might be determining if a child might benefit from additional tutoring in mathematics compared to a much more severe determination of whether an individual might suffer from severe depression and be at risk for self-harm. Different criteria might be used in determining the optimal cut-point for the former vs. the latter. Importantly, this statistical architecture is quite general and can be applied across a wide array of settings within the social sciences and offers a rigorous and principled method to help guide optimal decision making. We offer several suggested readings below.

Fan, J., Upadhye, S., & Worster, A. (2006). Understanding receiver operating characteristic (ROC) curves. *Canadian Journal of Emergency Medicine, 8*, 19-20.

Hart, P. D. (2016). Receiver operating characteristic (ROC) curve analysis: A tutorial using body mass index (BMI) as a measure of obesity. *J Phys Act Res*, 1, 5-8.

Janssens, A. C. J., & Martens, F. K. (2020). Reflection on modern methods: Revisiting the area under the ROC Curve. *International journal of epidemiology, 49*, 1397-1403.

Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. *Journal of Thoracic Oncology, 5*, 1315-1316.

Petscher, Y. M., Schatschneider, C., & Compton, D. L. (Eds.). (2013). *Applied quantitative analysis in education and the social sciences*. Routledge.

Youngstrom, E. A. (2014). A primer on receiver operating characteristic analysis and diagnostic efficiency statistics for pediatric psychology: we are ready to ROC. *Journal of Pediatric Psychology, 39*, 204-221.

The post What are ROC curves and how are these used to aid decision making? appeared first on CenterStat.

]]>The post This Week on CenterStat Unscripted: Exploratory & Confirmatory Factor Analysis appeared first on CenterStat.

]]>Please join in on the fun at https://www.youtube.com/@centerstat/streams, or by following the link on the *Unscripted *webpage at https://centerstat.org/unscripted/.

For those who can’t catch the live broadcasts, a recording of each episode is posted to the CenterStat YouTube Channel. We can’t promise the recordings will be any more entertaining, but at least you can watch them at double-speed.

The post This Week on CenterStat Unscripted: Exploratory & Confirmatory Factor Analysis appeared first on CenterStat.

]]>The post This Week on CenterStat Unscripted: An Introduction to the Latent Curve Model appeared first on CenterStat.

]]>Please join in on the fun at https://www.youtube.com/@centerstat/streams, or by following the link on the *Unscripted *webpage at https://centerstat.org/unscripted/.

For those who can’t catch the live broadcasts, a recording of each episode is posted to the CenterStat YouTube Channel. We can’t promise the recordings will be any more entertaining, but at least you can watch them at double-speed.

The post This Week on CenterStat Unscripted: An Introduction to the Latent Curve Model appeared first on CenterStat.

]]>The post Announcing CenterStat Unscripted: A New Weekly YouTube Livestream on Quantitative Methods appeared first on CenterStat.

]]>Why live? It’s a bit like playing with a rusty knife: you just can’t help yourself and you know it will probably end disastrously, but that’s part of the fun. Plus, being live also means Dan and Patrick can interact in real-time with viewers through live chat. This allows anyone to make fun of them from anywhere in the world for all to see. How can you possibly miss out on that?

Each episode will be broadcast Thursdays at noon (U.S. Eastern time zone) starting on February 9th. For those who can’t catch the live broadcasts, a recording of each episode will be posted to the CenterStat YouTube Channel. They can’t promise the recordings will be any more entertaining, but at least you can watch them at double-speed.

Topics will vary from introductory to advanced, and Patrick and Dan welcome any and all suggestions. The inaugural episode (on February 9th) will explore the advantages of moving from multiple regression to structural equation modeling, and the second (on February 16th) will similarly explore the advantages of moving from multiple regression to the multilevel model. Then who the heck knows where they will go after that.

Please join in on the fun at https://www.youtube.com/@centerstat/streams, or you can direct from the *Unscripted *webpage at https://centerstat.org/unscripted/.

The post Announcing CenterStat Unscripted: A New Weekly YouTube Livestream on Quantitative Methods appeared first on CenterStat.

]]>The post What exactly qualifies as intensive longitudinal data and why am I not able to use more traditional growth models to study stability and change over time? appeared first on CenterStat.

]]>To start, there is often much confusion over what constitutes intensive longitudinal data (or ILD), in large part because there exists no formal definition that separates ILD from other types of longitudinal data. That said, ILD tends to fall between two traditional data structures obtained from alternative designs: panel data and time series data. It’s useful to first consider these traditional structures to see how several of their features will combine within ILD.

Historically, the most common method for gathering longitudinal data in psychology and the social and health sciences has been the *panel design*. Typically, a panel design involves assessing a large sample of subjects (say 200 or more) at a much smaller number of time points (say three to six) that tend to be widely spaced in time (say six or 12 months or more). Panel data are often used to empirically examine long-term trajectories of change that might span multiple years, and common analytic methods include the standard latent curve model or a multilevel growth model. (See our prior Help Desk entry on the relation between the LCM and MLM).

A second type of longitudinal design, commonly used in economics among other areas, is the *time series* *design*, which resides at the opposite end of the continuum from the panel design. More specifically, a time series design is often based on just a single unit that is repeatedly assessed a very large number of times (say 100 to 200 or more) at intervals that tend to be close together in time (say daily or even hourly). Time series data are often used to empirically examine short-term dynamic processes that might unfold hour-by-hour or day-by-day (e.g., the daily closing cost of the S&P500) and many specialized analytic methods exist to fit models to these highly dense data.

ILD tends to fall between the two extremes of panel data on one end and time series on the other. More specifically, ILD tends to have fewer subjects than panel data but more than time series (say 50 or 100 subjects) and more time points than panel data but possibly fewer than time series (say 30 or 40 assessments). Data might be captured using wearable technology (e.g., heart rate or blood pressure monitors) or by sending random prompts throughout the day via smart phones or other electronic devices (e.g., a tone sounds on a smart phone three times throughout the day and an individual is prompted to respond to a brief feelings survey). As a hypothetical example, a study might be designed to randomly measure nicotine cravings and cigarette use in a sample of 50 individuals four times per day for a two week period resulting in 56 assessments on each individual, thus falling between traditional panel and time series designs in structure.

In the spirit of *be careful what you ask for*, once you obtain intensive longitudinal data you must then select an optimal modeling strategy to test your motivating hypotheses, and this is not always an easy task. To begin, some longitudinal models that we are familiar with from panel data simply will not work with ILD. Consider the latent curve model (LCM): because the LCM is embedded within the structural equation model, each observed time point is represented by a manifest variable in the model. This works well if the model is fit to annual assessments of some outcome (say antisocial behavior at age 6, 7, 8, 9 and 10) where each age-specific measure serves as an indicator on the underlying latent curve factor. However, the LCM rapidly breaks down with higher numbers of repeated measures in which only one observation may have been obtained at any given assessment (e.g., 9:15am, 9:52am, and so on). For our prior example with 56 repeated assessments taken on 50 subjects, the LCM is simply not an option.

We can next consider the multilevel model (MLM) and it turns out that this option works quite well for many ILD research applications. (See our Office Hours channel on YouTube for a lecture on the MLM with repeated measures data). The MLM approaches the complex ILD structure as nested data in which repeated assessments are nested within individual. Interestingly, unlike the standard LCM, the MLM can be applied to both more traditional panel data and to ILD. The reason is that, whereas the LCM incorporates the passage of time into the factor loading matrix and requires an observed variable at each assessment, the MLM incorporates the passage of time as a numerical predictor in the regression model. As such, the MLM can easily allow for highly dense (meaning many time points) and highly sparse (meaning few or even one assessment is shared by any individual at any given time point) data without problem. (The LCM can under certain circumstances be contorted to accommodate some of these features as well, but the MLM does this seamlessly). However, there are several complications that must be addressed when fitting an MLM to intensive longitudinal data that do not commonly arise in panel data.

The first issue is what is called *serial correlation* of the residuals for the repeatedly measured outcome. With apologies for the technical terminology, this means is that for a given person, when there is a “bump” at one timepoint, that tends to carry over to the next time point too. For instance, say a person’s average heart rate is 72 BPM. I measured this person at 9:10am and 9:26am. What I don’t know is that this person was late for their 9:00am job, which lead them to move faster and increased their stress, and they had only just arrived at 9:10am. This manifested in a heart rate of 91 BPM at 9:10 and 83 BPM at 9:26. The initial bump has thus not entirely dissipated by the second assessment.

Serial correlation is often not of importance in panel data because these perturbations have long since washed out (the residual correlation goes to zero over the long lags). A person’s heart rate might be higher than usual when I assess them at age 26 because they had a second shot of espresso or got in an argument with a colleague at work, but the effect of the espresso or argument has long since worn off by the time I reassess them at age 27. Of course, even with panel data the repeated measures are correlated, but not because of serial correlation of within-person *residuals* but because of individual differences in level and change over time. For instance, some people have consistently higher heart rates and others have consistently lower heart rates and this stability will lead to across-person positive correlations in repeated measures. We typically model these individual differences in level and change via latent growth factors / random effects when fitting LCMs / MLMs. Such individual differences may be an important source of correlation in ILD too, but we also have to contend with the serially correlated residuals. Although an added complexity, the MLM is quite well suited at incorporating serial correlations such as these. Complex error terms can be defined among the time-specific residuals such as auto-regressive, Gaussian decay, spatial power, or Toeplitz structures. It is very important these serial correlations be represented in the model if needed both to gain insights into the phenomenon under study and to ensure that other parameter estimates of interest are not biased.

A second issue that often arises in ILD is the presence of cycles or transition points that might occur during the assessment period. For example, daily measures taken over several weeks may vary as a function of weekday vs. weekend (e.g., if studying college drinking) or might cycle regularly throughout a day (e.g., hourly heart rate data varying as a function of waking to sleeping and back to waking). Although such cycles and transition points might be present in panel data as well, these are less likely to occur because there are typically fewer time-linked assessments and these tend to aggregate over longer durations (e.g., if we ask “over the past 30 days” to obtain monthly alcohol use levels, these ratings will implicitly smooth over weekday-weekend differences in daily alcohol use). In contrast, multiple cycles might be observed in ILD spanning a 50 or 60 time point series.

Finally, a third issue is the distinction between within- versus between-person effects. Often ILD is collected with the idea of assessing processes as they unfold in real time for individual participants (“life as lived”). For instance, we might be interested in using ILD to test a negative reinforcement hypothesis for alcohol use. That is, we wish to test the proposition that people drink more than they typically do when they are experiencing increased negative affect under the expectation that this will reduce their negative affect. Using a daily diary study, we measure negative affect each day and alcohol use each night and we build a model to predict alcohol use from negative affect. To fully assess the negative reinforcement hypothesis, we must differentiate the within-person effect (e.g., when my negative affect is higher than usual I drink more than is typical for me) from any between-person correlation that may also exist (e.g., that people who have higher negative affect in general tend to drink more in general). Fortunately, with the MLM we have well developed methods for separating within- and between-person effects, although there are some complications to consider (see our prior help desk post specifically on this issue)

The MLM is thus well suited to address all of these complexities that commonly arise in intensive longitudinal data. Once incorporated, the MLM offers many of the very same advantages as when applied to panel data: time-varying predictors can be incorporated at level-1 with either fixed or random effects, time invariant predictors can be incorporated at level-2, and interactions can be estimated within or across levels of analysis. However, there are two key limitations of the MLM that may or may not arise in a given application. The first is that, similar to the traditional general linear model, the MLM assumes all measures are error-free and all observed variance is “true” variance. This is often (if not always) an unrealistic assumption and violation of this assumption can lead to significant biases in the estimated results. The second is that the MLM only allows for one dependent variable at a time and is thus limited to the estimation of unidirectional effects. Say that you are interested in testing the reciprocal relations between depression during the day and substance use that evening, and you obtain multiple daily measures spanning a week of time. The MLM allows for the estimation of the prediction of substance use from depression, but not the simultaneous estimation of the reciprocal prediction of depression from substance use. As such, the MLM is only evaluating one part of the research hypotheses at hand.

However, recent developments have introduced a new analytic procedure that combines elements of the MLM, the SEM, and time series models called the dynamic structural equation model (or DSEM). The DSEM functionally picks up where the MLM leaves off, but expands the model to potentially include latent factors (to estimate and remove measurement error) and multiple dependent variables (to estimate reciprocal effects between two or more variables over time). DSEM is a recent development and much has yet to be learned about best practices in applied research settings, but it represents a significant development in our ability to fit complex models to ILD.

Want to learn more? We recently had the honor of being invited to provide a series of three lectures on intensive longitudinal data analysis for the American Psychological Association and we have posted our lecture materials in the resources section of the CenterStat home page (https://centerstat.org/apa-ild/). The first session discusses the challenges and opportunities of ILD; the second focuses on the analysis of ILD using the multilevel model; and the third focuses on the analysis of ILD using the dynamic structural equation model. In addition to those resources, below are several suggested readings on the design, collection, and analysis of intensive longitudinal data. Asynchronous access to CenterStat workshops on *Multilevel Modeling** *and *Analyzing Intensive Longitudinal Data* is also available to those who might wish to register for additional training. You can also check our workshop schedule for upcoming live offerings.

Good luck with your work!

Asparouhov, T., Hamaker, E. L., & Muthén, B. (2018). Dynamic structural equation models. Structural Equation Modeling: *A Multidisciplinary Journal, 25*, 359-388.

Asparouhov, T., & Muthén, B. (2020). Comparison of models for the analysis of intensive longitudinal data. *Structural Equation Modeling: A Multidisciplinary Journal, 27*, 275-297.

Bolger, N., & Laurenceau, J. P. (2013). *Intensive longitudinal methods: An introduction to diary and experience sampling research*. Guilford Press.

Hamaker, E. L., Asparouhov, T., Brose, A., Schmiedek, F., & Muthén, B. (2018). At the frontiers of modeling intensive longitudinal data: Dynamic structural equation models for the affective measurements from the COGITO study. *Multivariate Behavioral Research, 53*, 820-841.

Hoffman, L. (2015). Longitudinal analysis: Modeling within-person fluctuation and change. Routledge.

McNeish, D., & Hamaker, E. L. (2020). A primer on two-level dynamic structural equation models for intensive longitudinal data in Mplus. *Psychological Methods, 25*, 610-635.

McNeish, D., Mackinnon, D. P., Marsch, L. A., & Poldrack, R. A. (2021). Measurement in intensive longitudinal data. *Structural Equation Modeling: A Multidisciplinary Journal, 28*, 807-822.

Walls, T. A., & Schafer, J. L. (Eds.). (2006). *Models for intensive longitudinal data*. Oxford University Press.

The post What exactly qualifies as intensive longitudinal data and why am I not able to use more traditional growth models to study stability and change over time? appeared first on CenterStat.

]]>The post CenterStat Partners with APA to Offer Free Training on Intensive Longitudinal Data appeared first on CenterStat.

]]>The full program is available for free (to both APA members and non-members), and links to individual sessions are provided below. Sessions will be livestreamed and recordings will be made available to all those who registered approximately two weeks after each live seminar. The lecture materials for the last three sessions, delivered by Dan Bauer & Patrick Curran, are available here.

**August 31:**Training for the Collection of Real-World Biobehavioral Data Using Wearable Devices

Presenter: Benjamin Nelson, PhD**September 15:**Introduction to Intensive Longitudinal Methods

Presenter: Jean-Philippe Laurenceau, PhD**October 4:**Intensive Longitudinal Data: Methodological Challenges and Opportunities

Presenters: Daniel Bauer, PhD, and Patrick Curran, PhD**October 6:**Intensive Longitudinal Data: A Multilevel Modeling Perspective

Presenters: Daniel Bauer, PhD, and Patrick Curran, PhD**October 11:**Intensive Longitudinal Data: A Dynamic Structural Equation Modeling Perspective

Presenters: Daniel Bauer, PhD, and Patrick Curran, PhD

Participants who decide to pursue more in depth training on ILD after attending these seminars may wish to consider enrolling in our full workshops on Multilevel Modeling (by Dan Bauer and Patrick Curran) and Analyzing Intensive Longitudinal Data (by Jean-Philippe Laurenceau and Niall Bolger)

The post CenterStat Partners with APA to Offer Free Training on Intensive Longitudinal Data appeared first on CenterStat.

]]>The post What’s the best way to determine the number of latent classes in a finite mixture analysis? appeared first on CenterStat.

]]>One of the single most difficult tasks in finite mixture modeling is to determine the number of classes within the population, a process sometimes referred to as *class enumeration*. Typically, one will fit a finite mixture model using maximum likelihood estimation, in which the number of classes must be declared as part of the model specification. Thus, the analyst will fit a model with 1 class, then 2 classes, then 3, etc., and then compare the fit of these models to try to determine the optimal number of classes. Various approaches to determining the optimal number of classes can be considered but they generally fall into three primary categories: likelihood ratio tests, information criterion, and entropy statistics. Let’s consider each in turn. (And, yes, there are Bayesian approaches to this problem too, but they aren’t widely used in practice so we won’t be addressing those).

One approach for evaluating the number of classes is to use a likelihood ratio test (LRT). LRTs represent a general procedure for testing between nested models, i.e., where one model consists of parameters that are a restricted subset of the parameters of the other model. The LRT is computed as –2 times the difference in the log-likelihoods of the two models and, under certain regularity conditions (essentially *assumptions*), it is distributed as a central chi-square with degrees of freedom equal to the difference in number of estimated parameters. From the chi-square, we obtain a *p*-value under the null hypothesis that the simpler model is the right one. Effectively we are saying, look….we know that if we throw more parameters at the model it will fit the sample data better (i.e., the log-likelihood improves) but is this improvement greater than I would expect by chance alone given the number of parameters added (the degrees-of-freedom of the LRT)? If the *p*-value is significant, then we conclude that it is a greater improvement than we would expect by chance, rejecting the simpler model in favor of the more complex model. If it’s not significant, then we conclude there is not a meaningful difference between the two models and we retain the simpler model. In other words, we conclude that the extra parameters may just be overfitting, picking up random variation or noise in the sample that doesn’t reflect the true underlying structure in the population.

That is how we typically use LRTs in a traditional modeling framework, but let’s think about how we would apply this general testing approach to determine the number of classes in a finite mixture. First, we can establish that a *K*-class model is nested within a *K*+1-class model. For instance, one could set the mixing probability (prevalence rate) of one class in the *K*+1-class model to zero. Presto, this deletes one of the classes to produce a *K*-class model. So far so good. Now we fit models with 1 v. 2 classes, calculate the LRT, and if the *p*-value is significant we say 2 is better than one. Then we test 2 v. 3 classes, 3 v. 4 classes, etc., and stop when we get to the point that adding another class no longer results in a significant improvement in model fit. But where things get complicated is in the fine print to the likelihood ratio test. The regularity conditions required for the test distribution to be a central chi-square aren’t met when testing a *K* versus a *K*+1-class model. So while it still makes sense to conduct likelihood ratio tests, we no longer have the familiar chi-square with which to obtain p-values. We need to somehow modify how we conduct LRTs for use in this context.

One option is to bootstrap the test distribution. McLachlan (1987) proposed a parametric bootstrapping procedure that involves (1) simulating data sets from the *K*-class model estimates that were obtained from the real data; (2) fitting *K* and *K*+1-class models to the simulated data sets; (3) computing the likelihood ratio test statistic for each simulated data set; (4) using the distribution of bootstrapped LRT values to obtain the *p*-value for the likelihood ratio test statistic obtained with the real data. It’s a clever approach, but somewhat computationally intense, especially if one wants a precise *p*-value. The other option is to derive the correct theoretical test distribution for the LRT. Lo, Mendell & Rubin (2001) performed these derivations, determining it (appropriately enough) to be a mixture of chi-squares. They also provided an ad-hoc adjusted version of the test with a bit better performance at realistic sample sizes. Simulation studies, however, have shown the Lo-Mendell-Rubin LRT (original and adjusted versions) to have elevated Type I error rates for some models, whereas the bootstrapped LRT consistently work well. We thus tend to prefer the bootstrapped LRT, despite its greater computational demands (which is an increasingly less relevant concern given ever-improving computational speeds of even the lowliest desktop computers).

A second approach to evaluating the number of classes is to use information criteria (IC). Two well-known information criteria are Akaike’s Information Criterion (AIC) and Bayes’ Information Criterion (BIC), but there are many others. What ICs generally try to do is balance the *fit* of the model against the *complexity* of the model. Fit is measured by –2 times the log-likelihood and a penalty is then applied for complexity, usually some function of the number of parameters and/or sample size. Often, but not always, ICs are scaled so that smaller values are better. So one would fit models with 1, 2, 3, etc. classes and then select the model with the lowest IC value as providing the best balance of fit against complexity. Different ICs were motivated in different ways and implement different penalties. Some penalties are stiffer than others, so for instance the BIC penalty usually exceeds the AIC penalty. When choosing the number of classes, simulation studies have shown AIC to be too liberal (tends to support taking too many classes), whereas BIC generally does well as long as the classes are reasonably well separated. For less distinct classes (that is, classes that may reside closer together and are thus harder to discern), a sample size adjusted version of the BIC, which ratchets down the penalty a bit, sometimes performs better. While there are many different ICs to choose from, we generally find the BIC to be a reasonable choice.

A third common approach is to consider the *entropy* of the model. Entropy is a measure of how accurately one could assign cases to classes. Finite mixture models are probabilistic classification models in the sense that there is not a hard partition of the sample into non-overlapping clusters but instead there is a probability that each person belongs to each class; further, these probabilities sum to 1.0 for each individual reflecting there is a 100% chance they belong to one of the classes. However, sometimes one is interested in producing such a hard partition based on the probabilities, for instance by assigning a case to the class to which they most likely belong, a technique called *modal assignment*. If the probabilities of class membership tend toward zero and one, then this implies that there should be few errors of assignment. But as the probabilities move away from zero and one this reflects greater uncertainty about how to assign cases and an increased rate of assignment errors. For instance, if my probabilities for belonging to Classes 1 and 2 are .9 and .1, there’s a 90% chance I would be correctly assigned to Class 1. That’s pretty good. But if my probabilities are .6 and .4, there is only a 60% chance that placing me into Class 1 would be the right decision. Entropy summarizes the uncertainty of class membership across all individuals, providing a sense of how accurately one can classify based on the model.

There are several different types of entropy-based statistics. Some are of the same form as the ICs described above, in which the fit of the model is balanced against a penalty that is now a function of entropy (e.g., the classification likelihood criterion). Others are transformations of entropy to make interpretation easier (e.g., normalized entropy criterion). The *E *entropy statistic developed by Ramaswamy et al. (1992) is particularly popular – it has a nice scale, ranging from 0 to 1, with 1 indicating perfect accuracy, and is standard output in some software (e.g., Mplus). One might thus calculate *E* values (or some other entropy based statistic) for models with different numbers of latent classes and then select the model with the greatest classification accuracy. But this presupposes that one wants to select a model that consists of well-separated classes. Sometimes, classes aren’t well separated. Consider that there is a well-recognized height difference between adult men and women, yet men are only about 7% taller than women on average, so there is a lot of overlap between the height distributions. It seems reasonable to assume that latent classes will overlap at least as much as natural groups do, so entropy may be a poor guide to the number of classes in many realistic scenarios. Thus, in most cases, it is probably best not to use entropy to guide class enumeration, but instead to consider it a property of the model that is ultimately selected. That is, determine the number of classes using the BIC and/or bootstrapped likelihood ratio, then examine the entropy as a descriptive statistic of the selected model.

So we seem to have arrived at a straightforward set of recommendations. First, fit models with 1, 2, 3, etc. latent classes (until estimation breaks or we reach some practically useful / theoretically plausible upper bound like say 10 classes). Second, compare the fit of these models using your preferred information criterion (perhaps BIC, perhaps sample-size adjusted BIC). Also use the bootstrapped likelihood ratio test to get formal *p*-values. Hope your IC of choice and the bootstrapped LRT arrive at the same answer. Third, write your paper. How hard can all of that possibly be? Well, sometimes (maybe even oftentimes) this process doesn’t work, occasionally in small ways and occasionally in blow-up-in-your-face ways. You might end up selecting a model that is problematic, like having a very small class that is impractical and which you suspect may just reflect outliers or over-fitting to the data. Or you might select a model where, substantively, some of the classes seem similar enough that it isn’t worth distinguishing them. In such cases, you might use your content area knowledge (expert opinion) to decide that maybe the quantitatively “best” model isn’t as useful as the next-best model. Of course, this introduces subjectivity to the model selection process, and people may disagree about these decisions, so you want to justify your choice.

Other times, IC values just keep getting better as classes are added to the model and bootstrapped LRTs just keep giving significant results. This seems to happen a lot when analyzing especially large samples. What this reveals is a problem in our logic so far. To this point, we’ve assumed that the finite mixture model is *literally* *correct*: that is, there is some number of latent groups mixed together in the population and our job is to go find that number. But what if the model isn’t literally correct? Arguably, all models represent imperfect approximations to the true data generating process. We hope these models recover important features of the underlying structure, but we don’t necessarily regard them as correct. From this perspective, there isn’t some number of true classes to find. But, if that is the case, then what we are we doing when we conduct class enumeration? We would argue that we are evaluating different possible approximations to the data, trying to discern how many classes it takes to recover the primary structure without taking so many that we are starting to capture noise or nuisance variation.

At small sample sizes, we can only afford a gross approximation with few classes, but with higher sample sizes, we can start to recover finer structure with more classes. That finer structure may not always be of substantive interest, but it’s there, and traditional class enumeration procedures (BIC, etc.) will reward models that recover it. For example, with a modest amount of data we might be able to identify differences in attitudes, behavior, fashion, and speech between individuals living in broad regions of the United States, like the Northeast and Southwest. With more data, we might be able to see more nuanced differences, separating into smaller regions like mid-Atlantic states, upper Midwest, etc. In reality, the states (aside from Alaska and Hawaii) are contiguous, and attitudes, behavior, fashion, and speech patterns vary continuously over complex cultural and geographic gradients. Nevertheless, regional classifications capture important differences in local conditions. There’s no right number of classifications, just differences in fineness. With enough data, we can make our classes extremely local, but this might not always be useful to do.

Ultimately, then, there is an inconsistency between the perspective motivating the development and evaluation of traditional class enumeration procedures (that there is a true number of classes to find) and the context within which these are applied in practice (where the model is an approximation). This can lead to problems like seeing support for more and more classes at larger and larger sample sizes. In such cases, the number selected may again be determined more by subjective considerations such as the size, distinctiveness, and practical utility of the classes.

In sum, standard practice in determining the number of classes for a finite mixture model is to fit models with 1, 2, 3, etc. classes using maximum likelihood estimation, then compare fit using specialized likelihood ratio tests (bootstrapped LRT or Lo-Mendell-Rubin LRT), information criterion (BIC, AIC, etc.), or entropy, and to try to objectively triangulate on an optimal number. Simulation studies suggest bootstrapped LRTs and BIC generally work well. However, these presuppose that there is some true number of classes to find. In most instances, a more realistic perspective is that the model is instead providing an approximation to the underlying structure and there may not be a true number of classes to find. Even the archetypal concept of species undergirding our example with the finches is a bit more muddled than we learned in high school biology. On this view, the goal of our analysis is to select a number of classes that recovers the important features of the data without capturing noise or nuisance variation. Traditional class enumeration procedures can still serve as a useful guide, balancing fit and parsimony in quantifiable ways, but content area knowledge also plays an important role in determining how fine to make the approximation before it becomes impractical and unwieldy.

**References**

Hensen, J.M., Reise, S.P., & Kim, K.H. (2007). Detecting mixtures from structural model differences using latent variable mixture modeling: a comparison of relative model fit statistics. *Structural Equation Modeling, 14*, 202-226.

Kim, S.-Y. (2014). Determining the number of latent classes in single- and multiphase growth mixture models. *Structural Equation Modeling, 21*, 263-279.

Liu, M. & Hancock, G.R. (2014). Unrestricted mixture models for class identification in growth mixture modeling. *Educational and Psychological Measurement, Online First*.

Lo, Y., Mendell, N.R., & Rubin, D.B. (2001). Testing the number of components in a normal mixture. *Biometrika, 88*, 767–778.

McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. *Journal of the Royal Statistical Society, Series C, 36*, 318-324.

McLachlan, G., & Peel, D. (2000). *Finite mixture models*. New York: Wiley.

Nylund, K.L., Asparouhov, T. & Muthen, B.O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. *Structural Equation Modeling, 14*, 535-569.

Ramaswamy, V., DeSarbo, W.S., Reibstein, D.J., & Robinson, W.T. (1992). An empirical pooling approach for estimating marketing mix elasticities with PIMS data. *Marketing Science, 12, *241-254.

Tofighi, D. & Enders, C.K. (2008). Identifying the correct number of classes in growth mixture models. In G.R. Hancock & K.M. Samuelsen (Eds.), *Advances in Latent Variable Mixture Models* (pp. 317-341). Greenich, CT: Information Age.

The post What’s the best way to determine the number of latent classes in a finite mixture analysis? appeared first on CenterStat.

]]>The post My advisor told me to use principal components analysis to examine the structure of my items and compute scale scores, but I was taught not to use it because it is not a “true” factor analysis. Help! appeared first on CenterStat.

]]>Help, indeed. This issue has been a source of both confusion and contention for more than 75 years, and papers have been published on this topic as recently as just a few years ago. A thorough discussion of principal components analysis (PCA) and the closely related methods of exploratory factor analysis (EFA) would require pages of text and dozens of equations; here we will attempt to present a more succinct and admittedly colloquial description of the key issues at hand. We can begin by considering the nature of *composites*.

Say that you were interested in obtaining scores on negative affect (e.g., sadness, depression, anxiety) and you collected data from a sample of individuals who responded to 12 items assessing various types of mood and behavior (e.g., sometimes I feel lonely, I often have trouble sleeping, I feel nervous for no apparent reason, etc.). The simplest way to obtain a composite scale score would be to compute a mean of the 12 items for each person to represent their overall level of negative affect. This is often called an *unweighted* linear composite because all items contribute equally and additively to the scale score: that is, you simply add them all up and divide by 12. This approach is widely used in nearly all areas of social science research.

However, now imagine that you could compute *more* than one composite from the set of 12 items. For example, you might not believe a single overall composite of negative affect exists, but that there is one composite that primarily reflects *depression* and another that primarily reflects *anxiety*. This is initially very strange to think about because you want to obtain *different *composites from the *same* 12 items. The key is to *differentially* weight the items for each composite you compute. You might use larger weights for the first six items and smaller weights for the second six items to obtain the first composite, and then use smaller weights for the first six items and larger weights for the second six items to obtain the second composite. Now instead of having a single overall composite of the 12 items assessing negative affect, you have one composite that you might choose to label *depression* and a second composite that you might choose to label *anxiety*, and both were based on differential weighting of the same 12 items. This is the core of PCA.

PCA dates back to the 1930’s and was first proposed by Harold Hotelling as a *data reduction method*. His primary motivation was to take a larger amount of information and reduce it to a smaller amount of information by computing a set of weighted linear composites. The goal was for the composites to reflect *most, *though not *all,* of the original information. He accomplished this through the use of the eigenvalues and eigenvectors associated with the correlation matrix of the full set of items. Eigenvalues represent the variance associated with each composite, and eigenvectors represent the weights used to compute each composite. In our example, the first two eigenvalues would represent the *variances *of the depression and anxiety composites, and the eigenvectors or *weights* would tell us how much each item contributes to each composite. It is possible to compute as many composites as items (so we could compute 12 composites based on our 12 items) but this would accomplish nothing in terms of data reduction because we would simply be exchanging 12 items for 12 composites. Instead, we want to compute a much smaller number of composites than items that represent *most* but not *all* of the observed variance (so we might exchange 12 items for two or three composites). The cost of this reduction is some loss of information, but the gain is being able to work with a smaller number of composites relative to the original set of items.

There are many heuristics used to determine the “optimal” number of composites to extract from a set of items. Methods include the Kaiser-Guttman rule, looking for the “bend” in a scree plot of eigenvalues, parallel analysis, and evaluating the incremental variance associated with each extracted component. There are also many methods of “rotation” that allow us to rescale the item weights in particular ways to make the underlying components more interpretable (helping us “name” the factors). For example, if the first six items assessed things like sadness and loneliness and had large weights on the first component but smaller weights on the second, we might choose to name the first component “depression”, and so on. Often, the end goal is to obtain conceptually meaningful weighted composite scores for later analyses.

Although Hotelling developed PCA strictly as a method of data reduction and composite scoring (indeed, he never even discussed rotation because he was not interested in interpreting individual items), over time this method came to be associated with a broader class of models called exploratory factor analysis, or EFA. The goals of EFA are often very similar to those of PCA and might include scale development, understanding the psychometric structure underlying a set of items, obtaining scale scores for later analysis, or all three. There are many steps in EFA that overlap with those of PCA, including identifying the optimal number of factors to extract; how to rescale (or “rotate”) the factor loadings to enhance interpretation; how to “name” the factors based on what items are weighted more vs. less; and how to compute optimal scores. Given these similarities, there has long been contention about whether PCA is a formal member of the EFA family, or if PCA is not a “true” factor model but instead something distinctly different.

Contention on this point centers on a key defining feature of PCA: it assumes that all items are measured *without error* and all observed variance is available for potential factoring. When fewer composites are taken than the number of items, some residual variance in the items will be left over, but this is still considered “true” variance and not measurement error. In contrast, EFA explicitly assumes that the item responses may be, and indeed very likely are, characterized by measurement error. As such, whereas PCA expresses the components as a direct function of the items (that is, the items *induce* the components), EFA conceptually reverses this relation and instead expresses the items as a function of the underlying latent factors. The factors are “latent” in the sense that we believe them to exist but they are not directly observed, and our motivating goal is to infer their existence based on what we did observe: namely, the items.

Of critical importance is that, unlike the PCA, the EFA assumes that only *part* of the observed item variance is true score variance and the remaining part is explicitly defined as measurement error. Although this assumption allows the model to more accurately reflect what we believe to exist in the population (we nearly always recognize there is the potential for measurement error in our obtained items), this also creates a significant challenge in model estimation because the measurement errors are additional parameters that must be estimated from the data. Whereas PCA can be computed directly from our observed sample data, EFA requires us to move to more advanced methods that allow us to obtain optimal estimates of population parameters via iterative estimation. There are many methods of estimation that can be used in the EFA (e.g., unweighted least squares, generalized least squares, maximum likelihood), each of which have certain advantages and disadvantages. In general, maximum likelihood is often viewed as the “gold standard” method of estimation in most research applications.

We can think about four key issues that ultimately distinguish PCA from EFA:

- The theoretical model is
*formative*in PCA and*reflective*in EFA. In other words, the composites are viewed as a function of the items in PCA, but the items are viewed as a function of the latent factors in EFA. - PCA assumes all observed variance among a set of items is available for factoring, whereas EFA assumes only a subset of the observed variance among a set of items is available for factoring. This implies that PCA assumes no measurement error while EFA explicitly incorporates measurement error into the model.
- Although both PCA and EFA allow for the creation of weighted composites of items, in PCA these are direct linear combinations of items whereas in EFA these are model-implied estimates (or predicted values) of the underlying latent factors. As such, in PCA there is only a single method for computing composites, but in EFA there are many (e.g., regression, Bartlett, constrained covariance, etc.), all of which can differ slightly from one to the other.
- Finally, the confusion between PCA and EFA is exacerbated by the fact that in nearly all major software packages PCA is available as part of the “factor analysis” estimation procedures (e.g., in SAS PROC FACTOR a PCA is defined using “method=principal” but an EFA is defined using “method=ML”).

It is difficult to draw firm guidelines for when and if to use PCA in practice. It depends on the underlying theory, the characteristics of the sample, and the goals of the analysis. In most social science applications, particularly those focused on the measurement of psychological constructs, it is often best to use EFA because this better represents what we believe to hold in the population. However, if EFA is not possible due to estimation problems, or if there is an exceedingly large number of items under study, then PCA is a viable alternative. Interestingly, PCA has begun to make a recent comeback in usage within psychology given increased interest in machine learning. It is not uncommon for PCA to be applied to 50 or 100 variables in order to distill them down to a smaller number of composites to be used in subsequent analysis.

Our general recommendation is to initially consider EFA estimated using ML as your first best option, both for model fitting and score estimation. This is because, far more often than not, the EFA model better represents the mechanism we believe to have given rise to the observed data; namely, a process that combines both true underlying construct variation and random measurement error. However, if the EFA is not viable for some reason, then PCA is a perfectly defensible option as long as the omission of measurement error is clearly recognized. Finally, all of the above relates to the exploratory factor analysis model in which all items load on all underlying factors. In contrast, the confirmatory factor analysis (CFA) model allows for *a priori* tests of measurement structure based on theory. If there is a stronger underlying theoretical model under consideration, then CFA is often a better option. We discuss the CFA model in detail in our free three-day workshop, *Introduction to Structural Equation Modeling*.

Below are a few readings that might be of use.

Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford publications.

Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refinement of clinical assessment instruments. *Psychological Assessment, 7*, 286-299.

Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. *Psychological Methods, 4*, 272-299.

Widaman, K. F. (1993). Common factor analysis versus principal component analysis: Differential bias in representing model parameters? *Multivariate Behavioral Research, 28*, 263-311.

Widaman, K. F. (2018). On common factor and principal component representations of data: Implications for theory and for confirmatory replications. *Structural Equation Modeling: A Multidisciplinary Journal, 25*, 829-847.

The post My advisor told me to use principal components analysis to examine the structure of my items and compute scale scores, but I was taught not to use it because it is not a “true” factor analysis. Help! appeared first on CenterStat.

]]>The post Second Annual Winter Institute appeared first on CenterStat.

]]>And don’t forget that you can also obtain asynchronous access to any of our great workshops from last Spring, now including our ***Free** Introduction to Structural Equation Modeling* class. Videos are now available for six months from registration and all other materials can be downloaded and retained indefinitely.

The post Second Annual Winter Institute appeared first on CenterStat.

]]>The post I fit a multilevel model and got the warning message “G Matrix is Non-Positive Definite.” What does this mean and what should I do about it? appeared first on CenterStat.

]]>First, let’s translate the technical jargon. Following Laird & Ware (1982), many software programs used to fit multilevel models use the label **G** to reference the covariance matrix of the random effects. For instance, for a linear growth model, we might include both a random intercept and a random slope for time to capture (unexplained) individual differences in starting level and rate of change. In fitting the model, we don’t estimate the individual values of the random effects directly. Instead, we estimate the variances and covariance of the random effects, i.e., a variance for the intercepts, a variance for the slopes, and a covariance between intercepts and slopes. These variances and covariances are contained in the matrix **G**. Similarly, with hierarchically nested data (e.g., children nested within classrooms or patients nested within physician), we use random effects to capture (unexplained) cluster-level differences. Random intercepts capture between-cluster differences in outcome levels whereas random slopes capture between-cluster differences in the effects of predictors. Again, as part of fitting the model, we need to estimate the variances and covariances of these random effects and, again, these variance and covariance parameters are contained within the **G** matrix. Note that some software programs may use a different label for the covariance matrix of the random effects, but for this post we will use the common notation of **G** throughout.

When the **G** matrix is non-positive definite (NPD) this means that there are fewer dimensions of variation in the matrix than the expected number (i.e., the number of rows or columns of the matrix, corresponding to the number of random effects in the model). For instance, in our linear growth model example, there are two potentially correlated dimensions of variation specified in the **G** matrix, one corresponding to the random intercepts and one corresponding to the random slopes for time. This is no different than what we would expect for any two variables. If we measured height and weight, for instance, there would be variation in height, variation in weight, and some covariation between height and weight, and this would be captured in the 2 x 2 covariance matrix for the two variables. Here we are simply considering random effects rather than measured variables, but the principle remains the same. Now, imagine what would happen if there was no variation for one variable or random effect. For instance, suppose there were no individual differences in rate of change, making the variance of the slopes equal to zero? Then there would be only one remaining dimension of variation in the matrix (reflecting the random intercepts) and **G** would be NPD (having fewer actual dimensions of variation than its specified number of rows/columns). Thus, one way an NPD **G** matrix can arise is if one (or more) of the random effects in the fitted model has a variance of zero.

However, this is not the only possible way to obtain an NPD **G** matrix. For example, what happens if the intercepts and slopes of our growth model are perfectly correlated (e.g., *r = *1.0 or -1.0)? Then the two random effects are redundant with one another and actually represent just one dimension of variation. Again, this would lead the **G** matrix to be NPD. More technically, any time one random effect can be expressed as a perfect linear function of the other random effects, the **G** matrix will be NPD. Note that, depending on whether your software program implements boundary constraints on the variance and covariance parameters or not, you can even get negative variances for random effects or correlations exceeding ±1 (known as improper estimates).

Now let’s consider why an NPD covariance matrix for the random effects is usually a problem. Typically, when one includes random effects in a multilevel model, the assumption is that they “exist” as distinguishable components of variation. For instance, our growth model states that people differ in their starting points and rates of change, differences captured by the random intercepts and slopes included in the model specification. When we include random effects like these in our models, we expect them to have variance and, while they might be correlated with one another, none is thought to be fully redundant with the others. When we receive the “G matrix is non-positive definite” warning, it tells us our expectations were wrong. The estimated model found fewer dimensions of variation than the number of random effects that were specified.

Sometimes the problem is just that estimation went awry. For instance, when predictors with random slopes have very different scales, the variances of the random slopes may be numerically quite different, and this can impede proper model estimation. A second possible reason for **G** to be NPD is that we included random effects in the model that simply aren’t there. Sure, people differ in their starting levels but everyone is actually changing at the same rate, so the random intercepts are good but the random slopes are superfluous. A third possibility is that the data simply aren’t sufficient to support estimating the model (even if the model accurately describes the process under study). This often occurs with smaller sample sizes, more complex models, or some combination of the two.

To illustrate this last possibility, let’s say we fit our growth model to data comprised of two repeated measures per person, and these were collected at the same two points in time for everyone in the sample (a common scenario sometimes referred to as “time structured data”). With only these two time points, there simply is not enough information to be able to obtain unique estimates for all of our model parameters. That is, our model is “under identified”. To intuit why this is the case, imagine a time plot for a set of individuals. If we allow ourselves to draw a different line for each person, each with its own starting level and rate of change, then we will connect the dots perfectly for every case. Yet our model assumes there will be some residual variability around the line as well, i.e., variation around the individual trajectory. Since each line connects the dots, we have no remaining variability with which to capture the residual. Conversely, were we to try to introduce residuals by drawing lines that didn’t perfectly connect the dots, we couldn’t do so without using arbitrary intercepts and slopes. Thus, a typical linear growth model that includes both a residual and random intercepts and slopes cannot be estimated using just two time points of data without producing an NPD covariance matrix for the random effects. That doesn’t mean that there aren’t truly differences in where people start and the rate at which they are changing. It just means that the data are insufficient to tell us about those differences. To be able to identify the model, we would need a third time point (for at least some sizable portion of the sample) to be able to draw a line for each person that doesn’t simply connect the dots and that allows for individual differences in intercepts and slopes as well as residual variability.

A general but imperfect rule of thumb is that, for many of the units in the sample, you want at least one more observation than the number of random effects (e.g., to include two random effects in our growth model, a good number of people in the sample should have three or more repeated measures). If you have fewer observations per unit than indicated by this rule, that may be the cause of your NPD **G** matrix. The warning is telling you that you are trying to do too much with the data at hand. Although we illustrated this rule with longitudinal data, it applies equally to hierarchical data applications. For instance, with dyadic data, there are two partners per dyad, allowing for the inclusion of a random intercept to account for the between-dyad differences; however, no further random effects can be included in the model because their variance/covariance parameters would not be identified given the uniform cluster size of two.

Complicating matters, however, is that even when the number of observations per sampling unit are theoretically sufficient, one may still obtain an NPD covariance matrix. That is, the model is in principle mathematically identified but the data still aren’t able to support the full dimensionality of the random effects. Such a scenario is most likely to arise in small samples and when the number of random effects in the model is either large (i.e., 5 or more) or approaches the maximum number that can possibly be identified by the data. For instance, let’s say we have time structured data with four repeated measures per person. In principle, we can fit a quadratic growth model with a random intercept, random linear effect of time, and random quadratic effect of time. Four observations per person should be enough to be able to obtain unique variance and covariance estimates for three random effects. Yet when we fit the model, we might still obtain the warning, “G matrix is non-positive definite.” In such a case, inspecting the variance-covariance parameter estimates will likely reveal that the quadratic random effect has an estimated variance of zero (or negative variance) or that our random effects have excessively high correlations with one another (in practice, these very high correlations are very commonly negative). Empirically, we cannot distinguish all the components of variability that we specified for the individual trajectories.

Now that we understand when and why NPD **G** matrices occur, let’s consider what to do about them. What to do depends, of course, on what prompted the NPD solution. First, do your best to determine whether your model is identified. Model identification can be tricky with multilevel models, but drawing on our rule of thumb, consider whether with *p* random effects in your model, your sampling units have at least *p *+ 1 observations. If not, you probably need to simplify the model. Even if your model is mathematically identified, model simplification might still be in order. Remember, a non-positive definite **G** matrix signals a lack of empirical support for each random effect to represent a non-redundant component of variation. A logical remedy is then to remove random effects until the warning message goes away. Typically, one should remove higher-order terms before lower-order terms (e.g., remove the quadratic random effect before the linear one, and the linear before the intercept). One pattern of results that is particularly amenable to this strategy is when the variance estimate for a random effect collapses to zero (or goes negative), suggesting it should be removed. We caution, however, that non-significance of a variance estimate should not be taken to imply that the random effect can be sacrificed without worry. Non-significance might simply be a result of low power. Trimming terms based on p-values thus runs the risk of over-simplification, with consequences for the validity of the inferences made from the model.

Additionally, we want to emphasize that reducing the number of random effects is not always defensible, desirable, or necessary. For instance, suppose our theory suggested the inclusion of two random slopes in the model. Each is estimated with some non-zero variance but the slopes are excessively correlated with one another, producing an NPD **G** matrix. Which should we remove? Both were hypothesized to exist and there is no empirical information to prompt the exclusion of one versus the other. Fortunately, we may not have to remove either. Sometimes, re-parameterizing the random effects covariance matrix is sufficient to resolve the problem. Specifically, McNeish & Bauer (2020) showed that using a factor analytic (FA) decomposition of the random effects covariance matrix can greatly aid convergence and reduce the incidence of NPD solutions. When necessary, the FA decomposition can also be used to facilitate a dimension reduction to the random effects covariance matrix that doesn’t require any of the random effects to be omitted entirely. In that case, you are effectively acknowledging that an NPD G matrix is just something you have to live with given the complexity of your model, but you are choosing to do so in as graceful (and empirically useful) a manner as possible.

One other strategy is to abandon the random effects entirely and move to a marginal or “population average” model (Fitzmaurice, Laird, & Ware, 2011; McNeish, Stapleton & Silverman, 2016). In a marginal model, one captures dependence among observations using a covariance structure for the residuals (e.g., compound symmetric, autoregressive, etc.) rather than through the introduction of random effects. Generalized estimating equations (GEE) are one popular algorithm for fitting marginal models, particularly when working with longitudinal data and discrete outcome variables. The obvious downside to a marginal modeling approach is the inability to quantify individual differences between units. For instance, applied in a longitudinal setting, a marginal model would provide estimates of how the mean of the outcome variable changes over time but would not provide estimates of how individuals vary from one another in their trajectories. With hierarchical data, a related approach is to assume independence of observations (despite knowing this assumption to be incorrect), but then implement “cluster corrected” or “robust” standard errors to obtain valid inferences. This latter option is commonly used in survey research where the nesting of units is a by-product of the sampling design (e.g., cluster sampling) but of little substantive interest. In general, these marginal modeling approaches obviate the possibility of an NPD **G** matrix by omitting random effects from the model, but they are typically only useful if clustering is a nuisance and between-cluster differences are not of theoretical interest (see McNeish et al., 2016).

In sum, the warning “G matrix is non-positive definite” tells you that there are fewer unique components of variation in your estimated random effects covariance matrix than the number of random effects in the model. This can be a consequence of fitting an under-identified model, in which case one must simplify the random effects structure. Alternatively, it may reflect sparse empirical information to support the random effects in the model (especially in small samples or with more complex models). Removing random effects is then a common solution. Often, however, a better solution is to re-parameterize the random effects covariance matrix to facilitate optimization to a proper solution, for instance by using a factor analytic decomposition. If the random effects are not of substantive interest, then you might also consider moving to a marginal model to avoid the issue entirely.

**References**

Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011) *Applied longitudinal analysis* (2nd). Philadelphia, PA: Wiley

Laird, N.M. & Ware, J.H. (1982). Random-effects models for longitudinal data. *Biometrics*, *38*, 963-974. https://doi.org/10.2307/2529876

McNeish, D. & Bauer, D.J. (in press). Reducing incidence of nonpositive definite covariance matrices in mixed effect models. *Multivariate Behavioral Research*. https://doi.org/10.1080/00273171.2020.1830019

McNeish, D., Stapleton, L.M. & Silverman, R.D. (2016). On the unnecessary ubiquity of hierarchical linear modeling. *Psychological Methods, 22*, 114-140. https://doi.org/10.1037/met0000078

The post I fit a multilevel model and got the warning message “G Matrix is Non-Positive Definite.” What does this mean and what should I do about it? appeared first on CenterStat.

]]>