I fit a multilevel model and got the warning message “G Matrix is Non-Positive Definite.” What does this mean and what should I do about it?

Anyone who uses multilevel models will eventually encounter the dreaded “G matrix is non-positive definite” message (or “Tau Matrix” or “Psi Matrix”, depending on the labeling used by your software). This anxiety-inducing warning is often a signal that something is wrong in your model that needs fixing. With an NPD G matrix, the obtained estimates usually reflect an “improper” solution and the results should not be interpreted until this issue has been addressed. In this post, we will unpack what the “G matrix is non-positive definite” message means, why it is a problem, and how to fix it.

First, let’s translate the technical jargon. Following Laird & Ware (1982), many software programs used to fit multilevel models use the label G to reference the covariance matrix of the random effects.  For instance, for a linear growth model, we might include both a random intercept and a random slope for time to capture (unexplained) individual differences in starting level and rate of change. In fitting the model, we don’t estimate the individual values of the random effects directly. Instead, we estimate the variances and covariance of the random effects, i.e., a variance for the intercepts, a variance for the slopes, and a covariance between intercepts and slopes. These variances and covariances are contained in the matrix G. Similarly, with hierarchically nested data (e.g., children nested within classrooms or patients nested within physician), we use random effects to capture (unexplained) cluster-level differences. Random intercepts capture between-cluster differences in outcome levels whereas random slopes capture between-cluster differences in the effects of predictors. Again, as part of fitting the model, we need to estimate the variances and covariances of these random effects and, again, these variance and covariance parameters are contained within the G matrix. Note that some software programs may use a different label for the covariance matrix of the random effects, but for this post we will use the common notation of G throughout.

When the G matrix is non-positive definite (NPD) this means that there are fewer dimensions of variation in the matrix than the expected number (i.e., the number of rows or columns of the matrix, corresponding to the number of random effects in the model).  For instance, in our linear growth model example, there are two potentially correlated dimensions of variation specified in the G matrix, one corresponding to the random intercepts and one corresponding to the random slopes for time. This is no different than what we would expect for any two variables. If we measured height and weight, for instance, there would be variation in height, variation in weight, and some covariation between height and weight, and this would be captured in the 2 x 2 covariance matrix for the two variables. Here we are simply considering random effects rather than measured variables, but the principle remains the same.  Now, imagine what would happen if there was no variation for one variable or random effect.  For instance, suppose there were no individual differences in rate of change, making the variance of the slopes equal to zero?  Then there would be only one remaining dimension of variation in the matrix (reflecting the random intercepts) and G would be NPD (having fewer actual dimensions of variation than its specified number of rows/columns).  Thus, one way an NPD G matrix can arise is if one (or more) of the random effects in the fitted model has a variance of zero.

However, this is not the only possible way to obtain an NPD G matrix. For example, what happens if the intercepts and slopes of our growth model are perfectly correlated (e.g., r = 1.0 or -1.0)?  Then the two random effects are redundant with one another and actually represent just one dimension of variation. Again, this would lead the G matrix to be NPD.  More technically, any time one random effect can be expressed as a perfect linear function of the other random effects, the G matrix will be NPD.  Note that, depending on whether your software program implements boundary constraints on the variance and covariance parameters or not, you can even get negative variances for random effects or correlations exceeding ±1 (known as improper estimates).

Now let’s consider why an NPD covariance matrix for the random effects is usually a problem. Typically, when one includes random effects in a multilevel model, the assumption is that they “exist” as distinguishable components of variation. For instance, our growth model states that people differ in their starting points and rates of change, differences captured by the random intercepts and slopes included in the model specification. When we include random effects like these in our models, we expect them to have variance and, while they might be correlated with one another, none is thought to be fully redundant with the others. When we receive the “G matrix is non-positive definite” warning, it tells us our expectations were wrong. The estimated model found fewer dimensions of variation than the number of random effects that were specified.

Sometimes the problem is just that estimation went awry. For instance, when predictors with random slopes have very different scales, the variances of the random slopes may be numerically quite different, and this can impede proper model estimation. A second possible reason for G to be NPD is that we included random effects in the model that simply aren’t there. Sure, people differ in their starting levels but everyone is actually changing at the same rate, so the random intercepts are good but the random slopes are superfluous. A third possibility is that the data simply aren’t sufficient to support estimating the model (even if the model accurately describes the process under study). This often occurs with smaller sample sizes, more complex models, or some combination of the two.

To illustrate this last possibility, let’s say we fit our growth model to data comprised of two repeated measures per person, and these were collected at the same two points in time for everyone in the sample (a common scenario sometimes referred to as “time structured data”). With only these two time points, there simply is not enough information to be able to obtain unique estimates for all of our model parameters. That is, our model is “under identified”. To intuit why this is the case, imagine a time plot for a set of individuals. If we allow ourselves to draw a different line for each person, each with its own starting level and rate of change, then we will connect the dots perfectly for every case.  Yet our model assumes there will be some residual variability around the line as well, i.e., variation around the individual trajectory. Since each line connects the dots, we have no remaining variability with which to capture the residual. Conversely, were we to try to introduce residuals by drawing lines that didn’t perfectly connect the dots, we couldn’t do so without using arbitrary intercepts and slopes. Thus, a typical linear growth model that includes both a residual and random intercepts and slopes cannot be estimated using just two time points of data without producing an NPD covariance matrix for the random effects. That doesn’t mean that there aren’t truly differences in where people start and the rate at which they are changing. It just means that the data are insufficient to tell us about those differences. To be able to identify the model, we would need a third time point (for at least some sizable portion of the sample) to be able to draw a line for each person that doesn’t simply connect the dots and that allows for individual differences in intercepts and slopes as well as residual variability.

A general but imperfect rule of thumb is that, for many of the units in the sample, you want at least one more observation than the number of random effects (e.g., to include two random effects in our growth model, a good number of people in the sample should have three or more repeated measures). If you have fewer observations per unit than indicated by this rule, that may be the cause of your NPD G matrix. The warning is telling you that you are trying to do too much with the data at hand. Although we illustrated this rule with longitudinal data, it applies equally to hierarchical data applications.  For instance, with dyadic data, there are two partners per dyad, allowing for the inclusion of a random intercept to account for the between-dyad differences; however, no further random effects can be included in the model because their variance/covariance parameters would not be identified given the uniform cluster size of two.

Complicating matters, however, is that even when the number of observations per sampling unit are theoretically sufficient, one may still obtain an NPD covariance matrix. That is, the model is in principle mathematically identified but the data still aren’t able to support the full dimensionality of the random effects. Such a scenario is most likely to arise in small samples and when the number of random effects in the model is either large (i.e., 5 or more) or approaches the maximum number that can possibly be identified by the data. For instance, let’s say we have time structured data with four repeated measures per person. In principle, we can fit a quadratic growth model with a random intercept, random linear effect of time, and random quadratic effect of time. Four observations per person should be enough to be able to obtain unique variance and covariance estimates for three random effects. Yet when we fit the model, we might still obtain the warning, “G matrix is non-positive definite.” In such a case, inspecting the variance-covariance parameter estimates will likely reveal that the quadratic random effect has an estimated variance of zero (or negative variance) or that our random effects have excessively high correlations with one another (in practice, these very high correlations are very commonly negative). Empirically, we cannot distinguish all the components of variability that we specified for the individual trajectories.

Now that we understand when and why NPD G matrices occur, let’s consider what to do about them.  What to do depends, of course, on what prompted the NPD solution. First, do your best to determine whether your model is identified. Model identification can be tricky with multilevel models, but drawing on our rule of thumb, consider whether with p random effects in your model, your sampling units have at least p + 1 observations. If not, you probably need to simplify the model.  Even if your model is mathematically identified, model simplification might still be in order. Remember, a non-positive definite G matrix signals a lack of empirical support for each random effect to represent a non-redundant component of variation. A logical remedy is then to remove random effects until the warning message goes away. Typically, one should remove higher-order terms before lower-order terms (e.g., remove the quadratic random effect before the linear one, and the linear before the intercept). One pattern of results that is particularly amenable to this strategy is when the variance estimate for a random effect collapses to zero (or goes negative), suggesting it should be removed. We caution, however, that non-significance of a variance estimate should not be taken to imply that the random effect can be sacrificed without worry. Non-significance might simply be a result of low power.  Trimming terms based on p-values thus runs the risk of over-simplification, with consequences for the validity of the inferences made from the model.

Additionally, we want to emphasize that reducing the number of random effects is not always defensible, desirable, or necessary. For instance, suppose our theory suggested the inclusion of two random slopes in the model. Each is estimated with some non-zero variance but the slopes are excessively correlated with one another, producing an NPD G matrix. Which should we remove? Both were hypothesized to exist and there is no empirical information to prompt the exclusion of one versus the other. Fortunately, we may not have to remove either. Sometimes, re-parameterizing the random effects covariance matrix is sufficient to resolve the problem. Specifically, McNeish & Bauer (2020) showed that using a factor analytic (FA) decomposition of the random effects covariance matrix can greatly aid convergence and reduce the incidence of NPD solutions. When necessary, the FA decomposition can also be used to facilitate a dimension reduction to the random effects covariance matrix that doesn’t require any of the random effects to be omitted entirely. In that case, you are effectively acknowledging that an NPD G matrix is just something you have to live with given the complexity of your model, but you are choosing to do so in as graceful (and empirically useful) a manner as possible.

One other strategy is to abandon the random effects entirely and move to a marginal or “population average” model (Fitzmaurice, Laird, & Ware, 2011; McNeish, Stapleton & Silverman, 2016). In a marginal model, one captures dependence among observations using a covariance structure for the residuals (e.g., compound symmetric, autoregressive, etc.) rather than through the introduction of random effects. Generalized estimating equations (GEE) are one popular algorithm for fitting marginal models, particularly when working with longitudinal data and discrete outcome variables. The obvious downside to a marginal modeling approach is the inability to quantify individual differences between units. For instance, applied in a longitudinal setting, a marginal model would provide estimates of how the mean of the outcome variable changes over time but would not provide estimates of how individuals vary from one another in their trajectories. With hierarchical data, a related approach is to assume independence of observations (despite knowing this assumption to be incorrect), but then implement “cluster corrected” or “robust” standard errors to obtain valid inferences. This latter option is commonly used in survey research where the nesting of units is a by-product of the sampling design (e.g., cluster sampling) but of little substantive interest. In general, these marginal modeling approaches obviate the possibility of an NPD G matrix by omitting random effects from the model, but they are typically only useful if clustering is a nuisance and between-cluster differences are not of theoretical interest (see McNeish et al., 2016).

In sum, the warning “G matrix is non-positive definite” tells you that there are fewer unique components of variation in your estimated random effects covariance matrix than the number of random effects in the model. This can be a consequence of fitting an under-identified model, in which case one must simplify the random effects structure. Alternatively, it may reflect sparse empirical information to support the random effects in the model (especially in small samples or with more complex models). Removing random effects is then a common solution.  Often, however, a better solution is to re-parameterize the random effects covariance matrix to facilitate optimization to a proper solution, for instance by using a factor analytic decomposition. If the random effects are not of substantive interest, then you might also consider moving to a marginal model to avoid the issue entirely.

References

Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011) Applied longitudinal analysis (2nd). Philadelphia, PA: Wiley

Laird, N.M. & Ware, J.H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963-974. https://doi.org/10.2307/2529876

McNeish, D. & Bauer, D.J. (in press). Reducing incidence of nonpositive definite covariance matrices in mixed effect models. Multivariate Behavioral Research. https://doi.org/10.1080/00273171.2020.1830019

McNeish, D., Stapleton, L.M. & Silverman, R.D. (2016). On the unnecessary ubiquity of hierarchical linear modeling. Psychological Methods, 22, 114-140. https://doi.org/10.1037/met0000078

Related Articles

A reviewer recently asked me to comment on the issue of equivalent models in my structural equation model. What is the difference between alternative models and equivalent models within an SEM?

An equivalent model can be thought of as a re-parameterization of the original model. In other words, it is just a different way of “packaging” the same information in the data and no equivalent model can be distinguished from another based on fit alone. If you were to fit a series of equivalent models to the same sample data you obtain exactly the same chi-square test statistic, RMSEA, CFI, TLI, and any other omnibus measure of fit. It is often best to treat this as a limitation of any given study and to potentially present one or a small number of equivalent model options to the reader so that these too might be considered as plausible representations of the data.