What’s the best way to determine the number of latent classes in a finite mixture analysis?

Finite mixture models, which include latent class analysis, latent profile analysis, and growth mixture models, have grown greatly in popularity over the past decade or so.  Most statistical models assume a unitary (or homogeneous) population wherein all observations are governed by the same basic process.  In contrast, finite mixture models aim to identify latent subgroups within the population, sometimes referred to as “hidden heterogeneity.”  The idea is that distinct classes of people are mixed together in the population and we aren’t privy to the class labels. Imagine, for instance, that you are an amateur naturalist visiting the Galapagos, and you need to decide how many species of finch live on a given island.  The differences between species might not be overtly obvious but it might still be possible to separate one species from another through careful statistical analysis of measurements on beaks, feathers, and other morphology. Through the application of finite mixture models, it becomes possible to empirically evaluate how many species are present (more than one?), the prevalence of each species on the island, and what features most differentiate the species. Similarly, a clinician might want to determine if there are subtypes of Attention Deficit Hyperactivity Disorder, and an educational researcher might want to know if different subgroups of students use different cognitive strategies to solve math problems.  These are the kinds of research questions for which finite mixture model were designed: they provide a model-based method for resolving a population into classes when class membership is unknown ex ante (before the event).  We can think of this as trying to recover a categorical variable that is entirely missing in our data set, but whose value (what group an individual is in) dictates the distributions and relationships among the variables that we do observe (like a multiple-groups structural equation model where the grouping variable is 100% missing).  Since the data are entirely unlabeled, finite mixture models constitute one approach to unsupervised learning (other approaches include heuristic clustering algorithms like K-means). 

One of the single most difficult tasks in finite mixture modeling is to determine the number of classes within the population, a process sometimes referred to as class enumeration. Typically, one will fit a finite mixture model using maximum likelihood estimation, in which the number of classes must be declared as part of the model specification. Thus, the analyst will fit a model with 1 class, then 2 classes, then 3, etc., and then compare the fit of these models to try to determine the optimal number of classes.  Various approaches to determining the optimal number of classes can be considered but they generally fall into three primary categories: likelihood ratio tests, information criterion, and entropy statistics.  Let’s consider each in turn. (And, yes, there are Bayesian approaches to this problem too, but they aren’t widely used in practice so we won’t be addressing those).

One approach for evaluating the number of classes is to use a likelihood ratio test (LRT).  LRTs represent a general procedure for testing between nested models, i.e., where one model consists of parameters that are a restricted subset of the parameters of the other model.  The LRT is computed as –2 times the difference in the log-likelihoods of the two models and, under certain regularity conditions (essentially assumptions), it is distributed as a central chi-square with degrees of freedom equal to the difference in number of estimated parameters.  From the chi-square, we obtain a p-value under the null hypothesis that the simpler model is the right one. Effectively we are saying, look….we know that if we throw more parameters at the model it will fit the sample data better (i.e., the log-likelihood improves) but is this improvement greater than I would expect by chance alone given the number of parameters added (the degrees-of-freedom of the LRT)?  If the p-value is significant, then we conclude that it is a greater improvement than we would expect by chance, rejecting the simpler model in favor of the more complex model.  If it’s not significant, then we conclude there is not a meaningful difference between the two models and we retain the simpler model. In other words, we conclude that the extra parameters may just be overfitting, picking up random variation or noise in the sample that doesn’t reflect the true underlying structure in the population.

That is how we typically use LRTs in a traditional modeling framework, but let’s think about how we would apply this general testing approach to determine the number of classes in a finite mixture.  First, we can establish that a K-class model is nested within a K+1-class model.  For instance, one could set the mixing probability (prevalence rate) of one class in the K+1-class model to zero.  Presto, this deletes one of the classes to produce a K-class model.  So far so good.  Now we fit models with 1 v. 2 classes, calculate the LRT, and if the p-value is significant we say 2 is better than one.  Then we test 2 v. 3 classes, 3 v. 4 classes, etc., and stop when we get to the point that adding another class no longer results in a significant improvement in model fit.  But where things get complicated is in the fine print to the likelihood ratio test.  The regularity conditions required for the test distribution to be a central chi-square aren’t met when testing a K versus a K+1-class model.  So while it still makes sense to conduct likelihood ratio tests, we no longer have the familiar chi-square with which to obtain p-values. We need to somehow modify how we conduct LRTs for use in this context.

One option is to bootstrap the test distribution.  McLachlan (1987) proposed a parametric bootstrapping procedure that involves (1) simulating data sets from the K-class model estimates that were obtained from the real data; (2) fitting K and K+1-class models to the simulated data sets; (3) computing the likelihood ratio test statistic for each simulated data set; (4) using the distribution of bootstrapped LRT values to obtain the p-value for the likelihood ratio test statistic obtained with the real data.  It’s a clever approach, but somewhat computationally intense, especially if one wants a precise p-value.  The other option is to derive the correct theoretical test distribution for the LRT. Lo, Mendell & Rubin (2001) performed these derivations, determining it (appropriately enough) to be a mixture of chi-squares. They also provided an ad-hoc adjusted version of the test with a bit better performance at realistic sample sizes.  Simulation studies, however, have shown the Lo-Mendell-Rubin LRT (original and adjusted versions) to have elevated Type I error rates for some models, whereas the bootstrapped LRT consistently work well.  We thus tend to prefer the bootstrapped LRT, despite its greater computational demands (which is an increasingly less relevant concern given ever-improving computational speeds of even the lowliest desktop computers).

A second approach to evaluating the number of classes is to use information criteria (IC).  Two well-known information criteria are Akaike’s Information Criterion (AIC) and Bayes’ Information Criterion (BIC), but there are many others. What ICs generally try to do is balance the fit of the model against the complexity of the model.  Fit is measured by –2 times the log-likelihood and a penalty is then applied for complexity, usually some function of the number of parameters and/or sample size.  Often, but not always, ICs are scaled so that smaller values are better.  So one would fit models with 1, 2, 3, etc. classes and then select the model with the lowest IC value as providing the best balance of fit against complexity.  Different ICs were motivated in different ways and implement different penalties. Some penalties are stiffer than others, so for instance the BIC penalty usually exceeds the AIC penalty. When choosing the number of classes, simulation studies have shown AIC to be too liberal (tends to support taking too many classes), whereas BIC generally does well as long as the classes are reasonably well separated.  For less distinct classes (that is, classes that may reside closer together and are thus harder to discern), a sample size adjusted version of the BIC, which ratchets down the penalty a bit, sometimes performs better.  While there are many different ICs to choose from, we generally find the BIC to be a reasonable choice.

A third common approach is to consider the entropy of the model.  Entropy is a measure of how accurately one could assign cases to classes.  Finite mixture models are probabilistic classification models in the sense that there is not a hard partition of the sample into non-overlapping clusters but instead there is a probability that each person belongs to each class; further, these probabilities sum to 1.0 for each individual reflecting there is a 100% chance they belong to one of the classes.  However, sometimes one is interested in producing such a hard partition based on the probabilities, for instance by assigning a case to the class to which they most likely belong, a technique called modal assignment.  If the probabilities of class membership tend toward zero and one, then this implies that there should be few errors of assignment.  But as the probabilities move away from zero and one this reflects greater uncertainty about how to assign cases and an increased rate of assignment errors.  For instance, if my probabilities for belonging to Classes 1 and 2 are .9 and .1, there’s a 90% chance I would be correctly assigned to Class 1. That’s pretty good.  But if my probabilities are .6 and .4, there is only a 60% chance that placing me into Class 1 would be the right decision.  Entropy summarizes the uncertainty of class membership across all individuals, providing a sense of how accurately one can classify based on the model. 

There are several different types of entropy-based statistics.  Some are of the same form as the ICs described above, in which the fit of the model is balanced against a penalty that is now a function of entropy (e.g., the classification likelihood criterion).  Others are transformations of entropy to make interpretation easier (e.g., normalized entropy criterion). The E entropy statistic developed by Ramaswamy et al. (1992) is particularly popular – it has a nice scale, ranging from 0 to 1, with 1 indicating perfect accuracy, and is standard output in some software (e.g., Mplus).  One might thus calculate E values (or some other entropy based statistic) for models with different numbers of latent classes and then select the model with the greatest classification accuracy.  But this presupposes that one wants to select a model that consists of well-separated classes.  Sometimes, classes aren’t well separated.  Consider that there is a well-recognized height difference between adult men and women, yet men are only about 7% taller than women on average, so there is a lot of overlap between the height distributions. It seems reasonable to assume that latent classes will overlap at least as much as natural groups do, so entropy may be a poor guide to the number of classes in many realistic scenarios. Thus, in most cases, it is probably best not to use entropy to guide class enumeration, but instead to consider it a property of the model that is ultimately selected.  That is, determine the number of classes using the BIC and/or bootstrapped likelihood ratio, then examine the entropy as a descriptive statistic of the selected model.

So we seem to have arrived at a straightforward set of recommendations. First, fit models with 1, 2, 3, etc. latent classes (until estimation breaks or we reach some practically useful / theoretically plausible upper bound like say 10 classes). Second, compare the fit of these models using your preferred information criterion (perhaps BIC, perhaps sample-size adjusted BIC). Also use the bootstrapped likelihood ratio test to get formal p-values. Hope your IC of choice and the bootstrapped LRT arrive at the same answer.  Third, write your paper.  How hard can all of that possibly be?  Well, sometimes (maybe even oftentimes) this process doesn’t work, occasionally in small ways and occasionally in blow-up-in-your-face ways. You might end up selecting a model that is problematic, like having a very small class that is impractical and which you suspect may just reflect outliers or over-fitting to the data.  Or you might select a model where, substantively, some of the classes seem similar enough that it isn’t worth distinguishing them.  In such cases, you might use your content area knowledge (expert opinion) to decide that maybe the quantitatively “best” model isn’t as useful as the next-best model.  Of course, this introduces subjectivity to the model selection process, and people may disagree about these decisions, so you want to justify your choice.

Other times, IC values just keep getting better as classes are added to the model and bootstrapped LRTs just keep giving significant results.  This seems to happen a lot when analyzing especially large samples. What this reveals is a problem in our logic so far.  To this point, we’ve assumed that the finite mixture model is literally correct: that is, there is some number of latent groups mixed together in the population and our job is to go find that number. But what if the model isn’t literally correct?  Arguably, all models represent imperfect approximations to the true data generating process.  We hope these models recover important features of the underlying structure, but we don’t necessarily regard them as correct.  From this perspective, there isn’t some number of true classes to find.  But, if that is the case, then what we are we doing when we conduct class enumeration?  We would argue that we are evaluating different possible approximations to the data, trying to discern how many classes it takes to recover the primary structure without taking so many that we are starting to capture noise or nuisance variation. 

At small sample sizes, we can only afford a gross approximation with few classes, but with higher sample sizes, we can start to recover finer structure with more classes. That finer structure may not always be of substantive interest, but it’s there, and traditional class enumeration procedures (BIC, etc.) will reward models that recover it. For example, with a modest amount of data we might be able to identify differences in attitudes, behavior, fashion, and speech between individuals living in broad regions of the United States, like the Northeast and Southwest.  With more data, we might be able to see more nuanced differences, separating into smaller regions like mid-Atlantic states, upper Midwest, etc.  In reality, the states (aside from Alaska and Hawaii) are contiguous, and attitudes, behavior, fashion, and speech patterns vary continuously over complex cultural and geographic gradients. Nevertheless, regional classifications capture important differences in local conditions. There’s no right number of classifications, just differences in fineness. With enough data, we can make our classes extremely local, but this might not always be useful to do.

Ultimately, then, there is an inconsistency between the perspective motivating the development and evaluation of traditional class enumeration procedures (that there is a true number of classes to find) and the context within which these are applied in practice (where the model is an approximation).  This can lead to problems like seeing support for more and more classes at larger and larger sample sizes. In such cases, the number selected may again be determined more by subjective considerations such as the size, distinctiveness, and practical utility of the classes.

In sum, standard practice in determining the number of classes for a finite mixture model is to fit models with 1, 2, 3, etc. classes using maximum likelihood estimation, then compare fit using specialized likelihood ratio tests (bootstrapped LRT or Lo-Mendell-Rubin LRT), information criterion (BIC, AIC, etc.), or entropy, and to try to objectively triangulate on an optimal number. Simulation studies suggest bootstrapped LRTs and BIC generally work well. However, these presuppose that there is some true number of classes to find. In most instances, a more realistic perspective is that the model is instead providing an approximation to the underlying structure and there may not be a true number of classes to find. Even the archetypal concept of species undergirding our example with the finches is a bit more muddled than we learned in high school biology. On this view, the goal of our analysis is to select a number of classes that recovers the important features of the data without capturing noise or nuisance variation. Traditional class enumeration procedures can still serve as a useful guide, balancing fit and parsimony in quantifiable ways, but content area knowledge also plays an important role in determining how fine to make the approximation before it becomes impractical and unwieldy.


References

Hensen, J.M., Reise, S.P., & Kim, K.H. (2007). Detecting mixtures from structural model differences using latent variable mixture modeling: a comparison of relative model fit statistics. Structural Equation Modeling, 14, 202-226.

Kim, S.-Y. (2014). Determining the number of latent classes in single- and multiphase growth mixture models. Structural Equation Modeling, 21, 263-279.

Liu, M. & Hancock, G.R. (2014). Unrestricted mixture models for class identification in growth mixture modeling. Educational and Psychological Measurement, Online First.

Lo, Y., Mendell, N.R., & Rubin, D.B. (2001). Testing the number of components in a normal mixture. Biometrika, 88, 767–778.

McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture.  Journal of the Royal Statistical Society, Series C, 36, 318-324.

McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley.

Nylund, K.L., Asparouhov, T. & Muthen, B.O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569.

Ramaswamy, V., DeSarbo, W.S., Reibstein, D.J., & Robinson, W.T. (1992). An empirical pooling approach for estimating marketing mix elasticities with PIMS data. Marketing Science, 12, 241-254.

Tofighi, D. &  Enders, C.K. (2008). Identifying the correct number of classes in growth mixture models. In G.R. Hancock & K.M. Samuelsen (Eds.), Advances in Latent Variable Mixture Models (pp. 317-341). Greenich, CT: Information Age.

Related Articles