Why Measurement Matters
In the social, behavioral, and health sciences, we rarely observe the constructs we care about directly. Critically important constructs such as depression, quality of life, belonging, reading ability, self-efficacy, stress, prejudice, executive functioning, political trust, and family climate do not come with convenient rulers attached to them. Instead, we infer their existence based upon observed item responses, task performance, ratings, or behavior. That basic fact makes measurement central to the entire enterprise: if our measures do not represent the constructs we think they represent, then even highly sophisticated models can yield results that are precise and elegant yet deeply misleading. Good measurement practices are therefore essential for making defensible and reproducible inferences.
Measurement is far from a new problem. Indeed, psychology, education, and related fields have developed a rich tradition of psychometric thinking about construct validity, dimensionality, reliability, score precision, item functioning, measurement invariance, and test bias that spans more than a century. Yet in much contemporary substantive research, measurement is treated at most as a brief preliminary hurdle rather than an ongoing scientific responsibility, either ignored entirely or addressed merely through the routine reporting of Cronbach’s alpha (which hardly counts). Structural validity evidence in particular is often underreported and researchers frequently rely on prior use of a scale as justification for current use, which is far from sufficient. Despite the extraordinary knowledge that exists about measurement, researchers rarely make full use of rigorous psychometric tools in their day-to-day modeling practice, and this in turn can drastically limit what can be learned from our data.
Validity and Reliability
Two key issues at hand are validity, the extent to which our measures actually reflect their intended constructs, and reliability, the extent to which our measures capture true construct variance relative to noise, or measurement error. Nearly all statistical models proceed under the assumption that the observed measures are both valid and perfectly reliable (error free). The consequence of using invalid measures is relatively intuitive to work out. If a variable doesn’t actually represent the construct you think it does (e.g., a measure thought to represent impulsivity actually captures risk tolerance), then the obtained results obviously won’t provide accurate information about the intended construct. The consequences of unreliability are equally troubling: estimates obtained from a model assuming perfect reliability will be biased in the presence of measurement error, sometimes quite badly. In ordinary least squares regression, for example, predictors are typically treated as fixed and error-free for purposes of estimation. Absent this, the coefficient estimates will be biased.
The assumption of perfect reliability generalizes to a host of other statistical models ranging from mixture models to growth curve models to machine learning and beyond. But our predictors are often scale scores derived from a set of items, or even a single item, that can contain substantial measurement error. If present but ignored the analysis is no longer operating on the construct itself; it is operating on an imperfect proxy for the construct. The simple textbook story is that unreliability causes the coefficient estimates to shrink, that is, we obtain downwardly biased estimates pulled toward zero. Sometimes, this problem is dismissed with the justification that results are therefore just “conservative” because the true effects are actually even bigger. However, this textbook story only holds under the simplified scenario of model with just one predictor. With multiple predictors (so every single model fit in the real world), bias due to unreliability can be much more complicated and propagate throughout the entire model to not only attenuate but sometimes inflate estimates and standard errors.
Thus, the credibility of the entire scientific enterprise hinges on rigorous construct measurement. Fortunately, we have two extremely powerful and well-developed psychometric frameworks within which to assess measurement: factor analysis and item response theory (IRT). Both are latent variable models, but they tend to differ in their emphasis.
Factor Analysis
Factor analysis is principally used to evaluate questions of structural validity by evaluating how the observed responses, typically continuously scaled, reflect the underlying constructs of interest. For instance, early applications of factor analysis considered whether scores obtained across a range of cognitive tests might reflect three dominant factors: visual, verbal, and speed (see diagram below). More broadly, we might ask, do the observed responses reflect a unidimensional construct, or are multiple related processes being averaged together? Do responses load on a single factor, or are they meaningfully influenced by multiple factors? Are the factor loadings strong enough to support interpretation? Is the solution stable and replicable? These are not minor technical questions but allow for insight into the latent structure that underlies the set of observed responses. Additionally, once a satisfactory structure is identified, we can ask whether the factor loadings are sufficiently large to support reliable score estimation for subsequent modeling. In the simplest case, scale scores might be computed as a sum or average of the subset of observed responses with high loadings on a factor. An often better approach, however, is to use the final parameter estimates to compute factor score estimates, as these account for the possibility that some responses are more indicative of the underlying latent construct than others.

Item Response Theory
Whereas factor analysis is commonly applied with continuous response variables (frequently scale-level data), item-response theory (IRT) adds another layer of insight by focusing on the creation of scales from individual items. With item-level data, the response options are typically categorical (e.g., yes/no, correct/incorrect, never/sometimes/often or strongly disagree to strongly agree), necessitating models that consider how the probability of each response varies over the range of the latent trait (see below). For instance, the probability of answering an algebra item correctly should increase with math ability. Whereas factor analysis prioritizes defining the construct space by considering multiple factors, IRT is designed for the process of item selection and scale construction for a single underlying factor. IRT models thus allow researchers to evaluate which items discriminate most sharply, whether guessing is present, where along the latent continuum items are most informative, and how much measurement precision the test provides at different trait levels. IRT is a powerful yet often underused approach for measurement, especially because it yields information about both item functioning and score precision. This can be a major advantage over simply adding responses together and treating the result as if every point on the scale were equally reliable.

Connections Between Factor Analysis and IRT
Though these two psychometric frameworks developed side-by-side for somewhat different purposes, they are in fact highly related. A conventional factor analysis model can be considered an item response theory model for continuous items, and can likewise be used principally for scale construction purposes. Conversely, the IRT model can be considered a factor analysis model generalized to discrete items, with multidimensional IRT models offering a similar focus on structural validity. Knowledge of both factor analysis and IRT and their interconnections therefore offers both a rich understanding and wealth of tools for rigorously evaluating construct measurement.
Measurement in Practice
None of this means every study needs to begin with a full-scale psychometric redevelopment project. Researchers should, however, be much more cautious about treating common scales as if they were transparent windows onto the constructs they are intended to measure. A scale that worked well in one sample, at one time, for one purpose is not automatically adequate in another setting. Good measurement practice is therefore not a one-time citation to an old validation study. It is an ongoing process of testing whether the present data support the interpretation we want to make. For social, behavioral and health scientists, that is the larger message: measurement is not a nuisance to be dispatched before the “real” modeling begins. It is a fundamental part of the “real” modeling.
CenterStat is Here to Help
Here at CenterStat, we are deeply committed to measurement in all forms; indeed, both Dan and Patrick hold faculty positions in the L.L. Thurstone Psychometric Lab at the University of North Carolina, and Thurstone was arguably one of the greatest measurement experts to have ever lived. Measurement plays a key role within many of our workshops, but we offer two training opportunities that explicitly focus on these topics. The first is taught by Wes Bonifay (University of Missouri) and is titled Foundations of Item Response Theory. The second is co-taught by Patrick Curran (University of North Carolina) and Greg Hancock (University of Maryland) and is titled Exploratory and Confirmatory Factor Analysis. Each of these classes provides a broad treatment of IRT and FA starting from a basic introduction and moving to powerful contemporary applications of these models in practice. Each can be accessed separately, but we also offer a deeply discounted tuition on a measurement bundle that includes both classes.
Regardless of whether you learn measurement modeling from us or from someone else, it is absolutely critical that measurement be treated with the deep respect it deserves in every published study. This in turn increases the reliability, validity, and reproducibility of findings that we all desire.
Suggested Readings
Anastasi, A. (1950). The concept of validity in the interpretation of test scores. Educational and Psychological Measurement, 10, 67-78
Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.
Bollen, K. A. (2002). Latent variables in psychology and the social sciences. Annual Review of Psychology, 53, 605-634.
Bock, D. (1997). A brief history of item response theory. Educational Measurement: Issues and Practice, 16, 21-33.
Bollen, K. A., & Bauldry, S. (2011). Three Cs in measurement models: Causal indicators, composite indicators, and covariates. Psychological Methods, 16, 265-284.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-301.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. New York: Psychology Press.
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272–299.
Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 147–200).
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183-202.
Novick, M. R. (1966) The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3, 1-18.
Reise, S. P., & Waller, N. G. (2003). How many IRT parameters does it take to model psychopathology items? Psychological Methods, 8, 164–184.
Sellbom, M., & Tellegen, A. (2019). Factor analysis in psychological assessment research: Common pitfalls and recommendations. Psychological Assessment, 31, 1428–1441.
Smith, G. T. (2005). On construct validity: Issues of method and measurement. Psychological Assessment, 17, 396-408
Spearman, C. (1904). “General intelligence”, objectively determined and measured. American Journal of Psychology, 15, 201-293.
Thissen, D. E., & Wainer, H. E. (2001). Test scoring. Lawrence Erlbaum Associates Publishers.
Yong, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutorials in quantitative methods for psychology, 9, 79-94.
