Research Examples

Key papers

Core IRW resource
Domingue B, Braginsky M, Caffrey-Maffei L, Gilbert JB, Kanopka K, Kapoor R, Lee H, Liu Y, Nadela S, Pan G, Zhang L, Zhang S, Frank MC · 2025
Behavior Research Methods Open Access
The Item Response Warehouse (IRW) is a collection and standardization of a large volume of item response datasets in a free and open-source platform for researchers. We describe key elements of the data standardization process and provide a brief description of the over 900 datasets in the current iteration of the IRW (version 28.2). We describe how to access the data through both the website and an API, and offer a brief tutorial with example R code illustrating how to download data from the IRW and use it in standard psychometric analyses. While we are continuing to develop the IRW, this presentation may help researchers utilize data from this resource for work in psychometrics and related fields.
Additional resources & item text
Nadela S, Lee H, Jain N, Gupta A, Zhang X, Domingue B · 2026
Chinese/English Journal of Educational Measurement and Evaluation Open Access
The Item Response Warehouse (IRW) is a repository of harmonized item response datasets designed to support secondary analysis and methodological research in psychological and educational measurement. This paper serves as a practical guide for researchers interested in using the IRW. We describe the structure of IRW datasets and the quantitative and qualitative metadata available for dataset selection, and we demonstrate how researchers can navigate the IRW website to explore and compare available tables. We further show how the IRW R and Python packages can be used to filter datasets programmatically, download response-level data, and generate standardized citations for reproducible research workflows. Finally, we describe additional IRW features, including access to item text and ongoing development efforts, that are designed to both support immediate use of the IRW and allow for community input related to its continued expansion.

Psychometrics research using IRW data

Liu Y, Zhang L, Domingue B · 2026
PsyArXiv
A central choice in multidimensional item response theory (MIRT) concerns how multiple latent skills combine to produce success: compensatory models allow trade-offs across dimensions, whereas fully noncompensatory models impose a conjunctive constraint in which limited proficiency on any required dimension can constrain success. Despite longstanding conceptual interest, there is limited evidence on how these skill-combination assumptions affect out-of-sample predictive performance across large-scale assessments. Using the Item Response Warehouse (IRW) and a common missing-responses cross-validation framework, we compare a unidimensional 2PL baseline with compensatory and fully noncompensatory two-dimensional 2PL specifications and summarize predictive differences with the InterModel Vigorish (IMV). Across datasets, the compensatory specification yields small but consistently positive predictive improvements over the unidimensional baseline, whereas the fully noncompensatory specification rarely improves prediction. Item-level analyses further reveal within-test heterogeneity, suggesting that conjunctive structure tends to be localized to subsets of items rather than test-wide.
Nalbandyan R, Gilbert JB, Franco VR, Domingue BW · 2026
Educational and Psychological Measurement
Polytomous item response data are typically classified as either nominal or ordinal, but this binary distinction may oversimplify their true structure. In this paper, we reframe the nominal–ordinal distinction as a continuum and introduce six empirical indices to quantify the degree of category ordering in item response data. Through extensive simulations with various IRT models and applications to 245 empirical datasets, we evaluate the indices’ sensitivity, computational efficiency, and interpretability across diverse measurement contexts. Our findings show that two parametric indices are particularly robust and informative, even with low-frequency categories. These indices offer a practical tool for assessing whether and how item categories align with ordinal assumptions, supporting more accurate measurement and model selection. We conclude that treating ordering as a continuum, rather than a binary property, provides deeper insights for psychometric practice.
Gilbert J, Himmelsbach Z, Soland J, Joshi M, Domingue B · 2025
Journal of Policy Analysis and Management
Analyses of heterogeneous treatment effects (HTE) are common in applied causal inference research. However, when outcomes are latent variables assessed via psychometric instruments such as educational tests, standard methods ignore the potential HTE that may exist among the individual items of the outcome measure. Failing to account for “item-level” HTE can lead to both underestimated standard errors and identification challenges in the estimation of treatment-by-covariate interaction effects. We demonstrate how Item Response Theory (IRT) models that estimate a treatment effect for each assessment item can both address these challenges and provide new insights into HTE generally. This study uses 75 datasets from 48 randomized controlled trials containing 5.8 million item responses in economics, education, and health research. Our results show that the item-level HTE model reveals item-level variation masked by single-number scores, provides more meaningful standard errors, allows for estimates of the generalizability of causal effects to untested items, and provides estimates of standardized treatment effect sizes corrected for attenuation due to measurement error.
Gilbert J, Domingue B, Kim J · 2025
Psychological Methods
Network models in which each variable interacts with the others in a complex system have emerged as an important alternative to latent variable models in psychometric research. However, confirmatory methods for group network comparison can be limited by practical constraints, such as the computational intractability of the Ising model in large networks. In this study, we demonstrate how to estimate causal effects on network state and strength when direct network estimation is not feasible by leveraging the mathematical equivalencies between the Ising model and item response theory (IRT) models. We demonstrate through simulation that a two-parameter logistic (2PL) explanatory IRT model can simultaneously recover causal effects on network state and strength. We then replicate our approach with 72 empirical datasets from randomized controlled trials in education, economics, health, and related fields. Our results show that causal effects on network strength are both common and uncorrelated with effects on network state, suggesting that causal network models can provide new insight into the impact of interventions in the social and behavioral sciences.
Gilbert J · 2025
Methodology Open Access
When analyzing treatment effects on outcome variables constructed from psychometric instruments (e.g., educational test scores, psychological surveys, or patient reported outcomes), researchers face many choices and competing guidance for scoring the measures and modeling results. This study examines the impact of outcome measure scoring and modeling approaches through simulation and an empirical application. Results show that estimates from multiple methods applied to the same data will vary because two-step models using sum or factor scores provide attenuated standardized treatment effects compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the use of scoring weights. An errors-in-variables correction removes the bias from two-step models. An empirical application to 10 datasets from randomized controlled trials demonstrates the sensitivity of the results to model selection. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal outcome scoring weights.
Gilbert J, Young W, Himmelsbach Z, Ulitzsch E, Domingue B · 2025
Educational and Psychological Measurement
The use of process data such as response time (RT) in psychometrics has generally focused on the relationship between speed and accuracy. The potential relationships between RT and item discrimination remain less explored. In this study, we propose a model for simultaneously estimating the relationships between RT and item discrimination at the person, item, and person-by-item (residual) levels and illustrate our approach through an item-level meta-analysis of 40 empirical datasets comprising 1.84 million item responses. We find no evidence of average differences in item discrimination between items of different time intensity or persons of different average RT, while residual RT strongly and negatively predicts item discrimination. While heterogeneity is high, we find little evidence of moderation by overall dataset characteristics. Our results suggest that RT data can provide insights into the measurement properties of educational and psychological assessments, but that the relationships between RT and item discrimination are highly variable.
Gilbert J, Himmelsbach Z, Miratrix L, Ho AD, Domingue B · 2025
Journal of Educational and Behavioral Statistics
Value added models (VAMs) attempt to estimate the causal effects of teachers and schools on student test scores. We apply Generalizability Theory to show how estimated VA effects depend upon the selection of test items. Standard VAMs estimate causal effects on the items that are included on the test. Generalizability demands consideration of how estimates would differ had the test included alternative items. We use item-level data from the IRW to estimate the item-level heterogeneity in VA effects and explore implications for reliability, cross-study comparability, and effect sizes.
Gilbert J, Soland J, Domingue B · 2025
Educational Measurement: Issues and Practice
Value-Added Models (VAMs) are both common and controversial in education policy and accountability research. While the sensitivity of VAMs to model specification and covariate selection is well documented, the extent to which test scoring methods (e.g., mean scores vs. IRT-based scores) may affect VA estimates is less studied. We examine the sensitivity of VA estimates to scoring method using empirical item response data from 18 education datasets. We show that VA estimates are frequently highly sensitive to scoring method, holding constant students and items.
Domingue B, Kanopka K, Kapoor R, Pohl S, Chalmers P, Rahal C, Rhemtulla M · 2024
Psychometrika
The deployment of statistical models, such as those used in item response theory (IRT), necessitates the use of indices that are informative about the degree to which a given model is appropriate for a specific data context. We introduce the InterModel Vigorish (IMV) as an index that can be used to quantify accuracy for models of dichotomous item responses based on the improvement across two sets of predictions. This index has a range of desirable features: it can be used for the comparison of non-nested models and its values are highly portable and generalizable. We use this fact to compare predictive performance across a variety of simulated data contexts and also demonstrate qualitative differences in behavior between the IMV and other common indices (e.g., the AIC and RMSEA). We also illustrate the utility of the IMV in empirical applications with data from 89 dichotomous item response datasets, helping illustrate how the IMV can be used in practice and substantiating claims regarding various aspects of model performance.
Ahmed I, Bertling M, Zhang L, Ho A, Loyalka P, Xue H, Rozelle S, Domingue B · 2024
Journal of Research on Educational Effectiveness
Researchers use test outcomes to evaluate the effectiveness of education interventions across numerous randomized controlled trials (RCTs). Aggregate test data—for example, simple measures like the sum of correct responses—are compared across treatment and control groups to determine whether an intervention has had a positive impact on student achievement. We show that item-level data and psychometric analyses can provide information about treatment heterogeneity and improve design of future experiments. We demonstrate heterogeneity of item-treatment interactions in empirical data and discuss implications for the complexity and generalizability of RCT findings.
Ma WA, Liu Y, Kanopka K, Ma W, Domingue B · 2025
PsyArXiv
The ability of a student can be conceptualized as either a continuously varying entity (e.g., conventional IRT models) or a bundle of latent classes (e.g., cognitive diagnostic models, CDMs). This paper examines the degree to which these approaches—which utilize quite distinctive notions regarding the nature of ability—produce different predictions of response behavior. We first present simulation studies in which CDM-based predictions uniformly outperform those of IRT models when data are generated from CDMs. We then compare CDM- and IRT-based approaches across nine empirical datasets previously analyzed using CDMs. Our findings indicate that overfitting is a pervasive issue across CDM-based predictions, and only a minority of datasets show improved model fit for CDMs over the 2PL model. Researchers and practitioners may need to balance the diagnostic appeal of CDMs with the fact that their complexity can come at the cost of predictive accuracy.
Zhang L, Liu Y, Molenaar D, Domingue B · 2025
PsyArXiv
Simulation studies are commonly used to improve understanding of psychometric models. For many common models, an essential feature of the simulation is the relative variation in item difficulties. A common practice has been to generate both item difficulty parameters and person parameters directly from a standard normal distribution—an assumption that warrants careful examination. In this paper, leveraging 73 datasets from the Item Response Warehouse, we examine the variability of item difficulty distributions in real-world datasets and investigate how this variation influences estimation and simulation. We identify key distributional characteristics (e.g., variance and skewness) and propose a new method for simulating realistic item difficulties based on empirical data. This method enhances the realism and applicability of simulation results, making them more reflective of real-world measurement conditions and improving the robustness of psychometric model evaluation.
Domingue B, Kanopka K, Stenhaug B, Sulik MJ, Beverly T, Brinkhuis M, Circi R, Faul J, Liao D, McCandliss B, Obradović J, Piech C, Porter T, Soland J, Weeks J, Wise S, Yeatman J · 2022
Journal of Educational and Behavioral Statistics
The speed-accuracy tradeoff suggests that responses generated under time constraints will be less accurate. While it has undergone extensive experimental verification, it is less clear whether it applies in settings where time pressures are not being experimentally manipulated. Using a large corpus of 29 response time datasets containing data from cognitive tasks without experimental manipulation of time pressure, we probe whether the speed-accuracy tradeoff holds across a variety of tasks using idiosyncratic within-person variation in speed. We find inconsistent relationships between marginal increases in time spent responding and accuracy; in many cases, marginal increases in time do not predict increases in accuracy. However, we do observe time pressures (in the form of time limits) to consistently reduce accuracy and for rapid responses to typically show the anticipated relationship. We find substantial variation in the item-level associations between speed and accuracy, and on the person side, respondents who exhibit more within-person variation in response speed are typically of lower ability. Collectively, our findings suggest the speed-accuracy tradeoff may be limited as a conceptual model in non-experimental settings.