Conducting CFA using IRW datasets with lavaan

This tutorial demonstrates how to conduct confirmatory factor analysis (CFA) using datasets from the Item Response Warehouse (IRW) with the lavaan package in R. We walk through a complete workflow using a big five personality dataset, starting from data retrieval and preprocessing, followed by basic exploratory checks, and ending with model specification and estimation. In particular, we show how to efficiently construct a multi-factor CFA model by leveraging item naming conventions, which is especially helpful when working with large-scale datasets.

Although the example focuses on a five-factor personality dataset, the same workflow generalizes to many other IRW datasets and CFA applications.

Load the dataset

We begin by loading the required packages and retrieving a dataset from the Item Response Warehouse (IRW). The dataset used here contains responses to items measuring the Big Five personality traits.

Note. Code chunks are not evaluated (eval = false) to keep the tutorial lightweight for web rendering. They can be run locally to reproduce the results.

Code

library(irw)
library(lavaan)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(psych)

df <- irw_fetch("5personalityfactors")

The IRW stores data in long format, where each row corresponds to a single response to a specific item. Before fitting a confirmatory factor analysis (CFA) model, we perform a small amount of preprocessing to ensure the data are well structured. In particular, we remove respondents who have duplicate responses to the same item. This simplifies the analysis and ensures that each person contributes only one response per item. Next, we reshape the data into wide format, which is required by lavaan. In this format, each row corresponds to a respondent and each column corresponds to an item.

Code

# Data cleaning: remove duplicate responses and keep one response per item per respondent.

# Identify IDs with duplicate responses (same id-item appears more than once)
duplicate_id <- df |>
  summarise(n = n(), .by = c(id, item)) |>
  filter(n > 1) |>
  distinct(id)

# Keep only IDs with no duplicate responses
df_clean <- df |>
  filter(!id %in% duplicate_id$id)

# Reshape from long to wide format
df_wide <- df_clean |>
  select(id, item, resp) |>
  pivot_wider(names_from = item, values_from = resp)

Inspecting the Data

Before fitting a CFA model, it is useful to inspect the data. We start by examining the distribution of responses for each item and examine the proportion of missing responses for each item.

Code

# Plot the response distribution
df_clean |>
  filter(!is.na(resp)) |>  
  mutate(resp = factor(resp)) |>
  ggplot(aes(x = resp)) +
  geom_bar() +
  facet_wrap(~ item, scales = "free_y") +
  labs(x = "Response", y = "Count") +
  theme_classic()

# Plot the missing rates for each item
missing_item <- df_wide |>
  summarise(across(-id, ~ mean(is.na(.)))) |>
  tidyr::pivot_longer(everything(),
                      names_to = "item",
                      values_to = "missing_rate")

missing_item |>
  arrange(desc(missing_rate)) |>
  mutate(item = factor(item, levels = item)) |>
  ggplot(aes(x = item, y = missing_rate)) +
  geom_col() +
  coord_flip() +
  labs(x = "Item", y = "Missing rate") +
  theme_classic()

# Normality check
desc <- psych::describe(df_wide[,-1])
head(desc[, c("mean", "sd", "skew", "kurtosis")])

desc_df <- as.data.frame(desc) |>
  tibble::rownames_to_column("item")

# Identify items with extreme skewness or kurtosis (potential normality issues)
desc_df |>
  filter(abs(skew) > 2 | abs(kurtosis) > 7)

Prepare the CFA model

Code

# Rename item columns so they do not start with numbers
df_wide_lav <- df_wide |>
  rename_with(~ paste0("i", .x), -id)

# Get item names
items <- colnames(df_wide_lav)[-1]

# Extract trait names
traits <- str_extract(items, "(?<=_).*")

# Split items by trait
item_list <- split(items, traits)

# Build lavaan model string
model_5f <- paste(
  sapply(names(item_list), function(trait) {
    paste0(trait, " =~ ", paste(item_list[[trait]], collapse = " + "))
  }),
  collapse = "\n"
)

cat(model_5f)

Run the CFA model

The dataset uses a naming convention where each item includes its associated trait (for example, 10_agreeableness). We can take advantage of this structure to automatically construct the CFA model.

Code

fit <- cfa(
  model_5f,
  data = df_wide_lav,
  estimator = "MLR"
)

summary(fit, fit.measures = TRUE)