2PL Discrimination Parameters Across Cognitive Datasets

A Tier 2 IRW vignette demonstrating multi-dataset IRT analysis. We use irw_filter() to select dichotomous cognitive/educational datasets, fit a 2PL model to each, and examine what discrimination parameters look like in the wild.

Research question

Item discrimination — the a parameter in a 2PL model — captures how sharply an item separates high- and low-ability respondents. Methodologists often assume a ≈ 1 as a starting point (the Rasch model), but real data are messier. How variable is discrimination across cognitive and educational instruments in practice? And how consequential is the Rasch constraint of fixing a = 1?

The IRW makes this question tractable: we can filter for dichotomous cognitive datasets, fit a 2PL to each, and examine the resulting distribution of a estimates across instruments. What would otherwise require assembling dozens of datasets by hand is a few lines of code.

Step 1: Select datasets using metadata

We use irw_filter() to find dichotomous datasets (n_categories == 2) tagged as Cognitive/educational, with at least 10 items and 500 respondents, and a response density above 0.8 to avoid sparse matrices that destabilise 2PL estimation.

Code

cognitive_tables <- irw_filter(
  construct_type = "Cognitive/educational",
  n_categories   = 2,            # dichotomous only
  n_items        = c(10, 60),    # at least 10 items, at most 60
  n_participants = c(500, Inf),  # need enough respondents for stable estimation
  density        = c(0.8, 1.0)  # avoid sparse response matrices
)

This selected 66 datasets from the IRW (as of April 2026).

Note

To reproduce with current IRW holdings, re-run 2pl_across_datasets_compute.R and commit the updated 2pldata/2pl_across_datasets_results.rds.

Step 2: Fit a 2PL to each dataset

We write a function that (1) fetches a table, (2) downsamples to at most 10,000 respondents to keep memory use manageable, (3) reshapes from long to wide format, (4) fits a 2PL in mirt, and (5) extracts item parameter estimates. We then map() over all selected tables.

The lognormal prior on a (lnorm, 0.0, 1.0) keeps discrimination estimates positive and prevents runaway values for poorly behaved items, while remaining weak enough to let the data speak.

Code

fit_2pl <- function(table_name) {
  mirt::mirt.options(cores = 1)  # prevent mirt from spawning its own threads

  df <- tryCatch(irw_fetch(table_name), error = function(e) NULL)
  if (is.null(df)) return(NULL)

  # Downsample to at most 10,000 respondents before reshaping
  unique_ids <- unique(df$id)
  if (length(unique_ids) > 10000) {
    df <- df[df$id %in% sample(unique_ids, 10000), ]
  }

  resp <- irw_long2resp(df)
  resp$id <- NULL

  # Drop zero-variance items
  resp <- resp[, sapply(resp, function(x) length(unique(na.omit(x))) > 1), drop = FALSE]
  if (ncol(resp) < 5) return(NULL)

  ni         <- ncol(resp)
  model_spec <- mirt.model(paste0(
    "F = 1-", ni, "\n",
    "PRIOR = (1-", ni, ", a1, lnorm, 0.0, 1.0)"
  ))

  fit <- tryCatch(
    mirt(resp, model_spec, itemtype = rep("2PL", ni),
         method = "EM", technical = list(NCYCLES = 500), verbose = FALSE),
    error = function(e) NULL
  )
  if (is.null(fit)) return(NULL)

  params <- coef(fit, IRTpars = TRUE, simplify = TRUE)$items
  tibble(
    table = table_name,
    item  = rownames(params),
    a     = params[, "a"],
    b     = params[, "b"]
  )
}

# Run in parallel; each result written to disk as it completes
# (see 2pl_across_datasets_compute.R for full details)
all_results <- map(cognitive_tables, fit_2pl) |>
  compact() |>
  bind_rows()

Step 3: Examine discrimination distributions

Distribution of a across all items

Code

all_results |>
  mutate(a_win = pmin(a, 4)) |>
  ggplot(aes(x = a_win)) +
  geom_density(fill = "#2166ac", colour = "#2166ac", alpha = 0.4, linewidth = 0.8) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey40") +
  labs(
    x       = "Discrimination (a)",
    y       = "Density",
    caption = "Dashed line at a = 1 (Rasch assumption). Values winsorised at 4."
  ) +
  theme_minimal(base_size = 13)

Figure 1: Distribution of 2PL discrimination parameters (a) across cognitive datasets. Estimates > 4 are winsorised for display. Dashed line marks a = 1 (the Rasch assumption).

Dataset-level medians

Each point below is one IRW table. This shows whether the pattern is consistent across datasets or driven by a few outliers.

Code

dataset_summary <- all_results |>
  group_by(table) |>
  summarise(
    median_a = median(a, na.rm = TRUE),
    n_items  = n(),
    .groups  = "drop"
  ) |>
  arrange(median_a) |>
  mutate(rank = row_number())

ggplot(dataset_summary, aes(x = rank, y = median_a)) +
  geom_point(aes(size = n_items), colour = "#2166ac", alpha = 0.7) +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "grey40") +
  scale_size_continuous(range = c(2, 7), name = "# items") +
  labs(
    x = "Dataset (ranked by median a)",
    y = "Median discrimination (a)"
  ) +
  theme_minimal(base_size = 13)

Figure 2: Per-dataset median discrimination, ordered by value. Point size reflects number of items.

Summary statistics

Code

all_results |>
  summarise(
    Datasets      = n_distinct(table),
    Items         = n(),
    `Mean a`      = round(mean(a, na.rm = TRUE), 3),
    `Median a`    = round(median(a, na.rm = TRUE), 3),
    `SD a`        = round(sd(a, na.rm = TRUE), 3),
    `% above 1.0` = round(mean(a > 1, na.rm = TRUE) * 100, 1)
  )

Table 1: Summary of 2PL discrimination estimates across cognitive datasets

Datasets	Items	Mean a	Median a	SD a	% above 1.0
63	1860	1.355	1.076	1.622	54.5

What to notice

The Rasch assumption. The dashed line at a = 1 marks where the Rasch model parks every item. If the distribution is centred well above 1, or has a long right tail, that suggests the equal-discrimination constraint is doing real work — and possibly distorting person ability estimates for items at the extremes.

Between-dataset variance. The per-dataset median plot is often more informative than the pooled density. If median a varies widely across datasets, that points to instrument design or sample homogeneity as a driver, not just item-level noise.

Limitations. We downsample to 10,000 respondents per dataset, so estimates for large-scale assessments are based on a random subset. For the purpose of characterising the distribution of a this is acceptable — 10,000 respondents is more than enough for stable 2PL estimation — but it means we are not using the full IRW data. Some tables may also come from the same instrument administered in different populations; irw_info() can help identify these.

Extending this vignette

Model fit comparison. For each dataset, compare Rasch vs. 2PL using M2() fit statistics or IMV (see the IMV tutorial). Where does freeing discrimination actually improve fit?
Difficulty coverage. The all_results object also contains b parameters. How does the range of item difficulties vary across instruments?
Construct type granularity. Re-run with other construct_type values to compare discrimination patterns across domains.

Reproducibility

Results on this page were computed on April 24, 2026 using 66 IRW tables. To reproduce:

# 1. Re-run the compute script (from project root)
source("2pl_across_datasets_compute.R")

# 2. Re-render this page
quarto::quarto_render("2pl_across_datasets.qmd")

Tip

To cite the datasets used, run:

for (tbl in cognitive_tables) {
  irw_save_bibtex(tbl, output_file = "2pldata/irw_references.bib", append = TRUE)
}