Item-Level Heterogeneous Treatment Effects

When an RCT uses a psychometric instrument as its outcome, different items may respond differently to treatment. We fit item-level HTE models across IRW datasets and show how ignoring this variation can bias treatment-by-covariate interaction estimates.

Published

May 21, 2026

Overview

Many RCTs in education, economics, and health research use psychometric instruments — tests, surveys, symptom checklists — as outcome measures. A standard analysis sums responses into a single score and estimates the average treatment effect (ATE) on that score. But the individual items of the instrument may not respond equally to treatment. An item that directly targets the intervention’s content may show a large positive effect while an unrelated item shows none.

This item-level variation is called item-level heterogeneous treatment effects (IL-HTE) (Gilbert, Himmelsbach, et al. 2025). Formally, if the log-odds of a correct response for item \(i\) by person \(j\) follows

\[ \begin{align} \text{logit}(\Pr(Y_{ij} = 1)) = \eta_{ij} &= \theta_j + b_{i} + \zeta_i T_j \\ \theta_j &= \beta_0 + \beta_1 T_j + \varepsilon_j \\ \begin{bmatrix} b_i \\ \zeta_{i} \end{bmatrix} &\sim N\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix},\begin{bmatrix} \sigma^2_b & \rho\sigma_b\sigma_\zeta \\ \rho\sigma_b\sigma_\zeta & \sigma^2_\zeta \end{bmatrix}\right) \\ \varepsilon_j &\sim N(0, \sigma^2_\theta) \end{align} \]

then \(\zeta_i\) is the residual treatment effect on item \(i\) above and beyond the overall ATE \(\beta_1\). When the variance \(\sigma_\zeta^2 > 0\), items respond heterogeneously to treatment. In lme4 notation this is:

glmer(resp ~ treat + (1|id) + (treat|item), family = binomial)

Beyond masking item-level variation, IL-HTE has a less obvious consequence for causal identification. When \(\zeta_i\) is correlated with item difficulty \(b_i\) (parameterized as \(\rho \ne 0\)), a conventional treatment-by-covariate interaction model — the standard tool for asking “who benefits most?” — can produce biased estimates even when no true person-level heterogeneity exists. Gilbert, Miratrix, et al. (2025) prove this analytically and demonstrate it through simulation. We illustrate both insights below using RCT datasets from the IRW.

Item-level treatment effects

We fit the IL-HTE model to each dataset, extract empirical Bayes estimates of the item-specific effects \(\hat\beta_1 + \hat\zeta_i\), and standardize by \(\hat\sigma_\theta\) — the person-level SD from the baseline random-intercepts model — so that effects are comparable across datasets. The ATE \(\hat\beta_1\) is shown as a large point; individual item effects as small points. Datasets are ordered by ATE. Color encodes whether the IL-HTE variance \(\sigma_\zeta\) is statistically significant (likelihood ratio test, \(p < 0.05\), with the boundary correction of dividing the reported \(p\)-value by 2).

Code

library(plotly)

tab_order <- summary_df |> arrange(ate_std) |> pull(tab)

plot_df <- item_df |>
  left_join(summary_df |> select(tab, ate_std, sig_il_hte), by = "tab") |>
  mutate(
    y_pos     = as.integer(factor(tab, levels = tab_order)),
    sig_label = if_else(sig_il_hte, "Significant", "Not significant"),
    color_val = if_else(sig_il_hte, irw_blue, irw_red)
  )

ate_df <- summary_df |>
  mutate(
    y_pos     = as.integer(factor(tab, levels = tab_order)),
    sig_label = if_else(sig_il_hte, "Significant", "Not significant"),
    color_val = if_else(sig_il_hte, irw_blue, irw_red),
    hover_text = paste0(
      "<b>", tab, "</b><br>",
      "ATE: ", round(ate_std, 3), " SD<br>",
      "σζ: ", round(sigma_zeta_std, 3), " SD<br>",
      "ρ: ", round(rho, 2), "<br>",
      "Items: ", n_items, "<br>",
      "N: ", n_persons, "<br>",
      "Sig. IL-HTE: ", sig_il_hte
    )
  )

fig <- plot_ly() |>
  add_trace(
    data       = plot_df,
    x          = ~total_std,
    y          = ~y_pos,
    type       = "scatter",
    mode       = "markers",
    marker     = list(size = 4, opacity = 0.4),
    color      = ~sig_label,
    colors     = c("Significant" = irw_blue, "Not significant" = irw_red),
    hoverinfo  = "none",
    showlegend = TRUE
  ) |>
  add_trace(
    data       = ate_df,
    x          = ~ate_std,
    y          = ~y_pos,
    type       = "scatter",
    mode       = "markers",
    marker     = list(size = 10, line = list(width = 1, color = "white")),
    color      = ~sig_label,
    colors     = c("Significant" = irw_blue, "Not significant" = irw_red),
    text       = ~hover_text,
    hoverinfo  = "text",
    showlegend = FALSE
  ) |>
  layout(
    height = max(400, nrow(summary_df) * 14),
    xaxis = list(title = "Item-specific treatment effect (SDs)",
                 zeroline = TRUE, zerolinecolor = "#aaa", zerolinewidth = 1),
    yaxis = list(title = "", showticklabels = FALSE, showgrid = FALSE),
    legend = list(orientation = "h", x = 0, y = -0.05),
    margin = list(l = 20)
  )

fig

Figure 1: Empirical Bayes estimates of item-specific treatment effects, standardized by σ_θ. Small points: individual item effects (β₁ + ζ̂ᵢ)/σ_θ. Hover over the large dot to see dataset details.

Across 70 datasets, 40 show statistically significant IL-HTE (\(p < 0.05\)). In datasets with wide spreads of item effects, a single-number ATE obscures meaningful variation in which aspects of the outcome respond to the intervention — information that would be invisible in a sum-score analysis.

The identification problem

A more subtle consequence of IL-HTE concerns the estimation of treatment-by-covariate interactions. Gilbert, Miratrix, et al. (2025) show analytically that when \(\rho \ne 0\), the two distinct data-generating processes — (1) true person-level HTE via a treatment × pretest interaction, and (2) item-level HTE correlated with item difficulty — produce identical patterns in sum-score outcomes. A model that omits IL-HTE therefore cannot distinguish between these processes and will produce biased estimates of the treatment-by-covariate interaction.

Specifically, when \(\rho > 0\) (easier items respond more strongly to treatment), treatment “stretches” the item difficulty distribution — the ratio \(\sigma_b^\text{treat}/\sigma_b^\text{ctrl}\) exceeds 1 — and the constant-item model attributes this variation to a spurious negative treatment-by-pretest interaction.

We illustrate this empirically by fitting two additional models per dataset:

M_B (constant item effects + person HTE): resp ~ treat + cov + treat:cov + (1|item) + (1|id)
M_C (flexible EIRM — IL-HTE + person HTE): resp ~ treat + cov + treat:cov + (treat|item) + (1|id)

where cov is the standardized pretest score. The \(\hat\beta_3\) coefficient on treat:cov represents the treatment-by-covariate interaction; M_B estimates it without allowing for IL-HTE, M_C estimates it alongside IL-HTE.

Code

id_df <- summary_df |>
  filter(!is.na(delta_beta3)) |>
  mutate(
    hover_text = paste0(
      "<b>", tab, "</b><br>",
      "Δβ₃: ", round(delta_beta3, 3), "<br>",
      "SD ratio: ", round(sd_ratio, 3), "<br>",
      "ρ: ", round(rho, 2), "<br>",
      "Items: ", n_items, "<br>",
      "N: ", n_persons
    )
  )

plot_ly(id_df,
  x         = ~sd_ratio,
  y         = ~delta_beta3,
  type      = "scatter",
  mode      = "markers",
  marker    = list(
    size        = ~sqrt(n_items) * 2,
    color       = ~rho,
    colorscale  = list(c(0, irw_blue), c(0.5, "#d3d3d3"), c(1, irw_red)),
    cmin        = -1, cmax = 1,
    colorbar    = list(title = "ρ"),
    line        = list(width = 0.5, color = "white")
  ),
  text      = ~hover_text,
  hoverinfo = "text"
) |>
layout(
  xaxis = list(title = "σ_b(treat) / σ_b(ctrl)",
               zeroline = FALSE,
               zerolinecolor = "#aaa",
               showline = TRUE),
  yaxis = list(title = "β₃(M_B) − β₃(M_C)",
               zeroline = TRUE, zerolinecolor = "#aaa", zerolinewidth = 1),
  shapes = list(
    list(type = "line", x0 = 1, x1 = 1,
         y0 = ~min(delta_beta3, na.rm=TRUE), y1 = ~max(delta_beta3, na.rm=TRUE),
         line = list(dash = "dash", color = "#aaa"))
  )
)

Figure 2: Difference in treatment-by-covariate interaction estimates between M_B (constant item effects) and M_C (with IL-HTE), as a function of the ratio of item-difficulty SDs in treatment vs. control groups. Hover over a point for dataset details.

Datasets to the right of the vertical reference line (ratio > 1) tend to show negative \(\hat\beta_3(M_B) - \hat\beta_3(M_C)\): the conventional model produces a more negative interaction estimate than the flexible model, consistent with the theoretical predictions of Gilbert, Miratrix, et al. (2025). This pattern arises because the correlation between item easiness and item-specific treatment effects (\(\rho\)) creates the appearance of heterogeneity by pretest score in a sum-score analysis, even if none exists at the person level.

Reproducibility

Results computed May 20, 2026. To regenerate:

Rscript vignettes/il_hte_compute.R
quarto::quarto_render("vignettes/il_hte.qmd")

References

Gilbert, Joshua B, Zachary Himmelsbach, James Soland, and Benjamin W Domingue. 2025. “Estimating Heterogeneous Treatment Effects with Item-Level Outcome Data: Insights from Item Response Theory.” Journal of Policy Analysis and Management 44 (4): 1417–49. https://doi.org/10.1002/pam.70025.

Gilbert, Joshua B, Luke W Miratrix, Mridul Joshi, and Benjamin W Domingue. 2025. “Disentangling Person-Dependent and Item-Dependent Causal Effects: Applications of Item Response Theory to the Estimation of Treatment Effect Heterogeneity.” Journal of Educational and Behavioral Statistics 50 (1): 72–101. https://doi.org/10.3102/10769986241240085.