Estimating IRT Models in Python

A step-by-step introduction to importing IRW data in Python and fitting standard IRT models (1PL, 2PL, 3PL, and graded response) using the mirt and girth packages.

This vignette walks through a complete Python workflow for working with IRW data: fetching a dataset, preparing it for IRT estimation, fitting the 1PL (Rasch), 2PL, and 3PL models for dichotomous responses, and the graded response model (GRM) for polytomous data. We use the irw package for data access, mirt for dichotomous estimation, and girth for the GRM.


Setup

Install required packages

Note

The mirt package requires Python ≥ 3.11.

# pip install --upgrade setuptools
# pip install "git+https://github.com/itemresponsewarehouse/Python-pkg.git"
# pip install mirt girth scipy matplotlib

Load libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import irw
from mirt import fit_mirt
from girth import grm_mml, tag_missing_data, INVALID_RESPONSE

Part 1: Dichotomous IRT (1PL, 2PL, 3PL)

Fetch a dataset

We start with 4thgrade_math_sirt, a dichotomous cognitive dataset from the IRW containing math item responses from fourth-grade students. On first use, irw.fetch() will open a browser window prompting you to authenticate with your Redivis account.

df = irw.fetch("4thgrade_math_sirt")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
Warning: Setting the REDIVIS_API_TOKEN for interactive sessions is deprecated and highly discouraged.
Please delete the token on Redivis and remove it from your code, and follow the authentication prompts here instead.

This environment variable should only ever be set in a non-interactive environment, such as in an automated script or service.

Warning: Setting the REDIVIS_API_TOKEN for interactive sessions is deprecated and highly discouraged.
Please delete the token on Redivis and remove it from your code, and follow the authentication prompts here instead.

This environment variable should only ever be set in a non-interactive environment, such as in an automated script or service.

Warning: Setting the REDIVIS_API_TOKEN for interactive sessions is deprecated and highly discouraged.
Please delete the token on Redivis and remove it from your code, and follow the authentication prompts here instead.

This environment variable should only ever be set in a non-interactive environment, such as in an automated script or service.
  item   id  resp testlet      domain subdomain
0  MA1  125     0      MA  arithmetic  addition
1  MA1  653     0      MA  arithmetic  addition
2  MA1  483     0      MA  arithmetic  addition
3  MA1  609     0      MA  arithmetic  addition
4  MA1  412     0      MA  arithmetic  addition

Shape: (19920, 6)
Columns: ['item', 'id', 'resp', 'testlet', 'domain', 'subdomain']

IRW data is in long format: one row per person–item response. The key columns are:

  • id — respondent identifier
  • item — item identifier
  • resp — response value (0/1 for dichotomous items)

Prepare the response matrix

mirt expects a wide integer array of shape (n_persons, n_items) with missing responses coded as -1. We use irw.long2resp() to reshape from long to wide, then convert.

# Reshape from long to wide (persons × items)
resp_wide = irw.long2resp(df)
resp_wide = resp_wide.drop(columns="id", errors="ignore")

# Convert to numeric; encode missing as -1
resp_arr = resp_wide.apply(pd.to_numeric, errors="coerce").to_numpy(dtype=float, na_value=np.nan)

n_persons, n_items = resp_arr.shape
print(f"Persons: {n_persons}, Items: {n_items}")
print(f"Overall proportion correct: {np.nanmean(resp_arr):.3f}")

resp_matrix = np.where(np.isnan(resp_arr), -1, resp_arr).astype(int)
Persons: 664, Items: 30
Overall proportion correct: 0.510

Fit the 1PL (Rasch) model

The 1PL constrains all items to share a common discrimination, estimating only item difficulties (the b parameters). fit_mirt uses marginal maximum likelihood via expectation-maximisation (EM).

item_names = resp_wide.columns.tolist()

fit_1pl = fit_mirt(resp_matrix, model="1PL", item_names=item_names)
print(fit_1pl.summary())
================================================================================
                               IRT Model Results                                
================================================================================
Model:              1PL                  Log-Likelihood:     -12319.6172
No. Items:          30                   AIC:                 24759.2345
No. Factors:        1                    BIC:                 25029.1314
No. Persons:        664                  No. Parameters:              60
Converged:          True                 Iterations:                  17
--------------------------------------------------------------------------------

discrimination:
Item              Estimate    Std.Err    z-value      P>|z|     [95%      CI]
--------------------------------------------------------------------------------
MA1                 1.0000     0.0000        nan        nan      nan      nan
MA2                 1.0000     0.0000        nan        nan      nan      nan
MA3                 1.0000     0.0000        nan        nan      nan      nan
MA4                 1.0000     0.0000        nan        nan      nan      nan
MB1                 1.0000     0.0000        nan        nan      nan      nan
MB2                 1.0000     0.0000        nan        nan      nan      nan
MB3                 1.0000     0.0000        nan        nan      nan      nan
MB4                 1.0000     0.0000        nan        nan      nan      nan
MC1                 1.0000     0.0000        nan        nan      nan      nan
MC2                 1.0000     0.0000        nan        nan      nan      nan
MC3                 1.0000     0.0000        nan        nan      nan      nan
MC4                 1.0000     0.0000        nan        nan      nan      nan
MD1                 1.0000     0.0000        nan        nan      nan      nan
MD2                 1.0000     0.0000        nan        nan      nan      nan
MD3                 1.0000     0.0000        nan        nan      nan      nan
MD4                 1.0000     0.0000        nan        nan      nan      nan
ME1                 1.0000     0.0000        nan        nan      nan      nan
ME2                 1.0000     0.0000        nan        nan      nan      nan
MF1                 1.0000     0.0000        nan        nan      nan      nan
MG1                 1.0000     0.0000        nan        nan      nan      nan
MG2                 1.0000     0.0000        nan        nan      nan      nan
MG3                 1.0000     0.0000        nan        nan      nan      nan
MG4                 1.0000     0.0000        nan        nan      nan      nan
MH1                 1.0000     0.0000        nan        nan      nan      nan
MH2                 1.0000     0.0000        nan        nan      nan      nan
MH3                 1.0000     0.0000        nan        nan      nan      nan
MH4                 1.0000     0.0000        nan        nan      nan      nan
MI1                 1.0000     0.0000        nan        nan      nan      nan
MI2                 1.0000     0.0000        nan        nan      nan      nan
MI3                 1.0000     0.0000        nan        nan      nan      nan

difficulty:
Item              Estimate    Std.Err    z-value      P>|z|     [95%      CI]
--------------------------------------------------------------------------------
MA1                -2.3541     0.1277    -18.441     0.0000  -2.6043  -2.1039
MA2                -0.2902     0.0851     -3.410     0.0006  -0.4571  -0.1234
MA3                -0.1176     0.0846     -1.390     0.1646  -0.2834   0.0482
MA4                 0.4653     0.0858      5.421     0.0000   0.2971   0.6336
MB1                -0.6463     0.0874     -7.391     0.0000  -0.8177  -0.4749
MB2                -0.5553     0.0867     -6.407     0.0000  -0.7252  -0.3855
MB3                 0.7367     0.0880      8.371     0.0000   0.5642   0.9092
MB4                 0.7213     0.0879      8.209     0.0000   0.5491   0.8935
MC1                -1.3914     0.0984    -14.147     0.0000  -1.5842  -1.1987
MC2                 0.0967     0.0845      1.144     0.2526  -0.0690   0.2623
MC3                -0.7940     0.0889     -8.928     0.0000  -0.9683  -0.6197
MC4                -0.2541     0.0850     -2.990     0.0028  -0.4206  -0.0875
MD1                -0.8982     0.0902     -9.960     0.0000  -1.0749  -0.7215
MD2                -1.2873     0.0963    -13.368     0.0000  -1.4760  -1.0985
MD3                -0.4955     0.0862     -5.746     0.0000  -0.6646  -0.3265
MD4                -0.5030     0.0863     -5.829     0.0000  -0.6721  -0.3339
ME1                -0.9805     0.0913    -10.742     0.0000  -1.1594  -0.8016
ME2                 0.5171     0.0862      6.001     0.0000   0.3482   0.6860
MF1                -0.3193     0.0852     -3.746     0.0002  -0.4863  -0.1522
MG1                 0.3847     0.0854      4.506     0.0000   0.2174   0.5521
MG2                 0.7756     0.0884      8.773     0.0000   0.6024   0.9489
MG3                 0.3121     0.0851      3.669     0.0002   0.1454   0.4788
MG4                 0.4653     0.0858      5.421     0.0000   0.2971   0.6336
MH1                 0.0182     0.0845      0.215     0.8298  -0.1474   0.1837
MH2                 0.6295     0.0870      7.234     0.0000   0.4590   0.8001
MH3                 0.3629     0.0853      4.255     0.0000   0.1957   0.5300
MH4                 0.8228     0.0889      9.253     0.0000   0.6485   0.9971
MI1                 0.8387     0.0891      9.412     0.0000   0.6640   1.0133
MI2                 0.1825     0.0847      2.156     0.0311   0.0166   0.3485
MI3                 1.6350     0.1035     15.794     0.0000   1.4321   1.8379
================================================================================

Fit the 2PL model

The 2PL additionally estimates a discrimination parameter (a) for each item, allowing items to differ in how sharply they separate high- and low-ability respondents.

fit_2pl = fit_mirt(resp_matrix, model="2PL", item_names=item_names)

summary_2pl = pd.DataFrame({
    "discrimination": fit_2pl.model.discrimination.round(3),
    "difficulty":     fit_2pl.model.difficulty.round(3),
}, index=item_names)
print(summary_2pl.to_string())
     discrimination  difficulty
MA1           1.073      -2.088
MA2           1.389      -0.243
MA3           2.049      -0.099
MA4           2.085       0.196
MB1           2.377      -0.338
MB2           1.932      -0.338
MB3           2.845       0.268
MB4           2.989       0.254
MC1           1.988      -0.778
MC2           2.097       0.010
MC3           1.951      -0.464
MC4           2.304      -0.157
MD1           2.428      -0.451
MD2           2.090      -0.699
MD3           2.459      -0.262
MD4           2.022      -0.300
ME1           2.154      -0.529
ME2           2.145       0.217
MF1           2.147      -0.197
MG1           4.383       0.096
MG2           4.998       0.207
MG3           4.578       0.071
MG4           5.000       0.113
MH1           1.026      -0.031
MH2           1.477       0.369
MH3           1.361       0.209
MH4           1.348       0.536
MI1           0.828       0.855
MI2           1.290       0.089
MI3           1.438       1.055

Fit the 3PL model

The 3PL adds a lower asymptote (c, the “guessing”) parameter per item. This is appropriate when chance-level correct responding is plausible (e.g., multiple-choice exams).

fit_3pl = fit_mirt(resp_matrix, model="3PL", item_names=item_names)

summary_3pl = pd.DataFrame({
    "discrimination": fit_3pl.model.discrimination.round(3),
    "difficulty":     fit_3pl.model.difficulty.round(3),
    "guessing":       fit_3pl.model.guessing.round(3),
}, index=item_names)
print(summary_3pl.to_string())
     discrimination  difficulty  guessing
MA1           1.168      -1.149     0.500
MA2           1.133      -0.210     0.000
MA3           1.521      -0.037     0.000
MA4           1.558       0.360     0.000
MB1           1.785      -0.358     0.000
MB2           1.436      -0.358     0.000
MB3           2.069       0.470     0.000
MB4           2.190       0.450     0.000
MC1           1.510      -0.932     0.000
MC2           1.590       0.110     0.000
MC3           1.521      -0.508     0.000
MC4           1.731      -0.115     0.000
MD1           1.904      -0.495     0.000
MD2           1.710      -0.787     0.000
MD3           1.860      -0.254     0.000
MD4           1.557      -0.299     0.000
ME1           1.590      -0.618     0.000
ME2           1.684       0.380     0.000
MF1           1.567      -0.172     0.000
MG1           2.793       0.256     0.006
MG2           2.987       0.430     0.000
MG3           3.275       0.237     0.024
MG4           3.022       0.282     0.000
MH1           3.176       0.743     0.357
MH2           2.869       0.804     0.206
MH3           2.531       0.719     0.236
MH4           2.404       0.905     0.171
MI1           2.592       1.226     0.247
MI2           1.123       0.395     0.082
MI3           2.586       1.255     0.103

Plot item characteristic curves (ICCs)

Visualizing ICCs gives an intuitive view of how each item functions. Here we compare a selection of items across the 1PL and 2PL.

theta_range = np.linspace(-4, 4, 200)

def icc_1pl(theta, b):
    return 1 / (1 + np.exp(-(theta - b)))

def icc_2pl(theta, a, b):
    return 1 / (1 + np.exp(-a * (theta - b)))

difficulty_1pl     = fit_1pl.model.difficulty
discrimination_2pl = fit_2pl.model.discrimination
difficulty_2pl     = fit_2pl.model.difficulty

fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)

for idx, name in zip(range(5), item_names[:5]):
    axes[0].plot(theta_range, icc_1pl(theta_range, difficulty_1pl[idx]), label=name)
    axes[1].plot(theta_range, icc_2pl(theta_range, discrimination_2pl[idx], difficulty_2pl[idx]), label=name)

for ax, title in zip(axes, ["1PL (Rasch)", "2PL"]):
    ax.set_title(title)
    ax.set_xlabel("Ability (θ)")
    ax.set_ylabel("P(correct)")
    ax.axhline(0.5, color="grey", linestyle="--", linewidth=0.8)
    ax.legend(fontsize=8)
    ax.set_ylim(0, 1)

fig.suptitle("Item Characteristic Curves", fontsize=14)
plt.tight_layout()
plt.show()


Part 2: Polytomous IRT (Graded Response Model)

The Graded Response Model (GRM) is appropriate for ordered polytomous items (e.g., Likert-scale survey responses). We use girth for the GRM, which has a stable MML implementation for polytomous models.

Fetch a polytomous dataset

df_poly = irw.fetch("5personalityfactors")
print(df_poly.head())
print(f"\nUnique response categories: {sorted(df_poly['resp'].dropna().unique())}")
Warning: Setting the REDIVIS_API_TOKEN for interactive sessions is deprecated and highly discouraged.
Please delete the token on Redivis and remove it from your code, and follow the authentication prompts here instead.

This environment variable should only ever be set in a non-interactive environment, such as in an automated script or service.

Warning: Setting the REDIVIS_API_TOKEN for interactive sessions is deprecated and highly discouraged.
Please delete the token on Redivis and remove it from your code, and follow the authentication prompts here instead.

This environment variable should only ever be set in a non-interactive environment, such as in an automated script or service.
      id              item  resp  age
0  15111  10_agreeableness   1.0   18
1  21348  10_agreeableness   1.0   18
2  22571  10_agreeableness   1.0   18
3  30530  10_agreeableness   1.0   18
4  31443  10_agreeableness   1.0   18

Unique response categories: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, nan]

Prepare the response matrix

girth expects a (n_items, n_persons) integer array with categories coded starting from 0 and missing values encoded as INVALID_RESPONSE (girth’s sentinel, -99999). We shift the responses so the lowest observed category becomes 0, round any values produced by duplicate-response averaging, then transpose.

resp_wide_poly = irw.long2resp(df_poly)
resp_wide_poly = resp_wide_poly.drop(columns="id", errors="ignore")

# Convert to numeric; shift so minimum category = 0
resp_arr_poly = resp_wide_poly.apply(pd.to_numeric, errors="coerce").to_numpy(dtype=float, na_value=np.nan)
min_resp = int(np.nanmin(resp_arr_poly))
resp_arr_poly = resp_arr_poly - min_resp

n_categories = int(np.nanmax(resp_arr_poly)) + 1
print(f"Persons: {resp_arr_poly.shape[0]}, Items: {resp_arr_poly.shape[1]}, Categories: {n_categories}")

# Round averaged duplicates, encode missing as INVALID_RESPONSE, transpose to (items × persons)
resp_matrix_poly = np.where(np.isnan(resp_arr_poly),
                            INVALID_RESPONSE,
                            np.round(resp_arr_poly)).astype(int).T
resp_matrix_poly = tag_missing_data(resp_matrix_poly, list(range(n_categories)))
Found 1260 responses across 1260 unique id-item pairs (0.2% of total).
Average responses per pair: 1.0.
Aggregating responses based on agg_method='mean'.
Persons: 8936, Items: 70, Categories: 7

Fit the Graded Response Model

grm_results = grm_mml(resp_matrix_poly)

discrimination_grm = grm_results["Discrimination"]
thresholds_grm     = grm_results["Difficulty"]   # shape: (n_items, n_categories - 1)

# Display results for first 10 items
item_names_poly = resp_wide_poly.columns.tolist()
for i in range(min(10, len(item_names_poly))):
    thresh_str = ", ".join([f"{t:.2f}" for t in thresholds_grm[i]])
    print(
        f"{item_names_poly[i]:30s}  "
        f"a = {discrimination_grm[i]:.3f}  "
        f"b = [{thresh_str}]"
    )
Warning: invalid value encountered in log
Warning: divide by zero encountered in log
10_agreeableness                a = 0.769  b = [-5.95, -3.71, -1.70, -0.22, 1.28, 4.12]
10_conscientiousness            a = 0.818  b = [-3.52, -1.11, 0.44, 1.68, 3.60, 5.91]
10_extraversion                 a = 1.044  b = [-4.44, -2.79, -1.31, -0.29, 0.90, 2.80]
10_neuroticism                  a = 0.778  b = [-4.01, -1.31, 0.24, 1.52, 3.55, 5.92]
10_openness                     a = 1.044  b = [-5.89, -3.18, -1.36, 0.17, 1.99, 4.41]
11_agreeableness                a = 1.122  b = [-4.49, -2.78, -1.65, -0.66, 0.84, 2.88]
11_conscientiousness            a = 1.152  b = [-4.91, -2.97, -1.57, -0.48, 0.92, 2.59]
11_extraversion                 a = 1.476  b = [-3.38, -2.13, -1.28, -0.52, 0.69, 2.26]
11_neuroticism                  a = 0.602  b = [-5.84, -2.51, -0.56, 1.10, 3.23, 5.90]
11_openness                     a = 1.024  b = [-5.94, -3.66, -1.94, -0.50, 1.19, 3.08]

Plot category response curves (CRCs)

Category response curves show the probability of each response category as a function of ability. We plot CRCs for a single example item.

from scipy.special import expit

def grm_crc(theta, a, thresholds):
    """
    GRM category response curves.
    P(X=k|θ) = P*(X≥k|θ) − P*(X≥k+1|θ), where P*(X≥k|θ) = logistic(a*(θ−b_k)).
    """
    cum_probs = np.array([expit(a * (t - theta)) for t in thresholds])
    cum_probs = np.vstack([np.ones_like(theta), 1 - cum_probs, np.zeros_like(theta)])
    return -np.diff(cum_probs, axis=0)

item_idx  = 0
item_name = item_names_poly[item_idx]
a_item    = discrimination_grm[item_idx]
b_item    = thresholds_grm[item_idx]

crc = grm_crc(theta_range, a_item, b_item)

fig, ax = plt.subplots(figsize=(8, 5))
for k in range(crc.shape[0]):
    ax.plot(theta_range, crc[k], label=f"Category {k + 1}")

ax.set_title(f"GRM Category Response Curves\n{item_name}")
ax.set_xlabel("Trait level (θ)")
ax.set_ylabel("Response probability")
ax.legend()
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()


Summary

The table below recaps the models covered and when to use each one.

Model Package Response type Parameters per item Use when
1PL (Rasch) mirt Dichotomous difficulty b Equal discrimination is theoretically warranted
2PL mirt Dichotomous a, b Items vary in how well they discriminate
3PL mirt Dichotomous a, b, c Multiple-choice; guessing is plausible
GRM girth Polytomous (ordered) a, b₁…bₖ₋₁ Likert or partial-credit items

For a broader exploration of 2PL discrimination parameters across many IRW datasets, see the 2PL Across Datasets vignette. To compare model fit between the 1PL and 2PL using cross-validated predictions, see the IMV vignette.


Further reading