A simulator for generating realistic mock longitudinal questionnaire data, designed to represent annual survey responses from schoolchildren.
Built in Julia using:
- ArgParse.jl — CLI argument parsing
- Distributions.jl — statistical distributions for counts, latent effects, and item sampling
- StatsBase.jl — weighted categorical sampling for demographics
- DataFrames.jl — tabular output representation
- CSV.jl — CSV serialisation
- JSON3.jl — JSON serialisation
The simulator uses an agent-based approach driven by latent variables.
Each simulated student has latent mental-health scores (e.g. depression, anxiety) that
evolve across waves and drive their questionnaire responses. The latent model is a simple
structural equation model composed of two building blocks:
-
LinearEffect(target, inputs, value)— a fixed linear effect.
Addsvalue × ∏(inputs)to the named latent for every row.
Example: an age effectLinearEffect("depression", ["d_age"], 0.02)increases depression by 0.02 per year of age. -
RandomEffect(target, numericalInputs, categoricalInputs, value)— a random effect.
One value is drawn fromvalueper unique combination ofcategoricalInputs(e.g. one draw per school, one per student), optionally scaled bynumericalInputs.
EmptycategoricalInputsproduces a fresh draw per observation (residual error).
Example:RandomEffect("depression", [], ["uid"], truncated(Normal(0, 0.2), 0, Inf))gives each student a stable half-normal baseline.
Default model (used when you call simulate() with no custom configuration):
| Component | Affects | Distribution |
|---|---|---|
| Fixed age effect | depression, anxiety | +0.02 / +0.015 per year |
| Fixed sex effect (F=+1, M=−1) | depression, anxiety | ×0.05 |
| Fixed age × sex interaction | depression, anxiety | ×0.005 / ×0.004 |
| Random cohort (yearGroup) cluster | depression, anxiety | Normal(0, 0.05) |
| Random ethnicity × class × school cluster | depression, anxiety | Normal(0, 0.03) |
| Random individual × wave trajectory | depression, anxiety | Normal(0, 0.15) / Normal(0, 0.12) |
| Random individual baseline | depression, anxiety | half-Normal(0, 0.2) / half-Normal(0, 0.15) |
| Major depressive episode (1% per wave) | depression | Bernoulli(0.01) × Normal(0.75, 0.1) |
| Residual error | depression, anxiety | Normal(0, 0.1) |
Questionnaire items are then generated by mapping the student's latent values (clamped to [0, 1]) to a Likert
scale mean via LatentLoading(latentName, scale), sampling truncated Normal noise, and rounding.
Longitudinal continuity is preserved by blending 75% of the latent-derived mean with 25% of
the previous wave's score.
Demographics show realistic inter-school variation: each school's demographic weight vectors are independently perturbed with Gaussian noise before students are generated.
- Julia ≥ 1.0
Clone the repository and activate the package:
git clone https://github.com/OxfordRSE/ib-ox-dummies.git
cd ib-ox-dummies
julia --project=. -e 'import Pkg; Pkg.instantiate()'To run the CLI directly:
julia bin/ib_ox_dummies --helpOr add the bin/ directory to your PATH:
export PATH="$PATH:/path/to/ib-ox-dummies/bin"
ib_ox_dummies --helpusage: ib_ox_dummies [--config CONFIG]
[--nWaves NWAVES] [--nSchools NSCHOOLS]
[--nYeargroupsPerSchool SPEC]
[--nClassesPerSchoolYeargroup SPEC]
[--nStudentsPerClass SPEC]
[--latentVariables VARS]
[--linearEffect EFFECT]... [--randomEffect EFFECT]...
[--ethnicity WEIGHTS] [--sex WEIGHTS]
[--genderIdentity WEIGHTS] [--sexualOrientation WEIGHTS]
[--customField NAME=VALUE]...
[--seed SEED] [--output OUTPUT] [--schema]
[--version] [-h]
Generate mock longitudinal questionnaire data for schoolchildren.
Use --config to load a TOML file specifying the full model (questionnaires,
latent variable loadings, effects, demographics). CLI arguments override TOML values.
SPEC formats: integer ('5'), range ('1:5'), norm(μ,σ), halfnorm(μ,σ),
poisson(λ), negbinom(r,p), lognorm(μ,σ), uniform(a,b), exponential(rate),
gamma(shape,scale), mde(μ,σ[,p]) (Bernoulli spike: probability p [default 0.01]
of Normal(μ,σ) draw, else 0).
LinearEffect format: "target:inputs:value"
e.g. "depression:d_age:0.02" or "anxiety:d_age,_sex_fm:0.004"
RandomEffect format: "target:numInputs:catInputs:spec"
e.g. "depression::uid,wave:norm(0,0.15)" or "anxiety:::norm(0,0.1)"
Demographics weight format: "Category1:weight1,Category2:weight2,..."
e.g. "M:0.49,F:0.49,I:0.02" or "White British:0.75,Asian:0.15,Other:0.10"
optional arguments:
--config CONFIG Path to TOML configuration file; CLI args override
--nWaves NWAVES Number of waves (type: Int, default: 3)
--nSchools NSCHOOLS Number of schools (type: Int, default: 10)
--nYeargroupsPerSchool SPEC Yeargroups per school (default: "5")
--nClassesPerSchoolYeargroup SPEC Classes per yeargroup (default: "1:5")
--nStudentsPerClass SPEC Students per class (default: "norm(30,7)")
--latentVariables VARS Comma-separated latent names (default: "depression,anxiety")
--linearEffect EFFECT Add a LinearEffect (repeatable; replaces TOML/defaults)
--randomEffect EFFECT Add a RandomEffect (repeatable; replaces TOML/defaults)
--ethnicity WEIGHTS Ethnicity distribution (overrides TOML/UK 2021 Census)
--sex WEIGHTS Sex distribution (overrides TOML/UK 2021 Census)
--genderIdentity WEIGHTS Gender identity distribution (overrides TOML/UK 2021 Census)
--sexualOrientation WEIGHTS Sexual orientation distribution (overrides TOML/UK 2021 Census)
--customField NAME=VALUE Custom demographic column: value is a Faker method name
(e.g. "faker.city") or a constant string. Repeatable.
Overrides matching TOML [demographics.customFields] entries.
--seed SEED Random seed (type: Int; overrides TOML)
--output OUTPUT Output format: csv | json | schema (default: "csv")
--schema Print JSON Schema and exit
--version Show version and exit
-h, --help Show this help message and exit
The --config flag accepts a TOML file that can specify the complete model — questionnaires,
latent variable loadings, linear/random effects, and demographics — all in one place.
Any CLI argument overrides the corresponding TOML value.
# examples/default_model.toml — abbreviated excerpt
[simulation]
nWaves = 3
nSchools = 10
nStudentsPerClass = "norm(30,7)"
latentVariables = ["depression", "anxiety"]
[demographics]
sex = "M:0.490,F:0.490,I:0.020"
ethnicity = "White British:0.812,Asian:0.083,Black:0.040,Mixed:0.030,Other:0.035"
# Custom demographic fields: value is a Faker method name or a constant string.
[demographics.customFields]
d_city = "faker.city" # Faker-generated city name per student
d_country = "United Kingdom" # constant string on every row
[[linearEffect]]
target = "depression"
inputs = ["d_age"]
value = 0.02
[[randomEffect]]
target = "depression"
categoricalInputs = ["uid"]
value = "halfnorm(0,0.2)"
[[questionnaire]]
name = "PHQ_9"
prefix = "phq9"
nItems = 9
nLevels = 4
noiseSD = 0.6
spoilRate = 0.01
# Uniform loading — all items get the same scale factor:
loadings = [{latentName = "depression", scale = 2.5}]
# Per-item loading example — each item gets its own scale:
# loadings = [{latentName = "depression", itemScales = {"1" = 3.0, "2" = 2.5, "3" = 2.0}}]See examples/default_model.toml for the full default model expressed as TOML.
# Default run (3 waves, 10 schools, ~30 students/class)
ib_ox_dummies
# Small reproducible run → CSV
ib_ox_dummies --nWaves 2 --nSchools 3 --seed 42
# JSON output with Poisson-distributed class sizes
ib_ox_dummies --nStudentsPerClass poisson(25) --output json
# Full model from TOML config file (see examples/default_model.toml)
ib_ox_dummies --config examples/default_model.toml
# TOML config with CLI override: use TOML model but change wave count and seed
ib_ox_dummies --config examples/default_model.toml --nWaves 5 --seed 42
# #BeeWell GM Survey — full 136-item questionnaire (see examples/beewell_model.toml)
ib_ox_dummies --config examples/beewell_model.toml --seed 42
# Custom demographics: equal sex split, simplified ethnicity distribution, city field
ib_ox_dummies \
--sex "M:0.50,F:0.50" \
--ethnicity "White British:0.70,Asian:0.20,Black:0.05,Other:0.05" \
--customField "d_city=faker.city" \
--customField "d_country=United Kingdom" \
--nSchools 3 --nWaves 2 --seed 1
# Custom latent model via CLI: depression driven by age + individual baseline + residual
ib_ox_dummies \
--latentVariables "depression" \
--linearEffect "depression:d_age:0.02" \
--randomEffect "depression::uid:halfnorm(0,0.2)" \
--randomEffect "depression:::norm(0,0.1)" \
--nSchools 3 --nWaves 2 --seed 1
# Print JSON Schema describing the output columns
ib_ox_dummies --output schemaLong-format tabular data with one row per student per wave:
| wave | uid | name | school | yearGroup | schoolYear | class | d_age | d_sex | d_ethnicity | d_sexualOrientation | d_genderIdentity | phq9_1 | … | gad7_1 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | s5agas99p | Michelle Roux | Islington Academy | 2 | 2 | 2b | 11 | F | White British | Heterosexual/Straight | Cis | 1 | … | 0 | … |
- Demographics (
d_*) are generated once and updated each wave (age increments by 1). - Questionnaire items (PHQ-9, GAD-7) are scored on a Likert 0–3 scale, derived from latent variables.
- A configurable naughty monkey randomly removes ≈0.25 % of questionnaire cells and ≈5 % of demographics cells to simulate real-world data quality.
missingvalues appear as empty strings in CSV andnullin JSON.
When includeLatents = true, additional l_* columns are appended (e.g. l_depression, l_anxiety)
containing the continuous latent values used to generate the questionnaire responses.
The --output schema flag (or --schema) prints a
JSON Schema (Draft 7) document describing the
output row type, suitable for validation or code generation.
The package can also be used programmatically from Julia:
using IbOxDummies
using Distributions
# Run with all defaults — returns a DataFrame
data, schema = simulate(SimulationConfig(seed = 42))
# Write CSV to stdout (uses CSV.jl)
to_csv(data, schema)
# Write JSON to stdout (uses JSON3.jl)
to_json(data, schema)
# Include latent variable values in output for ground-truth comparison
data, schema = simulate(SimulationConfig(seed = 42, includeLatents = true))
# Custom configuration
config = SimulationConfig(
nWaves = 2,
nSchools = 5,
nYeargroupsPerSchool = Range(3, 6),
nClassesPerSchoolYeargroup = Range(1, 4),
nStudentsPerClass = Normal(28.0, 5.0),
seed = 123,
output = "csv",
)
data, schema = simulate(config)
# Custom latent model
data, schema = simulate(SimulationConfig(
seed = 42,
latentVariables = ["depression"],
linearEffects = [LinearEffect("depression", ["d_age"], 0.03)],
randomEffects = [
RandomEffect("depression", [], ["uid"], truncated(Normal(0, 0.2), 0, Inf)),
RandomEffect("depression", [], [], Normal(0, 0.1)), # residual error
],
questionnaires = [make_phq9()],
includeLatents = true,
))
# Custom demographics: override sex distribution and add a Faker-based city field
using Faker
data, schema = simulate(SimulationConfig(
seed = 42,
demographicsSpec = DemographicsSpec(
sex = [("M", 0.45), ("F", 0.45), ("I", 0.10)],
customFields = Dict{String,Function}("d_city" => Faker.city),
),
))| Type | Description |
|---|---|
Response |
Union{Int, Float64, String, Missing} — a single answer |
DataRow |
Dict{String, Response} — internal per-student-wave record |
Schema |
Column metadata (demographics, questionnaire, and latent columns) |
Range |
Inclusive integer range [min, max] |
SamplerSpec |
Union{Int, Range, UnivariateDistribution, Function} — flexible sampler used for counts and random effect values |
LinearEffect |
Fixed linear effect on a latent variable |
RandomEffect |
Random effect on a latent variable; value::SamplerSpec (distribution or callable) |
LatentLoading |
Maps a latent variable to questionnaire item means via a scale factor — either a uniform Float64 for all items or a Dict{String,Float64} of per-item scales |
QuestionnaireSpec |
Declarative Likert-scale questionnaire specification |
DemographicsSpec |
Categorical weight distributions + optional Faker-based custom fields (ethnicity, sex, gender, orientation, and arbitrary customFields) |
SimulationConfig |
All simulation parameters with sensible defaults |
| Field | Default | Description |
|---|---|---|
nWaves |
3 |
Number of data-collection waves |
nSchools |
10 |
Number of schools |
nYeargroupsPerSchool |
5 |
Count spec for yeargroups per school |
nClassesPerSchoolYeargroup |
Range(1,5) |
Count spec for classes per yeargroup |
nStudentsPerClass |
Normal(30,7) |
Count spec for students per class |
questionnaires |
[] → default PHQ-9 + GAD-7 |
Vector of QuestionnaireSpec |
latentVariables |
[] → ["depression","anxiety"] |
Latent variable names |
linearEffects |
[] → defaults |
Fixed linear effects |
randomEffects |
[] → defaults |
Random effects |
includeLatents |
false |
Append l_* latent columns to output |
demographicPerturbationSD |
0.05 |
Per-school demographic weight perturbation SD |
demographicsSpec |
nothing → UK census defaults |
Custom DemographicsSpec for demographic distributions |
demographicsUpdateFn |
age +1 | Function updating demographics between waves |
naughtyMonkey |
0.25%/5% deletion | Function applying data-quality corruption |
output |
"csv" |
Output format ("csv", "json", "schema", or custom Function) |
seed |
nothing |
Random seed for reproducibility |
Questionnaires are specified declaratively with QuestionnaireSpec, which defines the
number of items, number of Likert levels, latent variable loadings, noise SD, and spoil rate.
Nine items scored 0–3, measuring depression severity.
Loads on the "depression" latent variable (scale 2.5).
Seven items scored 0–3, measuring anxiety severity.
Loads on "anxiety" (scale 2.5) and secondarily "depression" (scale 0.8).
Full implementation of the Greater Manchester #BeeWell Survey
(updated August 2025). Covers all 124+ survey items across the Domains of Wellbeing and
Drivers of Wellbeing, grouped into 49 QuestionnaireSpecs that together produce
136 bw_* columns.
Questions modelled:
| Range | Scale | Description |
|---|---|---|
| Q4–5 | 3-pt / 16-pt | Migration background (Q1–3 are demographic columns) |
| Q6 | 0–10 | Life satisfaction (ONS) |
| Q7–13 | 0–4 | Psychological wellbeing (SWEMWBS, 7 items) |
| Q14–18 | 0–3 | Self-esteem (Rosenberg, 5 items) |
| Q19–21 | 0–2 | Emotion regulation (CWMS coping subscale) |
| Q22 | 0–10 | Appearance happiness |
| Q23–26 | 0–4 | Stress & coping (PSS-4; not Year 7) |
| Q27–36 | 0–2 | Emotional difficulties (Me & My Feelings, 10 items) |
| Q37–42 | 0–2 | Behavioural difficulties (Me & My Feelings, 6 items; item 4 reverse-scored) |
| Q43–51 | mixed | Physical health, sleep, activity, nutrition |
| Q52–67 | mixed | Free time, social media, volunteering, 11 activities |
| Q68–77 | mixed | School connection, attainment, staff relationships, isolation, pressure |
| Q78–86 | mixed | Home/local environment, food security, material deprivation |
| Q87–98 | mixed | Future readiness, careers education (Year 10), GMACS |
| Q99–107 | 0–4 / 0–4 | Parent relationships, friendships, loneliness |
| Q108–116 | mixed | Discrimination (5 characteristics + 7 locations), bullying |
| Q117–124 | mixed | Wellbeing support access, mental health contacts, Kooth |
Latent variables (13 total, all clamped to [0, 1]):
| Name | Construct |
|---|---|
wellbeing |
Positive psychological wellbeing |
depression |
Depressive affect |
anxiety |
Anxious / worried affect |
behaviour |
Behavioural difficulties |
physical_health |
Physical health and activity level |
unhealthy_diet |
Frequency of unhealthy food/drink |
social_connection |
Quality of relationships and belonging |
future_optimism |
Hope and readiness for the future |
socioeconomic |
Socioeconomic advantage |
migration |
Migrant family background (spike: ~15 % non-zero) |
discrimination |
Exposure to discrimination (spike: ~10 %) |
victimization |
Exposure to bullying (spike: ~10 %) |
screen_time |
Daily screen / social-media engagement |
Usage via TOML (CLI):
ib_ox_dummies --config examples/beewell_model.toml --seed 42Usage via Julia API:
using IbOxDummies
data, schema = simulate(SimulationConfig(
seed = 42,
latentVariables = beewell_latent_variables(),
linearEffects = beewell_linear_effects(),
randomEffects = beewell_random_effects(),
questionnaires = beewell_questionnaires(),
))Sample output (first 5 rows, selected columns; seed = 42, 1 wave, 2 schools):
| wave | uid | school | d_age | d_sex | bw_life_sat_1 | bw_wbeing_1 | bw_selfest_1 | bw_emodies_1 | bw_behav_1 | bw_physh_1 | bw_sleep_1 | bw_physact_1 | bw_future_1 | bw_lonely_1 | bw_bullying_1 | bw_kooth_1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | bu061oj01 | Cleliamouth High School | 10 | M | 2 | 3 | 1 | 1 | 0 | 3 | 1 | 3 | 0 | 2 | 0 | 2 |
| 1 | vdpqocpmo | Cleliamouth High School | 10 | F | 0 | 1 | 1 | 1 | 0 | 4 | 0 | 3 | 2 | 2 | 0 | 2 |
| 1 | as8wsya9x | Cleliamouth High School | 10 | M | 3 | 1 | 2 | 0 | 0 | 3 | 1 | 4 | 2 | 2 | 0 | 0 |
| 1 | 54dc8alga | Cleliamouth High School | 10 | — | 2 | 1 | 2 | 2 | 0 | 2 | 0 | 0 | 3 | 4 | 0 | 1 |
| 1 | 5uufz0qy3 | Cleliamouth High School | 10 | F | 6 | 2 | 0 | 0 | 0 | 2 | 1 | 6 | 2 | 1 | 0 | 1 |
The full output has 148 columns (12 demographics + 136 questionnaire items).
See examples/beewell_model.toml for the complete model.
Run the test suite:
julia --project=. test/runtests.jlSee LICENSE.