ib-ox-dummies

A simulator for generating realistic mock longitudinal questionnaire data, designed to represent annual survey responses from schoolchildren.

Built in Julia using:

ArgParse.jl — CLI argument parsing
Distributions.jl — statistical distributions for counts, latent effects, and item sampling
StatsBase.jl — weighted categorical sampling for demographics
DataFrames.jl — tabular output representation
CSV.jl — CSV serialisation
JSON3.jl — JSON serialisation

How it works

The simulator uses an agent-based approach driven by latent variables.

Each simulated student has latent mental-health scores (e.g. depression, anxiety) that evolve across waves and drive their questionnaire responses. The latent model is a simple structural equation model composed of two building blocks:

LinearEffect(target, inputs, value) — a fixed linear effect.
Adds value × ∏(inputs) to the named latent for every row.
Example: an age effect LinearEffect("depression", ["d_age"], 0.02) increases depression by 0.02 per year of age.
RandomEffect(target, numericalInputs, categoricalInputs, value) — a random effect.
One value is drawn from value per unique combination of categoricalInputs (e.g. one draw per school, one per student), optionally scaled by numericalInputs.
Empty categoricalInputs produces a fresh draw per observation (residual error).
Example: RandomEffect("depression", [], ["uid"], truncated(Normal(0, 0.2), 0, Inf)) gives each student a stable half-normal baseline.

Default model (used when you call simulate() with no custom configuration):

Component	Affects	Distribution
Fixed age effect	depression, anxiety	+0.02 / +0.015 per year
Fixed sex effect (F=+1, M=−1)	depression, anxiety	×0.05
Fixed age × sex interaction	depression, anxiety	×0.005 / ×0.004
Random cohort (yearGroup) cluster	depression, anxiety	Normal(0, 0.05)
Random ethnicity × class × school cluster	depression, anxiety	Normal(0, 0.03)
Random individual × wave trajectory	depression, anxiety	Normal(0, 0.15) / Normal(0, 0.12)
Random individual baseline	depression, anxiety	half-Normal(0, 0.2) / half-Normal(0, 0.15)
Major depressive episode (1% per wave)	depression	Bernoulli(0.01) × Normal(0.75, 0.1)
Residual error	depression, anxiety	Normal(0, 0.1)

Questionnaire items are then generated by mapping the student's latent values (clamped to [0, 1]) to a Likert scale mean via LatentLoading(latentName, scale), sampling truncated Normal noise, and rounding. Longitudinal continuity is preserved by blending 75% of the latent-derived mean with 25% of the previous wave's score.

Demographics show realistic inter-school variation: each school's demographic weight vectors are independently perturbed with Gaussian noise before students are generated.

Requirements

Julia ≥ 1.0

Installation

Clone the repository and activate the package:

git clone https://github.com/OxfordRSE/ib-ox-dummies.git
cd ib-ox-dummies
julia --project=. -e 'import Pkg; Pkg.instantiate()'

To run the CLI directly:

julia bin/ib_ox_dummies --help

Or add the bin/ directory to your PATH:

export PATH="$PATH:/path/to/ib-ox-dummies/bin"
ib_ox_dummies --help

Usage

usage: ib_ox_dummies [--config CONFIG]
                     [--nWaves NWAVES] [--nSchools NSCHOOLS]
                     [--nYeargroupsPerSchool SPEC]
                     [--nClassesPerSchoolYeargroup SPEC]
                     [--nStudentsPerClass SPEC]
                     [--latentVariables VARS]
                     [--linearEffect EFFECT]... [--randomEffect EFFECT]...
                     [--ethnicity WEIGHTS] [--sex WEIGHTS]
                     [--genderIdentity WEIGHTS] [--sexualOrientation WEIGHTS]
                     [--customField NAME=VALUE]...
                     [--seed SEED] [--output OUTPUT] [--schema]
                     [--version] [-h]

Generate mock longitudinal questionnaire data for schoolchildren.

Use --config to load a TOML file specifying the full model (questionnaires,
latent variable loadings, effects, demographics). CLI arguments override TOML values.

SPEC formats: integer ('5'), range ('1:5'), norm(μ,σ), halfnorm(μ,σ),
poisson(λ), negbinom(r,p), lognorm(μ,σ), uniform(a,b), exponential(rate),
gamma(shape,scale), mde(μ,σ[,p]) (Bernoulli spike: probability p [default 0.01]
of Normal(μ,σ) draw, else 0).

LinearEffect format:  "target:inputs:value"
  e.g. "depression:d_age:0.02"  or  "anxiety:d_age,_sex_fm:0.004"

RandomEffect format:  "target:numInputs:catInputs:spec"
  e.g. "depression::uid,wave:norm(0,0.15)"  or  "anxiety:::norm(0,0.1)"

Demographics weight format: "Category1:weight1,Category2:weight2,..."
  e.g. "M:0.49,F:0.49,I:0.02"  or  "White British:0.75,Asian:0.15,Other:0.10"

optional arguments:
  --config CONFIG                          Path to TOML configuration file; CLI args override
  --nWaves NWAVES                          Number of waves (type: Int, default: 3)
  --nSchools NSCHOOLS                      Number of schools (type: Int, default: 10)
  --nYeargroupsPerSchool SPEC              Yeargroups per school (default: "5")
  --nClassesPerSchoolYeargroup SPEC        Classes per yeargroup (default: "1:5")
  --nStudentsPerClass SPEC                 Students per class (default: "norm(30,7)")
  --latentVariables VARS                   Comma-separated latent names (default: "depression,anxiety")
  --linearEffect EFFECT                    Add a LinearEffect (repeatable; replaces TOML/defaults)
  --randomEffect EFFECT                    Add a RandomEffect (repeatable; replaces TOML/defaults)
  --ethnicity WEIGHTS                      Ethnicity distribution (overrides TOML/UK 2021 Census)
  --sex WEIGHTS                            Sex distribution (overrides TOML/UK 2021 Census)
  --genderIdentity WEIGHTS                 Gender identity distribution (overrides TOML/UK 2021 Census)
  --sexualOrientation WEIGHTS              Sexual orientation distribution (overrides TOML/UK 2021 Census)
  --customField NAME=VALUE                 Custom demographic column: value is a Faker method name
                                           (e.g. "faker.city") or a constant string. Repeatable.
                                           Overrides matching TOML [demographics.customFields] entries.
  --seed SEED                              Random seed (type: Int; overrides TOML)
  --output OUTPUT                          Output format: csv | json | schema (default: "csv")
  --schema                                 Print JSON Schema and exit
  --version                                Show version and exit
  -h, --help                               Show this help message and exit

TOML configuration file

The --config flag accepts a TOML file that can specify the complete model — questionnaires, latent variable loadings, linear/random effects, and demographics — all in one place. Any CLI argument overrides the corresponding TOML value.

# examples/default_model.toml — abbreviated excerpt
[simulation]
nWaves   = 3
nSchools = 10
nStudentsPerClass = "norm(30,7)"
latentVariables   = ["depression", "anxiety"]

[demographics]
sex       = "M:0.490,F:0.490,I:0.020"
ethnicity = "White British:0.812,Asian:0.083,Black:0.040,Mixed:0.030,Other:0.035"

# Custom demographic fields: value is a Faker method name or a constant string.
[demographics.customFields]
d_city    = "faker.city"           # Faker-generated city name per student
d_country = "United Kingdom"       # constant string on every row

[[linearEffect]]
target = "depression"
inputs = ["d_age"]
value  = 0.02

[[randomEffect]]
target            = "depression"
categoricalInputs = ["uid"]
value             = "halfnorm(0,0.2)"

[[questionnaire]]
name     = "PHQ_9"
prefix   = "phq9"
nItems   = 9
nLevels  = 4
noiseSD  = 0.6
spoilRate = 0.01
# Uniform loading — all items get the same scale factor:
loadings = [{latentName = "depression", scale = 2.5}]

# Per-item loading example — each item gets its own scale:
# loadings = [{latentName = "depression", itemScales = {"1" = 3.0, "2" = 2.5, "3" = 2.0}}]

See examples/default_model.toml for the full default model expressed as TOML.

Examples

# Default run (3 waves, 10 schools, ~30 students/class)
ib_ox_dummies

# Small reproducible run → CSV
ib_ox_dummies --nWaves 2 --nSchools 3 --seed 42

# JSON output with Poisson-distributed class sizes
ib_ox_dummies --nStudentsPerClass poisson(25) --output json

# Full model from TOML config file (see examples/default_model.toml)
ib_ox_dummies --config examples/default_model.toml

# TOML config with CLI override: use TOML model but change wave count and seed
ib_ox_dummies --config examples/default_model.toml --nWaves 5 --seed 42

# #BeeWell GM Survey — full 136-item questionnaire (see examples/beewell_model.toml)
ib_ox_dummies --config examples/beewell_model.toml --seed 42

# Custom demographics: equal sex split, simplified ethnicity distribution, city field
ib_ox_dummies \
  --sex "M:0.50,F:0.50" \
  --ethnicity "White British:0.70,Asian:0.20,Black:0.05,Other:0.05" \
  --customField "d_city=faker.city" \
  --customField "d_country=United Kingdom" \
  --nSchools 3 --nWaves 2 --seed 1

# Custom latent model via CLI: depression driven by age + individual baseline + residual
ib_ox_dummies \
  --latentVariables "depression" \
  --linearEffect "depression:d_age:0.02" \
  --randomEffect "depression::uid:halfnorm(0,0.2)" \
  --randomEffect "depression:::norm(0,0.1)" \
  --nSchools 3 --nWaves 2 --seed 1

# Print JSON Schema describing the output columns
ib_ox_dummies --output schema

Output

Long-format tabular data with one row per student per wave:

wave	uid	name	school	yearGroup	schoolYear	class	d_age	d_sex	d_ethnicity	d_sexualOrientation	d_genderIdentity	phq9_1	…	gad7_1	…
1	s5agas99p	Michelle Roux	Islington Academy	2	2	2b	11	F	White British	Heterosexual/Straight	Cis	1	…	0	…

Demographics (d_*) are generated once and updated each wave (age increments by 1).
Questionnaire items (PHQ-9, GAD-7) are scored on a Likert 0–3 scale, derived from latent variables.
A configurable naughty monkey randomly removes ≈0.25 % of questionnaire cells and ≈5 % of demographics cells to simulate real-world data quality.
missing values appear as empty strings in CSV and null in JSON.

When includeLatents = true, additional l_* columns are appended (e.g. l_depression, l_anxiety) containing the continuous latent values used to generate the questionnaire responses.

JSON Schema

The --output schema flag (or --schema) prints a JSON Schema (Draft 7) document describing the output row type, suitable for validation or code generation.

Package API

The package can also be used programmatically from Julia:

using IbOxDummies
using Distributions

# Run with all defaults — returns a DataFrame
data, schema = simulate(SimulationConfig(seed = 42))

# Write CSV to stdout (uses CSV.jl)
to_csv(data, schema)

# Write JSON to stdout (uses JSON3.jl)
to_json(data, schema)

# Include latent variable values in output for ground-truth comparison
data, schema = simulate(SimulationConfig(seed = 42, includeLatents = true))

# Custom configuration
config = SimulationConfig(
    nWaves                     = 2,
    nSchools                   = 5,
    nYeargroupsPerSchool       = Range(3, 6),
    nClassesPerSchoolYeargroup = Range(1, 4),
    nStudentsPerClass          = Normal(28.0, 5.0),
    seed                       = 123,
    output                     = "csv",
)
data, schema = simulate(config)

# Custom latent model
data, schema = simulate(SimulationConfig(
    seed            = 42,
    latentVariables = ["depression"],
    linearEffects = [LinearEffect("depression", ["d_age"], 0.03)],
    randomEffects = [
        RandomEffect("depression", [], ["uid"],  truncated(Normal(0, 0.2), 0, Inf)),
        RandomEffect("depression", [], [],       Normal(0, 0.1)),  # residual error
    ],
    questionnaires  = [make_phq9()],
    includeLatents  = true,
))

# Custom demographics: override sex distribution and add a Faker-based city field
using Faker
data, schema = simulate(SimulationConfig(
    seed = 42,
    demographicsSpec = DemographicsSpec(
        sex = [("M", 0.45), ("F", 0.45), ("I", 0.10)],
        customFields = Dict{String,Function}("d_city" => Faker.city),
    ),
))

Key types

Type	Description
`Response`	`Union{Int, Float64, String, Missing}` — a single answer
`DataRow`	`Dict{String, Response}` — internal per-student-wave record
`Schema`	Column metadata (demographics, questionnaire, and latent columns)
`Range`	Inclusive integer range `[min, max]`
`SamplerSpec`	`Union{Int, Range, UnivariateDistribution, Function}` — flexible sampler used for counts and random effect values
`LinearEffect`	Fixed linear effect on a latent variable
`RandomEffect`	Random effect on a latent variable; `value::SamplerSpec` (distribution or callable)
`LatentLoading`	Maps a latent variable to questionnaire item means via a scale factor — either a uniform `Float64` for all items or a `Dict{String,Float64}` of per-item scales
`QuestionnaireSpec`	Declarative Likert-scale questionnaire specification
`DemographicsSpec`	Categorical weight distributions + optional Faker-based custom fields (ethnicity, sex, gender, orientation, and arbitrary `customFields`)
`SimulationConfig`	All simulation parameters with sensible defaults

`SimulationConfig` fields

Field	Default	Description
`nWaves`	`3`	Number of data-collection waves
`nSchools`	`10`	Number of schools
`nYeargroupsPerSchool`	`5`	Count spec for yeargroups per school
`nClassesPerSchoolYeargroup`	`Range(1,5)`	Count spec for classes per yeargroup
`nStudentsPerClass`	`Normal(30,7)`	Count spec for students per class
`questionnaires`	`[]` → default PHQ-9 + GAD-7	Vector of `QuestionnaireSpec`
`latentVariables`	`[]` → `["depression","anxiety"]`	Latent variable names
`linearEffects`	`[]` → defaults	Fixed linear effects
`randomEffects`	`[]` → defaults	Random effects
`includeLatents`	`false`	Append `l_*` latent columns to output
`demographicPerturbationSD`	`0.05`	Per-school demographic weight perturbation SD
`demographicsSpec`	`nothing` → UK census defaults	Custom `DemographicsSpec` for demographic distributions
`demographicsUpdateFn`	age +1	Function updating demographics between waves
`naughtyMonkey`	0.25%/5% deletion	Function applying data-quality corruption
`output`	`"csv"`	Output format (`"csv"`, `"json"`, `"schema"`, or custom `Function`)
`seed`	`nothing`	Random seed for reproducibility

Questionnaires

Questionnaires are specified declaratively with QuestionnaireSpec, which defines the number of items, number of Likert levels, latent variable loadings, noise SD, and spoil rate.

PHQ-9 (Patient Health Questionnaire-9)

Nine items scored 0–3, measuring depression severity. Loads on the "depression" latent variable (scale 2.5).

GAD-7 (Generalised Anxiety Disorder-7)

Seven items scored 0–3, measuring anxiety severity. Loads on "anxiety" (scale 2.5) and secondarily "depression" (scale 0.8).

#BeeWell GM Survey (2025)

Full implementation of the Greater Manchester #BeeWell Survey (updated August 2025). Covers all 124+ survey items across the Domains of Wellbeing and Drivers of Wellbeing, grouped into 49 QuestionnaireSpecs that together produce 136 bw_* columns.

Questions modelled:

Range	Scale	Description
Q4–5	3-pt / 16-pt	Migration background (Q1–3 are demographic columns)
Q6	0–10	Life satisfaction (ONS)
Q7–13	0–4	Psychological wellbeing (SWEMWBS, 7 items)
Q14–18	0–3	Self-esteem (Rosenberg, 5 items)
Q19–21	0–2	Emotion regulation (CWMS coping subscale)
Q22	0–10	Appearance happiness
Q23–26	0–4	Stress & coping (PSS-4; not Year 7)
Q27–36	0–2	Emotional difficulties (Me & My Feelings, 10 items)
Q37–42	0–2	Behavioural difficulties (Me & My Feelings, 6 items; item 4 reverse-scored)
Q43–51	mixed	Physical health, sleep, activity, nutrition
Q52–67	mixed	Free time, social media, volunteering, 11 activities
Q68–77	mixed	School connection, attainment, staff relationships, isolation, pressure
Q78–86	mixed	Home/local environment, food security, material deprivation
Q87–98	mixed	Future readiness, careers education (Year 10), GMACS
Q99–107	0–4 / 0–4	Parent relationships, friendships, loneliness
Q108–116	mixed	Discrimination (5 characteristics + 7 locations), bullying
Q117–124	mixed	Wellbeing support access, mental health contacts, Kooth

Latent variables (13 total, all clamped to [0, 1]):

Name	Construct
`wellbeing`	Positive psychological wellbeing
`depression`	Depressive affect
`anxiety`	Anxious / worried affect
`behaviour`	Behavioural difficulties
`physical_health`	Physical health and activity level
`unhealthy_diet`	Frequency of unhealthy food/drink
`social_connection`	Quality of relationships and belonging
`future_optimism`	Hope and readiness for the future
`socioeconomic`	Socioeconomic advantage
`migration`	Migrant family background (spike: ~15 % non-zero)
`discrimination`	Exposure to discrimination (spike: ~10 %)
`victimization`	Exposure to bullying (spike: ~10 %)
`screen_time`	Daily screen / social-media engagement

Usage via TOML (CLI):

ib_ox_dummies --config examples/beewell_model.toml --seed 42

Usage via Julia API:

using IbOxDummies

data, schema = simulate(SimulationConfig(
    seed            = 42,
    latentVariables = beewell_latent_variables(),
    linearEffects   = beewell_linear_effects(),
    randomEffects   = beewell_random_effects(),
    questionnaires  = beewell_questionnaires(),
))

Sample output (first 5 rows, selected columns; seed = 42, 1 wave, 2 schools):

wave	uid	school	d_age	d_sex	bw_life_sat_1	bw_wbeing_1	bw_selfest_1	bw_emodies_1	bw_physh_1	bw_sleep_1	bw_physact_1	bw_future_1	bw_lonely_1	bw_kooth_1
1	bu061oj01	Cleliamouth High School	10	M	2	3	1	1	3	1	3	0	2	2
1	vdpqocpmo	Cleliamouth High School	10	F	0	1	1	1	4	0	3	2	2	2
1	as8wsya9x	Cleliamouth High School	10	M	3	1	2	0	3	1	4	2	2	0
1	54dc8alga	Cleliamouth High School	10	—	2	1	2	2	2	0	0	3	4	1
1	5uufz0qy3	Cleliamouth High School	10	F	6	2	0	0	2	1	6	2	1	1

The full output has 148 columns (12 demographics + 136 questionnaire items). See examples/beewell_model.toml for the complete model.

Development

Run the test suite:

julia --project=. test/runtests.jl

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
bin		bin
examples		examples
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ib-ox-dummies

How it works

Requirements

Installation

Usage

TOML configuration file

Examples

Output

JSON Schema

Package API

Key types

`SimulationConfig` fields

Questionnaires

PHQ-9 (Patient Health Questionnaire-9)

GAD-7 (Generalised Anxiety Disorder-7)

#BeeWell GM Survey (2025)

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ib-ox-dummies

How it works

Requirements

Installation

Usage

TOML configuration file

Examples

Output

JSON Schema

Package API

Key types

SimulationConfig fields

Questionnaires

PHQ-9 (Patient Health Questionnaire-9)

GAD-7 (Generalised Anxiety Disorder-7)

#BeeWell GM Survey (2025)

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`SimulationConfig` fields

Packages