Skip to content

OxfordRSE/ib-ox-dummies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ib-ox-dummies

A simulator for generating realistic mock longitudinal questionnaire data, designed to represent annual survey responses from schoolchildren.

Built in Julia using:

How it works

The simulator uses an agent-based approach driven by latent variables.

Each simulated student has latent mental-health scores (e.g. depression, anxiety) that evolve across waves and drive their questionnaire responses. The latent model is a simple structural equation model composed of two building blocks:

  • LinearEffect(target, inputs, value) — a fixed linear effect.
    Adds value × ∏(inputs) to the named latent for every row.
    Example: an age effect LinearEffect("depression", ["d_age"], 0.02) increases depression by 0.02 per year of age.

  • RandomEffect(target, numericalInputs, categoricalInputs, value) — a random effect.
    One value is drawn from value per unique combination of categoricalInputs (e.g. one draw per school, one per student), optionally scaled by numericalInputs.
    Empty categoricalInputs produces a fresh draw per observation (residual error).
    Example: RandomEffect("depression", [], ["uid"], truncated(Normal(0, 0.2), 0, Inf)) gives each student a stable half-normal baseline.

Default model (used when you call simulate() with no custom configuration):

Component Affects Distribution
Fixed age effect depression, anxiety +0.02 / +0.015 per year
Fixed sex effect (F=+1, M=−1) depression, anxiety ×0.05
Fixed age × sex interaction depression, anxiety ×0.005 / ×0.004
Random cohort (yearGroup) cluster depression, anxiety Normal(0, 0.05)
Random ethnicity × class × school cluster depression, anxiety Normal(0, 0.03)
Random individual × wave trajectory depression, anxiety Normal(0, 0.15) / Normal(0, 0.12)
Random individual baseline depression, anxiety half-Normal(0, 0.2) / half-Normal(0, 0.15)
Major depressive episode (1% per wave) depression Bernoulli(0.01) × Normal(0.75, 0.1)
Residual error depression, anxiety Normal(0, 0.1)

Questionnaire items are then generated by mapping the student's latent values (clamped to [0, 1]) to a Likert scale mean via LatentLoading(latentName, scale), sampling truncated Normal noise, and rounding. Longitudinal continuity is preserved by blending 75% of the latent-derived mean with 25% of the previous wave's score.

Demographics show realistic inter-school variation: each school's demographic weight vectors are independently perturbed with Gaussian noise before students are generated.

Requirements

  • Julia ≥ 1.0

Installation

Clone the repository and activate the package:

git clone https://github.com/OxfordRSE/ib-ox-dummies.git
cd ib-ox-dummies
julia --project=. -e 'import Pkg; Pkg.instantiate()'

To run the CLI directly:

julia bin/ib_ox_dummies --help

Or add the bin/ directory to your PATH:

export PATH="$PATH:/path/to/ib-ox-dummies/bin"
ib_ox_dummies --help

Usage

usage: ib_ox_dummies [--config CONFIG]
                     [--nWaves NWAVES] [--nSchools NSCHOOLS]
                     [--nYeargroupsPerSchool SPEC]
                     [--nClassesPerSchoolYeargroup SPEC]
                     [--nStudentsPerClass SPEC]
                     [--latentVariables VARS]
                     [--linearEffect EFFECT]... [--randomEffect EFFECT]...
                     [--ethnicity WEIGHTS] [--sex WEIGHTS]
                     [--genderIdentity WEIGHTS] [--sexualOrientation WEIGHTS]
                     [--customField NAME=VALUE]...
                     [--seed SEED] [--output OUTPUT] [--schema]
                     [--version] [-h]

Generate mock longitudinal questionnaire data for schoolchildren.

Use --config to load a TOML file specifying the full model (questionnaires,
latent variable loadings, effects, demographics). CLI arguments override TOML values.

SPEC formats: integer ('5'), range ('1:5'), norm(μ,σ), halfnorm(μ,σ),
poisson(λ), negbinom(r,p), lognorm(μ,σ), uniform(a,b), exponential(rate),
gamma(shape,scale), mde(μ,σ[,p]) (Bernoulli spike: probability p [default 0.01]
of Normal(μ,σ) draw, else 0).

LinearEffect format:  "target:inputs:value"
  e.g. "depression:d_age:0.02"  or  "anxiety:d_age,_sex_fm:0.004"

RandomEffect format:  "target:numInputs:catInputs:spec"
  e.g. "depression::uid,wave:norm(0,0.15)"  or  "anxiety:::norm(0,0.1)"

Demographics weight format: "Category1:weight1,Category2:weight2,..."
  e.g. "M:0.49,F:0.49,I:0.02"  or  "White British:0.75,Asian:0.15,Other:0.10"

optional arguments:
  --config CONFIG                          Path to TOML configuration file; CLI args override
  --nWaves NWAVES                          Number of waves (type: Int, default: 3)
  --nSchools NSCHOOLS                      Number of schools (type: Int, default: 10)
  --nYeargroupsPerSchool SPEC              Yeargroups per school (default: "5")
  --nClassesPerSchoolYeargroup SPEC        Classes per yeargroup (default: "1:5")
  --nStudentsPerClass SPEC                 Students per class (default: "norm(30,7)")
  --latentVariables VARS                   Comma-separated latent names (default: "depression,anxiety")
  --linearEffect EFFECT                    Add a LinearEffect (repeatable; replaces TOML/defaults)
  --randomEffect EFFECT                    Add a RandomEffect (repeatable; replaces TOML/defaults)
  --ethnicity WEIGHTS                      Ethnicity distribution (overrides TOML/UK 2021 Census)
  --sex WEIGHTS                            Sex distribution (overrides TOML/UK 2021 Census)
  --genderIdentity WEIGHTS                 Gender identity distribution (overrides TOML/UK 2021 Census)
  --sexualOrientation WEIGHTS              Sexual orientation distribution (overrides TOML/UK 2021 Census)
  --customField NAME=VALUE                 Custom demographic column: value is a Faker method name
                                           (e.g. "faker.city") or a constant string. Repeatable.
                                           Overrides matching TOML [demographics.customFields] entries.
  --seed SEED                              Random seed (type: Int; overrides TOML)
  --output OUTPUT                          Output format: csv | json | schema (default: "csv")
  --schema                                 Print JSON Schema and exit
  --version                                Show version and exit
  -h, --help                               Show this help message and exit

TOML configuration file

The --config flag accepts a TOML file that can specify the complete model — questionnaires, latent variable loadings, linear/random effects, and demographics — all in one place. Any CLI argument overrides the corresponding TOML value.

# examples/default_model.toml — abbreviated excerpt
[simulation]
nWaves   = 3
nSchools = 10
nStudentsPerClass = "norm(30,7)"
latentVariables   = ["depression", "anxiety"]

[demographics]
sex       = "M:0.490,F:0.490,I:0.020"
ethnicity = "White British:0.812,Asian:0.083,Black:0.040,Mixed:0.030,Other:0.035"

# Custom demographic fields: value is a Faker method name or a constant string.
[demographics.customFields]
d_city    = "faker.city"           # Faker-generated city name per student
d_country = "United Kingdom"       # constant string on every row

[[linearEffect]]
target = "depression"
inputs = ["d_age"]
value  = 0.02

[[randomEffect]]
target            = "depression"
categoricalInputs = ["uid"]
value             = "halfnorm(0,0.2)"

[[questionnaire]]
name     = "PHQ_9"
prefix   = "phq9"
nItems   = 9
nLevels  = 4
noiseSD  = 0.6
spoilRate = 0.01
# Uniform loading — all items get the same scale factor:
loadings = [{latentName = "depression", scale = 2.5}]

# Per-item loading example — each item gets its own scale:
# loadings = [{latentName = "depression", itemScales = {"1" = 3.0, "2" = 2.5, "3" = 2.0}}]

See examples/default_model.toml for the full default model expressed as TOML.

Examples

# Default run (3 waves, 10 schools, ~30 students/class)
ib_ox_dummies

# Small reproducible run → CSV
ib_ox_dummies --nWaves 2 --nSchools 3 --seed 42

# JSON output with Poisson-distributed class sizes
ib_ox_dummies --nStudentsPerClass poisson(25) --output json

# Full model from TOML config file (see examples/default_model.toml)
ib_ox_dummies --config examples/default_model.toml

# TOML config with CLI override: use TOML model but change wave count and seed
ib_ox_dummies --config examples/default_model.toml --nWaves 5 --seed 42

# #BeeWell GM Survey — full 136-item questionnaire (see examples/beewell_model.toml)
ib_ox_dummies --config examples/beewell_model.toml --seed 42

# Custom demographics: equal sex split, simplified ethnicity distribution, city field
ib_ox_dummies \
  --sex "M:0.50,F:0.50" \
  --ethnicity "White British:0.70,Asian:0.20,Black:0.05,Other:0.05" \
  --customField "d_city=faker.city" \
  --customField "d_country=United Kingdom" \
  --nSchools 3 --nWaves 2 --seed 1

# Custom latent model via CLI: depression driven by age + individual baseline + residual
ib_ox_dummies \
  --latentVariables "depression" \
  --linearEffect "depression:d_age:0.02" \
  --randomEffect "depression::uid:halfnorm(0,0.2)" \
  --randomEffect "depression:::norm(0,0.1)" \
  --nSchools 3 --nWaves 2 --seed 1

# Print JSON Schema describing the output columns
ib_ox_dummies --output schema

Output

Long-format tabular data with one row per student per wave:

wave uid name school yearGroup schoolYear class d_age d_sex d_ethnicity d_sexualOrientation d_genderIdentity phq9_1 gad7_1
1 s5agas99p Michelle Roux Islington Academy 2 2 2b 11 F White British Heterosexual/Straight Cis 1 0
  • Demographics (d_*) are generated once and updated each wave (age increments by 1).
  • Questionnaire items (PHQ-9, GAD-7) are scored on a Likert 0–3 scale, derived from latent variables.
  • A configurable naughty monkey randomly removes ≈0.25 % of questionnaire cells and ≈5 % of demographics cells to simulate real-world data quality.
  • missing values appear as empty strings in CSV and null in JSON.

When includeLatents = true, additional l_* columns are appended (e.g. l_depression, l_anxiety) containing the continuous latent values used to generate the questionnaire responses.

JSON Schema

The --output schema flag (or --schema) prints a JSON Schema (Draft 7) document describing the output row type, suitable for validation or code generation.

Package API

The package can also be used programmatically from Julia:

using IbOxDummies
using Distributions

# Run with all defaults — returns a DataFrame
data, schema = simulate(SimulationConfig(seed = 42))

# Write CSV to stdout (uses CSV.jl)
to_csv(data, schema)

# Write JSON to stdout (uses JSON3.jl)
to_json(data, schema)

# Include latent variable values in output for ground-truth comparison
data, schema = simulate(SimulationConfig(seed = 42, includeLatents = true))

# Custom configuration
config = SimulationConfig(
    nWaves                     = 2,
    nSchools                   = 5,
    nYeargroupsPerSchool       = Range(3, 6),
    nClassesPerSchoolYeargroup = Range(1, 4),
    nStudentsPerClass          = Normal(28.0, 5.0),
    seed                       = 123,
    output                     = "csv",
)
data, schema = simulate(config)

# Custom latent model
data, schema = simulate(SimulationConfig(
    seed            = 42,
    latentVariables = ["depression"],
    linearEffects = [LinearEffect("depression", ["d_age"], 0.03)],
    randomEffects = [
        RandomEffect("depression", [], ["uid"],  truncated(Normal(0, 0.2), 0, Inf)),
        RandomEffect("depression", [], [],       Normal(0, 0.1)),  # residual error
    ],
    questionnaires  = [make_phq9()],
    includeLatents  = true,
))

# Custom demographics: override sex distribution and add a Faker-based city field
using Faker
data, schema = simulate(SimulationConfig(
    seed = 42,
    demographicsSpec = DemographicsSpec(
        sex = [("M", 0.45), ("F", 0.45), ("I", 0.10)],
        customFields = Dict{String,Function}("d_city" => Faker.city),
    ),
))

Key types

Type Description
Response Union{Int, Float64, String, Missing} — a single answer
DataRow Dict{String, Response} — internal per-student-wave record
Schema Column metadata (demographics, questionnaire, and latent columns)
Range Inclusive integer range [min, max]
SamplerSpec Union{Int, Range, UnivariateDistribution, Function} — flexible sampler used for counts and random effect values
LinearEffect Fixed linear effect on a latent variable
RandomEffect Random effect on a latent variable; value::SamplerSpec (distribution or callable)
LatentLoading Maps a latent variable to questionnaire item means via a scale factor — either a uniform Float64 for all items or a Dict{String,Float64} of per-item scales
QuestionnaireSpec Declarative Likert-scale questionnaire specification
DemographicsSpec Categorical weight distributions + optional Faker-based custom fields (ethnicity, sex, gender, orientation, and arbitrary customFields)
SimulationConfig All simulation parameters with sensible defaults

SimulationConfig fields

Field Default Description
nWaves 3 Number of data-collection waves
nSchools 10 Number of schools
nYeargroupsPerSchool 5 Count spec for yeargroups per school
nClassesPerSchoolYeargroup Range(1,5) Count spec for classes per yeargroup
nStudentsPerClass Normal(30,7) Count spec for students per class
questionnaires [] → default PHQ-9 + GAD-7 Vector of QuestionnaireSpec
latentVariables []["depression","anxiety"] Latent variable names
linearEffects [] → defaults Fixed linear effects
randomEffects [] → defaults Random effects
includeLatents false Append l_* latent columns to output
demographicPerturbationSD 0.05 Per-school demographic weight perturbation SD
demographicsSpec nothing → UK census defaults Custom DemographicsSpec for demographic distributions
demographicsUpdateFn age +1 Function updating demographics between waves
naughtyMonkey 0.25%/5% deletion Function applying data-quality corruption
output "csv" Output format ("csv", "json", "schema", or custom Function)
seed nothing Random seed for reproducibility

Questionnaires

Questionnaires are specified declaratively with QuestionnaireSpec, which defines the number of items, number of Likert levels, latent variable loadings, noise SD, and spoil rate.

PHQ-9 (Patient Health Questionnaire-9)

Nine items scored 0–3, measuring depression severity. Loads on the "depression" latent variable (scale 2.5).

GAD-7 (Generalised Anxiety Disorder-7)

Seven items scored 0–3, measuring anxiety severity. Loads on "anxiety" (scale 2.5) and secondarily "depression" (scale 0.8).

#BeeWell GM Survey (2025)

Full implementation of the Greater Manchester #BeeWell Survey (updated August 2025). Covers all 124+ survey items across the Domains of Wellbeing and Drivers of Wellbeing, grouped into 49 QuestionnaireSpecs that together produce 136 bw_* columns.

Questions modelled:

Range Scale Description
Q4–5 3-pt / 16-pt Migration background (Q1–3 are demographic columns)
Q6 0–10 Life satisfaction (ONS)
Q7–13 0–4 Psychological wellbeing (SWEMWBS, 7 items)
Q14–18 0–3 Self-esteem (Rosenberg, 5 items)
Q19–21 0–2 Emotion regulation (CWMS coping subscale)
Q22 0–10 Appearance happiness
Q23–26 0–4 Stress & coping (PSS-4; not Year 7)
Q27–36 0–2 Emotional difficulties (Me & My Feelings, 10 items)
Q37–42 0–2 Behavioural difficulties (Me & My Feelings, 6 items; item 4 reverse-scored)
Q43–51 mixed Physical health, sleep, activity, nutrition
Q52–67 mixed Free time, social media, volunteering, 11 activities
Q68–77 mixed School connection, attainment, staff relationships, isolation, pressure
Q78–86 mixed Home/local environment, food security, material deprivation
Q87–98 mixed Future readiness, careers education (Year 10), GMACS
Q99–107 0–4 / 0–4 Parent relationships, friendships, loneliness
Q108–116 mixed Discrimination (5 characteristics + 7 locations), bullying
Q117–124 mixed Wellbeing support access, mental health contacts, Kooth

Latent variables (13 total, all clamped to [0, 1]):

Name Construct
wellbeing Positive psychological wellbeing
depression Depressive affect
anxiety Anxious / worried affect
behaviour Behavioural difficulties
physical_health Physical health and activity level
unhealthy_diet Frequency of unhealthy food/drink
social_connection Quality of relationships and belonging
future_optimism Hope and readiness for the future
socioeconomic Socioeconomic advantage
migration Migrant family background (spike: ~15 % non-zero)
discrimination Exposure to discrimination (spike: ~10 %)
victimization Exposure to bullying (spike: ~10 %)
screen_time Daily screen / social-media engagement

Usage via TOML (CLI):

ib_ox_dummies --config examples/beewell_model.toml --seed 42

Usage via Julia API:

using IbOxDummies

data, schema = simulate(SimulationConfig(
    seed            = 42,
    latentVariables = beewell_latent_variables(),
    linearEffects   = beewell_linear_effects(),
    randomEffects   = beewell_random_effects(),
    questionnaires  = beewell_questionnaires(),
))

Sample output (first 5 rows, selected columns; seed = 42, 1 wave, 2 schools):

wave uid school d_age d_sex bw_life_sat_1 bw_wbeing_1 bw_selfest_1 bw_emodies_1 bw_behav_1 bw_physh_1 bw_sleep_1 bw_physact_1 bw_future_1 bw_lonely_1 bw_bullying_1 bw_kooth_1
1 bu061oj01 Cleliamouth High School 10 M 2 3 1 1 0 3 1 3 0 2 0 2
1 vdpqocpmo Cleliamouth High School 10 F 0 1 1 1 0 4 0 3 2 2 0 2
1 as8wsya9x Cleliamouth High School 10 M 3 1 2 0 0 3 1 4 2 2 0 0
1 54dc8alga Cleliamouth High School 10 2 1 2 2 0 2 0 0 3 4 0 1
1 5uufz0qy3 Cleliamouth High School 10 F 6 2 0 0 0 2 1 6 2 1 0 1

The full output has 148 columns (12 demographics + 136 questionnaire items). See examples/beewell_model.toml for the complete model.

Development

Run the test suite:

julia --project=. test/runtests.jl

License

See LICENSE.

About

Utility to generate mock data for IB-Oxford dashboard project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages