-
Notifications
You must be signed in to change notification settings - Fork 0
Description
name: Data Generator
about: Generating data to test our algo etc.
title: "[FEATURE] Synthetic Food & Symptom Log Generator"
labels: feature, backlog
assignees: @keshavgoel787 @TKDPenguin
Summary
Build a Notebook that generates realistic synthetic food and symptom logs to test different algorithms for our backend, etc.
Motivation
We need realistic mock data before we have real users. Purely random data would give false confidence, we need logs with realistic behavioral patterns and noise so the algorithm has to actually work to find the signal.
Requirements
Acceptance Criteria
- Notebook runs with configurable parameters (N users, N days, noise level, etc.)
- Generates
food_logs.json,symptom_logs.json, andground_truth.jsonmatching schema below (or just output in the note book but json would be nice as well) - Supports multiple user archetypes including a healthy control group
- Symptoms are delayed relative to food exposure (not immediate)
- Exposure accumulates and decays over time
- Data contains imperfect logging (missed meals, missed symptoms)
Out of Scope
- Don't connect to or seed the real DB
- No need for biological accuracy, just plausible structure
Technical Approach
Background
Watch first: https://www.youtube.com/watch?v=1GKtfgwf3ig (if doing markov chains)
The core idea is three layers:
- Eating behavior — users have routines and preferences, not random meals every day
- Latent exposure — trigger ingredients accumulate in the body and decay over time
- Symptom emission — symptoms fire probabilistically once exposure crosses a threshold, after a delay
Schema
Food log:
{
"user_id": "user_001",
"timestamp": "2024-01-03T12:35:00",
"food_name": "pasta",
"ingredients": ["wheat flour", "egg", "olive oil"],
"tags": ["gluten"]
}Symptom log:
{
"user_id": "user_001",
"timestamp": "2024-01-04T02:15:00",
"symptom": "bloating",
"severity": 7
}Ground truth:
{
"user_archetypes": {
"user_001": "gluten_sensitive"
},
"archetype_definitions": { "..." }
}Step 1 — Food generation (habit-biased weighted sampling)
Each user gets baseline food preference weights, slightly randomized per user. Meal timing is semi-regular (breakfast 7–10am, lunch 11am–2pm, dinner 6–9pm). If a user ate a food recently, slightly increase its probability of appearing again — this simulates routine without needing complex modeling.
Add logging dropout — ~25% of meals go unlogged, higher on weekends.
If you have extra time: look into Markov chains as an alternative approach to modeling food routines — instead of weight boosting, you model day-to-day cluster transitions (e.g. grain-heavy day → likely grain-heavy tomorrow). Good rabbit hole if you want to go deeper.
Step 2 — Trigger exposure model
Each food has trigger flags. Maintain a running exposure score per trigger per user:
exposure[t] = previous_exposure * decay_rate + trigger_intensity
When a user eats a trigger food, the score goes up. Every hour that passes, it decays. Plot this for a single user to sanity check it looks right before running the full simulation.
Step 3 — Symptom generation
When exposure crosses a user's threshold, sample a lag window and fire a symptom event. Severity scales with how high the exposure got. Apply ~30% logging dropout to symptom events too.
Stretch goal: flare window — when a symptom fires, elevated probability of follow-on symptoms for the next 6–24hrs.
Step 4 — Archetypes + control group
ARCHETYPES = [
{
"name": "gluten_sensitive",
"triggers": ["gluten"],
"threshold": 0.6,
"decay_rate_per_hour": 0.97,
"lag_hours": (12, 36),
"symptoms": ["bloating", "cramping"],
"severity_range": (4, 9),
},
{
"name": "high_fodmap",
"triggers": ["fodmap"],
"threshold": 0.4,
"decay_rate_per_hour": 0.90,
"lag_hours": (4, 12),
"symptoms": ["bloating", "gas"],
"severity_range": (3, 7),
},
{
"name": "healthy", # control group — no sensitivities
"triggers": [],
"threshold": None,
"decay_rate_per_hour": None,
"lag_hours": None,
"symptoms": [],
"severity_range": (1, 3), # random low-severity noise only
},
]The healthy archetype is important, the algorithm shouldn't find fake correlations in users with no real sensitivities. Feel free to add your own archetypes.
Step 5 — Validate visually
- Visualizations of the data (make all sorts of graphs more to present to everyone the data and why its better than randomly generated stuff)
Affected Areas
data_generator/(new folder)
Dependencies
- None