Skip to content

[FEAT] - Data Generator #55

@Jrodrigo06

Description

@Jrodrigo06

name: Data Generator
about: Generating data to test our algo etc.
title: "[FEATURE] Synthetic Food & Symptom Log Generator"
labels: feature, backlog
assignees: @keshavgoel787 @TKDPenguin

Summary

Build a Notebook that generates realistic synthetic food and symptom logs to test different algorithms for our backend, etc.

Motivation

We need realistic mock data before we have real users. Purely random data would give false confidence, we need logs with realistic behavioral patterns and noise so the algorithm has to actually work to find the signal.

Requirements

Acceptance Criteria

  • Notebook runs with configurable parameters (N users, N days, noise level, etc.)
  • Generates food_logs.json, symptom_logs.json, and ground_truth.json matching schema below (or just output in the note book but json would be nice as well)
  • Supports multiple user archetypes including a healthy control group
  • Symptoms are delayed relative to food exposure (not immediate)
  • Exposure accumulates and decays over time
  • Data contains imperfect logging (missed meals, missed symptoms)

Out of Scope

  • Don't connect to or seed the real DB
  • No need for biological accuracy, just plausible structure

Technical Approach

Background

Watch first: https://www.youtube.com/watch?v=1GKtfgwf3ig (if doing markov chains)

The core idea is three layers:

  1. Eating behavior — users have routines and preferences, not random meals every day
  2. Latent exposure — trigger ingredients accumulate in the body and decay over time
  3. Symptom emission — symptoms fire probabilistically once exposure crosses a threshold, after a delay

Schema

Food log:

{
  "user_id": "user_001",
  "timestamp": "2024-01-03T12:35:00",
  "food_name": "pasta",
  "ingredients": ["wheat flour", "egg", "olive oil"],
  "tags": ["gluten"]
}

Symptom log:

{
  "user_id": "user_001",
  "timestamp": "2024-01-04T02:15:00",
  "symptom": "bloating",
  "severity": 7
}

Ground truth:

{
  "user_archetypes": {
    "user_001": "gluten_sensitive"
  },
  "archetype_definitions": { "..." }
}

Step 1 — Food generation (habit-biased weighted sampling)

Each user gets baseline food preference weights, slightly randomized per user. Meal timing is semi-regular (breakfast 7–10am, lunch 11am–2pm, dinner 6–9pm). If a user ate a food recently, slightly increase its probability of appearing again — this simulates routine without needing complex modeling.

Add logging dropout — ~25% of meals go unlogged, higher on weekends.

If you have extra time: look into Markov chains as an alternative approach to modeling food routines — instead of weight boosting, you model day-to-day cluster transitions (e.g. grain-heavy day → likely grain-heavy tomorrow). Good rabbit hole if you want to go deeper.

Step 2 — Trigger exposure model

Each food has trigger flags. Maintain a running exposure score per trigger per user:

exposure[t] = previous_exposure * decay_rate + trigger_intensity

When a user eats a trigger food, the score goes up. Every hour that passes, it decays. Plot this for a single user to sanity check it looks right before running the full simulation.

Step 3 — Symptom generation

When exposure crosses a user's threshold, sample a lag window and fire a symptom event. Severity scales with how high the exposure got. Apply ~30% logging dropout to symptom events too.

Stretch goal: flare window — when a symptom fires, elevated probability of follow-on symptoms for the next 6–24hrs.

Step 4 — Archetypes + control group

ARCHETYPES = [
    {
        "name": "gluten_sensitive",
        "triggers": ["gluten"],
        "threshold": 0.6,
        "decay_rate_per_hour": 0.97,
        "lag_hours": (12, 36),
        "symptoms": ["bloating", "cramping"],
        "severity_range": (4, 9),
    },
    {
        "name": "high_fodmap",
        "triggers": ["fodmap"],
        "threshold": 0.4,
        "decay_rate_per_hour": 0.90,
        "lag_hours": (4, 12),
        "symptoms": ["bloating", "gas"],
        "severity_range": (3, 7),
    },
    {
        "name": "healthy",  # control group — no sensitivities
        "triggers": [],
        "threshold": None,
        "decay_rate_per_hour": None,
        "lag_hours": None,
        "symptoms": [],
        "severity_range": (1, 3),  # random low-severity noise only
    },
]

The healthy archetype is important, the algorithm shouldn't find fake correlations in users with no real sensitivities. Feel free to add your own archetypes.

Step 5 — Validate visually

  • Visualizations of the data (make all sorts of graphs more to present to everyone the data and why its better than randomly generated stuff)

Affected Areas

  • data_generator/ (new folder)

Dependencies

  • None

Metadata

Metadata

Labels

featureIntroduces a new and complete feature

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions