[FEAT] - Data Generator

---
name: Data Generator
about: Generating data to test our algo etc.
title: "[FEATURE] Synthetic Food & Symptom Log Generator"
labels: feature, backlog
assignees: @keshavgoel787 @TKDPenguin 
---

## Summary
Build a Notebook that generates realistic synthetic food and symptom logs to test different algorithms for our backend, etc.

## Motivation
We need realistic mock data before we have real users. Purely random data would give false confidence, we need logs with realistic behavioral patterns and noise so the algorithm has to actually work to find the signal.

## Requirements

### Acceptance Criteria
- [ ] Notebook runs with configurable parameters (N users, N days, noise level, etc.) 
- [ ] Generates `food_logs.json`, `symptom_logs.json`, and `ground_truth.json` matching schema below (or just output in the note book but json would be nice as well)
- [ ] Supports multiple user archetypes including a healthy control group
- [ ] Symptoms are delayed relative to food exposure (not immediate)
- [ ] Exposure accumulates and decays over time
- [ ] Data contains imperfect logging (missed meals, missed symptoms)

### Out of Scope
- Don't connect to or seed the real DB
- No need for biological accuracy, just plausible structure

## Technical Approach

### Background

Watch first: https://www.youtube.com/watch?v=1GKtfgwf3ig (if doing markov chains)

The core idea is three layers:
1. **Eating behavior** — users have routines and preferences, not random meals every day
2. **Latent exposure** — trigger ingredients accumulate in the body and decay over time
3. **Symptom emission** — symptoms fire probabilistically once exposure crosses a threshold, after a delay

### Schema

Food log:
```json
{
  "user_id": "user_001",
  "timestamp": "2024-01-03T12:35:00",
  "food_name": "pasta",
  "ingredients": ["wheat flour", "egg", "olive oil"],
  "tags": ["gluten"]
}
```

Symptom log:
```json
{
  "user_id": "user_001",
  "timestamp": "2024-01-04T02:15:00",
  "symptom": "bloating",
  "severity": 7
}
```

Ground truth:
```json
{
  "user_archetypes": {
    "user_001": "gluten_sensitive"
  },
  "archetype_definitions": { "..." }
}
```

### Step 1 — Food generation (habit-biased weighted sampling)

Each user gets baseline food preference weights, slightly randomized per user. Meal timing is semi-regular (breakfast 7–10am, lunch 11am–2pm, dinner 6–9pm). If a user ate a food recently, slightly increase its probability of appearing again — this simulates routine without needing complex modeling.

Add **logging dropout** — ~25% of meals go unlogged, higher on weekends.

**If you have extra time:** look into Markov chains as an alternative approach to modeling food routines — instead of weight boosting, you model day-to-day cluster transitions (e.g. grain-heavy day → likely grain-heavy tomorrow). Good rabbit hole if you want to go deeper.

### Step 2 — Trigger exposure model

Each food has trigger flags. Maintain a running exposure score per trigger per user:

```
exposure[t] = previous_exposure * decay_rate + trigger_intensity
```

When a user eats a trigger food, the score goes up. Every hour that passes, it decays. Plot this for a single user to sanity check it looks right before running the full simulation.

### Step 3 — Symptom generation

When exposure crosses a user's threshold, sample a lag window and fire a symptom event. Severity scales with how high the exposure got. Apply ~30% logging dropout to symptom events too.

**Stretch goal:** flare window — when a symptom fires, elevated probability of follow-on symptoms for the next 6–24hrs.

### Step 4 — Archetypes + control group

```python
ARCHETYPES = [
    {
        "name": "gluten_sensitive",
        "triggers": ["gluten"],
        "threshold": 0.6,
        "decay_rate_per_hour": 0.97,
        "lag_hours": (12, 36),
        "symptoms": ["bloating", "cramping"],
        "severity_range": (4, 9),
    },
    {
        "name": "high_fodmap",
        "triggers": ["fodmap"],
        "threshold": 0.4,
        "decay_rate_per_hour": 0.90,
        "lag_hours": (4, 12),
        "symptoms": ["bloating", "gas"],
        "severity_range": (3, 7),
    },
    {
        "name": "healthy",  # control group — no sensitivities
        "triggers": [],
        "threshold": None,
        "decay_rate_per_hour": None,
        "lag_hours": None,
        "symptoms": [],
        "severity_range": (1, 3),  # random low-severity noise only
    },
]
```

The `healthy` archetype is important, the algorithm shouldn't find fake correlations in users with no real sensitivities. Feel free to add your own archetypes.

### Step 5 — Validate visually
- Visualizations of the data (make all sorts of graphs more to present to everyone the data and why its better than randomly generated stuff)

### Affected Areas
- `data_generator/` (new folder)

### Dependencies
- None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] - Data Generator #55

name: Data Generator
about: Generating data to test our algo etc.
title: "[FEATURE] Synthetic Food & Symptom Log Generator"
labels: feature, backlog
assignees: @keshavgoel787 @TKDPenguin

Summary

Motivation

Requirements

Acceptance Criteria

Out of Scope

Technical Approach

Background

Schema

Step 1 — Food generation (habit-biased weighted sampling)

Step 2 — Trigger exposure model

Step 3 — Symptom generation

Step 4 — Archetypes + control group

Step 5 — Validate visually

Affected Areas

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT] - Data Generator #55

Description

name: Data Generator about: Generating data to test our algo etc. title: "[FEATURE] Synthetic Food & Symptom Log Generator" labels: feature, backlog assignees: @keshavgoel787 @TKDPenguin

Summary

Motivation

Requirements

Acceptance Criteria

Out of Scope

Technical Approach

Background

Schema

Step 1 — Food generation (habit-biased weighted sampling)

Step 2 — Trigger exposure model

Step 3 — Symptom generation

Step 4 — Archetypes + control group

Step 5 — Validate visually

Affected Areas

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

name: Data Generator
about: Generating data to test our algo etc.
title: "[FEATURE] Synthetic Food & Symptom Log Generator"
labels: feature, backlog
assignees: @keshavgoel787 @TKDPenguin