Skip to content

ianktoo/crisis_pipeline

Repository files navigation

Crisis Response Data Pipeline

A production-ready pipeline for generating synthetic crisis scenario data for fine-tuning large language models. Generates structured crisis response data with multiple perspectives (civilian and first responder roles).

Quickstart

Get up and running in 3 steps:

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure API keys
# Copy example.env to .env and add your API keys:
# cp example.env .env
# Then edit .env with your actual API keys

# 3. Generate your first samples
python main.py generate --n 10

That's it! Your data will be saved to data/dataset.jsonl. Convert to training format with:

python main.py convert --role both

Quality Checks

The pipeline includes three levels of quality assurance to ensure data quality:

🟢 Level 1: Structure Validation (Always On - Free)

  • What it does: Validates JSON structure, required fields, and data types
  • Cost: $0 (included automatically)
  • How it works: Uses Pydantic validation + automatic retry on failures
  • Checks: JSON syntax, required fields exist, correct data types, value ranges

🟡 Level 2: Quality Check (Optional - ~$2-5 for 2000 samples)

  • What it does: Validates content quality using AI
  • Cost: ~$2-5 for 2000 samples
  • How it works: Uses lightweight LLM (gpt-4o-mini) to review content
  • Checks:
    • ✅ Content relevance to the scenario
    • ✅ Role appropriateness (civilian vs first responder)
    • ✅ Meaningful, non-empty content
    • ✅ No duplicate entries
    • ✅ Score consistency (confidence/quality_score make sense)
  • Enable it:
    # Via CLI flag
    python main.py generate --n 100 --quality-check
    
    # Or via environment variable
    ENABLE_QUALITY_CHECK=true

🔴 Level 3: Full Critique (Optional - ~$870 for 2000 samples)

  • What it does: Fixes invalid JSON before validation
  • Cost: ~$870 for 2000 samples
  • How it works: Uses powerful LLM to review and correct JSON structure
  • Best for: When you need guaranteed JSON validity regardless of cost
  • Enable it: Add ENABLE_CRITIQUE=true to your .env file

Recommendation: Start with Level 1 (default). Enable Level 2 if you want content quality validation without breaking the bank.

Features

  • Multi-Provider LLM Support: OpenAI, Anthropic, and Google Gemini
  • Structured Data Generation: Validated crisis scenarios with facts, uncertainties, analysis, and guidance
  • Quality Assurance Options: Multiple validation levels (structure-only, content quality check, or full critique)
  • Parallel Processing: Optimized for speed with concurrent role generation
  • Progress Tracking: Real-time progress bar with ETA
  • Resume Capability: Automatically resumes from last saved position
  • Training-Ready Output: Built-in conversion to fine-tuning formats

Use Cases

1. Fine-Tuning Crisis Response Models

Generate large-scale training datasets (1000+ samples) to fine-tune LLMs for crisis response applications. The pipeline creates structured, validated data with multiple perspectives (civilian and first responder roles) ready for fine-tuning.

Example Workflow:

# Generate 1000 samples
python generate_1k.py

# Or generate 2000 samples
python generate_2k.py

# Convert to training format
python main.py convert --input-file data/dataset_1k.jsonl --role both --output data/training_1k.jsonl
# Or for 2k dataset:
python main.py convert --input-file data/dataset_2k.jsonl --role both --output data/training_2k.jsonl

# Result: 2000 training examples (1k) or 4000 training examples (2k) ready for fine-tuning

2. Emergency Response Training Systems

Create synthetic crisis scenarios for training emergency response personnel or AI assistants. The structured format (facts, uncertainties, analysis, guidance) provides comprehensive training material.

Use Case:

  • Training chatbots for emergency services
  • Educational simulations for first responders
  • Decision-support systems for crisis management

3. Research and Development

Generate diverse crisis scenarios for research purposes:

  • Testing response strategies across different crisis categories
  • Analyzing decision-making patterns
  • Developing crisis management protocols

Categories Supported:

  • Common Day-to-Day Emergencies: Medical emergencies, structure fires, building collapses, gas leaks, motor vehicle accidents, power outages, hazardous conditions
  • Hydrological & Meteorological: Floods, hurricanes, tropical storms, tornadoes, severe storms, thunderstorms, drought, extreme heat, winter storms, snowstorms, ice storms, wildfires
  • Geological: Earthquakes, landslides, mudslides, volcanic eruptions, tsunamis
  • Technological/Industrial: Chemical spills, nuclear accidents, radiological incidents, industrial accidents, transportation accidents, dam failures, infrastructure failures
  • Biological: Infectious disease outbreaks, biological hazards, food contamination, water contamination
  • Societal: Conflicts, cybersecurity incidents, public health emergencies

All categories are configurable in config/categories.py and based on authoritative sources (FEMA, WHO, UNDRR, Red Cross).

4. Multi-Perspective Analysis

Each generated sample includes responses from both civilian and first responder perspectives, enabling:

  • Comparative analysis of different viewpoints
  • Training models to understand context-specific responses
  • Building systems that adapt to user roles

5. Custom Model Training

Fine-tune models for specific use cases:

  • Civilian-focused models: Train on civilian role responses for public-facing applications
  • Professional models: Train on first responder responses for emergency services
  • Dual-role models: Train on both roles for comprehensive crisis response systems

6. Data Augmentation

Augment existing crisis response datasets with synthetic data:

  • Increase dataset diversity
  • Fill gaps in underrepresented crisis categories
  • Generate edge cases and rare scenarios

7. Quality Assurance Testing

Use generated scenarios to test crisis response systems:

  • Validate system responses against structured ground truth
  • Test system robustness across different crisis types
  • Benchmark performance improvements

Quick Start

1. Installation

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate  # Windows
# source .venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

2. Configuration

Copy the example environment file and configure your API keys:

# Copy example.env to .env
cp example.env .env
# On Windows: copy example.env .env

Then edit .env and add your API keys:

OPEN_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GEMINI_API_KEY=your_gemini_key_here

Required: At least one API key depending on which providers you use. See example.env for all available options and detailed descriptions.

Optional settings (uncomment in .env to use):

  • MODEL_TEMPERATURE=0.7 - Control model randomness
  • MODEL_TIMEOUT=30 - API response timeout
  • ENABLE_CRITIQUE=true - Enable JSON fixing (~$870 for 2000 samples)
  • ENABLE_QUALITY_CHECK=true - Enable content validation (~$2-5 for 2000 samples)
  • PARALLEL_SAMPLES=2 - Generate multiple samples in parallel

3. Generate Data

# Generate 100 samples
python main.py generate --n 100

# Check progress
python main.py status

# Resume from last saved point
python main.py resume

Commands

Generate Samples

# Basic generation
python main.py generate --n 100

# With quality check enabled (validates content quality)
python main.py generate --n 100 --quality-check

Generates N crisis scenario samples with progress bar.

Check Status

python main.py status

Shows current dataset count.

Resume Generation

# Basic resume
python main.py resume

# Resume with quality check enabled
python main.py resume --quality-check

Continues generating from the last saved cursor position.

Convert for Training

# Convert to instruction format (default)
python main.py convert --role both

# Convert to conversational format (for chat models)
python main.py convert --format-type conversational --role both

# Convert to completion format (for base models)
python main.py convert --format-type completion --role both

# Convert single role
python main.py convert --role civilian

Model Configuration

The pipeline uses optimized models for speed and cost:

  • Scenario Generation: gpt-4o-mini (OpenAI) - Fast, cost-effective
  • Response Generation: claude-3-5-haiku (Anthropic) - Fastest Claude, great for structured outputs
  • Quality Check: gpt-4o-mini (OpenAI) - DISABLED BY DEFAULT (optional content validation)
  • Critique/Validation: gemini-2.0-flash-exp (Google) - DISABLED BY DEFAULT (saves ~98% cost)

Quality Assurance Options

The pipeline offers three levels of quality assurance:

1. Structure Validation Only (Default - Free)

  • Cost: $0 extra (included in base generation)
  • What it checks: JSON structure, required fields, data types
  • How it works: Pydantic validation + retry mechanism
  • Best for: Cost-conscious users, when LLM JSON generation is reliable

2. Quality Check (Optional - ~$2-5 for 2000 samples)

  • Cost: ~$2-5 for 2000 samples
  • What it checks: Content relevance, role appropriateness, meaningful content, duplicates, score consistency
  • How it works: Lightweight LLM-based content validation using gpt-4o-mini
  • Best for: Balancing cost and quality assurance
  • Enable via:
    # CLI flag
    python main.py generate --n 100 --quality-check
    
    # Or environment variable
    ENABLE_QUALITY_CHECK=true

3. Full Critique (Optional - ~$870 for 2000 samples)

  • Cost: ~$870 for 2000 samples
  • What it checks: JSON structure + fixes invalid JSON before validation
  • How it works: LLM reviews and corrects JSON structure
  • Best for: When you need guaranteed JSON validity and cost is not a concern
  • Enable via:
    # Environment variable
    ENABLE_CRITIQUE=true

⚠️ Default Configuration

By default, both critique and quality check are DISABLED to save costs:

  • Without quality checks: ~$11.18 for 2000 samples
  • With quality check: ~$13-16 for 2000 samples
  • With critique: ~$881.18 for 2000 samples

Note: Without critique/quality check, the pipeline relies on:

  • Strict JSON prompts (already enforced)
  • extract_json() function (cleans code fences)
  • Pydantic validator + retry mechanism (handles failures automatically)

Models can be changed in config/settings.py:

scenario_provider: str = "openai"
response_provider: str = "anthropic"
quality_check_provider: str = "openai"  # Model for quality checks
critique_provider: str = "gemini"
enable_critique: bool = False  # Set to True to enable (costs ~$870 for 2K samples)
enable_quality_check: bool = False  # Set to True to enable (costs ~$2-5 for 2K samples)
parallel_samples: int = 1  # Number of samples to generate in parallel

Data Format

Raw Dataset (data/dataset.jsonl)

Each line contains a complete crisis scenario:

{
  "category": "road accidents",
  "scenario": "A delivery van has collided with a school bus...",
  "responses": {
    "civilian": {
      "facts": ["..."],
      "uncertainties": ["..."],
      "analysis": ["..."],
      "guidance": ["..."],
      "confidence": 0.85,
      "quality_score": 0.88
    },
    "first responder": { ... }
  }
}

Training Formats

The pipeline supports three fine-tuning formats:

1. Instruction Format (Default)

Best for: Llama-2-Instruct, Mistral-Instruct, etc.

{
  "instruction": "You are a crisis response expert...",
  "output": "FACTS:\n  • ...",
  "category": "road accidents",
  "role": "civilian"
}

2. Conversational Format

Best for: GPT-3.5/4, Claude, etc.

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ],
  "category": "road accidents",
  "role": "civilian"
}

3. Completion Format

Best for: Base models (GPT-3 base, etc.)

{
  "prompt": "Category: road accidents\n\nScenario: ...",
  "completion": "FACTS:\n  • ...",
  "category": "road accidents",
  "role": "civilian"
}

Production Workflow

1. Generate Large Dataset

# Generate 1000+ samples (recommended for fine-tuning)
python generate_1k.py

# Generate 1000+ samples with quality check
python generate_1k.py --quality-check

# Or generate 2000+ samples
python generate_2k.py

# Generate 2000+ samples with quality check
python generate_2k.py --quality-check

# Or use main.py directly
python main.py generate --n 1000

# With quality check
python main.py generate --n 1000 --quality-check

The pipeline will:

  • Show real-time progress with ETA
  • Automatically save after each sample
  • Resume from last position if interrupted

2. Convert to Training Format

# Convert all samples with both roles
python main.py convert --role both --output data/training_instruction.jsonl

This creates training-ready examples:

  • 1 sample = 2 training examples (one per role)
  • 1000 samples = 2000 training examples

3. Data Splits

Split your converted data:

  • Training: 80%
  • Validation: 10%
  • Test: 10%

4. Fine-Tuning

Use the converted format with:

  • OpenAI Fine-tuning API: Use conversational format
  • Anthropic Fine-tuning: Use conversational format
  • Hugging Face: Use instruction or conversational format
  • Local (Llama, Mistral): Use instruction format

Performance

  • Speed: ~9-13 seconds per sample (with critique/quality check disabled, parallel role generation)
  • Throughput: ~250-400 samples/hour (depends on parallel_samples setting)
  • Optimizations:
    • Critique disabled by default (saves ~98% cost, ~$11 vs $881 for 2000 samples)
    • Quality check optional (adds ~$2-5 for 2000 samples if enabled)
    • Parallel role generation (2x speedup within each sample)
    • Parallel sample generation (configurable via PARALLEL_SAMPLES env var)
    • Optimized model selection (faster, cheaper models)
    • Reduced retry wait times
    • Progress tracking with ETA

Cost Comparison (2000 samples)

Configuration Cost Quality Assurance
Default (no checks) ~$11.18 Structure validation only
With Quality Check ~$13-16 Structure + content quality validation
With Critique ~$881.18 Structure validation + JSON fixing

Project Structure

crisis_pipeline/
├── main.py                 # CLI entry point
├── config/
│   ├── settings.py        # Configuration and API keys
│   └── categories.py     # Crisis categories
├── pipeline/
│   ├── generator.py       # Sample generation logic
│   ├── validator.py      # Data validation
│   ├── persistence.py    # File I/O and cursor tracking
│   ├── runner.py         # Main pipeline loop
│   └── models.py         # Pydantic data models
├── llm/
│   ├── client.py         # LLM provider abstraction
│   ├── chains.py         # LangChain prompt chains
│   └── prompts/         # Prompt templates
├── data/
│   ├── dataset.jsonl    # Generated samples
│   └── cursor.json      # Progress tracking
└── tests/               # Unit tests

Testing

Run unit tests:

python run_tests.py

All tests use Python's built-in unittest (no external dependencies).

Troubleshooting

Proxy Issues

The pipeline automatically disables proxy settings. If you encounter connection issues, ensure your .env file has valid API keys.

Rate Limits

The pipeline includes:

  • Exponential backoff retry logic
  • Reduced retry wait times (0.5-5s)
  • Parallel processing to maximize throughput

Slow Generation

  • Check API key validity
  • Verify network connectivity
  • Consider using faster models (already optimized)
  • Reduce temperature for faster responses

Documentation

Additional detailed guides are available in the docs/ folder:

Quality Assurance (Detailed)

The pipeline provides multiple quality assurance options to balance cost and data quality:

Structure Validation (Always Enabled)

  • Technology: Pydantic models + JSON parsing
  • What it validates:
    • JSON syntax is valid
    • All required fields are present (facts, uncertainties, analysis, guidance, confidence, quality_score)
    • Data types are correct (lists are lists, numbers are numbers)
    • Value constraints (confidence/quality_score between 0-1, lists have minimum length)
  • Failure handling: Automatically retries up to 3 times if validation fails
  • Cost: Free (included in base generation)

Quality Check (Optional)

  • Technology: LLM-based content validation using gpt-4o-mini
  • What it validates:
    • Content Relevance: Does the response actually relate to the crisis scenario?
    • Role Appropriateness: Does a civilian response sound like a civilian? Does a first responder response sound professional?
    • Content Quality: Are the facts, uncertainties, analysis, and guidance meaningful and non-empty?
    • No Duplicates: Are there duplicate or near-duplicate entries in the lists?
    • Score Consistency: Do the confidence and quality_score values make sense given the content?
  • Failure handling: Samples that fail quality check are rejected and regenerated
  • Cost: ~$2-5 for 2000 samples (much cheaper than critique)
  • Enable: Use --quality-check flag or set ENABLE_QUALITY_CHECK=true in .env

Full Critique (Optional)

  • Technology: LLM-based JSON fixing using gemini-2.0-flash-exp
  • What it does: Reviews and fixes invalid JSON structure before validation
  • Failure handling: Corrects JSON issues automatically
  • Cost: ~$870 for 2000 samples (98% of total cost when enabled)
  • Enable: Set ENABLE_CRITIQUE=true in .env

When to Use Each Level

Use Case Recommended Level Cost (2000 samples)
Budget-conscious, reliable LLM Level 1 (Default) ~$11.18
Balance cost and quality Level 2 (Quality Check) ~$13-16
Maximum quality, cost not a concern Level 3 (Critique) ~$881.18

Default recommendation: Start with Level 1. If you notice quality issues or need extra assurance, enable Level 2. Only use Level 3 if you have specific JSON reliability concerns and budget allows.

License

This project is licensed under the MIT License with Attribution requirement.

See LICENSE file for details.

Attribution Requirement: When using this software or derivative works, please include attribution to the original author.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages