A production-ready pipeline for generating synthetic crisis scenario data for fine-tuning large language models. Generates structured crisis response data with multiple perspectives (civilian and first responder roles).
Get up and running in 3 steps:
# 1. Install dependencies
pip install -r requirements.txt
# 2. Configure API keys
# Copy example.env to .env and add your API keys:
# cp example.env .env
# Then edit .env with your actual API keys
# 3. Generate your first samples
python main.py generate --n 10That's it! Your data will be saved to data/dataset.jsonl. Convert to training format with:
python main.py convert --role bothThe pipeline includes three levels of quality assurance to ensure data quality:
- What it does: Validates JSON structure, required fields, and data types
- Cost: $0 (included automatically)
- How it works: Uses Pydantic validation + automatic retry on failures
- Checks: JSON syntax, required fields exist, correct data types, value ranges
- What it does: Validates content quality using AI
- Cost: ~$2-5 for 2000 samples
- How it works: Uses lightweight LLM (
gpt-4o-mini) to review content - Checks:
- ✅ Content relevance to the scenario
- ✅ Role appropriateness (civilian vs first responder)
- ✅ Meaningful, non-empty content
- ✅ No duplicate entries
- ✅ Score consistency (confidence/quality_score make sense)
- Enable it:
# Via CLI flag python main.py generate --n 100 --quality-check # Or via environment variable ENABLE_QUALITY_CHECK=true
- What it does: Fixes invalid JSON before validation
- Cost: ~$870 for 2000 samples
- How it works: Uses powerful LLM to review and correct JSON structure
- Best for: When you need guaranteed JSON validity regardless of cost
- Enable it: Add
ENABLE_CRITIQUE=trueto your.envfile
Recommendation: Start with Level 1 (default). Enable Level 2 if you want content quality validation without breaking the bank.
- Multi-Provider LLM Support: OpenAI, Anthropic, and Google Gemini
- Structured Data Generation: Validated crisis scenarios with facts, uncertainties, analysis, and guidance
- Quality Assurance Options: Multiple validation levels (structure-only, content quality check, or full critique)
- Parallel Processing: Optimized for speed with concurrent role generation
- Progress Tracking: Real-time progress bar with ETA
- Resume Capability: Automatically resumes from last saved position
- Training-Ready Output: Built-in conversion to fine-tuning formats
Generate large-scale training datasets (1000+ samples) to fine-tune LLMs for crisis response applications. The pipeline creates structured, validated data with multiple perspectives (civilian and first responder roles) ready for fine-tuning.
Example Workflow:
# Generate 1000 samples
python generate_1k.py
# Or generate 2000 samples
python generate_2k.py
# Convert to training format
python main.py convert --input-file data/dataset_1k.jsonl --role both --output data/training_1k.jsonl
# Or for 2k dataset:
python main.py convert --input-file data/dataset_2k.jsonl --role both --output data/training_2k.jsonl
# Result: 2000 training examples (1k) or 4000 training examples (2k) ready for fine-tuningCreate synthetic crisis scenarios for training emergency response personnel or AI assistants. The structured format (facts, uncertainties, analysis, guidance) provides comprehensive training material.
Use Case:
- Training chatbots for emergency services
- Educational simulations for first responders
- Decision-support systems for crisis management
Generate diverse crisis scenarios for research purposes:
- Testing response strategies across different crisis categories
- Analyzing decision-making patterns
- Developing crisis management protocols
Categories Supported:
- Common Day-to-Day Emergencies: Medical emergencies, structure fires, building collapses, gas leaks, motor vehicle accidents, power outages, hazardous conditions
- Hydrological & Meteorological: Floods, hurricanes, tropical storms, tornadoes, severe storms, thunderstorms, drought, extreme heat, winter storms, snowstorms, ice storms, wildfires
- Geological: Earthquakes, landslides, mudslides, volcanic eruptions, tsunamis
- Technological/Industrial: Chemical spills, nuclear accidents, radiological incidents, industrial accidents, transportation accidents, dam failures, infrastructure failures
- Biological: Infectious disease outbreaks, biological hazards, food contamination, water contamination
- Societal: Conflicts, cybersecurity incidents, public health emergencies
All categories are configurable in config/categories.py and based on authoritative sources (FEMA, WHO, UNDRR, Red Cross).
Each generated sample includes responses from both civilian and first responder perspectives, enabling:
- Comparative analysis of different viewpoints
- Training models to understand context-specific responses
- Building systems that adapt to user roles
Fine-tune models for specific use cases:
- Civilian-focused models: Train on civilian role responses for public-facing applications
- Professional models: Train on first responder responses for emergency services
- Dual-role models: Train on both roles for comprehensive crisis response systems
Augment existing crisis response datasets with synthetic data:
- Increase dataset diversity
- Fill gaps in underrepresented crisis categories
- Generate edge cases and rare scenarios
Use generated scenarios to test crisis response systems:
- Validate system responses against structured ground truth
- Test system robustness across different crisis types
- Benchmark performance improvements
# Create virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txtCopy the example environment file and configure your API keys:
# Copy example.env to .env
cp example.env .env
# On Windows: copy example.env .envThen edit .env and add your API keys:
OPEN_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GEMINI_API_KEY=your_gemini_key_hereRequired: At least one API key depending on which providers you use. See example.env for all available options and detailed descriptions.
Optional settings (uncomment in .env to use):
MODEL_TEMPERATURE=0.7- Control model randomnessMODEL_TIMEOUT=30- API response timeoutENABLE_CRITIQUE=true- Enable JSON fixing (~$870 for 2000 samples)ENABLE_QUALITY_CHECK=true- Enable content validation (~$2-5 for 2000 samples)PARALLEL_SAMPLES=2- Generate multiple samples in parallel
# Generate 100 samples
python main.py generate --n 100
# Check progress
python main.py status
# Resume from last saved point
python main.py resume# Basic generation
python main.py generate --n 100
# With quality check enabled (validates content quality)
python main.py generate --n 100 --quality-checkGenerates N crisis scenario samples with progress bar.
python main.py statusShows current dataset count.
# Basic resume
python main.py resume
# Resume with quality check enabled
python main.py resume --quality-checkContinues generating from the last saved cursor position.
# Convert to instruction format (default)
python main.py convert --role both
# Convert to conversational format (for chat models)
python main.py convert --format-type conversational --role both
# Convert to completion format (for base models)
python main.py convert --format-type completion --role both
# Convert single role
python main.py convert --role civilianThe pipeline uses optimized models for speed and cost:
- Scenario Generation:
gpt-4o-mini(OpenAI) - Fast, cost-effective - Response Generation:
claude-3-5-haiku(Anthropic) - Fastest Claude, great for structured outputs - Quality Check:
gpt-4o-mini(OpenAI) - DISABLED BY DEFAULT (optional content validation) - Critique/Validation:
gemini-2.0-flash-exp(Google) - DISABLED BY DEFAULT (saves ~98% cost)
The pipeline offers three levels of quality assurance:
- Cost: $0 extra (included in base generation)
- What it checks: JSON structure, required fields, data types
- How it works: Pydantic validation + retry mechanism
- Best for: Cost-conscious users, when LLM JSON generation is reliable
- Cost: ~$2-5 for 2000 samples
- What it checks: Content relevance, role appropriateness, meaningful content, duplicates, score consistency
- How it works: Lightweight LLM-based content validation using
gpt-4o-mini - Best for: Balancing cost and quality assurance
- Enable via:
# CLI flag python main.py generate --n 100 --quality-check # Or environment variable ENABLE_QUALITY_CHECK=true
- Cost: ~$870 for 2000 samples
- What it checks: JSON structure + fixes invalid JSON before validation
- How it works: LLM reviews and corrects JSON structure
- Best for: When you need guaranteed JSON validity and cost is not a concern
- Enable via:
# Environment variable ENABLE_CRITIQUE=true
By default, both critique and quality check are DISABLED to save costs:
- Without quality checks: ~$11.18 for 2000 samples
- With quality check: ~$13-16 for 2000 samples
- With critique: ~$881.18 for 2000 samples
Note: Without critique/quality check, the pipeline relies on:
- Strict JSON prompts (already enforced)
extract_json()function (cleans code fences)- Pydantic validator + retry mechanism (handles failures automatically)
Models can be changed in config/settings.py:
scenario_provider: str = "openai"
response_provider: str = "anthropic"
quality_check_provider: str = "openai" # Model for quality checks
critique_provider: str = "gemini"
enable_critique: bool = False # Set to True to enable (costs ~$870 for 2K samples)
enable_quality_check: bool = False # Set to True to enable (costs ~$2-5 for 2K samples)
parallel_samples: int = 1 # Number of samples to generate in parallelEach line contains a complete crisis scenario:
{
"category": "road accidents",
"scenario": "A delivery van has collided with a school bus...",
"responses": {
"civilian": {
"facts": ["..."],
"uncertainties": ["..."],
"analysis": ["..."],
"guidance": ["..."],
"confidence": 0.85,
"quality_score": 0.88
},
"first responder": { ... }
}
}The pipeline supports three fine-tuning formats:
Best for: Llama-2-Instruct, Mistral-Instruct, etc.
{
"instruction": "You are a crisis response expert...",
"output": "FACTS:\n • ...",
"category": "road accidents",
"role": "civilian"
}Best for: GPT-3.5/4, Claude, etc.
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"category": "road accidents",
"role": "civilian"
}Best for: Base models (GPT-3 base, etc.)
{
"prompt": "Category: road accidents\n\nScenario: ...",
"completion": "FACTS:\n • ...",
"category": "road accidents",
"role": "civilian"
}# Generate 1000+ samples (recommended for fine-tuning)
python generate_1k.py
# Generate 1000+ samples with quality check
python generate_1k.py --quality-check
# Or generate 2000+ samples
python generate_2k.py
# Generate 2000+ samples with quality check
python generate_2k.py --quality-check
# Or use main.py directly
python main.py generate --n 1000
# With quality check
python main.py generate --n 1000 --quality-checkThe pipeline will:
- Show real-time progress with ETA
- Automatically save after each sample
- Resume from last position if interrupted
# Convert all samples with both roles
python main.py convert --role both --output data/training_instruction.jsonlThis creates training-ready examples:
- 1 sample = 2 training examples (one per role)
- 1000 samples = 2000 training examples
Split your converted data:
- Training: 80%
- Validation: 10%
- Test: 10%
Use the converted format with:
- OpenAI Fine-tuning API: Use conversational format
- Anthropic Fine-tuning: Use conversational format
- Hugging Face: Use instruction or conversational format
- Local (Llama, Mistral): Use instruction format
- Speed: ~9-13 seconds per sample (with critique/quality check disabled, parallel role generation)
- Throughput: ~250-400 samples/hour (depends on parallel_samples setting)
- Optimizations:
- Critique disabled by default (saves ~98% cost, ~$11 vs $881 for 2000 samples)
- Quality check optional (adds ~$2-5 for 2000 samples if enabled)
- Parallel role generation (2x speedup within each sample)
- Parallel sample generation (configurable via
PARALLEL_SAMPLESenv var) - Optimized model selection (faster, cheaper models)
- Reduced retry wait times
- Progress tracking with ETA
| Configuration | Cost | Quality Assurance |
|---|---|---|
| Default (no checks) | ~$11.18 | Structure validation only |
| With Quality Check | ~$13-16 | Structure + content quality validation |
| With Critique | ~$881.18 | Structure validation + JSON fixing |
crisis_pipeline/
├── main.py # CLI entry point
├── config/
│ ├── settings.py # Configuration and API keys
│ └── categories.py # Crisis categories
├── pipeline/
│ ├── generator.py # Sample generation logic
│ ├── validator.py # Data validation
│ ├── persistence.py # File I/O and cursor tracking
│ ├── runner.py # Main pipeline loop
│ └── models.py # Pydantic data models
├── llm/
│ ├── client.py # LLM provider abstraction
│ ├── chains.py # LangChain prompt chains
│ └── prompts/ # Prompt templates
├── data/
│ ├── dataset.jsonl # Generated samples
│ └── cursor.json # Progress tracking
└── tests/ # Unit tests
Run unit tests:
python run_tests.pyAll tests use Python's built-in unittest (no external dependencies).
The pipeline automatically disables proxy settings. If you encounter connection issues, ensure your .env file has valid API keys.
The pipeline includes:
- Exponential backoff retry logic
- Reduced retry wait times (0.5-5s)
- Parallel processing to maximize throughput
- Check API key validity
- Verify network connectivity
- Consider using faster models (already optimized)
- Reduce temperature for faster responses
Additional detailed guides are available in the docs/ folder:
- Command Prompt Guide - How to run commands in Windows Command Prompt
- Generate 1000+ Samples Guide - Complete guide for generating 1000 sample datasets
- Generate 2000+ Samples Guide - Complete guide for generating 2000 sample datasets
- Cost Estimate Guide - Detailed API cost breakdown for each provider
- Critique Disabled Notice - Why critique is disabled by default and how to enable it
The pipeline provides multiple quality assurance options to balance cost and data quality:
- Technology: Pydantic models + JSON parsing
- What it validates:
- JSON syntax is valid
- All required fields are present (
facts,uncertainties,analysis,guidance,confidence,quality_score) - Data types are correct (lists are lists, numbers are numbers)
- Value constraints (confidence/quality_score between 0-1, lists have minimum length)
- Failure handling: Automatically retries up to 3 times if validation fails
- Cost: Free (included in base generation)
- Technology: LLM-based content validation using
gpt-4o-mini - What it validates:
- Content Relevance: Does the response actually relate to the crisis scenario?
- Role Appropriateness: Does a civilian response sound like a civilian? Does a first responder response sound professional?
- Content Quality: Are the facts, uncertainties, analysis, and guidance meaningful and non-empty?
- No Duplicates: Are there duplicate or near-duplicate entries in the lists?
- Score Consistency: Do the confidence and quality_score values make sense given the content?
- Failure handling: Samples that fail quality check are rejected and regenerated
- Cost: ~$2-5 for 2000 samples (much cheaper than critique)
- Enable: Use
--quality-checkflag or setENABLE_QUALITY_CHECK=truein.env
- Technology: LLM-based JSON fixing using
gemini-2.0-flash-exp - What it does: Reviews and fixes invalid JSON structure before validation
- Failure handling: Corrects JSON issues automatically
- Cost: ~$870 for 2000 samples (98% of total cost when enabled)
- Enable: Set
ENABLE_CRITIQUE=truein.env
| Use Case | Recommended Level | Cost (2000 samples) |
|---|---|---|
| Budget-conscious, reliable LLM | Level 1 (Default) | ~$11.18 |
| Balance cost and quality | Level 2 (Quality Check) | ~$13-16 |
| Maximum quality, cost not a concern | Level 3 (Critique) | ~$881.18 |
Default recommendation: Start with Level 1. If you notice quality issues or need extra assurance, enable Level 2. Only use Level 3 if you have specific JSON reliability concerns and budget allows.
This project is licensed under the MIT License with Attribution requirement.
See LICENSE file for details.
Attribution Requirement: When using this software or derivative works, please include attribution to the original author.