A reproducible framework combining DSPy agentic workflows with GEPA (Genetic-Pareto) optimization for automated BI tasks: NL→SQL synthesis, KPI compilation, and executive insights.
- Overview
- Key Features
- Why This Framework?
- Quick Start
- Architecture
- Experiments & Results
- Usage
- API Documentation
- Contributing
This project demonstrates a production-ready BI pipeline that combines:
- DSPy Framework: Modular LLM programs for SQL generation, KPI compilation, and narrative insights
- GEPA Optimizer: Genetic algorithm + Pareto optimization for multi-objective prompt tuning
- Real BI Warehouse: DuckDB-based data warehouse with 10K orders, 5 data marts
- Interactive Dashboard: Streamlit app for natural language BI queries
Multi-objective optimization - Unlike traditional prompt engineering that optimizes for accuracy alone, GEPA simultaneously optimizes for:
- ✅ Accuracy (SQL correctness)
- ✅ Cost (token efficiency)
- ✅ Latency (response time)
- ✅ Test Pass Rate (reliability)
Result: Pareto-optimal prompts that balance quality and efficiency!
- DuckDB Warehouse: 10,000 synthetic orders with realistic schema
- 5 Data Marts: Fact tables, dimensions, aggregations
- Parquet Exports: Ready for Power BI, Tableau, Looker
- ETL Pipeline: Full extract-transform-load workflow
- SQL Synthesizer: Natural language → SQL with chain-of-thought
- KPI Compiler: NL descriptions → DAX/SQL measures
- Insight Writer: Automated executive summaries
- Auto-Graders: Quality validation and scoring
- Genetic Algorithm: Population-based search with selection, crossover, mutation
- Pareto Optimization: Find non-dominated solutions across objectives
- Reflective Learning: Mine rules from failures, adapt prompts
- Configurable: Customize population size, generations, objectives
- Streamlit UI: Natural language query interface
- Accessible: WCAG-compliant keyboard navigation
- Visualizations: Charts, tables, KPI cards
- Real-time: Live SQL generation and execution
- 15 Benchmark Tasks: SQL + KPI generation
- Automated Scoring: Execution-based validation
- Comparative Experiments: Vanilla vs DSPy vs DSPy+GEPA
- Reproducible: All configs and data included
Vanilla Prompting: Manual trial-and-error, single-objective focus, no systematic improvement
Regular DSPy: Better structure, but no optimization for cost/latency trade-offs
DSPy + GEPA: Automated multi-objective optimization - finds prompts that are accurate AND efficient!
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Initial │────▶│ Genetic │────▶│ Pareto │
│ Population │ │ Evolution │ │ Selection │
│ │ │ (mutate/ │ │ (best │
│ 8 genomes │ │ crossover) │ │ tradeoffs) │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌──────────────┐
│ Reflection │
│ & Learning │
│ (mine rules) │
└──────────────┘
Result after 4 generations:
- 70% accuracy ✅
- 100% test pass rate ✅
- 134 tokens/task (30% more efficient than baseline) ✅
- $0.004 per optimization run ✅
- Python 3.8+
- LLM API key (Mistral recommended, or OpenAI/Gemini)
# 1. Clone repository
git clone https://github.com/yourusername/DsPy-GEPA-BI.git
cd DsPy-GEPA-BI
# 2. Install dependencies
pip install -r requirements.txt
# 3. Configure API key
cp .env.example .env
# Edit .env and add: MISTRAL_API_KEY=your-key-here
# 4. Initialize data warehouse
python -m src.scripts.run_etl
# 5. Run GEPA optimization (optional)
python -m src.scripts.run_real_gepa
# 6. Launch dashboard
python -m src.scripts.run_appVisit http://localhost:8501 to access the dashboard!
Mistral (Recommended):
- Get $5 free credits at https://console.mistral.ai/
- Cost: $0.001 per 1K tokens (~1,250 GEPA runs on free tier!)
Google Gemini (Alternative):
- Free tier at https://aistudio.google.com/app/apikey
- 15 requests/minute, no credit card needed
DsPy-GEPA-BI/
├── src/
│ ├── etl/ # Extract-Transform-Load pipeline
│ ├── dspy_programs/ # SQL synth, KPI compiler, insights
│ ├── gepa/ # Genetic-Pareto optimizer
│ ├── eval/ # Scoring & benchmarks
│ ├── app/ # Streamlit dashboard
│ └── scripts/ # CLI entry points
├── data/
│ ├── raw/ # Source CSVs (orders, customers, products)
│ └── warehouse/ # DuckDB + Parquet exports
├── eval/
│ ├── benchmarks/ # sql_tasks.jsonl, kpi_tasks.jsonl
│ └── results/ # GEPA outputs
├── configs/ # project.yaml, eval.yaml
├── .env.example # API key template
├── requirements.txt # Dependencies
└── README.md # This file
┌───────────┐ ┌──────────┐ ┌─────────────┐ ┌──────────┐
│ Raw CSV │─────▶│ ETL │─────▶│ DuckDB │─────▶│ Parquet │
│ Data │ │ Pipeline │ │ Warehouse │ │ Exports │
└───────────┘ └──────────┘ └─────────────┘ └──────────┘
│
▼
┌───────────┐ ┌──────────┐ ┌─────────────┐
│ User │─────▶│ DSPy │─────▶│ LLM │
│ Query │ NL │ Programs │ SQL │ (Mistral) │
└───────────┘ └──────────┘ └─────────────┘
│
▼
┌──────────┐
│ GEPA │
│Optimizer │ ←─ Benchmarks
└──────────┘
Run the comparison:
python -m src.scripts.compare_baselinesExpected Results (5 SQL tasks):
| Approach | Accuracy | Avg Tokens | Avg Latency | Cost |
|---|---|---|---|---|
| Vanilla Prompting | 60% | 180 | 0.35s | $0.001 |
| DSPy (No Optimization) | 65% | 165 | 0.31s | $0.001 |
| DSPy + GEPA ✨ | 70% | 134 | 0.26s | $0.0007 |
Key Findings:
- ✅ +10% accuracy over vanilla prompting
- ✅ 25% fewer tokens (cost savings!)
- ✅ 26% faster response time
- ✅ Better reliability (100% test pass rate)
Configuration:
- Population: 8 genomes
- Generations: 4
- Tasks per genome: 5
- Model: Mistral Small
Evolution Progress:
Gen 1: Score -12.571 → Archive: 1 genome
Gen 2: Score -12.576 → Archive: 1 genome
Gen 3: Score -12.573 → Archive: 1 genome
Gen 4: Score -12.566 → Archive: 1 genome ✨ BEST
Best Genome Results:
- Accuracy: 70%
- Test Pass Rate: 100%
- Tokens/Task: 134
- Latency: 0.26s
- Total Cost: $0.004
Pareto Archive: Contains non-dominated solutions representing optimal accuracy/cost/latency tradeoffs
# ETL: Create warehouse
python -m src.scripts.run_etl
# Evaluation: Test DSPy programs
python -m src.scripts.eval_sql
# GEPA: Run optimization (simulated)
python -m src.scripts.run_gepa_search
# GEPA: Run with REAL LLM
python -m src.scripts.run_real_gepa
# Compare: Baseline experiments
python -m src.scripts.compare_baselines
# Dashboard: Launch Streamlit app
python -m src.scripts.run_appfrom src.dspy_programs.sql_synth import SQLSynthesizer
from src.gepa.loop import GEPAOptimizer
from src.utils.duck import DuckDBManager
# Initialize components
db = DuckDBManager('data/warehouse/bi.duckdb')
synthesizer = SQLSynthesizer(schema={...})
# Generate SQL
result = synthesizer(task="Get total revenue by region")
print(result.sql)
# Run GEPA optimization
optimizer = GEPAOptimizer(
eval_fn=your_eval_function,
config=gepa_config
)
results = optimizer.optimize()Edit configs/project.yaml:
gepa:
population_size: 8 # Number of genomes per generation
generations: 4 # Evolution iterations
max_trials: 50 # Budget cap
mutation_rate: 0.3 # Probability of mutation
crossover_rate: 0.5 # Probability of crossover
objectives:
accuracy: maximize # SQL correctness
cost: minimize # Token usage
latency: minimize # Response time
tests: maximize # Test pass rateclass SQLSynthesizer(dspy.Module):
"""Generate SQL from natural language."""
def forward(self, task: str, schema: dict) -> dspy.Prediction:
"""
Args:
task: Natural language query
schema: Database schema dict
Returns:
Prediction with .sql attribute
"""class KPICompiler(dspy.Module):
"""Compile KPI definitions to DAX/SQL."""
def forward(self, kpi_desc: str, schema: dict) -> dspy.Prediction:
"""
Args:
kpi_desc: KPI description in natural language
schema: Data warehouse schema
Returns:
Prediction with .measure (DAX/SQL)
"""class GEPAOptimizer:
"""Multi-objective genetic optimization."""
def __init__(self, eval_fn: Callable, config: GEPAConfig):
"""
Args:
eval_fn: Function(genome) -> metrics dict
config: Population size, generations, etc.
"""
def optimize(self) -> dict:
"""
Run optimization loop.
Returns:
{
'best_genome': PromptGenome,
'pareto_archive': List[genome],
'history': List[generation_stats],
'mined_rules': List[Rule]
}
"""- Type questions like: "What are the top 10 products by revenue?"
- Auto-generates SQL and executes on warehouse
- Displays results in tables and charts
- ✅ Keyboard navigation (Tab, Enter, Escape)
- ✅ Focus indicators (3px outline)
- ✅ Screen reader compatible
- ✅ High contrast mode
- ✅ Minimum 44px click targets
- 📈 Line charts (time series)
- 📊 Bar charts (comparisons)
- 🥧 Pie charts (distributions)
- 📋 Data tables (detailed views)
- 💳 KPI cards (metrics)
Add new objectives in src/gepa/objectives.py:
class CustomObjective(Objective):
name = "readability"
direction = "maximize"
def compute(self, metrics: dict) -> float:
# Higher score = better readability
return metrics.get('cyclomatic_complexity', 0)Extend GEPA in src/gepa/loop.py:
def _mutate_custom(self, genome: PromptGenome) -> PromptGenome:
"""Custom mutation operator."""
# Your logic here
return modified_genomeWe welcome contributions! Areas for improvement:
- More LLM Providers: Add Claude, Cohere, local models
- Larger Benchmarks: Expand to 100+ tasks
- Advanced GEPA: Multi-population, adaptive operators
- Model Distillation: Train smaller models on best prompts
- Real-world Data: Integration examples with actual databases
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Check code quality
black src/
flake8 src/
mypy src/MIT License - see LICENSE file for details.
- DSPy: Stanford NLP's modular LLM framework
- DuckDB: High-performance in-process OLAP database
- Mistral AI: Affordable, high-quality LLM API
- Streamlit: Easy dashboard development
- Working BI pipeline
- DSPy programs (SQL, KPI, insights)
- GEPA optimizer
- Real LLM integration (Mistral)
- Streamlit dashboard
- Comparative experiments
- Documentation
- Claude & local model support
- Expanded benchmarks (100+ tasks)
- Semantic caching
- A/B testing framework
- Multi-language support
- Docker deployment
- Multi-population GEPA
- Prompt template extraction
- Model distillation pipeline
- Real-time optimization
- Cloud deployment guides
⭐ Star this repo if you find it useful!
Built with ❤️ using DSPy, GEPA, Mistral AI, and DuckDB.
Last Updated: October 9, 2025