NL2SQL - Multi-Agent Natural Language to SQL System

A sophisticated multi-agent orchestration system that converts natural language queries into safe, optimized SQL using LangGraph and OpenRouter LLMs. Built with security, performance, and explainability at its core.

Key Features

Multi-Agent Architecture: 8 specialized agents working in harmony (Discovery, Security, Schema Formatting, Query Generation, Validation, Explanation)
Security-First: Built-in SQL injection prevention, dangerous operation filtering, and security scoring
Smart Table Discovery: Multi-signal ranking (keyword matching, semantic search, foreign key graph analysis)
Performance Validation: Query optimization checks and performance score reporting
Explainability: Human-readable explanations for every generated query
Async Architecture: Non-blocking operations throughout the pipeline
Schema Caching: Intelligent caching to reduce database introspection overhead
Comprehensive Logging: Detailed file-based logging for debugging and monitoring

Architecture

Quick Start

Prerequisites

Python 3.13 or higher
OpenRouter API key (or compatible LLM provider)
SQLite databases (Spider dataset recommended)

Installation & Usage (PyPI)

Install the package from PyPI:

pip install nl2sql-agents

1. Configure Environment Variables

After installing, create a .env file in your working directory (or export the variables in your shell):

# LLM Provider Configuration (required)
OPENAI_API_KEY=sk-or-v1-your-openrouter-api-key
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_MODEL=openai/gpt-4o-mini
OPENAI_EMBEDDING_MODEL=openai/text-embedding-3-small

# Database Configuration (required)
DB_TYPE=sqlite
DB_PATH=/absolute/path/to/your/database.sqlite

Note: Any OpenAI-compatible API provider works — just set the base URL and API key accordingly.

2. Point to Your SQLite Database

Set DB_PATH to the absolute path of any SQLite database you want to query:

# Example: your own database
DB_PATH=/home/user/data/sales.sqlite

# Example: Spider benchmark dataset
DB_PATH=/home/user/spider/database/concert_singer/concert_singer.sqlite

3. Run the CLI

The nl2sql command is available globally after installation:

# Interactive REPL mode
nl2sql

# One-shot query
nl2sql "Show me all singers from France"

# Override the database path on the fly
nl2sql --db /path/to/other.sqlite "List all employees"

# Check version
nl2sql --version

Minimal Example

pip install nl2sql-agents

export OPENAI_API_KEY="sk-or-v1-your-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_MODEL="openai/gpt-4o-mini"
export OPENAI_EMBEDDING_MODEL="openai/text-embedding-3-small"
export DB_TYPE="sqlite"
export DB_PATH="/path/to/your/database.sqlite"

nl2sql "What are the top 5 products by revenue?"

Install from Source (Development)

Click to expand source installation steps

Clone the repository

git clone https://github.com/ToheedAsghar/NL2SQL.git
cd NL2SQL

Create virtual environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Download Spider dataset (optional — for benchmarking)

The Spider dataset (877MB) is not included in the repository. Download it from the official source:

# Visit https://yale-lily.github.io/spider
# Download and extract to ./spider/ directory

# Verify structure
ls spider/database/
# Should show: academic, concert_singer, car_1, etc.

Configure environment

Create a .env file in the project root:

# LLM Provider Configuration
OPENAI_API_KEY=sk-or-v1-your-openrouter-api-key
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_MODEL=openai/gpt-4o-mini
OPENAI_EMBEDDING_MODEL=openai/text-embedding-3-small

# Database Configuration
DB_TYPE=sqlite
DB_PATH=./spider/database/concert_singer/concert_singer.sqlite

# Optional: Adjust discovery agent weights
# KEYWORD_WEIGHT=0.35
# SEMANTIC_WEIGHT=0.45
# FK_GRAPH_WEIGHT=0.20

Run the application
```
python main.py
```

Usage Examples

Interactive Mode (CLI)

nl2sql

Enter your question (or 'quit'/'exit'/'q'): 
> Show me all singers from France

══════════════════════════════════════════════════════════════
 SQL QUERY
══════════════════════════════════════════════════════════════
SELECT s.*
FROM concert_singer.singer AS s
WHERE s.Country = 'France'

══════════════════════════════════════════════════════════════
 EXPLANATION
══════════════════════════════════════════════════════════════
This query retrieves all singers from France by:
1. Selecting all columns from the singer table
2. Filtering records where Country equals 'France'
3. Using table alias 's' for readability

Safety Score: 3.5/4.0
✓ Security: Safe (no SQL injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider adding index on Country column for large datasets

══════════════════════════════════════════════════════════════

Sample Questions (Concert Singer Database)

- Show me all singers from France
- List all concerts held in 2024
- What are the top 5 stadiums by capacity?
- How many concerts did each singer perform?
- Which stadium hosted the most concerts?
- Find singers who have never performed in a concert
- Show me concerts with ticket prices above $100

Configuration

LLM Provider

The system uses OpenRouter by default, but you can configure any OpenAI-compatible API:

OPENAI_API_KEY=your-api-key
OPENAI_BASE_URL=https://api.openai.com/v1  # or your provider
OPENAI_MODEL=gpt-4-turbo
OPENAI_EMBEDDING_MODEL=text-embedding-3-small

Database Configuration

Switch between different databases by updating DB_PATH:

# Concert Singer (music industry)
DB_PATH=./spider/database/concert_singer/concert_singer.sqlite

# Academic (papers and conferences)
DB_PATH=./spider/database/academic/academic.sqlite

# Car Sales
DB_PATH=./spider/database/car_1/car_1.sqlite

Discovery Agent Tuning

Adjust table discovery signal weights in .env:

KEYWORD_WEIGHT=0.35    # Exact keyword matching
SEMANTIC_WEIGHT=0.45   # Semantic similarity
FK_GRAPH_WEIGHT=0.20   # Foreign key relationships

Project Structure

NL2SQL/
├── nl2sql_agents/                     # Installable Python package
│   ├── __init__.py                    # Package version & metadata
│   ├── cli.py                         # Rich CLI entry point
│   ├── py.typed                       # PEP 561 typed marker
│   ├── agents/                        # Agent modules
│   │   ├── __init__.py
│   │   ├── base_agent.py              # Base agent with LLM calling
│   │   ├── query_generator.py         # SQL generation agent
│   │   ├── schema_formatter.py        # Schema formatting agent
│   │   ├── discovery/                 # Table discovery agents
│   │   │   ├── __init__.py
│   │   │   ├── discovery_agent.py     # Multi-signal orchestrator
│   │   │   ├── keyword_agent.py       # Keyword matching
│   │   │   ├── semantic_agent.py      # Embedding-based search
│   │   │   └── fk_graph_agent.py      # Foreign key analysis
│   │   ├── validator/                 # Validation agents
│   │   │   ├── __init__.py
│   │   │   ├── validator_agent.py     # Orchestrator
│   │   │   ├── syntax_validator.py    # SQL syntax check
│   │   │   ├── logic_validator.py     # Logical correctness
│   │   │   ├── security_validator.py  # SQL injection check
│   │   │   └── performance_validator.py # Performance analysis
│   │   └── explainer/                 # Explanation agents
│   │       ├── __init__.py
│   │       ├── explainer_agent.py     # Orchestrator
│   │       ├── explanation_agent.py   # Query explanation
│   │       ├── safety_report_agent.py # Safety scoring
│   │       └── optimization_agent.py  # Optimization tips
│   ├── orchestrator/                  # LangGraph pipeline
│   │   ├── __init__.py
│   │   ├── pipeline.py                # Main graph definition
│   │   └── nodes.py                   # Node implementations
│   ├── filters/                       # Pre-processing filters
│   │   ├── __init__.py
│   │   ├── gate.py                    # Table gating
│   │   └── security_filter.py         # Query security filter
│   ├── db/                            # Database layer
│   │   ├── __init__.py
│   │   └── connector.py               # Async SQLite connector
│   ├── cache/                         # Caching layer
│   │   ├── __init__.py
│   │   └── schema_cache.py            # Schema caching
│   ├── config/                        # Configuration
│   │   ├── __init__.py
│   │   └── settings.py                # LLM provider settings
│   └── models/                        # Data models
│       ├── __init__.py
│       └── schemas.py                 # GraphState and schemas
├── data/                              # Local data assets
│   └── spider/
│       └── database/                  # Spider dataset databases
├── Assets/                            # Static assets
│   ├── architecture.svg
│   ├── nl2sql.gif                     # Demo GIF
│   └── nl2sql-architecture.svg        # Architecture diagram
├── logs/                              # Application logs (auto-created)
│   └── *.log                          # Timestamped log files
├── spider/                            # Spider benchmark files
│   ├── dev.json
│   ├── dev_gold.sql
│   ├── tables.json
│   ├── train_gold.sql
│   ├── train_others.json
│   ├── train_spider.json
│   ├── README.txt
│   └── database/                      # 200+ domain databases
├── main.py                            # Legacy CLI entry point
├── pyproject.toml                     # Package metadata & build config
├── requirements.txt                   # Python dependencies
├── FAME_PLAN.md                       # FAME methodology plan
├── MEMORY_IMPLEMENTATION.md           # Memory system design doc
├── .env                               # Environment configuration
├── .env.example                       # Example env template
├── .gitignore                         # Git ignore rules
├── LICENSE                            # MIT License
└── README.md                          # This file

How It Works

1. Table Discovery

The Discovery Agent uses three signals to rank tables:

Keyword Matching (35%): Direct token overlap with query
Semantic Search (45%): Embedding similarity using LLM
Foreign Key Graph (20%): Relationship centrality analysis

Tables are ranked by weighted score, and top-K are passed forward.

2. Security Filtering

Before processing, queries are checked for:

DROP, DELETE, TRUNCATE operations
System table access attempts
SQL injection patterns
Dangerous function calls

Dangerous queries are blocked immediately.

3. Query Generation

The generator uses:

One-shot retry: If first generation fails validation, retry with error context
Schema-aware prompting: Full context of relevant tables, columns, types
Example-based learning: Few-shot examples in system prompt

4. Multi-Stage Validation

Four validators run in parallel:

Syntax: SQL parsing and syntax validation
Logic: Table/column existence, join correctness
Security: SQL injection and dangerous operation check
Performance: Query complexity, missing indexes, optimization tips

5. Explainability

The Explainer generates:

Step-by-step query breakdown
Safety score (0-4 scale across 4 dimensions)
Optimization recommendations
Human-readable summary

Database Setup

Spider Dataset

The Spider dataset contains 200+ databases across diverse domains:

Academic papers and conferences
Music (concert_singer)
Sports (baseball, football, soccer)
Business (car sales, employee management, HR)
E-commerce, healthcare, real estate, and more

Download: Visit https://yale-lily.github.io/spider

Download the database ZIP file
Extract to ./spider/ directory in project root
Verify: ls spider/database/ should show ~200 subdirectories
Update .env with desired database path

Custom Databases

To use your own SQLite database:

Place your .sqlite or .db file anywhere accessible
Update .env:
```
DB_PATH=/path/to/your/database.sqlite
```
Run the application

The system will automatically:

Introspect schema (tables, columns, types, foreign keys)
Cache schema for performance
Build embedding index for semantic search

Logging

Logs are automatically written to logs/ directory with timestamps:

logs/
├── nl2sql_20260301_143022.log
├── nl2sql_20260301_151134.log
└── nl2sql_20260301_163045.log

Log format includes:

Timestamp
Logger name (agent identifier)
Log level
Message (including token usage, scores, timings)

View logs:

tail -f logs/nl2sql_*.log

Development

Running Tests

# Unit tests
pytest tests/

# Integration tests
pytest tests/integration/

# Coverage report
pytest --cov=nl2sql_agents --cov-report=html

Code Style

This project follows PEP 8 with Black formatting:

# Format code
black .

# Lint
ruff check .

# Type checking
mypy nl2sql_agents/

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Ensure all tests pass (pytest)
Format code (black .)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Roadmap

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Spider Dataset: Yale University's semantic parsing and text-to-SQL benchmark
LangGraph: LangChain's graph-based agent orchestration framework
OpenRouter: LLM API aggregation service
LangChain: Building blocks for LLM applications

Contact

Toheed Asghar - Project Maintainer

GitHub: @toheedasghar
Email: toheedasghar1@gmail.com

Star History

If you find this project useful, please consider giving it a star! It helps others discover it.

Built with ❤️ using LangGraph and OpenRouter

Report Bug · Request Feature · Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Assets		Assets
logs		logs
nl2sql_agents		nl2sql_agents
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NL2SQL - Multi-Agent Natural Language to SQL System

Key Features

Architecture

Quick Start

Prerequisites

Installation & Usage (PyPI)

1. Configure Environment Variables

2. Point to Your SQLite Database

3. Run the CLI

Minimal Example

Install from Source (Development)

Usage Examples

Interactive Mode (CLI)

Sample Questions (Concert Singer Database)

Configuration

LLM Provider

Database Configuration

Discovery Agent Tuning

Project Structure

How It Works

1. Table Discovery

2. Security Filtering

3. Query Generation

4. Multi-Stage Validation

5. Explainability

Database Setup

Spider Dataset

Custom Databases

Logging

Development

Running Tests

Code Style

Contributing

Development Setup

Roadmap

License

Acknowledgments

Contact

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages