Skip to content

Multi agent system to generate SQLite queries using LLMs and RAG

Notifications You must be signed in to change notification settings

HendEmad/NL2SQL

Repository files navigation

Introduction

This project implements an execution-grounded NL→SQL assistant for the Spider dataset using:

  • a multi-agent pipeline
  • RAG over database schemas (BM25 + tuned dense retrieval)
  • SQLite execution + automatic repair loops
  • An E5 embedding model fine-tuned on Spider for improved schema retrieval

Target:

Given an input example {question, db_id} from Spider:

  • retrieve the most relevant schema pieces.
  • generate SQL with an LLM.
  • validate SQL shape (audit / optional reprompt).
  • execute in SQLite.
  • auto-repair on errors (deterministic + LLM repair).
  • output final SQL + detailed metadata.

Install / setup (Kaggle)

In a Kaggle Notebook (GPU recommended for LLM generation):

!pip -q install transformers accelerate bitsandbytes sentencepiece sqlparse sqlglot
!pip -q install sentence-transformers faiss-cpu rank_bm25

Required inputs (Kaggle datasets)

Set paths as in the notebook:

  • Spider dataset:

    • spider_root = /kaggle/input/yale-universitys-spider-10-nlp-dataset/spider
    • contains tables.json and database/
  • Dev split:

    • dev1_path = /kaggle/input/dev-json-splitted-v1/.../dev_1.json
  • Tuned embedder:

    • EMBED_MODEL = /kaggle/input/e5model-spider-tuning/transformers/default/1

Model Config

  • Generator LLM:

    • MODEL_NAME = XGenerationLab/XiYanSQL-QwenCoder-7B-2502 (loaded in 4-bit)
  • Embedder:

    • tuned E5 (SentenceTransformers format)

Quick Run

data_agent = DataAgent(dev1_path, tables_path)
dev1, schema_by_db, used_db_ids = data_agent.load()

llm = LLMGeneratorAgent(MODEL_NAME)
llm.load()

schema_agent = SchemaStoreAgent(schema_by_db, db_root)
for db_id in used_db_ids:
    schema_agent.ensure(db_id)
    schema_agent.build_db_lexicon(db_id)

retrieval_agent = RetrievalIndexAgent(schema_agent, embed_model_name=EMBED_MODEL)
retrieval_agent.build_all(used_db_ids)

fk_agent = FKBridgeAgent(schema_agent)
join_agent = JoinPlannerAgent()
router_agent = RetrievalRouterAgent(schema_agent, retrieval_agent, fk_agent)

shape_agent = ShapeCheckerAgent()
exec_agent = ExecutionAgent()
det_repair = DeterministicRepairAgent(schema_by_db)
validator = ValidatorRepairAgent(llm, schema_agent, exec_agent, det_repair)

prompt_builder = PromptBuilderAgent(
    llm=llm,
    router=router_agent,
    schema_agent=schema_agent,
    value_linker=None,          # not used
    join_hint_agent=join_agent,
)

runner = PipelineRunner(
    llm=llm,
    prompt_builder=prompt_builder,
    shape_agent=shape_agent,
    exec_agent=exec_agent,
    det_repair=det_repair,
    validator=validator
)

pred_path = "/kaggle/working/pred_rag_baseline_agents.sql"
preds, meta_df = runner.run_config(
    dev1, name="rag_baseline_agents",
    use_value_linker=False, gated=False,
    out_sql_path=pred_path
)

Evaluation (Spider Official Script)

gold_path = "/kaggle/working/dev1_gold.sql"
with open(gold_path, "w", encoding="utf-8") as f:
    for ex in dev1:
        f.write(ex["query"].strip() + "\t" + ex["db_id"].strip() + "\n")

!git clone -q https://github.com/taoyds/spider

!python spider/evaluation.py \
  --gold  "/kaggle/working/dev1_gold.sql" \
  --pred  "/kaggle/working/pred_rag_baseline_agents.sql" \
  --db    "/kaggle/input/yale-universitys-spider-10-nlp-dataset/spider/database" \
  --table "/kaggle/input/yale-universitys-spider-10-nlp-dataset/spider/tables.json" \
  --etype all

System overview (components)

A) Multi-agent pipeline (high level)

Input: (question, db_id)

Output: final executable SQL (+ metadata)

Pipeline stages:

  1. DataAgent → loads dev set and schema metadata (tables.json), builds schema_by_db.
  2. SchemaStoreAgent → converts schemas into retrievable documents + builds FK join graph.
  3. RetrievalIndexAgent → builds:
    • BM25 index (lexical retrieval)
    • FAISS index using tuned E5 (dense retrieval)
  4. RetrievalRouterAgent → chooses retrieval mode and builds schema context:
    • BM25-based, dense-based, or fallback to full DDL
  5. FKBridgeAgent → expands retrieved docs via FK shortest paths (adds bridge tables/edges).
  6. JoinPlannerAgent → optionally formats a compact list of FK join edges as hints (disabled by default)
  7. PromptBuilderAgent → composes final NL→SQL prompt using retrieved schema context.
  8. LLMGeneratorAgent → loads XiYanSQL-QwenCoder (4-bit) and generates SQL, then cleans output.
  9. ShapeCheckerAgent → parses SQL with sqlglot and audits expected query structure (agg/group/order/limit/distinct)
  10. ExecutionAgent → runs SQL in SQLite read-only mode
  11. DeterministicRepairAgent → fixes common runtime errors:
    • unknown table / unknown column via fuzzy matching + alias mapping
  12. ValidatorRepairAgent → LLM repair loop using (DDL + bad SQL + SQLite error), then re-executes
  13. PipelineRunner → runs on the full dev set, saves predictions and meta logs

B) RAG over schemas

Retrieval is done over schema documents, not natural language docs.

Document types

  • TABLE docs: table name + columns (+ types) + PK + FK lines relevant to that table
  • EDGE docs: one doc per FK relation like A.col = B.col

Retrieval methods:

  • BM25: strong when question contains exact schema tokens
  • Dense (tuned E5): strong for paraphrases / semantic match

Routing

Router selects the retrieval strategy based on lexical hits + score gaps, otherwise falls back to full DDL.

FK expansion

Even if retrieval returns correct tables, it might miss connecting tables. FKBridgeAgent adds minimal join paths so the LLM sees a joinable subgraph.

C) Fine-tuning: tuned E5 embedder (Spider-specific)

The fine-tuned model is: intfloat/e5-small-v2 on Spider train examples to improve schema retrieval.

Training idea:

  • Build “COLUMN docs” for each database (table.col + hints).
  • Parse gold SQL to extract used columns.
  • Train contrastively: query: <question> <-> passage: <column_doc>
  • Loss: MultipleNegativesRankingLoss (in-batch negatives)

Result:

An embedding model that better aligns Spider questions with the correct schema elements.

6) Output files

  • pred_*.sql -> one predicted SQL query per line
  • pred_*_meta.csv -> per-example metadata:
    • router mode, retrieval stats, shape issues.
    • execution status, repair traces, validator steps.
  • dev1_gold.sql -> gold SQL + db_id file for Spider evaluation

7) Recommended defaults

  • Keep USE_VALIDATOR_REPAIR=True for best executable-SQL rate.
  • Use USE_SHAPE_AUDIT=True to analyze failures; enable reprompt only when debugging.
  • Start with USE_JOIN_HINTS=False and enable it only as an ablation.

8) Evaluation notes (from Spider Official github)

Spider evaluation reports:

  • structural matching (exact / partial), and
  • execution accuracy.

Execution accuracy can have occasional false positives (same result by coincidence), so meta logs + partial metrics are important for diagnosing errors.

About

Multi agent system to generate SQLite queries using LLMs and RAG

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors