-
-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
Feature: Semantic SQL Source (Governed Parameterized Queries)
The Problem
Right now, when users ask the SQLAgent something like “show me top sales reps in the North region”, it generates SQL from scratch. This works, but:
- It can hallucinate – The LLM might guess wrong column names, join conditions, or business logic
- No guardrails – Users can accidentally (or intentionally) query sensitive data
- Inconsistent results – Ask the same question twice, get different SQL
- Complex queries are fragile – Multi-table joins, CTEs, window functions… the LLM often gets these wrong
For teams with well-defined metrics and reports, the goal is for the LLM to fill in parameters for pre-approved queries, not write SQL from scratch, like a lightweight dbt semantic layer.
The Idea
Allow tables to define parameterized SQL templates.
The LLM selects which query to use and fills in the blanks.
source = SemanticSQLSource(
uri="analytics.db",
tables={
"customers": "SELECT * FROM customers",
"sales_by_region": {
"sql": "
SELECT region, product, SUM(amount) as revenue
FROM sales
WHERE region IN :regions
AND order_date BETWEEN :start_date AND :end_date
GROUP BY region, product
ORDER BY revenue DESC
LIMIT :limit
",
"description": "Sales breakdown by region and product",
"parameters": {
"regions": {
"type": "list[str]",
"enum": ["North", "South", "East", "West"],
},
"start_date": {"type": "date"},
"end_date": {"type": "date"},
"limit": {"type": "int", "default": 100},
},
},
},
)User asks:
How did North and South regions do last quarter?
LLM responds with parameters:
{
"table": "sales_by_region",
"regions": ["North", "South"],
"start_date": "2024-10-01",
"end_date": "2024-12-31",
"limit": 100
}Database drivers can parameterize values:
WHERE region = :region AND amount > :thresholdBut not identifiers:
SELECT :columns FROM sales GROUP BY :columnsSo two parameter types are needed:
- Values – passed to DB driver
- Identifiers – quoted and injected by Lumen
Example:
"parameters": {
"dimensions": {
"type": "identifier_list",
"enum": ["region", "product", "quarter"],
},
"order_by": {
"type": "identifier",
"enum": ["revenue", "order_count"],
},
"direction": {
"type": "keyword",
"enum": ["ASC", "DESC"],
}
}Identifiers are validated against the enum list before injection.
Full Example
from lumen.sources import SemanticSQLSource
source = SemanticSQLSource(
uri="postgres://localhost/analytics",
tables={
"sales_analysis": {
"sql": "
SELECT :dimensions, :metrics
FROM sales s
JOIN products p ON s.product_id = p.id
JOIN regions r ON s.region_id = r.id
WHERE r.name IN :regions
AND s.order_date BETWEEN :start_date AND :end_date
GROUP BY :dimensions
ORDER BY :order_by :direction
LIMIT :limit
",
"parameters": {
"dimensions": {
"type": "identifier_list",
"enum": ["r.name", "p.category"],
},
"regions": {
"type": "list[str]",
"enum": ["North", "South"],
},
},
},
},
)YAML Config
sources:
analytics:
type: semantic
uri: analytics.db
tables:
customers: SELECT * FROM customers
sales_analysis:
sql: |
SELECT :dimensions
FROM sales
WHERE region IN :regions
GROUP BY :dimensions
parameters:
dimensions:
type: identifier_list
enum: [region, product]
regions:
type: list[str]
enum: [North, South, East, West]Dynamic Enums from Database
"parameters": {
"regions": {
"type": "list[str]",
"enum_from": "regions.name",
"enum_cache": "1h",
}
}Cache options:
- 1h, 24h, 7d
- session
- startup
- none
Parameter Types
- str, int, date – WHERE values
- list[str] – IN clauses
- identifier – column or table
- identifier_list – SELECT or GROUP BY
- keyword – ASC or DESC
Open Questions
- Naming: SemanticSource vs ParameterizedSource
- Large enums handling
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels