Skip to content

SemanticSQLSource and Agent proposal #1673

@ahuang11

Description

@ahuang11

Feature: Semantic SQL Source (Governed Parameterized Queries)

The Problem

Right now, when users ask the SQLAgent something like “show me top sales reps in the North region”, it generates SQL from scratch. This works, but:

  1. It can hallucinate – The LLM might guess wrong column names, join conditions, or business logic
  2. No guardrails – Users can accidentally (or intentionally) query sensitive data
  3. Inconsistent results – Ask the same question twice, get different SQL
  4. Complex queries are fragile – Multi-table joins, CTEs, window functions… the LLM often gets these wrong

For teams with well-defined metrics and reports, the goal is for the LLM to fill in parameters for pre-approved queries, not write SQL from scratch, like a lightweight dbt semantic layer.


The Idea

Allow tables to define parameterized SQL templates.
The LLM selects which query to use and fills in the blanks.

source = SemanticSQLSource(
    uri="analytics.db",
    tables={
        "customers": "SELECT * FROM customers",

        "sales_by_region": {
            "sql": "
                SELECT region, product, SUM(amount) as revenue
                FROM sales
                WHERE region IN :regions
                  AND order_date BETWEEN :start_date AND :end_date
                GROUP BY region, product
                ORDER BY revenue DESC
                LIMIT :limit
            ",
            "description": "Sales breakdown by region and product",
            "parameters": {
                "regions": {
                    "type": "list[str]",
                    "enum": ["North", "South", "East", "West"],
                },
                "start_date": {"type": "date"},
                "end_date": {"type": "date"},
                "limit": {"type": "int", "default": 100},
            },
        },
    },
)

User asks:
How did North and South regions do last quarter?

LLM responds with parameters:

{
    "table": "sales_by_region",
    "regions": ["North", "South"],
    "start_date": "2024-10-01",
    "end_date": "2024-12-31",
    "limit": 100
}

Database drivers can parameterize values:

WHERE region = :region AND amount > :threshold

But not identifiers:

SELECT :columns FROM sales GROUP BY :columns

So two parameter types are needed:

  • Values – passed to DB driver
  • Identifiers – quoted and injected by Lumen

Example:

"parameters": {
    "dimensions": {
        "type": "identifier_list",
        "enum": ["region", "product", "quarter"],
    },
    "order_by": {
        "type": "identifier",
        "enum": ["revenue", "order_count"],
    },
    "direction": {
        "type": "keyword",
        "enum": ["ASC", "DESC"],
    }
}

Identifiers are validated against the enum list before injection.


Full Example

from lumen.sources import SemanticSQLSource

source = SemanticSQLSource(
    uri="postgres://localhost/analytics",

    tables={
        "sales_analysis": {
            "sql": "
                SELECT :dimensions, :metrics
                FROM sales s
                JOIN products p ON s.product_id = p.id
                JOIN regions r ON s.region_id = r.id
                WHERE r.name IN :regions
                  AND s.order_date BETWEEN :start_date AND :end_date
                GROUP BY :dimensions
                ORDER BY :order_by :direction
                LIMIT :limit
            ",

            "parameters": {
                "dimensions": {
                    "type": "identifier_list",
                    "enum": ["r.name", "p.category"],
                },
                "regions": {
                    "type": "list[str]",
                    "enum": ["North", "South"],
                },
            },
        },
    },
)

YAML Config

sources:
  analytics:
    type: semantic
    uri: analytics.db

    tables:
      customers: SELECT * FROM customers

      sales_analysis:
        sql: |
          SELECT :dimensions
          FROM sales
          WHERE region IN :regions
          GROUP BY :dimensions

        parameters:
          dimensions:
            type: identifier_list
            enum: [region, product]

          regions:
            type: list[str]
            enum: [North, South, East, West]

Dynamic Enums from Database

"parameters": {
    "regions": {
        "type": "list[str]",
        "enum_from": "regions.name",
        "enum_cache": "1h",
    }
}

Cache options:

  • 1h, 24h, 7d
  • session
  • startup
  • none

Parameter Types

  • str, int, date – WHERE values
  • list[str] – IN clauses
  • identifier – column or table
  • identifier_list – SELECT or GROUP BY
  • keyword – ASC or DESC

Open Questions

  1. Naming: SemanticSource vs ParameterizedSource
  2. Large enums handling

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions