Skip to content

New requirement: Semantic Overlap detection for agent workloads #1

@sfc-gh-enewlands

Description

@sfc-gh-enewlands

New requirement: semantic overlap detection for agent workloads

Proposal

Add a new requirement — semantic_overlap — to the Contextual factor. This requirement should include a check, diagnostic, and remediation path, and should be included in the agents profile.

Problem

The current Contextual stage checks whether semantic metadata exists (comments, semantic views, relationships, glossary tags) but never evaluates whether the metadata is unambiguous. For agent workloads, this is a critical gap.

When a data model contains semantic overlap — parallel entity hierarchies, synonym columns, competing join paths, overloaded names — an LLM-based agent has to guess which interpretation is correct. The practical consequences are:

  1. Accidental semantic misinterpretation — the agent picks the wrong table, the wrong column, or the wrong join path, and returns a confidently wrong answer.
  2. Token bloat in the semantic layer — disambiguation has to be explained away in documentation (semantic view descriptions, verified queries, column comments). The more overlap, the more tokens required, and the more fragile the system becomes.
  3. Reduced eval pass rates — overlapping models produce inconsistent SQL generation across runs, making evaluation sets unreliable.

The ideal state is a data model that is structurally explicit and unambiguous, requiring minimal natural-language explanation for an agent to navigate correctly.

Evidence from real-world agent deployments

Review of custom instruction sets across three distinct production agent deployments revealed that 60–70% of instruction tokens exist solely to disambiguate semantic overlap in the underlying data model. The remaining tokens cover genuinely complex business logic (calculation formulas, contribution analysis, domain-specific rules). The disambiguation instructions all follow the same pattern: "when the user says X, they mean Y not Z."

Recurring problem patterns observed:

Pattern Example Why it exists for BI/human analysts Impact on agent instructions
Multiple measures for the same business concept "Revenue" resolves to either an adjusted or reported metric depending on context. Instructions must specify which is the default and offer the alternative. Analysts need both views simultaneously for side-by-side comparison in reports. Having both as named columns makes spreadsheet-style analysis natural. The analyst knows which is which because they built the report. Every overlapping measure requires a default-preference rule + fallback
Entity tables multiplied by dimension The same entity replicated across three currency-specific tables rather than a single table with a currency column BI tools often perform better with pre-filtered tables. An analyst working in one currency connects to that table and never sees other currency data — faster queries, simpler dashboards, no risk of mixing currencies in a pivot table. Forces a "always ask which variant" rule on every query
Derived period lengths stored as facts Monthly, quarterly, and rolling metrics stored as separate values rather than computed at query time from the base granularity Finance reporting decks have fixed layouts — LTM in one section, monthly in another. Pre-computing each period length avoids runtime calculation errors and makes report building drag-and-drop. The analyst picks the column that matches their report template. Requires a "prefer X period unless specified" rule
Multiple data contexts mixed into a single table Running totals (MTD, YTD) stored as rows alongside detail records in the same table, or a compound key encoding scenario + currency + data type into a single column requiring a mandatory join to decode Pre-aggregation is a deliberate performance optimization — BI dashboards read a single row instead of aggregating millions of records. Compound keys inherited from ERP systems are configured once in the BI tool's join layer and invisible thereafter. Analysts know to filter by aggregation type because they built or inherited the report. Repeated warnings: "never sum these values," "never mix aggregation levels." In one deployment, ~50% of the instruction set is dedicated to this single problem.
Parallel classification systems for the same entity Two independent product hierarchies ("global" and "local"), each with 3 levels, maintained by different parts of the organisation. "Product category" exists in both but they are distinct, unmappable classification systems. Different organisational functions (global vs regional, commercial vs supply chain) evolved independent ways of categorising the same entities. Analysts know which hierarchy to use because it's pre-configured in their dashboard or mandated by their reporting line. Instructions must mandate that the agent reject the question if the user doesn't specify which classification system, or hard-code a default that may be wrong for some users
Overloaded column names "Country" means four different things (owner, customer, sold-to, market) across columns in the same schema In a star schema, each dimension table owns its own country column and the context is clear from the table name. BI tools resolve this through the join path: the analyst drags "country" from the customer dimension, not the shipment dimension. The column name is intentionally generic because the table provides the context. Each deployment has a "default to X meaning of country" rule
Default column not obvious from schema Multiple revenue columns exist but only one is correct for standard reporting Analysts need access to all variants for reconciliation, audit, and edge-case analysis. A finance team might routinely use one revenue variant for operational reporting but need another for statutory reporting. Having all columns available means one data source supports multiple report types. The "default" is tribal knowledge — everyone on the team knows which column goes in the standard deck. Rules that say "always use this specific column for revenue"
Business time vs system time "Current month" means a value from a business-calendar table, not system date Finance and operations teams work on closed-period cycles — the books close on a specific date, and "current" means the last closed period, not today. BI dashboards look up the current billing period from a control table. The concept of "now" being different from the system clock is fundamental to period-based financial reporting. A rule mapping the concept of "now" to a specific table and column

Every one of these patterns is a reasonable design choice for human-driven BI. The data models aren't wrong — they were built for users who navigate context visually (drill-through hierarchies), carry tribal knowledge (which revenue column is the default), and configure joins once in a BI tool's semantic layer. The problem is that agents don't have any of those affordances. An agent has column names, types, comments, and whatever instructions it's given. Every pattern above forces the agent to rely on natural-language instructions to recover context that a BI tool provides structurally.

Implementation

Add a new requirement — semantic_overlap — to the Contextual factor. This requirement should include a check, diagnostic, and remediation path, and should be included in the agents profile.

Detection approach: two tiers

Semantic overlap detection naturally splits into two tiers based on cost and depth.

Tier 1: Structural heuristics (SQL-only, zero Cortex cost)

Pure INFORMATION_SCHEMA / ACCOUNT_USAGE queries that detect structural signals of overlap. These are lightweight, free to run, and suitable for the scan profile.

Signal Detection method
Column name collisions Columns sharing the same name across multiple tables but with different types, comments, or tags
Duplicate entity tables Table pairs with high column-name overlap (>70% shared columns), suggesting redundant representations of the same entity
Competing join paths Multiple foreign-key or relationship paths between the same pair of entities (detectable from constraints or semantic view relationship declarations)
Overloaded column names High-frequency column names (e.g., status, type, name) appearing across many tables without distinguishing comments or tags
Overlapping semantic view coverage Multiple semantic views referencing the same base tables, creating competing semantic definitions

Tier 2: Semantic classification (Cortex AI functions, opt-in)

Cortex AI functions can detect overlap that structural heuristics miss — cases where different column names refer to the same business concept, or where comments describe overlapping intent in different words.

AI_CLASSIFY for the check — classify column names and comments into semantic categories, then detect when multiple columns across different tables are classified into the same category:

  • Surfaces "multiple measures for the same business concept" (the most common pattern observed across deployments)
  • Detects when columns like total_sales, net_revenue, and amount across different tables all classify as "revenue"
  • Catches overloaded concepts that structural name-matching misses because the column names are different

AI_EXTRACT for the diagnostic — extract structured semantic metadata from column comments to identify overlapping intent:

  • Extracts business concept, granularity, currency, and aggregation level from free-text comments
  • Surfaces cases where two columns in different tables describe the same business concept at different granularities
  • Provides the detail needed for remediation: which specific columns overlap and how

Tier 2 is implemented as a check variant (check.cortex.sql / diagnostic.cortex.sql) alongside the base SQL-only files. The framework already supports variants, so this fits naturally. The Cortex-powered variant is opt-in — users choose it when they want deeper analysis and accept the credit cost.

What remains out of reach

Even with Cortex functions, some overlap patterns are not reliably detectable:

  • Whether two hierarchies represent the same business concept at different grains (requires domain knowledge beyond what column metadata provides)
  • Whether pre-aggregated values are mixed with detail rows in the same table (requires understanding of the aggregation model, not just schema)
  • Whether business time semantics differ from system time (requires knowing the business calendar conventions)

These are acknowledged limitations. The check is a heuristic — structural at tier 1, semantic at tier 2 — not an oracle.

Scoring approach

An inverted score: 1.0 = no overlap signals detected, 0.0 = every table has at least one overlap signal. The score represents the fraction of tables free from detectable semantic overlap.

Remediation path

Semantic overlap in data models is rarely a technical accident — it reflects organisational history. Different teams, functions, or regions evolved independent representations of the same business concepts over time. Remediation is therefore an organisational process as much as a technical one.

For each overlap pattern, the organisation should follow the same decision framework:

Step 1: Audit why the overlap exists

Identify which teams, functions, or systems own each overlapping representation. Understand the historical reason for the duplication — was it a performance optimisation, an organisational boundary, a source system constraint, or organic drift?

Step 2: Decide — converge, deprecate, or make explicitly distinct

Three outcomes, in order of preference:

  1. Converge — if one representation is a superset or refinement of the other, retire the weaker one and map historical data to the surviving representation. This eliminates the overlap entirely. Highest effort, highest value.
  2. Deprecate — if one representation serves a shrinking user base or a legacy process, sunset it with a migration timeline. Keep it accessible in archive tables but remove it from the agent-facing model.
  3. Make explicitly distinct — if both representations are genuinely needed and serve different audiences, the data model must make them structurally impossible to confuse. Different table names, different column names, different foreign keys. The two systems should look as different in the schema as they are in the business.

Step 3: Assign ownership

Each representation should have a named owner (team or role) responsible for its definition, maintenance, and the decision about whether it belongs in the agent-facing model. Unowned representations are how semantic overlap accumulates over time.

Step 4: Define the agent-facing scope

Not everything in the data warehouse needs to be visible to the agent. The organisation should explicitly decide which measures, which hierarchies, which dimension variants are in scope for agentic consumption — and curate an agent-facing layer that contains only those. This isn't a semantic view problem — it's a data governance decision about what the agent is allowed to see.

Pattern-specific guidance

Pattern Converge Deprecate Make explicitly distinct
Multiple measures for the same business concept Standardise on a single definition of each metric. If "adjusted" and "reported" revenue both exist, determine which is the organisational standard and make the other available only through an explicit opt-in column or separate table. If one metric variant is only used by a legacy reporting process, retire it and redirect consumers to the standard metric. If both variants are actively needed, ensure the column names are unambiguous without context (e.g., never just revenue — always revenue_adjusted and revenue_reported).
Entity tables multiplied by dimension Merge into a single table with a dimension column. Query performance concerns that originally motivated table splitting are better addressed through clustering keys, search optimisation, and partition pruning on the dimension column — these approaches scale well and avoid the semantic overlap entirely. If one dimension variant serves a decommissioned reporting process, remove it from the model. N/A — converge is the correct path.
Derived period lengths stored as facts Store only the base granularity and compute derived periods at query time. Pre-computation was a BI performance optimisation — agents generate SQL and can include window functions. Where query performance is a concern, materialised views with query rewrite provide the best of both worlds: the agent queries the base table with simple SQL, and the query planner transparently rewrites to pre-computed results at runtime. If a specific period length (e.g., L3M) is only used by one team, remove it from the shared model and let that team maintain it in their own layer. If pre-computed period lengths are retained for other reasons, materialise as separate tables with unambiguous names (fact_monthly, fact_ltm, fact_ytd) — never mix granularities in the same table.
Multiple data contexts mixed into a single table Split into separate tables per context — one grain, one meaning per table. Each table should be safe to aggregate without filtering. This often means separating what an ERP delivered as a single extract into the distinct business views it actually represents. If one context (e.g., forecast scenario) is used by a small audience, move it to a separate schema or restrict it from the agent-facing model. If mixing is retained for storage efficiency, add explicit dimension columns and create agent-facing views that pre-filter to a single context.
Parallel classification systems for the same entity Determine whether one classification system can serve as the organisational standard. If global and local product hierarchies both exist, evaluate whether one can be retired or whether a mapping can be established. This is a cross-functional conversation, not a DDL change. If one hierarchy serves a shrinking user base (e.g., a regional classification being superseded by a global standard), sunset it with a migration timeline. If both hierarchies are genuinely needed, make them structurally impossible to confuse: separate dimension tables, unambiguous names at every level (never just category — always global_category and local_category), separate foreign keys on the fact table.
Overloaded column names Establish a naming convention where column names are self-describing regardless of which table they appear in. This is a data modelling standards decision that affects all future development. N/A — column names typically can't be deprecated, only renamed. Rename to be unambiguous at the column level (owner_country, customer_country, sold_to_country). If renaming source tables isn't feasible, create agent-facing views that alias them.
Default column not obvious from schema Determine the organisational standard metric for each business concept and document it as a data governance decision. The standard column should be the one with the simplest, most intuitive name. If non-standard variants are only used for reconciliation or audit, move them to a separate audit schema or table. Keep them accessible but out of the agent's default scope. Create an agent-facing view exposing only the standard columns with canonical names. Other variants remain in the base table for advanced users.
Business time vs system time Establish a single, authoritative source for "current period" and make it structurally accessible — an is_current_period boolean on the calendar dimension, maintained automatically. The concept of "now" should be resolvable from the schema without instructions. N/A — business time is always needed alongside system time. Ensure the calendar dimension makes business time explicit and unambiguous. Column names like is_current_billing_period are self-describing; relying on a separate control table that the agent must know to join is not.

Relationship to the Cortex Code bundled semantic-view skill

The Cortex Code bundled semantic-view skill operates at the layer above the data model — it creates and optimizes semantic views on top of whatever physical model exists. It can paper over overlap with more tokens (longer descriptions, more verified queries to constrain interpretation), but it does not evaluate or flag the root cause.

semantic_overlap is complementary: it detects that the problem exists at the physical layer, giving organisations the information needed to make structural decisions about their data model rather than compensate in the semantic layer.

Files to create

requirements/
  semantic_overlap/
    snowflake/
      check.sql                          — tier 1: structural overlap score (SQL-only)
      check.cortex.sql                   — tier 2: semantic overlap score (Cortex AI functions)
      diagnostic.sql                     — tier 1: list structural overlap findings
      diagnostic.cortex.sql              — tier 2: list semantic overlap findings
      fix.add-disambiguation-comments.sql — add distinguishing comments to overlapping columns

Update requirements.yaml with the new entry and add to profiles/agents.yaml under the Contextual stage.

Context

The agents profile already sets a high bar for Contextual metadata (relationship_declaration: 1.00, semantic_documentation: 0.95). Adding overlap detection closes the gap between "metadata exists" and "metadata is useful to an agent." Without it, a schema can score 100% on Contextual checks while still be practically unusable for Text-to-SQL because the agent cannot distinguish between competing interpretations of the same business concept.

The tiered approach (SQL-only base + Cortex-powered variant) ensures the requirement is accessible without Cortex credits while offering deeper analysis for users who want it.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions