Skip to content

[Part 2 of 3]: Add in-memory artifact store and extraction layer#780

Closed
jairus-m wants to merge 4 commits into
mainfrom
jairus/get-artifacts-in-memory
Closed

[Part 2 of 3]: Add in-memory artifact store and extraction layer#780
jairus-m wants to merge 4 commits into
mainfrom
jairus/get-artifacts-in-memory

Conversation

@jairus-m
Copy link
Copy Markdown
Collaborator

@jairus-m jairus-m commented May 14, 2026

Summary

Part 2 of 3 to replace get_job_run_artifacts with an in-memory DuckDB store that lets
LLMs query and run full-text search over dbt job run artifacts.

PR sequence:

  1. Artifact parsing infrastructure ([Part 1 of 3]: Replace hand-rolled Pydantic schemas with dbt-artifacts-parser #745)
  2. ArtifactStore + extraction layer (this PR)
  3. Tools + MCP wiring

What Changed

  • Added duckdb>=1.5.2 dependency for the in-memory analytical store
  • Created ArtifactStore class (store.py) — manages an in-memory DuckDB database with:
    • load_artifact() — parse, extract, and bulk-insert a single artifact
    • query() — read-only SQL with keyword-level mutation guard and 500-row cap
    • search() — BM25 full-text search via DuckDB's FTS extension
    • reset() / close() lifecycle methods
    • Deferred indexing support (reindex=False + build_all_indexes()) for batch loads
  • Created extraction layer (extractors.py) — converts parsed artifact dicts into DuckDB row tuples:
    • extract_from_manifest → nodes, node_columns, edges, test_metadata, exposures, metrics, groups, macros
    • extract_from_catalog → catalog_tables, catalog_stats, + column merge into node_columns
    • extract_from_run_results → invocations, run_results
    • extract_from_sources → source_freshness
  • Created table definitions (tables.py) — TableConfig dataclass with DDL, FTS columns, and index columns for all 13 tables
  • Added error hierarchy (artifact_search.py) — ArtifactSearchError (server) and ArtifactNotLoadedError (client)
  • Extracted ClientToolCallError / ServerToolCallError type unions into classification.py

Related Issues

Related to #413

Checklist

  • I have performed a self-review of my code
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have made corresponding changes to the documentation (in https://github.com/dbt-labs/docs.getdbt.com) if required

Mermaid ERD

erDiagram                                                                                                                                                                                                    
      INVOCATIONS {                                                                                                                                                                                            
          int id PK                                                                                                                                                                                            
          int run_id                                                                                                                                                                                           
          varchar invocation_id                                                                                                                                                                              
          varchar command
          varchar dbt_version
          float elapsed_time
      }                                                                                                                                                                                                        
      RUN_RESULTS {
          int id PK                                                                                                                                                                                            
          int run_id                                                                                                                                                                                         
          varchar unique_id FK
          varchar invocation_id FK
          varchar status
          float execution_time
          text message
      }                                                                                                                                                                                                        
      SOURCE_FRESHNESS {
          int id PK                                                                                                                                                                                            
          int run_id                                                                                                                                                                                         
          varchar unique_id FK
          varchar invocation_id FK
          varchar status
          varchar max_loaded_at
      }
      NODES {
          int id PK                                                                                                                                                                                            
          int run_id
          varchar unique_id                                                                                                                                                                                    
          varchar name                                                                                                                                                                                       
          varchar resource_type
          text description
          text raw_code
          text compiled_code
      }
      NODE_COLUMNS {
          int id PK                                                                                                                                                                                            
          int run_id
          varchar unique_id FK                                                                                                                                                                                 
          varchar column_name                                                                                                                                                                                
          varchar declared_type
          varchar catalog_type
          varchar data_type
      }
      EDGES {
          int id PK
          int run_id                                                                                                                                                                                           
          varchar parent_unique_id FK
          varchar child_unique_id FK                                                                                                                                                                           
          varchar edge_type                                                                                                                                                                                  
      }
      TEST_METADATA {
          int id PK
          int run_id
          varchar unique_id FK
          varchar test_name                                                                                                                                                                                    
          varchar attached_node FK
      }                                                                                                                                                                                                        
      CATALOG_TABLES {                                                                                                                                                                                       
          int id PK
          int run_id
          varchar unique_id FK
          varchar table_type
          varchar database_name                                                                                                                                                                                
          varchar schema_name
      }                                                                                                                                                                                                        
      CATALOG_STATS {                                                                                                                                                                                        
          int id PK
          int run_id
          varchar unique_id FK
          varchar stat_id
          varchar stat_value                                                                                                                                                                                   
      }
      EXPOSURES {                                                                                                                                                                                              
          int id PK                                                                                                                                                                                          
          int run_id
          varchar unique_id
          varchar name
          varchar exposure_type
      }
      METRICS {
          int id PK                                                                                                                                                                                            
          int run_id
          varchar unique_id                                                                                                                                                                                    
          varchar name                                                                                                                                                                                       
          varchar metric_type
      }
      GROUPS {
          int id PK
          int run_id
          varchar unique_id
          varchar name
      }                                                                                                                                                                                                        
      MACROS {
          int id PK                                                                                                                                                                                            
          int run_id                                                                                                                                                                                         
          varchar unique_id
          varchar name
          text macro_sql
      }

      INVOCATIONS ||--o{ RUN_RESULTS : "invocation_id"                                                                                                                                                         
      INVOCATIONS ||--o{ SOURCE_FRESHNESS : "invocation_id"
      NODES ||--o{ NODE_COLUMNS : "unique_id"                                                                                                                                                                  
      NODES ||--o{ EDGES : "parent / child"                                                                                                                                                                  
      NODES ||--o{ TEST_METADATA : "attached_node"                                                                                                                                                             
      NODES ||--o| CATALOG_TABLES : "unique_id"
      NODES ||--o{ CATALOG_STATS : "unique_id"                                                                                                                                                                 
      NODES ||--o{ RUN_RESULTS : "unique_id"                                                                                                                                                                   
      NODES ||--o{ SOURCE_FRESHNESS : "unique_id"
Loading

@jairus-m jairus-m force-pushed the jairus/get-artifacts-in-memory branch from 86e5db5 to b50d4bf Compare May 14, 2026 22:20
@jairus-m jairus-m force-pushed the jairus/get-artifacts-in-memory branch from b50d4bf to ed6d9c3 Compare May 20, 2026 02:21
@jairus-m jairus-m marked this pull request as ready for review May 20, 2026 04:36
@jairus-m jairus-m requested review from a team, b-per and jasnonaz as code owners May 20, 2026 04:36
@jairus-m jairus-m requested a review from Copilot May 20, 2026 18:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an in-memory DuckDB-backed ArtifactStore and an extraction layer to load parsed dbt artifacts into normalized tables, enabling SQL querying and DuckDB FTS (BM25) search as groundwork for replacing get_job_run_artifacts.

Changes:

  • Added DuckDB dependency and lockfile updates.
  • Implemented ArtifactStore (load/reset/query/search/indexing) plus table schemas for artifact-derived tables.
  • Added artifact extractors (manifest/catalog/run_results/sources) and new artifact-store error types; refactored tool-call error unions into errors/classification.py.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
uv.lock Locks the added DuckDB dependency.
pyproject.toml Adds duckdb>=1.5.2 runtime dependency.
.changes/unreleased/Under the Hood-20260514-151517.yaml Changelog entry for the new artifact store/extraction layer.
src/dbt_mcp/dbt_admin/run_artifacts/tables.py Defines DuckDB table DDL + FTS/index configuration for artifact tables.
src/dbt_mcp/dbt_admin/run_artifacts/extractors.py Extracts DuckDB-ready row tuples from parsed artifact dicts.
src/dbt_mcp/dbt_admin/run_artifacts/store.py Implements in-memory DuckDB store with loading, query guard, FTS search, and indexing.
src/dbt_mcp/errors/artifact_search.py Adds artifact-store specific error hierarchy.
src/dbt_mcp/errors/classification.py Centralizes client/server tool-call error union types.
src/dbt_mcp/errors/__init__.py Re-exports new errors and the new classification unions.
tests/unit/dbt_admin/run_artifacts/test_store.py Adds unit/integration-style tests for store behavior (query/search/load/merge).
tests/unit/dbt_admin/run_artifacts/__init__.py Adds test package marker.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dbt_mcp/dbt_admin/run_artifacts/store.py
Comment thread src/dbt_mcp/dbt_admin/run_artifacts/store.py
Comment thread src/dbt_mcp/dbt_admin/run_artifacts/extractors.py Outdated
Comment thread src/dbt_mcp/dbt_admin/run_artifacts/store.py Outdated
jairus-m added 2 commits May 20, 2026 20:38
Introduces DuckDB-backed store, row extractors for all 4 artifact types
(manifest, catalog, run_results, sources), table DDL definitions, and
error hierarchy for the ARTIFACT_SEARCH toolset (PR 2 of 3).
Copilot AI review requested due to automatic review settings May 21, 2026 03:38
@jairus-m jairus-m force-pushed the jairus/get-artifacts-in-memory branch from 7ba40e4 to a8c801c Compare May 21, 2026 03:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.

Comment thread src/dbt_mcp/dbt_admin/run_artifacts/store.py Outdated
Comment thread src/dbt_mcp/dbt_admin/run_artifacts/store.py Outdated
Comment thread src/dbt_mcp/dbt_admin/run_artifacts/extractors.py
Comment thread src/dbt_mcp/errors/artifact_search.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants