Skip to content

Commit 73757b0

Browse files
jairus-mclaude
andauthored
[Part 1 of 3]: Replace hand-rolled Pydantic schemas with dbt-artifacts-parser (#745)
## Summary Part 1 of 3 to replace `get_job_run_artifacts` with an in-memory DuckDB store that lets LLMs query and run full-text search over dbt job run artifacts. PR sequence: 1. **Artifact parsing infrastructure** (this PR) 2. ArtifactStore + extraction layer 3. Tools + MCP wiring ## What Changed - Added `dbt-artifacts-parser~=0.13.2` dependency — handles dbt schema version differences internally - Replaced hand-rolled schemas in `config.py` with a new `artifacts/` subpackage: - Per-artifact `parse()` functions (`manifest.py`, `catalog.py`, `run_results.py`, `sources.py`) - `ARTIFACT_PARSERS` dispatch dict for the upcoming extraction layer - `LenientXxx` Pydantic fallback models for Fusion/preview builds that deviate from published schemas - Split `config.py` into `schemas/job_run.py` (Admin API shapes) and `schemas/output.py` (tool output contract) - Refactored `ErrorFetcher` / `WarningFetcher` to use the new artifact modules — no behavior change ## Related Issues Related to #413 ## Checklist - [x] I have performed a self-review of my code - [x] I have added tests that prove my fix is effective or that my feature works - [x] New and existing unit tests pass locally with my changes - [ ] I have made corresponding changes to the documentation (in https://github.com/dbt-labs/docs.getdbt.com) if required ## Additional Notes `dbt-artifacts-parser` has type annotations on its `parse_*()` functions but lacks a [py.typed PEP 561 marker file](https://peps.python.org/pep-0561/). Without it, mypy treats the package as untyped, requiring `# type: ignore[import-untyped]` at every import site. NOTE: Addressed the above in yu-iskw/dbt-artifacts-parser#228 which was included in [release 0.13.2](https://github.com/yu-iskw/dbt-artifacts-parser/releases/tag/v0.13.2) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e7c716a commit 73757b0

21 files changed

Lines changed: 984 additions & 257 deletions

File tree

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kind: Under the Hood
2+
body: '[Part 1 of 3]: Use dbt-artifacts-parser schemas for artifact schema parsing'
3+
time: 2026-04-30T12:52:34.933451-10:00

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,11 @@ dependencies = [
4444
"httpx~=0.28.1",
4545
"filelock~=3.20.3",
4646
"starlette~=0.50.0",
47+
"dbt-artifacts-parser>=0.13.2",
4748
]
4849
[tool.uv]
4950
exclude-newer = "7 days"
50-
exclude-newer-package = { dbt-protos = false, dbt-sl-sdk = false, dbtlabs-vortex = false }
51+
exclude-newer-package = { dbt-protos = false, dbt-sl-sdk = false, dbtlabs-vortex = false, dbt-artifacts-parser = false}
5152

5253
[dependency-groups]
5354
dev = [
Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +0,0 @@
1-
from .parser import ErrorFetcher, WarningFetcher
2-
3-
__all__ = ["ErrorFetcher", "WarningFetcher"]
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Artifact parsing modules for dbt Cloud job run artifacts."""
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""Parsing and mapping for catalog.json artifacts."""
2+
3+
from __future__ import annotations
4+
5+
import logging
6+
from typing import Any
7+
8+
from dbt_artifacts_parser.parser import parse_catalog
9+
from dbt_artifacts_parser.parsers.catalog.catalog_v1 import CatalogV1
10+
11+
from dbt_mcp.dbt_admin.run_artifacts.artifacts.lenient import LenientCatalog
12+
13+
logger = logging.getLogger(__name__)
14+
15+
CatalogParsed = CatalogV1 | LenientCatalog
16+
17+
18+
def parse(raw: dict[str, Any]) -> CatalogParsed:
19+
"""Parse catalog.json using dbt-artifacts-parser (version-aware).
20+
21+
Falls back to ``LenientCatalog`` when strict Pydantic validation fails.
22+
"""
23+
try:
24+
return parse_catalog(catalog=raw)
25+
except Exception as e:
26+
logger.warning(
27+
"Strict catalog parsing failed (%s: %s); "
28+
"falling back to lenient dict-based parsing.",
29+
type(e).__name__,
30+
str(e)[:200],
31+
)
32+
return LenientCatalog.model_validate(raw)
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
"""Lenient Pydantic schemas used as fallbacks when dbt-artifacts-parser fails.
2+
3+
dbt-artifacts-parser uses Pydantic internally with strict enum validation.
4+
It fails on real-world artifacts that deviate from the published schema —
5+
e.g. a ``"reused"`` status from incremental builds, or preview dbt versions
6+
that emit extra fields.
7+
8+
These schemas are maximally permissive:
9+
- ``extra="allow"`` — unknown fields don't cause failures
10+
- All non-essential fields are optional with safe defaults
11+
- ``status`` is ``str | None`` (not an enum) — accepts any value dbt may emit
12+
13+
The ``parse()`` functions in each artifact module always return a Pydantic
14+
``BaseModel`` — either the strict dbt-artifacts-parser model (happy path) or
15+
one of these lenient models (fallback). Downstream extractors receive a typed
16+
object in both cases.
17+
"""
18+
19+
from __future__ import annotations
20+
21+
from typing import Any
22+
23+
from pydantic import BaseModel, ConfigDict, Field, field_validator
24+
25+
26+
class LenientRunResultsResult(BaseModel):
27+
model_config = ConfigDict(extra="allow")
28+
29+
status: str | None = None
30+
unique_id: str | None = None
31+
relation_name: str | None = None
32+
message: str | None = None
33+
compiled_code: str | None = None
34+
compiled_sql: str | None = None # older dbt versions used compiled_sql
35+
36+
37+
class LenientRunResultsArgs(BaseModel):
38+
model_config = ConfigDict(extra="allow")
39+
40+
target: str | None = None
41+
42+
43+
class LenientRunResults(BaseModel):
44+
model_config = ConfigDict(extra="allow")
45+
46+
results: list[LenientRunResultsResult] = Field(default_factory=list)
47+
args: LenientRunResultsArgs | None = None
48+
49+
@field_validator("results", mode="before")
50+
@classmethod
51+
def coerce_results(cls, v: Any) -> list[Any]:
52+
return v if isinstance(v, list) else []
53+
54+
55+
class LenientSourceResult(BaseModel):
56+
model_config = ConfigDict(extra="allow")
57+
58+
status: str | None = None
59+
unique_id: str | None = None
60+
max_loaded_at_time_ago_in_s: float | None = None
61+
62+
63+
class LenientSources(BaseModel):
64+
model_config = ConfigDict(extra="allow")
65+
66+
results: list[LenientSourceResult] = Field(default_factory=list)
67+
68+
@field_validator("results", mode="before")
69+
@classmethod
70+
def coerce_results(cls, v: Any) -> list[Any]:
71+
return v if isinstance(v, list) else []
72+
73+
74+
class LenientCatalog(BaseModel):
75+
"""Minimal lenient catalog schema — nodes/sources dicts for PR 2/3 extraction."""
76+
77+
model_config = ConfigDict(extra="allow")
78+
79+
nodes: dict[str, Any] = Field(default_factory=dict)
80+
sources: dict[str, Any] = Field(default_factory=dict)
81+
82+
@field_validator("nodes", "sources", mode="before")
83+
@classmethod
84+
def coerce_dict(cls, v: Any) -> dict[str, Any]:
85+
return v if isinstance(v, dict) else {}
86+
87+
88+
class LenientManifest(BaseModel):
89+
"""Minimal lenient manifest schema — nodes/sources dicts for PR 2/3 extraction."""
90+
91+
model_config = ConfigDict(extra="allow")
92+
93+
nodes: dict[str, Any] = Field(default_factory=dict)
94+
sources: dict[str, Any] = Field(default_factory=dict)
95+
96+
@field_validator("nodes", "sources", mode="before")
97+
@classmethod
98+
def coerce_dict(cls, v: Any) -> dict[str, Any]:
99+
return v if isinstance(v, dict) else {}
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
"""Parsing and mapping for manifest.json artifacts."""
2+
3+
from __future__ import annotations
4+
5+
import logging
6+
from typing import Any
7+
8+
from dbt_artifacts_parser.parser import parse_manifest
9+
from dbt_artifacts_parser.parsers.manifest.manifest_v1 import ManifestV1
10+
from dbt_artifacts_parser.parsers.manifest.manifest_v2 import ManifestV2
11+
from dbt_artifacts_parser.parsers.manifest.manifest_v3 import ManifestV3
12+
from dbt_artifacts_parser.parsers.manifest.manifest_v4 import ManifestV4
13+
from dbt_artifacts_parser.parsers.manifest.manifest_v5 import ManifestV5
14+
from dbt_artifacts_parser.parsers.manifest.manifest_v6 import ManifestV6
15+
from dbt_artifacts_parser.parsers.manifest.manifest_v7 import ManifestV7
16+
from dbt_artifacts_parser.parsers.manifest.manifest_v8 import ManifestV8
17+
from dbt_artifacts_parser.parsers.manifest.manifest_v9 import ManifestV9
18+
from dbt_artifacts_parser.parsers.manifest.manifest_v10 import ManifestV10
19+
from dbt_artifacts_parser.parsers.manifest.manifest_v11 import ManifestV11
20+
from dbt_artifacts_parser.parsers.manifest.manifest_v12 import ManifestV12
21+
22+
from dbt_mcp.dbt_admin.run_artifacts.artifacts.lenient import LenientManifest
23+
24+
logger = logging.getLogger(__name__)
25+
26+
ManifestParsed = (
27+
ManifestV1
28+
| ManifestV2
29+
| ManifestV3
30+
| ManifestV4
31+
| ManifestV5
32+
| ManifestV6
33+
| ManifestV7
34+
| ManifestV8
35+
| ManifestV9
36+
| ManifestV10
37+
| ManifestV11
38+
| ManifestV12
39+
| LenientManifest
40+
)
41+
42+
43+
def parse(raw: dict[str, Any]) -> ManifestParsed:
44+
"""Parse manifest.json using dbt-artifacts-parser (version-aware).
45+
46+
Falls back to ``LenientManifest`` when strict Pydantic validation fails.
47+
This covers preview / unreleased dbt versions that
48+
emit a manifest claiming a published schema version (e.g. v12) but
49+
containing additional fields not yet in that schema.
50+
"""
51+
try:
52+
return parse_manifest(manifest=raw)
53+
except Exception as e:
54+
logger.warning(
55+
"Strict manifest parsing failed (%s: %s); "
56+
"falling back to lenient dict-based parsing. "
57+
"This typically occurs with dbt preview builds.",
58+
type(e).__name__,
59+
str(e)[:200],
60+
)
61+
return LenientManifest.model_validate(raw)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""ArtifactType enum and ARTIFACT_PARSERS dispatch table.
2+
3+
ARTIFACT_PARSERS always returns a plain ``dict[str, Any]``:
4+
- Happy path: strict dbt-artifacts-parser Pydantic model → ``.model_dump(mode="json")``
5+
which normalises enums to strings and aliases (e.g. ``schema_``) to their JSON
6+
keys (e.g. ``"schema"``).
7+
- Fallback: raw dict passed through as-is — same JSON shape, just unvalidated.
8+
9+
Downstream extractors can therefore use ``.get()`` uniformly on every path.
10+
11+
Note: the ``parse()`` helpers in the sibling artifact modules (manifest.py, catalog.py,
12+
run_results.py, sources.py) are a separate API used by the job error/warning fetcher in
13+
parser.py and are intentionally left unchanged.
14+
"""
15+
16+
from __future__ import annotations
17+
18+
import logging
19+
from collections.abc import Callable
20+
from enum import Enum
21+
from typing import Any
22+
23+
from dbt_artifacts_parser.parser import (
24+
parse_catalog,
25+
parse_manifest,
26+
parse_run_results,
27+
parse_sources,
28+
)
29+
30+
logger = logging.getLogger(__name__)
31+
32+
33+
class ArtifactType(str, Enum):
34+
RUN_RESULTS = "run_results.json"
35+
SOURCES = "sources.json"
36+
MANIFEST = "manifest.json"
37+
CATALOG = "catalog.json"
38+
39+
40+
def _to_dict(raw: dict[str, Any], strict_parse_fn: Callable[[], Any]) -> dict[str, Any]:
41+
"""Try strict parsing and dump to a plain dict; fall back to raw on any error."""
42+
try:
43+
return strict_parse_fn().model_dump(mode="json")
44+
except Exception as exc:
45+
logger.warning(
46+
"Strict artifact parsing failed (%s: %s); falling back to raw dict. "
47+
"This is expected for dbt Fusion or preview builds.",
48+
type(exc).__name__,
49+
str(exc)[:200],
50+
)
51+
return raw
52+
53+
54+
ARTIFACT_PARSERS: dict[ArtifactType, Callable[[dict[str, Any]], dict[str, Any]]] = {
55+
ArtifactType.MANIFEST: lambda raw: _to_dict(
56+
raw, lambda: parse_manifest(manifest=raw)
57+
),
58+
ArtifactType.CATALOG: lambda raw: _to_dict(raw, lambda: parse_catalog(catalog=raw)),
59+
ArtifactType.RUN_RESULTS: lambda raw: _to_dict(
60+
raw, lambda: parse_run_results(run_results=raw)
61+
),
62+
ArtifactType.SOURCES: lambda raw: _to_dict(raw, lambda: parse_sources(sources=raw)),
63+
}

0 commit comments

Comments
 (0)