This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
- Run full pipeline:
make run-dbt(Runsdepsandbuild) - Debug dbt connection:
make run-dbt-debug - Clean dbt artifacts:
make clean-dbt(Removestarget/,logs/, anddbt_packages/) - Generate documentation:
make generate-docs - Lint SQL:
make lint(Uses SQLFluff) ormake lint-fix(Auto-fix)
- Start services:
make compose-up(waits for SQL Server readiness) - Stop services:
make compose-down(keeps named volumes) - Service status:
make compose-ps - Tail logs:
make compose-logs - Full init (compose + DB + Flyway migrate):
make init-db - Verify data in SQL Server:
make check-mssql - Inspect SCD2 table:
make check-scb-scd2 - Nuke everything (containers + volumes + venv + certs):
make nuke
- Upload seedcsv/ to MinIO (hive-partitioned):
make upload-minio - DuckDB staging + SQL Server landing:
make load-scb-bulkfil - SCD2 snapshot (sqlserver target):
make snapshot-scb-bulkfil - End-to-end:
make run-scb-scd2
- Run pipeline tests:
make test-scb-bulkfil(ormake dbt-container-test-scb-bulkfil) - Where dupes go:
uniquetest onbronze_scb_bulkfil_parquet.peorgnrisseverity: warn,store_failures: true. Failing rows persist tomain_dbt_test__audit.unique_bronze_scb_bulkfil_parquet_peorgnrin the local DuckDB (fidemo/my_db.duckdb). Refreshed every run.- Full-context rejected rows live in
fidemo.finance.scb_bulkfil_dedup_rejectsin SQL Server, with a_dedup_rankcolumn showing which "version" each row was (2 = second-newest, etc.). tests/dedup_reconciliation.sqlassertsbronze = winners + rejectson every test run.
Skips all the macOS-specific system setup (brew, Python 3.12 pin, ODBC, Flyway arch). Everything runs inside a Linux container that joins the compose network.
- Build image:
make dbt-container-build - Start (brings up mssql + minio + runner):
make dbt-container-up - Enter shell:
make dbt-shell - Run full pipeline inside the container:
make dbt-container-run-scb-scd2 - VS Code: "Reopen in Container" with the provided
.devcontainer/devcontainer.json.
- Setup Python environment:
make setup-python(Usesuvto create venv and install dependencies) - Update dependencies:
make force-update-requirements(Re-compilesrequirements.intorequirements.txt)
- The project venv is
uv-managed (host and container). Any ad-hoc install goes throughuv pip install --python /opt/venv/bin/python <pkg>inside the container, oruv pip install --python venv_dbt_duckdb/bin/python <pkg>on the host. - For tools with conflicting adapter pins (e.g.
reccepulls its own dbt versions), create an isolated interpreter:uv venv /tmp/<toolname>-env --python 3.12+uv pip install --python /tmp/<toolname>-env/bin/python <pkg>. Never let them rewrite/opt/venvorvenv_dbt_duckdb/. pip installin raw form should not appear anywhere I write — it bypasses the lockfile discipline and makes drift hard to reproduce.
This project implements a hybrid ELT pipeline (DuckDB
- Extract (Zero-copy): dbt reads raw Parquet files directly from an external location using DuckDB's
external_locationcapability. - Transform (DuckDB/dbt):
- Data is processed in an EAV (Entity-Attribute-Value) format.
- Parsing: Regex extracts business keys (e.g.,
household_id) from unstructuredrow_idstrings. - Pivoting: The
PIVOToperator transforms EAV rows into wide, columnar tables (e.g.,stg_households). - Surrogate Keys: Deterministic
MD5hashes are used for_skcolumns (e.g.,household_sk) to ensure idempotency.
- Load (BCP):
- Transformed data is exported from DuckDB to a temporary CSV.
- The BCP (Bulk Copy Program) utility is used to bulk-insert the CSV into a Microsoft SQL Server container.
- Schema Management: Flyway manages the SQL Server schema and migrations (found in
/migrations).
fidemo/models/staging/: dbt models performing the core parsing and pivoting logic (stg_scb_bulkfilreads the MinIO source).fidemo/models/bronze/: typed bronze models — Parquet (bronze_scb_bulkfil_parquet) + DuckLake (bronze_scb_bulkfil_ducklake) variants in MinIO.fidemo/models/exports/: models that push into SQL Server (viamssqlcommunity extension;scb_bulkfil_landing_from_parquetandscb_bulkfil_landing_from_ducklake).fidemo/snapshots/: dbt SCD2 snapshots (run with--target sqlserver).fidemo/macros/:common_columns.sql, plus the custom materializationsmaterialization_mssql_native.sqlandmaterialization_ducklake.sql.migrations/: SQL Server schema migrations managed by Flyway.cert/: Custom/corporate certificates for secure connections.Makefile: The primary orchestrator for all pipeline and infra tasks.
-
CSV encoding must be
'latin-1', not any ICU name. The DuckDB CSV reader lists 300+ ICU encodings but most of them (includingISO8859_15,windows-1252,8859_15) apply Unicode compatibility normalization that mangles ASCIIF(U+0046) → fullwidthF(U+FF26). DuckDB's three known-good CSV encodings areutf-8,utf-16,latin-1. For Swedish data,latin-1is semantically equivalent to ISO-8859-15 (divergence is only on €, Š, œ, Ÿ — none appear in names/addresses). -
DuckDB is pinned
==1.5.2inrequirements.txt. That's the newest version where themssqlcommunity extension (hugr-lab/mssql-extension) is published forosx_arm64,linux_amd64, andlinux_arm64on community-extensions.duckdb.org. 1.5.3+ and 1.6.x return 404 across all three platforms (verified via HEAD probe 2026-05-11). Upstream's latest release is v0.1.18 (built against DuckDB v1.5). Before bumping, re-run the probe:for v in 1.5.3 1.6.0 1.7.0; do for p in osx_arm64 linux_amd64 linux_arm64; do curl -sI "https://community-extensions.duckdb.org/v${v}/${p}/mssql.duckdb_extension.gz" | head -1; done; done -
Python must be ≥3.10 (
PYTHON_VERSION ?= 3.12in Makefile). The system Python on macOS Command Line Tools is 3.9.6, which can't satisfyblack>=25.12in requirements.setup-pythonauto-heals by detecting an existing venv with the wrong Python version and rebuilding. -
Flyway archive is platform-specific. The Makefile auto-detects OS+arch via
unameand picksmacosx-arm64,linux-x64, etc. A Linux tarball on a macOS host producescannot execute binary file: Exec format error. -
The DuckLake-fed landing model uses
-- depends_on:+ hardcoded FQN, notref(). The DuckLake table lives atlake.bronze.<name>butlakeis only ATTACHed mid-run inside the upstream ducklake materialization, so declaringdatabase='lake'in config trips dbt's pre-run relation checks. The Parquet-fed landing uses normalref(). -
Staging → bronze → landing must run in a single
dbt runinvocation. Thelakeattach established by the ducklake materialization only lives within one DuckDB session; splitting into twodbt runcalls drops it and breaks the DuckLake-path landing. -
externalmaterialization'spartition_byoption is a comma-separated string, not a Python list ('year, month, day', not['year','month','day']). A list Jinja-renders to['year','month','day']which dbt-duckdb wraps into invalid SQL. -
externalmaterialization'splugin=is only for third-party plugins (sqlalchemy, excel, iceberg). For plain Parquet writes through DuckDB's ownCOPY TO, omitpluginentirely — specifyingplugin='native'raises "Plugin native not found". -
Column-reference case collision: DuckDB resolves identifiers case-insensitively.
cast(PeOrgNr as varchar) as peorgnrerrors with "referenced before defined" because the source column and the alias collide. Fix: qualify with the CTE alias —cast(src.PeOrgNr as varchar) as peorgnr. -
setup-mssql-dbhas nosetup-mssql-driverdep. The original dep didapt-get install msodbcsql18which only works on Ubuntu;setup-mssql-dbitself only needs docker exec + sqlcmd inside the container, so no host ODBC is required. 10a. SCD2 source must be one-row-per-key. When multiple hive partitions cover overlappingpeorgnrvalues (different delivery dates of the same companies),stg_scb_bulkfiland the bronze layer carry both rows. The SCD2 snapshot's MERGE on SQL Server then errors: "MERGE attempted to UPDATE/DELETE the same row more than once". Fix: deduplicate at the silver landing withqualify row_number() over (partition by peorgnr order by effective_date desc) = 1. Bronze keeps full history; silver is "latest current state per key". 11a. DuckLake catalog version drift across DuckDB versions. The DuckLake extension is bundled with DuckDB; bumping DuckDB (e.g. 1.4.x → 1.5.x) bumps the DuckLake schema (v0.3 → v0.4). A catalog file written by the older version errors with "DuckLake catalog version mismatch" when opened by the newer one. Fix is theAUTOMATIC_MIGRATION trueparameter onATTACH, whichmaterialization_ducklake.sqlnow sets unconditionally — once migrated, the catalog stays at the new version (one-way, irreversible). -
disable_transactions: trueis required inprofiles.ymlfor DuckLake writes. dbt-duckdb's defaultBEGIN…COMMITwrapping does not propagate to attached catalogs — the CTAS againstlake.bronze.<table>reports "OK created" butducklake_snapshot_changesnever records it, so the table silently doesn't persist. Empirically verified by inspecting the SQLite catalog between runs. With transactions disabled, DuckLake manages its own commits. Our materializations are single-statement CTAS so the loss of dbt-managed atomicity is irrelevant. -
macOS needs Homebrew-installed ODBC for
dbt-sqlserver.pyodbcin the venv is linked against/opt/homebrew/opt/unixodbc/lib/libodbc.2.dylibwhich Apple doesn't ship. Prerequisites formake snapshot-scb-bulkfiland anything touching--target sqlserver:The DuckDB-side pipeline (brew install unixodbc brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release HOMEBREW_ACCEPT_EULA=Y brew install msodbcsql18 mssql-tools18
load-scb-bulkfil) doesn't need either — themssqlcommunity extension uses native TDS, no ODBC layer. 11b. SCD2 MERGE failure after an earlier crashed run. A snapshot whose previous MERGE crashed mid-flight can leavedbt_valid_to IS NULLfor multiple rows with the samepeorgnr. Every subsequent snapshot invocation then matches one source row to N target rows and errors with SQL Server 42000/8672 — even though current landing tables are clean. Diagnose withSELECT peorgnr, COUNT(*) FROM finance.snap_scb_bulkfil_scd2 WHERE dbt_valid_to IS NULL GROUP BY peorgnr HAVING COUNT(*)>1; recover withmake clean-pipeline-state(drops all pipeline output tables + bronze MinIO prefixes + local DuckDB+DuckLake; does NOT touch thefidemodatabase or Flyway history) then re-run the pipeline. 12a.dbt show --target sqlserver --select <mssql_native model>fails. Themssql_nativematerialization is DuckDB-only (INSTALL + ATTACH + CTAS). Runningdbt showagainst sqlserver asks that adapter to re-execute the model's SELECT, which reads fromref('bronze_…')(Parquet in MinIO) — the sqlserver adapter can't find that. Workaround: use--target devto preview the source query, or--target sqlserver --inline "select … from {{ source('finance_landing','…') }}"to query what was actually written. All v2 silver tables are declared as sources in_sources.ymlfor exactly this purpose. -
dbt-sqlservermust be pinned==1.9.0. Without a pin, uv resolvesdbt-sqlserver==1.3.1, which still imports the removeddbt.clients.agate_helper.empty_tableand crashes at module-load under dbt-core 1.10+ (current locked dbt-core is 1.11.8).dbt-sqlserver==1.9.0is the highest published on PyPI and is compatible with dbt-core 1.11. Also pulls indbt-fabric==1.9.3+pyodbc==5.1.0as transitive deps — reflected in the lockfile.
Authoritative references kept inline so Claude can build materializations without re-fetching docs.
Source: https://duckdb.org/community_extensions/extensions/mssql · https://github.com/hugr-lab/mssql-extension
- Install / load (requires DuckDB ≥ 1.4.1):
INSTALL mssql FROM community; LOAD mssql; - Attach (preferred — via secret):
Alternate forms: ADO.NET string (
CREATE SECRET ms (TYPE mssql, host 'localhost', port 1433, database 'fidemo', user 'sa', password 'MySecretPassword123!'); ATTACH '' AS ms (TYPE mssql, SECRET ms);
'Server=host,port;Database=...;User Id=...;Password=...') or URI ('mssql://user:pass@host:port/db?encrypt=true'). - Identifier syntax: three-part —
attached_catalog.schema.table(e.g.,ms.finance.scb_bulkfil_landing). - Supported DDL/DML:
CREATE TABLE,CREATE TABLE AS SELECT(CTAS uses BCP by default — settingmssql_ctas_use_bcp = true).CREATE OR REPLACE TABLE— yes, non-atomic (DROP then CREATE).DROP TABLE— yes.DROP TABLE IF EXISTSvia DuckDB syntax is not supported; useSELECT mssql_exec('ms', 'DROP TABLE IF EXISTS ...').INSERT INTO ms.schema.table SELECT ...— yes, auto-batched (1000 rows default).UPDATE/DELETE— require PK on the target; noRETURNING.
- Fastest bulk path:
~300K rows/s for simple rows (per docs).
COPY duckdb_view TO 'ms.finance.scb_bulkfil_landing' (FORMAT 'bcp', REPLACE true);
- Type mapping (CTAS):
VARCHAR→NVARCHAR(MAX),BOOLEAN→BIT,DOUBLE→FLOAT,TIMESTAMP→DATETIME2(7),UUID→UNIQUEIDENTIFIER. Unsupported:HUGEINT,INTERVAL,LIST,STRUCT,MAP,ARRAY. - Column mapping on existing tables: by name, case-insensitive (not position). Missing target cols get NULL (must be nullable); extra source cols ignored.
- Identity columns: auto-excluded from INSERT. Indexes/constraints/IDENTITY must be created via
mssql_exec(). - Auth: SQL auth + Azure Entra ID. Not supported: Windows auth, named instances. TLS on by default;
TrustServerCertificateis an alias forEncryptin ADO.NET strings (not ODBC-style independent flag).
Source: https://ducklake.select/docs/stable/duckdb/usage/choosing_a_catalog_database#sqlite
- Install / load:
INSTALL ducklake; INSTALL sqlite; LOAD ducklake;
- Attach (SQLite catalog + S3 data):
ATTACH 'ducklake:sqlite:fidemo/ducklake_catalog.sqlite' AS lake (DATA_PATH 's3://informat/bronze-ducklake/'); USE lake;
- Key ATTACH parameters:
DATA_PATH(required for non-DuckDB catalogs),CREATE_IF_NOT_EXISTS(defaulttrue),METADATA_SCHEMA,ENCRYPTED,SNAPSHOT_TIME/SNAPSHOT_VERSION(time-travel),OVERRIDE_DATA_PATH. - Operations: standard
CREATE TABLE,INSERT,UPDATE,DELETE,MERGEagainstlake.schema.table. - Concurrency note (SQLite): DuckLake compensates for SQLite's single-writer model by attach/detach-per-query + retry timeouts — usable for single-host demo/dev; swap catalog to PostgreSQL for multi-writer prod.
- S3 credentials for the data layer are configured separately (DuckDB
s3_access_key_idsettings or aTYPE s3secret) — DuckLake does not manage them.