Skip to content

fix: emit Delta liquid clustering for clustered_by (#187)#188

Merged
mdrakiburrahman merged 1 commit into
mainfrom
dev/mdrrahman/cluster-by
May 17, 2026
Merged

fix: emit Delta liquid clustering for clustered_by (#187)#188
mdrakiburrahman merged 1 commit into
mainfrom
dev/mdrrahman/cluster-by

Conversation

@mdrakiburrahman

Copy link
Copy Markdown
Collaborator

Closes #187.

Problem

{{ config(clustered_by=[...]) }} is silently a no-op on Delta tables today. fabricspark__clustered_cols only emits the clause when buckets is also set (Hive bucketing). For Delta, the right behavior is to emit CLUSTER BY (cols) for Fabric Spark liquid clustering.

Fix

Drop in the reference impl from #187 — override fabricspark__clustered_cols and fabricspark__file_format_clause so the existing clustered_by config opts a Delta table into liquid clustering. No new config name introduced.

Semantics matrix:

clustered_by buckets file_format partition_by Result
set set any unchanged Hive bucketing: clustered by (cols) into N buckets
set unset delta/unset unset new: cluster by (cols) + using delta emitted by file_format_clause
set unset delta/unset set new: compile-time exceptions.raise_compiler_error (mutually exclusive)
set unset non-delta unchanged: no clustering clause, using <fmt> emitted
unset any unchanged

Seeds inherit the new behavior automatically through adapter dispatch — seed.sql calls the same clustered_cols and file_format_clause macros.

Expected DDL (per #187 acceptance criteria)

CREATE OR REPLACE TABLE schema.t
USING DELTA
CLUSTER BY (col_a, col_b)
AS SELECT ...

Validation

npx nx run dbt-fabricspark:test ran end-to-end on this branch — all targets green:

Step Duration Status
Unit tests (incl. 10 new cluster-by branches) < 1s
Local-e2e (jaffle_shop, Spark 3.5.1 + Delta 3.2.0) ~6 min
Functional no_schema 14m 7s
Functional with_schema (incl. cross-workspace) 19m 52s
Lint < 1s
Build + twine < 5s

End-to-end smoke models added per user direction:

  • tests/functional/adapter/persist_docs/fixtures.py::_MODELS__TABLE_DELTA_MODELclustered_by=['id'] (Fabric).
  • tests/fixtures/dbt-jaffle-shop/models/orders.sqlclustered_by=['order_id'] (local Livy).

A successful dbt run against both Fabric Spark and local Spark+Delta with the new clause is the live confirmation that the macros emit valid liquid-clustering SQL.

Version

Bumps 1.12.01.12.1, CHANGELOG updated.

`{{ config(clustered_by=[...]) }}` was silently a no-op on Delta tables
because `fabricspark__clustered_cols` only emitted the clause when
`buckets` was also set (Hive bucketing). Fabric Spark accepts
`CLUSTER BY (cols)` on Delta CTAS for liquid clustering — this commit
wires the existing `clustered_by` config through to that DDL without
introducing a new config name.

Override `fabricspark__clustered_cols` and `fabricspark__file_format_clause`
per the reference impl in #187:

- `clustered_by` + `buckets` → unchanged Hive bucketing
- `clustered_by` alone on Delta → `cluster by (cols)` + `using delta`
- `clustered_by` + `partition_by` on Delta → compile-time error
  (mutually exclusive on Delta)
- Non-`delta` `file_format` → unchanged `using <fmt>`, no clustering

Seeds inherit the new behavior automatically through adapter dispatch
(`seed.sql` calls the same `clustered_cols` and `file_format_clause`
macros).

Adds 10 unit tests covering the four acceptance-criteria branches and
adds `clustered_by` to two existing end-to-end test models — the
`persist_docs` Fabric delta table model and the local-e2e jaffle_shop
`orders.sql` — so both the Fabric Spark and local Spark 3.5.1 + Delta
3.2.0 paths exercise the new clause on every CI run.

Closes #187.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdrakiburrahman mdrakiburrahman merged commit 8d9ed35 into main May 17, 2026
2 checks passed
@mdrakiburrahman mdrakiburrahman deleted the dev/mdrrahman/cluster-by branch May 17, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Delta liquid clustering via existing clustered_by config

1 participant