Skip to content

[v2-rebuild] databricks-streaming-guardian: Delta / Liquid Clustering / Structured Streaming / DLT operations skill #792

Description

@jeremylongshore

What this skill does

databricks-streaming-guardian is the data-operations skill of the pack — focused on Delta Lake, Liquid Clustering, Structured Streaming, and DLT. It is the largest skill in the rebuild because these four surfaces each ship with their own set of sharp edges that show up most visibly when production data flows through them at scale. Nine of the twelve failure modes here are not bugs; they are documented platform behaviors that surprise engineers — so the response pattern is friction-at-trigger-time (PreToolUse hooks on destructive operations), not bug reports.

What it catches

  • ConcurrentDeleteDeleteException on OPTIMIZE collisions — manual OPTIMIZE colliding with AUTO OPTIMIZE, which is silently enabled on every table touched by MERGE/UPDATE/DELETE. (003-RL-RSRC D01)
  • ConcurrentAppendException on Liquid Clustering — LC eliminates folder-based partition pruning but not writer conflicts; fan-out MERGEs break unless the MERGE predicate is narrowed to clustering keys. (003-RL-RSRC D02)
  • DELTA_FILE_NOT_FOUND_DETAILED after VACUUM — streaming checkpoint pins file paths; OPTIMIZE rewrites them; VACUUM 7 days later deletes the originals. (003-RL-RSRC D03)
  • Silent checkpoint corruption — months of healthy streaming then silent reset to batch 0, no documented root cause. (003-RL-RSRC D04)
  • RocksDB off-heap memory pinning — multi-GB beyond JVM GC reach kills the driver via OOM with heap looking fine. (003-RL-RSRC D05)
  • Liquid Clustering migration costs and downstream breakage — hidden full-rewrite cost plus consumer code that expected partition predicates that no longer exist. (003-RL-RSRC D06)
  • Time travel breaking after VACUUM crosses retention boundary — engineers learning during audit that time travel is not backup. (003-RL-RSRC D07)
  • DLT @dlt.table thread raceThreadPoolExecutor registrations completing out of order, intermittent "table missing" failures. (003-RL-RSRC D08)
  • DLT full refresh dropping data silently — when source is non-replayable (Kafka past retention, truncate-and-load Delta source). (003-RL-RSRC D09)
  • Autoloader UnknownFieldException on schema evolution — default mode stops the stream on every new column; rescue mode silently widens schema. (003-RL-RSRC D10)
  • DLT predictive optimization cost on idle pipelines — the maintenance cluster running 24x7 on Advanced tier. (003-RL-RSRC D11 — primary mention in cost-leak-hunter, secondary here)
  • DIFFERENT_DELTA_TABLE_READ_BY_STREAMING_SOURCE — checkpoints pin to source table UUID; CREATE OR REPLACE generates a new UUID; every active streaming consumer dies the moment the producer team runs the migration script. (003-RL-RSRC D12)

Design questions I want pushback on

  1. PreToolUse hook scope. Plan is to block: OPTIMIZE / VACUUM / CREATE OR REPLACE / DROP TABLE when system.streaming.query_progress shows any active consumer on the target table. Is this the right list, or should it be wider (TRUNCATE, ALTER TABLE schema changes) or narrower?
  2. False-positive tolerance for hooks. A consumer that has been idle for 2 hours but has not been formally stopped — block or warn? My current plan is warn + offer a --force escape hatch. Right call?
  3. Liquid Clustering migration triage. D06 includes a full-rewrite cost that can be very large. Should the skill estimate the rewrite cost up front (requires DESCRIBE DETAIL + table-size math) before recommending migration, or only after the user opts in?
  4. DLT @dlt.table thread race detection. D08 fires intermittently. Should the skill detect-and-warn at code-review time (static analysis of ThreadPoolExecutor.submit(register_table, ...) patterns), or only diagnose after a failure has occurred?
  5. Autoloader schema-evolution recommendation. D10 has three modes (addNewColumns, failOnNewColumns, rescue). My instinct is "recommend rescue by default, document the silent-widening risk loudly." Is that right, or do production teams hate rescue for reasons I am missing?

What I am not asking about right now

  • Whether to split this into multiple skills — merging the delta-conflict-resolver into this one is decided per 007-AT-ADEC § Decision 2.
  • Whether to add Kafka-side or Kinesis-side diagnostics — out of scope, source-system focus.
  • Whether to use Delta Live Tables features that have not GA'd yet — DLT Direct Publishing Mode is on the watch list but not in scope until it stabilizes.

How to respond

Comment below with any thoughts, leave thumbs-up / thumbs-down on individual bullets in the design questions, or send a voice memo on WhatsApp and I will transcribe it into the issue with attribution. English is not required for voice memos — Portuguese is fine.

Source bead: claude-vjaw in the local beads workspace.


Reference material

Most relevant for this skill:

Doc What it covers
003-RL-RSRC Delta Lake / Liquid Clustering / Structured Streaming / DLT pain catalog
007-AT-ADEC CTO decision — Databricks Pack v2 rebuild

Full reference set + cross-skill context: see umbrella issue #795 § Reference material.

  • Jeremy Longshore
    intentsolutions.io

Metadata

Metadata

Assignees

No one assigned

    Labels

    community-design-reviewPractitioner input is the load-bearing input for this issuedatabricks-packScope label for Databricks plugin pack workfeedback-wantedIssue is soliciting community feedback before code landssaas-packsSaaS integration packs under plugins/saas-packs/v2-rebuildTemporal marker for v2 rebuild initiatives (sunsetable)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions