What this skill does
databricks-streaming-guardian is the data-operations skill of the pack — focused on Delta Lake, Liquid Clustering, Structured Streaming, and DLT. It is the largest skill in the rebuild because these four surfaces each ship with their own set of sharp edges that show up most visibly when production data flows through them at scale. Nine of the twelve failure modes here are not bugs; they are documented platform behaviors that surprise engineers — so the response pattern is friction-at-trigger-time (PreToolUse hooks on destructive operations), not bug reports.
What it catches
- ConcurrentDeleteDeleteException on OPTIMIZE collisions — manual OPTIMIZE colliding with AUTO OPTIMIZE, which is silently enabled on every table touched by MERGE/UPDATE/DELETE. (
003-RL-RSRC D01)
- ConcurrentAppendException on Liquid Clustering — LC eliminates folder-based partition pruning but not writer conflicts; fan-out MERGEs break unless the MERGE predicate is narrowed to clustering keys. (
003-RL-RSRC D02)
- DELTA_FILE_NOT_FOUND_DETAILED after VACUUM — streaming checkpoint pins file paths; OPTIMIZE rewrites them; VACUUM 7 days later deletes the originals. (
003-RL-RSRC D03)
- Silent checkpoint corruption — months of healthy streaming then silent reset to batch 0, no documented root cause. (
003-RL-RSRC D04)
- RocksDB off-heap memory pinning — multi-GB beyond JVM GC reach kills the driver via OOM with heap looking fine. (
003-RL-RSRC D05)
- Liquid Clustering migration costs and downstream breakage — hidden full-rewrite cost plus consumer code that expected partition predicates that no longer exist. (
003-RL-RSRC D06)
- Time travel breaking after VACUUM crosses retention boundary — engineers learning during audit that time travel is not backup. (
003-RL-RSRC D07)
- DLT
@dlt.table thread race — ThreadPoolExecutor registrations completing out of order, intermittent "table missing" failures. (003-RL-RSRC D08)
- DLT full refresh dropping data silently — when source is non-replayable (Kafka past retention, truncate-and-load Delta source). (
003-RL-RSRC D09)
- Autoloader
UnknownFieldException on schema evolution — default mode stops the stream on every new column; rescue mode silently widens schema. (003-RL-RSRC D10)
- DLT predictive optimization cost on idle pipelines — the maintenance cluster running 24x7 on Advanced tier. (
003-RL-RSRC D11 — primary mention in cost-leak-hunter, secondary here)
- DIFFERENT_DELTA_TABLE_READ_BY_STREAMING_SOURCE — checkpoints pin to source table UUID; CREATE OR REPLACE generates a new UUID; every active streaming consumer dies the moment the producer team runs the migration script. (
003-RL-RSRC D12)
Design questions I want pushback on
- PreToolUse hook scope. Plan is to block: OPTIMIZE / VACUUM / CREATE OR REPLACE / DROP TABLE when
system.streaming.query_progress shows any active consumer on the target table. Is this the right list, or should it be wider (TRUNCATE, ALTER TABLE schema changes) or narrower?
- False-positive tolerance for hooks. A consumer that has been idle for 2 hours but has not been formally stopped — block or warn? My current plan is warn + offer a
--force escape hatch. Right call?
- Liquid Clustering migration triage. D06 includes a full-rewrite cost that can be very large. Should the skill estimate the rewrite cost up front (requires DESCRIBE DETAIL + table-size math) before recommending migration, or only after the user opts in?
- DLT
@dlt.table thread race detection. D08 fires intermittently. Should the skill detect-and-warn at code-review time (static analysis of ThreadPoolExecutor.submit(register_table, ...) patterns), or only diagnose after a failure has occurred?
- Autoloader schema-evolution recommendation. D10 has three modes (
addNewColumns, failOnNewColumns, rescue). My instinct is "recommend rescue by default, document the silent-widening risk loudly." Is that right, or do production teams hate rescue for reasons I am missing?
What I am not asking about right now
- Whether to split this into multiple skills — merging the delta-conflict-resolver into this one is decided per
007-AT-ADEC § Decision 2.
- Whether to add Kafka-side or Kinesis-side diagnostics — out of scope, source-system focus.
- Whether to use Delta Live Tables features that have not GA'd yet — DLT Direct Publishing Mode is on the watch list but not in scope until it stabilizes.
How to respond
Comment below with any thoughts, leave thumbs-up / thumbs-down on individual bullets in the design questions, or send a voice memo on WhatsApp and I will transcribe it into the issue with attribution. English is not required for voice memos — Portuguese is fine.
Source bead: claude-vjaw in the local beads workspace.
Reference material
Most relevant for this skill:
| Doc |
What it covers |
003-RL-RSRC |
Delta Lake / Liquid Clustering / Structured Streaming / DLT pain catalog |
007-AT-ADEC |
CTO decision — Databricks Pack v2 rebuild |
Full reference set + cross-skill context: see umbrella issue #795 § Reference material.
- Jeremy Longshore
intentsolutions.io
What this skill does
databricks-streaming-guardianis the data-operations skill of the pack — focused on Delta Lake, Liquid Clustering, Structured Streaming, and DLT. It is the largest skill in the rebuild because these four surfaces each ship with their own set of sharp edges that show up most visibly when production data flows through them at scale. Nine of the twelve failure modes here are not bugs; they are documented platform behaviors that surprise engineers — so the response pattern is friction-at-trigger-time (PreToolUse hooks on destructive operations), not bug reports.What it catches
003-RL-RSRCD01)003-RL-RSRCD02)003-RL-RSRCD03)003-RL-RSRCD04)003-RL-RSRCD05)003-RL-RSRCD06)003-RL-RSRCD07)@dlt.tablethread race —ThreadPoolExecutorregistrations completing out of order, intermittent "table missing" failures. (003-RL-RSRCD08)003-RL-RSRCD09)UnknownFieldExceptionon schema evolution — default mode stops the stream on every new column; rescue mode silently widens schema. (003-RL-RSRCD10)003-RL-RSRCD11 — primary mention in cost-leak-hunter, secondary here)003-RL-RSRCD12)Design questions I want pushback on
system.streaming.query_progressshows any active consumer on the target table. Is this the right list, or should it be wider (TRUNCATE, ALTER TABLE schema changes) or narrower?--forceescape hatch. Right call?@dlt.tablethread race detection. D08 fires intermittently. Should the skill detect-and-warn at code-review time (static analysis ofThreadPoolExecutor.submit(register_table, ...)patterns), or only diagnose after a failure has occurred?addNewColumns,failOnNewColumns,rescue). My instinct is "recommendrescueby default, document the silent-widening risk loudly." Is that right, or do production teams haterescuefor reasons I am missing?What I am not asking about right now
007-AT-ADEC§ Decision 2.How to respond
Comment below with any thoughts, leave thumbs-up / thumbs-down on individual bullets in the design questions, or send a voice memo on WhatsApp and I will transcribe it into the issue with attribution. English is not required for voice memos — Portuguese is fine.
Source bead:
claude-vjawin the local beads workspace.Reference material
Most relevant for this skill:
003-RL-RSRC007-AT-ADECFull reference set + cross-skill context: see umbrella issue #795 § Reference material.
intentsolutions.io