Skip to content

add replay command: reconstruct the observed schedule#3

Merged
trouze merged 1 commit into
pypi-v0.1.0from
feat/replay
Apr 24, 2026
Merged

add replay command: reconstruct the observed schedule#3
trouze merged 1 commit into
pypi-v0.1.0from
feat/replay

Conversation

@trouze
Copy link
Copy Markdown
Owner

@trouze trouze commented Apr 24, 2026

Summary

  • Adds dbt-dag-opt replay, a complementary command to analyze. Where analyze is theoretical (critical path from DAG topology), replay reads the observed schedule out of run_results.json — every result carries a thread_id and per-phase timing with start/end timestamps — joined against manifest.json's parent_map to attribute blocking.
  • Surfaces three things the tool couldn't say before:
    • Per-thread utilization: busy vs. idle across the run.
    • Observed critical path: walked backwards from the last-completing node by picking, at each step, the parent whose completion time was closest to (but no later than) this node's start.
    • Idle-gap attribution: every idle stretch is tagged with the parent node the thread was waiting on. Gaps with no blocker are flagged as scheduler overhead rather than DAG-structural blocking — that distinction matters because only DAG-structural lag is addressable by changing model code.
  • text (default, rich tables) and json output formats. Same file/cloud input modes as analyze, same env var token.

Stacking

This PR targets pypi-v0.1.0 (#1) because the package scaffolding lives there. Merge order: #1 first, rebase onto main, then merge this.

Test plan

  • pytest — 42 passed (13 new for replay; 29 pre-existing)
  • ruff check . + mypy src — clean
  • Manual smoke test via dbt-dag-opt replay --manifest tests/fixtures/dbt_dugout/manifest.json --run-results tests/fixtures/dbt_dugout/run_results.json — renders summary, thread utilization, critical path, and attributed idle gaps against a real 57-node 4-thread Snowflake run.
  • Synthetic 4-node 2-thread fixture in conftest.py asserts exact expected critical path, thread utilization, and idle-gap attribution.

🤖 Generated with Claude Code

`analyze` predicts a theoretical lower bound on wall-clock from the DAG's
critical path. `replay` does the complementary thing: reads the observed
schedule out of run_results.json (thread_id + per-phase timing) and joins
against manifest.json's parent_map to report:

- per-thread utilization (busy vs. idle, events executed)
- observed critical path, walked backwards from the last-completing node
  by picking, at each step, the parent whose completion time was closest
  to this node's start
- top idle gaps, each attributed to the parent node the thread was
  waiting on (or flagged as scheduler overhead if all parents were
  already done)

CLI: `dbt-dag-opt replay --manifest ... --run-results ...` with `text`
(default) and `json` formats. Same file/cloud input modes as `analyze`.

Tests: synthetic 4-node 2-thread fixture with known expected chain +
integration fixture at tests/fixtures/dbt_dugout/ (real 57-node Snowflake
run) smoke-tested through CLI. Suite: 42 passed (+13 replay tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trouze trouze merged commit 51ab842 into pypi-v0.1.0 Apr 24, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant