Skip to content

Commit 765b969

Browse files
authored
Merge pull request #1 from trouze/pypi-v0.1.0
Rewrite as production-ready Python package (v0.1.0 on PyPI)
2 parents 8aa362e + 24e95a6 commit 765b969

32 files changed

Lines changed: 2182 additions & 427 deletions

.github/workflows/ci.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
8+
concurrency:
9+
group: ${{ github.workflow }}-${{ github.ref }}
10+
cancel-in-progress: true
11+
12+
jobs:
13+
test:
14+
runs-on: ubuntu-latest
15+
strategy:
16+
fail-fast: false
17+
matrix:
18+
python-version: ["3.10", "3.11", "3.12"]
19+
steps:
20+
- uses: actions/checkout@v4
21+
22+
- name: Install uv
23+
uses: astral-sh/setup-uv@v3
24+
with:
25+
enable-cache: true
26+
27+
- name: Set up Python ${{ matrix.python-version }}
28+
run: uv python install ${{ matrix.python-version }}
29+
30+
- name: Install project
31+
run: uv sync --all-extras --python ${{ matrix.python-version }}
32+
33+
- name: Lint
34+
run: uv run ruff check .
35+
36+
- name: Typecheck
37+
run: uv run mypy src
38+
39+
- name: Test
40+
run: uv run pytest -q --cov=dbt_dag_opt --cov-report=term-missing

.github/workflows/publish.yml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: Publish to PyPI
2+
3+
on:
4+
push:
5+
tags: ["v*"]
6+
7+
jobs:
8+
build-and-publish:
9+
runs-on: ubuntu-latest
10+
environment:
11+
name: pypi
12+
url: https://pypi.org/project/dbt-dag-opt/
13+
permissions:
14+
id-token: write # required for PyPI Trusted Publishing (OIDC)
15+
contents: read
16+
steps:
17+
- uses: actions/checkout@v4
18+
19+
- name: Install uv
20+
uses: astral-sh/setup-uv@v3
21+
with:
22+
enable-cache: true
23+
24+
- name: Set up Python
25+
run: uv python install 3.12
26+
27+
- name: Build sdist + wheel
28+
run: uv build
29+
30+
- name: Verify built distributions
31+
run: uv run --with twine twine check dist/*
32+
33+
- name: Publish to PyPI
34+
uses: pypa/gh-action-pypi-publish@release/v1

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,10 @@ venv.bak/
152152

153153
# mypy
154154
.mypy_cache/
155+
.ruff_cache/
156+
157+
# Claude Code local state
158+
.claude/
155159
.dmypy.json
156160
dmypy.json
157161

CHANGELOG.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
4+
5+
## [0.1.0] - 2026-04-24
6+
7+
Initial PyPI release. Complete rewrite of the pre-release prototype.
8+
9+
### Added
10+
11+
- `dbt-dag-opt analyze` CLI (Typer) with two input modes:
12+
- File mode: `--manifest` and `--run-results` point at local dbt artifacts.
13+
- Cloud mode: `--account-id`, `--job-id`, optional `--run-id`, and `DBT_CLOUD_TOKEN` env var (or `--token`) pull artifacts from the dbt Cloud Admin API.
14+
- Output formats: `table` (rich terminal), `json` (valid, `jq`-friendly), `jsonl`.
15+
- `--top N` to limit results; `--output` to write to a file.
16+
- Typed exceptions (`ArtifactLoadError`, `DbtCloudAPIError`, `InvalidArtifactError`, `GraphError`).
17+
- Package ships with `py.typed` (PEP 561).
18+
- CI matrix across Python 3.10 / 3.11 / 3.12.
19+
- PyPI publishing via Trusted Publishers (OIDC) on tag push.
20+
21+
### Changed (vs. prototype)
22+
23+
- Replaced per-source recursive DFS + ProcessPoolExecutor with a single iterative DP over topological order. O(V + E) across all sources, no recursion-limit risk, no 20s per-task timeout.
24+
- Node weights are now attached to the *target* node of each path hop (fixes a bug where parent weights were assigned to outgoing edges).
25+
- Adjacency list replaces full-edge-list rescan on every DFS step.
26+
- Output is valid JSON by default (prototype's `longest_paths.json` was a stream of comma-separated fragments opened in append mode — not parseable).
27+
28+
### Notes for PyPI Trusted Publishing
29+
30+
Before the first `v*` tag is pushed, configure PyPI: Project settings → Publishing → Add GitHub publisher with `trouze/dbt-dag-opt` / workflow `publish.yml` / environment `pypi`.

README.md

Lines changed: 96 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,105 @@
11
# dbt-dag-opt
2-
Struggling with long running dbt pipelines? Use this utility to determine the most troublesome paths through your dbt DAG by total execution time. Just because a model is long running, doesn't mean improving it's run time will materially speed up your dbt jobs. Long chained models with comparatively faster runtimes can add up and slow down total pipeline execution time. This utility uses a longest path algorithm to determine your longest running paths through your DAG, starting with each of your sources in dbt.
32

4-
This package uses [Fire](https://python-fire.readthedocs.io/en/latest/) to run like a CLI. To get started, you can either run using the dbt Cloud Admin API, or pass file paths for your `manifest.json` and `run_results.json` files.
3+
[![CI](https://github.com/trouze/dbt-dag-opt/actions/workflows/ci.yml/badge.svg)](https://github.com/trouze/dbt-dag-opt/actions/workflows/ci.yml)
4+
[![PyPI](https://img.shields.io/pypi/v/dbt-dag-opt.svg)](https://pypi.org/project/dbt-dag-opt/)
5+
[![Python](https://img.shields.io/pypi/pyversions/dbt-dag-opt.svg)](https://pypi.org/project/dbt-dag-opt/)
6+
7+
**Find the longest-running paths through your dbt DAG — the models that actually make your pipeline slow.**
8+
9+
When you pay for compute by the second (Snowflake, Databricks, Redshift), your dbt job's wall-clock cost is bounded by the *critical path* through the DAG: the longest cumulative chain of model execution times. Optimizing a slow model on a short branch saves you nothing if a longer branch was already the bottleneck. `dbt-dag-opt` tells you which paths to cut first.
10+
11+
## Install
12+
13+
```bash
14+
pip install dbt-dag-opt
15+
```
16+
17+
## Quickstart
18+
19+
### From local artifacts
20+
21+
```bash
22+
dbt-dag-opt analyze \
23+
--manifest target/manifest.json \
24+
--run-results target/run_results.json \
25+
--format table \
26+
--top 10
27+
```
28+
29+
### From dbt Cloud
30+
31+
```bash
32+
export DBT_CLOUD_TOKEN=dbtu_...
33+
dbt-dag-opt analyze \
34+
--account-id 12345 \
35+
--job-id 67890 \
36+
--base-url https://cloud.getdbt.com \
37+
--format table
38+
```
39+
40+
Add `--run-id <id>` to pull artifacts from a specific historical run instead of the job's latest.
41+
42+
## Sample output
543

6-
File path method
744
```
8-
python3 entrypoint.py --file_method=True --manifest_path='artifacts/manifest.json' --run_results_path='artifacts/run_results.json'
45+
Longest paths by total execution time
46+
┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┓
47+
┃ # ┃ Source ┃ End of path ┃ Length ┃ Total time (s) ┃
48+
┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━┩
49+
│ 1 │ source.demo.raw.orders │ model.demo.fact_orders │ 4 │ 35.00 │
50+
│ 2 │ source.demo.raw.customers │ model.demo.fact_orders │ 4 │ 32.00 │
51+
└───┴───────────────────────────┴────────────────────────┴────────┴────────────────┘
952
```
1053

11-
dbt Cloud API
54+
## CLI reference
55+
56+
```
57+
dbt-dag-opt analyze [OPTIONS]
58+
59+
--manifest PATH Path to manifest.json (file mode)
60+
--run-results PATH Path to run_results.json (file mode)
61+
--account-id TEXT dbt Cloud account id (cloud mode)
62+
--job-id TEXT dbt Cloud job id (cloud mode)
63+
--run-id TEXT dbt Cloud run id; omit for the job's latest run
64+
--base-url TEXT dbt Cloud base URL [default: https://cloud.getdbt.com]
65+
--token TEXT dbt Cloud API token [env: DBT_CLOUD_TOKEN]
66+
-f, --format [json|jsonl|table] Output format [default: table]
67+
-n, --top INTEGER Show only top N paths (0 = all) [default: 10]
68+
-o, --output PATH Write output to a file instead of stdout
1269
```
13-
python3 entrypoint.py --account_id='<my_id>' --job_id='<job_id>' --token='<api_token>'
14-
python3 entrypoint.py --base-url='https://cu288.us1.dbt.com' --account_id='70437463654419' --job_id='70437463655408' --token='dbtu_hayC4-EeNKK-lNbu5xYspNEhbLFeQK1ojfNXAC58J_qr2lRBwA'
70+
71+
### Output formats
72+
73+
- `table` — rich terminal table (default; what you want in a shell).
74+
- `json` — one object keyed by source: `{source_id: {path, distance, length}}`. Valid JSON, safe to pipe through `jq`.
75+
- `jsonl` — one JSON object per line. Nice for streaming into a log aggregator.
76+
77+
## How it works
78+
79+
1. **Load** `manifest.json` and `run_results.json` (from disk or dbt Cloud's Admin API).
80+
2. **Build** a weighted DAG: nodes are `model.*` / `source.*` / `seed.*` / `snapshot.*` ids; each node's weight is its `execution_time` in seconds.
81+
3. **Compute** the longest path from each source using an iterative DP over topological order (O(V + E)).
82+
4. **Sort** paths by total distance and surface the heaviest ones.
83+
84+
Distances sum the execution time of every node along the path — that's the warehouse-seconds you'd save by zeroing out that chain.
85+
86+
## What this is / isn't
87+
88+
It **is** a CLI tool that points at the slowest chains in your DAG.
89+
90+
It **isn't** (yet):
91+
- A scheduler simulator. If your dbt `threads` setting is low, total wall-clock is bounded by parallelism *and* the critical path; v0.2 will surface both. For now, treat the critical-path distance as a lower bound.
92+
- A cost model. Multiplying distance × your warehouse rate is on you — a `--warehouse-size` flag is planned for v0.3.
93+
94+
## Development
95+
96+
```bash
97+
uv sync --all-extras
98+
uv run ruff check .
99+
uv run mypy src
100+
uv run pytest
15101
```
16102

17-
The utility will save a json file to your working directory that has information on the longest path in your DAG for each starting node (usually sources). It's recommended to use this information to divide and conquer what models you should seek to optimize in order to shorten your pipeline runtimes.
103+
## License
104+
105+
Apache 2.0 — see [LICENSE](LICENSE).

entrypoint.py

Lines changed: 0 additions & 56 deletions
This file was deleted.

0 commit comments

Comments
 (0)