Skip to content

Commit 17b64ba

Browse files
ryan-williamsclaude
andcommitted
Add dvx audit — blob classification, lineage, and cache analysis
New command classifies workspace blobs as input/generated/foreign, checks local cache status, detects orphaned cache entries, and outputs colored DOT dependency graphs. Adds `reproducible` field to `DVCFileInfo` (read from `meta.reproducible`); blobs with `computation` default to reproducible unless explicitly opted out. Renames `evictable` → `reproducible` in specs (positive default, opt-out via `meta.reproducible: false`). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 28c6b9a commit 17b64ba

File tree

9 files changed

+875
-0
lines changed

9 files changed

+875
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,7 @@ with Repo() as repo:
276276
| `add` | Track file(s) with optional provenance |
277277
| `status` | Show freshness of tracked files (data & deps) |
278278
| `diff` | Content diff with preprocessing support |
279+
| `audit` | Blob classification, lineage, and cache analysis |
279280
| `cache` | Inspect cache (path, md5, dir) |
280281
| `cat` | View cached file contents |
281282
| `push` | Upload data to remote storage |
@@ -299,6 +300,7 @@ with Repo() as repo:
299300
### Added in DVX
300301
- `dvx run` - Parallel pipeline execution with per-file provenance
301302
- `dvx diff` preprocessing - Pipe through commands before diffing (with `{}` placeholder)
303+
- `dvx audit` - Blob classification (input/generated/foreign), lineage, orphan detection
302304
- `dvx cache path/md5` - Cache introspection
303305
- `dvx cat` - View cached files directly
304306
- `dvx status --yaml` - Detailed status with hashes

specs/blob-audit-lineage.md

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Blob Audit and Lineage Tracking
2+
3+
## Problem
4+
5+
DVX stores provenance per-artifact (`meta.computation` in `.dvc` files), but there's no way to query across artifacts: which blobs are used where, which are still necessary, which are orphaned, and what the full dependency graph looks like. DVC's `gc` operates at the hash level (keep referenced hashes, delete unreferenced), but doesn't reason about the *relationships* between blobs across commits.
6+
7+
## Use Cases
8+
9+
### 1. "What blobs does this commit need?"
10+
Given a commit SHA, enumerate all `.dvc` files, their output hashes, and transitively all input hashes. Answer: "to fully reproduce this commit's state, you need these N blobs totaling X MB."
11+
12+
### 2. "Where is this blob used?"
13+
Given a blob hash (or path), find all commits/branches/tags that reference it — either as a direct output or as a transitive dependency. Answer: "this blob is referenced by 3 commits on `main` and 1 tag."
14+
15+
### 3. "Which blobs are generated vs. input?"
16+
Classify all blobs in the cache (or remote) by their provenance:
17+
- **Input**: no computation, added directly via `dvx add` or `dvx import-url`
18+
- **Generated**: has computation, output of `dvx run`
19+
- **Foreign**: imported via `--no-download`, tracked by ETag but not cached locally
20+
- **Orphaned**: in cache but not referenced by any `.dvc` file in any branch/tag/commit
21+
22+
### 4. "What's the minimal cache for this branch?"
23+
Given a branch, compute the minimal set of blobs needed:
24+
- All input blobs (not reproducible)
25+
- Generated blobs only if their inputs are unavailable
26+
- Total size of irreducible inputs
27+
28+
### 5. "Can I safely delete this remote blob?"
29+
Before deleting a blob from S3, check:
30+
- Is it an input blob? (If so, it's irreplaceable — don't delete unless another copy exists)
31+
- Is it generated? (Safe to delete if inputs are available)
32+
- Is it referenced by any commit? (If not, it's orphaned — safe to delete)
33+
34+
## Implementation
35+
36+
### Module structure
37+
38+
```
39+
src/dvx/audit/
40+
__init__.py # Exports: scan_workspace, audit_artifact, find_orphans
41+
model.py # BlobKind, Reproducibility, BlobInfo, AuditSummary
42+
scan.py # Scanning and classification logic
43+
src/dvx/cli/audit.py # Click command
44+
```
45+
46+
### Data model (`model.py`)
47+
48+
- **`BlobKind`** enum: `INPUT`, `GENERATED`, `FOREIGN`, `ORPHANED`
49+
- **`Reproducibility`** enum: `REPRODUCIBLE`, `NOT_REPRODUCIBLE`, `UNKNOWN`
50+
- **`BlobInfo`** dataclass: path, md5, size, kind, reproducible, cmd, deps, git_deps, in_local_cache, in_remote_cache, is_dir, nfiles
51+
- **`AuditSummary`** dataclass: blobs list + computed aggregates (counts/sizes by kind, cache stats)
52+
- `to_dict()` for JSON serialization — this is the data contract for the web UI
53+
54+
### Classification logic (`scan.py`)
55+
56+
`classify_blob(DVCFileInfo) → (BlobKind, Reproducibility)`:
57+
- No md5 on outs → `FOREIGN`
58+
- Has `cmd``GENERATED`
59+
- `meta.reproducible: false``NOT_REPRODUCIBLE`
60+
- `meta.reproducible: true` or absent → `REPRODUCIBLE` (positive default)
61+
- No cmd → `INPUT`
62+
63+
Reuses from existing code:
64+
- `find_dvc_files()` from `cache.py` — discovers `.dvc` files
65+
- `read_dvc_file()` from `dvc_files.py` — parses `.dvc` YAML
66+
- `check_local_cache()` from `cache.py` — checks local cache existence
67+
- `read_dir_manifest()` from `dvc_files.py` — reads directory manifest entries for orphan detection
68+
69+
### Orphan detection (`find_orphans`)
70+
71+
1. Collect all md5 hashes referenced by `.dvc` files (output hashes + dep hashes + dir manifest entries)
72+
2. Walk `.dvc/cache/files/md5/` to enumerate all cached blobs
73+
3. Return blobs not in the referenced set, with their sizes
74+
75+
### CLI (`cli/audit.py`)
76+
77+
```
78+
dvx audit # workspace summary
79+
dvx audit <path> # per-artifact lineage
80+
dvx audit -o/--orphans # list unreferenced cache blobs
81+
dvx audit -g/--graph # DOT dependency graph (colored by kind)
82+
dvx audit --json # machine-readable (any mode)
83+
dvx audit -r/--remote <name> # also check remote cache
84+
dvx audit -j/--jobs N # parallel workers for remote checks
85+
```
86+
87+
### DVCFileInfo extension
88+
89+
Added `reproducible: bool | None = None` field to `DVCFileInfo` in `dvc_files.py`, read from `meta.reproducible` in `read_dvc_file()`.
90+
91+
### Output formats
92+
93+
**Summary** (no args):
94+
```
95+
Blobs in workspace: 34
96+
Input: 18 (810 MB)
97+
Generated: 16 (80 MB, 14 reproducible)
98+
Foreign: 0
99+
100+
Local cache: 30 of 34 (870 MB)
101+
Missing: 4 (20 MB)
102+
```
103+
104+
**Per-artifact** (`dvx audit <path>`):
105+
```
106+
Path: www/public/taxes-2025-lots.geojson
107+
MD5: abc123...
108+
Size: 22.8 MB
109+
Type: Generated (reproducible)
110+
Command: python -m jc_taxes.geojson_yearly --year 2025 --agg lot
111+
112+
Dependencies (2 data + 2 code):
113+
[data] data/taxrecords_enriched.parquet (def456...)
114+
[code] src/jc_taxes/geojson_yearly.py (git: aabbcc)
115+
116+
Cache: local=yes remote=not checked
117+
```
118+
119+
**Orphans** (`dvx audit --orphans`):
120+
```
121+
4 orphaned blob(s) (12 MB):
122+
a433cf78... (8.2 MB)
123+
b782ee41... (3.8 MB)
124+
```
125+
126+
**JSON** (`dvx audit --json`): full `AuditSummary.to_dict()` serialized.
127+
128+
**Graph** (`dvx audit --graph`): Graphviz DOT with kind-based node coloring:
129+
- Input = palegreen, Generated = lightblue (lighter if reproducible), Foreign = gray dashed
130+
131+
### Integration with `dvx dag`
132+
133+
The graph output reuses the same conceptual DAG structure as `dvx dag` but colors nodes by `BlobKind` rather than by position (root/leaf/middle). For full graph features (clustering, Mermaid, HTML), use `dvx dag`.
134+
135+
## ML Pipeline Considerations
136+
137+
Large-scale ML training pipelines amplify the audit problem. A single training run may produce:
138+
- **Checkpoints**: dozens of multi-GB model snapshots at different training steps
139+
- **Evaluation artifacts**: metrics, predictions, confusion matrices at each checkpoint
140+
- **Intermediate data**: preprocessed datasets, tokenized corpora, embedding caches
141+
142+
The `meta.reproducible: false` opt-out is particularly important here — expensive training outputs should be explicitly marked non-reproducible to prevent accidental eviction.
143+
144+
## Future (not this PR)
145+
146+
- Cross-commit scanning (which commits reference which blobs)
147+
- `dvx gc --evict-reproducible` (uses audit classification) — see `evictable-generated-blobs.md`
148+
- Remote cache size analysis
149+
- SQLite index for large repos
150+
- UI extension: audit view tab in `ui/` (Vite + React + @xyflow/react)
151+
152+
## Open Questions
153+
154+
- How expensive is scanning `.dvc` files across all commits? For repos with thousands of commits and hundreds of `.dvc` files, this could be slow. The SQLite index amortizes this but adds maintenance burden.
155+
- Should `dvx audit` also check dep *availability* (can this blob actually be regenerated right now)? This requires checking that all transitive inputs exist in cache or remote.
156+
- Should lineage be queryable in the other direction? ("What outputs does this input produce?" — useful for impact analysis when an input changes.)
157+
- For ML pipelines: should DVX integrate with experiment trackers (W&B, MLflow) to correlate blob lineage with training metrics? Or should it stay purely at the data layer and let users join the two?
158+
- Should `dvx audit` support a `--cost` flag that estimates regeneration cost from historical run times logged in `meta.computation`?

specs/evictable-generated-blobs.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Reproducible Generated Blobs
2+
3+
## Problem
4+
5+
DVX tracks both **generated** artifacts (have `meta.computation` with `cmd` + `deps`) and **input** artifacts (no computation, or foreign imports). Currently GC treats them identically — keep or evict based on git scope. But generated artifacts are fundamentally different: they can be **regenerated** from their inputs + code at any commit, making them safe to evict from cache even when referenced.
6+
7+
This matters for projects like jc-taxes where ~450MB of GeoJSON is regenerated from deterministic code + parquet inputs. Every regen creates 16 new cache blobs; old versions accumulate even though they're reproducible from the corresponding code.
8+
9+
## Concepts
10+
11+
### Blob taxonomy
12+
13+
| | **Has computation** | **No computation** |
14+
|--|--------------------|--------------------|
15+
| **DVX-tracked** (`.dvc` file) | Generated: `dvx run` output | Input: `dvx add`'d raw data |
16+
| **Foreign** (`import-url --no-download`) | n/a | External: tracked by ETag, not cached |
17+
18+
Generated blobs are the only ones that can be safely evicted without data loss, because they satisfy:
19+
1. Their `.dvc` file records the exact `cmd` + `deps` + `git_deps`
20+
2. At any git commit, the code (`git_deps`) and data inputs (`deps`) are pinned
21+
3. Re-running the command with those inputs reproduces the output
22+
23+
### Reproducibility spectrum
24+
25+
Not all generated blobs are equally reproducible. The `reproducible` flag encodes a **confidence level about reproducibility**:
26+
27+
| Level | Example | Reproducible? | Notes |
28+
|-------|---------|--------------|-------|
29+
| **Bit-reproducible** | jc-taxes GeoJSON from deterministic Python | Yes (default) | Same inputs + code → byte-identical output |
30+
| **Semantically reproducible** | ML inference, deterministic but float-sensitive | Mostly | May differ at last bits due to hardware/library versions |
31+
| **Reproducible with seed** | Single-GPU training with fixed seed | Cautiously | Reproduce given same hardware + library versions |
32+
| **Non-reproducible** | Distributed training with weight-update races | No | Even with same inputs, output differs per run |
33+
34+
Note: even "non-reproducible" training is becoming tractable — Google's Marin 8B (JAX + TPU) achieved bit-reproducibility for large-scale training, specifically to enable debugging loss spikes in expensive runs. But this is the exception; most distributed training has inherent nondeterminism from collective ops ordering.
35+
36+
**Key insight**: reproducibility is the positive default. Blobs with `computation` are assumed reproducible unless explicitly marked `meta.reproducible: false`. A `dvx run` output that took 10,000 GPU-hours to produce should be explicitly marked non-reproducible by the user.
37+
38+
### Reproducibility
39+
40+
A generated blob is **reproducible** (and thus safe to evict) when:
41+
- It has a `computation` block in its `.dvc` file
42+
- It is NOT marked `meta.reproducible: false` (opt-out)
43+
- All its `deps` are available (either cached or themselves reproducible)
44+
- All its `git_deps` are available (always true — they're in git)
45+
46+
An input blob is **never evictable** unless backed up to a remote (the existing `--safe` GC behavior).
47+
48+
## Proposed Changes
49+
50+
### 1. `.dvc` file: `reproducible` flag
51+
52+
The `meta.reproducible` field is opt-out — present only when a generated blob is NOT reproducible:
53+
54+
```yaml
55+
outs:
56+
- md5: abc123
57+
size: 22851069
58+
path: model-checkpoint.pt
59+
60+
meta:
61+
reproducible: false
62+
computation:
63+
cmd: python train.py --steps=50000 --seed=42
64+
deps:
65+
data/training_set.parquet: def456
66+
git_deps:
67+
train.py: aabbcc
68+
```
69+
70+
- Default (absent): blobs with `computation` are assumed reproducible
71+
- `reproducible: false` — blob cannot be reliably regenerated, keep in cache
72+
- Classification is surfaced by `dvx audit` (see blob-audit-lineage spec)
73+
74+
### 2. `dvx gc --evict-reproducible`
75+
76+
New GC mode: evict blobs that are reproducible (have `computation` and not marked `reproducible: false`), keeping only input blobs and non-reproducible outputs.
77+
78+
```bash
79+
# Evict reproducible blobs not used in current workspace
80+
dvx gc -w --evict-reproducible
81+
82+
# Evict reproducible blobs from all commits (keep only inputs in cache)
83+
dvx gc -A --evict-reproducible
84+
85+
# Dry run
86+
dvx gc -w --evict-reproducible --dry
87+
```
88+
89+
Logic:
90+
1. Identify all `.dvc` files in scope
91+
2. For each, check if it has `computation` and is NOT `meta.reproducible: false`
92+
3. If reproducible: remove from cache (local and/or remote per `--cloud`)
93+
4. If not reproducible: keep (standard GC behavior)
94+
95+
### 3. Non-reproducible marking
96+
97+
For expensive or non-deterministic outputs, mark them explicitly:
98+
99+
```yaml
100+
# In the .dvc file, set meta.reproducible: false
101+
meta:
102+
reproducible: false
103+
computation:
104+
cmd: ...
105+
```
106+
107+
This can be done manually in the YAML, or a future `dvx mark --not-reproducible <path>` command.
108+
109+
### 4. `dvx regen` (optional, future)
110+
111+
Regenerate evicted blobs on demand:
112+
113+
```bash
114+
# Regenerate a specific evicted blob
115+
dvx regen www/public/taxes-2025-lots.geojson
116+
117+
# Regenerate all evicted blobs needed for current workspace
118+
dvx regen --workspace
119+
```
120+
121+
Uses `meta.computation.cmd` with the pinned deps. This is essentially `dvx run` but specifically targeting blobs that were evicted.
122+
123+
## Interaction with `dvx push`
124+
125+
When pushing, reproducible blobs could optionally be skipped:
126+
127+
```bash
128+
# Push only non-reproducible (input) blobs
129+
dvx push --skip-reproducible
130+
131+
# Push everything (default, current behavior)
132+
dvx push
133+
```
134+
135+
This saves remote storage for blobs that are reproducible. The tradeoff: re-generating is slower than pulling from cache, but for deterministic pipelines the remote copy is pure redundancy.
136+
137+
## Open Questions
138+
139+
- Should there be a project-level default in `.dvc/config`? E.g., `core.assume_reproducible = true` (already the default behavior) or `false` to require explicit opt-in.
140+
- Should there be a "confidence" level? E.g., `reproducible: true` (fully reproducible) vs `reproducible: expensive` (reproducible but costly)?
141+
- How does this interact with remote storage billing? If reproducible blobs are skipped on push, the remote is smaller but regen requires local compute.

src/dvx/audit/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
"""DVX blob audit — classification, lineage, and cache analysis."""
2+
3+
from dvx.audit.model import AuditSummary, BlobInfo, BlobKind, Reproducibility
4+
from dvx.audit.scan import audit_artifact, find_orphans, scan_workspace
5+
6+
__all__ = [
7+
"AuditSummary",
8+
"BlobInfo",
9+
"BlobKind",
10+
"Reproducibility",
11+
"audit_artifact",
12+
"find_orphans",
13+
"scan_workspace",
14+
]

0 commit comments

Comments
 (0)