|
| 1 | +# Blob Audit and Lineage Tracking |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +DVX stores provenance per-artifact (`meta.computation` in `.dvc` files), but there's no way to query across artifacts: which blobs are used where, which are still necessary, which are orphaned, and what the full dependency graph looks like. DVC's `gc` operates at the hash level (keep referenced hashes, delete unreferenced), but doesn't reason about the *relationships* between blobs across commits. |
| 6 | + |
| 7 | +## Use Cases |
| 8 | + |
| 9 | +### 1. "What blobs does this commit need?" |
| 10 | +Given a commit SHA, enumerate all `.dvc` files, their output hashes, and transitively all input hashes. Answer: "to fully reproduce this commit's state, you need these N blobs totaling X MB." |
| 11 | + |
| 12 | +### 2. "Where is this blob used?" |
| 13 | +Given a blob hash (or path), find all commits/branches/tags that reference it — either as a direct output or as a transitive dependency. Answer: "this blob is referenced by 3 commits on `main` and 1 tag." |
| 14 | + |
| 15 | +### 3. "Which blobs are generated vs. input?" |
| 16 | +Classify all blobs in the cache (or remote) by their provenance: |
| 17 | +- **Input**: no computation, added directly via `dvx add` or `dvx import-url` |
| 18 | +- **Generated**: has computation, output of `dvx run` |
| 19 | +- **Foreign**: imported via `--no-download`, tracked by ETag but not cached locally |
| 20 | +- **Orphaned**: in cache but not referenced by any `.dvc` file in any branch/tag/commit |
| 21 | + |
| 22 | +### 4. "What's the minimal cache for this branch?" |
| 23 | +Given a branch, compute the minimal set of blobs needed: |
| 24 | +- All input blobs (not reproducible) |
| 25 | +- Generated blobs only if their inputs are unavailable |
| 26 | +- Total size of irreducible inputs |
| 27 | + |
| 28 | +### 5. "Can I safely delete this remote blob?" |
| 29 | +Before deleting a blob from S3, check: |
| 30 | +- Is it an input blob? (If so, it's irreplaceable — don't delete unless another copy exists) |
| 31 | +- Is it generated? (Safe to delete if inputs are available) |
| 32 | +- Is it referenced by any commit? (If not, it's orphaned — safe to delete) |
| 33 | + |
| 34 | +## Implementation |
| 35 | + |
| 36 | +### Module structure |
| 37 | + |
| 38 | +``` |
| 39 | +src/dvx/audit/ |
| 40 | + __init__.py # Exports: scan_workspace, audit_artifact, find_orphans |
| 41 | + model.py # BlobKind, Reproducibility, BlobInfo, AuditSummary |
| 42 | + scan.py # Scanning and classification logic |
| 43 | +src/dvx/cli/audit.py # Click command |
| 44 | +``` |
| 45 | + |
| 46 | +### Data model (`model.py`) |
| 47 | + |
| 48 | +- **`BlobKind`** enum: `INPUT`, `GENERATED`, `FOREIGN`, `ORPHANED` |
| 49 | +- **`Reproducibility`** enum: `REPRODUCIBLE`, `NOT_REPRODUCIBLE`, `UNKNOWN` |
| 50 | +- **`BlobInfo`** dataclass: path, md5, size, kind, reproducible, cmd, deps, git_deps, in_local_cache, in_remote_cache, is_dir, nfiles |
| 51 | +- **`AuditSummary`** dataclass: blobs list + computed aggregates (counts/sizes by kind, cache stats) |
| 52 | + - `to_dict()` for JSON serialization — this is the data contract for the web UI |
| 53 | + |
| 54 | +### Classification logic (`scan.py`) |
| 55 | + |
| 56 | +`classify_blob(DVCFileInfo) → (BlobKind, Reproducibility)`: |
| 57 | +- No md5 on outs → `FOREIGN` |
| 58 | +- Has `cmd` → `GENERATED` |
| 59 | + - `meta.reproducible: false` → `NOT_REPRODUCIBLE` |
| 60 | + - `meta.reproducible: true` or absent → `REPRODUCIBLE` (positive default) |
| 61 | +- No cmd → `INPUT` |
| 62 | + |
| 63 | +Reuses from existing code: |
| 64 | +- `find_dvc_files()` from `cache.py` — discovers `.dvc` files |
| 65 | +- `read_dvc_file()` from `dvc_files.py` — parses `.dvc` YAML |
| 66 | +- `check_local_cache()` from `cache.py` — checks local cache existence |
| 67 | +- `read_dir_manifest()` from `dvc_files.py` — reads directory manifest entries for orphan detection |
| 68 | + |
| 69 | +### Orphan detection (`find_orphans`) |
| 70 | + |
| 71 | +1. Collect all md5 hashes referenced by `.dvc` files (output hashes + dep hashes + dir manifest entries) |
| 72 | +2. Walk `.dvc/cache/files/md5/` to enumerate all cached blobs |
| 73 | +3. Return blobs not in the referenced set, with their sizes |
| 74 | + |
| 75 | +### CLI (`cli/audit.py`) |
| 76 | + |
| 77 | +``` |
| 78 | +dvx audit # workspace summary |
| 79 | +dvx audit <path> # per-artifact lineage |
| 80 | +dvx audit -o/--orphans # list unreferenced cache blobs |
| 81 | +dvx audit -g/--graph # DOT dependency graph (colored by kind) |
| 82 | +dvx audit --json # machine-readable (any mode) |
| 83 | +dvx audit -r/--remote <name> # also check remote cache |
| 84 | +dvx audit -j/--jobs N # parallel workers for remote checks |
| 85 | +``` |
| 86 | + |
| 87 | +### DVCFileInfo extension |
| 88 | + |
| 89 | +Added `reproducible: bool | None = None` field to `DVCFileInfo` in `dvc_files.py`, read from `meta.reproducible` in `read_dvc_file()`. |
| 90 | + |
| 91 | +### Output formats |
| 92 | + |
| 93 | +**Summary** (no args): |
| 94 | +``` |
| 95 | +Blobs in workspace: 34 |
| 96 | + Input: 18 (810 MB) |
| 97 | + Generated: 16 (80 MB, 14 reproducible) |
| 98 | + Foreign: 0 |
| 99 | +
|
| 100 | +Local cache: 30 of 34 (870 MB) |
| 101 | + Missing: 4 (20 MB) |
| 102 | +``` |
| 103 | + |
| 104 | +**Per-artifact** (`dvx audit <path>`): |
| 105 | +``` |
| 106 | +Path: www/public/taxes-2025-lots.geojson |
| 107 | +MD5: abc123... |
| 108 | +Size: 22.8 MB |
| 109 | +Type: Generated (reproducible) |
| 110 | +Command: python -m jc_taxes.geojson_yearly --year 2025 --agg lot |
| 111 | +
|
| 112 | +Dependencies (2 data + 2 code): |
| 113 | + [data] data/taxrecords_enriched.parquet (def456...) |
| 114 | + [code] src/jc_taxes/geojson_yearly.py (git: aabbcc) |
| 115 | +
|
| 116 | +Cache: local=yes remote=not checked |
| 117 | +``` |
| 118 | + |
| 119 | +**Orphans** (`dvx audit --orphans`): |
| 120 | +``` |
| 121 | +4 orphaned blob(s) (12 MB): |
| 122 | + a433cf78... (8.2 MB) |
| 123 | + b782ee41... (3.8 MB) |
| 124 | +``` |
| 125 | + |
| 126 | +**JSON** (`dvx audit --json`): full `AuditSummary.to_dict()` serialized. |
| 127 | + |
| 128 | +**Graph** (`dvx audit --graph`): Graphviz DOT with kind-based node coloring: |
| 129 | +- Input = palegreen, Generated = lightblue (lighter if reproducible), Foreign = gray dashed |
| 130 | + |
| 131 | +### Integration with `dvx dag` |
| 132 | + |
| 133 | +The graph output reuses the same conceptual DAG structure as `dvx dag` but colors nodes by `BlobKind` rather than by position (root/leaf/middle). For full graph features (clustering, Mermaid, HTML), use `dvx dag`. |
| 134 | + |
| 135 | +## ML Pipeline Considerations |
| 136 | + |
| 137 | +Large-scale ML training pipelines amplify the audit problem. A single training run may produce: |
| 138 | +- **Checkpoints**: dozens of multi-GB model snapshots at different training steps |
| 139 | +- **Evaluation artifacts**: metrics, predictions, confusion matrices at each checkpoint |
| 140 | +- **Intermediate data**: preprocessed datasets, tokenized corpora, embedding caches |
| 141 | + |
| 142 | +The `meta.reproducible: false` opt-out is particularly important here — expensive training outputs should be explicitly marked non-reproducible to prevent accidental eviction. |
| 143 | + |
| 144 | +## Future (not this PR) |
| 145 | + |
| 146 | +- Cross-commit scanning (which commits reference which blobs) |
| 147 | +- `dvx gc --evict-reproducible` (uses audit classification) — see `evictable-generated-blobs.md` |
| 148 | +- Remote cache size analysis |
| 149 | +- SQLite index for large repos |
| 150 | +- UI extension: audit view tab in `ui/` (Vite + React + @xyflow/react) |
| 151 | + |
| 152 | +## Open Questions |
| 153 | + |
| 154 | +- How expensive is scanning `.dvc` files across all commits? For repos with thousands of commits and hundreds of `.dvc` files, this could be slow. The SQLite index amortizes this but adds maintenance burden. |
| 155 | +- Should `dvx audit` also check dep *availability* (can this blob actually be regenerated right now)? This requires checking that all transitive inputs exist in cache or remote. |
| 156 | +- Should lineage be queryable in the other direction? ("What outputs does this input produce?" — useful for impact analysis when an input changes.) |
| 157 | +- For ML pipelines: should DVX integrate with experiment trackers (W&B, MLflow) to correlate blob lineage with training metrics? Or should it stay purely at the data layer and let users join the two? |
| 158 | +- Should `dvx audit` support a `--cost` flag that estimates regeneration cost from historical run times logged in `meta.computation`? |
0 commit comments