Add persistent dependency inference cache for incremental --changed-dependents#23228
Add persistent dependency inference cache for incremental --changed-dependents#23228jasonwbarnett wants to merge 6 commits intopantsbuild:mainfrom
Conversation
Implement IncrementalDependents subsystem that persists the forward dependency graph to disk. When enabled via --incremental-dependents-enabled, only targets whose BUILD files or source files have changed (based on mtime+size fingerprinting) need their dependencies re-resolved. This dramatically reduces wall time for --changed-dependents=transitive in large monorepos by avoiding redundant dependency inference on unchanged targets across pantsd restarts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arse Address.parse() fails on bare spec strings like "src/python/foo.py:bar" because it expects "//" prefix. Instead, build a spec→Address lookup dict from AllUnexpandedTargets for O(1) resolution of cached dep specs. Also simplify CachedEntry to store deps as spec strings directly rather than structured JSON tuples, and remove now-unused serialization helpers. Results: 52927-target monorepo - Cold cache: 3m12s (same as before, writes 29MB cache) - Warm cache: 38s (dep graph in 1.6s, 52927 targets from cache) - 5x speedup on warm cache, 100% identical output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mtime-based fingerprinting fails across machines because git clone sets all file mtimes to the checkout timestamp, making the cache useless on CI agents. SHA-256 content hashing costs only ~5 seconds more for 18K files but makes the cache fully portable. Benchmark (52,927 targets): - Cold cache: 3m22s (writes cache) - Warm cache: 43s (sha256 fingerprints, 100% cache hits) - Cross-machine: cache is portable via S3 (1.3MB compressed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These .claude/worktrees/ entries were accidentally staged by git add -A and are not part of the persistent dep cache changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
159893c to
18fdfb0
Compare
- Unit tests for CachedEntry, save/load roundtrip, JSON edge cases - Unit tests for SHA-256 file hashing - Unit tests for compute_source_fingerprint (BUILD changes, source changes, stability) - Integration tests verifying incremental mode matches standard mode for direct deps, transitive deps, empty inputs, and special-cased deps - Fix missing Address import in incremental_dependents.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Could you elaborate (here or in an issue) on your setup? Of the 53k targets, what is the rough breakdown by types? I've most often seen --changed-since used with a filter to select on an "uncommon" type, such as "deploy all the helm stuff" or "publish all the docker images". From #23224 I take it you are filtering on a common type (like python_sources), is that correct? I know you have looked at this from a few different angles, does performance get worse with:
If I wanted to make a case like yours -- or even more pathological! -- what would I need? |
- Replace IncrementalDependents subsystem with PANTS_INCREMENTAL_DEPENDENTS env var to avoid "No such options scope" errors in tests that use dependents rules without registering the subsystem - Add release notes entry to docs/notes/2.32.x.md - Fix unused import (textwrap) and formatting issues caught by CI linters - All tests pass: dependents_test, incremental_dependents_test, py_constraints_test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
I'm going to write up a comprehensive explanation of how I arrived at the conclusion and all of the supporting evidence. Give me a couple of hours to get it done. |
Performance Investigation:
|
| Target Type | Count | % of Total |
|---|---|---|
python_source |
20,850 | 39.4% |
file |
15,422 | 29.1% |
resource |
8,620 | 16.3% |
python_test |
2,945 | 5.6% |
python_sources (generators) |
1,767 | 3.3% |
python_requirement |
1,263 | 2.4% |
python_tests (generators) |
693 | 1.3% |
shell_source |
303 | 0.6% |
docker_image |
92 | 0.2% |
| Other (resources, distributions, etc.) | 972 | 1.8% |
| Total | 52,927 | 100% |
Benchmark Results
All times are wall-clock elapsed seconds. Pants version 2.32.0.dev7 from source unless noted.
Test 1: The Core Bottleneck — --changed-dependents vs not
| Command | Time | Output |
|---|---|---|
filter (no dependents) |
37s | 0 targets |
filter --changed-dependents=direct |
2m53s | 0 targets |
filter --changed-dependents=transitive |
2m40s | 440 targets |
Key finding: Adding --changed-dependents=direct jumps from 37s to 2m53s — a 4.7x increase — even when the result is 0 additional targets. The entire cost is building the full reverse dependency graph via map_addresses_to_dependents().
Test 2: The --changed-since Depth Does NOT Matter
| Range | Changed Files | Time | Output |
|---|---|---|---|
HEAD~1 |
9 files | 2m49s | 0 targets |
HEAD~3 |
53 files | 2m40s | 440 targets |
HEAD~10 |
79 files | 2m56s | 811 targets |
Times are within noise. The depth of --changed-since is irrelevant — the bottleneck is always map_addresses_to_dependents() which processes all 53K targets regardless of how many files changed.
Test 3: The Filter Type Does NOT Matter
| Filter | Time | Output |
|---|---|---|
--filter-target-type=+python_test |
2m40s | 440 targets |
--filter-target-type=+docker_image |
2m38s | 20 targets |
Same cost whether finding 440 test targets or 20 Docker targets. The filter is applied AFTER the full dependency graph is built.
Test 4: dependents Goal Shows the Same Bottleneck
| Command | Time | Output |
|---|---|---|
dependents --transitive <single-file> |
2m50s | 1,601 dependents |
dependencies <single-file> |
38s | 10 dependencies |
list :: |
27s | 52,927 targets |
Computing forward dependencies for a single target: 38 seconds.
Computing reverse dependents for a single target: 2m50s (requires building the full reverse graph for ALL 53K targets).
Test 5: Warm pantsd Does NOT Help
| Run | Time |
|---|---|
| Cold pantsd | 2m44s |
| Warm pantsd (identical command) | 2m39s |
| Warm pantsd (different range) | 2m51s |
Warm pantsd provides essentially zero benefit for this operation. The map_addresses_to_dependents rule is recomputed on every invocation because it depends on AllUnexpandedTargets, which the Pants source describes as "relatively expensive to compute and frequently invalidated".
Test 6: Pre-built Binary vs From-Source
| Version | Time |
|---|---|
| Pants 2.30.0 (pre-built binary) | 2m56s |
| Pants 2.32.0.dev7 (from source) | 2m40s |
No meaningful difference. The bottleneck is the same in both versions.
Test 7: Work Unit Timing (from -linfo logs)
"Map all targets to their dependents" — reported as a long-running task at:
60.2s elapsed
90.1s elapsed
120.0s elapsed
This single rule (map_addresses_to_dependents) accounts for ~120 seconds out of ~160 seconds of total execution (75% of wall time).
Root Cause Analysis
What map_addresses_to_dependents() Does
@rule(desc="Map all targets to their dependents")
async def map_addresses_to_dependents(all_targets: AllUnexpandedTargets) -> AddressToDependents:
dependencies_per_target = await concurrently(
resolve_dependencies(
DependenciesRequest(tgt.get(Dependencies), ...)
)
for tgt in all_targets # ALL 52,927 targets
)
# Invert the forward deps to build the reverse map
address_to_dependents = defaultdict(set)
for tgt, dependencies in zip(all_targets, dependencies_per_target):
for dependency in dependencies:
address_to_dependents[dependency].add(tgt.address)
return AddressToDependents(...)This rule:
- Resolves
AllUnexpandedTargets— every target in the repository (52,927) - For each target, calls
resolve_dependencies()which includes:- Parsing the target's BUILD file for explicit dependencies
- Running dependency inference (Python import parsing, Docker COPY analysis, Shell source detection)
- Resolving inferred module names to target addresses
- Inverts the forward dependency graph into a reverse mapping
Step 2 is the expensive part. Python import inference uses a Rust-based tree-sitter parser (fast per-file), but the per-target overhead of the rule engine — resolving imports to target addresses via the module mapper, handling ambiguity, validating results — adds up at 53K scale.
Why Warm pantsd Doesn't Help
map_addresses_to_dependents takes AllUnexpandedTargets as its sole input. AllUnexpandedTargets is a rule that scans the entire filesystem for BUILD files and resolves all targets. The Pants engine's InvalidationWatcher (inotify-based) detects any filesystem change and invalidates AllUnexpandedTargets, which cascades to invalidate AddressToDependents.
Even without actual file changes, the engine must re-verify that all BUILD files are unchanged, re-hash target definitions, and confirm the cached result is still valid. At 53K targets, this verification itself is non-trivial.
Why the Filter Doesn't Help
The filter (--filter-target-type=+python_test, --tag="-integration") is applied after map_addresses_to_dependents completes. The full reverse graph for all 53K targets is built first, then the result is filtered down.
This is a deliberate design choice (see pantsbuild/pants#15544): filtering before building the graph would cause missed dependents when a filtered-out target is an intermediate link in the dependency chain.
Conditions to Reproduce
To reproduce this performance characteristic, you need:
- Many targets (>30K, ideally >50K). The cost scales roughly linearly with target count.
--changed-dependents=director--changed-dependents=transitive. Without this flag, the operation is fast (~30-40s) because it only finds owners of changed files, not their dependents.- Any amount of changed files — even 0 changed files triggers the full graph build if
--changed-dependentsis set.
The target type distribution doesn't matter much. file and resource targets (which make up 45% of our targets) have trivial dependency inference, but they still contribute to the 53K targets that map_addresses_to_dependents must process.
Synthetic Reproduction
To create a synthetic test case:
# Create 50K targets in a fresh repo
mkdir big-repo && cd big-repo
pants init
# Add Python backend and interpreter constraints
cat > pants.toml <<'EOF'
[GLOBAL]
pants_version = "2.31.0"
backend_packages = ["pants.backend.python"]
[python]
interpreter_constraints = ["==3.11.*"]
EOF
# Ignore pants cache so git diff doesn't explode
echo '/.pants.*' > .gitignore
for i in $(seq 1 500); do
mkdir -p "pkg${i}"
for j in $(seq 1 100); do
echo "x = $j" > "pkg${i}/file${j}.py"
done
echo 'python_sources()' > "pkg${i}/BUILD.pants"
done
git init && git add . && git commit -m "init"
# Modify an existing file
sed -i 's/x = 1/x = 999/' pkg1/file1.py
git add . && git commit -m "change"
time pants --changed-since=HEAD~ --changed-dependents=transitive listSummary
The performance issue is real, reproducible, and caused by map_addresses_to_dependents() resolving dependencies for ALL targets in the repo whenever --changed-dependents is used. The cost is O(N) where N = total target count, regardless of:
- How many files changed
- What target type is being filtered for
- Whether pantsd is warm or cold
- The depth of the git history
At 53K targets, this costs ~2m40s. The rule engine's in-memory caching doesn't help because AllUnexpandedTargets is invalidated on every invocation.
|
Thanks, this is very helpful. I'll do some analysis, but I'm out for the next of the week and may not be able to post anything before then. A few clarifying questions:
|
|
sureshjoshi
left a comment
There was a problem hiding this comment.
Thanks for the contribution. We've just branched for 2.32.x, so merging this pull request now will come out in 2.33.x, please move the release notes updates to docs/notes/2.33.x.md if that's appropriate.
Summary
Adds an opt-in persistent disk cache for the dependency graph computed by
map_addresses_to_dependents(). When enabled via--incremental-dependents-enabled, the forward dependency graph is serialized to~/.cache/pants/incremental_dep_graph_v2.jsonafter each run and loaded on the next run. Only targets whose source files have changed (by SHA-256 content hash) need their dependencies re-resolved.This dramatically reduces wall time for
--changed-dependents=transitivein large repos with many targets.Motivation
In a monorepo with ~53K targets,
pants --changed-since=HEAD~3 --changed-dependents=transitive filtertakes ~3.5 minutes becausemap_addresses_to_dependents()callsresolve_dependencies()for every target — even when pantsd is warm. The rule engine's in-memory memoization is invalidated by any filesystem change, and theAllUnexpandedTargets→AddressToDependentscascade forces full recomputation each time.The persistent cache breaks this cycle: even on a cold pantsd start (fresh CI agent), previously computed dependency edges are reused for unchanged targets.
Results
Tested on a monorepo with 52,927 targets:
Design
New subsystem:
--incremental-dependents-enabledOpt-in flag. When disabled (default), behavior is completely unchanged.
Cache format
JSON file at
~/.cache/pants/incremental_dep_graph_v2.json:{ "version": 2, "buildroot": "/path/to/repo", "entries": { "src/python/foo/bar.py:lib": { "fingerprint": "<sha256>", "deps": ["src/python/baz/qux.py:lib", "3rdparty/python:requests"] } } }Fingerprinting
Each target's cache key is SHA-256 of:
This is ~1 second for 18K files and is fully portable across machines.
Safety
resolve_dependencies()as normal.tmp, thenos.replace)Files changed
src/python/pants/backend/project_info/dependents.py— Modifiedmap_addresses_to_dependents()to use incremental mode when enabledsrc/python/pants/backend/project_info/incremental_dependents.py— New: cache persistence, fingerprinting,IncrementalDependentssubsystemCI usage
The cache can be shared across ephemeral CI agents via S3:
🤖 Generated with Claude Code