You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Bazel CI workflow (.github/workflows/bazel.yml) recompiles large transitive dependencies (notably protobuf) on every run despite a two-layer cache: GitHub Actions disk cache via actions/cache@v5 and BuildBuddy remote cache. The recompilation is consistent across runs on main and appears to be much worse on at least some PR runs.
The repository depends on a single-digit number of direct Bazel modules (fuzztest, googletest, rules_cc, rules_shell, z3); the protobuf rebuild comes in transitively, almost certainly through fuzztest. Our own source tree is small, so most of the wall-clock time on every CI run is spent on dependency code that hasn't changed.
Evidence
Main runs are consistently ~2 minutes but show high cache miss rates
Most recent successful Bazel CI runs on main (from gh run list --workflow=bazel.yml --branch main):
Run
Date
Duration
Add exact-match table symbolic execution example (#25)
2026-05-07
1m56s
docs: restate naming rule positively
2026-05-07
1m42s
Update actions/upload-pages-artifact action to v5 (#18)
So of the 2432 build-phase actions, only 1161 (~48%) come from cache. The other 1271 run fresh. Searching that run's log for external/protobuf+/src/google/protobuf/ returns hundreds of lines of compile warnings, confirming protobuf is recompiled despite the partial cache hits.
Our PR sees a much worse cache hit rate
In PR #27, the Bazel CI build job is currently running at ~23 minutes elapsed and counting (started 2026-05-11T00:52:19Z, still pending at time of issue creation). Compared to main's ~2-minute baseline, that's roughly a 10x slowdown.
Our PR does not modify any of the cache key inputs (.bazelversion, MODULE.bazel, MODULE.bazel.lock, .bazelrc):
So the disk cache key (bazel-Linux-${{ hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') }}) is identical to main's. The restore-keys: bazel-Linux- fallback should also match. Despite that, the build is dramatically slower than main's, suggesting the cache lookup is missing far more often than on main.
Two cache layers exist (disk + BuildBuddy remote). Both restored successfully on the May 7 run, yet half the actions ran fresh. That implies action hashes differ run-to-run for ~half the build graph, OR cache content is missing under the right hash for that half.
Proposals
Three angles worth investigating, ordered by leverage:
1. Root-cause the cache miss pattern (highest leverage)
BuildBuddy's UI has a "Compare invocations" view that shows, action-by-action, exactly which inputs differ between two runs and why. The CI is already wired to BuildBuddy (--remote_header=x-buildbuddy-api-key=...), so the data is there. Pick two recent main runs and diff them; the typical culprits are:
Volatile environment variables captured via --action_env.
Platform/exec_properties differences.
Fix is usually a small .bazelrc change (pin compiler features, scrub action_env, set hermetic toolchain options). Investigation time: probably 30-90 minutes if you have BuildBuddy access. Payoff: potentially fixes the whole graph at once.
2. Stop building fuzz tests by default (cleanest sidestep)
protobuf is transitive through fuzztest, which is only needed by tests/fuzz/* targets. If those targets are tagged manual in tests/fuzz/BUILD.bazel, they drop out of bazel build //... expansion. Bazel only executes actions needed by the requested target set, so protobuf compile actions never run — only the source fetch happens during analysis, which is seconds rather than minutes.
To preserve fuzz coverage, add a dedicated CI step (or a separate workflow) that runs bazel test //tests/fuzz/... explicitly — that one job still pays the cost, but only when fuzz code or its deps change, or on a schedule. Implementation cost: ~30 minutes (one BUILD edit per fuzz target, one CI step). Risk: low; the fuzz tests are already isolated under tests/fuzz/.
3. Pre-warm the cache nightly (backstop)
Add a schedule: trigger to the Bazel workflow so a main build runs nightly. The existing cache config already saves with a key that PR runs restore from, so a nightly success leaves a warm cache for the next day's PRs.
on:
push: { branches: [main] }pull_request: { branches: [main] }schedule:
- cron: '0 5 * * *'# 05:00 UTC daily
Implementation cost: ~15 minutes plus a day to verify. Important caveat: this only helps if the cache misses are about content absence (the right hash was never saved). If misses are about action hash drift (option 1's territory), pre-warming saves under hash A and the PR run looks up hash B and still misses. The 10% disk-cache hit rate today says at least some hashes are stable, so pre-warming would help that fraction — but to know how much, option 1's diagnosis is the prerequisite.
Analyses already done
Identified the workflow and cache config at .github/workflows/bazel.yml. Two cache layers (disk + BuildBuddy remote), key derived from hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') with a fallback restore-keys: bazel-Linux-.
Verified PR Strict comparison operators + math_* family #27 does not touch cache-key inputs. Cache key should match main; restore-keys fallback should match regardless. Yet the PR's Bazel job is currently at ~23 minutes and counting vs. main's ~2 minute baseline.
Confirmed protobuf rebuild is pre-existing on main. The May 7 run of main (gh run view 25479661712 --log) emits hundreds of external/protobuf+/src/google/protobuf/... compile warnings. The PR is not introducing the rebuild — it's just much worse at exhibiting the underlying cache miss problem.
Confirmed dep path. None of MODULE.bazel's direct deps (fuzztest, googletest, rules_cc, rules_shell, z3) are protobuf. Protobuf is transitive; fuzztest is the most likely path (it uses protobuf for test-input serialization).
Cache stats from a representative main run (run 25479661712, build summary):
Build phase: 2432 processes — 248 disk cache hit + 913 remote cache hit + 1271 internal = ~48% cached.
Test phase: 31 processes — 510 action cache hit + 18 disk cache hit + 32 remote cache hit.
Disk cache did restore ("Cache restored from key: bazel-Linux-deb7395ec..."). The cache is working — just not as well as we'd want.
What would close this issue
A measurable improvement in Bazel CI wall-clock time on main, ideally back under 1 minute for incremental changes.
Diagnosis of why roughly half of Bazel actions miss cache on main today (and why PRs sometimes miss far more than that).
A documented fix or a documented tradeoff if the root cause is hard to address.
Useful tools
BuildBuddy "Compare invocations" view (link to your BuildBuddy org from a recent invocation URL in the CI logs).
`bazel build //... --execution_log_json_file=exec.json` for local-vs-CI action diff if BuildBuddy isn't sufficient.
`bazel cquery 'deps(//tests/fuzz/...)' --output=label` to confirm the protobuf dep path before option 2.
Problem
The Bazel CI workflow (
.github/workflows/bazel.yml) recompiles large transitive dependencies (notably protobuf) on every run despite a two-layer cache: GitHub Actions disk cache viaactions/cache@v5and BuildBuddy remote cache. The recompilation is consistent across runs onmainand appears to be much worse on at least some PR runs.The repository depends on a single-digit number of direct Bazel modules (
fuzztest,googletest,rules_cc,rules_shell,z3); the protobuf rebuild comes in transitively, almost certainly throughfuzztest. Our own source tree is small, so most of the wall-clock time on every CI run is spent on dependency code that hasn't changed.Evidence
Main runs are consistently ~2 minutes but show high cache miss rates
Most recent successful Bazel CI runs on
main(fromgh run list --workflow=bazel.yml --branch main):Inspecting the most recent (run id
25479661712) shows the disk cache does restore successfully, but the Bazel summary lines are:So of the 2432 build-phase actions, only 1161 (~48%) come from cache. The other 1271 run fresh. Searching that run's log for
external/protobuf+/src/google/protobuf/returns hundreds of lines of compile warnings, confirming protobuf is recompiled despite the partial cache hits.Our PR sees a much worse cache hit rate
In PR #27, the Bazel CI build job is currently running at ~23 minutes elapsed and counting (started 2026-05-11T00:52:19Z, still pending at time of issue creation). Compared to main's ~2-minute baseline, that's roughly a 10x slowdown.
Our PR does not modify any of the cache key inputs (
.bazelversion,MODULE.bazel,MODULE.bazel.lock,.bazelrc):So the disk cache key (
bazel-Linux-${{ hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') }}) is identical tomain's. Therestore-keys: bazel-Linux-fallback should also match. Despite that, the build is dramatically slower thanmain's, suggesting the cache lookup is missing far more often than onmain.Cache mechanics
Workflow excerpt (
.github/workflows/bazel.yml:14-26):Two cache layers exist (disk + BuildBuddy remote). Both restored successfully on the May 7 run, yet half the actions ran fresh. That implies action hashes differ run-to-run for ~half the build graph, OR cache content is missing under the right hash for that half.
Proposals
Three angles worth investigating, ordered by leverage:
1. Root-cause the cache miss pattern (highest leverage)
BuildBuddy's UI has a "Compare invocations" view that shows, action-by-action, exactly which inputs differ between two runs and why. The CI is already wired to BuildBuddy (
--remote_header=x-buildbuddy-api-key=...), so the data is there. Pick two recent main runs and diff them; the typical culprits are:-MD/.dfiles, debug info).--action_env.Fix is usually a small
.bazelrcchange (pin compiler features, scrub action_env, set hermetic toolchain options). Investigation time: probably 30-90 minutes if you have BuildBuddy access. Payoff: potentially fixes the whole graph at once.2. Stop building fuzz tests by default (cleanest sidestep)
protobufis transitive throughfuzztest, which is only needed bytests/fuzz/*targets. If those targets are taggedmanualintests/fuzz/BUILD.bazel, they drop out ofbazel build //...expansion. Bazel only executes actions needed by the requested target set, so protobuf compile actions never run — only the source fetch happens during analysis, which is seconds rather than minutes.To preserve fuzz coverage, add a dedicated CI step (or a separate workflow) that runs
bazel test //tests/fuzz/...explicitly — that one job still pays the cost, but only when fuzz code or its deps change, or on a schedule. Implementation cost: ~30 minutes (one BUILD edit per fuzz target, one CI step). Risk: low; the fuzz tests are already isolated undertests/fuzz/.3. Pre-warm the cache nightly (backstop)
Add a
schedule:trigger to the Bazel workflow so a main build runs nightly. The existing cache config already saves with a key that PR runs restore from, so a nightly success leaves a warm cache for the next day's PRs.Implementation cost: ~15 minutes plus a day to verify. Important caveat: this only helps if the cache misses are about content absence (the right hash was never saved). If misses are about action hash drift (option 1's territory), pre-warming saves under hash A and the PR run looks up hash B and still misses. The 10% disk-cache hit rate today says at least some hashes are stable, so pre-warming would help that fraction — but to know how much, option 1's diagnosis is the prerequisite.
Analyses already done
Identified the workflow and cache config at
.github/workflows/bazel.yml. Two cache layers (disk + BuildBuddy remote), key derived fromhashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc')with a fallbackrestore-keys: bazel-Linux-.Verified PR Strict comparison operators + math_* family #27 does not touch cache-key inputs. Cache key should match main; restore-keys fallback should match regardless. Yet the PR's Bazel job is currently at ~23 minutes and counting vs. main's ~2 minute baseline.
Confirmed protobuf rebuild is pre-existing on main. The May 7 run of main (
gh run view 25479661712 --log) emits hundreds ofexternal/protobuf+/src/google/protobuf/...compile warnings. The PR is not introducing the rebuild — it's just much worse at exhibiting the underlying cache miss problem.Confirmed dep path. None of
MODULE.bazel's direct deps (fuzztest,googletest,rules_cc,rules_shell,z3) are protobuf. Protobuf is transitive;fuzztestis the most likely path (it uses protobuf for test-input serialization).Cache stats from a representative main run (run 25479661712, build summary):
What would close this issue
Useful tools
🤖 Generated with Claude Code