Skip to content

Bazel CI build is slow: investigate cache miss patterns #28

@qobilidop

Description

@qobilidop

Problem

The Bazel CI workflow (.github/workflows/bazel.yml) recompiles large transitive dependencies (notably protobuf) on every run despite a two-layer cache: GitHub Actions disk cache via actions/cache@v5 and BuildBuddy remote cache. The recompilation is consistent across runs on main and appears to be much worse on at least some PR runs.

The repository depends on a single-digit number of direct Bazel modules (fuzztest, googletest, rules_cc, rules_shell, z3); the protobuf rebuild comes in transitively, almost certainly through fuzztest. Our own source tree is small, so most of the wall-clock time on every CI run is spent on dependency code that hasn't changed.

Evidence

Main runs are consistently ~2 minutes but show high cache miss rates

Most recent successful Bazel CI runs on main (from gh run list --workflow=bazel.yml --branch main):

Run Date Duration
Add exact-match table symbolic execution example (#25) 2026-05-07 1m56s
docs: restate naming rule positively 2026-05-07 1m42s
Update actions/upload-pages-artifact action to v5 (#18) 2026-05-07 1m45s
Update actions/deploy-pages action to v5 (#17) 2026-05-07 1m58s
Update dependency rules_shell to v0.8.0 (#16) 2026-05-07 1m52s

Inspecting the most recent (run id 25479661712) shows the disk cache does restore successfully, but the Bazel summary lines are:

INFO: 2432 processes: 248 disk cache hit, 913 remote cache hit, 1271 internal.
INFO: 31 processes:   510 action cache hit,  18 disk cache hit, 32 remote cache hit.

So of the 2432 build-phase actions, only 1161 (~48%) come from cache. The other 1271 run fresh. Searching that run's log for external/protobuf+/src/google/protobuf/ returns hundreds of lines of compile warnings, confirming protobuf is recompiled despite the partial cache hits.

Our PR sees a much worse cache hit rate

In PR #27, the Bazel CI build job is currently running at ~23 minutes elapsed and counting (started 2026-05-11T00:52:19Z, still pending at time of issue creation). Compared to main's ~2-minute baseline, that's roughly a 10x slowdown.

Our PR does not modify any of the cache key inputs (.bazelversion, MODULE.bazel, MODULE.bazel.lock, .bazelrc):

$ git diff 87ccdc99..HEAD --stat -- .bazelversion MODULE.bazel.lock .bazelrc MODULE.bazel
(empty - no changes)

So the disk cache key (bazel-Linux-${{ hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') }}) is identical to main's. The restore-keys: bazel-Linux- fallback should also match. Despite that, the build is dramatically slower than main's, suggesting the cache lookup is missing far more often than on main.

Cache mechanics

Workflow excerpt (.github/workflows/bazel.yml:14-26):

- name: Restore Bazel disk cache
  uses: actions/cache@v5
  with:
    path: .bazel-disk-cache
    key: bazel-${{ runner.os }}-${{ hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') }}
    restore-keys: |
      bazel-${{ runner.os }}-

- name: Activate CI-specific Bazel config
  run: |
    echo "build --config=ci" > ci.bazelrc
    echo "build:ci --remote_header=x-buildbuddy-api-key=${{ secrets.BUILDBUDDY_API_KEY }}" >> ci.bazelrc

Two cache layers exist (disk + BuildBuddy remote). Both restored successfully on the May 7 run, yet half the actions ran fresh. That implies action hashes differ run-to-run for ~half the build graph, OR cache content is missing under the right hash for that half.

Proposals

Three angles worth investigating, ordered by leverage:

1. Root-cause the cache miss pattern (highest leverage)

BuildBuddy's UI has a "Compare invocations" view that shows, action-by-action, exactly which inputs differ between two runs and why. The CI is already wired to BuildBuddy (--remote_header=x-buildbuddy-api-key=...), so the data is there. Pick two recent main runs and diff them; the typical culprits are:

  • Absolute paths leaking into compile outputs (-MD / .d files, debug info).
  • Compiler / sysroot / libc fingerprint drift across runner image revisions.
  • Volatile environment variables captured via --action_env.
  • Platform/exec_properties differences.

Fix is usually a small .bazelrc change (pin compiler features, scrub action_env, set hermetic toolchain options). Investigation time: probably 30-90 minutes if you have BuildBuddy access. Payoff: potentially fixes the whole graph at once.

2. Stop building fuzz tests by default (cleanest sidestep)

protobuf is transitive through fuzztest, which is only needed by tests/fuzz/* targets. If those targets are tagged manual in tests/fuzz/BUILD.bazel, they drop out of bazel build //... expansion. Bazel only executes actions needed by the requested target set, so protobuf compile actions never run — only the source fetch happens during analysis, which is seconds rather than minutes.

# tests/fuzz/BUILD.bazel
cc_test(
    name = \"sym_bit_vec_cast_test\",
    srcs = [\"sym_bit_vec_cast_test.cc\"],
    tags = [\"manual\"],   # excluded from //...
    deps = [...],
)

To preserve fuzz coverage, add a dedicated CI step (or a separate workflow) that runs bazel test //tests/fuzz/... explicitly — that one job still pays the cost, but only when fuzz code or its deps change, or on a schedule. Implementation cost: ~30 minutes (one BUILD edit per fuzz target, one CI step). Risk: low; the fuzz tests are already isolated under tests/fuzz/.

3. Pre-warm the cache nightly (backstop)

Add a schedule: trigger to the Bazel workflow so a main build runs nightly. The existing cache config already saves with a key that PR runs restore from, so a nightly success leaves a warm cache for the next day's PRs.

on:
  push: { branches: [main] }
  pull_request: { branches: [main] }
  schedule:
    - cron: '0 5 * * *'   # 05:00 UTC daily

Implementation cost: ~15 minutes plus a day to verify. Important caveat: this only helps if the cache misses are about content absence (the right hash was never saved). If misses are about action hash drift (option 1's territory), pre-warming saves under hash A and the PR run looks up hash B and still misses. The 10% disk-cache hit rate today says at least some hashes are stable, so pre-warming would help that fraction — but to know how much, option 1's diagnosis is the prerequisite.

Analyses already done

  1. Identified the workflow and cache config at .github/workflows/bazel.yml. Two cache layers (disk + BuildBuddy remote), key derived from hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') with a fallback restore-keys: bazel-Linux-.

  2. Verified PR Strict comparison operators + math_* family #27 does not touch cache-key inputs. Cache key should match main; restore-keys fallback should match regardless. Yet the PR's Bazel job is currently at ~23 minutes and counting vs. main's ~2 minute baseline.

  3. Confirmed protobuf rebuild is pre-existing on main. The May 7 run of main (gh run view 25479661712 --log) emits hundreds of external/protobuf+/src/google/protobuf/... compile warnings. The PR is not introducing the rebuild — it's just much worse at exhibiting the underlying cache miss problem.

  4. Confirmed dep path. None of MODULE.bazel's direct deps (fuzztest, googletest, rules_cc, rules_shell, z3) are protobuf. Protobuf is transitive; fuzztest is the most likely path (it uses protobuf for test-input serialization).

  5. Cache stats from a representative main run (run 25479661712, build summary):

    • Build phase: 2432 processes — 248 disk cache hit + 913 remote cache hit + 1271 internal = ~48% cached.
    • Test phase: 31 processes — 510 action cache hit + 18 disk cache hit + 32 remote cache hit.
    • Disk cache did restore ("Cache restored from key: bazel-Linux-deb7395ec..."). The cache is working — just not as well as we'd want.

What would close this issue

  • A measurable improvement in Bazel CI wall-clock time on main, ideally back under 1 minute for incremental changes.
  • Diagnosis of why roughly half of Bazel actions miss cache on main today (and why PRs sometimes miss far more than that).
  • A documented fix or a documented tradeoff if the root cause is hard to address.

Useful tools

  • BuildBuddy "Compare invocations" view (link to your BuildBuddy org from a recent invocation URL in the CI logs).
  • `bazel build //... --execution_log_json_file=exec.json` for local-vs-CI action diff if BuildBuddy isn't sufficient.
  • `bazel cquery 'deps(//tests/fuzz/...)' --output=label` to confirm the protobuf dep path before option 2.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions