Bazel CI build is slow: investigate cache miss patterns

## Problem

The Bazel CI workflow (`.github/workflows/bazel.yml`) recompiles large transitive dependencies (notably protobuf) on every run despite a two-layer cache: GitHub Actions disk cache via `actions/cache@v5` and BuildBuddy remote cache. The recompilation is consistent across runs on `main` and appears to be much worse on at least some PR runs.

The repository depends on a single-digit number of direct Bazel modules (`fuzztest`, `googletest`, `rules_cc`, `rules_shell`, `z3`); the protobuf rebuild comes in transitively, almost certainly through `fuzztest`. Our own source tree is small, so most of the wall-clock time on every CI run is spent on dependency code that hasn't changed.

## Evidence

### Main runs are consistently ~2 minutes but show high cache miss rates

Most recent successful Bazel CI runs on `main` (from `gh run list --workflow=bazel.yml --branch main`):

| Run | Date | Duration |
| --- | ---- | -------- |
| Add exact-match table symbolic execution example (#25) | 2026-05-07 | 1m56s |
| docs: restate naming rule positively                    | 2026-05-07 | 1m42s |
| Update actions/upload-pages-artifact action to v5 (#18) | 2026-05-07 | 1m45s |
| Update actions/deploy-pages action to v5 (#17)          | 2026-05-07 | 1m58s |
| Update dependency rules_shell to v0.8.0 (#16)           | 2026-05-07 | 1m52s |

Inspecting the most recent (run id `25479661712`) shows the disk cache *does* restore successfully, but the Bazel summary lines are:

```
INFO: 2432 processes: 248 disk cache hit, 913 remote cache hit, 1271 internal.
INFO: 31 processes:   510 action cache hit,  18 disk cache hit, 32 remote cache hit.
```

So of the 2432 build-phase actions, only 1161 (~48%) come from cache. The other 1271 run fresh. Searching that run's log for `external/protobuf+/src/google/protobuf/` returns hundreds of lines of compile warnings, confirming protobuf is recompiled despite the partial cache hits.

### Our PR sees a much worse cache hit rate

In PR #27, the Bazel CI build job is currently running at **~23 minutes elapsed and counting** (started 2026-05-11T00:52:19Z, still pending at time of issue creation). Compared to main's ~2-minute baseline, that's roughly a 10x slowdown.

Our PR does **not** modify any of the cache key inputs (`.bazelversion`, `MODULE.bazel`, `MODULE.bazel.lock`, `.bazelrc`):

```
$ git diff 87ccdc99..HEAD --stat -- .bazelversion MODULE.bazel.lock .bazelrc MODULE.bazel
(empty - no changes)
```

So the disk cache key (`bazel-Linux-${{ hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') }}`) is identical to `main`'s. The `restore-keys: bazel-Linux-` fallback should also match. Despite that, the build is dramatically slower than `main`'s, suggesting the cache lookup is missing far more often than on `main`.

### Cache mechanics

Workflow excerpt (`.github/workflows/bazel.yml:14-26`):

```yaml
- name: Restore Bazel disk cache
  uses: actions/cache@v5
  with:
    path: .bazel-disk-cache
    key: bazel-${{ runner.os }}-${{ hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc') }}
    restore-keys: |
      bazel-${{ runner.os }}-

- name: Activate CI-specific Bazel config
  run: |
    echo "build --config=ci" > ci.bazelrc
    echo "build:ci --remote_header=x-buildbuddy-api-key=${{ secrets.BUILDBUDDY_API_KEY }}" >> ci.bazelrc
```

Two cache layers exist (disk + BuildBuddy remote). Both restored successfully on the May 7 run, yet half the actions ran fresh. That implies action *hashes* differ run-to-run for ~half the build graph, OR cache *content* is missing under the right hash for that half.

## Proposals

Three angles worth investigating, ordered by leverage:

### 1. Root-cause the cache miss pattern (highest leverage)

BuildBuddy's UI has a **\"Compare invocations\"** view that shows, action-by-action, exactly which inputs differ between two runs and why. The CI is already wired to BuildBuddy (`--remote_header=x-buildbuddy-api-key=...`), so the data is there. Pick two recent main runs and diff them; the typical culprits are:

- Absolute paths leaking into compile outputs (`-MD` / `.d` files, debug info).
- Compiler / sysroot / libc fingerprint drift across runner image revisions.
- Volatile environment variables captured via `--action_env`.
- Platform/exec_properties differences.

Fix is usually a small `.bazelrc` change (pin compiler features, scrub action_env, set hermetic toolchain options). Investigation time: probably 30-90 minutes if you have BuildBuddy access. Payoff: potentially fixes the whole graph at once.

### 2. Stop building fuzz tests by default (cleanest sidestep)

`protobuf` is transitive through `fuzztest`, which is only needed by `tests/fuzz/*` targets. If those targets are tagged `manual` in `tests/fuzz/BUILD.bazel`, they drop out of `bazel build //...` expansion. Bazel only executes actions needed by the requested target set, so protobuf compile actions never run — only the source fetch happens during analysis, which is seconds rather than minutes.

```python
# tests/fuzz/BUILD.bazel
cc_test(
    name = \"sym_bit_vec_cast_test\",
    srcs = [\"sym_bit_vec_cast_test.cc\"],
    tags = [\"manual\"],   # excluded from //...
    deps = [...],
)
```

To preserve fuzz coverage, add a dedicated CI step (or a separate workflow) that runs `bazel test //tests/fuzz/...` explicitly — that one job still pays the cost, but only when fuzz code or its deps change, or on a schedule. Implementation cost: ~30 minutes (one BUILD edit per fuzz target, one CI step). Risk: low; the fuzz tests are already isolated under `tests/fuzz/`.

### 3. Pre-warm the cache nightly (backstop)

Add a `schedule:` trigger to the Bazel workflow so a main build runs nightly. The existing cache config already saves with a key that PR runs restore from, so a nightly success leaves a warm cache for the next day's PRs.

```yaml
on:
  push: { branches: [main] }
  pull_request: { branches: [main] }
  schedule:
    - cron: '0 5 * * *'   # 05:00 UTC daily
```

Implementation cost: ~15 minutes plus a day to verify. **Important caveat:** this only helps if the cache misses are about *content absence* (the right hash was never saved). If misses are about *action hash drift* (option 1's territory), pre-warming saves under hash A and the PR run looks up hash B and still misses. The 10% disk-cache hit rate today says at least some hashes *are* stable, so pre-warming would help that fraction — but to know how much, option 1's diagnosis is the prerequisite.

## Analyses already done

1. **Identified the workflow and cache config** at `.github/workflows/bazel.yml`. Two cache layers (disk + BuildBuddy remote), key derived from `hashFiles('.bazelversion', 'MODULE.bazel.lock', '.bazelrc')` with a fallback `restore-keys: bazel-Linux-`.

2. **Verified PR #27 does not touch cache-key inputs.** Cache key should match main; restore-keys fallback should match regardless. Yet the PR's Bazel job is currently at ~23 minutes and counting vs. main's ~2 minute baseline.

3. **Confirmed protobuf rebuild is pre-existing on main.** The May 7 run of main (`gh run view 25479661712 --log`) emits hundreds of `external/protobuf+/src/google/protobuf/...` compile warnings. The PR is not introducing the rebuild — it's just much worse at exhibiting the underlying cache miss problem.

4. **Confirmed dep path.** None of `MODULE.bazel`'s direct deps (`fuzztest`, `googletest`, `rules_cc`, `rules_shell`, `z3`) are protobuf. Protobuf is transitive; `fuzztest` is the most likely path (it uses protobuf for test-input serialization).

5. **Cache stats from a representative main run** (run 25479661712, build summary):
   - Build phase: 2432 processes — 248 disk cache hit + 913 remote cache hit + 1271 internal = ~48% cached.
   - Test phase: 31 processes — 510 action cache hit + 18 disk cache hit + 32 remote cache hit.
   - Disk cache *did* restore (\"Cache restored from key: bazel-Linux-deb7395ec...\"). The cache is working — just not as well as we'd want.

## What would close this issue

- A measurable improvement in Bazel CI wall-clock time on main, ideally back under 1 minute for incremental changes.
- Diagnosis of *why* roughly half of Bazel actions miss cache on main today (and why PRs sometimes miss far more than that).
- A documented fix or a documented tradeoff if the root cause is hard to address.

## Useful tools

- BuildBuddy \"Compare invocations\" view (link to your BuildBuddy org from a recent invocation URL in the CI logs).
- \`bazel build //... --execution_log_json_file=exec.json\` for local-vs-CI action diff if BuildBuddy isn't sufficient.
- \`bazel cquery 'deps(//tests/fuzz/...)' --output=label\` to confirm the protobuf dep path before option 2.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel CI build is slow: investigate cache miss patterns #28

Problem

Evidence

Main runs are consistently ~2 minutes but show high cache miss rates

Our PR sees a much worse cache hit rate

Cache mechanics

Proposals

1. Root-cause the cache miss pattern (highest leverage)

2. Stop building fuzz tests by default (cleanest sidestep)

3. Pre-warm the cache nightly (backstop)

Analyses already done

What would close this issue

Useful tools

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run	Date	Duration
Add exact-match table symbolic execution example (#25)	2026-05-07	1m56s
docs: restate naming rule positively	2026-05-07	1m42s
Update actions/upload-pages-artifact action to v5 (#18)	2026-05-07	1m45s
Update actions/deploy-pages action to v5 (#17)	2026-05-07	1m58s
Update dependency rules_shell to v0.8.0 (#16)	2026-05-07	1m52s

Bazel CI build is slow: investigate cache miss patterns #28

Description

Problem

Evidence

Main runs are consistently ~2 minutes but show high cache miss rates

Our PR sees a much worse cache hit rate

Cache mechanics

Proposals

1. Root-cause the cache miss pattern (highest leverage)

2. Stop building fuzz tests by default (cleanest sidestep)

3. Pre-warm the cache nightly (backstop)

Analyses already done

What would close this issue

Useful tools

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions