Trunk-compiler failure records persist forever in latest table (sticky 'failed before' across nightly rebuilds)

Spun out from #1342 since the perf work there is done and this is a separate correctness/staleness concern.

## Problem

When a trunk compiler (e.g. `clang_barry`, `gcc-trunk`, `clang_pawel`) fails to build a stable library release (e.g. `boost 1.85.0`, `fmt 10.0.0`), the failure record in the proxy's `latest` table persists indefinitely across nightly compiler rebuilds. The library-builder treats it as "still failing" and skips that combination forever — even after the underlying compiler bug is fixed.

## Why the existing freshness guard doesn't help

`has_failed_before` in `library_builder.py` and `fortran_library_builder.py` filters the result by comparing the stored `commithash` against `self.get_commit_hash()`:

```python
# library_builder.py
def has_failed_before(self):
    return match_failed_build(self._get_failed_builds(), self.current_buildparameters_obj, self.get_commit_hash())
```

`match_failed_build` requires `commithash == current_commit_hash` for the failure to count. The intent is "stale records from a different commit don't poison new builds."

The flaw: `get_commit_hash()` returns the **library's** commit hash, not the compiler's. For:

- **Stable library releases** (boost 1.85.0, fmt 10.0.0, ...): the source tag commit is fixed. `get_commit_hash()` returns the same value forever.
- **Trunk compilers** (clang_barry, gcc-trunk): `compiler_version` is a stable string (`clang_barry`) that doesn't change when the binary is rebuilt nightly.

Result: `(library, library_version, compiler, compiler_version, arch, libcxx, compiler_flags)` is constant across compiler rebuilds, the `commithash` is constant for stable libraries, so `commithash == current_commit_hash` always passes and the failure record sticks.

Go and Rust have it worse — `has_failed_before` there passes `None` for the commit filter (the original `/hasfailedbefore` endpoint had no commit-hash check), so any failure is treated as current.

## Concrete incident

On 2026-05-04 during the perf-work session: `clang_barry`'s failure to build `boost_bin/1.85.0` from 2026-05-03 was still poisoning the 2026-05-04 run despite barry rebuilding nightly in between. Surfaced via #2090's `clear-for-compiler` row-count printout — a typoed compiler ID was silently no-opping until that PR landed.

## Code references

- Proxy `latest` schema: `migrations/001-initial.sql` in compiler-explorer/conanproxy. Composite PK `(library, compiler, library_version, compiler_version, arch, libcxx, compiler_flags)`. `commithash` added in `002-commithash.sql`.
- Failure write path: `BuildLogging.setBuildFailed` (`build-logging.js:93`).
- Failure clear path: `BuildLogging.setBuildFixed` (line 30) — only fires on a *successful* build of the same combo. If the combo never succeeds because we never retry, this never fires.
- Client-side filter: `match_failed_build` in `bin/lib/library_builder.py`, called from each builder's `has_failed_before`.

## Possible fixes (cheap → invasive)

1. **Server-side TTL on read.** Add `build_dt < N days ago` filter to the `getFailedBuildsForLibrary` query in `build-logging.js`. Failures older than the cutoff don't appear in `/failedbuilds` responses; the next builder run treats them as a clean retry.

   - Pros: smallest change, no schema migration, no client work.
   - Cons: doesn't reclaim disk; the rows are still there. Cutoff is server-policy, hard for client to override.

2. **Client-side TTL.** Have `/failedbuilds` include `build_dt` in the response, and `match_failed_build` ignore failures older than N days.

   - Pros: lets policy live on the client; response shape is a small additive change.
   - Cons: still doesn't reclaim disk. Two callers (Go/Rust currently bypass commithash filter entirely; same shape would need the TTL).

3. **Periodic server-side cleanup.** Cron job (or proxy-startup task) DELETEs `latest` rows where `success=0 AND build_dt < cutoff`. Plus a periodic `VACUUM`. Reclaims disk over time.

   - Pros: simple, addresses the symptom and the disk bloat.
   - Cons: doesn't help with rapid recovery — a failure recorded yesterday still blocks today's retry until cutoff.

4. **Include compiler binary identity in the failure key.** The real bug is that `compiler_version` (`clang_barry`) doesn't actually identify a unique compiler binary. Add the binary hash or build date to the key. A failure recorded for one binary doesn't apply to a freshly-rebuilt one.

   - Pros: addresses the root cause.
   - Cons: schema change in the `latest` table; every existing row's identity shifts; client + server need coordinated update; way bigger blast radius.

## Recommendation

Start with **(1)** — add `build_dt < cutoff` to the SELECT in `getFailedBuildsForLibrary`. With the covering index from conanproxy#75, the filter is effectively free. A 7-day cutoff would let nightly trunk compiler rebuilds retry stale failures without flooding the table with retry attempts mid-day.

If we want to also reclaim the disk over time, layer **(3)** on top as a separate maintenance migration / cron job. (1) handles the staleness for the read path; (3) handles long-term disk hygiene.

**(4)** is the right answer for correctness but is genuinely a bigger piece of work and probably not worth it unless the staleness symptom keeps biting after (1)+(3).

## Out of scope here

- The covering index, log-truncation, and bulk-fetch performance work on `/failedbuilds` are already done (conanproxy#74, #75, #76; infra#2100, #2104).
- The four-builder duplication (#1832 has a draft) is orthogonal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trunk-compiler failure records persist forever in latest table (sticky 'failed before' across nightly rebuilds) #2105

Problem

Why the existing freshness guard doesn't help

Concrete incident

Code references

Possible fixes (cheap → invasive)

Recommendation

Out of scope here

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Trunk-compiler failure records persist forever in latest table (sticky 'failed before' across nightly rebuilds) #2105

Description

Problem

Why the existing freshness guard doesn't help

Concrete incident

Code references

Possible fixes (cheap → invasive)

Recommendation

Out of scope here

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions