Skip to content

Trunk-compiler failure records persist forever in latest table (sticky 'failed before' across nightly rebuilds) #2105

@mattgodbolt

Description

@mattgodbolt

Spun out from #1342 since the perf work there is done and this is a separate correctness/staleness concern.

Problem

When a trunk compiler (e.g. clang_barry, gcc-trunk, clang_pawel) fails to build a stable library release (e.g. boost 1.85.0, fmt 10.0.0), the failure record in the proxy's latest table persists indefinitely across nightly compiler rebuilds. The library-builder treats it as "still failing" and skips that combination forever — even after the underlying compiler bug is fixed.

Why the existing freshness guard doesn't help

has_failed_before in library_builder.py and fortran_library_builder.py filters the result by comparing the stored commithash against self.get_commit_hash():

# library_builder.py
def has_failed_before(self):
    return match_failed_build(self._get_failed_builds(), self.current_buildparameters_obj, self.get_commit_hash())

match_failed_build requires commithash == current_commit_hash for the failure to count. The intent is "stale records from a different commit don't poison new builds."

The flaw: get_commit_hash() returns the library's commit hash, not the compiler's. For:

  • Stable library releases (boost 1.85.0, fmt 10.0.0, ...): the source tag commit is fixed. get_commit_hash() returns the same value forever.
  • Trunk compilers (clang_barry, gcc-trunk): compiler_version is a stable string (clang_barry) that doesn't change when the binary is rebuilt nightly.

Result: (library, library_version, compiler, compiler_version, arch, libcxx, compiler_flags) is constant across compiler rebuilds, the commithash is constant for stable libraries, so commithash == current_commit_hash always passes and the failure record sticks.

Go and Rust have it worse — has_failed_before there passes None for the commit filter (the original /hasfailedbefore endpoint had no commit-hash check), so any failure is treated as current.

Concrete incident

On 2026-05-04 during the perf-work session: clang_barry's failure to build boost_bin/1.85.0 from 2026-05-03 was still poisoning the 2026-05-04 run despite barry rebuilding nightly in between. Surfaced via #2090's clear-for-compiler row-count printout — a typoed compiler ID was silently no-opping until that PR landed.

Code references

  • Proxy latest schema: migrations/001-initial.sql in compiler-explorer/conanproxy. Composite PK (library, compiler, library_version, compiler_version, arch, libcxx, compiler_flags). commithash added in 002-commithash.sql.
  • Failure write path: BuildLogging.setBuildFailed (build-logging.js:93).
  • Failure clear path: BuildLogging.setBuildFixed (line 30) — only fires on a successful build of the same combo. If the combo never succeeds because we never retry, this never fires.
  • Client-side filter: match_failed_build in bin/lib/library_builder.py, called from each builder's has_failed_before.

Possible fixes (cheap → invasive)

  1. Server-side TTL on read. Add build_dt < N days ago filter to the getFailedBuildsForLibrary query in build-logging.js. Failures older than the cutoff don't appear in /failedbuilds responses; the next builder run treats them as a clean retry.

    • Pros: smallest change, no schema migration, no client work.
    • Cons: doesn't reclaim disk; the rows are still there. Cutoff is server-policy, hard for client to override.
  2. Client-side TTL. Have /failedbuilds include build_dt in the response, and match_failed_build ignore failures older than N days.

    • Pros: lets policy live on the client; response shape is a small additive change.
    • Cons: still doesn't reclaim disk. Two callers (Go/Rust currently bypass commithash filter entirely; same shape would need the TTL).
  3. Periodic server-side cleanup. Cron job (or proxy-startup task) DELETEs latest rows where success=0 AND build_dt < cutoff. Plus a periodic VACUUM. Reclaims disk over time.

    • Pros: simple, addresses the symptom and the disk bloat.
    • Cons: doesn't help with rapid recovery — a failure recorded yesterday still blocks today's retry until cutoff.
  4. Include compiler binary identity in the failure key. The real bug is that compiler_version (clang_barry) doesn't actually identify a unique compiler binary. Add the binary hash or build date to the key. A failure recorded for one binary doesn't apply to a freshly-rebuilt one.

    • Pros: addresses the root cause.
    • Cons: schema change in the latest table; every existing row's identity shifts; client + server need coordinated update; way bigger blast radius.

Recommendation

Start with (1) — add build_dt < cutoff to the SELECT in getFailedBuildsForLibrary. With the covering index from conanproxy#75, the filter is effectively free. A 7-day cutoff would let nightly trunk compiler rebuilds retry stale failures without flooding the table with retry attempts mid-day.

If we want to also reclaim the disk over time, layer (3) on top as a separate maintenance migration / cron job. (1) handles the staleness for the read path; (3) handles long-term disk hygiene.

(4) is the right answer for correctness but is genuinely a bigger piece of work and probably not worth it unless the staleness symptom keeps biting after (1)+(3).

Out of scope here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions