Spun out from #1342 since the perf work there is done and this is a separate correctness/staleness concern.
Problem
When a trunk compiler (e.g. clang_barry, gcc-trunk, clang_pawel) fails to build a stable library release (e.g. boost 1.85.0, fmt 10.0.0), the failure record in the proxy's latest table persists indefinitely across nightly compiler rebuilds. The library-builder treats it as "still failing" and skips that combination forever — even after the underlying compiler bug is fixed.
Why the existing freshness guard doesn't help
has_failed_before in library_builder.py and fortran_library_builder.py filters the result by comparing the stored commithash against self.get_commit_hash():
# library_builder.py
def has_failed_before(self):
return match_failed_build(self._get_failed_builds(), self.current_buildparameters_obj, self.get_commit_hash())
match_failed_build requires commithash == current_commit_hash for the failure to count. The intent is "stale records from a different commit don't poison new builds."
The flaw: get_commit_hash() returns the library's commit hash, not the compiler's. For:
- Stable library releases (boost 1.85.0, fmt 10.0.0, ...): the source tag commit is fixed.
get_commit_hash() returns the same value forever.
- Trunk compilers (clang_barry, gcc-trunk):
compiler_version is a stable string (clang_barry) that doesn't change when the binary is rebuilt nightly.
Result: (library, library_version, compiler, compiler_version, arch, libcxx, compiler_flags) is constant across compiler rebuilds, the commithash is constant for stable libraries, so commithash == current_commit_hash always passes and the failure record sticks.
Go and Rust have it worse — has_failed_before there passes None for the commit filter (the original /hasfailedbefore endpoint had no commit-hash check), so any failure is treated as current.
Concrete incident
On 2026-05-04 during the perf-work session: clang_barry's failure to build boost_bin/1.85.0 from 2026-05-03 was still poisoning the 2026-05-04 run despite barry rebuilding nightly in between. Surfaced via #2090's clear-for-compiler row-count printout — a typoed compiler ID was silently no-opping until that PR landed.
Code references
- Proxy
latest schema: migrations/001-initial.sql in compiler-explorer/conanproxy. Composite PK (library, compiler, library_version, compiler_version, arch, libcxx, compiler_flags). commithash added in 002-commithash.sql.
- Failure write path:
BuildLogging.setBuildFailed (build-logging.js:93).
- Failure clear path:
BuildLogging.setBuildFixed (line 30) — only fires on a successful build of the same combo. If the combo never succeeds because we never retry, this never fires.
- Client-side filter:
match_failed_build in bin/lib/library_builder.py, called from each builder's has_failed_before.
Possible fixes (cheap → invasive)
-
Server-side TTL on read. Add build_dt < N days ago filter to the getFailedBuildsForLibrary query in build-logging.js. Failures older than the cutoff don't appear in /failedbuilds responses; the next builder run treats them as a clean retry.
- Pros: smallest change, no schema migration, no client work.
- Cons: doesn't reclaim disk; the rows are still there. Cutoff is server-policy, hard for client to override.
-
Client-side TTL. Have /failedbuilds include build_dt in the response, and match_failed_build ignore failures older than N days.
- Pros: lets policy live on the client; response shape is a small additive change.
- Cons: still doesn't reclaim disk. Two callers (Go/Rust currently bypass commithash filter entirely; same shape would need the TTL).
-
Periodic server-side cleanup. Cron job (or proxy-startup task) DELETEs latest rows where success=0 AND build_dt < cutoff. Plus a periodic VACUUM. Reclaims disk over time.
- Pros: simple, addresses the symptom and the disk bloat.
- Cons: doesn't help with rapid recovery — a failure recorded yesterday still blocks today's retry until cutoff.
-
Include compiler binary identity in the failure key. The real bug is that compiler_version (clang_barry) doesn't actually identify a unique compiler binary. Add the binary hash or build date to the key. A failure recorded for one binary doesn't apply to a freshly-rebuilt one.
- Pros: addresses the root cause.
- Cons: schema change in the
latest table; every existing row's identity shifts; client + server need coordinated update; way bigger blast radius.
Recommendation
Start with (1) — add build_dt < cutoff to the SELECT in getFailedBuildsForLibrary. With the covering index from conanproxy#75, the filter is effectively free. A 7-day cutoff would let nightly trunk compiler rebuilds retry stale failures without flooding the table with retry attempts mid-day.
If we want to also reclaim the disk over time, layer (3) on top as a separate maintenance migration / cron job. (1) handles the staleness for the read path; (3) handles long-term disk hygiene.
(4) is the right answer for correctness but is genuinely a bigger piece of work and probably not worth it unless the staleness symptom keeps biting after (1)+(3).
Out of scope here
Spun out from #1342 since the perf work there is done and this is a separate correctness/staleness concern.
Problem
When a trunk compiler (e.g.
clang_barry,gcc-trunk,clang_pawel) fails to build a stable library release (e.g.boost 1.85.0,fmt 10.0.0), the failure record in the proxy'slatesttable persists indefinitely across nightly compiler rebuilds. The library-builder treats it as "still failing" and skips that combination forever — even after the underlying compiler bug is fixed.Why the existing freshness guard doesn't help
has_failed_beforeinlibrary_builder.pyandfortran_library_builder.pyfilters the result by comparing the storedcommithashagainstself.get_commit_hash():match_failed_buildrequirescommithash == current_commit_hashfor the failure to count. The intent is "stale records from a different commit don't poison new builds."The flaw:
get_commit_hash()returns the library's commit hash, not the compiler's. For:get_commit_hash()returns the same value forever.compiler_versionis a stable string (clang_barry) that doesn't change when the binary is rebuilt nightly.Result:
(library, library_version, compiler, compiler_version, arch, libcxx, compiler_flags)is constant across compiler rebuilds, thecommithashis constant for stable libraries, socommithash == current_commit_hashalways passes and the failure record sticks.Go and Rust have it worse —
has_failed_beforethere passesNonefor the commit filter (the original/hasfailedbeforeendpoint had no commit-hash check), so any failure is treated as current.Concrete incident
On 2026-05-04 during the perf-work session:
clang_barry's failure to buildboost_bin/1.85.0from 2026-05-03 was still poisoning the 2026-05-04 run despite barry rebuilding nightly in between. Surfaced via #2090'sclear-for-compilerrow-count printout — a typoed compiler ID was silently no-opping until that PR landed.Code references
latestschema:migrations/001-initial.sqlin compiler-explorer/conanproxy. Composite PK(library, compiler, library_version, compiler_version, arch, libcxx, compiler_flags).commithashadded in002-commithash.sql.BuildLogging.setBuildFailed(build-logging.js:93).BuildLogging.setBuildFixed(line 30) — only fires on a successful build of the same combo. If the combo never succeeds because we never retry, this never fires.match_failed_buildinbin/lib/library_builder.py, called from each builder'shas_failed_before.Possible fixes (cheap → invasive)
Server-side TTL on read. Add
build_dt < N days agofilter to thegetFailedBuildsForLibraryquery inbuild-logging.js. Failures older than the cutoff don't appear in/failedbuildsresponses; the next builder run treats them as a clean retry.Client-side TTL. Have
/failedbuildsincludebuild_dtin the response, andmatch_failed_buildignore failures older than N days.Periodic server-side cleanup. Cron job (or proxy-startup task) DELETEs
latestrows wheresuccess=0 AND build_dt < cutoff. Plus a periodicVACUUM. Reclaims disk over time.Include compiler binary identity in the failure key. The real bug is that
compiler_version(clang_barry) doesn't actually identify a unique compiler binary. Add the binary hash or build date to the key. A failure recorded for one binary doesn't apply to a freshly-rebuilt one.latesttable; every existing row's identity shifts; client + server need coordinated update; way bigger blast radius.Recommendation
Start with (1) — add
build_dt < cutoffto the SELECT ingetFailedBuildsForLibrary. With the covering index from conanproxy#75, the filter is effectively free. A 7-day cutoff would let nightly trunk compiler rebuilds retry stale failures without flooding the table with retry attempts mid-day.If we want to also reclaim the disk over time, layer (3) on top as a separate maintenance migration / cron job. (1) handles the staleness for the read path; (3) handles long-term disk hygiene.
(4) is the right answer for correctness but is genuinely a bigger piece of work and probably not worth it unless the staleness symptom keeps biting after (1)+(3).
Out of scope here
/failedbuildsare already done (conanproxy#74, Intel license has expired #75, clang should build with compiler-rt #76; infra#2100, Bulk-fetch /annotations once per builder #2104).