Enable GPR regalloc for F32/F64 IR values on 64-bit targets#89
Draft
TholeG wants to merge 2 commits intoanthropics:mainfrom
Draft
Enable GPR regalloc for F32/F64 IR values on 64-bit targets#89TholeG wants to merge 2 commits intoanthropics:mainfrom
TholeG wants to merge 2 commits intoanthropics:mainfrom
Conversation
Review: APPROVE (draft — experimental)No linked issue — standalone optimization improvement What it doesAllows F32/F64 values to participate in GPR register allocation on 64-bit targets. Previously all float types were excluded from regalloc. Since 64-bit backends already represent F32/F64 as raw bits in a single GPR (rax/x0/t0), this is safe and should improve code quality by reducing unnecessary spills. Changes
Risk AssessmentThis is a correctness-sensitive change — incorrect regalloc can produce wrong code silently. The fact that it's still a draft suggests it needs more testing. Would benefit from end-to-end codegen tests that verify floating-point values survive through regalloc correctly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Draft: Enable Register Allocation For F32/F64 On 64-bit Targets
Authorship note: This PR draft (and the proposed code change) was prepared by Codex (OpenAI) in this workspace.
Title
Enable GPR regalloc for
F32/F64IR values on 64-bit targetsProblem / Motivation
CCC's linear-scan register allocator currently treats all floating-point IR
values as "non-GPR" and excludes them from register allocation. In practice,
the 64-bit backends (x86-64, AArch64, RISC-V 64) represent
F32/F64valuesas raw bit patterns in a single general-purpose register (accumulator paths:
rax/x0/t0) and only move them into FP regs (xmm*/d*/ft*) at theactual FP instruction boundary.
As a result, float-heavy code ends up with excessive stack spilling and can be
an order of magnitude slower than clang/gcc for hot FP loops (e.g. n-body).
What This PR Changes
IrType::F32andIrType::F64values as GPR-eligible on 64-bit targets.IrType::F128(long double) andI128/U128from the GPRallocator (these still require special codegen paths).
excluded because they do not fit cleanly into a single GPR without additional
special handling.
cargo test --releasepasses:the
if_convertmodule docs contained indented pseudo-code that Rust treatsas doctests; the PR wraps those snippets in ```text fences.
Implementation Details
Files changed:
src/backend/regalloc.rssrc/passes/if_convert.rs(doc-only change; fixes failing doctests)Key logic changes:
is_non_gpr_typenow:F128+I128/U128on 64-bit targets.I64/U64on 32-bit targets.collect_non_gpr_valuesand Copy-chain propagation were updated accordingly(float constants are only treated as non-GPR on 32-bit targets).
No backend-specific codegen changes are required because the 64-bit backends
already load/store
F32/F64values via the normal accumulator + bitpatternrepresentation (with
fmov/movd/fmvbridges at FP op boundaries).Benchmarks (Runtime)
Environment:
ccc-arm)-O3 -DNDEBUGnbody(benchmarksgame style),steps=20_000_000Results:
Note: There was a single contradictory 3-run measurement. With more samples
the speedup is stable and matches the earlier ~23% improvement. I'd still
recommend running benchmarks on an otherwise-idle machine (or with more runs)
when presenting results.
Spill proxy (assembly text stats for CCC output of
nbody.c):ldr=219,str=203,lines=1742ldr=189,str=166,lines=1804Repro command (example):
cargo test --releasenbodyand time it (AArch64 host or VM):./target/release/ccc-arm -O3 -DNDEBUG -o nbody nbody.c -lm/usr/bin/time -p ./nbody 20000000Testing
Ran
cargo test --releaseindocker.io/library/rust:1.91-bookworm:493 passed; 0 failed; 6 ignoredNotes / Follow-ups
clang/gcc on
nbody. Further work likely needs:interval splitting / reducing reg pressure), and/or
fmov/bitpattern moves.