Skip to content

Enable GPR regalloc for F32/F64 IR values on 64-bit targets#89

Draft
TholeG wants to merge 2 commits intoanthropics:mainfrom
TholeG:pr-float-regalloc
Draft

Enable GPR regalloc for F32/F64 IR values on 64-bit targets#89
TholeG wants to merge 2 commits intoanthropics:mainfrom
TholeG:pr-float-regalloc

Conversation

@TholeG
Copy link
Copy Markdown

@TholeG TholeG commented Feb 6, 2026

PR Draft: Enable Register Allocation For F32/F64 On 64-bit Targets

Authorship note: This PR draft (and the proposed code change) was prepared by Codex (OpenAI) in this workspace.

Title

Enable GPR regalloc for F32/F64 IR values on 64-bit targets

Problem / Motivation

CCC's linear-scan register allocator currently treats all floating-point IR
values as "non-GPR" and excludes them from register allocation. In practice,
the 64-bit backends (x86-64, AArch64, RISC-V 64) represent F32/F64 values
as raw bit patterns in a single general-purpose register (accumulator paths:
rax/x0/t0) and only move them into FP regs (xmm*/d*/ft*) at the
actual FP instruction boundary.

As a result, float-heavy code ends up with excessive stack spilling and can be
an order of magnitude slower than clang/gcc for hot FP loops (e.g. n-body).

What This PR Changes

  • Treat IrType::F32 and IrType::F64 values as GPR-eligible on 64-bit targets.
  • Keep excluding IrType::F128 (long double) and I128/U128 from the GPR
    allocator (these still require special codegen paths).
  • Keep the current conservative behavior on 32-bit targets (i686): floats remain
    excluded because they do not fit cleanly into a single GPR without additional
    special handling.
  • Also includes a small doc/comment fix so cargo test --release passes:
    the if_convert module docs contained indented pseudo-code that Rust treats
    as doctests; the PR wraps those snippets in ```text fences.

Implementation Details

Files changed:

  • src/backend/regalloc.rs
  • src/passes/if_convert.rs (doc-only change; fixes failing doctests)

Key logic changes:

  • is_non_gpr_type now:
    • Excludes only F128 + I128/U128 on 64-bit targets.
    • Excludes floats + I64/U64 on 32-bit targets.
  • collect_non_gpr_values and Copy-chain propagation were updated accordingly
    (float constants are only treated as non-GPR on 32-bit targets).

No backend-specific codegen changes are required because the 64-bit backends
already load/store F32/F64 values via the normal accumulator + bitpattern
representation (with fmov/movd/fmv bridges at FP op boundaries).

Benchmarks (Runtime)

Environment:

  • Linux (Debian bookworm) in a Podman container on an Apple Silicon host
  • Target: AArch64 (ccc-arm)
  • Build flags: -O3 -DNDEBUG
  • Benchmark: scalar/non-SIMD nbody (benchmarksgame style), steps=20_000_000

Results:

  • Earlier run (avg of 3): CCC before ~6.95s, CCC after ~5.37s (about 23% faster)
  • One re-run (avg of 3, fresh builds in container): CCC before 7.166667s, CCC after 7.453333s (about 4.00% slower)
  • Latest re-run (avg of 10, interleaved base/patched): CCC before 6.840000s (sd=0.017321), CCC after 5.296000s (sd=0.041037) (about 22.57% faster)

Note: There was a single contradictory 3-run measurement. With more samples
the speedup is stable and matches the earlier ~23% improvement. I'd still
recommend running benchmarks on an otherwise-idle machine (or with more runs)
when presenting results.

Spill proxy (assembly text stats for CCC output of nbody.c):

  • before: ldr=219, str=203, lines=1742
  • after: ldr=189, str=166, lines=1804

Repro command (example):

  1. Build and test CCC on Linux:
    • cargo test --release
  2. Build nbody and time it (AArch64 host or VM):
    • ./target/release/ccc-arm -O3 -DNDEBUG -o nbody nbody.c -lm
    • /usr/bin/time -p ./nbody 20000000

Testing

Ran cargo test --release in docker.io/library/rust:1.91-bookworm:

  • 493 passed; 0 failed; 6 ignored

Notes / Follow-ups

  • This is an incremental improvement; CCC is still significantly slower than
    clang/gcc on nbody. Further work likely needs:
    • Better float value handling (e.g., real FP-reg allocation or smarter
      interval splitting / reducing reg pressure), and/or
    • Additional backend peephole optimizations around fmov/bitpattern moves.

@ChaseWNorton
Copy link
Copy Markdown

Review: APPROVE (draft — experimental)

No linked issue — standalone optimization improvement
Reviewed: Commit dff07e7a

What it does

Allows F32/F64 values to participate in GPR register allocation on 64-bit targets. Previously all float types were excluded from regalloc. Since 64-bit backends already represent F32/F64 as raw bits in a single GPR (rax/x0/t0), this is safe and should improve code quality by reducing unnecessary spills.

Changes

  • is_non_gpr_type: On 64-bit, only excludes F128/I128/U128 (not F32/F64)
  • collect_non_gpr_values: Updated consistently — F32/F64 constants only non-GPR on 32-bit
  • Comments and doc strings updated

Risk Assessment

This is a correctness-sensitive change — incorrect regalloc can produce wrong code silently. The fact that it's still a draft suggests it needs more testing. Would benefit from end-to-end codegen tests that verify floating-point values survive through regalloc correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants