Skip to content

z3rco/fastalloc

fastalloc

A specialized game-engine memory allocator. Three tiers. Sub-nanosecond access for lifetime-known resources.

C++20 header-only CMake License


lifetime-annotated transient access

Why this exists

General-purpose allocators are tuned for server workloads — unpredictable sizes, arbitrary threading, throughput over latency. Games do not match that shape. fastalloc exploits three properties that generic allocators ignore:

Property What it means fastalloc's answer
Predictable lifetimes Frame-graph transient resources live for a handful of passes known at declaration time Tier 0 — offline-planned offset table
Typed allocations A component pool already knows its own object size Tier 1 — bucketed proxy, pagemap-owned
Producer-consumer threads Asset streamer allocates, render thread frees Tier 2 — lock-free MPSC inbox with batching

Quick start

#include "fastalloc.hpp"

fastalloc::FastAlloc alloc;

void* p = alloc.allocate(256);
// ... use p ...
alloc.deallocate(p, 256);   // sized free — fast path

For the full tripartite router with planned transients:

#include "sfta.hpp"

std::array<fastalloc::BufferDecl, 2> decls{{
    {.id = 0, .size = 4096, .align = 256, .birth = 0, .death = 2},
    {.id = 1, .size = 4096, .align = 256, .birth = 3, .death = 5},
}};

fastalloc::StaticPlan plan;
auto footprint = plan.plan(decls);

fastalloc::StaticArena arena(std::malloc(footprint), footprint, plan.placements());
void* rt = arena.get(0);     // one add instruction — no allocator invocation

Architecture

TierRolePer-op costBacking
0 — Static DSA Lifetime-known resources (frame graph, scene epochs) ~0 ns   one pointer add Offline-planned offset table
1 — Dynamic Bounded-size typed allocations ~4–20 ns Proxy slabs + TLSF fallback
2 — Cross-thread Producer-consumer asset streaming lock-free   remote free via CAS Proxy + MPSC inbox with BatchIt batching

The free path routes by virtual-address range: at most two range checks plus a pagemap lookup identify the owning tier. No branching on type, no runtime dispatch cost.

Benchmarks

Measured against six game-focused allocators — mimalloc, rpmalloc, snmalloc, smmalloc, ltalloc — plus glibc malloc as a baseline. 5-iteration median per cell, averaged across 5 runs, core-pinned, -O3 -march=native -flto, unsized-free API across every adapter. Full methodology and all adapters live in the companion repo memalloc-microbenchmarks.

Latency

mean latency per workload

Relative throughput

fastalloc vs next-fastest per workload

Hardware: Intel Core i5-7300U @ 2.60 GHz (2C/4T, 3 MiB L3), 16 GiB DDR4. OS: Ubuntu 24.04 LTS, kernel 6.8.0, transparent_hugepage/enabled=[madvise]. Numbers scale down on newer silicon; relative ordering between allocators is stable across hardware in the range we tested.

Scope of the claim

Note

fastalloc leads 9 of the 10 measured workloads. ltalloc takes fixed 64 B churn by a sliver (8.3 ns vs 9.0 ns — within ~1 ns). The measurement is warm steady-state: each allocator is constructed once, warmed up, then driven through the bench. This is the profile an engine actually runs with. Cold-start-heavy workloads that rebuild allocators per frame are outside the scope of these numbers.

Building

CC=clang CXX=clang++ cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

Header-only under src/. Drop the headers into your tree, include one, done.

Tip

Requires a C++20 compiler (tested with clang 18) and CMake 3.25+.

API surface

// fastalloc.hpp        — unified Tier 1 entry point
class FastAlloc {
    void* allocate(std::size_t size) noexcept;
    void  deallocate(void* p, std::size_t size) noexcept;   // sized, fast
    void  deallocate(void* p) noexcept;                     // unsized
};

// static_arena.hpp     — static DSA planner + runtime arena
struct BufferDecl { uint32_t id; size_t size; size_t align; int birth; int death; };
class StaticPlan  { size_t plan(std::span<const BufferDecl>); };
class StaticArena { void* get(uint32_t id) const noexcept; };   // one instruction

// cross_thread.hpp     — lock-free producer-consumer
class CrossThreadArena { void* allocate(size_t); void local_deallocate(void*); };
class RemoteFreer      { void defer_free(CrossThreadArena*, void*); };

// sfta.hpp             — full tripartite router
class SFTA { /* resolve_static, allocate_dynamic, allocate_cross, deallocate */ };

Design

  • Tier 0 solves Dynamic Storage Allocation offline with a best-fit-by- address heuristic — place largest-first, pick the lowest offset that avoids collision with any simultaneously-live buffer. O(n²) in buffer count; expected to run at level load. Output is a flat id → offset table.
  • Tier 1a (proxy) holds per-size-class free lists as raw Node* arrays packed into one cache line on the hot path. A pagemap over 16 KiB chunks gives O(1) owner identification when the size is not known at free time. A size-dispatch table maps request size to bucket index via one shift.
  • Tier 1b (TLSF) implements the classical Masmano two-level bitmap. Small constants (~200 instructions worst case) and bounded internal fragmentation (2^(1/SLI) − 1).
  • Tier 2 pairs the proxy with a lock-free MPSC inbox per owner. Remote threads push frees via a single CAS on the queue head. Owner drains on its next allocate. A per-target batcher (BatchIt) amortises CAS cost across 4–32 frees.

All three tiers live in non-overlapping virtual ranges, so the free path routes with at most two range checks plus one pagemap lookup.

Limitations

Warning

  • Not thread-safe. Tier 1 is single-threaded. Use Tier 2 for cross-thread traffic, or run one instance per thread.
  • Not a C-ABI malloc. allocate / deallocate / reallocate / allocate_aligned are member functions on FastAlloc; there is no LD_PRELOAD-drop-in entry point. Wrap if you need one.
  • Pre-reserves ~768 MiB VMA for the default config (512 MiB proxy + 256 MiB TLSF). Physical RSS grows only on touch. Tunable.
  • Tier 0 requires lifetime declarations. Engines without a frame graph cannot use it; the benefits then fall to Tier 1.
  • Adversarial workloads (long-lived + unbounded size range) push work into the TLSF fallback; performance there matches mimalloc / rpmalloc rather than beating them.

Code size

Project Purpose SLOC
fastalloc SFTA tripartite allocator ~700
mimalloc General-purpose SOTA ~12 700
rpmalloc Game-focused general ~8 000

References

  1. TLSF — Masmano, Ripoll, Crespo, Real, TLSF: A New Dynamic Memory Allocator for Real-Time Systems, ECRTS 2004.
  2. Static planning — Lamprakos, Xanthopoulos, Katsaragakis, Xydis, Soudris, Catthoor, Futureproof Static Memory Planning, ACM TOPLAS 2025. arXiv:2504.04874.
  3. snmalloc / BatchIt — Liétar et al., ISMM 2019; Filardo & Parkinson, ISMM 2024. Inform the Tier 2 message-passing design.
  4. smmalloc — Sergey Makeev, https://github.com/SergeyMakeev/smmalloc.

License

Apache License 2.0. Copyright 2026 z3rco. See LICENSE and NOTICE. Includes the patent grant — adopters are protected from contributor-filed patents covering the code they use.

About

A specialized game-engine memory allocator — three tiers, sub-nanosecond access for lifetime-known resources

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors