A specialized game-engine memory allocator. Three tiers. Sub-nanosecond access for lifetime-known resources.
General-purpose allocators are tuned for server workloads — unpredictable
sizes, arbitrary threading, throughput over latency. Games do not match that
shape. fastalloc exploits three properties that generic allocators ignore:
| Property | What it means | fastalloc's answer |
|---|---|---|
| Predictable lifetimes | Frame-graph transient resources live for a handful of passes known at declaration time | Tier 0 — offline-planned offset table |
| Typed allocations | A component pool already knows its own object size | Tier 1 — bucketed proxy, pagemap-owned |
| Producer-consumer threads | Asset streamer allocates, render thread frees | Tier 2 — lock-free MPSC inbox with batching |
#include "fastalloc.hpp"
fastalloc::FastAlloc alloc;
void* p = alloc.allocate(256);
// ... use p ...
alloc.deallocate(p, 256); // sized free — fast pathFor the full tripartite router with planned transients:
#include "sfta.hpp"
std::array<fastalloc::BufferDecl, 2> decls{{
{.id = 0, .size = 4096, .align = 256, .birth = 0, .death = 2},
{.id = 1, .size = 4096, .align = 256, .birth = 3, .death = 5},
}};
fastalloc::StaticPlan plan;
auto footprint = plan.plan(decls);
fastalloc::StaticArena arena(std::malloc(footprint), footprint, plan.placements());
void* rt = arena.get(0); // one add instruction — no allocator invocation| Tier | Role | Per-op cost | Backing |
|---|---|---|---|
| 0 — Static DSA | Lifetime-known resources (frame graph, scene epochs) | ~0 ns one pointer add |
Offline-planned offset table |
| 1 — Dynamic | Bounded-size typed allocations | ~4–20 ns |
Proxy slabs + TLSF fallback |
| 2 — Cross-thread | Producer-consumer asset streaming | lock-free remote free via CAS |
Proxy + MPSC inbox with BatchIt batching |
The free path routes by virtual-address range: at most two range checks plus a pagemap lookup identify the owning tier. No branching on type, no runtime dispatch cost.
Measured against six game-focused allocators — mimalloc, rpmalloc,
snmalloc, smmalloc, ltalloc — plus glibc malloc as a baseline.
5-iteration median per cell, averaged across 5 runs, core-pinned,
-O3 -march=native -flto, unsized-free API across every adapter.
Full methodology and all adapters live in the companion repo
memalloc-microbenchmarks.
Hardware: Intel Core i5-7300U @ 2.60 GHz (2C/4T, 3 MiB L3), 16 GiB DDR4.
OS: Ubuntu 24.04 LTS, kernel 6.8.0, transparent_hugepage/enabled=[madvise].
Numbers scale down on newer silicon; relative ordering between allocators
is stable across hardware in the range we tested.
Note
fastalloc leads 9 of the 10 measured workloads. ltalloc takes
fixed 64 B churn by a sliver (8.3 ns vs 9.0 ns — within ~1 ns). The
measurement is warm steady-state: each allocator is constructed once,
warmed up, then driven through the bench. This is the profile an engine
actually runs with. Cold-start-heavy workloads that rebuild allocators
per frame are outside the scope of these numbers.
CC=clang CXX=clang++ cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build buildHeader-only under src/. Drop the headers into your tree, include one,
done.
Tip
Requires a C++20 compiler (tested with clang 18) and CMake 3.25+.
// fastalloc.hpp — unified Tier 1 entry point
class FastAlloc {
void* allocate(std::size_t size) noexcept;
void deallocate(void* p, std::size_t size) noexcept; // sized, fast
void deallocate(void* p) noexcept; // unsized
};
// static_arena.hpp — static DSA planner + runtime arena
struct BufferDecl { uint32_t id; size_t size; size_t align; int birth; int death; };
class StaticPlan { size_t plan(std::span<const BufferDecl>); };
class StaticArena { void* get(uint32_t id) const noexcept; }; // one instruction
// cross_thread.hpp — lock-free producer-consumer
class CrossThreadArena { void* allocate(size_t); void local_deallocate(void*); };
class RemoteFreer { void defer_free(CrossThreadArena*, void*); };
// sfta.hpp — full tripartite router
class SFTA { /* resolve_static, allocate_dynamic, allocate_cross, deallocate */ };- Tier 0 solves Dynamic Storage Allocation offline with a best-fit-by-
address heuristic — place largest-first, pick the lowest offset that
avoids collision with any simultaneously-live buffer. O(n²) in buffer
count; expected to run at level load. Output is a flat
id → offsettable. - Tier 1a (proxy) holds per-size-class free lists as raw
Node*arrays packed into one cache line on the hot path. A pagemap over 16 KiB chunks gives O(1) owner identification when the size is not known at free time. A size-dispatch table maps request size to bucket index via one shift. - Tier 1b (TLSF) implements the classical Masmano two-level bitmap.
Small constants (~200 instructions worst case) and bounded internal
fragmentation (
2^(1/SLI) − 1). - Tier 2 pairs the proxy with a lock-free MPSC inbox per owner. Remote threads push frees via a single CAS on the queue head. Owner drains on its next allocate. A per-target batcher (BatchIt) amortises CAS cost across 4–32 frees.
All three tiers live in non-overlapping virtual ranges, so the free path routes with at most two range checks plus one pagemap lookup.
Warning
- Not thread-safe. Tier 1 is single-threaded. Use Tier 2 for cross-thread traffic, or run one instance per thread.
- Not a C-ABI
malloc.allocate/deallocate/reallocate/allocate_alignedare member functions onFastAlloc; there is noLD_PRELOAD-drop-in entry point. Wrap if you need one. - Pre-reserves ~768 MiB VMA for the default config (512 MiB proxy + 256 MiB TLSF). Physical RSS grows only on touch. Tunable.
- Tier 0 requires lifetime declarations. Engines without a frame graph cannot use it; the benefits then fall to Tier 1.
- Adversarial workloads (long-lived + unbounded size range) push work
into the TLSF fallback; performance there matches
mimalloc/rpmallocrather than beating them.
| Project | Purpose | SLOC |
|---|---|---|
| fastalloc | SFTA tripartite allocator | ~700 |
| mimalloc | General-purpose SOTA | ~12 700 |
| rpmalloc | Game-focused general | ~8 000 |
- TLSF — Masmano, Ripoll, Crespo, Real, TLSF: A New Dynamic Memory Allocator for Real-Time Systems, ECRTS 2004.
- Static planning — Lamprakos, Xanthopoulos, Katsaragakis, Xydis, Soudris, Catthoor, Futureproof Static Memory Planning, ACM TOPLAS 2025. arXiv:2504.04874.
- snmalloc / BatchIt — Liétar et al., ISMM 2019; Filardo & Parkinson, ISMM 2024. Inform the Tier 2 message-passing design.
- smmalloc — Sergey Makeev, https://github.com/SergeyMakeev/smmalloc.
Apache License 2.0. Copyright 2026 z3rco. See LICENSE and NOTICE. Includes the patent grant — adopters are protected from contributor-filed patents covering the code they use.

