GitHub - z3rco/fastalloc: A specialized game-engine memory allocator — three tiers, sub-nanosecond access for lifetime-known resources

A specialized game-engine memory allocator. Three tiers. Sub-nanosecond access for lifetime-known resources.

Why this exists

General-purpose allocators are tuned for server workloads — unpredictable sizes, arbitrary threading, throughput over latency. Games do not match that shape. fastalloc exploits three properties that generic allocators ignore:

Property	What it means	fastalloc's answer
Predictable lifetimes	Frame-graph transient resources live for a handful of passes known at declaration time	Tier 0 — offline-planned offset table
Typed allocations	A component pool already knows its own object size	Tier 1 — bucketed proxy, pagemap-owned
Producer-consumer threads	Asset streamer allocates, render thread frees	Tier 2 — lock-free MPSC inbox with batching

Quick start

#include "fastalloc.hpp"

fastalloc::FastAlloc alloc;

void* p = alloc.allocate(256);
// ... use p ...
alloc.deallocate(p, 256);   // sized free — fast path

For the full tripartite router with planned transients:

#include "sfta.hpp"

std::array<fastalloc::BufferDecl, 2> decls{{
    {.id = 0, .size = 4096, .align = 256, .birth = 0, .death = 2},
    {.id = 1, .size = 4096, .align = 256, .birth = 3, .death = 5},
}};

fastalloc::StaticPlan plan;
auto footprint = plan.plan(decls);

fastalloc::StaticArena arena(std::malloc(footprint), footprint, plan.placements());
void* rt = arena.get(0);     // one add instruction — no allocator invocation

Architecture

Tier	Role	Per-op cost	Backing
0 — Static DSA	Lifetime-known resources (frame graph, scene epochs)	`~0 ns` _{one pointer add}	Offline-planned offset table
1 — Dynamic	Bounded-size typed allocations	`~4–20 ns`	Proxy slabs + TLSF fallback
2 — Cross-thread	Producer-consumer asset streaming	`lock-free` _{remote free via CAS}	Proxy + MPSC inbox with BatchIt batching

The free path routes by virtual-address range: at most two range checks plus a pagemap lookup identify the owning tier. No branching on type, no runtime dispatch cost.

Benchmarks

Measured against six game-focused allocators — mimalloc, rpmalloc, snmalloc, smmalloc, ltalloc — plus glibc malloc as a baseline. 5-iteration median per cell, averaged across 5 runs, core-pinned, -O3 -march=native -flto, unsized-free API across every adapter. Full methodology and all adapters live in the companion repo memalloc-microbenchmarks.

Latency

Relative throughput

_{Hardware: Intel Core i5-7300U @ 2.60 GHz (2C/4T, 3 MiB L3), 16 GiB DDR4.
OS: Ubuntu 24.04 LTS, kernel 6.8.0, transparent_hugepage/enabled=[madvise].
Numbers scale down on newer silicon; relative ordering between allocators
is stable across hardware in the range we tested.}

Scope of the claim

Note

fastalloc leads 9 of the 10 measured workloads. ltalloc takes fixed 64 B churn by a sliver (8.3 ns vs 9.0 ns — within ~1 ns). The measurement is warm steady-state: each allocator is constructed once, warmed up, then driven through the bench. This is the profile an engine actually runs with. Cold-start-heavy workloads that rebuild allocators per frame are outside the scope of these numbers.

Building

CC=clang CXX=clang++ cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

Header-only under src/. Drop the headers into your tree, include one, done.

Tip

Requires a C++20 compiler (tested with clang 18) and CMake 3.25+.

API surface

// fastalloc.hpp        — unified Tier 1 entry point
class FastAlloc {
    void* allocate(std::size_t size) noexcept;
    void  deallocate(void* p, std::size_t size) noexcept;   // sized, fast
    void  deallocate(void* p) noexcept;                     // unsized
};

// static_arena.hpp     — static DSA planner + runtime arena
struct BufferDecl { uint32_t id; size_t size; size_t align; int birth; int death; };
class StaticPlan  { size_t plan(std::span<const BufferDecl>); };
class StaticArena { void* get(uint32_t id) const noexcept; };   // one instruction

// cross_thread.hpp     — lock-free producer-consumer
class CrossThreadArena { void* allocate(size_t); void local_deallocate(void*); };
class RemoteFreer      { void defer_free(CrossThreadArena*, void*); };

// sfta.hpp             — full tripartite router
class SFTA { /* resolve_static, allocate_dynamic, allocate_cross, deallocate */ };

Design

Tier 0 solves Dynamic Storage Allocation offline with a best-fit-by- address heuristic — place largest-first, pick the lowest offset that avoids collision with any simultaneously-live buffer. O(n²) in buffer count; expected to run at level load. Output is a flat id → offset table.
Tier 1a (proxy) holds per-size-class free lists as raw Node* arrays packed into one cache line on the hot path. A pagemap over 16 KiB chunks gives O(1) owner identification when the size is not known at free time. A size-dispatch table maps request size to bucket index via one shift.
Tier 1b (TLSF) implements the classical Masmano two-level bitmap. Small constants (~200 instructions worst case) and bounded internal fragmentation (2^(1/SLI) − 1).
Tier 2 pairs the proxy with a lock-free MPSC inbox per owner. Remote threads push frees via a single CAS on the queue head. Owner drains on its next allocate. A per-target batcher (BatchIt) amortises CAS cost across 4–32 frees.

All three tiers live in non-overlapping virtual ranges, so the free path routes with at most two range checks plus one pagemap lookup.

Limitations

Warning

Not thread-safe. Tier 1 is single-threaded. Use Tier 2 for cross-thread traffic, or run one instance per thread.
Not a C-ABI malloc. allocate / deallocate / reallocate / allocate_aligned are member functions on FastAlloc; there is no LD_PRELOAD-drop-in entry point. Wrap if you need one.
Pre-reserves ~768 MiB VMA for the default config (512 MiB proxy + 256 MiB TLSF). Physical RSS grows only on touch. Tunable.
Tier 0 requires lifetime declarations. Engines without a frame graph cannot use it; the benefits then fall to Tier 1.
Adversarial workloads (long-lived + unbounded size range) push work into the TLSF fallback; performance there matches mimalloc / rpmalloc rather than beating them.

Code size

Project	Purpose	SLOC
fastalloc	SFTA tripartite allocator	~700
mimalloc	General-purpose SOTA	~12 700
rpmalloc	Game-focused general	~8 000

References

TLSF — Masmano, Ripoll, Crespo, Real, TLSF: A New Dynamic Memory Allocator for Real-Time Systems, ECRTS 2004.
Static planning — Lamprakos, Xanthopoulos, Katsaragakis, Xydis, Soudris, Catthoor, Futureproof Static Memory Planning, ACM TOPLAS 2025. arXiv:2504.04874.
snmalloc / BatchIt — Liétar et al., ISMM 2019; Filardo & Parkinson, ISMM 2024. Inform the Tier 2 message-passing design.
smmalloc — Sergey Makeev, https://github.com/SergeyMakeev/smmalloc.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why this exists

Quick start

Architecture

Benchmarks

Latency

Relative throughput

Scope of the claim

Building

API surface

Design

Limitations

Code size

References

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why this exists

Quick start

Architecture

Benchmarks

Latency

Relative throughput

Scope of the claim

Building

API surface

Design

Limitations

Code size

References

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages