Skip to content

Commit 82fbcc0

Browse files
committed
Merge branch 'release/1.2.2'
2 parents 0f87f54 + d69fb71 commit 82fbcc0

25 files changed

+1555
-481
lines changed

.gitattributes

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
*.rtf diff=astextplain
2222
*.RTF diff=astextplain
2323

24-
# Ignore build scripts in language stats
24+
# Ignore build scripts and test cases in language stats
2525
build/* linguist-vendored
2626
configure.py linguist-vendored=true
27+
test/* linguist-vendored

BENCHMARKS.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ https://github.com/rampantpixels/rpmalloc-benchmark
55

66
Benchmarks are run with parameters `benchmark <num threads> <mode> <distribution> <cross-thread rate> <loop count> <block count> <op count> <min size> <max size>`. It runs the given number of threads allocating randomly or fixed sized blocks in `[<min size>, <max size>]` bytes range. The `<mode>` parameter controls if it is random or fixed size. If random size, the `<distribution>` parameter controls if sizes are evenly distributed, have a linear falloff rate with size or an exponential falloff rate with size. In each thread, `<loop count>` number of loops are performed, allocating up to `<block count>` blocks in each thread. Every loop iteration `<op count>` number of blocks are deallocated, scattered across the entire set of blocks, then another set of `<op count>` number of blocks are allocated, also scattered across the entire set of slots in the block array (also deallocating any previous block in that slot). Every `<cross-thread rate>` loop iteration an `<op count>` number of blocks are allocated and handed off to another thread for cross-thread deallocation (memory allocated in one thread is freed in another thread).
77

8-
The benchmark also measures the maximum requested allocated size and the used virtual memory by the process to calculate a overhead percentage. This is done at the end of the loop iterations, once all cross-thread deallocations are processed. This will naturally introduce overhead in allocator implementations that have some form of caching of free blocks, which is intented.
8+
The benchmark also measures the maximum requested allocated size and the used virtual memory by the process to calculate a overhead percentage. This is done at the end of the loop iterations, once all cross-thread deallocations are processed. This will naturally introduce overhead in allocator implementations that have some form of caching of free blocks, which is intended.
99

10-
The loop sequence is run four times in all threads, first two times in each thread with a full free of all blocks inbetween. The threads are then terminated and restarted, running another set of similar two loop sequences. The idea is to measure both how well allocators perform at starting up fresh threads, as well as reusing memory within a thread and when warm starting another thread.
10+
The loop sequence is run four times in all threads, first two times in each thread with a full free of all blocks in between. The threads are then terminated and restarted, running another set of similar two loop sequences. The idea is to measure both how well allocators perform at starting up fresh threads, as well as reusing memory within a thread and when warm starting another thread.
1111

1212
Below is a collection of benchmark results for various allocation sizes. The machines running the Windows 10 and Linux benchmarks have 8 cores (4 physical cores with HT) and 12GiB RAM. The macOS machine is a MacBook, 2 cores (1 physical with HT) and 8GiB RAM. (Windows and macOS results coming soon)
1313

@@ -30,7 +30,7 @@ Linear falloff distributed sizes in `[16, 8000]` range, 20000 loops with 50000 b
3030
![Ubuntu 16.10 random [16, 8000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=301017877&format=image)
3131
![Ubuntu 16.10 random [16, 8000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=1224595675&format=image)
3232

33-
Ignoring the lockfree-malloc and jemalloc memory overhead range and focusing on the other allocators we can see they are pretty close in memory overhead factors, with most of the multihreaded cases hovering around the 10-15% overhead mark.
33+
Ignoring the lockfree-malloc and jemalloc memory overhead range and focusing on the other allocators we can see they are pretty close in memory overhead factors, with most of the multithreaded cases hovering around the 10-15% overhead mark.
3434

3535
![Ubuntu 16.10 random [16, 8000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=812830245&format=image)
3636

CACHE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Thread caches
22
rpmalloc has a thread cache of free memory blocks which can be used in allocations without interfering with other threads or going to system to map more memory, as well as a global cache shared by all threads to let pages flow between threads. Configuring the size of these caches can be crucial to obtaining good performance while minimizing memory overhead blowup. Below is a simple case study using the benchmark tool to compare different thread cache configurations for rpmalloc.
33

4-
The rpmalloc thread cache is configured to be unlimited, performance oriented as meaning default values, size oriennted where both thread cache and global cache is reduced significantly, or disabled where both thread and global caches are disabled and completely free pages are directly unmapped.
4+
The rpmalloc thread cache is configured to be unlimited, performance oriented as meaning default values, size oriented where both thread cache and global cache is reduced significantly, or disabled where both thread and global caches are disabled and completely free pages are directly unmapped.
55

66
The benchmark is configured to run threads allocating 150000 blocks distributed in the `[16, 16000]` bytes range with a linear falloff probability. It runs 1000 loops, and every iteration 75000 blocks (50%) are freed and allocated in a scattered pattern. There are no cross thread allocations/deallocations. Parameters: `benchmark n 0 0 0 1000 150000 75000 16 16000`. The benchmarks are run on an Ubuntu 16.10 machine with 8 cores (4 physical, HT) and 12GiB RAM.
77

CHANGELOG

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,18 @@
1+
1.2.2
2+
3+
Add configurable memory mapper providing map/unmap of memory pages. Default to VirtualAlloc/mmap if none provided. This allows rpmalloc to be used in contexts where memory is provided by internal means.
4+
5+
Avoid using explicit memory map addresses to mmap on POSIX systems. Instead use overallocation of virtual memory space to gain 64KiB alignment of spans. Since extra pages are never touched this should have no impact on real memory usage and remove the possibility of contention in virtual address space with other uses of mmap.
6+
7+
Detect system memory page size at initialization, and allow page size to be set explicitly in initialization. This allows the allocator to be used as a sub-allocator where the page granularity should be lower to reduce risk of wasting unused memory ranges, and adds support for modern iOS devices where page size is 16KiB.
8+
9+
Add build time option to use memory guards, surrounding each allocated block with a dead zone which is checked for consistency when block is freed.
10+
11+
Always finalize thread on allocator finalization, fixing issue when re-initializing allocator in the same thread.
12+
13+
Add basic allocator test cases
14+
15+
116
1.2.1
217

318
Split library into rpmalloc only base library and preloadable malloc wrapper library.
@@ -10,6 +25,7 @@ Improve preload compatibility on Apple platforms by using pthread key for TLS in
1025

1126
Fix ABA issue in orphaned heap linked list
1227

28+
1329
1.2
1430

1531
Dual license under MIT

README.md

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Platforms currently supported:
99
- Linux
1010
- Android
1111

12-
The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API.
12+
The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API. The API used to map/unmap memory pages can be configured in runtime to a custom implementation.
1313

1414
This library is put in the public domain; you can redistribute it and/or modify it without any restrictions. Or, if you choose, you can use it under the MIT license.
1515

@@ -18,7 +18,7 @@ Please consider our Patreon to support our work - https://www.patreon.com/rampan
1818
Created by Mattias Jansson ([@maniccoder](https://twitter.com/maniccoder)) / Rampant Pixels - http://www.rampantpixels.com
1919

2020
# Performance
21-
We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~1700 lines of C code.
21+
We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~1800 lines of C code.
2222

2323
Contained in a parallel repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
2424

@@ -37,6 +37,8 @@ The easiest way to use the library is simply adding rpmalloc.[h|c] to your proje
3737

3838
__rpmalloc_initialize__ : Call at process start to initialize the allocator
3939

40+
__rpmalloc_initialize_config__ : Optional entry point to call at process start to initialize the allocator with a custom memory mapping backend and/or memory page size
41+
4042
__rpmalloc_finalize__: Call at process exit to finalize the allocator
4143

4244
__rpmalloc_thread_initialize__: Call at each thread start to initialize the thread local data for the allocator
@@ -70,11 +72,13 @@ Integer safety checks on all calls are enabled if __ENABLE_VALIDATE_ARGS__ is de
7072

7173
Asserts are enabled if __ENABLE_ASSERTS__ is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in `rpmalloc.c`.
7274

75+
Overwrite and underwrite guards are enabled if __ENABLE_GUARDS__ is defined to 1 (default is 0, or disabled), either on compile command line or by settings the value in `rpmalloc.c`. This will introduce up to 32 byte overhead on each allocation to store magic numbers, which will be verified when freeing the memory block. The actual overhead is dependent on the requested size compared to size class limits.
76+
7377
# Quick overview
7478
The allocator is similar in spirit to tcmalloc from the [Google Performance Toolkit](https://github.com/gperftools/gperftools). It uses separate heaps for each thread and partitions memory blocks according to a preconfigured set of size classes, up to 2MiB. Larger blocks are mapped and unmapped directly. Allocations for different size classes will be served from different set of memory pages, each "span" of pages is dedicated to one size class. Spans of pages can flow between threads when the thread cache overflows and are released to a global cache, or when the thread ends. Unlike tcmalloc, single blocks do not flow between threads, only entire spans of pages.
7579

7680
# Implementation details
77-
The allocator is based on 64KiB page alignment, where all runs of memory pages are mapped to 64KiB boundaries. On Windows this is automatically guaranteed by the VirtualAlloc granularity, and on mmap systems it is achieved by atomically incrementing the address where pages are mapped to. By aligning to 64KiB boundaries the free operation can locate the header of the memory block without having to do a table lookup (as tcmalloc does) by simply masking out the low 16 bits of the address.
81+
The allocator is based on 64KiB page alignment and 16 byte block alignment, where all runs of memory pages are mapped to 64KiB boundaries. On Windows this is automatically guaranteed by the VirtualAlloc granularity, and on mmap systems it is achieved by atomically incrementing the address where pages are mapped to. By aligning to 64KiB boundaries the free operation can locate the header of the memory block without having to do a table lookup (as tcmalloc does) by simply masking out the low 16 bits of the address.
7882

7983
Memory blocks are divided into three categories. Small blocks are [16, 2032] bytes, medium blocks (2032, 32720] bytes, and large blocks (32720, 2097120] bytes. The three categories are further divided in size classes.
8084

@@ -86,6 +90,18 @@ Each span for a small and medium size class keeps track of how many blocks are a
8690

8791
Large blocks, or super spans, are cached in two levels. The first level is a per thread list of free super spans. The second level is a global list of free super spans.
8892

93+
# Memory mapping
94+
By default the allocator uses OS APIs to map virtual memory pages as needed, either `VirtualAlloc` on Windows or `mmap` on POSIX systems. If you want to use your own custom memory mapping provider you can use __rpmalloc_initialize_config__ and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes.
95+
96+
The functions do not need to deal with alignment, this is done by rpmalloc internally. However, ideally the map function should return pages aligned to 64KiB boundaries in order to avoid extra mapping requests (see caveats section below). What will happen is that if the first map call returns an address that is not 64KiB aligned, rpmalloc will immediately unmap that block and call a new mapping request with the size increased by 64KiB, then perform the alignment internally. To avoid this double mapping, always return blocks aligned to 64KiB.
97+
98+
Memory mapping requests are always done in multiples of the memory page size. You can specify a custom page size when initializing rpmalloc with __rpmalloc_initialize_config__, or pass 0 to let rpmalloc determine the system memory page size using OS APIs. The page size MUST be a power of two in [512, 16384] range.
99+
100+
# Memory guards
101+
If you define the __ENABLE_GUARDS__ to 1, all memory allocations will be padded with extra guard areas before and after the memory block (while still honoring the requested alignment). These dead zones will be filled with a pattern and checked when the block is freed. If the patterns are not intact the callback set in initialization config is called, or if not set an assert is fired.
102+
103+
Note that the end of the memory block in this case is defined by the total usable size of the block as returned by `rpmalloc_usable_size`, which can be larger than the size passed to allocation request due to size class buckets.
104+
89105
# Memory fragmentation
90106
There is no memory fragmentation by the allocator in the sense that it will not leave unallocated and unusable "holes" in the memory pages by calls to allocate and free blocks of different sizes. This is due to the fact that the memory pages allocated for each size class is split up in perfectly aligned blocks which are not reused for a request of a different size. The block freed by a call to `rpfree` will always be immediately available for an allocation request within the same size class.
91107

@@ -106,12 +122,12 @@ Since each thread cache maps spans of memory pages per size class, a thread that
106122

107123
An application that has a producer-consumer scheme between threads where one thread performs all allocations and another frees all memory will have a sub-optimal performance due to blocks crossing thread boundaries will be freed in a two step process - first deferred to the allocating thread, then freed when that thread has need for more memory pages for the requested size. However, depending on the use case the performance overhead might be small.
108124

109-
Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-aqcuired from the global cache.
125+
Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-acquired from the global cache.
110126

111127
# Caveats
112128
Cross-thread deallocations are more costly than in-thread deallocations, since the spans are completely owned by the allocating thread. The free operation will be deferred using an atomic list operation and the actual free operation will be performed when the owner thread requires a new block of the corresponding size class.
113129

114-
VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead keeps track and atomically increases a running address counter of where memory pages should be mapped to in the virtual address range. If some other code in the process uses mmap to reserve a part of virtual memory spance this counter needs to catch up and resync in order to keep the 64KiB granularity of span addresses, which could potentially be time consuming.
130+
VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead oversizes the memory mapping with 64KiB to be able to always return a memory area with this alignment. Since the extra memory pages are never touched this will not result in extra committed physical memory pages, but rather only increase virtual memory address space.
115131

116132
The free, realloc and usable size functions all require the passed pointer to be within the first 64KiB page block of the start of the memory block. You cannot pass in any pointer from the memory block address range.
117133

build/ninja/android.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
"""Ninja toolchain abstraction for Android platform"""
44

55
import os
6-
import urlparse
76
import subprocess
87

98
import toolchain
@@ -87,7 +86,7 @@ def initialize_toolchain(self):
8786
self.hostarchname = 'linux-x86_64'
8887
else:
8988
self.hostarchname = 'linux-x86'
90-
elif self.host.is_macosx():
89+
elif self.host.is_macos():
9190
self.hostarchname = 'darwin-x86_64'
9291

9392
def build_toolchain(self):

0 commit comments

Comments
 (0)