Merge to master for 1.4.2

mjansson · web-flow · commit b55d218cfb05 · 2021-04-25T20:36:29.000+02:00
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -1,5 +1,5 @@
 # Benchmarks
-Contained in a parallell repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
+Contained in a parallel repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
 
 https://github.com/mjansson/rpmalloc-benchmark
 
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,22 @@
+1.4.2
+
+Fixed an issue where calling _exit might hang the main thread cleanup in rpmalloc if another
+worker thread was terminated while holding exclusive access to the global cache.
+
+Improved caches to prioritize main spans in a chunk to avoid leaving main spans mapped due to
+remaining subspans in caches.
+
+Improve cache reuse by allowing large blocks to use caches from slightly larger cache classes.
+
+Fixed an issue where thread heap statistics would go out of sync when a free span was deferred
+to another thread heap
+
+API breaking change - added flag to rpmalloc_thread_finalize to avoid releasing thread caches.
+Pass nonzero value to retain old behaviour of releasing thread caches to global cache.
+
+Add option to config to set a custom error callback for assert failures (if ENABLE_ASSERT)
+
+
 1.4.1
 
 Dual license as both released to public domain or under MIT license
diff --git a/README.md b/README.md
@@ -33,7 +33,7 @@ Configuration of the thread and global caches can be important depending on your
 
 # Required functions
 
-Before calling any other function in the API, you __MUST__ call the initization function, either __rpmalloc_initialize__ or __pmalloc_initialize_config__, or you will get undefined behaviour when calling other rpmalloc entry point.
+Before calling any other function in the API, you __MUST__ call the initialization function, either __rpmalloc_initialize__ or __pmalloc_initialize_config__, or you will get undefined behaviour when calling other rpmalloc entry point.
 
 Before terminating your use of the allocator, you __SHOULD__ call __rpmalloc_finalize__ in order to release caches and unmap virtual memory, as well as prepare the allocator for global scope cleanup at process exit or dynamic library unload depending on your use case.
 
@@ -104,7 +104,7 @@ The allocator is based on a fixed but configurable page alignment (defaults to 6
 
 Memory blocks are divided into three categories. For 64KiB span size/alignment the small blocks are [16, 1024] bytes, medium blocks (1024, 32256] bytes, and large blocks (32256, 2097120] bytes. The three categories are further divided in size classes. If the span size is changed, the small block classes remain but medium blocks go from (1024, span size] bytes.
 
-Small blocks have a size class granularity of 16 bytes each in 64 buckets. Medium blocks have a granularity of 512 bytes, 61 buckets (default). Large blocks have a the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 36 bytes will allocate a block of 48 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
+Small blocks have a size class granularity of 16 bytes each in 64 buckets. Medium blocks have a granularity of 512 bytes, 61 buckets (default). Large blocks have the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 36 bytes will allocate a block of 48 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
 
 Spans for small and medium blocks are cached in four levels to avoid calls to map/unmap memory pages. The first level is a per thread single active span for each size class. The second level is a per thread list of partially free spans for each size class. The third level is a per thread list of free spans. The fourth level is a global list of free spans.
 
@@ -113,7 +113,7 @@ Each span for a small and medium size class keeps track of how many blocks are a
 Large blocks, or super spans, are cached in two levels. The first level is a per thread list of free super spans. The second level is a global list of free super spans.
 
 # Memory mapping
-By default the allocator uses OS APIs to map virtual memory pages as needed, either `VirtualAlloc` on Windows or `mmap` on POSIX systems. If you want to use your own custom memory mapping provider you can use __rpmalloc_initialize_config__ and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes. 
+By default the allocator uses OS APIs to map virtual memory pages as needed, either `VirtualAlloc` on Windows or `mmap` on POSIX systems. If you want to use your own custom memory mapping provider you can use __rpmalloc_initialize_config__ and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes.
 
 The returned memory address from the memory map function MUST be aligned to the memory page size and the memory span size (which ever is larger), both of which is configurable. Either provide the page and span sizes during initialization using __rpmalloc_initialize_config__, or use __rpmalloc_config__ to find the required alignment which is equal to the maximum of page and span size. The span size MUST be a power of two in [4096, 262144] range, and be a multiple or divisor of the memory page size.
 
@@ -128,7 +128,7 @@ Super spans (spans a multiple > 1 of the span size) can be subdivided into small
 
 A span that is a subspan of a larger super span can be individually decommitted to reduce physical memory pressure when the span is evicted from caches and scheduled to be unmapped. The entire original super span will keep track of the subspans it is broken up into, and when the entire range is decommitted tha super span will be unmapped. This allows platforms like Windows that require the entire virtual memory range that was mapped in a call to VirtualAlloc to be unmapped in one call to VirtualFree, while still decommitting individual pages in subspans (if the page size is smaller than the span size).
 
-If you use a custom memory map/unmap function you need to take this into account by looking at the `release` parameter given to the `memory_unmap` function. It is set to 0 for decommitting invididual pages and the total super span byte size for finally releasing the entire super span memory range.
+If you use a custom memory map/unmap function you need to take this into account by looking at the `release` parameter given to the `memory_unmap` function. It is set to 0 for decommitting individual pages and the total super span byte size for finally releasing the entire super span memory range.
 
 # Memory fragmentation
 There is no memory fragmentation by the allocator in the sense that it will not leave unallocated and unusable "holes" in the memory pages by calls to allocate and free blocks of different sizes. This is due to the fact that the memory pages allocated for each size class is split up in perfectly aligned blocks which are not reused for a request of a different size. The block freed by a call to `rpfree` will always be immediately available for an allocation request within the same size class.
diff --git a/build/ninja/clang.py b/build/ninja/clang.py
@@ -38,7 +38,7 @@ def initialize(self, project, archs, configs, includepaths, dependlibs, libpaths
     self.cxxcmd = '$toolchain$cxx -MMD -MT $out -MF $out.d $includepaths $moreincludepaths $cxxflags $carchflags $cconfigflags $cmoreflags $cxxenvflags -c $in -o $out'
     self.ccdeps = 'gcc'
     self.ccdepfile = '$out.d'
-    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crsD $ararchflags $arflags $arenvflags $out $in'
+    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crs $ararchflags $arflags $arenvflags $out $in'
     if self.target.is_windows():
       self.linkcmd = '$toolchain$link $libpaths $configlibpaths $linkflags $linkarchflags $linkconfigflags $linkenvflags /debug /nologo /subsystem:console /dynamicbase /nxcompat /manifest /manifestuac:\"level=\'asInvoker\' uiAccess=\'false\'\" /tlbid:1 /pdb:$pdbpath /out:$out $in $libs $archlibs $oslibs $frameworks'
       self.dllcmd = self.linkcmd + ' /dll'
@@ -52,7 +52,7 @@ def initialize(self, project, archs, configs, includepaths, dependlibs, libpaths
                    '-fno-trapping-math', '-ffast-math']
     self.cwarnflags = ['-W', '-Werror', '-pedantic', '-Wall', '-Weverything',
                        '-Wno-c++98-compat', '-Wno-padded', '-Wno-documentation-unknown-command',
-                       '-Wno-implicit-fallthrough', '-Wno-static-in-inline', '-Wno-reserved-id-macro']
+                       '-Wno-implicit-fallthrough', '-Wno-static-in-inline', '-Wno-reserved-id-macro', '-Wno-disabled-macro-expansion']
     self.cmoreflags = []
     self.mflags = []
     self.arflags = []
@@ -76,8 +76,14 @@ def initialize(self, project, archs, configs, includepaths, dependlibs, libpaths
       self.oslibs += ['m']
     if self.target.is_linux() or self.target.is_raspberrypi():
       self.oslibs += ['dl']
+    if self.target.is_raspberrypi():
+      self.linkflags += ['-latomic']
     if self.target.is_bsd():
       self.oslibs += ['execinfo']
+    if self.target.is_haiku():
+      self.cflags += ['-D_GNU_SOURCE=1']
+      self.linkflags += ['-lpthread']
+      self.oslibs += ['m']
     if not self.target.is_windows():
       self.linkflags += ['-fomit-frame-pointer']
 
@@ -391,7 +397,7 @@ def make_linkconfigflags(self, config, targettype, variables):
       if targettype == 'sharedlib':
         flags += ['-shared', '-fPIC']
     if config != 'debug':
-      if targettype == 'bin' or targettype == 'sharedlib':
+      if (targettype == 'bin' or targettype == 'sharedlib') and self.use_lto():
         flags += ['-flto']
     return flags
 
diff --git a/build/ninja/gcc.py b/build/ninja/gcc.py
@@ -24,7 +24,7 @@ def initialize(self, project, archs, configs, includepaths, dependlibs, libpaths
     self.cxxcmd = '$toolchain$cxx -MMD -MT $out -MF $out.d $includepaths $moreincludepaths $cxxflags $carchflags $cconfigflags $cmoreflags $cxxenvflags -c $in -o $out'
     self.ccdeps = 'gcc'
     self.ccdepfile = '$out.d'
-    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crsD $ararchflags $arflags $arenvflags $out $in'
+    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crs $ararchflags $arflags $arenvflags $out $in'
     self.linkcmd = '$toolchain$link $libpaths $configlibpaths $linkflags $linkarchflags $linkconfigflags $linkenvflags -o $out $in $libs $archlibs $oslibs'
 
     #Base flags
@@ -54,8 +54,13 @@ def initialize(self, project, archs, configs, includepaths, dependlibs, libpaths
       self.linkflags += ['-pthread']
     if self.target.is_linux() or self.target.is_raspberrypi():
       self.oslibs += ['dl']
+    if self.target.is_raspberrypi():
+      self.linkflags += ['-latomic']
     if self.target.is_bsd():
       self.oslibs += ['execinfo']
+    if self.target.is_haiku():
+      self.cflags += ['-D_GNU_SOURCE=1']
+      self.linkflags += ['-lpthread']
 
     self.includepaths = self.prefix_includepaths((includepaths or []) + ['.'])
 
diff --git a/build/ninja/generator.py b/build/ninja/generator.py
@@ -49,6 +49,9 @@ def __init__(self, project, includepaths = [], dependlibs = [], libpaths = [], v
     parser.add_argument('--updatebuild', action='store_true',
                         help = 'Update submodule build scripts',
                         default = '')
+    parser.add_argument('--lto', action='store_true',
+                        help = 'Build with Link Time Optimization',
+                        default = False)
     options = parser.parse_args()
 
     self.project = project
@@ -91,6 +94,8 @@ def __init__(self, project, includepaths = [], dependlibs = [], libpaths = [], v
       variables['monolithic'] = True
     if options.coverage:
       variables['coverage'] = True
+    if options.lto:
+      variables['lto'] = True
     if self.subninja != '':
       variables['internal_deps'] = True
 
diff --git a/build/ninja/platform.py b/build/ninja/platform.py
@@ -5,7 +5,7 @@
 import sys
 
 def supported_platforms():
-  return [ 'windows', 'linux', 'macos', 'bsd', 'ios', 'android', 'raspberrypi', 'tizen', 'sunos' ]
+  return [ 'windows', 'linux', 'macos', 'bsd', 'ios', 'android', 'raspberrypi', 'tizen', 'sunos', 'haiku' ]
 
 class Platform(object):
   def __init__(self, platform):
@@ -20,7 +20,7 @@ def __init__(self, platform):
       self.platform = 'macos'
     elif self.platform.startswith('win'):
       self.platform = 'windows'
-    elif 'bsd' in self.platform:
+    elif 'bsd' in self.platform or self.platform.startswith('dragonfly'):
       self.platform = 'bsd'
     elif self.platform.startswith('ios'):
       self.platform = 'ios'
@@ -32,6 +32,8 @@ def __init__(self, platform):
       self.platform = 'tizen'
     elif self.platform.startswith('sunos'):
       self.platform = 'sunos'
+    elif self.platform.startswith('haiku'):
+      self.platform = 'haiku'
 
   def platform(self):
     return self.platform
@@ -63,5 +65,8 @@ def is_tizen(self):
   def is_sunos(self):
     return self.platform == 'sunos'
 
+  def is_haiku(self):
+    return self.platform == 'haiku'
+
   def get(self):
     return self.platform
diff --git a/build/ninja/toolchain.py b/build/ninja/toolchain.py
@@ -54,6 +54,7 @@ def __init__(self, host, target, toolchain):
     #Set default values
     self.build_monolithic = False
     self.build_coverage = False
+    self.build_lto = False
     self.support_lua = False
     self.internal_deps = False
     self.python = 'python'
@@ -132,7 +133,7 @@ def initialize_archs(self, archs):
   def initialize_default_archs(self):
     if self.target.is_windows():
       self.archs = ['x86-64']
-    elif self.target.is_linux() or self.target.is_bsd() or self.target.is_sunos():
+    elif self.target.is_linux() or self.target.is_bsd() or self.target.is_sunos() or self.target.is_haiku():
       localarch = subprocess.check_output(['uname', '-m']).decode().strip()
       if localarch == 'x86_64' or localarch == 'amd64':
         self.archs = ['x86-64']
@@ -208,6 +209,8 @@ def parse_default_variables(self, variables):
         self.build_monolithic = get_boolean_flag(val)
       elif key == 'coverage':
         self.build_coverage = get_boolean_flag(val)
+      elif key == 'lto':
+        self.build_lto = get_boolean_flag(val)
       elif key == 'support_lua':
         self.support_lua = get_boolean_flag(val)
       elif key == 'internal_deps':
@@ -234,6 +237,8 @@ def parse_prefs(self, prefs):
       self.build_monolithic = get_boolean_flag(prefs['monolithic'])
     if 'coverage' in prefs:
       self.build_coverage = get_boolean_flag( prefs['coverage'] )
+    if 'lto' in prefs:
+      self.build_lto = get_boolean_flag( prefs['lto'] )
     if 'support_lua' in prefs:
       self.support_lua = get_boolean_flag(prefs['support_lua'])
     if 'python' in prefs:
@@ -258,6 +263,9 @@ def is_monolithic(self):
   def use_coverage(self):
     return self.build_coverage
 
+  def use_lto(self):
+    return self.build_lto
+
   def write_variables(self, writer):
     writer.variable('buildpath', self.buildpath)
     writer.variable('target', self.target.platform)
diff --git a/rpmalloc/malloc.c b/rpmalloc/malloc.c
@@ -292,26 +292,55 @@ DllMain(HINSTANCE instance, DWORD reason, LPVOID reserved) {
 	else if (reason == DLL_THREAD_ATTACH)
 		rpmalloc_thread_initialize();
 	else if (reason == DLL_THREAD_DETACH)
-		rpmalloc_thread_finalize();
+		rpmalloc_thread_finalize(1);
 	return TRUE;
 }
 
+//end BUILD_DYNAMIC_LINK
+#else
+
+extern void
+_global_rpmalloc_init(void) {
+	rpmalloc_set_main_thread();
+	rpmalloc_initialize();
+}
+
+#if defined(__clang__) || defined(__GNUC__)
+
+static void __attribute__((constructor))
+initializer(void) {
+	_global_rpmalloc_init();
+}
+
+#elif defined(_MSC_VER)
+
+#pragma section(".CRT$XIB",read)
+__declspec(allocate(".CRT$XIB")) void (*_rpmalloc_module_init)(void) = _global_rpmalloc_init;
+#pragma comment(linker, "/include:_rpmalloc_module_init")
+
 #endif
 
+//end !BUILD_DYNAMIC_LINK
+#endif 
+
 #else
 
 #include <pthread.h>
 #include <stdlib.h>
 #include <stdint.h>
 #include <unistd.h>
 
+extern void
+rpmalloc_set_main_thread(void);
+
 static pthread_key_t destructor_key;
 
 static void
 thread_destructor(void*);
 
 static void __attribute__((constructor))
 initializer(void) {
+	rpmalloc_set_main_thread();
 	rpmalloc_initialize();
 	pthread_key_create(&destructor_key, thread_destructor);
 }
@@ -340,7 +369,7 @@ thread_starter(void* argptr) {
 static void
 thread_destructor(void* value) {
 	(void)sizeof(value);
-	rpmalloc_thread_finalize();
+	rpmalloc_thread_finalize(1);
 }
 
 #ifdef __APPLE__
@@ -368,7 +397,8 @@ pthread_create(pthread_t* thread,
                const pthread_attr_t* attr,
                void* (*start_routine)(void*),
                void* arg) {
-#if defined(__linux__) || defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__APPLE__) || defined(__HAIKU__)
+#if defined(__linux__) || defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) || defined(__DragonFly__) || \
+    defined(__APPLE__) || defined(__HAIKU__)
 	char fname[] = "pthread_create";
 #else
 	char fname[] = "_pthread_create";
diff --git a/rpmalloc/rpmalloc.c b/rpmalloc/rpmalloc.c
diff --git a/rpmalloc/rpmalloc.h b/rpmalloc/rpmalloc.h
diff --git a/test/main.c b/test/main.c
diff --git a/test/thread.c b/test/thread.c