- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.3k
Thread-local arenas #8692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread-local arenas #8692
Conversation
| @kddnewton Can we say "environment" instead of "arena" here? Otherwise, thank you for the PR! Oh, or alternatively please explain what "arena" is in this context, haven't heard that one before. At a glance, it looks like either "arena" is another word for "project" or it's an imaging term I'm not familiar with 😄 | 
| @aclark4life No problem! Arenas in this context are memory arenas, which are already in use inside Pillow. The general idea is that they represent large contiguous blocks of memory, that you can then go and manually allocate memory from but avoid the cost of calling malloc/free. Below is a super simplified example: extern struct my_struct;
int main() {
  struct my_struct s1 = malloc(sizeof(struct my_struct));
  struct my_struct s2 = malloc(sizeof(struct my_struct));
  struct my_struct s3 = malloc(sizeof(struct my_struct));
  /* do something */
  free(s1);
  free(s2);
  free(s3);
  return EXIT_SUCCESS;
}In this example we manually allocate memory for all three structs, and then manually free them. This can cause heap fragmentation and results in a lot of sys calls. Instead: extern struct my_struct;
struct my_arena {
  uint8_t *memory;
  size_t size;
};
void* my_malloc(struct my_arena *arena, size_t size) {
  void* result = arena->memory + arena->size + size;
  arena->size += size;
  return result;
}
int main() {
  struct my_arena arena = { .memory = malloc(1024), .size = 0 };
  struct my_struct s1 = my_malloc(&arena, sizeof(struct my_struct));
  struct my_struct s2 = my_malloc(&arena, sizeof(struct my_struct));
  struct my_struct s3 = my_malloc(&arena, sizeof(struct my_struct));
  /* do something */
  free(arena.memory);
  return EXIT_SUCCESS;
}In this example we make a single memory allocation and then a single free, which means all of the memory is contiguous (helping with locality) and only 2 sys calls are made (more efficient). I'm omitting a couple of details here about bookkeeping, but that's the general gist. This is already in place in Pillow. There is a single global arena that is used for all memory allocations. This is great, and helps a lot in terms of performance. However the downside is that in a multi-threaded environment, the mutex that wraps access to the arena (be it the GIL or a Python mutex in free-threaded Python) falls under a lot of contention because everyone is trying to use the same arena. This commit instead makes a separate arena for each thread, so that each thread manages its own memory. This means it never falls under contention, and you can see in the benchmarks that it drastically speeds up free-threaded Python because it never has to lock anything. I hope I explained that sufficiently, let me know if there's anything I can clear up! | 
| cc @lysnikolaou who's been helping with the free-threaded work. | 
e799ace    to
    8445d50      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @kddnewton! Only suggested a change to setup.py, so that it's a bit clearer.
350283e    to
    e76b4d4      
    Compare
  
    51a476e    to
    20f2f4c      
    Compare
  
    20f2f4c    to
    f751960      
    Compare
  
    | @lysnikolaou are those test failures related to my changes? Doesn't seem like it but since setup.py infects everything I'm not so sure. | 
| 
 Just merged, please update this PR from  | 
f751960    to
    fa6a6b0      
    Compare
  
    Currently, all threads use the same arena for imaging. This can result in a lot of contention when there are enough workers and the mutex is constantly being checked. This commit instead introduces lockless thread-local arenas for environments that support it.
fa6a6b0    to
    cfb2dcd      
    Compare
  
    | @hugovk done! | 
| The ImagingMemoryArena is an implicit default for the image -- it's not recorded anywhere that I see. What happens if an image is passed from thread to thread? This is the image struct: Pillow/src/libImaging/Imaging.h Line 80 in cfb2dcd 
 And this is where the memory is released back into the pool: Pillow/src/libImaging/Storage.c Line 368 in cfb2dcd 
 | 
| @wiredfool do you have an example of passing it from thread to thread? I'm not sure if I know how that would happen. | 
| (sorry, managed to edit rather than comment) I've done it in the past where I had an app where all of the processing was offloaded to worker threads, using queuing. Scanner -> initial processing -> thumbnailing -> uploading were all done off the main thread. Anything were you're doing something with a UI main thread and processing elsewhere -- there are a bunch of operations that will create a new image. If you then have a reference on the main thread you won't be able to release it. I'm also thinking that it's going to interfere with the lifetimes for arrow support, because that could potentially be freed from a thread that's not even part of our process. Actually -- is memory in thread local storage actually available outside of the thread? | 
| The honest answer is I'm not sure. I think we should test this out. Just so that I can properly replicate what you're saying, are you describing: create images in parent thread, child threads pick them off queue and process them, child threads exit, parent thread resumes? As for TLS being visible outside of the thread, I think the answer is it depends on the implementation. Linux has actual instructions for TLS, whereas macOS implements it as a library from what I understand. I imagine this would impact the answer. | 
| I was going to raise "does this help #1888?" so I'm curious to know the answer too … thanks all! | 
| I think something like: 
 would probably be enough to do it. Actually, looking at gcc's tls: 
 That concerns me on a couple of fronts -- The image memory is probably accessible outside the thread, since it's a malloc, and it's just the original struct that's going to be in the TLS storage. However, if we have a oneshot thread that passes the image off, the arena will be deallocated before the image is freed. So not only will the mutex and the arena likely be wrong in the child, there's going to be a pretty good memory leaks there because we won't necessarily get a chance to clean up the malloc'd items. | 
| @wiredfool Okay that helps a lot. I'll put together some example code and see what I can see. Maybe we'll need to add some logic on moving between threads to ensure everything is working properly. In the meantime let's put a pause on this PR until I can answer your questions. | 
| Ok, Some thoughts here -- 
 | 
| The original version of this PR I saw used pthread and windows key/value TLS. While it would require a pthreads dependency on Unix-like platforms, I wonder if that would avoid the problems with the  Thanks for pointing out that these arenas can be shared between threads! An important point. The uses of  | 
| Had some back of the mind thinking about this -- I think under TLS the mutex is essentially a no-op, because it's never going to have contention. How could it when the only things that lock it are on the same thread. So while it's not (IMO) a valid approach to speeding this up due to object lifetime, I think it's a good indication of the speedup that could happen from eliminating the bottleneck. If there were a static array of N memory arenas, and each thread took the next one in order, eg, (m++ % N) there would only need to be a single int stored in TLS. There's a bit of subtlety about pool sizes vs number of pools, but it's going to be a bit of a second order tuning thing. We'd definitely need to tag the image with it's source pool and read that on destroy. API wise, I'm wondering if the Python TLS support would be a good place to look rather than native TLS, as it 's in the stable ABI. https://docs.python.org/3/c-api/init.html#thread-local-storage-support I'm not sure if that's going to work on pypy though. 
 | 
| @wiredfool I like that approach a lot. Sorry I got caught up with other work stuff. I'll take a look at this first thing next week (starting Tuesday). | 
| Closing this one in favor of @wiredfool's approach in #8714 | 
Summary
Currently, all threads use the same arena for imaging. When there are enough workers, in regular Python the GIL will be under lots of contention and in free-threaded Python the mutex will be under lots of contention.
This commit instead introduces lockless thread-local arenas for environments that support it. For environments that do not support thread-locals (or for environments where we couldn't determine if they do or not) we fall back to either the GIL or a mutex if there is no GIL.
This has some implications for statistics, as statistics are now thread-specific. This could be solved in a couple of ways (in C or in Python), or left unsolved and just documented. I think either way is fine.
Code
Most of the code doesn't actually need to change. The bulk of the changes were getting
setup.pyto emit the proper compilation definitions so that we could check which kind of thread-local declarations were supported at compile-time. Other than that, the declaration of the default arena now has the thread-local declaration and the places where we previously locked the mutex the macro name has changed to reflect that it is specific to the thread-local arena.Benchmarks
For regular Python, this didn't make much of a difference. (The difference in the samples wasn't statistically significant, 95% CI). For free-threaded Python, however, the difference was fairly massive (about a 70% increase).
v3.13.0 on main
v3.13.0t on main
v3.13.0 on branch
v3.13.0t on branch
Script
Below is the script that I used to run these benchmarks.
bench.py