On macOS, the recommended way to implement a JIT system is by creating the memory map with PROT_WRITE | PROT_EXEC and the MAP_JIT flag, then using pthread_jit_write_protect_np to switch between writing and executing the buffer.
(This is kinda weird, because the W^X behavior is tracked on a per-thread basis, rather than per-region; I found it easiest to only enable W right before copying into the region, then disable it afterwards)
Anyways, it turns out that this is much faster than using mmap to swap regions from PROT_WRITE to PROT_EXEC!
Here's a flamegraph using mmap

(note the calls to mprotect and memmove taking up a good chunk of time)
Here's what it looks like with pthread_jit_write_protect_np

(those calls are gone, and pthread_jit_write_protect_np doesn't even show up)
I see one benchmark go from 112 ms down to 62 ms, almost a 50% improvement!
(My benchmarks are admittedly weird, in that they compile a lot of very small functions 😆)
This requires ditching / forking the memmap2 crate, which doesn't support this behavior. Here's how I did it.
Right now, it's easy for users to do this on their own: I'm using a VecAssembler then copying into this custom struct Mmap, which works fine. Still, this would be a decent optimization for the stock Assembler.
As always, the dynasm-rs is great, and I really appreciate the work that went into it!
On macOS, the recommended way to implement a JIT system is by creating the memory map with
PROT_WRITE | PROT_EXECand theMAP_JITflag, then usingpthread_jit_write_protect_npto switch between writing and executing the buffer.(This is kinda weird, because the W^X behavior is tracked on a per-thread basis, rather than per-region; I found it easiest to only enable W right before copying into the region, then disable it afterwards)
Anyways, it turns out that this is much faster than using
mmapto swap regions fromPROT_WRITEtoPROT_EXEC!Here's a flamegraph using
mmap(note the calls to
mprotectandmemmovetaking up a good chunk of time)Here's what it looks like with
pthread_jit_write_protect_np(those calls are gone, and
pthread_jit_write_protect_npdoesn't even show up)I see one benchmark go from 112 ms down to 62 ms, almost a 50% improvement!
(My benchmarks are admittedly weird, in that they compile a lot of very small functions 😆)
This requires ditching / forking the
memmap2crate, which doesn't support this behavior. Here's how I did it.Right now, it's easy for users to do this on their own: I'm using a
VecAssemblerthen copying into this customstruct Mmap, which works fine. Still, this would be a decent optimization for the stockAssembler.As always, the
dynasm-rsis great, and I really appreciate the work that went into it!