Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE2: optimize memory allocations #15395

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jan 31, 2025

We noticed in #15088 that we don't need Crystal::ThreadLocalValue in the Regex PCRE2 engine.

We can reuse the JIT stack but also the match data for every Regex (no need for a specific match data per Regex). We can allocate one and make sure it's not used by other threads... hence the thread locals: no more spinlock (thread contention) nor hash.

It's simpler and faster. Here's the benchmark from #13144 for example:

$ crystal run --release bench/regex.cr
starts_with?  29.46M ( 33.94ns) (± 0.27%)
    matches?  40.34M ( 24.79ns) (± 0.74%)
$ bin/crystal run --release bench/regex.cr
starts_with?  30.70M ( 32.58ns) (± 0.26%)
    matches?  46.28M ( 21.61ns) (± 0.72%)

Enabling MT also no longer has any impact on performance:

$ crystal run --release -Dpreview_mt bench/regex.cr
starts_with?  26.50M ( 37.73ns) (± 0.09%)
    matches?  41.75M ( 23.95ns) (± 0.56%)
$ bin/crystal run --release -Dpreview_mt bench/regex.cr
starts_with?  30.61M ( 32.67ns) (± 0.36%)
    matches?  48.45M ( 20.64ns) (± 1.51%)

The drawback is that we must allocate each matchdata with a maximum number of ovectors (65535). That might increase memory usage, though I failed to notice it in practice. Maybe not allocating memory for every regular expression is helping?

Note: this PR will be separated into a couple PRs to introduce Crystal::System::ThreadLocal(T). The point of this new type is for this patch, so I want approval on the overall approach before the split.

I could have used @[ThreadLocal] but some targets don't support it (namely: Android, MinGW and OpenBSD) and we can't register destructors either (on thread shutdown). But using pthread_key_create or FlsAlloc we can 😍

@ysbaddaden ysbaddaden self-assigned this Jan 31, 2025
@ysbaddaden ysbaddaden changed the title Refactor: PCRE2 memory allocations PCRE2: optimize memory allocations Jan 31, 2025
@ysbaddaden
Copy link
Contributor Author

Damn, that was promising, and it's working well on Linux, but both Darwin and Windows completely broke 😭

Windows could be explained by using FlsAlloc instead of TlsAlloc, but Darwin?

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Feb 4, 2025

WTH: I can't reproduce the errors in my Windows VM (MSVC) 😕

@ysbaddaden
Copy link
Contributor Author

I can reproduce when running the compiler specs. Looking at the memory graph, the committed memory keeps growing rapidly, and it starts failing exactly when reaching 32GB (the VM has 16GB of RAM). That can't be a coincidence.

When running the compiler specs from master, the committed memory is rather stable or grows slowly (around 3.5GB). Let's note that a linker command still fails, but I get a bunch of weird spec errors in the VM.

Maybe FlsAlloc ain't exactly compatible with our fiber-context switch (argh)?

@ysbaddaden ysbaddaden force-pushed the refactor/pcre2-memory-allocations branch from 7ceca47 to 7487cb5 Compare March 11, 2025 07:43
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Mar 11, 2025

Here's a new attempt.

From CI runs in my fork, the darwin and win32-msvc targets got fixed, but win32-gnu is still broken on dual heredocs such as call <<-CRYSTAL, <<-CRYSTAL which I don't understand 😞

Note: I believe FLS should work on Windows, which would avoid the custom destructors. The stdlib specs pass, save for these failures in the compiler specs.

@HertzDevil
Copy link
Contributor

Yeah that's the LLVM 20 failure, ignore it for the moment

@ysbaddaden
Copy link
Contributor Author

I tried again, and FLS is apparently working on Windows. Nice.

Thread.threads.delete(self)
Fiber.inactive(fiber)
detach { system_close }
end
end

protected def run_destructors : Nil
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: cleanup before merge (not needed anymore).

@ysbaddaden ysbaddaden marked this pull request as ready for review March 11, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants