Skip to content

Regex caching strategy for RegexRunner turns allocation-free paths into allocating under contention #115082

Open
@neon-sunset

Description

@neon-sunset

Description

Regex has a few paths which are expected to be allocation-free, in particular, IsMatch.
Unfortunately, the internal implementation has IsMatch go through RunSingleMatch call which ends up relying on a cached singleton instance of RegexRunner. It makes it look good on microbenchmarks but not so much if a Regex instance holding that RegexRunner ends up being on a hot path for multiple threads.

Data

Given the example benchmark (note that the actual pattern is different and can't be (easily) used with IndexOfAny):

static partial class RegexHolder
{
    [GeneratedRegex("[%]|[+]")]
    public static partial Regex GeneratedPattern();
}

[ShortRunJob, MemoryDiagnoser]
public abstract class IssueRunner : IDisposable
{
    [Params(1, 2, 4)]
    public int Concurrency;

    public class SourceGeneratedRegex : IssueRunner
    {
        protected override void Operation() => RegexHolder.GeneratedPattern().IsMatch(text);
    }

    public class ThreadStaticCompiledRegex : IssueRunner
    {
        [ThreadStatic]
        static Regex? regex;

        protected override void Operation() => (regex ??= InitPattern()).IsMatch(text);

        static Regex InitPattern() => new("[%]|[+]", RegexOptions.Compiled);
    }

#nullable disable
    protected string text;
    CancellationTokenSource cts;
    Barrier start;
    Barrier finish;
    Task[] workers;
#nullable restore

    [GlobalSetup]
    public void Setup()
    {
        cts = new();
        text = new('a', 4096);
        start = new(Concurrency + 1);
        finish = new(Concurrency + 1);
        workers = new Task[Concurrency];

        for (var i = 0; i < workers.Length; i++)
            workers[i] = Task.Run(OperationWorker);
    }

    [Benchmark]
    public void Execute()
    {
        start.SignalAndWait();
        finish.SignalAndWait();
    }

    protected abstract void Operation();

    void OperationWorker()
    {
        while (true)
        {
            start.SignalAndWait(cts.Token);
            Operation();
            finish.SignalAndWait(cts.Token);
        }
    }

    public void Dispose()
    {
        GC.SuppressFinalize(this);

        cts.Cancel();
        Task.WhenAll(workers)
            .ConfigureAwait(ConfigureAwaitOptions.SuppressThrowing)
            .GetAwaiter()
            .GetResult();

        cts.Dispose();
        start.Dispose();
        finish.Dispose();
    }
}

The benchmark results are as follows:

Method Concurrency Mean Error StdDev Allocated
ThreadStaticRegex.Execute 1 441.3 ns 118.0 ns 6.47 ns -
ThreadStaticRegex.Execute 2 640.9 ns 623.9 ns 34.20 ns -
ThreadStaticRegex.Execute 4 1,038.5 ns 349.7 ns 19.17 ns -
----------------------------- ------------- ------------ ---------- ---------- -----------
SourceGeneratedRegex.Execute 1 422.8 ns 133.8 ns 7.34 ns -
SourceGeneratedRegex.Execute 2 736.0 ns 109.5 ns 6.00 ns 715 B
SourceGeneratedRegex.Execute 4 1,170.7 ns 875.4 ns 47.98 ns 2024 B

Analysis

As you can see, it is possible to work around the issue by falling back to a JIT-compiled regexes stored in a threadlocal.

But it is more hands-on and goes against the recommended analyzer action of using source-generated method/property which sets the expectation that it is supposed to do the right thing.

This caught me off guard when tuning an application which relies on .NET's regex engine assurance to be fast and efficient.

It would be great if Regex had dedicated path(s) for methods like IsMatch and RunSingleMatch (but for EnumerateMatches ref struct iterator) which only need to locate the presence and/or range for a particular match and could avoid having to use the state carried by RegexRunner.

Configuration

.NET SDK:
 Version:           10.0.100-preview.4.25219.3
 Commit:            e5f35a28bb
 Workload version:  10.0.100-manifests.bb4a133c
 MSBuild version:   17.15.0-preview-25217-10+a9d68ab58

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  15.4
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /usr/local/share/dotnet/sdk/10.0.100-preview.4.25219.3/

Regression?

No

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions