perf(parse_regex, parse_regex_all): pre-compute capture names by thomasqueirozb · Pull Request #1772 · vectordotdev/vrl

thomasqueirozb · 2026-05-20T18:18:36Z

Summary

parse_regex does two things per event that can be eliminated. First, it reconstructs the map keys by calling regex.capture_names() on every event and allocating a fresh KeyString for each. Since the regex is compiled once at VRL compile time when the pattern is static, the names are fixed and can be stored then instead. Second, each captured substring is heap-copied into an owned Bytes. The input value is already a refcounted Bytes buffer, so captures can be zero-copy slices of it. We now get an Arc increment rather than a memcpy.

Benchmarks (`cargo bench -- parse_regex`)

Benchmark	`origin/main`	this branch	delta
`parse_regex/matches`	3.061 µs	2.660 µs	−13%
`parse_regex/single_match`	260.9 ns	233.2 ns	−11%
`parse_regex_all/matches`	27.76 µs	28.51 µs	+3% (noise)

parse_regex_all shows no regression — the benchmark only has two short matches, so per-event overhead savings are small relative to total time.

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

How did you test this PR?

Existing unit tests for parse_regex and parse_regex_all pass unchanged.

cargo bench --bench stdlib --features="default test" -- --baseline main parse_regex

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on
our guidelines.
No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

Our CONTRIBUTING.md is a good starting place.
If this PR introduces changes to LICENSE-3rdparty.csv, please
run dd-rust-license-tool write and commit the changes. More details here.

References

NA

…slicing

…helper

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 533f5d1850

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T18:33:57Z

-        )
+    let names = capture_names.iter().map(|name| {
+        let value: Value = match capture.name(name.as_str()) {
+            Some(m) => utf8_bytes.slice(m.start()..m.end()).into(),


Avoid retaining full input buffer in capture values

Using utf8_bytes.slice(...) here makes every extracted field a view into the original value buffer instead of an owned substring. When users parse large messages and keep only small captures (especially with parse_regex_all), each retained capture now keeps the full original message allocation alive, which can cause substantial memory growth compared with the previous copy-based behavior. Consider materializing owned bytes for returned captures (or conditionally copying large-source slices) so dropping the original field actually releases that memory.

Useful? React with 👍 / 👎.

~~Events are short lived and memory pressure shouldn't be an issue here. Deallocation would happen shortly after this runs.~~

pront

The PR description doesn't describe the optimization idea.

pront · 2026-05-20T18:36:44Z

+pub(crate) fn with_utf8_bytes<F, T>(bytes: &bytes::Bytes, f: F) -> T
+where
+    F: FnOnce(&str, &bytes::Bytes) -> T,
+{
+    let owned;
+    let (s, utf8_bytes): (&str, &bytes::Bytes) = if let Ok(s) = std::str::from_utf8(bytes) {
+        (s, bytes)
+    } else {
+        owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned());
+        (
+            std::str::from_utf8(&owned).expect("lossy string is valid UTF-8"),
+            &owned,
+        )
+    };
+    f(s, utf8_bytes)
+}


This is a awkward implementation, can do a match std::str::from_utf8(bytes) { and take it from there?

It is unfortunately awkward but using match give me a clippy warning: https://rust-lang.github.io/rust-clippy/rust-1.92.0/index.html#single_match_else

This pattern looks off because there is no way to get the inner bytes from String::from_utf8_lossy and also keep a reference to the underlying &str. Will keep as is but open to suggestions as I'm not the biggest fan of this either

… would be disproportionate

…py for high capture density

pront

Claude findings (second pass after the latest changes).

pront · 2026-05-21T13:38:14Z

+where
+    F: FnOnce(&str, &bytes::Bytes) -> T,
+{
+    let owned;


Claude finding (1): the let owned; late-init is only needed to make both arms of the if/else produce the same (&str, &Bytes) tuple. A match with each arm calling f(..) directly would drop the late binding and read more straightforwardly:

match std::str::from_utf8(bytes) { Ok(s) => f(s, bytes), Err(_) => { let owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned()); let s = std::str::from_utf8(&owned).expect("from_utf8_lossy yields valid UTF-8"); f(s, &owned) } }

Both shapes are correct — current one is just slightly unusual.

pront · 2026-05-21T13:38:14Z

+    } else {
+        owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned());
+        (
+            std::str::from_utf8(&owned).expect("lossy string is valid UTF-8"),


Claude finding (5): no test exercises the lossy fallback path — the entire else branch (including this expect) is uncovered. Worth adding at least one case feeding invalid UTF-8 (e.g. Bytes::from_static(b"\xff foo bar")) through parse_regex/parse_regex_all so the offset-alignment invariant this helper exists for is regression-tested.

pront · 2026-05-21T13:38:14Z

 ///
+/// `capture_names` must be the pre-computed slice of named-group
+/// [`KeyString`]s for the regex (computed once at VRL compile time via
+/// `regex.capture_names().flatten().map(KeyString::from)`).


Claude finding (4): the "computed once at VRL compile time" claim is true for parse_regex but not for parse_regex_all, which recomputes per resolve() when the pattern is not a compile-time constant. Suggest softening to something like: "computed at compile time when the regex is a constant, otherwise once per resolve call."

pront · 2026-05-21T13:38:14Z

+        let capture_names: &[KeyString] = if let Some(names) = &self.capture_names {
+            names.as_slice()
+        } else {
+            &pattern


Claude finding (2): this relies on temporary lifetime extension in a way that looks like a dangling reference. &pattern.capture_names()...collect::<Vec<_>>() borrows a Vec that has no let binding; it stays alive only because temporary lifetime extension on let keeps it around for the rest of resolve(). Compiles and works, but most readers will pause.

Either hoist explicitly:

let computed; let capture_names: &[KeyString] = match &self.capture_names { Some(names) => names.as_slice(), None => { computed = pattern.capture_names().flatten().map(KeyString::from).collect::<Vec<_>>(); &computed } };

…or push the parse_regex_all(..) call into each branch so no shared binding is needed.

pront · 2026-05-21T13:38:14Z

+            .capture_names()
+            .flatten()
+            .map(KeyString::from)
+            .collect();


Claude finding (3): regex.capture_names().flatten().map(KeyString::from).collect() is duplicated across three call sites — here, parse_regex_all.rs compile() (lines 102-109), and parse_regex_all.rs resolve() (lines 200-204). Extract a small helper in util.rs:

pub(crate) fn capture_names(regex: &Regex) -> Vec<KeyString> { regex.capture_names().flatten().map(KeyString::from).collect() }

…y map paths

thomasqueirozb added 3 commits May 20, 2026 13:32

perf(parse_regex): pre-compute capture names and use zero-copy Bytes …

32f74f9

…slicing

perf(parse_regex): make original_bytes required, add with_utf8_bytes …

9471d38

…helper

rename bytes parameter to utf8_bytes in capture_regex_to_map

349f86a

thomasqueirozb added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label May 20, 2026

fmt

406076f

thomasqueirozb marked this pull request as ready for review May 20, 2026 18:26

thomasqueirozb requested a review from a team as a code owner May 20, 2026 18:26

thomasqueirozb changed the title ~~perf(parse_regex): pre-compute capture names and use zero-copy Bytes slicing~~ perf(parse_regex, parse_regex_all): pre-compute capture names May 20, 2026

Move parse_regex functions to original places

533f5d1

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

pront previously approved these changes May 20, 2026

View reviewed changes

thomasqueirozb dismissed pront’s stale review via 7b72f3f May 20, 2026 18:56

thomasqueirozb force-pushed the perf/parse-regex-capture-names branch from f9c9411 to 533f5d1 Compare May 20, 2026 19:28

thomasqueirozb requested a review from pront May 20, 2026 20:17

thomasqueirozb added 3 commits May 20, 2026 16:54

perf(parse_regex): zero-copy slice captures, copy when retained bytes…

8d6277e

… would be disproportionate

bench: add large_input_small_captures benchmark for parse_regex

3acb840

perf(parse_regex): use capture group indices for O(1) access, zero-co…

c964372

…py for high capture density

pront reviewed May 21, 2026

View reviewed changes

thomasqueirozb marked this pull request as draft May 22, 2026 17:19

perf(parse_regex): pre-build capture template and split copy/zero-cop…

bbf2c56

…y map paths

pront added stdlib: parse_regex VRL stdlib function: parse_regex stdlib: parse_regex_all VRL stdlib function: parse_regex_all labels Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(parse_regex, parse_regex_all): pre-compute capture names#1772

perf(parse_regex, parse_regex_all): pre-compute capture names#1772
thomasqueirozb wants to merge 9 commits into
mainfrom
perf/parse-regex-capture-names

thomasqueirozb commented May 20, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

thomasqueirozb May 20, 2026 •

edited

Loading

Uh oh!

pront left a comment

Uh oh!

pront May 20, 2026

Uh oh!

thomasqueirozb May 20, 2026

Uh oh!

pront left a comment

Uh oh!

pront May 21, 2026

Uh oh!

pront May 21, 2026

Uh oh!

pront May 21, 2026

Uh oh!

pront May 21, 2026

Uh oh!

pront May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

thomasqueirozb commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (cargo bench -- parse_regex)

Change Type

Is this a breaking change?

How did you test this PR?

Does this PR include user facing changes?

Checklist

References

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

thomasqueirozb May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pront left a comment

Choose a reason for hiding this comment

Uh oh!

pront May 20, 2026

Choose a reason for hiding this comment

Uh oh!

thomasqueirozb May 20, 2026

Choose a reason for hiding this comment

Uh oh!

pront left a comment

Choose a reason for hiding this comment

Uh oh!

pront May 21, 2026

Choose a reason for hiding this comment

Uh oh!

pront May 21, 2026

Choose a reason for hiding this comment

Uh oh!

pront May 21, 2026

Choose a reason for hiding this comment

Uh oh!

pront May 21, 2026

Choose a reason for hiding this comment

Uh oh!

pront May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thomasqueirozb commented May 20, 2026 •

edited

Loading

Benchmarks (`cargo bench -- parse_regex`)

thomasqueirozb May 20, 2026 •

edited

Loading