Skip to content

perf(parse_regex, parse_regex_all): pre-compute capture names#1772

Draft
thomasqueirozb wants to merge 9 commits into
mainfrom
perf/parse-regex-capture-names
Draft

perf(parse_regex, parse_regex_all): pre-compute capture names#1772
thomasqueirozb wants to merge 9 commits into
mainfrom
perf/parse-regex-capture-names

Conversation

@thomasqueirozb

@thomasqueirozb thomasqueirozb commented May 20, 2026

Copy link
Copy Markdown
Member

Summary

parse_regex does two things per event that can be eliminated. First, it reconstructs the map keys by calling regex.capture_names() on every event and allocating a fresh KeyString for each. Since the regex is compiled once at VRL compile time when the pattern is static, the names are fixed and can be stored then instead. Second, each captured substring is heap-copied into an owned Bytes. The input value is already a refcounted Bytes buffer, so captures can be zero-copy slices of it. We now get an Arc increment rather than a memcpy.

Benchmarks (cargo bench -- parse_regex)

Benchmark origin/main this branch delta
parse_regex/matches 3.061 µs 2.660 µs −13%
parse_regex/single_match 260.9 ns 233.2 ns −11%
parse_regex_all/matches 27.76 µs 28.51 µs +3% (noise)

parse_regex_all shows no regression — the benchmark only has two short matches, so per-event overhead savings are small relative to total time.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Existing unit tests for parse_regex and parse_regex_all pass unchanged.

cargo bench --bench stdlib --features="default test" -- --baseline main parse_regex

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on
    our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

References

NA

@thomasqueirozb thomasqueirozb added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label May 20, 2026
@thomasqueirozb thomasqueirozb marked this pull request as ready for review May 20, 2026 18:26
@thomasqueirozb thomasqueirozb requested a review from a team as a code owner May 20, 2026 18:26
@thomasqueirozb thomasqueirozb changed the title perf(parse_regex): pre-compute capture names and use zero-copy Bytes slicing perf(parse_regex, parse_regex_all): pre-compute capture names May 20, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 533f5d1850

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/stdlib/util.rs
)
let names = capture_names.iter().map(|name| {
let value: Value = match capture.name(name.as_str()) {
Some(m) => utf8_bytes.slice(m.start()..m.end()).into(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid retaining full input buffer in capture values

Using utf8_bytes.slice(...) here makes every extracted field a view into the original value buffer instead of an owned substring. When users parse large messages and keep only small captures (especially with parse_regex_all), each retained capture now keeps the full original message allocation alive, which can cause substantial memory growth compared with the previous copy-based behavior. Consider materializing owned bytes for returned captures (or conditionally copying large-source slices) so dropping the original field actually releases that memory.

Useful? React with 👍 / 👎.

@thomasqueirozb thomasqueirozb May 20, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events are short lived and memory pressure shouldn't be an issue here. Deallocation would happen shortly after this runs.

pront
pront previously approved these changes May 20, 2026

@pront pront left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description doesn't describe the optimization idea.

Comment thread src/stdlib/util.rs Outdated
Comment on lines +36 to +51
pub(crate) fn with_utf8_bytes<F, T>(bytes: &bytes::Bytes, f: F) -> T
where
F: FnOnce(&str, &bytes::Bytes) -> T,
{
let owned;
let (s, utf8_bytes): (&str, &bytes::Bytes) = if let Ok(s) = std::str::from_utf8(bytes) {
(s, bytes)
} else {
owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned());
(
std::str::from_utf8(&owned).expect("lossy string is valid UTF-8"),
&owned,
)
};
f(s, utf8_bytes)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a awkward implementation, can do a match std::str::from_utf8(bytes) { and take it from there?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunately awkward but using match give me a clippy warning: https://rust-lang.github.io/rust-clippy/rust-1.92.0/index.html#single_match_else

This pattern looks off because there is no way to get the inner bytes from String::from_utf8_lossy and also keep a reference to the underlying &str. Will keep as is but open to suggestions as I'm not the biggest fan of this either

@pront pront left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude findings (second pass after the latest changes).

Comment thread src/stdlib/util.rs Outdated
where
F: FnOnce(&str, &bytes::Bytes) -> T,
{
let owned;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finding (1): the let owned; late-init is only needed to make both arms of the if/else produce the same (&str, &Bytes) tuple. A match with each arm calling f(..) directly would drop the late binding and read more straightforwardly:

match std::str::from_utf8(bytes) {
    Ok(s) => f(s, bytes),
    Err(_) => {
        let owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned());
        let s = std::str::from_utf8(&owned).expect("from_utf8_lossy yields valid UTF-8");
        f(s, &owned)
    }
}

Both shapes are correct — current one is just slightly unusual.

Comment thread src/stdlib/util.rs Outdated
} else {
owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned());
(
std::str::from_utf8(&owned).expect("lossy string is valid UTF-8"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finding (5): no test exercises the lossy fallback path — the entire else branch (including this expect) is uncovered. Worth adding at least one case feeding invalid UTF-8 (e.g. Bytes::from_static(b"\xff foo bar")) through parse_regex/parse_regex_all so the offset-alignment invariant this helper exists for is regression-tested.

Comment thread src/stdlib/util.rs Outdated
///
/// `capture_names` must be the pre-computed slice of named-group
/// [`KeyString`]s for the regex (computed once at VRL compile time via
/// `regex.capture_names().flatten().map(KeyString::from)`).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finding (4): the "computed once at VRL compile time" claim is true for parse_regex but not for parse_regex_all, which recomputes per resolve() when the pattern is not a compile-time constant. Suggest softening to something like: "computed at compile time when the regex is a constant, otherwise once per resolve call."

Comment thread src/stdlib/parse_regex_all.rs Outdated
let capture_names: &[KeyString] = if let Some(names) = &self.capture_names {
names.as_slice()
} else {
&pattern

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finding (2): this relies on temporary lifetime extension in a way that looks like a dangling reference. &pattern.capture_names()...collect::<Vec<_>>() borrows a Vec that has no let binding; it stays alive only because temporary lifetime extension on let keeps it around for the rest of resolve(). Compiles and works, but most readers will pause.

Either hoist explicitly:

let computed;
let capture_names: &[KeyString] = match &self.capture_names {
    Some(names) => names.as_slice(),
    None => {
        computed = pattern.capture_names().flatten().map(KeyString::from).collect::<Vec<_>>();
        &computed
    }
};

…or push the parse_regex_all(..) call into each branch so no shared binding is needed.

Comment thread src/stdlib/parse_regex.rs
.capture_names()
.flatten()
.map(KeyString::from)
.collect();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finding (3): regex.capture_names().flatten().map(KeyString::from).collect() is duplicated across three call sites — here, parse_regex_all.rs compile() (lines 102-109), and parse_regex_all.rs resolve() (lines 200-204). Extract a small helper in util.rs:

pub(crate) fn capture_names(regex: &Regex) -> Vec<KeyString> {
    regex.capture_names().flatten().map(KeyString::from).collect()
}

@thomasqueirozb thomasqueirozb marked this pull request as draft May 22, 2026 17:19
@pront pront added stdlib: parse_regex VRL stdlib function: parse_regex stdlib: parse_regex_all VRL stdlib function: parse_regex_all labels Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog Changes in this PR do not need user-facing explanations in the release changelog stdlib: parse_regex_all VRL stdlib function: parse_regex_all stdlib: parse_regex VRL stdlib function: parse_regex

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants