perf(parse_regex, parse_regex_all): pre-compute capture names#1772
perf(parse_regex, parse_regex_all): pre-compute capture names#1772thomasqueirozb wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 533f5d1850
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ) | ||
| let names = capture_names.iter().map(|name| { | ||
| let value: Value = match capture.name(name.as_str()) { | ||
| Some(m) => utf8_bytes.slice(m.start()..m.end()).into(), |
There was a problem hiding this comment.
Avoid retaining full input buffer in capture values
Using utf8_bytes.slice(...) here makes every extracted field a view into the original value buffer instead of an owned substring. When users parse large messages and keep only small captures (especially with parse_regex_all), each retained capture now keeps the full original message allocation alive, which can cause substantial memory growth compared with the previous copy-based behavior. Consider materializing owned bytes for returned captures (or conditionally copying large-source slices) so dropping the original field actually releases that memory.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Events are short lived and memory pressure shouldn't be an issue here. Deallocation would happen shortly after this runs.
pront
left a comment
There was a problem hiding this comment.
The PR description doesn't describe the optimization idea.
| pub(crate) fn with_utf8_bytes<F, T>(bytes: &bytes::Bytes, f: F) -> T | ||
| where | ||
| F: FnOnce(&str, &bytes::Bytes) -> T, | ||
| { | ||
| let owned; | ||
| let (s, utf8_bytes): (&str, &bytes::Bytes) = if let Ok(s) = std::str::from_utf8(bytes) { | ||
| (s, bytes) | ||
| } else { | ||
| owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned()); | ||
| ( | ||
| std::str::from_utf8(&owned).expect("lossy string is valid UTF-8"), | ||
| &owned, | ||
| ) | ||
| }; | ||
| f(s, utf8_bytes) | ||
| } |
There was a problem hiding this comment.
This is a awkward implementation, can do a match std::str::from_utf8(bytes) { and take it from there?
There was a problem hiding this comment.
It is unfortunately awkward but using match give me a clippy warning: https://rust-lang.github.io/rust-clippy/rust-1.92.0/index.html#single_match_else
This pattern looks off because there is no way to get the inner bytes from String::from_utf8_lossy and also keep a reference to the underlying &str. Will keep as is but open to suggestions as I'm not the biggest fan of this either
f9c9411 to
533f5d1
Compare
… would be disproportionate
…py for high capture density
pront
left a comment
There was a problem hiding this comment.
Claude findings (second pass after the latest changes).
| where | ||
| F: FnOnce(&str, &bytes::Bytes) -> T, | ||
| { | ||
| let owned; |
There was a problem hiding this comment.
Claude finding (1): the let owned; late-init is only needed to make both arms of the if/else produce the same (&str, &Bytes) tuple. A match with each arm calling f(..) directly would drop the late binding and read more straightforwardly:
match std::str::from_utf8(bytes) {
Ok(s) => f(s, bytes),
Err(_) => {
let owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned());
let s = std::str::from_utf8(&owned).expect("from_utf8_lossy yields valid UTF-8");
f(s, &owned)
}
}Both shapes are correct — current one is just slightly unusual.
| } else { | ||
| owned = bytes::Bytes::from(String::from_utf8_lossy(bytes).into_owned()); | ||
| ( | ||
| std::str::from_utf8(&owned).expect("lossy string is valid UTF-8"), |
There was a problem hiding this comment.
Claude finding (5): no test exercises the lossy fallback path — the entire else branch (including this expect) is uncovered. Worth adding at least one case feeding invalid UTF-8 (e.g. Bytes::from_static(b"\xff foo bar")) through parse_regex/parse_regex_all so the offset-alignment invariant this helper exists for is regression-tested.
| /// | ||
| /// `capture_names` must be the pre-computed slice of named-group | ||
| /// [`KeyString`]s for the regex (computed once at VRL compile time via | ||
| /// `regex.capture_names().flatten().map(KeyString::from)`). |
There was a problem hiding this comment.
Claude finding (4): the "computed once at VRL compile time" claim is true for parse_regex but not for parse_regex_all, which recomputes per resolve() when the pattern is not a compile-time constant. Suggest softening to something like: "computed at compile time when the regex is a constant, otherwise once per resolve call."
| let capture_names: &[KeyString] = if let Some(names) = &self.capture_names { | ||
| names.as_slice() | ||
| } else { | ||
| &pattern |
There was a problem hiding this comment.
Claude finding (2): this relies on temporary lifetime extension in a way that looks like a dangling reference. &pattern.capture_names()...collect::<Vec<_>>() borrows a Vec that has no let binding; it stays alive only because temporary lifetime extension on let keeps it around for the rest of resolve(). Compiles and works, but most readers will pause.
Either hoist explicitly:
let computed;
let capture_names: &[KeyString] = match &self.capture_names {
Some(names) => names.as_slice(),
None => {
computed = pattern.capture_names().flatten().map(KeyString::from).collect::<Vec<_>>();
&computed
}
};…or push the parse_regex_all(..) call into each branch so no shared binding is needed.
| .capture_names() | ||
| .flatten() | ||
| .map(KeyString::from) | ||
| .collect(); |
There was a problem hiding this comment.
Claude finding (3): regex.capture_names().flatten().map(KeyString::from).collect() is duplicated across three call sites — here, parse_regex_all.rs compile() (lines 102-109), and parse_regex_all.rs resolve() (lines 200-204). Extract a small helper in util.rs:
pub(crate) fn capture_names(regex: &Regex) -> Vec<KeyString> {
regex.capture_names().flatten().map(KeyString::from).collect()
}
Summary
parse_regexdoes two things per event that can be eliminated. First, it reconstructs the map keys by callingregex.capture_names()on every event and allocating a freshKeyStringfor each. Since the regex is compiled once at VRL compile time when the pattern is static, the names are fixed and can be stored then instead. Second, each captured substring is heap-copied into an ownedBytes. The input value is already a refcountedBytesbuffer, so captures can be zero-copy slices of it. We now get anArcincrement rather than amemcpy.Benchmarks (
cargo bench -- parse_regex)origin/mainparse_regex/matchesparse_regex/single_matchparse_regex_all/matchesparse_regex_allshows no regression — the benchmark only has two short matches, so per-event overhead savings are small relative to total time.Change Type
Is this a breaking change?
How did you test this PR?
Existing unit tests for
parse_regexandparse_regex_allpass unchanged.cargo bench --bench stdlib --features="default test" -- --baseline main parse_regexDoes this PR include user facing changes?
our guidelines.
Checklist
run
dd-rust-license-tool writeand commit the changes. More details here.References
NA