feat(enrichment tables): add cuckoo filter to memory table#25143
feat(enrichment tables): add cuckoo filter to memory table#25143esensar wants to merge 38 commits into
Conversation
This adds support for cuckoo filters in memory enrichment tables, to support use cases where only presence of a key needs to be checked and false positives are acceptable, greatly improving memory usage compared to regular memory tables.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4e16f8ff3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fe95ea743
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@vectordev/vector any thoughts on this? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90ff070563
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f43d0b75d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99297103b3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
FYI we are waiting for all Codex comments to be resolved before taking another manual look at this PR. |
Oh I completely missed the ones above - I thought we were just waiting for this one: #25143 (comment) Anyways, I will resolve the ones I missed. Is the one about losing state on reload a blocker? Because that was the behavior with regular memory table before that other PR - I can fix that in a separate PR, if that is alright. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 171ede70e8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a18e49722
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a938de7d19
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0f47a5c8e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@pront |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f9fedfc536
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ae9694939
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 120e3343b8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| /// Configuration of cuckoo filter for memory table. | ||
| #[configurable_component] | ||
| #[derive(Clone, Debug, PartialEq, Eq)] | ||
| pub struct CuckooMemoryConfig { |
There was a problem hiding this comment.
Reject unknown cuckoo filter fields
Because this nested config does not deny unknown fields, serde will accept and ignore misspelled optional settings inside filter (for example persist_path instead of persistence_path). In that configuration Vector starts successfully but never imports or exports the cuckoo state, so a simple typo can silently disable persistence; add #[serde(deny_unknown_fields)] here like the surrounding memory configs use.
Useful? React with 👍 / 👎.
| self.filter.insert_if_not_present(k) | ||
| }; | ||
|
|
||
| if res.is_some_and(|r| r.matches_key(k, &self.filter)) { |
There was a problem hiding this comment.
Report evictions of existing filter entries
When the cuckoo filter is full and insert_if_not_present* returns a fingerprint for some other entry, this condition treats the write as successful and emits only Inserted. The returned fingerprint still means an existing item was lost from the filter, so a previously accepted key can start missing before TTL/LRU would remove it while the sink acknowledges the batch as delivered; treat any Some result as an eviction/capacity signal instead of silently accepting it.
Useful? React with 👍 / 👎.
| if !self.cuckoo_config.concurrent_scanning { | ||
| handles.spawn(task); | ||
| } else { | ||
| tokio::spawn(task); |
There was a problem hiding this comment.
Prevent overlapping concurrent scans
When filter.concurrent_scanning = true and a full scan takes longer than scan_interval, these detached tasks can still be running when the next timer tick starts another scan over every partition. That ages TTL/LRU counters more than once per configured interval on large filters, so entries can expire or become eviction candidates too early; keep an in-progress handle/flag or skip the next concurrent scan until the previous one finishes.
Useful? React with 👍 / 👎.
Summary
This adds support for cuckoo filters in memory enrichment tables, to support use cases where only presence of a key needs to be checked and false positives are acceptable, greatly improving memory usage compared to regular memory tables.
Bloom filters should be fairly easy to add as well (and will be done in a separate PR), but cuckoo is a better fit, because it supports deletion, especially using cuckoo-clock lib which extends cuckoo filter with TTL and more, which fits memory enrichment tables.
Vector configuration
How did you test this PR?
Ran the above configuration and looked up the keys using
stdinsource, by entering the keys to look up. Some unit tests were added as well.Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.