Skip to content

feat(enrichment tables): add cuckoo filter to memory table#25143

Open
esensar wants to merge 38 commits into
vectordotdev:masterfrom
esensar:feature/memory-enrichment-table-cuckoo-filter
Open

feat(enrichment tables): add cuckoo filter to memory table#25143
esensar wants to merge 38 commits into
vectordotdev:masterfrom
esensar:feature/memory-enrichment-table-cuckoo-filter

Conversation

@esensar

@esensar esensar commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Summary

This adds support for cuckoo filters in memory enrichment tables, to support use cases where only presence of a key needs to be checked and false positives are acceptable, greatly improving memory usage compared to regular memory tables.

Bloom filters should be fairly easy to add as well (and will be done in a separate PR), but cuckoo is a better fit, because it supports deletion, especially using cuckoo-clock lib which extends cuckoo filter with TTL and more, which fits memory enrichment tables.

Vector configuration

enrichment_tables:
  cuckoo_table:
    type: memory
    ttl: 60
    flush_interval: 5
    scan_interval: 10
    inputs: ["cuckoo_generator"]
    filter:
      type: cuckoo
      max_entries: 100000
      ttl_enabled: true
      counter_enabled: true
      counter_bits: 4
      # since export keys back out of cuckoo filter is not possible, 
      # this component provides a way to persist whole filter state and reload it on restart
      persistence_path: /etc/vector/persisted-cuckoo-filter
      export_interval: 1

sources:
  data_for_table:
    type: file
    # Just a list of JSON lines in format {"key":"something"}
    include: ["/etc/vector/vector-cuckoo-memory-example-input.jsonl"]

  stdin_data:
    type: stdin

transforms:
  cuckoo_reader:
    type: "remap"
    inputs: ["stdin_data"]
    source: |
      key = .message

      existing, err = get_enrichment_table_record("cuckoo_table", { "key": key })

      if err == null {
        . = existing
      } else {
        .message = "Key not found"
      }

  cuckoo_generator:
    type: "remap"
    inputs: ["data_for_table"]
    source: |
      data = parse_json!(.message)
      . = set!(value: {}, path: [get!(data, path: ["key"])], data: { "ttl": 500 })

sinks:
  console:
    inputs: ["cuckoo_reader"]
    target: "stdout"
    type: "console"
    encoding:
      codec: "json"

How did you test this PR?

Ran the above configuration and looked up the keys using stdin source, by entering the keys to look up. Some unit tests were added as well.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

Sponsored by Quad9

This adds support for cuckoo filters in memory enrichment tables, to support use cases
where only presence of a key needs to be checked and false positives are acceptable, greatly
improving memory usage compared to regular memory tables.
@esensar esensar requested review from a team as code owners April 8, 2026 15:38
@github-actions github-actions Bot added the domain: external docs Anything related to Vector's external, public documentation label Apr 8, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4e16f8ff3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/config.rs
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fe95ea743

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
@esensar

esensar commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

@vectordev/vector any thoughts on this?

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90ff070563

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/config.rs
Comment thread src/enrichment_tables/memory/cuckoo_table.rs
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
@github-actions github-actions Bot added the docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. label May 26, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f43d0b75d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99297103b3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs
@pront

pront commented Jun 9, 2026

Copy link
Copy Markdown
Member

FYI we are waiting for all Codex comments to be resolved before taking another manual look at this PR.

@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jun 9, 2026
@esensar

esensar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

FYI we are waiting for all Codex comments to be resolved before taking another manual look at this PR.

Oh I completely missed the ones above - I thought we were just waiting for this one: #25143 (comment)

Anyways, I will resolve the ones I missed. Is the one about losing state on reload a blocker? Because that was the behavior with regular memory table before that other PR - I can fix that in a separate PR, if that is alright.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 171ede70e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a18e49722

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs
Comment thread src/enrichment_tables/memory/cuckoo_table.rs
Comment thread src/enrichment_tables/memory/cuckoo_table.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a938de7d19

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0f47a5c8e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread Cargo.toml Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
@esensar

esensar commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@pront
We have tried this out in prod and with large cuckoo filters, we have an issue with configuration validation. Since cuckoo filter has a fixed size, all the memory is allocated upfront, which can be a lot with large cuckoo filters. Is there a way to detect that a validation is being run so that we can skip full allocation or something like that? Would something like that be acceptable if there isn't something like that in place?

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9fedfc536

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs
Comment thread src/enrichment_tables/memory/cuckoo_table.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ae9694939

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs Outdated
Comment thread src/enrichment_tables/memory/cuckoo_table.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 120e3343b8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

/// Configuration of cuckoo filter for memory table.
#[configurable_component]
#[derive(Clone, Debug, PartialEq, Eq)]
pub struct CuckooMemoryConfig {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject unknown cuckoo filter fields

Because this nested config does not deny unknown fields, serde will accept and ignore misspelled optional settings inside filter (for example persist_path instead of persistence_path). In that configuration Vector starts successfully but never imports or exports the cuckoo state, so a simple typo can silently disable persistence; add #[serde(deny_unknown_fields)] here like the surrounding memory configs use.

Useful? React with 👍 / 👎.

self.filter.insert_if_not_present(k)
};

if res.is_some_and(|r| r.matches_key(k, &self.filter)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report evictions of existing filter entries

When the cuckoo filter is full and insert_if_not_present* returns a fingerprint for some other entry, this condition treats the write as successful and emits only Inserted. The returned fingerprint still means an existing item was lost from the filter, so a previously accepted key can start missing before TTL/LRU would remove it while the sink acknowledges the batch as delivered; treat any Some result as an eviction/capacity signal instead of silently accepting it.

Useful? React with 👍 / 👎.

if !self.cuckoo_config.concurrent_scanning {
handles.spawn(task);
} else {
tokio::spawn(task);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent overlapping concurrent scans

When filter.concurrent_scanning = true and a full scan takes longer than scan_interval, these detached tasks can still be running when the next timer tick starts another scan over every partition. That ages TTL/LRU counters more than once per configured interval on large filters, so entries can expire or become eviction candidates too early; keep an in-progress handle/flag or skip the next concurrent scan until the previous one finishes.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: external docs Anything related to Vector's external, public documentation under_review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants