Skip to content

feat(tag_cardinality_limit transform): Add exact_fingerprint mode for lower memory usage#25640

Open
ArunPiduguDD wants to merge 4 commits into
vectordotdev:masterfrom
ArunPiduguDD:feat/tag-cardinality-exact-fingerprint
Open

feat(tag_cardinality_limit transform): Add exact_fingerprint mode for lower memory usage#25640
ArunPiduguDD wants to merge 4 commits into
vectordotdev:masterfrom
ArunPiduguDD:feat/tag-cardinality-exact-fingerprint

Conversation

@ArunPiduguDD

@ArunPiduguDD ArunPiduguDD commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new optional mode to the tag_cardinality_limit transform called exact_fingerprint.

It behaves just like the existing exact mode, except instead of keeping a full copy of every tag value it has seen, it stores a small 8 byte fingerprint of each value. This leads to less memory usage in most scenarios (assuming that in general average tag values length > 8).

Trade-offs:

  • Slightly reduced throughput due to extra hashing operations
  • There is an extremely small chance of a hash collision, which could cause an
    undercount. For workloads that need perfectly exact counting, exact mode is unchanged and
    remains the default.

Uses SeaHasher over DefaultHasher for improved hashing speed

Memory savings

Measured memory use of exact vs exact_fingerprint across different workloads
(M = number of metrics, tags/metric, distinct values per tag). Each tag value was a randomly generated 20 byte string:

Metrics Tags/metric Values/tag Exact mode Fingerprint mode Memory saved
50,000 10 1 424 MiB 271 MiB 36%
50,000 10 10 1.30 GiB 463 MiB 65%
50,000 10 100 6.81 GiB 1016 MiB 85%
50,000 50 1 1.63 GiB 934 MiB 44%
50,000 50 10 5.46 GiB 1.40 GiB 74%
50,000 50 100 33.30 GiB 4.12 GiB 88%
100,000 10 1 814 MiB 500 MiB 39%
100,000 10 10 2.47 GiB 769 MiB 70%
100,000 10 100 13.47 GiB 1.78 GiB 87%
100,000 50 1 3.19 GiB 1.74 GiB 46%
100,000 50 10 10.71 GiB 2.65 GiB 75%

Vector configuration

transforms:
  my_tcl:
    type: tag_cardinality_limit
    mode: exact_fingerprint
    value_limit: 500

How did you test this PR?

  • Unit and integration tests for the new mode (value-limit enforcement under both drop
    actions, excluded-tag handling, config parsing).
  • Memory sweeps on a release build across the combinations in the table above, run locally
    by sending generated metrics and measuring resident memory.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

@github-actions github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: transforms Anything related to Vector's transform components domain: external docs Anything related to Vector's external, public documentation labels Jun 16, 2026
@ArunPiduguDD ArunPiduguDD force-pushed the feat/tag-cardinality-exact-fingerprint branch from 1980f6e to 637d90a Compare June 16, 2026 19:01
… lower memory

Introduces Mode::ExactFingerprint (YAML: mode: exact_fingerprint), an opt-in storage
mode that reduces per-accepted-value memory from ~128 B to ~9 B by storing 64-bit hash
fingerprints of tag values instead of the full strings.

Design choices:
- Stores only u64 fingerprints; accepts a vanishingly small collision risk
  (≈ 7e-15 per set at the default value_limit=500), which can cause a minor cardinality
  undercount. Mode::Exact remains byte-exact for users who need it.
- Fingerprints are computed with the std DefaultHasher (stateless, fixed keys, no
  per-set hasher state) — the same hasher TagValueSet's own Hash impl uses internally.
- Fingerprint table uses HashBuildHasher (identity/pass-through hasher) to avoid
  double-hashing an already-uniformly-distributed u64.
- Mode::ExactFingerprint and OverrideMode::ExactFingerprint are new, user-visible config
  variants. Existing Mode::Exact semantics are completely unchanged.

Also fixes test_accepted_tag_value_set_probabilistic in tag_value_set.rs, which was
erroneously constructing Mode::Exact and therefore not testing the Bloom path at all.

Benchmarked on a local release binary across M=50K/100K, T=10/50, V=1/10/100. Memory
reduction vs exact mode: 36-46% at V=1, 65-75% at V=10, 85-88% at V=100.
See tcl_memtest/SESSION_NOTES_2026-06-12.md for full results.

Co-authored-by: ArunPiduguDD <arun.pidugu@datadoghq.com>
@ArunPiduguDD ArunPiduguDD force-pushed the feat/tag-cardinality-exact-fingerprint branch from 637d90a to 317f3e5 Compare June 16, 2026 19:04
@ArunPiduguDD ArunPiduguDD marked this pull request as ready for review June 16, 2026 19:04
@ArunPiduguDD ArunPiduguDD requested review from a team as code owners June 16, 2026 19:04
@ArunPiduguDD ArunPiduguDD changed the title feat(tag_cardinality_limit transform): add exact_fingerprint mode for lower memory feat(tag_cardinality_limit transform): Add exact_fingerprint mode for lower memory usage Jun 16, 2026

@bruceg bruceg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM as-is but I had a question about pre-creating the hasher that might be worth pursuing.

Comment thread src/transforms/tag_cardinality_limit/tag_value_set.rs Outdated
Comment thread src/transforms/tag_cardinality_limit/tests.rs Outdated

/// Compute a 64-bit fingerprint of a tag value
fn fingerprint(value: &TagValueSet) -> u64 {
BuildHasherDefault::<SeaHasher>::default().hash_one(value)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any value in pre-computing BuildHasherDefault::<SeaHasher>::default() when FingerprintStorage is created? You might want to seed it differently for each storage unit to avoid known-seed collision attacks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what is a scenario where we would need to guard against known-seed collision attacks in Vector?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path would be being able to synthetically cause a collision with attacker-controlled data. In this case that would cause an undercount, allowing through a higher cardinality than otherwise IIUC, which amounts to a DoS attack due to service costs. Having an unknown seed makes that much harder, particularly if Vector is used in a cluster where every node has different seeds. The question then is if this is concerning.

FWIW microbenchmarks confirm that the setup of the hasher is free for this use.

@github-actions github-actions Bot removed the docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. label Jun 17, 2026

@pront pront left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, thanks.

@buraizu buraizu self-assigned this Jun 17, 2026

@buraizu buraizu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a minor suggestion for punctuation consistency

of tag values instead of the original strings. This leads to lower memory requirements in most
scenarios (assuming average tag value size is greater than 8 bytes) at the cost of slightly
reduced throughput due to extra hashing operations and a very small chance of collisions at
very high cardinalities

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
very high cardinalities
very high cardinalities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: external docs Anything related to Vector's external, public documentation domain: transforms Anything related to Vector's transform components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants