feat(tag_cardinality_limit transform): Add exact_fingerprint mode for lower memory usage by ArunPiduguDD · Pull Request #25640 · vectordotdev/vector

ArunPiduguDD · 2026-06-16T18:25:49Z

Summary

Adds a new optional mode to the tag_cardinality_limit transform called exact_fingerprint.

It behaves just like the existing exact mode, except instead of keeping a full copy of every tag value it has seen, it stores a small 8 byte fingerprint of each value. This leads to less memory usage in most scenarios (assuming that in general average tag values length > 8).

Trade-offs:

Slightly reduced throughput due to extra hashing operations
There is an extremely small chance of a hash collision, which could cause an
undercount. For workloads that need perfectly exact counting, exact mode is unchanged and
remains the default.

Uses SeaHasher over DefaultHasher for improved hashing speed

Memory savings

Measured memory use of exact vs exact_fingerprint across different workloads
(M = number of metrics, tags/metric, distinct values per tag). Each tag value was a randomly generated 20 byte string:

Metrics	Tags/metric	Values/tag	Exact mode	Fingerprint mode	Memory saved
50,000	10	1	424 MiB	271 MiB	36%
50,000	10	10	1.30 GiB	463 MiB	65%
50,000	10	100	6.81 GiB	1016 MiB	85%
50,000	50	1	1.63 GiB	934 MiB	44%
50,000	50	10	5.46 GiB	1.40 GiB	74%
50,000	50	100	33.30 GiB	4.12 GiB	88%
100,000	10	1	814 MiB	500 MiB	39%
100,000	10	10	2.47 GiB	769 MiB	70%
100,000	10	100	13.47 GiB	1.78 GiB	87%
100,000	50	1	3.19 GiB	1.74 GiB	46%
100,000	50	10	10.71 GiB	2.65 GiB	75%

Vector configuration

transforms:
  my_tcl:
    type: tag_cardinality_limit
    mode: exact_fingerprint
    value_limit: 500

How did you test this PR?

Unit and integration tests for the new mode (value-limit enforcement under both drop
actions, excluded-tag handling, config parsing).
Memory sweeps on a release build across the combinations in the table above, run locally
by sending generated metrics and measuring resident memory.

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

… lower memory Introduces Mode::ExactFingerprint (YAML: mode: exact_fingerprint), an opt-in storage mode that reduces per-accepted-value memory from ~128 B to ~9 B by storing 64-bit hash fingerprints of tag values instead of the full strings. Design choices: - Stores only u64 fingerprints; accepts a vanishingly small collision risk (≈ 7e-15 per set at the default value_limit=500), which can cause a minor cardinality undercount. Mode::Exact remains byte-exact for users who need it. - Fingerprints are computed with the std DefaultHasher (stateless, fixed keys, no per-set hasher state) — the same hasher TagValueSet's own Hash impl uses internally. - Fingerprint table uses HashBuildHasher (identity/pass-through hasher) to avoid double-hashing an already-uniformly-distributed u64. - Mode::ExactFingerprint and OverrideMode::ExactFingerprint are new, user-visible config variants. Existing Mode::Exact semantics are completely unchanged. Also fixes test_accepted_tag_value_set_probabilistic in tag_value_set.rs, which was erroneously constructing Mode::Exact and therefore not testing the Bloom path at all. Benchmarked on a local release binary across M=50K/100K, T=10/50, V=1/10/100. Memory reduction vs exact mode: 36-46% at V=1, 65-75% at V=10, 85-88% at V=100. See tcl_memtest/SESSION_NOTES_2026-06-12.md for full results. Co-authored-by: ArunPiduguDD <arun.pidugu@datadoghq.com>

bruceg

This LGTM as-is but I had a question about pre-creating the hasher that might be worth pursuing.

bruceg · 2026-06-17T03:06:45Z

+
+    /// Compute a 64-bit fingerprint of a tag value
+    fn fingerprint(value: &TagValueSet) -> u64 {
+        BuildHasherDefault::<SeaHasher>::default().hash_one(value)


Would there be any value in pre-computing BuildHasherDefault::<SeaHasher>::default() when FingerprintStorage is created? You might want to seed it differently for each storage unit to avoid known-seed collision attacks.

Hm, what is a scenario where we would need to guard against known-seed collision attacks in Vector?

The path would be being able to synthetically cause a collision with attacker-controlled data. In this case that would cause an undercount, allowing through a higher cardinality than otherwise IIUC, which amounts to a DoS attack due to service costs. Having an unknown seed makes that much harder, particularly if Vector is used in a cluster where every node has different seeds. The question then is if this is concerning.

FWIW microbenchmarks confirm that the setup of the hasher is free for this use.

pront

Nice addition, thanks.

buraizu

Approving with a minor suggestion for punctuation consistency

buraizu · 2026-06-17T18:24:37Z

+				of tag values instead of the original strings. This leads to lower memory requirements in most
+				scenarios (assuming average tag value size is greater than 8 bytes) at the cost of slightly
+				reduced throughput due to extra hashing operations and a very small chance of collisions at
+				very high cardinalities


Suggested change

very high cardinalities

very high cardinalities.

github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: transforms Anything related to Vector's transform components domain: external docs Anything related to Vector's external, public documentation labels Jun 16, 2026

ArunPiduguDD force-pushed the feat/tag-cardinality-exact-fingerprint branch from 1980f6e to 637d90a Compare June 16, 2026 19:01

ArunPiduguDD force-pushed the feat/tag-cardinality-exact-fingerprint branch from 637d90a to 317f3e5 Compare June 16, 2026 19:04

ArunPiduguDD marked this pull request as ready for review June 16, 2026 19:04

ArunPiduguDD requested review from a team as code owners June 16, 2026 19:04

ArunPiduguDD changed the title ~~feat(tag_cardinality_limit transform): add exact_fingerprint mode for lower memory~~ feat(tag_cardinality_limit transform): Add exact_fingerprint mode for lower memory usage Jun 16, 2026

bruceg approved these changes Jun 17, 2026

View reviewed changes

github-actions Bot removed the docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. label Jun 17, 2026

ArunPiduguDD and others added 3 commits June 17, 2026 15:41

Add suggested changes (test refactor + fps rename + use derive)

a3756c1

cargo fmt

430f037

Merge branch 'master' into feat/tag-cardinality-exact-fingerprint

c6c12d6

pront approved these changes Jun 17, 2026

View reviewed changes

buraizu self-assigned this Jun 17, 2026

buraizu approved these changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tag_cardinality_limit transform): Add exact_fingerprint mode for lower memory usage#25640

feat(tag_cardinality_limit transform): Add exact_fingerprint mode for lower memory usage#25640
ArunPiduguDD wants to merge 4 commits into
vectordotdev:masterfrom
ArunPiduguDD:feat/tag-cardinality-exact-fingerprint

ArunPiduguDD commented Jun 16, 2026 •

edited

Loading

Uh oh!

bruceg left a comment

Uh oh!

Uh oh!

Uh oh!

bruceg Jun 17, 2026

Uh oh!

ArunPiduguDD Jun 17, 2026

Uh oh!

bruceg Jun 17, 2026

Uh oh!

pront left a comment

Uh oh!

buraizu left a comment

Uh oh!

buraizu Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ArunPiduguDD commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory savings

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

Uh oh!

bruceg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bruceg Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

ArunPiduguDD Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

bruceg Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

pront left a comment

Choose a reason for hiding this comment

Uh oh!

buraizu left a comment

Choose a reason for hiding this comment

Uh oh!

buraizu Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArunPiduguDD commented Jun 16, 2026 •

edited

Loading