Skip to content

Key hashing in const, remove atomics#669

Draft
hoxxep wants to merge 3 commits intometrics-rs:mainfrom
hoxxep:keyhasher-const
Draft

Key hashing in const, remove atomics#669
hoxxep wants to merge 3 commits intometrics-rs:mainfrom
hoxxep:keyhasher-const

Conversation

@hoxxep
Copy link
Contributor

@hoxxep hoxxep commented Jan 29, 2026

This is a work in progress, but thought I'd share an update for comments. Apologies it's become a bit of an essay, but I think the direction is promising.

This PR:

  • Implements hashing Key in const, always caching hash and removing the need for atomics.
  • Replaces the impl Hash for Key with an implementation that simply hashes the cached hash value, via write_u64(hash).
  • Converts KeyHasher into a NoHashHasher-style for the new impl Hash for Key implementation, while DefaultHashable instead uses RapidHasher<'static>, as nohash isn't suitable for generic hashing.
  • recency.rs to use hashbrown with KeyHasher (nohash) instead of std HashMap with SipHash. I need to benchmark the change to hashbrown here, if there's no performance gains I'll revert some of my changes to instead use std HashMap with KeyHasher (nohash).
  • Adds a const_cow module inside the cow.rs, this is a copy of the Deref implementation, but allows us to Deref in const.

Key hashing implementation

Continuing on from our chat on #651, I've completely overhauled the hashing of Key to remove the atomics, pursuing absolute speed for the 99% case and assuming inputs are trusted (HashDoS is not a concern). I've opted for rapidhash_v3_nano_inline to hash each individual part of the key, assuming names and labels are most often keys under 48 bytes in length.

The hash value of the name is reused as the seed value for the labels, and the hash of the label key as the seed for the label values. This disambiguates the name/key/label, so that swapping values between them produces different hash values, but it also creates a data-dependency, and we could go faster still if we didn't disambiguate (but then a user swapping the label key/value around would produce the same hash value). Let me know if you care about the disambiguation here?

Label hashes are combined by summing them. This ensures hashing labels is order-agnostic as per the previous implementation, but avoids having to sort them. Addition is used instead of XOR, as XOR would mean two identical labels would cancel each other out.

TODO

  • Benchmark the recency.rs hashbrown change.
  • Double check the const key benchmark assembly without const { ... } wrappers.
  • Review generate_key_hash, we could improve performance further if we're willing to sacrifice disambiguating between names, label keys, and label values.
  • Confirm we're happy changing KeyHasher to a no-hash hasher. Any users using KeyHasher on other objects might face panics at runtime.
  • Review use of #[inline] and #[inline(always)].

Bench profile

I've changed the default bench profile to ensure consistency between benchmarking runs.

[profile.bench]
codegen-units = 1
lto = true

Decrease in performance for const key benchmarks

I seem to have to wrap the benchmarks in const { ... } to convince the compiler to make the Key a simple load operation. I haven't checked what assembly it otherwise generates.

This applies more to the old bench profile so doesn't appear in the full benchmarks below, but I was seeing a slight decrease performance (2ns instead of 1ns) in the const key benchmarks (the new bench profile is always around 2ns). The main difference is coming from the new code loading the compiled-time generated Key from static memory while old code constructed the Key inline with immediate zeros. It doesn't appear to affect the performance of other benchmarks.

Old code (keyhasher-rapidhash branch):

  • from_static_parts creates a Key with hashed: AtomicBool::new(false), hash: AtomicU64::new(0)
  • The compiler generates stp xzr, xzr (store pair of zero register) - immediate zeros with no memory load required
  • The hash is computed lazily on first get_hash() call (slow, but not captured by this benchmark)

New code (keyhasher-const branch):

  • from_static_parts calls generate_key_hash() at compile time, storing the result in the hash field
  • The compiler generates ldp q0, q1 (load pair of quadwords) - must read the pre-computed Key from static memory
  • The hash is immediately available

Assembly comparison

Old inner loop:

stp   xzr, xzr, [x22, #32]  ; immediate zeros - no memory load!
str   x26, [sp, #32]        ; pointer already in register

New inner loop:

ldp   q0, q1, [sp, #16]     ; load 32 bytes from memory
stp   q1, q0, [sp, #64]     ; store to stack

Relevant benchmark results

Run on an M1 Max with the new bench profile (codegen-units = 1 and lto = true). This is comparing my keyhasher-rapidhash branch (PR #651) to keyhasher-const (this PR), so that the improvements in switching to rapidhash is already accounted for.

macros/uninitialized/no_labels
                        time:   [935.57 ps 936.14 ps 936.83 ps]
                        change: [−3.0413% −1.8853% −1.1901%] (p = 0.00 < 0.05)
                        Performance has improved.
macros/uninitialized/with_static_labels
                        time:   [935.62 ps 936.07 ps 936.53 ps]
                        change: [−0.6715% −0.4912% −0.2798%] (p = 0.00 < 0.05)
                        Change within noise threshold.
macros/global_initialized/no_labels
                        time:   [1.2477 ns 1.2484 ns 1.2490 ns]
                        change: [−0.9261% −0.7751% −0.6280%] (p = 0.00 < 0.05)
                        Change within noise threshold.
macros/global_initialized/with_static_labels
                        time:   [1.2478 ns 1.2494 ns 1.2521 ns]
                        change: [−0.6763% −0.4884% −0.3052%] (p = 0.00 < 0.05)
                        Change within noise threshold.
macros/global_initialized/with_dynamic_labels
                        time:   [45.189 ns 45.300 ns 45.409 ns]
                        change: [−18.412% −18.175% −17.898%] (p = 0.00 < 0.05)
                        Performance has improved.
macros/local_initialized/no_labels
                        time:   [1.2488 ns 1.2511 ns 1.2538 ns]
                        change: [−0.8947% −0.6161% −0.3291%] (p = 0.00 < 0.05)
                        Change within noise threshold.
macros/local_initialized/with_static_labels
                        time:   [1.2475 ns 1.2482 ns 1.2489 ns]
                        change: [−0.1981% −0.0340% +0.1203%] (p = 0.69 > 0.05)
                        No change in performance detected.
macros/local_initialized/with_dynamic_labels
                        time:   [45.447 ns 45.590 ns 45.707 ns]
                        change: [−16.667% −16.452% −16.207%] (p = 0.00 < 0.05)
                        Performance has improved.

layer/base case         time:   [312.04 ps 312.20 ps 312.37 ps]
                        change: [−0.0322% +0.1329% +0.2776%] (p = 0.09 > 0.05)
                        No change in performance detected.
layer/no integration    time:   [311.90 ps 312.15 ps 312.48 ps]
                        change: [−0.0130% +0.2943% +0.7786%] (p = 0.15 > 0.05)
                        No change in performance detected.
layer/tracing layer only
                        time:   [312.17 ps 312.54 ps 313.00 ps]
                        change: [−0.1780% −0.0039% +0.1414%] (p = 0.97 > 0.05)
                        No change in performance detected.
layer/metrics layer only
                        time:   [16.567 ns 16.598 ns 16.656 ns]
                        change: [+0.1855% +0.3920% +0.5975%] (p = 0.00 < 0.05)
                        Change within noise threshold.
layer/full integration  time:   [265.82 ns 267.37 ns 268.92 ns]
                        change: [−4.4166% −3.8710% −3.3346%] (p = 0.00 < 0.05)
                        Performance has improved.

prefix/basic            time:   [40.380 ns 40.399 ns 40.420 ns]
                        change: [−5.4505% −5.2640% −5.0950%] (p = 0.00 < 0.05)
                        Performance has improved.
prefix/noop recorder overhead (increment_counter)
                        time:   [311.87 ps 312.19 ps 312.60 ps]
                        change: [−0.3941% −0.1707% +0.0444%] (p = 0.13 > 0.05)
                        No change in performance detected.

registry/cached op (basic)
                        time:   [13.500 ns 13.533 ns 13.574 ns]
                        change: [+0.0617% +0.4376% +0.8980%] (p = 0.03 < 0.05)
                        Change within noise threshold.
registry/cached op (labels)
                        time:   [13.475 ns 13.484 ns 13.493 ns]
                        change: [−2.5266% −2.2544% −1.9942%] (p = 0.00 < 0.05)
                        Performance has improved.
registry/uncached op (basic)
                        time:   [65.576 ns 66.866 ns 67.996 ns]
                        change: [−21.222% −18.540% −15.867%] (p = 0.00 < 0.05)
                        Performance has improved.
registry/uncached op (labels)
                        time:   [98.083 ns 99.035 ns 99.902 ns]
                        change: [−15.621% −13.753% −11.997%] (p = 0.00 < 0.05)
                        Performance has improved.
registry/creation overhead
                        time:   [466.12 ns 466.74 ns 467.40 ns]
                        change: [−0.9672% −0.7127% −0.4716%] (p = 0.00 < 0.05)
                        Change within noise threshold.
registry/const key overhead (basic)
                        time:   [1.8244 ns 1.8254 ns 1.8264 ns]
                        change: [−8.9958% −8.6051% −8.2095%] (p = 0.00 < 0.05)
                        Performance has improved.
registry/const key data overhead (labels)
                        time:   [1.8253 ns 1.8267 ns 1.8287 ns]
                        change: [−12.560% −12.377% −12.151%] (p = 0.00 < 0.05)
                        Performance has improved.
registry/owned key overhead (basic)
                        time:   [2.8546 ns 2.8566 ns 2.8586 ns]
                        change: [−50.697% −50.604% −50.514%] (p = 0.00 < 0.05)
                        Performance has improved.
registry/owned key overhead (labels)
                        time:   [21.269 ns 21.303 ns 21.350 ns]
                        change: [−32.313% −31.749% −31.326%] (p = 0.00 < 0.05)
                        Performance has improved.

@hoxxep hoxxep changed the title Const Key hashing Key hashing in const, remove atomics Jan 29, 2026
@tobz
Copy link
Member

tobz commented Jan 30, 2026

@hoxxep Just wanted to say that I haven't gone over this entirely yet, but from the surface level description, this sounds like a nice win. 🎉

@hoxxep
Copy link
Contributor Author

hoxxep commented Jan 31, 2026

@tobz seems to be! I want to do some more benchmarking to confirm it when I find the time, and then the two big decisions that will need your input are:

  • In generate_key_hash, we could improve performance further if we're willing to sacrifice disambiguating between the key name, label keys, and label values.
  • Confirm we're happy changing KeyHasher to a no-hash hasher. Any users using KeyHasher on other objects will likely face panics at runtime. This might make it a breaking change for the API, requiring metrics to be bumped to 0.25, even though it technically wouldn't break compilation?

I can also optimise RapidSecrets by using a reference for the secrets array to avoid a copy, but this will cause a major version bump in rapidhash as RapidSecrets would need a lifetime parameter. I'll have a think about implementing this separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments