Skip to content

Conversation

@benbroadaway
Copy link
Collaborator

Main Issue

Noderoster processors can lag and never catch up.
Screenshot 2025-12-18 at 1 59 11 PM

Findings

#1206 helps speed up the query to list events, but processing them is still an issue. Specifically, we end up making a large amount of queries to look existing up hosts IDs in the DB (or create them).

Evicting cache too aggressively

Currently, the host id cache is hard-coded to evict entries 1 minute after last access.

Non-singleton cache

The cache gets created per processor--so, currently that's 3 copies. Which may end up repeating host lookups multiple times for the same event, and makes the cache less effective.

Unbounded cache size

Very unlikely to actually be an issue since the cache evicts so aggressively, but technically it's possible to put an unlimited number of entries in the cache which may lead to OOM in the extremest of extreme cases. This is more important when we get to fixing the other issues to keep memory usage predictable.

Per-lookup (non-singleton) cache loaders

This doesn't really impact lookup performance, but is generally not great hygiene if if it can be avoided. It does have impact in the realm of garbage collection--not critically, but might as well be cleaned up. e.g. create one loader instance for the life of the server rather than millions per hour for normal usage.

Fixes

  • Make noderoster host id cache @Singleton
  • Increase default cache eviction duration (1 hour after last access)
  • Set max size for noderoster host id cache (default 10,000)
  • Make cache settings configurable in server config
  • Use singleton cache loader
    • Get-or-create logic remains in HostManager, but moves out of the cache loader

Benchmarks

Method

  1. Pre-load DB with unprocessed events.
  2. Start server and wait for all events to be processed.
  3. Collect task run time.

Synthetic data

  • Unique Hosts: 200,090 (re-use 90 hosts and generate a unique hostname for 10% of the events)
  • Total Events: 2,000,000

Short story without wasting time with charts: the new implementation was ~71% faster. Just skip to the next section because it's more valid.

Real Data

Modeled after 1 hour of production data.

  • Unique Hosts: 34,080
  • Total Events: 792,715
Metric Current New
Total Time (ms) 237,586 292,870
Cache Hit Rate 0.953497789 0.957008509
GC Count 131 107
GC Time (ms) 507 315

That's not an improvement, it's a 23% drop!. BUT this is with the default settings and <1ms latency. Tweaking the cache size from 10,000 entries to 50,000 (expect ~15MB of heap usage at full cap) changes the picture.

Metric Current New
Total Time (ms) 237,586 229,884
Cache Hit Rate 0.953497789 0.957008509
GC Count 131 107
GC Time (ms) 507 315

Ok, that is an improvement, but not much. Since latency is so low, the current-implementation cache stays busy enough to keep warm for the most part.

Now, lets introduce 10ms of latency which is less than some real-world latency depending on server and db locations..

Metric Current New
Total Time (ms) 3,475,388 884,653
Cache Hit Rate 0.776712942 0.957008509
GC Count 161 115
GC Time (ms) 1402 533

Aaaah, now the new fixes get to shine. That's a 74% reduction in processing time.

The real-world execution of the Noderoster event processors is itermittent. A production multi-server Concord instance (say, 5) will see the event processing switch between servers regularly so the likelihood of the host current-implementation cache getting cold is much higher. Coupled that with cross-region latency for re-loading the cache and it just can't keep up with a busy environment (with regard to ANSIBLE process events). The current-implementation cache never held more than ~5,000 entries while running.

@benbroadaway benbroadaway requested review from a team, brig and ibodrov December 18, 2025 20:18
@benbroadaway benbroadaway changed the title noderoster: configurable cache size and eviction duration noderoster: configurable host cache size and eviction duration Dec 22, 2025
@benbroadaway benbroadaway merged commit 4923686 into master Dec 22, 2025
4 checks passed
@benbroadaway benbroadaway deleted the bb/noderoster-cache-settings branch December 22, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants