noderoster: configurable host cache size and eviction duration #1246
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Main Issue
Noderoster processors can lag and never catch up.

Findings
#1206 helps speed up the query to list events, but processing them is still an issue. Specifically, we end up making a large amount of queries to look existing up hosts IDs in the DB (or create them).
Evicting cache too aggressively
Currently, the host id cache is hard-coded to evict entries 1 minute after last access.
Non-singleton cache
The cache gets created per processor--so, currently that's 3 copies. Which may end up repeating host lookups multiple times for the same event, and makes the cache less effective.
Unbounded cache size
Very unlikely to actually be an issue since the cache evicts so aggressively, but technically it's possible to put an unlimited number of entries in the cache which may lead to OOM in the extremest of extreme cases. This is more important when we get to fixing the other issues to keep memory usage predictable.
Per-lookup (non-singleton) cache loaders
This doesn't really impact lookup performance, but is generally not great hygiene if if it can be avoided. It does have impact in the realm of garbage collection--not critically, but might as well be cleaned up. e.g. create one loader instance for the life of the server rather than millions per hour for normal usage.
Fixes
@SingletonHostManager, but moves out of the cache loaderBenchmarks
Method
Synthetic data
Short story without wasting time with charts: the new implementation was ~71% faster. Just skip to the next section because it's more valid.
Real Data
Modeled after 1 hour of production data.
That's not an improvement, it's a 23% drop!. BUT this is with the default settings and <1ms latency. Tweaking the cache size from 10,000 entries to 50,000 (expect ~15MB of heap usage at full cap) changes the picture.
Ok, that is an improvement, but not much. Since latency is so low, the current-implementation cache stays busy enough to keep warm for the most part.
Now, lets introduce 10ms of latency which is less than some real-world latency depending on server and db locations..
Aaaah, now the new fixes get to shine. That's a 74% reduction in processing time.
The real-world execution of the Noderoster event processors is itermittent. A production multi-server Concord instance (say, 5) will see the event processing switch between servers regularly so the likelihood of the host current-implementation cache getting cold is much higher. Coupled that with cross-region latency for re-loading the cache and it just can't keep up with a busy environment (with regard to
ANSIBLEprocess events). The current-implementation cache never held more than ~5,000 entries while running.