Reuse MetricName objects because they are expensive to make. #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The following change removes the MetricName object allocation that would happen on any metering or timing call to the AbstractMetrics. The object allocation rate can get up to multiple GBs/sec, if Pinot can push a QPS of 2000 queries/sec or more. There's an additional locking cost of creating the MetricNames objects because the underlying yammer Metrics object that's created uses the JDK getClassName method call, which does a synchronized lock on the class loader object, effectively creating a false contention problem.
The metrics allocation and contention issue is resolved by using a capped cache of 1000 MetricNames using the high performance Caffeine cache class. It's important to use capped collection here to avoid memory leaks in case there's a bug in the Metric naming.
Tested various LRU cache implementations with this quick benchmarks we made here:
Results are as follows (4 core/ 8 thread Linux JDK8):
Running with 8 worker threads (lower is better)...
Timing Guava cache ...
Elapsed time 42205ms
Timing Map cache ...
Elapsed time 22649ms
Timing Caffeine cache ...
Elapsed time 11888ms
Running with 16 worker threads (lower is better)...
Timing Guava cache ...
Elapsed time 90137ms
Timing Map cache ...
Elapsed time 34901ms
Timing Caffeine cache ...
Elapsed time 27998ms
Caffeine as implementation is equal or better than plain old ConcurrentHashMap, seems most suitable for using it as LRU cache.
Todo:
Performance measurement results:
Machine configuration:
4 core (8 threads) Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
32GB of RAM
Linux x86-64, kernel: 5.0.0-37-generic
Benchmark configuration:
TPC-H (optimal index)
20 clients
180s runtime
Results (QPS higher is better, response time lower is better)
Base with (fix-benchmark-client):
Time Passed: 180.014s, Query Executed: 513267, QPS: 2851.2615685446685, Avg Response Time: 7.001299518574154ms
Improved (this branch):
Time Passed: 180.014s, Query Executed: 528540, QPS: 2936.104969613474, Avg Response Time: 6.800626253452908ms
Co-authored-by: @grcevski @charliegracie @macarte @adityamandaleeka