Reuse MetricName objects because they are expensive to make. #1

grcevski · 2019-12-17T16:15:25Z

The following change removes the MetricName object allocation that would happen on any metering or timing call to the AbstractMetrics. The object allocation rate can get up to multiple GBs/sec, if Pinot can push a QPS of 2000 queries/sec or more. There's an additional locking cost of creating the MetricNames objects because the underlying yammer Metrics object that's created uses the JDK getClassName method call, which does a synchronized lock on the class loader object, effectively creating a false contention problem.

The metrics allocation and contention issue is resolved by using a capped cache of 1000 MetricNames using the high performance Caffeine cache class. It's important to use capped collection here to avoid memory leaks in case there's a bug in the Metric naming.

Tested various LRU cache implementations with this quick benchmarks we made here:

Results are as follows (4 core/ 8 thread Linux JDK8):

Running with 8 worker threads (lower is better)...
Timing Guava cache ...
Elapsed time 42205ms
Timing Map cache ...
Elapsed time 22649ms
Timing Caffeine cache ...
Elapsed time 11888ms

Running with 16 worker threads (lower is better)...
Timing Guava cache ...
Elapsed time 90137ms
Timing Map cache ...
Elapsed time 34901ms
Timing Caffeine cache ...
Elapsed time 27998ms

Caffeine as implementation is equal or better than plain old ConcurrentHashMap, seems most suitable for using it as LRU cache.

Todo:

Find good initial concurrency parameters
Benchmark the individual improvement and produce numbers
Add tests for MetricName caching

Performance measurement results:
Machine configuration:
4 core (8 threads) Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
32GB of RAM
Linux x86-64, kernel: 5.0.0-37-generic

Benchmark configuration:
TPC-H (optimal index)
20 clients
180s runtime

Results (QPS higher is better, response time lower is better)

Base with (fix-benchmark-client):
Time Passed: 180.014s, Query Executed: 513267, QPS: 2851.2615685446685, Avg Response Time: 7.001299518574154ms
Improved (this branch):
Time Passed: 180.014s, Query Executed: 528540, QPS: 2936.104969613474, Avg Response Time: 6.800626253452908ms

Co-authored-by: @grcevski @charliegracie @macarte @adityamandaleeka

Nikola Grcevski added 3 commits December 16, 2019 16:46

Fix socket leak in benchmark client.

9ac4cca

Fix issue with benchmark server boot

9c55f71

Reuse MetricName objects because they are expensive to make.

0d68e31

grcevski force-pushed the fix-benchmark-client branch from bc68f08 to 2eb1e6b Compare January 28, 2020 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reuse MetricName objects because they are expensive to make. #1

Reuse MetricName objects because they are expensive to make. #1

Uh oh!

grcevski commented Dec 17, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reuse MetricName objects because they are expensive to make. #1

Are you sure you want to change the base?

Reuse MetricName objects because they are expensive to make. #1

Uh oh!

Conversation

grcevski commented Dec 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grcevski commented Dec 17, 2019 •

edited

Loading