Description
Elasticsearch Version
8.14.1
Installed Plugins
analysis-icu, analysis-kuromoji, analysis-nori,analysis-smartcn,analysis-stempel,analysis-ukrainian,ltr,mapper-size,repository-hdfs
Java Version
bundled
OS Version
Debian GNU/Linux 11 (bullseye)
Problem Description
We upgraded our main cluster this past week from 8.6.1 to 8.14.1 and it was a lot rockier than we expected.
The cluster has: 3 masters (one per data center); 96 nodes (32 per data center); 8 servers in each DC with 4 ES instances running on each (these aren't in VMs btw); 8b documents across 4820 primary shards (which all have 2x replication to have one in each DC).
Our upgrade procedure (which we've used many times in the past) is to do a rolling restart of 4-8 nodes at a time, waiting for the cluster to go back to green in between. The last time we did this (last year) it took about 30 minutes.
This time the nodes we restarted immediately went to 100% CPU and became unresponsive. It broke our stats/graphana tracking from ES, and made applying cluster updates or debugging quite difficult. The CPU usage was all from a huge spike of merge activity. We were eventually able to complete the upgrade over the course of 12 hours by going much more slowly and only restarting one node on a server at a time rather than doing batches. That node is effectively dead as the merges happen, but we have enough redundancy to handle it. The load graph across all nodes tells the story pretty well:
We are pretty certain that the cause of this behavior was this change in Lucene that lowered the deleted docs percentage from 33% to 20%: apache/lucene#11761. This was then applied to ES in https://github.com/elastic/elasticsearch/pull/93188/files#diff-4d10666b14a73c580bcbf7f20ec482e7c661878caac568c720391e0fde8efe6aR113
I think those are good changes, we saw disk usage decrease from 48% to 42% and our overall search latency decreased: median 24.6ms -> 16.3ms; 99th percentile 64.2ms -> 52.9ms.
Given the decrease in disk usage and our applications are constantly doing lots of delete and update operations on documents I think the deleted docs percentage change makes sense as the root cause. Even after the upgrade the cluster has a high number of deleted docs:
"docs": {
"count": 8192053102,
"deleted": 1220829287,
"total_size_in_bytes": 49435305941551
},
We attempted to set "index.merge.scheduler.max_thread_count": 1
in order to lower how much CPU was being used by the merge process, but that didn't make any difference. Plus once things get this broken it is tough to adjust these types of settings.
I think the bug here is that there is no way to control the amount of CPU getting consumed during this upgrade process. Ideally the merges should have been better spread out over time, but even if not, there doesn't seem to be a way to limit these types of merges.
Steps to Reproduce
- Create an index on ES 8.6.1
- Trigger lots of update operations on that index so that the number of deleted documents is close to 33% of the index
- Upgrade to ES 8.14.1
- Watch the CPU spike.
Logs (if relevant)
No response