OAK-11545 - Use a single long-lived bulk processor for Elastic reindexing #2183

nfsantos · 2025-03-14T16:42:10Z

When writing to Elastic indexes, we use a BulkIngester (part of the Elastic Client Java API) to buffer operations before sending them to Elastic in a bulk request and to manage concurrent outgoing requests. This is essential to have good performance. In the current implementation, each instance of ElasticIndexWriter creates a new ElasticBulkProcessorHandler. Since index writers are created per-index, when indexing more than one Elastic index we are creating several BulkIngesters. This is wasteful because a BulkIngester can be used to write to several indexes, so a single instance can be shared by ElasticIndexWriters.

This PR moves the creation of the ElasticBulkProcessor into the ElasticIndexerProvider class so this single instance can be shared by all the index writers created by that indexer provider. This has the following advantages:

Decreases the memory usage. Each bulk ingester contains buffers to keep the operations before sending them to Elastic. These buffers can take up significant space, especially if they are increased to optimize the size of the bulk requests.
Potentially improve performance because the single BulkIngester would fill up faster with operations from several indexes, leading to fewer but bigger bulk requests.
In the incremental indexer, eliminates the cost of creating and destroying several BulkIngesters for every incremental cycle (every 5 seconds).

Technical details

Bulk ingester configuration is now global

The following properties used to configure the Bulk Ingester could be set per-index, in the index definition. Now they are global configuration properties:

Bulk Ingester property	New system property	Default
`maxOperations`	`oak.indexer.elastic.bulkProcessor.maxBulkOperations`	8192
`maxSize`	`oak.indexer.elastic.bulkProcessor.maxBulkSizeBytes`	8MB
`flushInterval`	`oak.indexer.elastic.bulkProcessor.bulkFlushIntervalMs`	2000 (2s)
`maxConcurrentRequests`	`oak.indexer.elastic.bulkProcessor.maxConcurrentRequests`	1
`failOnError`	`oak.indexer.elastic.bulkProcessor.failOnError`	true

Changes in default values

The current defaults for the maximum number of operations per bulk and maximum bulk size are too small for good performance. This PR increases them to the following values:

oak.indexer.elastic.bulkProcessor.maxBulkOperations: 250 -> 8192
oak.indexer.elastic.bulkProcessor.maxBulkSizeBytes: 1MB -> 8MB

Closing writers without closing bulk ingester

The main complexity of this PR is to manage closing a index writer without closing the underlying bulk ingester. Until now, since each index writer contained a bulk ingester, closing the index writer and bulk ingester had the same lifecycle, so they could be closed together. But to use a single long lived bulk ingester for several writers, closing a index writer becomes more complex.

When closing a writer, we want to flush all the operations for that index. However, the bulk ingester will have in its buffers operations for several indexes and cannot easily distinguish between operations of each index. The solution used is to force a flush of the bulk ingester whenever a index writer is closed, and then wait until all bulk requests lower or equal to the one created by the flush request are processed. This will also flush operations for other indexes that are not being closed, but this will only slow down a little the time to close an index, and has not other problems.

… polish

Improve logging of statistics. Make names of Oak configuration properties consistent with names used by Bulk ingester.

… case, to be consistent with naming conventions.

.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java

fabriziofortino · 2025-03-21T14:47:16Z

.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java

-        if (totalOperations == 0) { // no need to invoke phaser await if we already know there were no operations
-            LOG.debug("No operations executed in this processor. Close immediately");
-            return false;
+    public boolean closeIndex(String indexName) throws IOException {


I think this method name is a bit misleading. No indexes are actually closed here. Should we call something like flushIndex?

fabriziofortino · 2025-03-21T14:48:11Z

.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java

-            LOG.debug("No operations executed in this processor. Close immediately");
-            return false;
+    public boolean closeIndex(String indexName) throws IOException {
+        LOG.info("Closing index: {}", indexName);


I guess this is called at every async lane run. Should we lower this to debug/trace?

fabriziofortino · 2025-03-25T16:07:12Z

.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java

            try {
-                LOG.debug("Bulk with id {} processed in {} ms", executionId, response.took());
+                LOG.debug("Bulk with id {} processed in {} ms", executionId, response.took() / 1_000_000);


response.took() returns milliseconds. Dividing it by 1M would result in 0 most of the times.

nfsantos added 10 commits March 14, 2025 17:40

Initial working implementation with new tests. Needs more testing and…

3288baf

… polish

Merge remote-tracking branch 'upstream/trunk' into OAK-11545

59495d9

Add statistics logging.

15cd2de

Merge remote-tracking branch 'upstream/trunk' into OAK-11545

650e279

Fix tests.

8adcc05

Improve logging of statistics. Make names of Oak configuration properties consistent with names used by Bulk ingester.

Improve closing of indexers.

dd7545e

Minor refactor

341e6ff

Merge remote-tracking branch 'upstream/trunk' into OAK-11545

21ac02f

Add more documentation and reduce logging when closing index.

137f49f

Fixes

a4fadf2

nfsantos marked this pull request as ready for review March 18, 2025 17:04

nfsantos added 2 commits March 19, 2025 09:58

Merge remote-tracking branch 'upstream/trunk' into OAK-11545

2c51852

Change names of instance-level constants from all upper case to camel…

0bd665c

… case, to be consistent with naming conventions.

fabriziofortino reviewed Mar 19, 2025

View reviewed changes

.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java Outdated Show resolved Hide resolved

.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java Show resolved Hide resolved

nfsantos added 5 commits March 19, 2025 15:33

Merge remote-tracking branch 'upstream/trunk' into OAK-11545

05cd2d6

Fix typo

36d8a88

Merge remote-tracking branch 'upstream/trunk' into OAK-11545

a8cdaa5

Reduce logging level

5938a09

Merge remote-tracking branch 'upstream/trunk' into OAK-11545

6cc5ee3

fabriziofortino approved these changes Mar 21, 2025

View reviewed changes

fabriziofortino reviewed Mar 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-11545 - Use a single long-lived bulk processor for Elastic reindexing #2183

OAK-11545 - Use a single long-lived bulk processor for Elastic reindexing #2183

nfsantos commented Mar 14, 2025 •

edited

Loading

fabriziofortino Mar 21, 2025

fabriziofortino Mar 21, 2025

fabriziofortino Mar 25, 2025

OAK-11545 - Use a single long-lived bulk processor for Elastic reindexing #2183

Are you sure you want to change the base?

OAK-11545 - Use a single long-lived bulk processor for Elastic reindexing #2183

Conversation

nfsantos commented Mar 14, 2025 • edited Loading

Technical details

Bulk ingester configuration is now global

Changes in default values

Closing writers without closing bulk ingester

fabriziofortino Mar 21, 2025

Choose a reason for hiding this comment

fabriziofortino Mar 21, 2025

Choose a reason for hiding this comment

fabriziofortino Mar 25, 2025

Choose a reason for hiding this comment

nfsantos commented Mar 14, 2025 •

edited

Loading