-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAK-11545 - Use a single long-lived bulk processor for Elastic reindexing #2183
base: trunk
Are you sure you want to change the base?
Conversation
Improve logging of statistics. Make names of Oak configuration properties consistent with names used by Bulk ingester.
… case, to be consistent with naming conventions.
.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java
Show resolved
Hide resolved
if (totalOperations == 0) { // no need to invoke phaser await if we already know there were no operations | ||
LOG.debug("No operations executed in this processor. Close immediately"); | ||
return false; | ||
public boolean closeIndex(String indexName) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method name is a bit misleading. No indexes are actually closed here. Should we call something like flushIndex
?
LOG.debug("No operations executed in this processor. Close immediately"); | ||
return false; | ||
public boolean closeIndex(String indexName) throws IOException { | ||
LOG.info("Closing index: {}", indexName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is called at every async lane run. Should we lower this to debug/trace?
try { | ||
LOG.debug("Bulk with id {} processed in {} ms", executionId, response.took()); | ||
LOG.debug("Bulk with id {} processed in {} ms", executionId, response.took() / 1_000_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response.took()
returns milliseconds. Dividing it by 1M would result in 0
most of the times.
When writing to Elastic indexes, we use a
BulkIngester
(part of the Elastic Client Java API) to buffer operations before sending them to Elastic in a bulk request and to manage concurrent outgoing requests. This is essential to have good performance. In the current implementation, each instance ofElasticIndexWriter
creates a newElasticBulkProcessorHandler
. Since index writers are created per-index, when indexing more than one Elastic index we are creating severalBulkIngester
s. This is wasteful because a BulkIngester can be used to write to several indexes, so a single instance can be shared byElasticIndexWriter
s.This PR moves the creation of the
ElasticBulkProcessor
into theElasticIndexerProvider
class so this single instance can be shared by all the index writers created by that indexer provider. This has the following advantages:BulkIngester
would fill up faster with operations from several indexes, leading to fewer but bigger bulk requests.Technical details
Bulk ingester configuration is now global
The following properties used to configure the Bulk Ingester could be set per-index, in the index definition. Now they are global configuration properties:
maxOperations
oak.indexer.elastic.bulkProcessor.maxBulkOperations
maxSize
oak.indexer.elastic.bulkProcessor.maxBulkSizeBytes
flushInterval
oak.indexer.elastic.bulkProcessor.bulkFlushIntervalMs
maxConcurrentRequests
oak.indexer.elastic.bulkProcessor.maxConcurrentRequests
failOnError
oak.indexer.elastic.bulkProcessor.failOnError
Changes in default values
The current defaults for the maximum number of operations per bulk and maximum bulk size are too small for good performance. This PR increases them to the following values:
oak.indexer.elastic.bulkProcessor.maxBulkOperations
: 250 -> 8192oak.indexer.elastic.bulkProcessor.maxBulkSizeBytes
: 1MB -> 8MBClosing writers without closing bulk ingester
The main complexity of this PR is to manage closing a index writer without closing the underlying bulk ingester. Until now, since each index writer contained a bulk ingester, closing the index writer and bulk ingester had the same lifecycle, so they could be closed together. But to use a single long lived bulk ingester for several writers, closing a index writer becomes more complex.
When closing a writer, we want to flush all the operations for that index. However, the bulk ingester will have in its buffers operations for several indexes and cannot easily distinguish between operations of each index. The solution used is to force a flush of the bulk ingester whenever a index writer is closed, and then wait until all bulk requests lower or equal to the one created by the flush request are processed. This will also flush operations for other indexes that are not being closed, but this will only slow down a little the time to close an index, and has not other problems.