Open
Description
Background
Reindex allows users to create new indexes with data that is already in elasticsearch. This is especially useful for moving to semantic search because users often have already implemented text search and want to embed their existing data in a new index. Unfortunately, reindex has some flaws that make it difficult or impossible to use for larger datasets and when using machine learning models to produce embeddings.
Problems
Resiliency - Issues with failures and errors
- Reindex is not resilient to failures or node restarts [meta issue with lots of details] Reindex resiliency #42612
- If the inference processor isn't configured correctly (which means adding specific code which isn't there by default) any errors will not be recorded anywhere
- Bulk request retry silently drops pipeline when retrying individual response that failed Bulk request retry silently drops pipeline when retrying individual response that failed #78792
- A single error can cause the reindex to stop completely Reindex API : improve robustness in case of error #22471
- Different types of failures are handled in the same way Split reindex and update by query failures exposed on the REST interface #17539
- Large failure responses to API call in some cases reindex can produce very large failures in API response #20199
Issues with size
- Incompatible size values can cause the reindex to fail due to errors in the inference processor caused by exceeding the inference queue [ML] Handling the NLP model 'the inference process queue is full' error #85319 Potential issues with rejections and starvation in the Inference Processor #103665
- Large documents can cause OOM Reindex - option for batch memory size not just document count #57535, Can not Reindex from remote even with 1 batch size - Remote responded with a chunk that was too large. Use a smaller batch size. #73261 Reindex - provide option to specify batch size in bytes #90195
- Due to automatic chunking with semantic_text it is now possible to exceed the inference queue even with small reindex sizes
- Using the scroll API with large documents can cause excessive time spent on GC on ML nodes and can cause those nodes to become unresponsive
Issues with performance
- Reindexing is slow Speed up reindex and friends #76978
- Throttling doesn't work under some circumstances fix reindex throttle wait time #67932
- Also latency of each batch is high because of multiple sequential API calls on a single thread, which causes performance issues for ML
Issues with scroll
- Its possible to hit the scroll limit if you have a lot of shards Empty scroll contexts don't count #86407
- Scroll stores results in memory for a specific amount of time that isn't tied to the completion of the reindex.