[ML] Overview of reindex issues with NLP

## Background
Reindex allows users to create new indexes with data that is already in elasticsearch. This is especially useful for moving to semantic search because users often have already implemented text search and want to embed their existing data in a new index. Unfortunately, reindex has some flaws that make it difficult or impossible to use for larger datasets and when using machine learning models to produce embeddings.

## Problems

### Resiliency - Issues with failures and errors
- Reindex is not resilient to failures or node restarts [meta issue with lots of details] https://github.com/elastic/elasticsearch/issues/42612
- If the inference processor isn't configured correctly (which means adding specific code which isn't there by default) any errors will not be recorded anywhere
- Bulk request retry silently drops pipeline when retrying individual response that failed https://github.com/elastic/elasticsearch/issues/78792
- A single error can cause the reindex to stop completely https://github.com/elastic/elasticsearch/issues/22471
- Different types of failures are handled in the same way https://github.com/elastic/elasticsearch/issues/17539
- Large failure responses to API call in some cases https://github.com/elastic/elasticsearch/issues/20199

### Issues with size
- Incompatible size values can cause the reindex to fail due to errors in the inference processor caused by exceeding the inference queue https://github.com/elastic/elasticsearch/issues/85319 https://github.com/elastic/elasticsearch/issues/103665
- Large documents can cause OOM https://github.com/elastic/elasticsearch/issues/57535, https://github.com/elastic/elasticsearch/issues/73261 https://github.com/elastic/elasticsearch/issues/90195
- Due to automatic chunking with semantic_text it is now possible to exceed the inference queue even with small reindex sizes
- Using the scroll API with large documents can cause excessive time spent on GC on ML nodes and can cause those nodes to become unresponsive

### Issues with performance
- Reindexing is slow https://github.com/elastic/elasticsearch/issues/76978
- Throttling doesn't work under some circumstances https://github.com/elastic/elasticsearch/issues/67932
  - Also latency of each batch is high because of multiple sequential API calls on a single thread, which causes performance issues for ML

### Issues with scroll
- Its possible to hit the scroll limit if you have a lot of shards https://github.com/elastic/elasticsearch/issues/86407
- Scroll stores results in memory for a specific amount of time that isn't tied to the completion of the reindex.

## Possible solutions in the works?
https://github.com/elastic/elasticsearch/issues/27724#issuecomment-2101539332

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Overview of reindex issues with NLP #113948

Background

Problems

Resiliency - Issues with failures and errors

Issues with size

Issues with performance

Issues with scroll

Possible solutions in the works?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] Overview of reindex issues with NLP #113948

Description

Background

Problems

Resiliency - Issues with failures and errors

Issues with size

Issues with performance

Issues with scroll

Possible solutions in the works?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions