-
Notifications
You must be signed in to change notification settings - Fork 209
Description
Is your feature request related to a problem?
The bulk (and streaming_bulk) helpers only yield the response from every indexing operation when raise_on_error=False. This means when there are errors indexing a document, we don't have enough information to find out which doc failed.
An example current yield:
{
'index': {
'_index': 'logs-2025.09.10-000001',
'_id': 'JQn6MpkBL_dyks7LFLJw',
'status': 400,
'error': {
'type': 'mapper_parsing_exception',
'reason': "failed to parse field [resource] of type [keyword] in document with id 'JQn6MpkBL_dyks7LFLJw'. Preview of field's value: '{resourceId=..., resourceType=AWS::EC2::Instance}'",
'caused_by': {
'type': 'illegal_state_exception',
'reason': "Can't get text on a START_OBJECT at 1:387"
}
}
}
}
The only identifying information here is the doc id that opensearch generated.
We want to track the documents that fail so we can retry certain errors or patch data/mappings. And we would also like to continue to use these bulk helpers.
What solution would you like?
We would like an option in streaming_bulk that also yields the data back, for example yield_data
What alternatives have you considered?
One workaround is providing my own document ids in the bulk payload but doing this incurs a performance hit while indexing. The other workaround is handling exceptions raised when raise_on_error=True but there are some trade-offs, it doesn't seem like retries happen and from what I can tell any subsequent chunks after the exception is raised don't get sent.
Do you have any additional context?
Open to PRing some kind of solution here