-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Describe the bug
When using gpt-3.5 in the extract_properties transform, the resulting 'entity' object doesn't always have the same schema. In some cases the types of the fields within 'entity' are different across documents. This causes an error in indexing into OpenSearch as it expects each field to have a consistent datatype. The inconsistency happens in extract_schema as well, but that doesn't cause issues since it's a single task whose results are applied to all documents.
To Reproduce
Steps to reproduce the behavior:
- Execute the metadata-extraction notebook here https://github.com/aryn-ai/sycamore/blob/a39eced00c884e0f50f33eefa7b009b5f9923249/notebooks/metadata-extraction.ipynb a few times
- You will notice transient failures
Expected behavior
Extract properties should result in an 'entity' object in a record, with each entity object having the same set of fields with same types.
Screenshots
n/a
Desktop (please complete the following information):
- Unrelated
Smartphone (please complete the following information):
- Unrelated
Additional context
Stack trace attached
datatype_error.txt