Skip to content

[Bug] Inconsistent data types from extract_properties #234

@baitsguy

Description

@baitsguy

Describe the bug
When using gpt-3.5 in the extract_properties transform, the resulting 'entity' object doesn't always have the same schema. In some cases the types of the fields within 'entity' are different across documents. This causes an error in indexing into OpenSearch as it expects each field to have a consistent datatype. The inconsistency happens in extract_schema as well, but that doesn't cause issues since it's a single task whose results are applied to all documents.

To Reproduce
Steps to reproduce the behavior:

  1. Execute the metadata-extraction notebook here https://github.com/aryn-ai/sycamore/blob/a39eced00c884e0f50f33eefa7b009b5f9923249/notebooks/metadata-extraction.ipynb a few times
  2. You will notice transient failures

Expected behavior
Extract properties should result in an 'entity' object in a record, with each entity object having the same set of fields with same types.

Screenshots
n/a

Desktop (please complete the following information):

  • Unrelated

Smartphone (please complete the following information):

  • Unrelated

Additional context
Stack trace attached
datatype_error.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions