Skip to content

bug(json): partition_json() does not preserve original element_id or metadata #3365

Open
@scanny

Description

@scanny

Summary
The contract of partition_json() is to "rehydrate" the JSON elements serialized to a JSON array of element objects. However, it changes the element_id and certain metadata fields from their original values.

To Reproduce

file_path = example_doc_path("simple.json")
original_elements = elements_from_json(file_path)
partitioned_elements = partition_json(file_path)

assert elements_to_dicts(partitioned_elements) == elements_to_dicts(original_elements)

produces:

    [
        {
  -         'element_id': 'a06d2d9e65212d4aa955c3ab32950ffa',
  +         'element_id': 'dbc05298f7937a62027af643bd1c3c87',
            'metadata': {
                'category_depth': 0,
  -             'file_directory': 'unstructured/example-docs',
  +             'file_directory': '/Users/scanny/src/unstructured/example-docs',
  -             'filename': 'simple.docx',
  ?                                 ^ ^^
  +             'filename': 'simple.json',
  ?                                 ^^ ^
  -             'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
  +             'filetype': 'application/json',
                'languages': [
                    'eng',
                ],
  -             'last_modified': '2024-07-06T16:44:51',
  ?                                        ^ ^  ^^^^^
  +             'last_modified': '2024-07-08T23:06:02',
  ?                                        ^ ^^^^  ^^
            },
            'text': 'These are a few of my favorite things:',
            'type': 'Title',
        },

Expected behavior
Because partition_json() is the mechanism for step-wise processing of documents using the API, (like elements_from_json() is using unstructured open-source directly), identifiers and metadata should be unchanged from their serialized state. Note that step-wise processing includes chunking as a separate step, perhaps after filtering elements from the original payload or enhancing metadata.

Additional context

  • Much of the problem appears to be caused by double-post-processing. partition_json() uses the same @process_metadata() and @add_metadata_with_filetype() decorators that other partitioners do, but since it is not actually a partitioner, those metadata and id post-processing steps are not needed and causes these and perhaps other unwelcome behaviors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingjsonRelated to partitioning JSON

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions