Open
Description
Summary
The contract of partition_json()
is to "rehydrate" the JSON elements serialized to a JSON array of element objects. However, it changes the element_id
and certain metadata fields from their original values.
To Reproduce
file_path = example_doc_path("simple.json")
original_elements = elements_from_json(file_path)
partitioned_elements = partition_json(file_path)
assert elements_to_dicts(partitioned_elements) == elements_to_dicts(original_elements)
produces:
[
{
- 'element_id': 'a06d2d9e65212d4aa955c3ab32950ffa',
+ 'element_id': 'dbc05298f7937a62027af643bd1c3c87',
'metadata': {
'category_depth': 0,
- 'file_directory': 'unstructured/example-docs',
+ 'file_directory': '/Users/scanny/src/unstructured/example-docs',
- 'filename': 'simple.docx',
? ^ ^^
+ 'filename': 'simple.json',
? ^^ ^
- 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+ 'filetype': 'application/json',
'languages': [
'eng',
],
- 'last_modified': '2024-07-06T16:44:51',
? ^ ^ ^^^^^
+ 'last_modified': '2024-07-08T23:06:02',
? ^ ^^^^ ^^
},
'text': 'These are a few of my favorite things:',
'type': 'Title',
},
Expected behavior
Because partition_json()
is the mechanism for step-wise processing of documents using the API, (like elements_from_json()
is using unstructured
open-source directly), identifiers and metadata should be unchanged from their serialized state. Note that step-wise processing includes chunking as a separate step, perhaps after filtering elements from the original payload or enhancing metadata.
Additional context
- Much of the problem appears to be caused by double-post-processing.
partition_json()
uses the same@process_metadata()
and@add_metadata_with_filetype()
decorators that other partitioners do, but since it is not actually a partitioner, those metadata and id post-processing steps are not needed and causes these and perhaps other unwelcome behaviors.