bug(json): partition_json() does not preserve original element_id or metadata

**Summary**
The contract of `partition_json()` is to "rehydrate" the JSON elements serialized to a JSON array of element objects. However, it changes  the `element_id` and certain metadata fields from their original values.

**To Reproduce**
```python
file_path = example_doc_path("simple.json")
original_elements = elements_from_json(file_path)
partitioned_elements = partition_json(file_path)

assert elements_to_dicts(partitioned_elements) == elements_to_dicts(original_elements)
```
produces:
```diff
    [
        {
  -         'element_id': 'a06d2d9e65212d4aa955c3ab32950ffa',
  +         'element_id': 'dbc05298f7937a62027af643bd1c3c87',
            'metadata': {
                'category_depth': 0,
  -             'file_directory': 'unstructured/example-docs',
  +             'file_directory': '/Users/scanny/src/unstructured/example-docs',
  -             'filename': 'simple.docx',
  ?                                 ^ ^^
  +             'filename': 'simple.json',
  ?                                 ^^ ^
  -             'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
  +             'filetype': 'application/json',
                'languages': [
                    'eng',
                ],
  -             'last_modified': '2024-07-06T16:44:51',
  ?                                        ^ ^  ^^^^^
  +             'last_modified': '2024-07-08T23:06:02',
  ?                                        ^ ^^^^  ^^
            },
            'text': 'These are a few of my favorite things:',
            'type': 'Title',
        },
```

**Expected behavior**
Because `partition_json()` is the mechanism for step-wise processing of documents using the API, (like `elements_from_json()` is using `unstructured` open-source directly), identifiers and metadata should be unchanged from their serialized state. Note that step-wise processing includes chunking as a separate step, perhaps after filtering elements from the original payload or enhancing metadata.

**Additional context**
- Much of the problem appears to be caused by double-post-processing. `partition_json()` uses the same `@process_metadata()` and `@add_metadata_with_filetype()` decorators that other partitioners do, but since it is not actually a partitioner, those metadata and id post-processing steps are not needed and causes these and perhaps other unwelcome behaviors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug(json): partition_json() does not preserve original element_id or metadata #3365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(json): partition_json() does not preserve original element_id or metadata #3365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions