Skip to content

Commit 0d229f0

Browse files
authored
fix: preserve all elements when serialized; feat: helper functions for serialization (#273)
* added type to text element map * add element_id and coordinates * added test for serialization * added serialization for check boxes * add dict_to_elements and covert_to_dict aliases * helpers for serializing and deserializing elements * bump version; changelog * add Text to tests * aliases for isd functions * remove test elements json * changelog updates * make indent a kwarg * update expected structured output * docs update * use new function in ingest code * pop coordinates due to floating point differences * pop coordinates
1 parent 354eff1 commit 0d229f0

File tree

11 files changed

+635
-42
lines changed

11 files changed

+635
-42
lines changed

Diff for: CHANGELOG.md

+12
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,15 @@
1+
## 0.4.15
2+
3+
### Enhancements
4+
5+
* Added `elements_to_json` and `elements_from_json` for easier serialization/deserialization
6+
* `convert_to_dict`, `dict_to_elements` and `convert_to_csv` are now aliases for functions
7+
that use the ISD terminology.
8+
9+
### Fixes
10+
11+
* Update to ensure all elements are preserved during serialization/deserialization
12+
113
## 0.4.14
214

315
* Automatically install `nltk` models in the `tokenize` module.

Diff for: docs/source/bricks.rst

+14-14
Original file line numberDiff line numberDiff line change
@@ -887,33 +887,33 @@ Staging
887887
Staging bricks in ``unstructured`` prepare extracted text for downstream tasks such
888888
as machine learning inference and data labeling.
889889

890-
``convert_to_isd``
891-
------------------
890+
``convert_to_dict``
891+
--------------------
892892

893-
Converts outputs to the initial structured data (ISD) format. This is the default format
894-
for returning data in Unstructured pipeline APIs.
893+
Converts a list of ``Element`` objects to a dictionary. This is the default format
894+
for representing documents in ``unstructured``.
895895

896896
Examples:
897897

898898
.. code:: python
899899
900900
from unstructured.documents.elements import Title, NarrativeText
901-
from unstructured.staging.base import convert_to_isd
901+
from unstructured.staging.base import convert_to_dict
902902
903903
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
904-
isd = convert_to_isd(elements)
904+
isd = convert_to_dict(elements)
905905
906906
907-
``isd_to_elements``
908-
-------------------
907+
``dict_to_elements``
908+
---------------------
909909

910-
Converts outputs from initial structured data (ISD) format back to a list of ``Text`` elements.
910+
Converts a dictionary of the format produced by ``convert_to_dict`` back to a list of ``Element`` objects.
911911

912912
Examples:
913913

914914
.. code:: python
915915
916-
from unstructured.staging.base import isd_to_elements
916+
from unstructured.staging.base import dict_to_elements
917917
918918
isd = [
919919
{"text": "My Title", "type": "Title"},
@@ -922,10 +922,10 @@ Examples:
922922
923923
# elements will look like:
924924
# [ Title(text="My Title"), NarrativeText(text="My Narrative")]
925-
elements = isd_to_elements(isd)
925+
elements = dict_to_elements(isd)
926926
927927
928-
``convert_to_isd_csv``
928+
``convert_to_csv``
929929
----------------------
930930

931931
Converts outputs to the initial structured data (ISD) format as a CSV string.
@@ -935,10 +935,10 @@ Examples:
935935
.. code:: python
936936
937937
from unstructured.documents.elements import Title, NarrativeText
938-
from unstructured.staging.base import convert_to_isd_csv
938+
from unstructured.staging.base import convert_to_csv
939939
940940
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
941-
isd_csv = convert_to_isd_csv(elements)
941+
isd_csv = convert_to_csv(elements)
942942
943943
944944
``convert_to_dataframe``

Diff for: docs/source/elements.rst

+28-1
Original file line numberDiff line numberDiff line change
@@ -44,4 +44,31 @@ Examples:
4444
item.apply(*cleaners)
4545
4646
# The output will be: Учебник по крокодильным средам обитания
47-
print(item)
47+
print(item)
48+
49+
####################
50+
Serializing Elements
51+
####################
52+
53+
The ``unstructured`` library includes helper functions for
54+
reading and writing a list of ``Element`` objects to and
55+
from JSON. You can use the following workflow for
56+
serializing and deserializing an ``Element`` list.
57+
58+
59+
.. code:: python
60+
61+
from unstructured.documents.elements import ElementMetadata, Text, Title, FigureCaption
62+
from unstructured.staging.base import elements_to_json, elements_from_json
63+
64+
filename = "my-elements.json"
65+
metadata = ElementMetadata(filename="fake-file.txt")
66+
elements = [
67+
FigureCaption(text="caption", metadata=metadata, element_id="1"),
68+
Title(text="title", metadata=metadata, element_id="2"),
69+
Text(text="title", metadata=metadata, element_id="3"),
70+
71+
]
72+
73+
elements_to_json(elements, filename=filename)
74+
new_elements = elements_from_json(filename=filename)

Diff for: test_unstructured/staging/test_base_staging.py

+52-3
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,18 @@
88

99
import unstructured.staging.base as base
1010

11-
from unstructured.documents.elements import ElementMetadata, Title, NarrativeText, ListItem
11+
from unstructured.documents.elements import (
12+
Address,
13+
CheckBox,
14+
ElementMetadata,
15+
FigureCaption,
16+
Title,
17+
Text,
18+
NarrativeText,
19+
ListItem,
20+
Image,
21+
PageBreak,
22+
)
1223

1324

1425
@pytest.fixture
@@ -44,10 +55,10 @@ def test_isd_to_elements():
4455
]
4556

4657

47-
def test_convert_to_isd_csv(output_csv_file):
58+
def test_convert_to_csv(output_csv_file):
4859
elements = [Title(text="Title 1"), NarrativeText(text="Narrative 1")]
4960
with open(output_csv_file, "w+") as csv_file:
50-
isd_csv_string = base.convert_to_isd_csv(elements)
61+
isd_csv_string = base.convert_to_csv(elements)
5162
csv_file.write(isd_csv_string)
5263

5364
with open(output_csv_file, "r") as csv_file:
@@ -77,3 +88,41 @@ def test_convert_to_isd_serializes_with_posix_paths():
7788
output = base.convert_to_isd(elements)
7889
# NOTE(robinson) - json.dumps should run without raising an exception
7990
json.dumps(output)
91+
92+
93+
def test_all_elements_preserved_when_serialized():
94+
metadata = ElementMetadata(filename="fake-file.txt")
95+
elements = [
96+
Address(text="address", metadata=metadata, element_id="1"),
97+
CheckBox(checked=True, metadata=metadata, element_id="2"),
98+
FigureCaption(text="caption", metadata=metadata, element_id="3"),
99+
Title(text="title", metadata=metadata, element_id="4"),
100+
NarrativeText(text="narrative", metadata=metadata, element_id="5"),
101+
ListItem(text="list", metadata=metadata, element_id="6"),
102+
Image(text="image", metadata=metadata, element_id="7"),
103+
Text(text="text", metadata=metadata, element_id="8"),
104+
PageBreak(),
105+
]
106+
107+
isd = base.convert_to_isd(elements)
108+
assert base.convert_to_isd(base.isd_to_elements(isd)) == isd
109+
110+
111+
def test_serialized_deserialize_elements_to_json(tmpdir):
112+
filename = os.path.join(tmpdir, "fake-elements.json")
113+
metadata = ElementMetadata(filename="fake-file.txt")
114+
elements = [
115+
Address(text="address", metadata=metadata, element_id="1"),
116+
CheckBox(checked=True, metadata=metadata, element_id="2"),
117+
FigureCaption(text="caption", metadata=metadata, element_id="3"),
118+
Title(text="title", metadata=metadata, element_id="4"),
119+
NarrativeText(text="narrative", metadata=metadata, element_id="5"),
120+
ListItem(text="list", metadata=metadata, element_id="6"),
121+
Image(text="image", metadata=metadata, element_id="7"),
122+
Text(text="text", metadata=metadata, element_id="8"),
123+
PageBreak(),
124+
]
125+
126+
base.elements_to_json(elements, filename=filename)
127+
new_elements = base.elements_from_json(filename=filename)
128+
assert elements == new_elements

0 commit comments

Comments
 (0)