Skip to content

Commit f12240c

Browse files
authored
feat: add support for .txt files in partition (#150)
* added partition_text for auto * rename partition_text tests * bump version and update docs
1 parent eba4c80 commit f12240c

File tree

7 files changed

+55
-26
lines changed

7 files changed

+55
-26
lines changed

Diff for: CHANGELOG.md

+4
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.4.1-dev0
2+
3+
* Added support for text files in the `partition` function
4+
15
## 0.4.0
26

37
* Added generic `partition` brick that detects the file type and routes a file to the appropriate

Diff for: README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ To install the library, run `pip install unstructured`.
6262
You can run this [Colab notebook](https://colab.research.google.com/drive/1RnXEiSTUaru8vZSGbh1U2T2P9aUa5tQD#scrollTo=E_WN7p3JGcLJ) to run the examples below.
6363

6464
The following examples show how to get started with the `unstructured` library.
65-
You can parse **HTML**, **PDF**, **EML** and **DOCX** documents with one line of code!
65+
You can parse **TXT**, **HTML**, **PDF**, **EML** and **DOCX** documents with one line of code!
6666
<br></br>
6767
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
6868
of the features in the library.
@@ -76,7 +76,7 @@ If you are using the `partition` brick, ensure you first install `libmagic` usin
7676
instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection)
7777
`partition` will always apply the default arguments. If you need
7878
advanced features, use a document-specific brick. The `partition` brick currently works for
79-
`.docx`, `eml`, `.html`, and `.pdf` documents.
79+
`.txt`, `.docx`, `eml`, `.html`, and `.pdf` documents.
8080

8181
```python
8282
from unstructured.partition.auto import partition

Diff for: docs/source/bricks.rst

+18-17
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
2222
file type and route it to the appropriate partitioning brick. All partitioning bricks
2323
called within ``partition`` are called using the defualt kwargs. Use the document-type
2424
specific bricks if you need to apply non-default settings.
25+
``partition`` currently supports ``.docx``, ``.eml``, ``.html``, ``.pdf``, and ``.txt`` files.
2526

2627

2728
.. code:: python
@@ -104,7 +105,7 @@ Examples:
104105
``partition_pdf``
105106
---------------------
106107

107-
The ``partition_pdf`` function segments a PDF document by calling the document image analysis API.
108+
The ``partition_pdf`` function segments a PDF document by calling the document image analysis API.
108109
The intent of the parameters ``url`` and ``token`` is to allow users to self host an inference API,
109110
if desired.
110111

@@ -122,7 +123,7 @@ Examples:
122123
---------------------
123124

124125
The ``partition_email`` function partitions ``.eml`` documents and works with exports
125-
from email clients such as Microsoft Outlook and Gmail. The ``partition_email``
126+
from email clients such as Microsoft Outlook and Gmail. The ``partition_email``
126127
takes a filename, file-like object, or raw text as input and produces a list of
127128
document ``Element`` objects as output. Also ``content_source`` can be set to ``text/html``
128129
(default) or ``text/plain`` to process the html or plain text version of the email, respectively.
@@ -157,7 +158,7 @@ Examples:
157158
``partition_text``
158159
---------------------
159160

160-
The ``partition_text`` function partitions text files. The ``partition_text``
161+
The ``partition_text`` function partitions text files. The ``partition_text``
161162
takes a filename, file-like object, and raw text as input and produces ``Element`` objects as output.
162163

163164
Examples:
@@ -629,7 +630,7 @@ addresses in the input string.
629630
630631
from unstructured.cleaners.extract import extract_email_address
631632
632-
text = """Me [email protected] and You <[email protected]>
633+
text = """Me [email protected] and You <[email protected]>
633634
([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
634635
635636
@@ -646,7 +647,7 @@ returns a list of all IP address in input string.
646647
647648
from unstructured.cleaners.extract import extract_ip_address
648649
649-
text = """Me [email protected] and You <[email protected]>
650+
text = """Me [email protected] and You <[email protected]>
650651
([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
651652
652653
# Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']"
@@ -656,7 +657,7 @@ returns a list of all IP address in input string.
656657
``extract_ip_address_name``
657658
----------------------------
658659

659-
Extracts the names of each IP address in the ``Received`` field(s) from an ``.eml``
660+
Extracts the names of each IP address in the ``Received`` field(s) from an ``.eml``
660661
file. ``extract_ip_address_name`` takes in a string and returns a list of all
661662
IP addresses in the input string.
662663

@@ -675,7 +676,7 @@ IP addresses in the input string.
675676
``extract_mapi_id``
676677
----------------------
677678

678-
Extracts the ``mapi id`` in the ``Received`` field(s) from an ``.eml``
679+
Extracts the ``mapi id`` in the ``Received`` field(s) from an ``.eml``
679680
file. ``extract_mapi_id`` takes in a string and returns a list of a string
680681
containing the ``mapi id`` in the input string.
681682

@@ -694,7 +695,7 @@ containing the ``mapi id`` in the input string.
694695
``extract_datetimetz``
695696
----------------------
696697

697-
Extracts the date, time, and timezone in the ``Received`` field(s) from an ``.eml``
698+
Extracts the date, time, and timezone in the ``Received`` field(s) from an ``.eml``
698699
file. ``extract_datetimetz`` takes in a string and returns a datetime.datetime
699700
object from the input string.
700701

@@ -754,7 +755,7 @@ other languages.
754755
Parameters:
755756

756757
* ``text``: the input string to translate.
757-
* ``source_lang``: the two letter language code for the source language of the text.
758+
* ``source_lang``: the two letter language code for the source language of the text.
758759
If ``source_lang`` is not specified,
759760
the language will be detected using ``langdetect``.
760761
* ``target_lang``: the two letter language code for the target language for translation.
@@ -857,7 +858,7 @@ Examples:
857858
--------------------------
858859

859860
Prepares ``Text`` elements for processing in ``transformers`` pipelines
860-
by splitting the elements into chunks that fit into the model's attention window.
861+
by splitting the elements into chunks that fit into the model's attention window.
861862

862863
Examples:
863864

@@ -960,7 +961,7 @@ Examples:
960961
json.dump(label_studio_data, f, indent=4)
961962
962963
963-
You can also include pre-annotations and predictions as part of your LabelStudio upload.
964+
You can also include pre-annotations and predictions as part of your LabelStudio upload.
964965

965966
The ``annotations`` kwarg is a list of lists. If ``annotations`` is specified, there must be a list of
966967
annotations for each element in the ``elements`` list. If an element does not have any annotations,
@@ -1009,7 +1010,7 @@ task in LabelStudio:
10091010
10101011
Similar to annotations, the ``predictions`` kwarg is also a list of lists. A ``prediction`` is an annotation with
10111012
the addition of a ``score`` value. If ``predictions`` is specified, there must be a list of
1012-
predictions for each element in the ``elements`` list. If an element does not have any predictions, use an empty list.
1013+
predictions for each element in the ``elements`` list. If an element does not have any predictions, use an empty list.
10131014
The following shows an example of how to upload predictions for the "Text Classification"
10141015
task in LabelStudio:
10151016

@@ -1167,13 +1168,13 @@ Examples:
11671168
``stage_for_label_box``
11681169
--------------------------
11691170

1170-
Formats outputs for use with `LabelBox <https://docs.labelbox.com/docs/overview>`_. LabelBox accepts cloud-hosted data
1171+
Formats outputs for use with `LabelBox <https://docs.labelbox.com/docs/overview>`_. LabelBox accepts cloud-hosted data
11711172
and does not support importing text directly. The ``stage_for_label_box`` does the following:
11721173

11731174
* Stages the data files in the ``output_directory`` specified in function arguments to be uploaded to a cloud storage service.
11741175
* Returns a config of type ``List[Dict[str, Any]]`` that can be written to a ``json`` file and imported into LabelBox.
11751176

1176-
**Note:** ``stage_for_label_box`` does not upload the data to remote storage such as S3. Users can upload the data to S3
1177+
**Note:** ``stage_for_label_box`` does not upload the data to remote storage such as S3. Users can upload the data to S3
11771178
using ``aws s3 sync ${output_directory} ${url_prefix}`` after running the ``stage_for_label_box`` staging brick.
11781179

11791180
Examples:
@@ -1197,7 +1198,7 @@ files to an S3 bucket.
11971198
11981199
# The URL prefix where the data files will be accessed.
11991200
S3_URL_PREFIX = f"https://{S3_BUCKET_NAME}.s3.amazonaws.com/{S3_BUCKET_KEY_PREFIX}"
1200-
1201+
12011202
# The local output directory where the data files will be staged for uploading to a Cloud Storage service.
12021203
LOCAL_OUTPUT_DIRECTORY = "/tmp/labelbox-staging"
12031204
@@ -1232,7 +1233,7 @@ files to an S3 bucket.
12321233
--------------------------
12331234
Formats a list of ``Text`` elements as input to token based tasks in Datasaur.
12341235

1235-
Example:
1236+
Example:
12361237

12371238
.. code:: python
12381239
@@ -1243,7 +1244,7 @@ Example:
12431244
datasaur_data = stage_for_datasaur(elements)
12441245
12451246
The output is a list of dictionaries, each one with two keys:
1246-
"text" with the content of the element and
1247+
"text" with the content of the element and
12471248
"entities" with an empty list.
12481249

12491250
You can also specify specify entities in the ``stage_for_datasaur`` brick. Entities

Diff for: test_unstructured/partition/test_auto.py

+22-1
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,28 @@ def test_auto_partition_html_from_file_rb():
113113
assert len(elements) > 0
114114

115115

116-
def test_auto_partition_pdf():
116+
EXPECTED_TEXT_OUTPUT = [
117+
NarrativeText(text="This is a test document to use for unit tests."),
118+
Title(text="Important points:"),
119+
ListItem(text="Hamburgers are delicious"),
120+
ListItem(text="Dogs are the best"),
121+
ListItem(text="I love fuzzy blankets"),
122+
]
123+
124+
125+
def test_auto_partition_text_from_filename():
126+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
127+
elements = partition(filename=filename)
128+
assert len(elements) > 0
129+
assert elements == EXPECTED_TEXT_OUTPUT
130+
131+
132+
def test_auto_partition_text_from_file():
133+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
134+
with open(filename, "r") as f:
135+
elements = partition(file=f)
136+
assert len(elements) > 0
137+
assert elements == EXPECTED_TEXT_OUTPUT
117138
filename = os.path.join(
118139
EXAMPLE_DOCS_DIRECTORY, "..", "..", "example-docs", "layout-parser-paper-fast.pdf"
119140
)

Diff for: test_unstructured/partition/test_text.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -16,22 +16,22 @@
1616
]
1717

1818

19-
def test_partition_email_from_filename():
19+
def test_partition_text_from_filename():
2020
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
2121
elements = partition_text(filename=filename)
2222
assert len(elements) > 0
2323
assert elements == EXPECTED_OUTPUT
2424

2525

26-
def test_partition_email_from_file():
26+
def test_partition_text_from_file():
2727
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
2828
with open(filename, "r") as f:
2929
elements = partition_text(file=f)
3030
assert len(elements) > 0
3131
assert elements == EXPECTED_OUTPUT
3232

3333

34-
def test_partition_email_from_text():
34+
def test_partition_text_from_text():
3535
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
3636
with open(filename, "r") as f:
3737
text = f.read()
@@ -40,12 +40,12 @@ def test_partition_email_from_text():
4040
assert elements == EXPECTED_OUTPUT
4141

4242

43-
def test_partition_email_raises_with_none_specified():
43+
def test_partition_text_raises_with_none_specified():
4444
with pytest.raises(ValueError):
4545
partition_text()
4646

4747

48-
def test_partition_email_raises_with_too_many_specified():
48+
def test_partition_text_raises_with_too_many_specified():
4949
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-text.txt")
5050
with open(filename, "r") as f:
5151
text = f.read()

Diff for: unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.0" # pragma: no cover
1+
__version__ = "0.4.1-dev0" # pragma: no cover

Diff for: unstructured/partition/auto.py

+3
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from unstructured.partition.email import partition_email
66
from unstructured.partition.html import partition_html
77
from unstructured.partition.pdf import partition_pdf
8+
from unstructured.partition.text import partition_text
89

910

1011
def partition(filename: Optional[str] = None, file: Optional[IO] = None):
@@ -33,6 +34,8 @@ def partition(filename: Optional[str] = None, file: Optional[IO] = None):
3334
return partition_html(filename=filename, file=file)
3435
elif filetype == FileType.PDF:
3536
return partition_pdf(filename=filename, file=file, url=None) # type: ignore
37+
elif filetype == FileType.TXT:
38+
return partition_text(filename=filename, file=file)
3639
else:
3740
msg = "Invalid file" if not filename else f"Invalid file {filename}"
3841
raise ValueError(f"{msg}. File type not support in partition.")

0 commit comments

Comments
 (0)