@@ -22,6 +22,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
22
22
file type and route it to the appropriate partitioning brick. All partitioning bricks
23
23
called within ``partition `` are called using the defualt kwargs. Use the document-type
24
24
specific bricks if you need to apply non-default settings.
25
+ ``partition `` currently supports ``.docx ``, ``.eml ``, ``.html ``, ``.pdf ``, and ``.txt `` files.
25
26
26
27
27
28
.. code :: python
@@ -104,7 +105,7 @@ Examples:
104
105
``partition_pdf ``
105
106
---------------------
106
107
107
- The ``partition_pdf `` function segments a PDF document by calling the document image analysis API.
108
+ The ``partition_pdf `` function segments a PDF document by calling the document image analysis API.
108
109
The intent of the parameters ``url `` and ``token `` is to allow users to self host an inference API,
109
110
if desired.
110
111
@@ -122,7 +123,7 @@ Examples:
122
123
---------------------
123
124
124
125
The ``partition_email `` function partitions ``.eml `` documents and works with exports
125
- from email clients such as Microsoft Outlook and Gmail. The ``partition_email ``
126
+ from email clients such as Microsoft Outlook and Gmail. The ``partition_email ``
126
127
takes a filename, file-like object, or raw text as input and produces a list of
127
128
document ``Element `` objects as output. Also ``content_source `` can be set to ``text/html ``
128
129
(default) or ``text/plain `` to process the html or plain text version of the email, respectively.
@@ -157,7 +158,7 @@ Examples:
157
158
``partition_text ``
158
159
---------------------
159
160
160
- The ``partition_text `` function partitions text files. The ``partition_text ``
161
+ The ``partition_text `` function partitions text files. The ``partition_text ``
161
162
takes a filename, file-like object, and raw text as input and produces ``Element `` objects as output.
162
163
163
164
Examples:
@@ -629,7 +630,7 @@ addresses in the input string.
629
630
630
631
from unstructured.cleaners.extract import extract_email_address
631
632
632
-
633
+
633
634
([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
634
635
635
636
@@ -646,7 +647,7 @@ returns a list of all IP address in input string.
646
647
647
648
from unstructured.cleaners.extract import extract_ip_address
648
649
649
-
650
+
650
651
([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
651
652
652
653
# Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']"
@@ -656,7 +657,7 @@ returns a list of all IP address in input string.
656
657
``extract_ip_address_name ``
657
658
----------------------------
658
659
659
- Extracts the names of each IP address in the ``Received `` field(s) from an ``.eml ``
660
+ Extracts the names of each IP address in the ``Received `` field(s) from an ``.eml ``
660
661
file. ``extract_ip_address_name `` takes in a string and returns a list of all
661
662
IP addresses in the input string.
662
663
@@ -675,7 +676,7 @@ IP addresses in the input string.
675
676
``extract_mapi_id ``
676
677
----------------------
677
678
678
- Extracts the ``mapi id `` in the ``Received `` field(s) from an ``.eml ``
679
+ Extracts the ``mapi id `` in the ``Received `` field(s) from an ``.eml ``
679
680
file. ``extract_mapi_id `` takes in a string and returns a list of a string
680
681
containing the ``mapi id `` in the input string.
681
682
@@ -694,7 +695,7 @@ containing the ``mapi id`` in the input string.
694
695
``extract_datetimetz ``
695
696
----------------------
696
697
697
- Extracts the date, time, and timezone in the ``Received `` field(s) from an ``.eml ``
698
+ Extracts the date, time, and timezone in the ``Received `` field(s) from an ``.eml ``
698
699
file. ``extract_datetimetz `` takes in a string and returns a datetime.datetime
699
700
object from the input string.
700
701
@@ -754,7 +755,7 @@ other languages.
754
755
Parameters:
755
756
756
757
* ``text ``: the input string to translate.
757
- * ``source_lang ``: the two letter language code for the source language of the text.
758
+ * ``source_lang ``: the two letter language code for the source language of the text.
758
759
If ``source_lang `` is not specified,
759
760
the language will be detected using ``langdetect ``.
760
761
* ``target_lang ``: the two letter language code for the target language for translation.
@@ -857,7 +858,7 @@ Examples:
857
858
--------------------------
858
859
859
860
Prepares ``Text `` elements for processing in ``transformers `` pipelines
860
- by splitting the elements into chunks that fit into the model's attention window.
861
+ by splitting the elements into chunks that fit into the model's attention window.
861
862
862
863
Examples:
863
864
@@ -960,7 +961,7 @@ Examples:
960
961
json.dump(label_studio_data, f, indent = 4 )
961
962
962
963
963
- You can also include pre-annotations and predictions as part of your LabelStudio upload.
964
+ You can also include pre-annotations and predictions as part of your LabelStudio upload.
964
965
965
966
The ``annotations `` kwarg is a list of lists. If ``annotations `` is specified, there must be a list of
966
967
annotations for each element in the ``elements `` list. If an element does not have any annotations,
@@ -1009,7 +1010,7 @@ task in LabelStudio:
1009
1010
1010
1011
Similar to annotations, the ``predictions `` kwarg is also a list of lists. A ``prediction `` is an annotation with
1011
1012
the addition of a ``score `` value. If ``predictions `` is specified, there must be a list of
1012
- predictions for each element in the ``elements `` list. If an element does not have any predictions, use an empty list.
1013
+ predictions for each element in the ``elements `` list. If an element does not have any predictions, use an empty list.
1013
1014
The following shows an example of how to upload predictions for the "Text Classification"
1014
1015
task in LabelStudio:
1015
1016
@@ -1167,13 +1168,13 @@ Examples:
1167
1168
``stage_for_label_box ``
1168
1169
--------------------------
1169
1170
1170
- Formats outputs for use with `LabelBox <https://docs.labelbox.com/docs/overview >`_. LabelBox accepts cloud-hosted data
1171
+ Formats outputs for use with `LabelBox <https://docs.labelbox.com/docs/overview >`_. LabelBox accepts cloud-hosted data
1171
1172
and does not support importing text directly. The ``stage_for_label_box `` does the following:
1172
1173
1173
1174
* Stages the data files in the ``output_directory `` specified in function arguments to be uploaded to a cloud storage service.
1174
1175
* Returns a config of type ``List[Dict[str, Any]] `` that can be written to a ``json `` file and imported into LabelBox.
1175
1176
1176
- **Note: ** ``stage_for_label_box `` does not upload the data to remote storage such as S3. Users can upload the data to S3
1177
+ **Note: ** ``stage_for_label_box `` does not upload the data to remote storage such as S3. Users can upload the data to S3
1177
1178
using ``aws s3 sync ${output_directory} ${url_prefix} `` after running the ``stage_for_label_box `` staging brick.
1178
1179
1179
1180
Examples:
@@ -1197,7 +1198,7 @@ files to an S3 bucket.
1197
1198
1198
1199
# The URL prefix where the data files will be accessed.
1199
1200
S3_URL_PREFIX = f " https:// { S3_BUCKET_NAME } .s3.amazonaws.com/ { S3_BUCKET_KEY_PREFIX } "
1200
-
1201
+
1201
1202
# The local output directory where the data files will be staged for uploading to a Cloud Storage service.
1202
1203
LOCAL_OUTPUT_DIRECTORY = " /tmp/labelbox-staging"
1203
1204
@@ -1232,7 +1233,7 @@ files to an S3 bucket.
1232
1233
--------------------------
1233
1234
Formats a list of ``Text `` elements as input to token based tasks in Datasaur.
1234
1235
1235
- Example:
1236
+ Example:
1236
1237
1237
1238
.. code :: python
1238
1239
@@ -1243,7 +1244,7 @@ Example:
1243
1244
datasaur_data = stage_for_datasaur(elements)
1244
1245
1245
1246
The output is a list of dictionaries, each one with two keys:
1246
- "text" with the content of the element and
1247
+ "text" with the content of the element and
1247
1248
"entities" with an empty list.
1248
1249
1249
1250
You can also specify specify entities in the ``stage_for_datasaur `` brick. Entities
0 commit comments