You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
+18-22Lines changed: 18 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -180,8 +180,7 @@ In this step, you will test the **High Res** partitioning strategy on the "Chine
180
180
- You can scroll through the original file on the left or, where supported for a given file type, click the up and down arrows to page through the file one page at a time.
181
181
- You can scroll through Unstructured's JSON output on the right, and you can click **Search JSON** to search for specific text in the JSON output. You will do this next.
182
182
-**Download Full JSON** allows you to download the full output to your local machine as a JSON file.
183
-
-**View JSON at this step** allows you to view the JSON output at each step in the workflow as it was further processed. There's only one step right now (the **Partitioner** step),
184
-
but as you add more nodes to the workflow DAG, this can be a useful tool to see how the JSON output changes along the way.
183
+
-**View JSON at this step** allows you to view the JSON output at each step in the workflow as it is further processed.
185
184
- The close (**X**) button returns you to the workflow designer.
186
185
</Tip>
187
186
@@ -193,23 +192,20 @@ In this step, you will test the **High Res** partitioning strategy on the "Chine
193
192
194
193

195
194
196
-
- The Chinese characters on page 1. Search for the text `all have the meaning of acting`. Notice how the Chinese characters are captures correctly.
197
-
- The HTML representations of the seven tables on pages 6-9 and 12. Search for the text `"text_as_html":`.
198
-
- The descriptions of the four diagrams on page 3. Search for the text `\"diagram\",\n \"description\"`.
199
-
- The descriptions of the three graphs on pages 7-8. Search for the text `\"graph\",\n \"description\"`.
200
-
- The Base64-encoded, full-fidelity representations of the 14 tables, diagrams, and graphs on pages 3, 6-9, and 12.
201
-
Search for the text `"image_base64":`. You can use a web-based tool such as [base64.guru](https://base64.guru/converter/decode/image)
202
-
to experiment with decoding these representations back into their original visual representations.
195
+
- The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice how the Chinese characters are captured correctly.
196
+
- The HTML representations of the seven tables on pages 6-9 and 12. Search for the text `text_as_html`.
197
+
- The descriptions of the four diagrams on page 3. Search for the text `"diagram` (including the opening quotation mark).
198
+
- The descriptions of the three graphs on pages 7-8. Search for the text `"graph` (including the opening quotation mark).
203
199
204
-
8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
200
+
7. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
205
201
the workflow designer for the next step.
206
202
207
203
## Step 4: Add more enrichments
208
204
209
205
Your existing workflow already has three **Enrichment** nodes. Recall that these nodes perform the following enrichments:
210
206
211
207
- An [image description](ui/enriching/image-descriptions) enrichment, which uses a vision language model (VLM) to provide a text-based summary of the contents of the each detected image.
212
-
- A [generative OCR](/ui/enriching/generative-ocr) enrichment, which uses a VLM to improve the accuracy of each block of initially-processed text.
208
+
- A [generative OCR](/ui/enriching/generative-ocr) enrichment, which uses a VLM to improve the accuracy of each block of initially-processed text, as needed.
213
209
- A [table to HTML](/ui/enriching/table-to-html) enrichment, which uses a VLM to provide an HTML-structured representation of each detected table.
214
210
215
211
In this step, you add a few more [enrichments](/ui/enriching/overview) to your workflow, such as generating summary descriptions of detected tables,
@@ -222,9 +218,9 @@ and generating detected entities (such as people and organizations) and the infe
222
218
2. In the node's settings pane's **Details** tab, click:
223
219
224
220
-**Table** under **Input Type**.
225
-
-**Anthropic** under **Provider**.
226
-
-**Claude Sonnet 4** under **Model**.
227
-
-**Table Description** under **Task**.
221
+
-Any available choice under **Provider**.
222
+
-Any available choice under **Model**.
223
+
-If not already selected, **Table Description** under **Task**.
228
224
229
225
<Tip>
230
226
The table description enrichment generates a summary description of each detected table. This can help you to more quickly and easily understand
@@ -236,23 +232,23 @@ and generating detected entities (such as people and organizations) and the infe
236
232
In the node's settings pane's **Details** tab, click:
237
233
238
234
-**Text** under **Input Type**.
239
-
-**Anthropic** under **Provider**.
240
-
-**Claude Sonnet 4** under **Model**.
235
+
-Any available choice under **Provider**.
236
+
-Any available choice under **Model**.
241
237
242
238
<Tip>
243
239
The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner).
244
240
</Tip>
245
241
246
-
The workflow designer should now look like this:
242
+
The workflow designer should now look similar to this:
247
243
248
244

249
245
250
246
4. Immediately above the **Source** node, click **Test**.
251
247
5. In the **Test output** pane, make sure that **Enrichment (6 of 6)** is showing. If not, click the right arrow (**>**) until **Enrichment (6 of 6)** appears, which will show the output from the last node in the workflow.
252
248
6. Some interesting portions of the output include the following:
253
249
254
-
- The descriptions of the seven tables on pages 6-9 and 12. Search for the text `## Table Structure Analysis\n\n###`.
255
-
- The identified entities and inferred relationships among them. For example, search for the text `Zhijun Wang`. Of the eight instances of this name, notice
250
+
- The descriptions of the seven tables on pages 6-9 and 12. Search for the text `"Table"` (including the quotation marks).
251
+
- The identified entities and inferred relationships among them. For example, search for the text `Zhijun Wang`. Of the nine instances of this name, notice
256
252
the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice.
257
253
258
254
7. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
@@ -303,11 +299,11 @@ the resulting document elements' `text` content into manageable "chunks" to stay
303
299
_What do each of these chunking settings do?_
304
300
305
301
-**Contextual Chunking** prepends chunk-specific explanatory context to each chunk, which has been shown to yield significant improvements in downstream retrieval accuracy. [Learn more](/ui/chunking#contextual-chunking).
306
-
-**Include Original Elements** outputs into each chunk's `metadata` field's `orig_elements` value the elements that were used to form that particular chunk. [Learn more](/ui/chunking#include-original-elements-setting).
302
+
-**Include Original Elements** outputs into each chunk's `metadata` field's `orig_elements` value the elements that were used to form that particular chunk. These elements are output in gzip compressed, Base64 encoded format. To get back to the original content, Base64 decode and then gzip decompress the bytes as UTF-8. [Learn more](/ui/chunking#include-original-elements-setting).
307
303
-**Max Characters** is the "hard" or maximum number of characters that any one chunk can contain. Unstructured cannot exceed this number when forming chunks. [Learn more](/ui/chunking#max-characters-setting).
308
304
-**New After N Characters**: is the "soft" or approximate number of characters that any one chunk can contain. Unstructured can exceed this number if needed when forming chunks (but still cannot exceed the **Max Characters** setting). [Learn more](/ui/chunking#new-after-n-characters-setting).
309
305
-**Overlap**, when applied (see **Overlap All**), prepends to the current chunk the specified number of characters from the previous chunk, which can help provide additional context about this chunk relative to the previous chunk. [Learn more](/ui/chunking#overlap-setting)
310
-
-**Overlap All** applies the **Overlap** setting (if greater than zero) to all chunks. Otherwise, unchecking this box means that the **Overlap** setting (if greater than zero)is applied only in edge cases where "normal" chunks cannot be formed by combining whole elements. Check this box with caution as it can introduce noise into otherwise clean semantic units. [Learn more](/ui/chunking#overlap-all-setting).
306
+
-**Overlap All** applies the **Overlap** setting (if greater than zero) to all chunks. Otherwise, unchecking this box means that the **Overlap** setting (if greater than zero)is applied only in edge cases where "normal" chunks cannot be formed by combining whole elements. Check this box with caution as it can introduce noise into otherwise clean semantic units. [Learn more](/ui/chunking#overlap-all-setting).
311
307
</Tip>
312
308
313
309
4. Immediately above the **Source** node, click **Test**.
@@ -409,7 +405,7 @@ embedding model that is provided by an embedding provider. For the best embeddin
409
405
2. In the node's settings pane's **Details** tab, under **Select Embedding Model**, for **Azure OpenAI**, select **Text Embedding 3 Small [dim 1536]**.
410
406
3. Immediately above the **Source** node, click **Test**.
411
407
4. In the **Test output** pane, make sure that **Embedder (8 of 8)** is showing. If not, click the right arrow (**>**) until **Embedder (8 of 8)** appears, which will show the output from the last node in the workflow.
412
-
5. To explore the embeddings, search for the text `"embeddings"`.
408
+
5. To explore the embeddings, search for the text `"embeddings"` (including the quotation marks).
Copy file name to clipboardExpand all lines: ui/chunking.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,7 +103,7 @@ Here are a few examples:
103
103
104
104
If the option to include original elements is specified, during chunking the `orig_elements` field is added to the `metadata` field of each chunked element.
105
105
The `orig_elements` field is a list of the original elements that were used to create the current chunked element. This list is output in
106
-
compressed Base64 gzipped format. To get back to the original content for this list, Base64-decode the list's bytes, decompress them, and then decode them using UTF-8.
106
+
gzip compressed, Base64-encoded format. To get back to the original content for this list, Base64-decode the list's bytes, and then gzip decompress them as UTF-8.
0 commit comments