You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -218,8 +218,8 @@ and generating detected entities (such as people and organizations) and the infe
218
218
2. In the node's settings pane's **Details** tab, click:
219
219
220
220
-**Table** under **Input Type**.
221
-
- Any available choice under **Provider**.
222
-
- Any available choice under **Model**.
221
+
- Any available choice under **Provider** (for example, **Anthropic**).
222
+
- Any available choice under **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
223
223
- If not already selected, **Table Description** under **Task**.
224
224
225
225
<Tip>
@@ -232,8 +232,8 @@ and generating detected entities (such as people and organizations) and the infe
232
232
In the node's settings pane's **Details** tab, click:
233
233
234
234
-**Text** under **Input Type**.
235
-
- Any available choice under **Provider**.
236
-
- Any available choice under **Model**.
235
+
- Any available choice under **Provider** (for example, **Anthropic**).
236
+
- Any available choice under **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
237
237
238
238
<Tip>
239
239
The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner).
@@ -320,7 +320,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
320
320
these chunks were derived from by putting them into each chunk's `metadata`. To have Unstructured do this, use the **Include Original Elements** setting, as described in the preceding tip.
321
321
</Tip>
322
322
323
-
7.Try running this workflow again with the **Chunk by Title** strategy, as follows:
323
+
7.Optionally, you can try running this workflow again with the **Chunk by Title** strategy, as follows:
324
324
325
325
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
326
326
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Title**.<br/>
@@ -344,7 +344,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
344
344
f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately
345
345
precede titles might be shortened due to the presence of the title impacting the chunk's size.
346
346
347
-
8.Try running this workflow again with the **Chunk by Page** strategy, as follows:
347
+
8.Optionally, you can try running this workflow again with the **Chunk by Page** strategy, as follows:
348
348
349
349
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
350
350
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Page**.<br/>
@@ -361,7 +361,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
361
361
f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately
362
362
precede page breaks might be shortened due to the presence of the page break impacting the chunk's size.<br/>
363
363
364
-
9.Try running this workflow again with the **Chunk by Similarity** strategy, as follows:
364
+
9.Optionally, you can try running this workflow again with the **Chunk by Similarity** strategy, as follows:
365
365
366
366
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
367
367
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Similarity**.<br/>
@@ -391,7 +391,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
391
391
10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
392
392
the workflow designer for the next step.
393
393
394
-
## Step 6: Experiment with embedding
394
+
## Step 6 (Optional): Experiment with embedding
395
395
396
396
In this step, you generate [embeddings](/ui/embedding) for your workflow. Embeddings are vectors of numbers that represent various aspects of the text that is extracted by Unstructured.
397
397
These vectors are stored or "embedded" next to the text itself in a vector store or vector database. Chatbots, agents, and other AI solutions can use
Copy file name to clipboardExpand all lines: ui/walkthrough.mdx
+20-19Lines changed: 20 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -160,7 +160,7 @@ shows how well Unstructured's **VLM** partitioning strategy handles challenging
160
160
-**VLM** is great for any file, but it is best when you know for certain that some of your files have a combination of tables (especially complex ones), images, and multilanguage, scanned, or handwritten content. It's the highest quality but slowest of all the strategies.
161
161
</Tip>
162
162
163
-
4. Under **Select VLM Model**, under **Anthropic**, select **Claude Sonnet 4**.<br/>
163
+
4. Under **Select VLM Model**, under **Vertex AI**, select **Gemini 2.0 Flash**.<br/>
164
164
165
165

166
166
@@ -208,7 +208,7 @@ shows how well Unstructured's **VLM** partitioning strategy handles challenging
208
208
209
209

210
210
211
-
- The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice that the Chinese characters are intepreted correctly.
211
+
- The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice how the Chinese characters are output. We'll see accuracy improvements to this output later in Step 4 in the enrichments portion of this walkthrough.
212
212
- The tables on pages 1, 6, 7, 8, 9, and 12. Search for the text `"Table"` (including the quotation marks) to see how the VLM interprets the various tables. We'll see changes to these elements' `text` and `metadata.text_as_html` contents later in Step 4 in the enrichments portion of this walkthrough.
213
213
- The images on pages 3, 7, and 8. Search for the text `"Image"` (including the quotation marks) to see how the VLM interprets the various images. We'll see changes to these elements' `text` contents later in Step 4 in the enrichments portion of this walkthrough.
214
214
@@ -222,7 +222,7 @@ shows how well Unstructured's **VLM** partitioning strategy handles challenging
222
222
9. Notice the following in the JSON output:
223
223
224
224
- The handwriting on page 3. Search for the text `I have written RAND`. Notice how well the handwriting is recognized.
225
-
- The mimeograph on page 11. Search for the text `Technicians at this Agency`. Notice how well the mimeographed content is recognized.
225
+
- The mimeograph on page 18. Search for the text `The system which`. Notice how well the mimeographed content is recognized.
226
226
227
227
10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
228
228
the workflow designer for the next step.
@@ -240,8 +240,8 @@ HTML representations of detected tables, and detected entities (such as people a
240
240
3. In the node's settings pane's **Details** tab, click:
241
241
242
242
-**Image** under **Input Type**.
243
-
- Any available choice for **Provider**.
244
-
- Any available choice for **Model**.
243
+
- Any available choice for **Provider** (for example, **Anthropic**).
244
+
- Any available choice for **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
245
245
- If not already selected, **Image Description** under **Task**.
246
246
247
247
<Tip>
@@ -257,8 +257,8 @@ HTML representations of detected tables, and detected entities (such as people a
257
257
In the node's settings pane's **Details** tab, click:
258
258
259
259
-**Table** under **Input Type**.
260
-
- Any available choice for **Provider**.
261
-
- Any available choice for **Model**.
260
+
- Any available choice for **Provider** (for example, **Anthropic**).
261
+
- Any available choice for **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
262
262
- If not already selected, **Table Description** under **Task**.
263
263
264
264
<Tip>
@@ -271,8 +271,8 @@ HTML representations of detected tables, and detected entities (such as people a
271
271
In the node's settings pane's **Details** tab, click:
272
272
273
273
-**Table** under **Input Type**.
274
-
-**OpenAI**under **Provider**.
275
-
- Any available choice under**Model**.
274
+
-Any available choice for **Provider**(for example, **Anthropic**).
275
+
- Any available choice for**Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
276
276
-**Table to HTML** under **Task**.
277
277
278
278
<Tip>
@@ -284,8 +284,8 @@ HTML representations of detected tables, and detected entities (such as people a
284
284
In the node's settings pane's **Details** tab, click:
285
285
286
286
-**Text** under **Input Type**.
287
-
- Any available choice under**Provider**.
288
-
- Any available choice under**Model**.
287
+
- Any available choice for**Provider** (for example, **Anthropic**).
288
+
- Any available choice for**Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
289
289
290
290
<Tip>
291
291
The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner).
@@ -296,8 +296,8 @@ HTML representations of detected tables, and detected entities (such as people a
296
296
In the node's settings pane's **Details** tab, click:
297
297
298
298
-**Image** under **Input Type**.
299
-
-**Anthropic** or **Amazon Bedrock**under **Provider**.
300
-
- Any available choice under**Model**.
299
+
-Any available choice for **Provider**(for example, **Anthropic**).
300
+
- Any available choice for**Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
301
301
-**Generative OCR** under **Task**.
302
302
303
303
<Tip>
@@ -320,8 +320,9 @@ HTML representations of detected tables, and detected entities (such as people a
320
320
321
321
7. Some interesting portions of the output include the following:
322
322
323
-
- The images on pages 3, 7, and 8. Search for the text `"Image"` (including the quotation marks). Notice the summary description for each image.
324
-
- The tables on pages 1, 6, 7, 8, 9, and 12. Search for the text `"Table"` (including the quotation marks). Notice the summary description for each of these tables.
323
+
- The Chinese characters on page 1. Search again for the text `verbs. The characters`. Notice how the accuracy of the Chinese character output is improved.
324
+
- The images on pages 3, 7, and 8. Search again for the text `"Image"` (including the quotation marks). Notice the summary description for each image.
325
+
- The tables on pages 1, 6, 7, 8, 9, and 12. Search again for the text `"Table"` (including the quotation marks). Notice the summary description for each of these tables.
325
326
Also notice the `text_as_html` field for each of these tables.
326
327
- The identified entities and inferred relationships among them. Search for the text `Zhijun Wang`. Of the eight instances of this name, notice
327
328
the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice.
@@ -395,7 +396,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
395
396
these chunks were derived from by putting them into each chunk's `metadata`. To have Unstructured do this, use the **Include Original Elements** setting, as described in the preceding tip.
396
397
</Tip>
397
398
398
-
7.Try running this workflow again with the **Chunk by Title** strategy, as follows:
399
+
7.Optionally, you can try running this workflow again with the **Chunk by Title** strategy, as follows:
399
400
400
401
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
401
402
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Title**.<br/>
@@ -419,7 +420,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
419
420
f. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). Notice that the lengths of some of the chunks that immediately
420
421
precede titles might be shortened due to the presence of the title impacting the chunk's size.
421
422
422
-
8.Try running this workflow again with the **Chunk by Page** strategy, as follows:
423
+
8.Optionally, you can try running this workflow again with the **Chunk by Page** strategy, as follows:
423
424
424
425
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
425
426
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Page**.<br/>
@@ -436,7 +437,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
436
437
f. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). Notice that the lengths of some of the chunks that immediately
437
438
precede page breaks might be shortened due to the presence of the page break impacting the chunk's size.<br/>
438
439
439
-
9.Try running this workflow again with the **Chunk by Similarity** strategy, as follows:
440
+
9.Optionally, you can try running this workflow again with the **Chunk by Similarity** strategy, as follows:
440
441
441
442
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
442
443
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Similarity**.<br/>
@@ -466,7 +467,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
466
467
10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
467
468
the workflow designer for the next step.
468
469
469
-
## Step 6: Experiment with embedding
470
+
## Step 6 (Optional): Experiment with embedding
470
471
471
472
In this step, you generate [embeddings](/ui/embedding) for your workflow. Embeddings are vectors of numbers that represent various aspects of the text that is extracted by Unstructured.
472
473
These vectors are stored or "embedded" next to the text itself in a vector store or vector database. Chatbots, agents, and other AI solutions can use
0 commit comments