Skip to content

Commit cbf9b20

Browse files
authored
Walkthrough updates based on customer session feedback (#793)
1 parent 077e3d7 commit cbf9b20

File tree

3 files changed

+28
-27
lines changed

3 files changed

+28
-27
lines changed
-5.93 KB
Loading

snippets/general-shared-text/get-started-single-file-ui-part-2.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -218,8 +218,8 @@ and generating detected entities (such as people and organizations) and the infe
218218
2. In the node's settings pane's **Details** tab, click:
219219

220220
- **Table** under **Input Type**.
221-
- Any available choice under **Provider**.
222-
- Any available choice under **Model**.
221+
- Any available choice under **Provider** (for example, **Anthropic**).
222+
- Any available choice under **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
223223
- If not already selected, **Table Description** under **Task**.
224224

225225
<Tip>
@@ -232,8 +232,8 @@ and generating detected entities (such as people and organizations) and the infe
232232
In the node's settings pane's **Details** tab, click:
233233

234234
- **Text** under **Input Type**.
235-
- Any available choice under **Provider**.
236-
- Any available choice under **Model**.
235+
- Any available choice under **Provider** (for example, **Anthropic**).
236+
- Any available choice under **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
237237

238238
<Tip>
239239
The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner).
@@ -320,7 +320,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
320320
these chunks were derived from by putting them into each chunk's `metadata`. To have Unstructured do this, use the **Include Original Elements** setting, as described in the preceding tip.
321321
</Tip>
322322

323-
7. Try running this workflow again with the **Chunk by Title** strategy, as follows:
323+
7. Optionally, you can try running this workflow again with the **Chunk by Title** strategy, as follows:
324324

325325
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
326326
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Title**.<br/>
@@ -344,7 +344,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
344344
f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately
345345
precede titles might be shortened due to the presence of the title impacting the chunk's size.
346346

347-
8. Try running this workflow again with the **Chunk by Page** strategy, as follows:
347+
8. Optionally, you can try running this workflow again with the **Chunk by Page** strategy, as follows:
348348

349349
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
350350
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Page**.<br/>
@@ -361,7 +361,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
361361
f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately
362362
precede page breaks might be shortened due to the presence of the page break impacting the chunk's size.<br/>
363363

364-
9. Try running this workflow again with the **Chunk by Similarity** strategy, as follows:
364+
9. Optionally, you can try running this workflow again with the **Chunk by Similarity** strategy, as follows:
365365

366366
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
367367
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Similarity**.<br/>
@@ -391,7 +391,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
391391
10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
392392
the workflow designer for the next step.
393393

394-
## Step 6: Experiment with embedding
394+
## Step 6 (Optional): Experiment with embedding
395395

396396
In this step, you generate [embeddings](/ui/embedding) for your workflow. Embeddings are vectors of numbers that represent various aspects of the text that is extracted by Unstructured.
397397
These vectors are stored or "embedded" next to the text itself in a vector store or vector database. Chatbots, agents, and other AI solutions can use

ui/walkthrough.mdx

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ shows how well Unstructured's **VLM** partitioning strategy handles challenging
160160
- **VLM** is great for any file, but it is best when you know for certain that some of your files have a combination of tables (especially complex ones), images, and multilanguage, scanned, or handwritten content. It's the highest quality but slowest of all the strategies.
161161
</Tip>
162162

163-
4. Under **Select VLM Model**, under **Anthropic**, select **Claude Sonnet 4**.<br/>
163+
4. Under **Select VLM Model**, under **Vertex AI**, select **Gemini 2.0 Flash**.<br/>
164164

165165
![Selecting the VLM for partitioning](/img/ui/walkthrough/VLMPartitioner.png)
166166

@@ -208,7 +208,7 @@ shows how well Unstructured's **VLM** partitioning strategy handles challenging
208208

209209
![Searching the JSON output](/img/ui/walkthrough/SearchJSON.png)
210210

211-
- The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice that the Chinese characters are intepreted correctly.
211+
- The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice how the Chinese characters are output. We'll see accuracy improvements to this output later in Step 4 in the enrichments portion of this walkthrough.
212212
- The tables on pages 1, 6, 7, 8, 9, and 12. Search for the text `"Table"` (including the quotation marks) to see how the VLM interprets the various tables. We'll see changes to these elements' `text` and `metadata.text_as_html` contents later in Step 4 in the enrichments portion of this walkthrough.
213213
- The images on pages 3, 7, and 8. Search for the text `"Image"` (including the quotation marks) to see how the VLM interprets the various images. We'll see changes to these elements' `text` contents later in Step 4 in the enrichments portion of this walkthrough.
214214

@@ -222,7 +222,7 @@ shows how well Unstructured's **VLM** partitioning strategy handles challenging
222222
9. Notice the following in the JSON output:
223223

224224
- The handwriting on page 3. Search for the text `I have written RAND`. Notice how well the handwriting is recognized.
225-
- The mimeograph on page 11. Search for the text `Technicians at this Agency`. Notice how well the mimeographed content is recognized.
225+
- The mimeograph on page 18. Search for the text `The system which`. Notice how well the mimeographed content is recognized.
226226

227227
10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
228228
the workflow designer for the next step.
@@ -240,8 +240,8 @@ HTML representations of detected tables, and detected entities (such as people a
240240
3. In the node's settings pane's **Details** tab, click:
241241

242242
- **Image** under **Input Type**.
243-
- Any available choice for **Provider**.
244-
- Any available choice for **Model**.
243+
- Any available choice for **Provider** (for example, **Anthropic**).
244+
- Any available choice for **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
245245
- If not already selected, **Image Description** under **Task**.
246246

247247
<Tip>
@@ -257,8 +257,8 @@ HTML representations of detected tables, and detected entities (such as people a
257257
In the node's settings pane's **Details** tab, click:
258258

259259
- **Table** under **Input Type**.
260-
- Any available choice for **Provider**.
261-
- Any available choice for **Model**.
260+
- Any available choice for **Provider** (for example, **Anthropic**).
261+
- Any available choice for **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
262262
- If not already selected, **Table Description** under **Task**.
263263

264264
<Tip>
@@ -271,8 +271,8 @@ HTML representations of detected tables, and detected entities (such as people a
271271
In the node's settings pane's **Details** tab, click:
272272

273273
- **Table** under **Input Type**.
274-
- **OpenAI** under **Provider**.
275-
- Any available choice under **Model**.
274+
- Any available choice for **Provider** (for example, **Anthropic**).
275+
- Any available choice for **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
276276
- **Table to HTML** under **Task**.
277277

278278
<Tip>
@@ -284,8 +284,8 @@ HTML representations of detected tables, and detected entities (such as people a
284284
In the node's settings pane's **Details** tab, click:
285285

286286
- **Text** under **Input Type**.
287-
- Any available choice under **Provider**.
288-
- Any available choice under **Model**.
287+
- Any available choice for **Provider** (for example, **Anthropic**).
288+
- Any available choice for **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
289289

290290
<Tip>
291291
The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner).
@@ -296,8 +296,8 @@ HTML representations of detected tables, and detected entities (such as people a
296296
In the node's settings pane's **Details** tab, click:
297297

298298
- **Image** under **Input Type**.
299-
- **Anthropic** or **Amazon Bedrock** under **Provider**.
300-
- Any available choice under **Model**.
299+
- Any available choice for **Provider** (for example, **Anthropic**).
300+
- Any available choice for **Model** (for example, **Claude Sonnet 4.5** if you chose **Anthropic** for **Provider**).
301301
- **Generative OCR** under **Task**.
302302

303303
<Tip>
@@ -320,8 +320,9 @@ HTML representations of detected tables, and detected entities (such as people a
320320

321321
7. Some interesting portions of the output include the following:
322322

323-
- The images on pages 3, 7, and 8. Search for the text `"Image"` (including the quotation marks). Notice the summary description for each image.
324-
- The tables on pages 1, 6, 7, 8, 9, and 12. Search for the text `"Table"` (including the quotation marks). Notice the summary description for each of these tables.
323+
- The Chinese characters on page 1. Search again for the text `verbs. The characters`. Notice how the accuracy of the Chinese character output is improved.
324+
- The images on pages 3, 7, and 8. Search again for the text `"Image"` (including the quotation marks). Notice the summary description for each image.
325+
- The tables on pages 1, 6, 7, 8, 9, and 12. Search again for the text `"Table"` (including the quotation marks). Notice the summary description for each of these tables.
325326
Also notice the `text_as_html` field for each of these tables.
326327
- The identified entities and inferred relationships among them. Search for the text `Zhijun Wang`. Of the eight instances of this name, notice
327328
the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice.
@@ -395,7 +396,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
395396
these chunks were derived from by putting them into each chunk's `metadata`. To have Unstructured do this, use the **Include Original Elements** setting, as described in the preceding tip.
396397
</Tip>
397398

398-
7. Try running this workflow again with the **Chunk by Title** strategy, as follows:
399+
7. Optionally, you can try running this workflow again with the **Chunk by Title** strategy, as follows:
399400

400401
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
401402
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Title**.<br/>
@@ -419,7 +420,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
419420
f. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). Notice that the lengths of some of the chunks that immediately
420421
precede titles might be shortened due to the presence of the title impacting the chunk's size.
421422

422-
8. Try running this workflow again with the **Chunk by Page** strategy, as follows:
423+
8. Optionally, you can try running this workflow again with the **Chunk by Page** strategy, as follows:
423424

424425
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
425426
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Page**.<br/>
@@ -436,7 +437,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
436437
f. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). Notice that the lengths of some of the chunks that immediately
437438
precede page breaks might be shortened due to the presence of the page break impacting the chunk's size.<br/>
438439

439-
9. Try running this workflow again with the **Chunk by Similarity** strategy, as follows:
440+
9. Optionally, you can try running this workflow again with the **Chunk by Similarity** strategy, as follows:
440441

441442
a. Click the close (**X**) button above the output on the right side of the screen.<br/>
442443
b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Similarity**.<br/>
@@ -466,7 +467,7 @@ the resulting document elements' `text` content into manageable "chunks" to stay
466467
10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
467468
the workflow designer for the next step.
468469

469-
## Step 6: Experiment with embedding
470+
## Step 6 (Optional): Experiment with embedding
470471

471472
In this step, you generate [embeddings](/ui/embedding) for your workflow. Embeddings are vectors of numbers that represent various aspects of the text that is extracted by Unstructured.
472473
These vectors are stored or "embedded" next to the text itself in a vector store or vector database. Chatbots, agents, and other AI solutions can use

0 commit comments

Comments
 (0)