Skip to content

Commit 425961e

Browse files
authored
Cleanup of walkthroughs based on recent customer testing and feedback (#791)
1 parent dd058a4 commit 425961e

File tree

5 files changed

+80
-124
lines changed

5 files changed

+80
-124
lines changed
-74 Bytes
Loading
461 KB
Loading

snippets/general-shared-text/get-started-single-file-ui-part-2.mdx

Lines changed: 18 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -180,8 +180,7 @@ In this step, you will test the **High Res** partitioning strategy on the "Chine
180180
- You can scroll through the original file on the left or, where supported for a given file type, click the up and down arrows to page through the file one page at a time.
181181
- You can scroll through Unstructured's JSON output on the right, and you can click **Search JSON** to search for specific text in the JSON output. You will do this next.
182182
- **Download Full JSON** allows you to download the full output to your local machine as a JSON file.
183-
- **View JSON at this step** allows you to view the JSON output at each step in the workflow as it was further processed. There's only one step right now (the **Partitioner** step),
184-
but as you add more nodes to the workflow DAG, this can be a useful tool to see how the JSON output changes along the way.
183+
- **View JSON at this step** allows you to view the JSON output at each step in the workflow as it is further processed.
185184
- The close (**X**) button returns you to the workflow designer.
186185
</Tip>
187186

@@ -193,23 +192,20 @@ In this step, you will test the **High Res** partitioning strategy on the "Chine
193192

194193
![Searching the JSON output](/img/ui/walkthrough/SearchJSON.png)
195194

196-
- The Chinese characters on page 1. Search for the text `all have the meaning of acting`. Notice how the Chinese characters are captures correctly.
197-
- The HTML representations of the seven tables on pages 6-9 and 12. Search for the text `"text_as_html":`.
198-
- The descriptions of the four diagrams on page 3. Search for the text `\"diagram\",\n \"description\"`.
199-
- The descriptions of the three graphs on pages 7-8. Search for the text `\"graph\",\n \"description\"`.
200-
- The Base64-encoded, full-fidelity representations of the 14 tables, diagrams, and graphs on pages 3, 6-9, and 12.
201-
Search for the text `"image_base64":`. You can use a web-based tool such as [base64.guru](https://base64.guru/converter/decode/image)
202-
to experiment with decoding these representations back into their original visual representations.
195+
- The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice how the Chinese characters are captured correctly.
196+
- The HTML representations of the seven tables on pages 6-9 and 12. Search for the text `text_as_html`.
197+
- The descriptions of the four diagrams on page 3. Search for the text `"diagram` (including the opening quotation mark).
198+
- The descriptions of the three graphs on pages 7-8. Search for the text `"graph` (including the opening quotation mark).
203199

204-
8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
200+
7. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
205201
the workflow designer for the next step.
206202

207203
## Step 4: Add more enrichments
208204

209205
Your existing workflow already has three **Enrichment** nodes. Recall that these nodes perform the following enrichments:
210206

211207
- An [image description](ui/enriching/image-descriptions) enrichment, which uses a vision language model (VLM) to provide a text-based summary of the contents of the each detected image.
212-
- A [generative OCR](/ui/enriching/generative-ocr) enrichment, which uses a VLM to improve the accuracy of each block of initially-processed text.
208+
- A [generative OCR](/ui/enriching/generative-ocr) enrichment, which uses a VLM to improve the accuracy of each block of initially-processed text, as needed.
213209
- A [table to HTML](/ui/enriching/table-to-html) enrichment, which uses a VLM to provide an HTML-structured representation of each detected table.
214210

215211
In this step, you add a few more [enrichments](/ui/enriching/overview) to your workflow, such as generating summary descriptions of detected tables,
@@ -222,9 +218,9 @@ and generating detected entities (such as people and organizations) and the infe
222218
2. In the node's settings pane's **Details** tab, click:
223219

224220
- **Table** under **Input Type**.
225-
- **Anthropic** under **Provider**.
226-
- **Claude Sonnet 4** under **Model**.
227-
- **Table Description** under **Task**.
221+
- Any available choice under **Provider**.
222+
- Any available choice under **Model**.
223+
- If not already selected, **Table Description** under **Task**.
228224

229225
<Tip>
230226
The table description enrichment generates a summary description of each detected table. This can help you to more quickly and easily understand
@@ -236,23 +232,23 @@ and generating detected entities (such as people and organizations) and the infe
236232
In the node's settings pane's **Details** tab, click:
237233

238234
- **Text** under **Input Type**.
239-
- **Anthropic** under **Provider**.
240-
- **Claude Sonnet 4** under **Model**.
235+
- Any available choice under **Provider**.
236+
- Any available choice under **Model**.
241237

242238
<Tip>
243239
The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner).
244240
</Tip>
245241

246-
The workflow designer should now look like this:
242+
The workflow designer should now look similar to this:
247243

248244
![The workflow with enrichments added](/img/ui/walkthrough/EnrichedWorkflow.png)
249245

250246
4. Immediately above the **Source** node, click **Test**.
251247
5. In the **Test output** pane, make sure that **Enrichment (6 of 6)** is showing. If not, click the right arrow (**>**) until **Enrichment (6 of 6)** appears, which will show the output from the last node in the workflow.
252248
6. Some interesting portions of the output include the following:
253249

254-
- The descriptions of the seven tables on pages 6-9 and 12. Search for the text `## Table Structure Analysis\n\n###`.
255-
- The identified entities and inferred relationships among them. For example, search for the text `Zhijun Wang`. Of the eight instances of this name, notice
250+
- The descriptions of the seven tables on pages 6-9 and 12. Search for the text `"Table"` (including the quotation marks).
251+
- The identified entities and inferred relationships among them. For example, search for the text `Zhijun Wang`. Of the nine instances of this name, notice
256252
the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice.
257253

258254
7. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
@@ -303,11 +299,11 @@ the resulting document elements' `text` content into manageable "chunks" to stay
303299
_What do each of these chunking settings do?_
304300

305301
- **Contextual Chunking** prepends chunk-specific explanatory context to each chunk, which has been shown to yield significant improvements in downstream retrieval accuracy. [Learn more](/ui/chunking#contextual-chunking).
306-
- **Include Original Elements** outputs into each chunk's `metadata` field's `orig_elements` value the elements that were used to form that particular chunk. [Learn more](/ui/chunking#include-original-elements-setting).
302+
- **Include Original Elements** outputs into each chunk's `metadata` field's `orig_elements` value the elements that were used to form that particular chunk. These elements are output in gzip compressed, Base64 encoded format. To get back to the original content, Base64 decode and then gzip decompress the bytes as UTF-8. [Learn more](/ui/chunking#include-original-elements-setting).
307303
- **Max Characters** is the "hard" or maximum number of characters that any one chunk can contain. Unstructured cannot exceed this number when forming chunks. [Learn more](/ui/chunking#max-characters-setting).
308304
- **New After N Characters**: is the "soft" or approximate number of characters that any one chunk can contain. Unstructured can exceed this number if needed when forming chunks (but still cannot exceed the **Max Characters** setting). [Learn more](/ui/chunking#new-after-n-characters-setting).
309305
- **Overlap**, when applied (see **Overlap All**), prepends to the current chunk the specified number of characters from the previous chunk, which can help provide additional context about this chunk relative to the previous chunk. [Learn more](/ui/chunking#overlap-setting)
310-
- **Overlap All** applies the **Overlap** setting (if greater than zero) to all chunks. Otherwise, unchecking this box means that the **Overlap** setting (if greater than zero)is applied only in edge cases where "normal" chunks cannot be formed by combining whole elements. Check this box with caution as it can introduce noise into otherwise clean semantic units. [Learn more](/ui/chunking#overlap-all-setting).
306+
- **Overlap All** applies the **Overlap** setting (if greater than zero) to all chunks. Otherwise, unchecking this box means that the **Overlap** setting (if greater than zero) is applied only in edge cases where "normal" chunks cannot be formed by combining whole elements. Check this box with caution as it can introduce noise into otherwise clean semantic units. [Learn more](/ui/chunking#overlap-all-setting).
311307
</Tip>
312308

313309
4. Immediately above the **Source** node, click **Test**.
@@ -409,7 +405,7 @@ embedding model that is provided by an embedding provider. For the best embeddin
409405
2. In the node's settings pane's **Details** tab, under **Select Embedding Model**, for **Azure OpenAI**, select **Text Embedding 3 Small [dim 1536]**.
410406
3. Immediately above the **Source** node, click **Test**.
411407
4. In the **Test output** pane, make sure that **Embedder (8 of 8)** is showing. If not, click the right arrow (**>**) until **Embedder (8 of 8)** appears, which will show the output from the last node in the workflow.
412-
5. To explore the embeddings, search for the text `"embeddings"`.
408+
5. To explore the embeddings, search for the text `"embeddings"` (including the quotation marks).
413409

414410
<Tip>
415411
_What do all of these numbers mean?_

ui/chunking.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ Here are a few examples:
103103

104104
If the option to include original elements is specified, during chunking the `orig_elements` field is added to the `metadata` field of each chunked element.
105105
The `orig_elements` field is a list of the original elements that were used to create the current chunked element. This list is output in
106-
compressed Base64 gzipped format. To get back to the original content for this list, Base64-decode the list's bytes, decompress them, and then decode them using UTF-8.
106+
gzip compressed, Base64-encoded format. To get back to the original content for this list, Base64-decode the list's bytes, and then gzip decompress them as UTF-8.
107107
[Learn how](/api-reference/partition/get-chunked-elements).
108108

109109
After chunking, `Image` elements are not preserved in the output. However,

0 commit comments

Comments
 (0)