Skip to content
40 changes: 20 additions & 20 deletions docs/docs/extraction/chunking.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,11 @@ If you want chunks smaller than `page`, use token-based splitting as described i

The `split` task uses a tokenizer to count the number of tokens in the document,
and splits the document based on the desired maximum chunk size and chunk overlap.
We recommend that you use the `meta-llama/Llama-3.2-1B` tokenizer,
because it's the same tokenizer as the llama-3.2 embedding model that we use for embedding.
However, you can use any tokenizer from any HuggingFace model that includes a tokenizer file.

Use the `split` method to chunk large documents as shown in the following code.

!!! note
We recommend the default tokenizer for token-based splitting. For more information, refer to [Llama tokenizer (default)](#llama-tokenizer-default).
You can also use any tokenizer from any HuggingFace model that includes a tokenizer file.

The default tokenizer (`meta-llama/Llama-3.2-1B`) requires a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). You must set `hf_access_token": "hf_***` to authenticate.
Use the `split` method to chunk large documents as shown in the following code.

```python
ingestor = ingestor.split(
Expand All @@ -76,6 +72,23 @@ ingestor = ingestor.split(
)
```

### Llama tokenizer (default) {#llama-tokenizer-default}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this "{#llama-tokenizer-default}" added intentionally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you intend that to be a section anchor, I don't think it will work across both GitHub and docs.nvidia.com


The default tokenizer for token-based splitting is **`meta-llama/Llama-3.2-1B`**. It matches the tokenizer used by the Llama 3.2 embedding model, which helps keep chunk boundaries aligned with the embedding model.

!!! note

This tokenizer is gated on Hugging Face and requires an access token. For more information, refer to [User access tokens](https://huggingface.co/docs/hub/en/security-tokens). You must set `hf_access_token` in your `split` params (for example, `"hf_***"`) to authenticate.

By default, the NV Ingest container includes this tokenizer pre-downloaded at build time, so it does not need to be fetched at runtime. If you build the container yourself and want to pre-download it, do the following:

- Review the [license agreement](https://huggingface.co/meta-llama/Llama-3.2-1B).
- [Request access](https://huggingface.co/meta-llama/Llama-3.2-1B).
- Set the `DOWNLOAD_LLAMA_TOKENIZER` environment variable to `True`.
- Set the `HF_ACCESS_TOKEN` environment variable to your HuggingFace access token.

For details on how to set environment variables, refer to [Environment Variables](environment-config.md).

### Split Parameters

The following table contains the `split` parameters.
Expand All @@ -91,19 +104,6 @@ The following table contains the `split` parameters.



### Pre-download the Tokenizer

By default, the NV Ingest container comes with the `meta-llama/Llama-3.2-1B` tokenizer pre-downloaded
so that it doesn't have to download a tokenizer at runtime.
If you are building the container yourself and want to pre-download this model, do the following:

- Review the [license agreement](https://huggingface.co/meta-llama/Llama-3.2-1B).
- [Request access](https://huggingface.co/meta-llama/Llama-3.2-1B).
- Set the `DOWNLOAD_LLAMA_TOKENIZER` environment variable to `True`
- Set the `HF_ACCESS_TOKEN` environment variable to your HuggingFace access token.



## Related Topics

- [Use the Python API](nv-ingest-python-api.md)
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/extraction/environment-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ You can specify these in your .env file or directly in your environment.

| Name | Example | Description |
|----------------------------------|--------------------------------|-----------------------------------------------------------------------|
| `DOWNLOAD_LLAMA_TOKENIZER` | - | The Llama tokenizer is now pre-downloaded at build time. For details, refer to [Token-Based Splitting](chunking.md#token-based-splitting). |
| `HF_ACCESS_TOKEN` | - | A token to access HuggingFace models. For details, refer to [Token-Based Splitting](chunking.md#token-based-splitting). |
| `DOWNLOAD_LLAMA_TOKENIZER` | - | Pre-download the default tokenizer at build time. For details, refer to [Llama tokenizer](chunking.md#llama-tokenizer-default). |
| `HF_ACCESS_TOKEN` | - | A token to access HuggingFace models. For details, refer to [Llama tokenizer](chunking.md#llama-tokenizer-default). |
| `INGEST_LOG_LEVEL` | - `DEBUG` <br/> - `INFO` <br/> - `WARNING` <br/> - `ERROR` <br/> - `CRITICAL` <br/> | The log level for the ingest service, which controls the verbosity of the logging output. |
| `MESSAGE_CLIENT_HOST` | - `redis` <br/> - `localhost` <br/> - `192.168.1.10` <br/> | Specifies the hostname or IP address of the message broker used for communication between services. |
| `MESSAGE_CLIENT_PORT` | - `7670` <br/> - `6379` <br/> | Specifies the port number on which the message broker is listening. |
Expand Down
1 change: 1 addition & 0 deletions docs/docs/extraction/nv-ingest-python-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -534,6 +534,7 @@ For more information on environment variables, refer to [Environment Variables](
## Extract Audio

Use the following code to extract mp3 audio content.
The example uses the default tokenizer for token-based splitting; see [Llama tokenizer (default)](chunking.md#llama-tokenizer-default).

```python
from nv_ingest_client.client import Ingestor
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/releasenotes-nv-ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ This release contains the following key changes:
- Added VLM caption prompt customization parameters, including reasoning control. For details, refer to [Caption Images and Control Reasoning](nv-ingest-python-api.md#caption-images-and-control-reasoning).
- Added support for the [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse/modelcard) model which replaces the [nemoretriever-parse](https://build.nvidia.com/nvidia/nemoretriever-parse/modelcard) model. For details, refer to [Advanced Visual Parsing](nemoretriever-parse.md).
- Support is now deprecated for [paddleocr](https://build.nvidia.com/baidu/paddleocr/modelcard).
- The `meta-llama/Llama-3.2-1B` tokenizer is now pre-downloaded so that you can run token-based splitting without making a network request. For details, refer to [Split Documents](chunking.md).
- The default tokenizer for token-based splitting is now pre-downloaded at build time so you can run splitting without a network request. For details, refer to [Llama tokenizer (default](chunking.md#llama-tokenizer-default).
- For scanned PDFs, added specialized extraction strategies. For details, refer to [PDF Extraction Strategies](nv-ingest-python-api.md#pdf-extraction-strategies).
- Added support for [LanceDB](https://lancedb.com/). For details, refer to [Upload to a Custom Data Store](data-store.md).
- The V2 API is now available and is the default processing pipeline. The response format remains backwards-compatible. You can enable the v2 API by using `message_client_kwargs={"api_version": "v2"}`.For details, refer to [API Reference](api-docs).
Expand Down
Loading