diff --git a/docs/docs/extraction/chunking.md b/docs/docs/extraction/chunking.md index 0bddfa4a2..d331e3e7b 100644 --- a/docs/docs/extraction/chunking.md +++ b/docs/docs/extraction/chunking.md @@ -46,15 +46,11 @@ If you want chunks smaller than `page`, use token-based splitting as described i The `split` task uses a tokenizer to count the number of tokens in the document, and splits the document based on the desired maximum chunk size and chunk overlap. -We recommend that you use the `meta-llama/Llama-3.2-1B` tokenizer, -because it's the same tokenizer as the llama-3.2 embedding model that we use for embedding. -However, you can use any tokenizer from any HuggingFace model that includes a tokenizer file. -Use the `split` method to chunk large documents as shown in the following code. - -!!! note +We recommend the default tokenizer for token-based splitting. For more information, refer to [Llama tokenizer (default)](#llama-tokenizer-default). +You can also use any tokenizer from any HuggingFace model that includes a tokenizer file. - The default tokenizer (`meta-llama/Llama-3.2-1B`) requires a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). You must set `hf_access_token": "hf_***` to authenticate. +Use the `split` method to chunk large documents as shown in the following code. ```python ingestor = ingestor.split( @@ -76,6 +72,23 @@ ingestor = ingestor.split( ) ``` +### Llama tokenizer (default) {#llama-tokenizer-default} + +The default tokenizer for token-based splitting is **`meta-llama/Llama-3.2-1B`**. It matches the tokenizer used by the Llama 3.2 embedding model, which helps keep chunk boundaries aligned with the embedding model. + +!!! note + + This tokenizer is gated on Hugging Face and requires an access token. For more information, refer to [User access tokens](https://huggingface.co/docs/hub/en/security-tokens). You must set `hf_access_token` in your `split` params (for example, `"hf_***"`) to authenticate. + +By default, the NV Ingest container includes this tokenizer pre-downloaded at build time, so it does not need to be fetched at runtime. If you build the container yourself and want to pre-download it, do the following: + +- Review the [license agreement](https://huggingface.co/meta-llama/Llama-3.2-1B). +- [Request access](https://huggingface.co/meta-llama/Llama-3.2-1B). +- Set the `DOWNLOAD_LLAMA_TOKENIZER` environment variable to `True`. +- Set the `HF_ACCESS_TOKEN` environment variable to your HuggingFace access token. + +For details on how to set environment variables, refer to [Environment Variables](environment-config.md). + ### Split Parameters The following table contains the `split` parameters. @@ -91,19 +104,6 @@ The following table contains the `split` parameters. -### Pre-download the Tokenizer - -By default, the NV Ingest container comes with the `meta-llama/Llama-3.2-1B` tokenizer pre-downloaded -so that it doesn't have to download a tokenizer at runtime. -If you are building the container yourself and want to pre-download this model, do the following: - -- Review the [license agreement](https://huggingface.co/meta-llama/Llama-3.2-1B). -- [Request access](https://huggingface.co/meta-llama/Llama-3.2-1B). -- Set the `DOWNLOAD_LLAMA_TOKENIZER` environment variable to `True` -- Set the `HF_ACCESS_TOKEN` environment variable to your HuggingFace access token. - - - ## Related Topics - [Use the Python API](nv-ingest-python-api.md) diff --git a/docs/docs/extraction/environment-config.md b/docs/docs/extraction/environment-config.md index 6843d9b9c..150f6f174 100644 --- a/docs/docs/extraction/environment-config.md +++ b/docs/docs/extraction/environment-config.md @@ -12,8 +12,8 @@ You can specify these in your .env file or directly in your environment. | Name | Example | Description | |----------------------------------|--------------------------------|-----------------------------------------------------------------------| -| `DOWNLOAD_LLAMA_TOKENIZER` | - | The Llama tokenizer is now pre-downloaded at build time. For details, refer to [Token-Based Splitting](chunking.md#token-based-splitting). | -| `HF_ACCESS_TOKEN` | - | A token to access HuggingFace models. For details, refer to [Token-Based Splitting](chunking.md#token-based-splitting). | +| `DOWNLOAD_LLAMA_TOKENIZER` | - | Pre-download the default tokenizer at build time. For details, refer to [Llama tokenizer](chunking.md#llama-tokenizer-default). | +| `HF_ACCESS_TOKEN` | - | A token to access HuggingFace models. For details, refer to [Llama tokenizer](chunking.md#llama-tokenizer-default). | | `INGEST_LOG_LEVEL` | - `DEBUG`
- `INFO`
- `WARNING`
- `ERROR`
- `CRITICAL`
| The log level for the ingest service, which controls the verbosity of the logging output. | | `MESSAGE_CLIENT_HOST` | - `redis`
- `localhost`
- `192.168.1.10`
| Specifies the hostname or IP address of the message broker used for communication between services. | | `MESSAGE_CLIENT_PORT` | - `7670`
- `6379`
| Specifies the port number on which the message broker is listening. | diff --git a/docs/docs/extraction/nv-ingest-python-api.md b/docs/docs/extraction/nv-ingest-python-api.md index 541130ef2..8a6720f85 100644 --- a/docs/docs/extraction/nv-ingest-python-api.md +++ b/docs/docs/extraction/nv-ingest-python-api.md @@ -534,6 +534,7 @@ For more information on environment variables, refer to [Environment Variables]( ## Extract Audio Use the following code to extract mp3 audio content. +The example uses the default tokenizer for token-based splitting; see [Llama tokenizer (default)](chunking.md#llama-tokenizer-default). ```python from nv_ingest_client.client import Ingestor diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md index ebb57f19e..5c4a060b6 100644 --- a/docs/docs/extraction/releasenotes-nv-ingest.md +++ b/docs/docs/extraction/releasenotes-nv-ingest.md @@ -26,7 +26,7 @@ This release contains the following key changes: - Added VLM caption prompt customization parameters, including reasoning control. For details, refer to [Caption Images and Control Reasoning](nv-ingest-python-api.md#caption-images-and-control-reasoning). - Added support for the [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse/modelcard) model which replaces the [nemoretriever-parse](https://build.nvidia.com/nvidia/nemoretriever-parse/modelcard) model. For details, refer to [Advanced Visual Parsing](nemoretriever-parse.md). - Support is now deprecated for [paddleocr](https://build.nvidia.com/baidu/paddleocr/modelcard). -- The `meta-llama/Llama-3.2-1B` tokenizer is now pre-downloaded so that you can run token-based splitting without making a network request. For details, refer to [Split Documents](chunking.md). +- The default tokenizer for token-based splitting is now pre-downloaded at build time so you can run splitting without a network request. For details, refer to [Llama tokenizer (default](chunking.md#llama-tokenizer-default). - For scanned PDFs, added specialized extraction strategies. For details, refer to [PDF Extraction Strategies](nv-ingest-python-api.md#pdf-extraction-strategies). - Added support for [LanceDB](https://lancedb.com/). For details, refer to [Upload to a Custom Data Store](data-store.md). - The V2 API is now available and is the default processing pipeline. The response format remains backwards-compatible. You can enable the v2 API by using `message_client_kwargs={"api_version": "v2"}`.For details, refer to [API Reference](api-docs).