Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion api-reference/workflow/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1923,11 +1923,20 @@ Allowed values for `subtype` and `model_name` include:

- `"subtype": "voyageai"`

- `"model_name": "voyage-context-3"`
- `"model_name": "voyage-3.5"`
- `"model_name": "voyage-3.5-lite"`
- `"model_name": "voyage-3"`
- `"model_name": "voyage-3-large"`
- `"model_name": "voyage-3-lite"`
- `"model_name": "voyage-3-m-exp"`
- `"model_name": "voyage-2"`
- `"model_name": "voyage-02"`
- `"model_name": "voyage-large-2"`
- `"model_name": "voyage-large-2-instruct"`
- `"model_name": "voyage-code-3"`
- `"model_name": "voyage-code-2"`
- `"model_name": "voyage-finance-2"`
- `"model_name": "voyage-law-2"`
- `"model_name": "voyage-code-2"`
- `"model_name": "voyage-multilingual-2"`
- `"model_name": "voyage-multimodal-3"`
82 changes: 80 additions & 2 deletions open-source/how-to/embedding.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,17 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
- `openai` for [OpenAI](https://openai.com/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/).
- `togetherai` for [Together.ai](https://www.together.ai/). [Learn more](https://docs.together.ai/docs/embedding-models).
- `vertexai` for [Google Vertex AI PaLM](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/google_vertex_ai_palm/).
- `voyageai` for [Voyage AI](https://www.voyageai.com/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/voyageai/).
- `voyageai` for [Voyage AI](https://www.voyageai.com/). [Learn more](https://docs.voyageai.com/docs/embeddings).

<Note>
Voyage AI offers multiple embedding models optimized for different use cases:
- **voyage-3.5** and **voyage-3.5-lite**: Latest models with high token limits (320k and 1M tokens respectively)
- **voyage-context-3**: Specialized model for contextualized embeddings that capture relationships between documents
- **voyage-code-3** and **voyage-code-2**: Optimized for code embeddings
- **voyage-finance-2**, **voyage-law-2**, **voyage-multilingual-2**: Domain-specific models
- **voyage-multimodal-3**: Supports multimodal embeddings
- Additional models available for various use cases
</Note>

2. Run the following command to install the required Python package for the embedding provider:

Expand Down Expand Up @@ -86,7 +96,15 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
- `openai`. [Choose a model](https://platform.openai.com/docs/guides/embeddings/embedding-models), or use the default model `text-embedding-ada-002`.
- `togetherai`. [Choose a model](https://docs.together.ai/docs/embedding-models), or use the default model `togethercomputer/m2-bert-80M-32k-retrieval`.
- `vertexai`. [Choose a model](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api), or use the default model `text-embedding-05`.
- `voyageai`. [Choose a model](https://docs.voyageai.com/docs/embeddings). No default model is provided.
- `voyageai`. [Choose a model](https://docs.voyageai.com/docs/embeddings). No default model is provided. Available models include:
- **voyage-3.5**: High-performance model with 320k token limit and 1024 dimensions
- **voyage-3.5-lite**: Lightweight model with 1M token limit and 512 dimensions
- **voyage-context-3**: Contextualized embedding model with 32k token limit
- **voyage-3**, **voyage-3-large**, **voyage-3-lite**: General-purpose models
- **voyage-2**, **voyage-02**: Previous generation models
- **voyage-code-3**, **voyage-code-2**: Code-specialized models
- **voyage-finance-2**, **voyage-law-2**, **voyage-multilingual-2**: Domain-specific models
- **voyage-multimodal-3**: Multimodal embedding support

4. Note the special settings to connect to the provider:

Expand Down Expand Up @@ -157,3 +175,63 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
- Set `embedding_aws_region` to the corresponding AWS Region identifier.
</Accordion>
</AccordionGroup>

## VoyageAI Advanced Features

VoyageAI embeddings offer several advanced capabilities beyond standard embedding generation:

### Contextualized Embeddings

The `voyage-context-3` model provides contextualized embeddings that capture relationships between documents in a batch. This is particularly useful for RAG applications where understanding document relationships improves retrieval accuracy.

### Automatic Batching

VoyageAI integration automatically handles batching based on:
- Model-specific token limits (ranging from 32k to 1M tokens depending on the model)
- Maximum batch size of 1000 documents per request
- Efficient token counting to optimize API usage

### Output Dimension Control

You can specify a custom `output_dimension` parameter to reduce the dimensionality of embeddings, which can:
- Reduce storage requirements
- Speed up similarity search
- Maintain embedding quality for many use cases

### Progress Tracking

Enable `show_progress_bar` to monitor embedding progress for large document collections. This requires installing `tqdm`: `pip install tqdm`.

### Example: Using VoyageAI with Ingest CLI

```bash
unstructured-ingest \
local \
--input-path /path/to/documents \
--output-dir /path/to/output \
--embedding-provider voyageai \
--embedding-api-key $VOYAGE_API_KEY \
--embedding-model-name voyage-3.5 \
--num-processes 2
```

### Example: Using VoyageAI with Contextualized Embeddings

```bash
unstructured-ingest \
local \
--input-path /path/to/documents \
--output-dir /path/to/output \
--embedding-provider voyageai \
--embedding-api-key $VOYAGE_API_KEY \
--embedding-model-name voyage-context-3 \
--num-processes 2
```

### Choosing the Right VoyageAI Model

- **voyage-3.5**: Best for general-purpose embeddings with high token limits
- **voyage-3.5-lite**: Optimal for very large documents or when you need maximum token capacity
- **voyage-context-3**: Use when document relationships matter for your retrieval task
- **voyage-code-3**: Specifically optimized for code and technical documentation
- **Domain-specific models**: Choose finance-2, law-2, or multilingual-2 for specialized domains
25 changes: 17 additions & 8 deletions snippets/general-shared-text/chunk-limits-embedding-models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,22 @@ as listed in the following table's last column.
| _Together AI_ | | | |
| M2-Bert 80M 32K Retrieval | 768 | 8192 | 28672 |
| _Voyage AI_ | | | |
| Voyage 3 | 1024 | 32000 | 112000 |
| Voyage 3 Large | 1024 | 32000 | 112000 |
| Voyage 3 Lite | 512 | 32000 | 112000 |
| Voyage Code 2 | 1536 | 16000| 56000 |
| Voyage Code 3 | 1024 | 32000 | 112000 |
| Voyage Finance 2 | 1024 | 32000| 112000 |
| Voyage Law 2 | 1024 | 16000 | 56000 |
| Voyage Multimodal 3 | 1024 | 32000 | 112000 |
| Voyage Context 3 | 1024 | 32000 | 112000 |
| Voyage 3.5 | 1024 | 320000 | 1120000 |
| Voyage 3.5 Lite | 512 | 1000000 | 3500000 |
| Voyage 3 | 1024 | 120000 | 420000 |
| Voyage 3 Large | 1024 | 120000 | 420000 |
| Voyage 3 Lite | 512 | 120000 | 420000 |
| Voyage 3 M Exp | 1024 | 120000 | 420000 |
| Voyage 2 | 1024 | 320000 | 1120000 |
| Voyage 02 | 1024 | 320000 | 1120000 |
| Voyage Large 2 | 1024 | 120000 | 420000 |
| Voyage Large 2 Instruct | 1024 | 120000 | 420000 |
| Voyage Code 3 | 1024 | 120000 | 420000 |
| Voyage Code 2 | 1536 | 120000 | 420000 |
| Voyage Finance 2 | 1024 | 120000 | 420000 |
| Voyage Law 2 | 1024 | 120000 | 420000 |
| Voyage Multilingual 2 | 1024 | 120000 | 420000 |
| Voyage Multimodal 3 | 1024 | 120000 | 420000 |

<sup>*</sup> This is an approximate value, determined by multiplying the embedding model's token limit by 3.5.