You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: open-source/how-to/embedding.mdx
+80-2Lines changed: 80 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,17 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
57
57
-`openai` for [OpenAI](https://openai.com/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/).
58
58
-`togetherai` for [Together.ai](https://www.together.ai/). [Learn more](https://docs.together.ai/docs/embedding-models).
59
59
-`vertexai` for [Google Vertex AI PaLM](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/google_vertex_ai_palm/).
60
-
-`voyageai` for [Voyage AI](https://www.voyageai.com/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/voyageai/).
60
+
-`voyageai` for [Voyage AI](https://www.voyageai.com/). [Learn more](https://docs.voyageai.com/docs/embeddings).
61
+
62
+
<Note>
63
+
Voyage AI offers multiple embedding models optimized for different use cases:
64
+
-**voyage-3.5** and **voyage-3.5-lite**: Latest models with high token limits (320k and 1M tokens respectively)
65
+
-**voyage-context-3**: Specialized model for contextualized embeddings that capture relationships between documents
66
+
-**voyage-code-3** and **voyage-code-2**: Optimized for code embeddings
- Additional models available for various use cases
70
+
</Note>
61
71
62
72
2. Run the following command to install the required Python package for the embedding provider:
63
73
@@ -86,7 +96,15 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
86
96
-`openai`. [Choose a model](https://platform.openai.com/docs/guides/embeddings/embedding-models), or use the default model `text-embedding-ada-002`.
87
97
-`togetherai`. [Choose a model](https://docs.together.ai/docs/embedding-models), or use the default model `togethercomputer/m2-bert-80M-32k-retrieval`.
88
98
-`vertexai`. [Choose a model](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api), or use the default model `text-embedding-05`.
89
-
-`voyageai`. [Choose a model](https://docs.voyageai.com/docs/embeddings). No default model is provided.
99
+
-`voyageai`. [Choose a model](https://docs.voyageai.com/docs/embeddings). No default model is provided. Available models include:
100
+
-**voyage-3.5**: High-performance model with 320k token limit and 1024 dimensions
101
+
-**voyage-3.5-lite**: Lightweight model with 1M token limit and 512 dimensions
102
+
-**voyage-context-3**: Contextualized embedding model with 32k token limit
-**voyage-multimodal-3**: Multimodal embedding support
90
108
91
109
4. Note the special settings to connect to the provider:
92
110
@@ -157,3 +175,63 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
157
175
- Set `embedding_aws_region` to the corresponding AWS Region identifier.
158
176
</Accordion>
159
177
</AccordionGroup>
178
+
179
+
## VoyageAI Advanced Features
180
+
181
+
VoyageAI embeddings offer several advanced capabilities beyond standard embedding generation:
182
+
183
+
### Contextualized Embeddings
184
+
185
+
The `voyage-context-3` model provides contextualized embeddings that capture relationships between documents in a batch. This is particularly useful for RAG applications where understanding document relationships improves retrieval accuracy.
186
+
187
+
### Automatic Batching
188
+
189
+
VoyageAI integration automatically handles batching based on:
190
+
- Model-specific token limits (ranging from 32k to 1M tokens depending on the model)
191
+
- Maximum batch size of 1000 documents per request
192
+
- Efficient token counting to optimize API usage
193
+
194
+
### Output Dimension Control
195
+
196
+
You can specify a custom `output_dimension` parameter to reduce the dimensionality of embeddings, which can:
197
+
- Reduce storage requirements
198
+
- Speed up similarity search
199
+
- Maintain embedding quality for many use cases
200
+
201
+
### Progress Tracking
202
+
203
+
Enable `show_progress_bar` to monitor embedding progress for large document collections. This requires installing `tqdm`: `pip install tqdm`.
204
+
205
+
### Example: Using VoyageAI with Ingest CLI
206
+
207
+
```bash
208
+
unstructured-ingest \
209
+
local \
210
+
--input-path /path/to/documents \
211
+
--output-dir /path/to/output \
212
+
--embedding-provider voyageai \
213
+
--embedding-api-key $VOYAGE_API_KEY \
214
+
--embedding-model-name voyage-3.5 \
215
+
--num-processes 2
216
+
```
217
+
218
+
### Example: Using VoyageAI with Contextualized Embeddings
219
+
220
+
```bash
221
+
unstructured-ingest \
222
+
local \
223
+
--input-path /path/to/documents \
224
+
--output-dir /path/to/output \
225
+
--embedding-provider voyageai \
226
+
--embedding-api-key $VOYAGE_API_KEY \
227
+
--embedding-model-name voyage-context-3 \
228
+
--num-processes 2
229
+
```
230
+
231
+
### Choosing the Right VoyageAI Model
232
+
233
+
-**voyage-3.5**: Best for general-purpose embeddings with high token limits
234
+
-**voyage-3.5-lite**: Optimal for very large documents or when you need maximum token capacity
235
+
-**voyage-context-3**: Use when document relationships matter for your retrieval task
236
+
-**voyage-code-3**: Specifically optimized for code and technical documentation
237
+
-**Domain-specific models**: Choose finance-2, law-2, or multilingual-2 for specialized domains
0 commit comments