Skip to content

bug/VOYAGE embedding models supported but not available in PIPELINE #54

Open
@jeremydiba

Description

@jeremydiba

Describe the bug
In Pipeline -> EmbedderConfig, every embedding model documented here
https://docs.unstructured.io/open-source/core-functionality/embedding#voyageaiembeddingencoder
is supported except for Voyage throws an error as being not recognized

To Reproduce

from unstructured.ingest.v2.pipeline.pipeline import Pipeline
from unstructured.ingest.v2.interfaces import ProcessorConfig
from unstructured.ingest.v2.processes.connectors.fsspec.s3 import (
    S3IndexerConfig,
    S3DownloaderConfig,
    S3ConnectionConfig,
    S3AccessConfig,
    S3UploaderConfig
)
from unstructured.ingest.v2.processes.partitioner import PartitionerConfig
from unstructured.ingest.v2.processes.chunker import ChunkerConfig
from unstructured.ingest.v2.processes.embedder import EmbedderConfig
pipeline = Pipeline.from_configs(
    context=ProcessorConfig(),
    indexer_config=S3IndexerConfig(remote_url=INPUT_S3_FILE),
    downloader_config=S3DownloaderConfig(download_dir="s3-ingest-download"),
    source_connection_config=S3ConnectionConfig(
        access_config=S3AccessConfig(
            key="AWS_ACCESS_KEY_ID",
            secret="AWS_SECRET_ACCESS_KEY",
            token="AWS_SESSION_TOKEN"
        )
    ),
    partitioner_config=PartitionerConfig(
        partition_by_api=True,
        api_key="UNSTRUCTURED_API_KEY_AUTH",
        partition_endpoint="UNSTRUCTURED_SERVER_URL",
        strategy="auto"
    ),
    chunker_config=ChunkerConfig(chunking_strategy="by_title",
                                chunk_combine_text_under_n_chars=100,
                                chunk_include_orig_elements=False,
                                chunk_max_characters=4000),
    embedder_config=EmbedderConfig(embedding_provider="Voyage",
                                   embedding_api_key="VOYAGE_API_KEY",
                                   embedding_model_name="voyage-law-2"),
    destination_connection_config=S3ConnectionConfig(
        access_config=S3AccessConfig(
            key="AWS_ACCESS_KEY_ID",
            secret="AWS_SECRET_ACCESS_KEY",
            token="AWS_SESSION_TOKEN"
        )
    ),
    uploader_config=S3UploaderConfig(remote_url=OUTPUT_S3_FILEPATH)
)

Expected behavior
Support for VoyageAIEmbeddingEncoder / Voyage to be a valid parameter
If support is not intended, there should be indication in the documentation that this is available functionality only when ran outside the pipeline

Screenshots
If applicable, add screenshots to help explain your problem.

Environment Info
Python 3.11
ValueError: Voyage not a recognized encoder

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions