Skip to content

[Update][Accuracy] Default Embedding Function #2284

Open
@atroyn

Description

@atroyn

Default Embedding Function

Embedding functions have significant influence on the accuracy of retrieval, especially recall. Currently we use a fairly basic sentence transformer model, but lately there have been some better open-source models released in the same weight-class in terms of memory and compute.

Additionally, it’s easy to trip up as a user with embedding functions, because they typically have a fixed and relatively short context window which will truncate documents, causing important information to be lost. They’re also usually trained on only one distance metric, which we rely on the user to set themselves currently.

[Complexity] Subtask

  • [Low] Evaluate and swap to a new default EF. [Snowflake’s models](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) look particularly promising. This is already available as ONNX.

  • [Low] Attach dimensionality and distance metric to EF class. This would allow us to set both of these automatically on collection creation without the user having to think about passing HNSW params.

  • [Low] Attach context window length metadata to the EF class. This would allow Smart Chunker to auto-parametrize to the right settings, and we could warn or error for users when this was being exceeded.

    • Some complexity here because of varying tokenization but we can be conservative.
  • [Med] Expose tokenizer when it’s available. This would help Smart Chunker to count correctly.

Metadata

Metadata

Assignees

Labels

Local ChromaAn improvement to Local (single node) Chromaby-chromain-progressCurrently working on this

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions