[Update][Accuracy] Default Embedding Function

# Default Embedding Function

Embedding functions have significant influence on the accuracy of retrieval, especially recall. Currently we use a fairly basic sentence transformer model, but lately there have been some better open-source models released in the same weight-class in terms of memory and compute.

Additionally, it’s easy to trip up as a user with embedding functions, because they typically have a fixed and relatively short context window which will truncate documents, causing important information to be lost. They’re also usually trained on only one distance metric, which we rely on the user to set themselves currently. 

## [Complexity] Subtask 

- [Low] Evaluate and swap to a new default EF. [[Snowflake’s models](https://huggingface.co/Snowflake/snowflake-arctic-embed-s)](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) look particularly promising. This is already available as ONNX.

- [Low] Attach dimensionality and distance metric to EF class. This would allow us to set both of these automatically on collection creation without the user having to think about passing HNSW params.

- [Low] Attach context window length metadata to the EF class. This would allow [Smart Chunker](https://github.com/chroma-core/chroma/issues/2281) to auto-parametrize to the right settings, and we could warn or error for users when this was being exceeded.
    - Some complexity here because of varying tokenization but we can be conservative.
- [Med] Expose tokenizer when it’s available. This would help  [Smart Chunker](https://github.com/chroma-core/chroma/issues/2281) to count correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Update][Accuracy] Default Embedding Function #2284

Default Embedding Function

[Complexity] Subtask

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Update][Accuracy] Default Embedding Function #2284

Description

Default Embedding Function

[Complexity] Subtask

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions