Description
Default Embedding Function
Embedding functions have significant influence on the accuracy of retrieval, especially recall. Currently we use a fairly basic sentence transformer model, but lately there have been some better open-source models released in the same weight-class in terms of memory and compute.
Additionally, it’s easy to trip up as a user with embedding functions, because they typically have a fixed and relatively short context window which will truncate documents, causing important information to be lost. They’re also usually trained on only one distance metric, which we rely on the user to set themselves currently.
[Complexity] Subtask
-
[Low] Evaluate and swap to a new default EF. [Snowflake’s models](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) look particularly promising. This is already available as ONNX.
-
[Low] Attach dimensionality and distance metric to EF class. This would allow us to set both of these automatically on collection creation without the user having to think about passing HNSW params.
-
[Low] Attach context window length metadata to the EF class. This would allow Smart Chunker to auto-parametrize to the right settings, and we could warn or error for users when this was being exceeded.
- Some complexity here because of varying tokenization but we can be conservative.
-
[Med] Expose tokenizer when it’s available. This would help Smart Chunker to count correctly.