Skip to content

DefaultEmbeddingFunction.__call__ constructs a new ONNXMiniLM_L6_V2 on every call (10× slowdown on repeated embeds) #6941

Description

@xelauvas

What happens

chromadb/api/types.py ships (chromadb 1.5.8):

class DefaultEmbeddingFunction(EmbeddingFunction[Documents]):
    def __call__(self, input: Documents) -> Embeddings:
        from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import (
            ONNXMiniLM_L6_V2,
        )
        return ONNXMiniLM_L6_V2()(input)

DefaultEmbeddingFunction.__call__ constructs a fresh ONNXMiniLM_L6_V2 every time it runs, triggering cold lazy-init of the tokenizer (~5ms) and the ONNX InferenceSession (~180ms) per invocation. Users whose workload hits embedding on a hot path (per-request retrieval, streaming ingest, RAG loops) pay ~200ms of avoidable tokenizer+model setup per call.

Secondary issue — thread contention under concurrency

Even with a cached instance, the default intra_op_num_threads=0 ("use all cores") causes severe context-switch thrashing when multiple concurrent queries hit the same session: each embed call fans out across all CPUs, producing worse-than-serial scaling.

Measurement

AMD EPYC-Rome VPS, 16 cores, 300-drawer palace, chromadb 1.5.8:

Scenario Pre-fix Post-fix (singleton + intra_op=1) Speedup
semantic_search p95 (single-user; per /search call, one embed inside) 412 ms 95 ms 4.3×
Composite 4-layer wake-up p95 768 ms 106 ms 7.2×
4-concurrent scaling ratio vs single-call 4.36× 1.35× 3.2× better
Ingest per drawer 299 ms 104 ms 2.9×

The 4-concurrent ratio is the one worth dwelling on: 4.36× means four parallel embed calls take longer than running them in series would — the default thread fan-out produces negative scaling.

Proposed fix

Two parts:

  1. Cache a single ONNXMiniLM_L6_V2 instance at the class level (or via module-level singleton) so DefaultEmbeddingFunction.__call__ routes every invocation through one instance.
  2. Construct that instance with intra_op_num_threads=1, inter_op_num_threads=1 so concurrent embeds parallelize across separate cores rather than thrashing contention on shared cores.

InferenceSession.run() is documented as thread-safe; the Rust-backed tokenizers.Tokenizer is thread-safe for encode. Vectors are byte-identical pre- and post-patch for identical input — no index rebuild required by users applying the change.

Workaround downstream

A monkey-patch of DefaultEmbeddingFunction.__call__ landing the singleton + intra_op=1 settings produces the numbers above. Version-pinned to chromadb 1.5.8 because the patch depends on the exact call site. Discovered during MemPalace-sidecar integration work (xelauvas/xelasphere, April 2026); happy to share the patch module and full benchmark artifacts if useful.

Version

chromadb 1.5.8 (reproducible back to at least 1.4.x based on the identical __call__ body). Python 3.12.3, onnxruntime 1.24.4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions