Performance & scaling

How Engrava behaves as data grows, where the limits are, and the two levers that matter most: the vector backend and batched writes. The numbers that matter for your workload depend on corpus size, embedding dimension, query mix, and hardware — measure on your own data rather than trusting a single headline figure. This page explains what drives cost so you know what to measure.

For the dreaming quality benchmark (does consolidation help retrieval), see Benchmarks. For the hard platform constraints, see Known Limitations.

Where the cost is

A query touches up to five signals; each scales differently:

Signal	Cost driver	Scaling
FTS5 / BM25	SQLite's FTS5 inverted index	Sub-linear; scales well into large corpora.
Vector	The vector backend (see below)	Linear in #embeddings for both backends; sqlite-vec scans a compact `vec0` table with a much smaller constant factor than the Python path.
Recency	A cheap per-candidate arithmetic decay	Negligible.
Priority	A per-candidate enum→multiplier lookup	Negligible.
Graph	1-hop neighbour expansion over edges	Proportional to the fusion-pool size × average degree; opt-in (`graph_weight=0.0` makes zero graph queries).

The dominant term at scale is almost always the vector signal, because both backends compare the query against every stored embedding — the difference is how efficiently they do it (see below).

The brute-force ceiling (and how to pass it)

Without the vec extra, vector search is brute-force cosine similarity in Python: every search_similar / search_hybrid query scans all embeddings. This is simple and dependency-free, and works well up to roughly 100k embeddings. Past that, vector-query latency grows linearly and becomes the bottleneck.

The fix is the sqlite-vec backend, which stores vectors in a dedicated, compact vec0 virtual table. In the pinned sqlite-vec 0.1.x line a vec0 query is still an exhaustive k-nearest-neighbour scan — not an approximate or sub-linear index — but over a tightly packed, chunked columnar store, so it runs with a far smaller constant factor (and lower memory overhead) than the Python brute-force path. The practical effect is that the same corpus stays well under your latency budget for much longer. FTS5 scales independently and usually needs no special handling.

The ~100k figure is a rule of thumb, not a cliff — see Known Limitations → sqlite-vec. Measure your own p95 query latency and switch when it stops meeting your budget.

Switching to sqlite-vec (incl. migrating an existing database)

The migration is designed to be turnkey: your embeddings already live in the embedding table, so switching backends only builds and backfills the vec0 vector table — you do not re-embed anything.

1. Install the extra.

pip install 'engrava[vec]'

2. Set the backend in your config.

extensions:
  vector:
    backend: sqlite-vec      # default is "numpy"
    dimension: 384           # must match your embedding model

3. Open the store with from_config. On open, Engrava creates the vec0 virtual table and backfills every existing embedding into it automatically (idempotent — safe to run repeatedly). From then on, new writes keep the index in sync.

from engrava import SqliteEngravaCore

# from_config wires the vector backend; the index is created and back-filled
# on open. A plain SqliteEngravaCore(conn) constructor stays on numpy.
async with await SqliteEngravaCore.from_config("engrava.yaml") as store:
    result = await store.search_similar(query_vector, top_k=5)

That's the whole migration — no manual re-index step, and no re-embedding, because the vectors are reused from the existing embedding table.

Important caveats.

Use from_config. Only the from_config path configures the vector backend. If you build the store directly with SqliteEngravaCore(conn), it stays on the numpy backend regardless of the YAML.
Graceful fallback, not a hard error. If the sqlite-vec package is missing or the extension can't load, Engrava logs a warning and falls back to numpy rather than crashing — so a "switch" that silently kept numpy usually means the extension didn't load.
macOS system SQLite blocks extensions. The most common load failure is macOS's bundled SQLite, which disables extension loading. Install Python via Homebrew or pyenv (a full-featured SQLite build). See Known Limitations → macOS.
Dimension must match. The index is created for a fixed dimension; it must equal your embedding model's output. Mixing dimensions corrupts results (see Embedding Dimension Consistency).

Write throughput and bulk ingest

By default each mutating call commits its own transaction. For a bulk load that is the wrong granularity — one commit per row dominates wall-clock. Wrap the batch in suspend_auto_commit(), which defers to a single commit on success and rolls the whole batch back on any error:

async def bulk_load(store, items):
    async with store.suspend_auto_commit():
        for item in items:
            await store.create_thought(item, deduplicate=True)
    return await store.count_thoughts()

deduplicate=True collapses identical content into one thought (bumping confirmation_count) instead of inserting duplicate rows — cheaper storage and fewer embeddings to compute. (Note the persistence default is deduplicate=False; opt in per call.)
Keep each transaction short. A long-running transaction blocks aiosqlite's background thread (see Known Limitations → aiosqlite), so for very large imports, batch in chunks (e.g. a few thousand rows per suspend_auto_commit() block) rather than one giant transaction.
Embedding cost dominates a bulk load when a provider is configured with auto_embed=True: each new thought is embedded on write. Pre-compute vectors and store them with store_embedding(...), use a batching local provider, or import in chunks so the encoder isn't the bottleneck. See the Embeddings guide.

A runnable end-to-end bulk-import example lives in the migration guide.

Dreaming cost at scale

Dreaming runs off the hot path — you invoke run_consolidation() on your own cadence, so it never adds latency to CRUD or search. Its own cost scales with the number of candidate thoughts and the clustering algorithm:

Run it periodically, not every turn (every N cycles, a cron job, or manually).
candidates_limit caps how many thoughts are evaluated per pass — keep it bounded on large stores.
Clustering has two backends via extensions.dreaming.clustering_backend ("numpy" default, or "python"); numpy is faster for the similarity math on larger candidate sets.
The LPA clustering algorithm is O(edges × iterations); the agglomerative algorithm operates over active thoughts — see Dreaming for the algorithm tradeoffs.

Checklist: scaling Engrava

Past ~100k embeddings or missing your latency budget? Switch to sqlite-vec (above).
Bulk loading? Batch writes with suspend_auto_commit() and consider deduplicate=True.
Embedding is the bottleneck? Use a batching provider or pre-compute vectors.
Multi-tenant? One database file per tenant via EngravaManager keeps each store smaller and independently lockable (see the scoping section).
Dreaming heavy? Cap candidates_limit, run it on a schedule, pick the right clustering_backend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance & scaling

Where the cost is

The brute-force ceiling (and how to pass it)

Switching to sqlite-vec (incl. migrating an existing database)

Write throughput and bulk ingest

Dreaming cost at scale

Checklist: scaling Engrava

See also

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance & scaling

Where the cost is

The brute-force ceiling (and how to pass it)

Switching to sqlite-vec (incl. migrating an existing database)

Write throughput and bulk ingest

Dreaming cost at scale

Checklist: scaling Engrava

See also