feat: blended search and indexing across multiple scopes (closes #337) by 1TommyCheung · Pull Request #525 · zilliztech/memsearch

1TommyCheung · 2026-05-03T10:50:24Z

Closes #337. Builds on #408 and #514, which enabled scope-switching via MEMSEARCH_DIR at the storage layer (one scope at a time). This PR adds first-class multi-scope support: one memsearch search call fans out across N configured scopes (dedupes by chunk_hash, applies per-scope quotas, tags each result with its source scope); memsearch index routes files to the right collection by path-prefix match.

Example uses

1. Solo developer — project history + cross-project personal preferences

The motivating use case from #337. Project notes stay isolated per repo; personal style and habit notes follow you everywhere.

# ~/.memsearch/config.toml (global)
[[scopes]]
name       = "personal"
collection = "ms_personal"
paths      = ["~/.memsearch/personal"]
quota      = 2

# project-root/.memsearch.toml (per-project)
[milvus]
collection = "ms_project_acme"

[paths]
include = ["./.memsearch/memory"]

$ memsearch search "how do I deploy"
--- Result 1 (score: 0.91, scope: project) ---   ← from this project's memory
--- Result 2 (score: 0.88, scope: personal) ---  ← your stored deploy preferences

Open a different project and memsearch search "python style" still surfaces your personal preferences alongside whatever the new project knows — no env var swapping, no session restart.

2. Team subscribing to a shared knowledge base (read-only scope)

A team has a curated knowledge collection (architecture docs, runbooks) indexed once by a CI job. Individual contributors attach to it as a read-only scope — they search it but never index into it.

[milvus]
collection = "ms_my_project"

[[scopes]]
name       = "team-knowledge"
collection = "team_chunks"
quota      = 2
# no paths → read-only, never indexed by this user

When the contributor runs memsearch index, only their own files are written. When they run memsearch search, results blend their project + team-knowledge with quotas.

3. Multi-agent system — shared facts + per-agent private context

A system runs N agents, each with its own private memory but all reading from a shared "facts" collection populated by an upstream registrar.

# Registrar indexes shared facts once (separate process / cron / CI):
registrar = MemSearch(
    paths=["./shared_facts/"],
    collection="ms_shared_facts",
)
await registrar.index()

# Each agent attaches: private (writable) + shared (read-only)
agent = MemSearch(
    paths=[f"./agent_{agent_id}_private/"],
    collection=f"ms_agent_{agent_id}_private",
    default_scope_name=f"agent_{agent_id}",
    extra_scopes=[
        Scope(name="shared", collection="ms_shared_facts", paths=[], quota=2),
    ],
)

Agents read shared facts AND their own private notes in one search call. Agents cannot see each other's private collections — privacy is preserved by the per-collection isolation Milvus already provides; this PR just adds the orchestration layer above it.

4. Mixed local + cloud Milvus

Per-scope uri / token lets a personal Milvus Lite live alongside a cloud-hosted team scope without proxying or duplicating storage:

[milvus]
uri = "~/.memsearch/local.db"   # personal — Milvus Lite

[[scopes]]
name       = "team-cloud"
collection = "team_chunks"
uri        = "https://xxx.zillizcloud.com"   # team — Zilliz Cloud
token      = "env:ZILLIZ_TOKEN"
quota      = 3

Backward compatibility

Single-scope behavior is byte-identical to today. No [default_scope] or [[scopes]] config and no extra_scopes= kwarg → identical output, no new fields on results, no behavior changes anywhere. The new APIs are all opt-in.

Surface

Python: MemSearch(extra_scopes=[Scope(...)]) plus optional default_scope_name and default_scope_quota kwargs. Search adds only_scope=[...] for ad-hoc query restriction.

TOML: new [default_scope] table and repeatable [[scopes]] array-of-tables with fields name, collection, paths, quota, uri, token. Existing env:VAR interpolation works inside scope entries.

CLI:

memsearch search "..." \
  --extra-scope name:collection[:quota]   # repeatable
  --only-scope name1,name2                 # comma-separated

memsearch index    # indexes every scope's paths into its collection

Generality

Writable (with paths) and read-only (no paths) scopes.
Three quota modes: all-quota (hard caps, no redistribution), no-quota (globally top-K by score), mixed (quota'd capped first, unquota'd share remainder).
Tie-break on equal score: default scope wins, then config order — matches Feature request: recommended personal/global memory alongside project-scoped memory #337's "project memory higher priority" ask.
Path-overlap rejected at config-resolve time with a clear error.

Test coverage

32 new tests, baseline 113 → 136 passed + 15 skipped (skips are existing OPENAI_API_KEY-gated integration tests, unchanged):

tests/test_core_unit.py (new) — Scope dataclass + the dedup+quota algorithm covering all 3 quota modes, tie-break, underfill (pure unit, no Milvus, no API key — runs in CI on every push).
tests/test_core.py — multi-scope search, only_scope, index path routing, read-only-scope skip.
tests/test_watcher_multi_scope.py (new) — longest-prefix scope resolution + watcher event routing.
tests/test_config.py — [[scopes]] round-trip, path-overlap.
tests/test_cli_*.py — flag parsing, help text, scope-tagged output.

Scenario validation

scripts/scenario_validation.py (new) runs three real-world workflows end-to-end with the ONNX local embedder (no API key, no cost):

Solo dev with project + personal scopes (use case 1 above).
Multi-agent shared memory with privacy isolation (use case 3 above).
Per-user single-scope isolation (back-compat).

uv run python scripts/scenario_validation.py

All three pass.

Untouched code

store.py, chunker.py, embeddings/, scanner.py, watcher.py (the file watcher itself is generic — only the dispatch in core.py changed), compact.py, reranker.py. No plugin hooks touched — Claude Code, Codex, OpenClaw, and OpenCode plugins continue to work unchanged. They can opt into multi-scope in follow-up PRs.

Out of scope (separate PRs welcomed)

Per-scope embedders (currently all scopes share the parent embedder; raises clearly on dimension mismatch)
Scope-aware compact
Plugin integration to expose multi-scope to plugin end users

Add pointer comments above MemSearchConfig and _SECTION_CLASSES to clarify that ScopeConfig/DefaultScopeConfig are intentionally unwired and will be integrated in Task 2 of the multi-scope plan. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add frozen `Scope` dataclass with name, collection, paths, quota, uri, and token fields — first building block for multi-scope blended search.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add `default_scope_name`, `default_scope_quota`, and `extra_scopes` kwargs to `MemSearch.__init__`; build `self._stores: dict[str, MilvusStore]` with one entry per scope; keep `self._store` as a back-compat alias pointing at the default scope's store. Update `close()` to iterate all stores, with a `__new__`-safe fallback for test fixtures that bypass `__init__`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace MemSearch.search body with single-scope fast path (no scope tag, backwards-compatible) and multi-scope path using asyncio.gather fan-out, _blend_scope_results dedup+quota logic, and only_scope restriction with ValueError on unknown names. Add _seed_scope helper, two_scope_mem fixture, and four integration tests covering: no-scope-field on single-scope, scope tagging on multi-scope, only_scope restriction, and ValueError on unknown scope names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MemSearch.index() now builds a plan from the default scope's _paths plus any extra_scopes with non-empty paths. Each file is indexed into the per-scope store via _index_file(scope_name=…). Read-only scopes (empty paths) are skipped entirely. _embed_and_store() also accepts an optional scope_name so it writes to the correct store. Backward-compat is preserved: objects constructed via __new__ without _default_scope_name / _stores fall back to the old _store attr; when scope_name is None the helpers use self._store as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add _resolve_scope_for_path() (longest-prefix match across all scopes) and index_file_for_scope() (scope-aware single-file indexer); update watch() to build a unified path list and route _on_change to the correct store via the resolver instead of hardcoding the default scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add _parse_extra_scope helper, two new Click options on the search command, and wire extra_scopes/only_scope through to MemSearch.search(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ings Three scenario-driven workflows that exercise multi-scope routing end-to-end without requiring any API key (uses the ONNX local embedding provider): 1. Solo dev (closes zilliztech#337): project + global personal scopes, blended retrieval with quota enforcement and only_scope restriction. 2. Chat agents shared memory: a "registrar" indexes shared canon once; multiple agents (Alice, Bob) attach to it as a read-only scope (empty paths) while each writes to their own private scope. Verifies cross-agent privacy. 3. Individual isolation: two independent MemSearch instances on separate Milvus DBs cannot cross-leak. Single-scope behavior unchanged. Run via: uv run python scripts/scenario_validation.py

1TommyCheung and others added 16 commits May 3, 2026 16:59

feat(config): add ScopeConfig and DefaultScopeConfig dataclasses

9aa430a

style: apply ruff format to config.py and test_config.py

5f750ee

feat(config): wire scopes and default_scope into MemSearchConfig

dc03449

style(config): apply ruff format to Task 2 changes

32409b8

feat(config): validate scope paths don't overlap

6876b35

feat(core): add Scope dataclass

2671b02

Add frozen `Scope` dataclass with name, collection, paths, quota, uri, and token fields — first building block for multi-scope blended search.

feat(core): pure dedup+quota helper for multi-scope blending

1edea02

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(cli): --extra-scope and --only-scope flags on search

7fcc9b1

Add _parse_extra_scope helper, two new Click options on the search command, and wire extra_scopes/only_scope through to MemSearch.search(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(cli): include scope in search text output when multi-scope

451e6d4

feat(cli): pass config-loaded scopes to MemSearch; CLI flags append

a7306e9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1TommyCheung mentioned this pull request May 3, 2026

Feature request: recommended personal/global memory alongside project-scoped memory #337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: blended search and indexing across multiple scopes (closes #337)#525

feat: blended search and indexing across multiple scopes (closes #337)#525
1TommyCheung wants to merge 16 commits intozilliztech:mainfrom
1TommyCheung:feat/multi-scope-blended-search

1TommyCheung commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

1TommyCheung commented May 3, 2026

Example uses

1. Solo developer — project history + cross-project personal preferences

2. Team subscribing to a shared knowledge base (read-only scope)

3. Multi-agent system — shared facts + per-agent private context

4. Mixed local + cloud Milvus

Backward compatibility

Surface

Generality

Test coverage

Scenario validation

Untouched code

Out of scope (separate PRs welcomed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant