Skip to content

feat: blended search and indexing across multiple scopes (closes #337)#525

Open
1TommyCheung wants to merge 16 commits intozilliztech:mainfrom
1TommyCheung:feat/multi-scope-blended-search
Open

feat: blended search and indexing across multiple scopes (closes #337)#525
1TommyCheung wants to merge 16 commits intozilliztech:mainfrom
1TommyCheung:feat/multi-scope-blended-search

Conversation

@1TommyCheung
Copy link
Copy Markdown

Closes #337. Builds on #408 and #514, which enabled scope-switching via MEMSEARCH_DIR at the storage layer (one scope at a time). This PR adds first-class multi-scope support: one memsearch search call fans out across N configured scopes (dedupes by chunk_hash, applies per-scope quotas, tags each result with its source scope); memsearch index routes files to the right collection by path-prefix match.

Example uses

1. Solo developer — project history + cross-project personal preferences

The motivating use case from #337. Project notes stay isolated per repo; personal style and habit notes follow you everywhere.

# ~/.memsearch/config.toml (global)
[[scopes]]
name       = "personal"
collection = "ms_personal"
paths      = ["~/.memsearch/personal"]
quota      = 2
# project-root/.memsearch.toml (per-project)
[milvus]
collection = "ms_project_acme"

[paths]
include = ["./.memsearch/memory"]
$ memsearch search "how do I deploy"
--- Result 1 (score: 0.91, scope: project) ---   ← from this project's memory
--- Result 2 (score: 0.88, scope: personal) ---  ← your stored deploy preferences

Open a different project and memsearch search "python style" still surfaces your personal preferences alongside whatever the new project knows — no env var swapping, no session restart.

2. Team subscribing to a shared knowledge base (read-only scope)

A team has a curated knowledge collection (architecture docs, runbooks) indexed once by a CI job. Individual contributors attach to it as a read-only scope — they search it but never index into it.

[milvus]
collection = "ms_my_project"

[[scopes]]
name       = "team-knowledge"
collection = "team_chunks"
quota      = 2
# no paths → read-only, never indexed by this user

When the contributor runs memsearch index, only their own files are written. When they run memsearch search, results blend their project + team-knowledge with quotas.

3. Multi-agent system — shared facts + per-agent private context

A system runs N agents, each with its own private memory but all reading from a shared "facts" collection populated by an upstream registrar.

# Registrar indexes shared facts once (separate process / cron / CI):
registrar = MemSearch(
    paths=["./shared_facts/"],
    collection="ms_shared_facts",
)
await registrar.index()

# Each agent attaches: private (writable) + shared (read-only)
agent = MemSearch(
    paths=[f"./agent_{agent_id}_private/"],
    collection=f"ms_agent_{agent_id}_private",
    default_scope_name=f"agent_{agent_id}",
    extra_scopes=[
        Scope(name="shared", collection="ms_shared_facts", paths=[], quota=2),
    ],
)

Agents read shared facts AND their own private notes in one search call. Agents cannot see each other's private collections — privacy is preserved by the per-collection isolation Milvus already provides; this PR just adds the orchestration layer above it.

4. Mixed local + cloud Milvus

Per-scope uri / token lets a personal Milvus Lite live alongside a cloud-hosted team scope without proxying or duplicating storage:

[milvus]
uri = "~/.memsearch/local.db"   # personal — Milvus Lite

[[scopes]]
name       = "team-cloud"
collection = "team_chunks"
uri        = "https://xxx.zillizcloud.com"   # team — Zilliz Cloud
token      = "env:ZILLIZ_TOKEN"
quota      = 3

Backward compatibility

Single-scope behavior is byte-identical to today. No [default_scope] or [[scopes]] config and no extra_scopes= kwarg → identical output, no new fields on results, no behavior changes anywhere. The new APIs are all opt-in.

Surface

Python: MemSearch(extra_scopes=[Scope(...)]) plus optional default_scope_name and default_scope_quota kwargs. Search adds only_scope=[...] for ad-hoc query restriction.

TOML: new [default_scope] table and repeatable [[scopes]] array-of-tables with fields name, collection, paths, quota, uri, token. Existing env:VAR interpolation works inside scope entries.

CLI:

memsearch search "..." \
  --extra-scope name:collection[:quota]   # repeatable
  --only-scope name1,name2                 # comma-separated

memsearch index    # indexes every scope's paths into its collection

Generality

  • Writable (with paths) and read-only (no paths) scopes.
  • Three quota modes: all-quota (hard caps, no redistribution), no-quota (globally top-K by score), mixed (quota'd capped first, unquota'd share remainder).
  • Tie-break on equal score: default scope wins, then config order — matches Feature request: recommended personal/global memory alongside project-scoped memory #337's "project memory higher priority" ask.
  • Path-overlap rejected at config-resolve time with a clear error.

Test coverage

32 new tests, baseline 113 → 136 passed + 15 skipped (skips are existing OPENAI_API_KEY-gated integration tests, unchanged):

  • tests/test_core_unit.py (new) — Scope dataclass + the dedup+quota algorithm covering all 3 quota modes, tie-break, underfill (pure unit, no Milvus, no API key — runs in CI on every push).
  • tests/test_core.py — multi-scope search, only_scope, index path routing, read-only-scope skip.
  • tests/test_watcher_multi_scope.py (new) — longest-prefix scope resolution + watcher event routing.
  • tests/test_config.py[[scopes]] round-trip, path-overlap.
  • tests/test_cli_*.py — flag parsing, help text, scope-tagged output.

Scenario validation

scripts/scenario_validation.py (new) runs three real-world workflows end-to-end with the ONNX local embedder (no API key, no cost):

  1. Solo dev with project + personal scopes (use case 1 above).
  2. Multi-agent shared memory with privacy isolation (use case 3 above).
  3. Per-user single-scope isolation (back-compat).
uv run python scripts/scenario_validation.py

All three pass.

Untouched code

store.py, chunker.py, embeddings/, scanner.py, watcher.py (the file watcher itself is generic — only the dispatch in core.py changed), compact.py, reranker.py. No plugin hooks touched — Claude Code, Codex, OpenClaw, and OpenCode plugins continue to work unchanged. They can opt into multi-scope in follow-up PRs.

Out of scope (separate PRs welcomed)

  • Per-scope embedders (currently all scopes share the parent embedder; raises clearly on dimension mismatch)
  • Scope-aware compact
  • Plugin integration to expose multi-scope to plugin end users

1TommyCheung and others added 16 commits May 3, 2026 16:59
Add pointer comments above MemSearchConfig and _SECTION_CLASSES to
clarify that ScopeConfig/DefaultScopeConfig are intentionally unwired
and will be integrated in Task 2 of the multi-scope plan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add frozen `Scope` dataclass with name, collection, paths, quota, uri,
and token fields — first building block for multi-scope blended search.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `default_scope_name`, `default_scope_quota`, and `extra_scopes` kwargs
to `MemSearch.__init__`; build `self._stores: dict[str, MilvusStore]` with
one entry per scope; keep `self._store` as a back-compat alias pointing at
the default scope's store. Update `close()` to iterate all stores, with a
`__new__`-safe fallback for test fixtures that bypass `__init__`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace MemSearch.search body with single-scope fast path (no scope tag,
backwards-compatible) and multi-scope path using asyncio.gather fan-out,
_blend_scope_results dedup+quota logic, and only_scope restriction with
ValueError on unknown names.  Add _seed_scope helper, two_scope_mem
fixture, and four integration tests covering: no-scope-field on
single-scope, scope tagging on multi-scope, only_scope restriction, and
ValueError on unknown scope names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MemSearch.index() now builds a plan from the default scope's _paths
plus any extra_scopes with non-empty paths.  Each file is indexed into
the per-scope store via _index_file(scope_name=…).  Read-only scopes
(empty paths) are skipped entirely.  _embed_and_store() also accepts an
optional scope_name so it writes to the correct store.

Backward-compat is preserved: objects constructed via __new__ without
_default_scope_name / _stores fall back to the old _store attr; when
scope_name is None the helpers use self._store as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _resolve_scope_for_path() (longest-prefix match across all scopes)
and index_file_for_scope() (scope-aware single-file indexer); update
watch() to build a unified path list and route _on_change to the
correct store via the resolver instead of hardcoding the default scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _parse_extra_scope helper, two new Click options on the search
command, and wire extra_scopes/only_scope through to MemSearch.search().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ings

Three scenario-driven workflows that exercise multi-scope routing end-to-end
without requiring any API key (uses the ONNX local embedding provider):

1. Solo dev (closes zilliztech#337): project + global personal scopes, blended retrieval
   with quota enforcement and only_scope restriction.
2. Chat agents shared memory: a "registrar" indexes shared canon once; multiple
   agents (Alice, Bob) attach to it as a read-only scope (empty paths) while
   each writes to their own private scope. Verifies cross-agent privacy.
3. Individual isolation: two independent MemSearch instances on separate
   Milvus DBs cannot cross-leak. Single-scope behavior unchanged.

Run via: uv run python scripts/scenario_validation.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: recommended personal/global memory alongside project-scoped memory

1 participant