The repo-aware, team-aware, token-efficient memory layer for Claude Code.
Claude Code doesn't fail because it lacks intelligence.
It fails because it has zero awareness of your repo and your team.
mcp-brain is a Model Context Protocol (MCP) server that gives Claude Code persistent, structured awareness of your project — without burning tokens on context rebuilding.
| 🧠 | Compressed awareness in ~100 tokens instead of ~2000 |
|---|---|
| 🎯 | 63.4% Hit@10 on SWE-bench Full (2294 real GitHub issues) — zero LLM cost |
| ⚡ | Sub-100ms file prediction (BM25 + code graph + optional semantic reranker) |
| 👥 | Team-aware: soft claims, conflict detection, ownership tracking |
| 🔄 | Self-healing: decision lifecycle, automatic staleness, feedback loop |
| 🛡️ | Local-first: SQLite, no cloud, no embeddings required, GDPR-friendly |
- The Problem
- What mcp-brain Changes
- In 60 seconds
- How It Works
- Memory Hierarchy
- Prediction Pipeline
- Decision Lifecycle
- Architecture
- Benchmark Results
- Token Efficiency
- Quick Start
- MCP Tools
- Use Cases
- FAQ
- Trade-offs
- Roadmap
- License
Without persistent awareness, Claude Code operates blindly at the start of every session:
| Without mcp-brain | With mcp-brain |
|---|---|
| ❌ No idea which files matter | ✅ Predicted files in top-K |
| ❌ Re-explores the repo every session | ✅ Compressed context in ~100 tokens |
| ❌ No visibility into teammates' WIP | ✅ Soft claims + conflict detection |
| ❌ Acts on outdated decisions | ✅ Decision lifecycle (active → stale) |
| ❌ Burns 2000–5000 tokens just to "orient" | ✅ One YAML block, ready to act |
Result without mcp-brain: wrong file exploration → outdated suggestions → merge conflicts → massive token waste.
┌──────────────────────────────────────────────────────┐
│ │
│ Without: Claude → explores → guesses → retries │
│ → conflicts → high token usage │
│ │
│ With: Claude → predicts → verifies → acts │
│ → aligned → low token usage │
│ │
└──────────────────────────────────────────────────────┘
Instead of giving Claude more context, we give it structured awareness of reality.
We track:
- 📌 what changed (signal extraction from git)
- 🎯 what matters (scoring + lifecycle)
- 👥 who's working on what (team claims)
- 🧭 where to act (issue → file prediction)
…and we deliver it in ~100 tokens.
You drop a one-line ticket into Claude Code:
> work on ticket #42 — JWT login broken
Without mcp-brain, Claude starts grep-walking the repo, reading directory listings, opening README, sampling files — burning 2000+ tokens before producing the first useful sentence.
With mcp-brain, in <100ms Claude receives:
predictions:
- file: src/auth.py
confidence: high
why: "path + symbol match: login, jwt"
- file: src/middleware.py
confidence: medium
why: "imports auth (hop 1)"
- file: src/jwt_utils.py
confidence: medium
why: "called_by auth.login"
team_claims:
- { ticket: 39, author: dev-B, files: [middleware.py] } # ⚠️ overlap
avoid:
- "HS256 — vulnerable to key confusion. Migrated to RS256 in commit a1b2c3."
decisions:
- "tokens stored in httpOnly cookie, never localStorage"It's structured reality, not regenerated context. Claude can act on the first turn.
flowchart TD
subgraph Capture[Capture signals]
A[Git commit] -->|filtered signals| B[mcp-brain memory]
C[Session end] -->|structured snapshot| B
end
subgraph Predict[Predict where to act]
E[Ticket opened] --> F[File predictor]
F -->|top-K files + confidence + why| D[Claude Code]
end
subgraph Coordinate[Coordinate team work]
F -->|overlap check| G[Team claims]
G -->|conflict warnings| D
end
subgraph Learn[Learn from outcomes]
H[Outcome recorded] -->|precision / recall| I[Feedback loop]
I -->|demote noisy memories| B
I -->|supersede stale decisions| B
end
B -->|~100-token YAML context| D
- Capture — git hooks promote only high-signal events (decisions, patterns, things to avoid). Ignored: docs, chore, tests, CI noise.
- Compress — three-level memory (L1/L2/L3) auto-assigned by a scoring function (recency 35% + frequency 30% + impact 20% + explicit 15%).
- Predict — issue title/body → ranked file list via BM25 + code graph expansion + optional semantic reranker.
- Coordinate — soft claims warn before two devs touch the same files.
- Self-correct — every closed ticket feeds precision/recall stats; noisy memories are auto-demoted.
Memories aren't dumped into one bag. They're scored and tiered, so the high-token slot in your prompt only carries what's signal-dense for this moment:
- L1 — hot context loads automatically every session. Stack, conventions, current branch, recent commits, team claims, active high-confidence decisions. Capped at ~70 tokens.
- L2 — warm context loads only on demand (
brain_get_decisions). Historical reasoning, superseded patterns, the why behind a past trade-off. - L3 — cold archive is never sent to the model. Kept for audit, transparency, and the lifecycle's "undo" path.
The score is a transparent linear formula — no black-box embedding similarity. Every memory's level is reproducible and explainable.
The predictor is three deterministic stages:
| Stage | What it does | Cost |
|---|---|---|
| 1. BM25 + IDF | Tokenize issue, match against symbols / identifiers / paths in an inverted index | ~5 ms |
| 2. Graph expansion | Walk imports / imported_by / called_by from seeds. Score decays per hop (×0.5, ×0.25) |
~10 ms |
| 3. Semantic rerank (optional) | MiniLM (80 MB, CPU/GPU) embeds query + candidates, blends 30% cosine sim with 70% BM25 | ~50 ms |
Every prediction comes back with a why field and a full breakdown, so you can audit why a file was suggested — no opaque ranking.
💡 Default ON. To run lean (CI / containers without PyTorch), set
MCP_BRAIN_SEMANTIC=0and the pipeline degrades gracefully to BM25 + graph.
Memories aren't immortal. mcp-brain assumes you'll change your mind and bakes the lifecycle in:
- Age-based decay — after
SUSPECT_DAYSa memory gets flagged for re-verification. AfterSTALE_DAYSit's hidden from prompts. - Semantic supersession — write a new memory similar (cosine ≥ 0.85) to an old one and the old one is auto-marked
superseded. - Feedback loop — when a memory is shown 3+ times before a reverted ticket, it gets demoted automatically. Noisy memories die fast.
This is what makes mcp-brain safe to leave running for months without manual cleanup. The L1 stays small and trustworthy; the L3 archives the audit trail.
flowchart TB
subgraph Client
CC[Claude Code]
end
subgraph Server[mcp-brain server]
T[MCP Tools layer<br/>brain_init, brain_get_context,<br/>brain_predict_files, ...]
R[Retriever<br/>+ Compressor]
P[File Predictor<br/>BM25 + Graph + Semantic]
F[Feedback Reconciler]
O[Observability<br/>p50/p95/p99]
end
subgraph Storage[Local storage ~/.mcp-brain/]
DB[(SQLite<br/>memories, sessions,<br/>projects, feedback)]
IDX[Inverted Index<br/>BM25]
G[Code Graph<br/>imports/calls]
Y[YAML claims]
end
CC <-->|MCP/stdio| T
T --> R
T --> P
T --> F
T --> O
R --> DB
P --> IDX
P --> G
F --> DB
O --> DB
mcp-brain/
├── src/
│ ├── brain/ # core logic: retriever, compressor, scorer, predictor
│ │ # code_graph, file_indexer, semantic_reranker,
│ │ # staleness, similarity, feedback loop, observability
│ ├── capture/ # git hook signal extraction
│ ├── storage/ # SQLite layer
│ └── tools/ # MCP tool definitions (FastMCP)
├── benchmark/ # SWE-bench Lite/Full, Bench4BL, BugLocator harness
├── tests/ # pytest suite (predictor, feedback, observability, ...)
└── assets/ # SVG diagrams used in this README
We benchmark file localization — given a real GitHub issue, can mcp-brain rank the production files the accepted patch actually modified?
- 2294 real Python bug-fix tasks from major OSS projects (astropy, django, flask, matplotlib, pandas, pytest, requests, scikit-learn, sphinx, sympy, xarray)
- Ground truth = files modified in the accepted reference patch (test files excluded by default — strict production-file evaluation)
| Metric | @1 | @3 | @5 | @10 |
|---|---|---|---|---|
| Hit | 24.5% | 43.4% | 53.7% | 63.4% |
| Recall | 20.1% | 36.6% | 46.1% | 55.8% |
| MAP | 24.5% | 28.4% | 30.4% | 31.8% |
- Instances evaluated: 2294
- Errors: 5 (0.2% failure rate)
- Avg gold files per issue: 1.66
- Avg predicted files: 9.98 (top-10)
| System | Hit@10 (file loc.) | Cost per query | Notes |
|---|---|---|---|
| BM25 baseline (vanilla) | ~45–55% | free | symbol search only |
| mcp-brain v1.4.0 | 63.4% | free | BM25 + graph + semantic, zero LLM |
| Agentless / SWE-agent | ~70–85% | $0.10–$2 | LLM-based, multi-step |
Reading the numbers:
Hit@5 = 53.7%→ in more than half of real issues, the right production file is in top-5 before Claude reads a single byte.Hit@10 = 63.4%→ expanded to top-10, almost 2 issues out of 3 have the right file ranked.MAP@1 = 24.5%→ the very first prediction is dead-on for 1 issue out of 4.0.2% error rateover 2294 runs → robust pipeline.
# One-time online setup
pip install -e .
pip install -r benchmark/requirements-benchmark.txt
python -m benchmark.adapters.swebench --dataset-name princeton-nlp/SWE-bench \
--output benchmark/datasets/cache/swebench_full.jsonl
python -m benchmark.prepare_repos \
--dataset benchmark/datasets/cache/swebench_full.jsonl \
--repo-cache benchmark/repos
# Offline evaluation (full)
python -m benchmark.run_eval \
--dataset benchmark/datasets/cache/swebench_full.jsonl \
--repo-cache benchmark/repos \
--out benchmark/results/swebench_full.json \
--report-dir benchmark/reports \
--top-k 10 --max-hops 2 --use-semanticReports are emitted as Markdown + HTML in benchmark/reports/.
The harness also supports SWE-bench Lite (300 instances), SWE-bench Verified, Bench4BL, and BugLocator — see benchmark/README.md.
A typical Claude Code session without mcp-brain spends thousands of tokens just to orient itself:
| Phase (no mcp-brain) | Action | ~Tokens |
|---|---|---|
| Session start | List directory, read README, sample files | 800–2000 |
| Issue handling | Grep symbols, follow imports, retry wrong files | 1000–3000 |
| Context restore | Re-explain project conventions | 200–500 |
| Total per session | 2000–5500 |
A session with mcp-brain:
| Phase (with mcp-brain) | Action | ~Tokens |
|---|---|---|
| Session start | brain_get_context returns compressed L1 YAML |
~100 |
| Issue handling | brain_predict_files returns ranked top-K + why |
~250 |
| Decision recall | brain_get_decisions (only when needed) |
~300 |
| Total per session | ~650 |
Without With mcp-brain Saving
Session start: 2000 ─────────► 100 tokens ~95%
Per session: 2000–5500 ──► 450–950 tokens 40–80%
Per developer*: ~1.2M/month ──► ~400k/month ~65%
*assuming 100 sessions/month/dev
- ✅ No embeddings required for retrieval (BM25 + code graph)
- ✅ No vector DB to query (zero round-trip cost)
- ✅ No history replay — context is reconstructed, not re-scrolled
- ✅ YAML compression with
default_flow_style=Trueand empty-key stripping - ✅ L1/L2 split — heavy memory only loaded on demand
💡 The semantic reranker (
use_semantic=True) is on by default and runs locally on CPU/GPU. It does not add LLM cost. Disable withMCP_BRAIN_SEMANTIC=0for lean CI.
git clone https://github.com/PierfrancescoLijoi/mcp-brain.git
cd mcp-brain
pip install -e ".[all]"The [all] extra installs:
- language parsers (Python, JS, TS, Go, Rust, Java, C#) for the code graph
- semantic reranker (sentence-transformers + numpy)
- dev tooling (pytest, pytest-cov)
If you want a smaller footprint, you can pick exactly what you need:
pip install -e . # core only — BM25 + graph (no semantic, no parsers)
pip install -e ".[parsers]" # + multi-language parsers
pip install -e ".[semantic]" # + semantic reranker (~700 MB w/ PyTorch)
pip install -e ".[dev]" # + dev toolingclaude mcp add mcp-brain python /absolute/path/to/run.pyOn Windows PowerShell:
claude mcp add mcp-brain python "C:\path\to\mcp-brain\run.py"mcp-brain initThat's it. Open Claude Code in your repo and the L1 context is automatically available via brain_get_context.
| Tool | Purpose | When Claude calls it |
|---|---|---|
brain_init |
Register project, stack, conventions | Once per repo |
brain_get_context |
Load L1 context (~70 tokens) | Every session start |
brain_get_decisions |
Load L2 decisions on demand | When historical context needed |
brain_remember |
Store a memory; level auto-assigned | When user makes a decision |
brain_save_session |
Save end-of-session snapshot | At session end |
brain_predict_files |
Issue → ranked file list with why |
When opening a ticket |
brain_start_ticket |
Start ticket workflow + conflict check | Workflow orchestration |
brain_record_outcome |
Log ticket outcome (completed/reverted/...) | After ticket closed |
brain_feedback_stats |
Precision/recall window | Health checks |
brain_memory_health |
Surface noisy memories | Debugging |
brain_observability |
Full unified dashboard (YAML) | Ops / CI |
p: {name: my-api, stack: [FastAPI, PostgreSQL]}
s: {branch: feat/auth, wip: "JWT refactor", next: "add refresh token"}
git:
recent: ["refactor: JWT moved to RS256"]
changed: [auth.py, middleware.py]
team_claims:
- {ticket: 42, author: dev-B, files: [middleware.py]}
avoid:
- "avoid: HS256 — vulnerable to key confusion"
decisions:
- "decision: tokens stored httpOnly cookie, never localStorage"👉 Claude already knows where to act before reading a single source file.
- Cuts session-start exploration: −90% tokens on the first turn
- Remembers your "I always do it this way" patterns
- Auto-supersedes decisions when you change your mind
- Conflict detection before two devs touch the same files
- Shared decision log with lifecycle (no more "wait, didn't we decide…?")
- File ownership inference from git history
- Local-first, no data leaves the machine → GDPR / SOC2-friendly
- Compatible with Managed Identity / on-prem deployments (no cloud calls)
- Token saving compounds: 65% × 100 devs × 100 sessions/month → measurable infra savings
Is this a RAG system or a vector DB?
No, and on purpose. mcp-brain is a structured awareness layer, not a retrieval-over-embeddings layer. The core retrieval is BM25 + code graph expansion — fully deterministic, sub-100ms, no vector DB to maintain. The semantic reranker is an optional 30% blend on top, used only as a tiebreaker. This is why token cost stays predictable and infra is local-first.
Why not just use Claude's native context window? It's huge now.
A long context window doesn't fix the problem — it makes it cheaper to waste. The bottleneck isn't capacity, it's signal density. Pasting your whole repo into the context still leaves Claude searching for the right file linearly. mcp-brain pre-ranks reality so the model spends its attention on the right 3 files, not the wrong 30.
Will it leak my code or memories anywhere?
No. Storage is SQLite under ~/.mcp-brain/ (local) and <repo>/.brain/shared/ (versioned with git if you choose). No outbound network calls, no telemetry, no cloud component. The semantic model runs on your CPU/GPU. This makes mcp-brain compatible with GDPR-restricted and air-gapped environments.
What if I disagree with a decision mcp-brain remembers?
Write a new memory that contradicts it. Semantic supersession (cosine ≥ 0.85) will auto-mark the old one as superseded. You can also manually demote via brain_memory_health or wait for age-based decay (SUSPECT_DAYS / STALE_DAYS). The lifecycle assumes you'll change your mind.
Does it work with languages other than Python?
Yes for indexing/predicting (BM25 is language-agnostic). The code graph currently supports Python, JavaScript, TypeScript, Go, Rust, Java, C# via tree-sitter parsers. Adding a new language is a single registry entry — see src/brain/parsers.py.
How does it compare to SWE-agent / Aider / Cursor?
Different layer of the stack. SWE-agent and similar tools are autonomous coders — they read, plan, and patch via LLM calls. mcp-brain is the awareness layer underneath them. You could pair it with Aider or any MCP-compatible client; it makes whatever LLM you use start from a smarter zero.
What's the catch?
Honest answer: file prediction is heuristic. Hit@1 = 24.5% means 3 issues out of 4 still need Claude to validate the prediction before acting. mcp-brain orients, it doesn't replace exploration. That's also why it's free — it's a force multiplier, not an oracle.
I'm honest about what this is and isn't.
| Strength | Limitation |
|---|---|
| ✅ Zero LLM cost for retrieval | |
| ✅ Sub-100ms predictions | |
| ✅ Local-first, no cloud | .brain/shared/) |
| ✅ Deterministic (replays produce same output) | |
| ✅ Works on any size repo |
This is NOT:
- ❌ a vector DB memory
- ❌ a RAG system
- ❌ an SWE-agent / autonomous coder
- ❌ a checkpoint / replay tool
This IS:
- ✅ a repo-aware, team-aware, token-efficient awareness layer
- ✅ a force multiplier for Claude Code, not a replacement
- BM25 + code graph + semantic reranker
- Decision lifecycle with semantic supersession
- Feedback loop with precision/recall reconciliation
- Observability dashboard
- SWE-bench Full benchmark (2294 instances)
- Multi-language code graph (Python, JS, TS, Go, Rust, Java, C#)
- Cross-repo memory federation (opt-in)
- Real-time conflict push (currently pull-based)
- VS Code extension companion
- Hosted shared
.brain/for distributed teams (still local-first per dev)
pip install -e ".[dev]"
pytest tests/ -vExpected: full pass on Python 3.10, 3.11, 3.12.
PRs welcome. Before opening one:
pytest tests/ -vmust pass- New behavior needs new tests
- New MCP tools must be wrapped with
@observed("brain_<name>") - Avoid heavy dependencies for the default install path — anything ML-flavored goes behind an optional extra
MIT — see LICENSE.
Built for Claude Code — but the architecture is MCP-standard, so any MCP-compatible client works.
If mcp-brain saved you tokens, ⭐ the repo. That's the only payment I ask for.