Skip to content

PierfrancescoLijoi/mcp-brain

Repository files navigation

mcp-brain

mcp-brain banner

SWE-bench Hit@10 Token saving Zero LLM cost Local-first MIT License Python 3.10+

The repo-aware, team-aware, token-efficient memory layer for Claude Code.

Claude Code doesn't fail because it lacks intelligence.
It fails because it has zero awareness of your repo and your team.


🚀 TL;DR

mcp-brain is a Model Context Protocol (MCP) server that gives Claude Code persistent, structured awareness of your project — without burning tokens on context rebuilding.

🧠 Compressed awareness in ~100 tokens instead of ~2000
🎯 63.4% Hit@10 on SWE-bench Full (2294 real GitHub issues) — zero LLM cost
Sub-100ms file prediction (BM25 + code graph + optional semantic reranker)
👥 Team-aware: soft claims, conflict detection, ownership tracking
🔄 Self-healing: decision lifecycle, automatic staleness, feedback loop
🛡️ Local-first: SQLite, no cloud, no embeddings required, GDPR-friendly

📑 Table of Contents


🚨 The Problem

Workflow comparison: without mcp-brain Claude explores blindly; with mcp-brain Claude starts from structured repo and team awareness

Without persistent awareness, Claude Code operates blindly at the start of every session:

Without mcp-brain With mcp-brain
❌ No idea which files matter ✅ Predicted files in top-K
❌ Re-explores the repo every session ✅ Compressed context in ~100 tokens
❌ No visibility into teammates' WIP ✅ Soft claims + conflict detection
❌ Acts on outdated decisions ✅ Decision lifecycle (active → stale)
❌ Burns 2000–5000 tokens just to "orient" ✅ One YAML block, ready to act

Result without mcp-brain: wrong file exploration → outdated suggestions → merge conflicts → massive token waste.


⚡ What mcp-brain Changes

┌──────────────────────────────────────────────────────┐
│                                                      │
│   Without:  Claude → explores → guesses → retries    │
│             → conflicts → high token usage           │
│                                                      │
│   With:     Claude → predicts → verifies → acts      │
│             → aligned → low token usage              │
│                                                      │
└──────────────────────────────────────────────────────┘

🧬 Core idea

Instead of giving Claude more context, we give it structured awareness of reality.

We track:

  • 📌 what changed (signal extraction from git)
  • 🎯 what matters (scoring + lifecycle)
  • 👥 who's working on what (team claims)
  • 🧭 where to act (issue → file prediction)

…and we deliver it in ~100 tokens.


⏱️ In 60 seconds

You drop a one-line ticket into Claude Code:

> work on ticket #42 — JWT login broken

Without mcp-brain, Claude starts grep-walking the repo, reading directory listings, opening README, sampling files — burning 2000+ tokens before producing the first useful sentence.

With mcp-brain, in <100ms Claude receives:

predictions:
  - file: src/auth.py
    confidence: high
    why: "path + symbol match: login, jwt"
  - file: src/middleware.py
    confidence: medium
    why: "imports auth (hop 1)"
  - file: src/jwt_utils.py
    confidence: medium
    why: "called_by auth.login"
team_claims:
  - { ticket: 39, author: dev-B, files: [middleware.py] }   # ⚠️ overlap
avoid:
  - "HS256 — vulnerable to key confusion. Migrated to RS256 in commit a1b2c3."
decisions:
  - "tokens stored in httpOnly cookie, never localStorage"

It's structured reality, not regenerated context. Claude can act on the first turn.


🔑 How It Works

flowchart TD
    subgraph Capture[Capture signals]
        A[Git commit] -->|filtered signals| B[mcp-brain memory]
        C[Session end] -->|structured snapshot| B
    end

    subgraph Predict[Predict where to act]
        E[Ticket opened] --> F[File predictor]
        F -->|top-K files + confidence + why| D[Claude Code]
    end

    subgraph Coordinate[Coordinate team work]
        F -->|overlap check| G[Team claims]
        G -->|conflict warnings| D
    end

    subgraph Learn[Learn from outcomes]
        H[Outcome recorded] -->|precision / recall| I[Feedback loop]
        I -->|demote noisy memories| B
        I -->|supersede stale decisions| B
    end

    B -->|~100-token YAML context| D
Loading
  1. Capture — git hooks promote only high-signal events (decisions, patterns, things to avoid). Ignored: docs, chore, tests, CI noise.
  2. Compress — three-level memory (L1/L2/L3) auto-assigned by a scoring function (recency 35% + frequency 30% + impact 20% + explicit 15%).
  3. Predict — issue title/body → ranked file list via BM25 + code graph expansion + optional semantic reranker.
  4. Coordinate — soft claims warn before two devs touch the same files.
  5. Self-correct — every closed ticket feeds precision/recall stats; noisy memories are auto-demoted.

🧠 Memory Hierarchy

Three-level memory hierarchy: L1 hot context, L2 warm context, L3 cold archive

Memories aren't dumped into one bag. They're scored and tiered, so the high-token slot in your prompt only carries what's signal-dense for this moment:

  • L1 — hot context loads automatically every session. Stack, conventions, current branch, recent commits, team claims, active high-confidence decisions. Capped at ~70 tokens.
  • L2 — warm context loads only on demand (brain_get_decisions). Historical reasoning, superseded patterns, the why behind a past trade-off.
  • L3 — cold archive is never sent to the model. Kept for audit, transparency, and the lifecycle's "undo" path.

The score is a transparent linear formula — no black-box embedding similarity. Every memory's level is reproducible and explainable.


🔍 Prediction Pipeline

Prediction pipeline: BM25 and IDF scoring, graph expansion, optional semantic rerank

The predictor is three deterministic stages:

Stage What it does Cost
1. BM25 + IDF Tokenize issue, match against symbols / identifiers / paths in an inverted index ~5 ms
2. Graph expansion Walk imports / imported_by / called_by from seeds. Score decays per hop (×0.5, ×0.25) ~10 ms
3. Semantic rerank (optional) MiniLM (80 MB, CPU/GPU) embeds query + candidates, blends 30% cosine sim with 70% BM25 ~50 ms

Every prediction comes back with a why field and a full breakdown, so you can audit why a file was suggested — no opaque ranking.

💡 Default ON. To run lean (CI / containers without PyTorch), set MCP_BRAIN_SEMANTIC=0 and the pipeline degrades gracefully to BM25 + graph.


🔄 Decision Lifecycle

Decision lifecycle: active memories become suspect, stale, or superseded over time and through feedback

Memories aren't immortal. mcp-brain assumes you'll change your mind and bakes the lifecycle in:

  • Age-based decay — after SUSPECT_DAYS a memory gets flagged for re-verification. After STALE_DAYS it's hidden from prompts.
  • Semantic supersession — write a new memory similar (cosine ≥ 0.85) to an old one and the old one is auto-marked superseded.
  • Feedback loop — when a memory is shown 3+ times before a reverted ticket, it gets demoted automatically. Noisy memories die fast.

This is what makes mcp-brain safe to leave running for months without manual cleanup. The L1 stays small and trustworthy; the L3 archives the audit trail.


🏗️ Architecture

mcp-brain architecture: Claude Code talks to the MCP tools layer, which uses memory retrieval, file prediction, feedback, observability, and local SQLite storage

flowchart TB
    subgraph Client
        CC[Claude Code]
    end
    subgraph Server[mcp-brain server]
        T[MCP Tools layer<br/>brain_init, brain_get_context,<br/>brain_predict_files, ...]
        R[Retriever<br/>+ Compressor]
        P[File Predictor<br/>BM25 + Graph + Semantic]
        F[Feedback Reconciler]
        O[Observability<br/>p50/p95/p99]
    end
    subgraph Storage[Local storage ~/.mcp-brain/]
        DB[(SQLite<br/>memories, sessions,<br/>projects, feedback)]
        IDX[Inverted Index<br/>BM25]
        G[Code Graph<br/>imports/calls]
        Y[YAML claims]
    end
    CC <-->|MCP/stdio| T
    T --> R
    T --> P
    T --> F
    T --> O
    R --> DB
    P --> IDX
    P --> G
    F --> DB
    O --> DB
Loading

Repo layout

mcp-brain/
├── src/
│   ├── brain/         # core logic: retriever, compressor, scorer, predictor
│   │                  # code_graph, file_indexer, semantic_reranker,
│   │                  # staleness, similarity, feedback loop, observability
│   ├── capture/       # git hook signal extraction
│   ├── storage/       # SQLite layer
│   └── tools/         # MCP tool definitions (FastMCP)
├── benchmark/         # SWE-bench Lite/Full, Bench4BL, BugLocator harness
├── tests/             # pytest suite (predictor, feedback, observability, ...)
└── assets/            # SVG diagrams used in this README

📊 Benchmark Results

SWE-bench Full benchmark results: Hit@K, Recall@K, MAP@K, and comparison vs literature

We benchmark file localizationgiven a real GitHub issue, can mcp-brain rank the production files the accepted patch actually modified?

Dataset: SWE-bench Full

  • 2294 real Python bug-fix tasks from major OSS projects (astropy, django, flask, matplotlib, pandas, pytest, requests, scikit-learn, sphinx, sympy, xarray)
  • Ground truth = files modified in the accepted reference patch (test files excluded by default — strict production-file evaluation)

Results — mcp-brain v1.4.0 (BM25 + graph + semantic)

Metric @1 @3 @5 @10
Hit 24.5% 43.4% 53.7% 63.4%
Recall 20.1% 36.6% 46.1% 55.8%
MAP 24.5% 28.4% 30.4% 31.8%
  • Instances evaluated: 2294
  • Errors: 5 (0.2% failure rate)
  • Avg gold files per issue: 1.66
  • Avg predicted files: 9.98 (top-10)

Honest comparison vs. literature

System Hit@10 (file loc.) Cost per query Notes
BM25 baseline (vanilla) ~45–55% free symbol search only
mcp-brain v1.4.0 63.4% free BM25 + graph + semantic, zero LLM
Agentless / SWE-agent ~70–85% $0.10–$2 LLM-based, multi-step

Reading the numbers:

  • Hit@5 = 53.7% → in more than half of real issues, the right production file is in top-5 before Claude reads a single byte.
  • Hit@10 = 63.4% → expanded to top-10, almost 2 issues out of 3 have the right file ranked.
  • MAP@1 = 24.5% → the very first prediction is dead-on for 1 issue out of 4.
  • 0.2% error rate over 2294 runs → robust pipeline.

Reproduce it yourself

# One-time online setup
pip install -e .
pip install -r benchmark/requirements-benchmark.txt
python -m benchmark.adapters.swebench --dataset-name princeton-nlp/SWE-bench \
  --output benchmark/datasets/cache/swebench_full.jsonl
python -m benchmark.prepare_repos \
  --dataset benchmark/datasets/cache/swebench_full.jsonl \
  --repo-cache benchmark/repos

# Offline evaluation (full)
python -m benchmark.run_eval \
  --dataset benchmark/datasets/cache/swebench_full.jsonl \
  --repo-cache benchmark/repos \
  --out benchmark/results/swebench_full.json \
  --report-dir benchmark/reports \
  --top-k 10 --max-hops 2 --use-semantic

Reports are emitted as Markdown + HTML in benchmark/reports/.

The harness also supports SWE-bench Lite (300 instances), SWE-bench Verified, Bench4BL, and BugLocator — see benchmark/README.md.


💰 Token Efficiency

Cost optimization: from 2000-5500 orientation tokens per session to roughly 650 tokens with mcp-brain

The math

A typical Claude Code session without mcp-brain spends thousands of tokens just to orient itself:

Phase (no mcp-brain) Action ~Tokens
Session start List directory, read README, sample files 800–2000
Issue handling Grep symbols, follow imports, retry wrong files 1000–3000
Context restore Re-explain project conventions 200–500
Total per session 2000–5500

A session with mcp-brain:

Phase (with mcp-brain) Action ~Tokens
Session start brain_get_context returns compressed L1 YAML ~100
Issue handling brain_predict_files returns ranked top-K + why ~250
Decision recall brain_get_decisions (only when needed) ~300
Total per session ~650

Estimated saving

                        Without          With mcp-brain     Saving
  Session start:    2000 ─────────►       100 tokens        ~95%
  Per session:      2000–5500 ──►       450–950 tokens      40–80%
  Per developer*:   ~1.2M/month ──►    ~400k/month          ~65%

*assuming 100 sessions/month/dev

Why this works

  • No embeddings required for retrieval (BM25 + code graph)
  • No vector DB to query (zero round-trip cost)
  • No history replay — context is reconstructed, not re-scrolled
  • YAML compression with default_flow_style=True and empty-key stripping
  • L1/L2 split — heavy memory only loaded on demand

💡 The semantic reranker (use_semantic=True) is on by default and runs locally on CPU/GPU. It does not add LLM cost. Disable with MCP_BRAIN_SEMANTIC=0 for lean CI.


🚀 Quick Start

Install — one command, batteries included

git clone https://github.com/PierfrancescoLijoi/mcp-brain.git
cd mcp-brain
pip install -e ".[all]"

The [all] extra installs:

  • language parsers (Python, JS, TS, Go, Rust, Java, C#) for the code graph
  • semantic reranker (sentence-transformers + numpy)
  • dev tooling (pytest, pytest-cov)

Lean install paths

If you want a smaller footprint, you can pick exactly what you need:

pip install -e .                      # core only — BM25 + graph (no semantic, no parsers)
pip install -e ".[parsers]"           # + multi-language parsers
pip install -e ".[semantic]"          # + semantic reranker (~700 MB w/ PyTorch)
pip install -e ".[dev]"               # + dev tooling

Register with Claude Code

claude mcp add mcp-brain python /absolute/path/to/run.py

On Windows PowerShell:

claude mcp add mcp-brain python "C:\path\to\mcp-brain\run.py"

Initialize your project

mcp-brain init

That's it. Open Claude Code in your repo and the L1 context is automatically available via brain_get_context.


🧠 MCP Tools

Tool Purpose When Claude calls it
brain_init Register project, stack, conventions Once per repo
brain_get_context Load L1 context (~70 tokens) Every session start
brain_get_decisions Load L2 decisions on demand When historical context needed
brain_remember Store a memory; level auto-assigned When user makes a decision
brain_save_session Save end-of-session snapshot At session end
brain_predict_files Issue → ranked file list with why When opening a ticket
brain_start_ticket Start ticket workflow + conflict check Workflow orchestration
brain_record_outcome Log ticket outcome (completed/reverted/...) After ticket closed
brain_feedback_stats Precision/recall window Health checks
brain_memory_health Surface noisy memories Debugging
brain_observability Full unified dashboard (YAML) Ops / CI

Example L1 context output (~100 tokens)

p: {name: my-api, stack: [FastAPI, PostgreSQL]}
s: {branch: feat/auth, wip: "JWT refactor", next: "add refresh token"}

git:
  recent: ["refactor: JWT moved to RS256"]
  changed: [auth.py, middleware.py]

team_claims:
  - {ticket: 42, author: dev-B, files: [middleware.py]}

avoid:
  - "avoid: HS256 — vulnerable to key confusion"

decisions:
  - "decision: tokens stored httpOnly cookie, never localStorage"

👉 Claude already knows where to act before reading a single source file.


💼 Use Cases

🎯 Solo developer

  • Cuts session-start exploration: −90% tokens on the first turn
  • Remembers your "I always do it this way" patterns
  • Auto-supersedes decisions when you change your mind

👥 Small team (3–10 devs)

  • Conflict detection before two devs touch the same files
  • Shared decision log with lifecycle (no more "wait, didn't we decide…?")
  • File ownership inference from git history

🏢 Enterprise (with caveats)

  • Local-first, no data leaves the machine → GDPR / SOC2-friendly
  • Compatible with Managed Identity / on-prem deployments (no cloud calls)
  • Token saving compounds: 65% × 100 devs × 100 sessions/month → measurable infra savings

❓ FAQ

Is this a RAG system or a vector DB?

No, and on purpose. mcp-brain is a structured awareness layer, not a retrieval-over-embeddings layer. The core retrieval is BM25 + code graph expansion — fully deterministic, sub-100ms, no vector DB to maintain. The semantic reranker is an optional 30% blend on top, used only as a tiebreaker. This is why token cost stays predictable and infra is local-first.

Why not just use Claude's native context window? It's huge now.

A long context window doesn't fix the problem — it makes it cheaper to waste. The bottleneck isn't capacity, it's signal density. Pasting your whole repo into the context still leaves Claude searching for the right file linearly. mcp-brain pre-ranks reality so the model spends its attention on the right 3 files, not the wrong 30.

Will it leak my code or memories anywhere?

No. Storage is SQLite under ~/.mcp-brain/ (local) and <repo>/.brain/shared/ (versioned with git if you choose). No outbound network calls, no telemetry, no cloud component. The semantic model runs on your CPU/GPU. This makes mcp-brain compatible with GDPR-restricted and air-gapped environments.

What if I disagree with a decision mcp-brain remembers?

Write a new memory that contradicts it. Semantic supersession (cosine ≥ 0.85) will auto-mark the old one as superseded. You can also manually demote via brain_memory_health or wait for age-based decay (SUSPECT_DAYS / STALE_DAYS). The lifecycle assumes you'll change your mind.

Does it work with languages other than Python?

Yes for indexing/predicting (BM25 is language-agnostic). The code graph currently supports Python, JavaScript, TypeScript, Go, Rust, Java, C# via tree-sitter parsers. Adding a new language is a single registry entry — see src/brain/parsers.py.

How does it compare to SWE-agent / Aider / Cursor?

Different layer of the stack. SWE-agent and similar tools are autonomous coders — they read, plan, and patch via LLM calls. mcp-brain is the awareness layer underneath them. You could pair it with Aider or any MCP-compatible client; it makes whatever LLM you use start from a smarter zero.

What's the catch?

Honest answer: file prediction is heuristic. Hit@1 = 24.5% means 3 issues out of 4 still need Claude to validate the prediction before acting. mcp-brain orients, it doesn't replace exploration. That's also why it's free — it's a force multiplier, not an oracle.


⚠️ Trade-offs

I'm honest about what this is and isn't.

Strength Limitation
✅ Zero LLM cost for retrieval ⚠️ Heuristic-based: edge cases with no symbol/path overlap can miss
✅ Sub-100ms predictions ⚠️ Requires good commit hygiene (semantic commit messages help)
✅ Local-first, no cloud ⚠️ No cross-machine sync out of the box (use git for .brain/shared/)
✅ Deterministic (replays produce same output) ⚠️ Hit@1 = 24.5% → orients, doesn't replace exploration
✅ Works on any size repo ⚠️ Best on medium/large repos (small repos don't benefit much)

This is NOT:

  • ❌ a vector DB memory
  • ❌ a RAG system
  • ❌ an SWE-agent / autonomous coder
  • ❌ a checkpoint / replay tool

This IS:

  • ✅ a repo-aware, team-aware, token-efficient awareness layer
  • ✅ a force multiplier for Claude Code, not a replacement

🛣️ Roadmap

  • BM25 + code graph + semantic reranker
  • Decision lifecycle with semantic supersession
  • Feedback loop with precision/recall reconciliation
  • Observability dashboard
  • SWE-bench Full benchmark (2294 instances)
  • Multi-language code graph (Python, JS, TS, Go, Rust, Java, C#)
  • Cross-repo memory federation (opt-in)
  • Real-time conflict push (currently pull-based)
  • VS Code extension companion
  • Hosted shared .brain/ for distributed teams (still local-first per dev)

🧪 Run the test suite

pip install -e ".[dev]"
pytest tests/ -v

Expected: full pass on Python 3.10, 3.11, 3.12.


🤝 Contributing

PRs welcome. Before opening one:

  1. pytest tests/ -v must pass
  2. New behavior needs new tests
  3. New MCP tools must be wrapped with @observed("brain_<name>")
  4. Avoid heavy dependencies for the default install path — anything ML-flavored goes behind an optional extra

📄 License

MIT — see LICENSE.


Built for Claude Code — but the architecture is MCP-standard, so any MCP-compatible client works.

If mcp-brain saved you tokens, ⭐ the repo. That's the only payment I ask for.

Packages

 
 
 

Contributors