Skip to content

feat(memory): agent memory v2 — second brain with hybrid search, LLM extraction, and observability dashboard#606

Open
kovtcharov wants to merge 48 commits intomainfrom
feature/agent-memory
Open

feat(memory): agent memory v2 — second brain with hybrid search, LLM extraction, and observability dashboard#606
kovtcharov wants to merge 48 commits intomainfrom
feature/agent-memory

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

@kovtcharov kovtcharov commented Mar 20, 2026

Summary

Comprehensive agent memory system that serves as a second brain — storing, recalling, and learning from every interaction. Built on proven patterns from Mem0, Zep, and Hindsight.

Architecture (v2)

  • Hybrid search: Vector (FAISS) + BM25 (FTS5) + RRF fusion + cross-encoder reranking (ms-marco-MiniLM-L-6-v2)
  • Mem0-style extraction: LLM decides ADD/UPDATE/DELETE/NOOP against existing memory after each conversation turn
  • Zep-inspired fact lineage: superseded_by column preserves history when facts are corrected
  • Hindsight-inspired reconciliation: Background pairwise similarity check detects contradictions across sessions
  • Complexity-aware recall: Adaptive top_k (3/5/10) based on query complexity heuristics
  • Temporal search: time_from/time_to on all search methods for time-based recall
  • Conversation consolidation: Auto-distill old sessions into durable knowledge before 90-day prune
  • No silent fallback: Embeddings are a hard requirement — system fails loudly on misconfiguration

Schema v2

Three tables (knowledge, conversations, tool_history) with new columns:

  • knowledge.embedding BLOB — 768-dim vector (nomic-embed-text-v2)
  • knowledge.superseded_by TEXT — fact lineage chain
  • conversations.consolidated_at TEXT — consolidation tracking

Memory Tools (5 LLM-facing tools)

Tool Purpose
remember Store facts, notes, reminders with category/domain/entity/due_at
recall Hybrid semantic+keyword search with temporal filtering
update_memory Modify existing items, set reminded_at
forget Delete a memory item
search_past_conversations Search conversation history with temporal filtering

Use Cases

  • Note-taking, journaling, meeting notes capture
  • Reminders with due dates and wake-up scheduling
  • Personal knowledge management (research, articles)
  • Contact profiles via entity linking (person:sarah_chen)
  • Error learning and skill capture from tool usage
  • Recurring commitments (LLM advances due_at)

Observability Dashboard

Full-page Memory Dashboard in Agent UI with:

  • Header stats cards (memories, sessions, tool calls, success rate)
  • Activity timeline (30-day heatmap)
  • Knowledge browser (filterable, sortable, paginated table)
  • Tool performance stats
  • Conversation history browser with consolidation status
  • Upcoming/overdue reminders panel
  • Maintenance actions (consolidate, rebuild embeddings, reconcile)
  • Embedding coverage indicator

Startup Sequence

  1. Validate Lemonade → 2. Backfill embeddings → 3. Rebuild FAISS → 4. Confidence decay → 5. Reconcile memory → 6. Consolidate sessions → 7. Prune → 8. Generate session

Files

Component Files
Data layer src/gaia/agents/base/memory_store.py
Agent mixin src/gaia/agents/base/memory.py
System discovery src/gaia/agents/base/discovery.py
REST API src/gaia/ui/routers/memory.py
Agent UI src/gaia/apps/webui/src/pages/MemoryDashboard.tsx
Architecture spec docs/spec/agent-memory-architecture.md
Unit tests tests/unit/test_memory_*.py
Integration tests tests/integration/test_memory_*.py

Design References

System Pattern adopted
Mem0 LLM-in-the-loop extraction (ADD/UPDATE/DELETE/NOOP)
Zep/Graphiti Fact lineage via superseded_by, temporal search
Hindsight Cross-encoder reranking, background reconciliation
ENGRAM Memory typing (category-based) over knowledge graphs
CoALA Four-tier cognitive architecture (working/episodic/semantic/procedural)

Test plan

  • Unit tests pass: pytest tests/unit/test_memory_store.py test_memory_mixin.py test_memory_router.py
  • Integration tests pass: pytest tests/integration/test_memory_integration.py test_memory_api_integration.py
  • Schema v2 migration works on existing v1 databases
  • Hybrid search returns semantically relevant results (not just keyword matches)
  • Mem0 extraction correctly handles ADD/UPDATE/DELETE operations
  • Superseded items excluded from search and system prompt
  • Temporal filtering works with time_from/time_to on recall
  • Consolidation distills old sessions into knowledge items
  • Reconciliation detects contradictory facts across sessions
  • Memory Dashboard renders all 6 sections with real data
  • Dashboard knowledge browser supports filter/sort/paginate/edit/delete
  • Lemonade unavailable at startup raises RuntimeError (no silent fallback)
  • Cross-encoder reranking improves precision on ambiguous queries
  • Complexity-aware recall uses adaptive top_k (3/5/10)
  • Frontend build succeeds: cd src/gaia/apps/webui && npm run build

@github-actions github-actions Bot added documentation Documentation changes dependencies Dependency updates agents cli CLI changes tests Test changes electron Electron app changes labels Mar 20, 2026
Comment thread src/gaia/agents/base/discovery.py Fixed
@kovtcharov kovtcharov force-pushed the feature/agent-memory branch from e0eff31 to 068eead Compare March 21, 2026 23:13
itomek and others added 6 commits March 21, 2026 18:10
## Summary

- **`gaia init` now installs RAG dependencies** for `chat`, `rag`, and
`all` profiles — adds `pip_extras` field to profile definitions and a
new `_install_pip_extras()` step that detects editable vs package
install, tries `uv pip` first with `pip` fallback
- **Added `self.rag` None guards** to 8 RAG tools in `rag_tools.py` that
were crashing with `'NoneType' object has no attribute 'index_document'`
when RAG deps not installed
- **Widened ChatAgent RAG init exception catch** from `ImportError` to
`Exception` with warning-level logging and debug traceback
- **Updated Agent UI docs** to include `[rag]` in install instructions
(`[ui,rag]`)

## Test plan

- [x] Lint passing (black, isort, pylint, flake8)
- [x] All 1104 unit tests passing
- [ ] `gaia init --profile chat` installs RAG deps automatically
- [ ] Agent UI document indexing works after `pip install -e ".[rag]"`
- [ ] RAG tools return actionable error when deps not installed (instead
of crashing)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
C-1: Guard winreg import and all registry-scanning methods in discovery.py
     so the module loads cleanly on Linux/macOS where winreg is absent.
     Also guard _scan_credential_manager() behind sys.platform check to
     avoid subprocess.CREATE_NO_WINDOW AttributeError on non-Windows.

C-3: Replace direct _lock/_conn access in CLI with two new MemoryStore
     public methods: get_source_counts() and delete_by_source(source).
     delete_by_source() wraps FTS cleanup + DELETE in a single atomic
     transaction with rollback, removing the per-ID loop that could
     leave knowledge/FTS diverged on partial failure.

C-4: Add close_store() to memory router module; call it from FastAPI
     lifespan shutdown so the WAL is checkpointed and the SQLite
     connection is released cleanly on server exit.

M-2: list_knowledge endpoint now excludes sensitive items by default.
     New include_sensitive=false query param (default false) controls
     visibility; sensitive=true still filters to sensitive-only.

M-6: Add append-only comment to conversations FTS trigger block noting
     that an AFTER UPDATE trigger would be required if store_turn()
     ever changes to update existing rows.

Tests: +9 tests (394 total) covering get_source_counts, delete_by_source
       rollback discipline, and all three sensitive filter modes in the router.
- Fix _original_user_input=None fallback bug in _after_process_query
  (getattr default ignored None; switch to `or` to handle init state)
- Extract VALID_CATEGORIES/MAX_CONTENT_LENGTH/MAX_TURN_LENGTH and other
  magic numbers to named module-level constants in memory_store.py
- Import constants in memory.py to eliminate duplicate category sets
  and ensure truncation limits stay in sync across all call sites
- DRY: memory router imports VALID_CATEGORIES from data layer instead
  of redefining its own copy
- Clean up unused imports in test files (F401/F811 flake8 violations)
- 394 unit tests passing, flake8 clean
Replace substring `"github.com" in url_lower` with urlparse().hostname
comparison to fix CodeQL CWE-20 "Incomplete URL substring sanitization".
A crafted URL like http://evil.com/github.com could otherwise bypass the
check. Hostname equality/suffix match is unambiguous.
Security:
- recall tool now filters out sensitive items before returning results
  to the LLM — sensitive entries (API keys, credentials) are for
  internal use only and must not appear in tool output.

Performance:
- Add get_by_category_contexts() to MemoryStore: single SQL query with
  WHERE context IN (active, 'global') replaces two separate
  get_by_category() calls in _get_context_items(), halving DB round-trips
  per system-prompt build (was 6 queries, now 3).
- Replace N+1 correlated subquery in get_sessions() with a LEFT JOIN on
  MIN(id) per session — scales linearly regardless of session count.

Reliability:
- Add PRAGMA busy_timeout=5000 so concurrent WAL readers/writers in the
  same process (dashboard REST singleton + ChatAgent) retry for 5 s
  instead of failing immediately with SQLITE_BUSY.

Correctness:
- update_memory tool truncation check now uses MAX_CONTENT_LENGTH constant
  instead of hardcoded 2000, keeping it in sync with memory_store.py.

Testability:
- Replace sys.exit(1) in _bootstrap_chat/_bootstrap_discover/_bootstrap_reset
  helpers with raise RuntimeError; _handle_memory_bootstrap catches and
  exits, making helpers unit-testable in isolation.

Tests (+34):
- TestGetByCategoryContexts (5): single-query context+global fetch
- TestGetAllKnowledgeSortByValidation (4): sort_by whitelist protection
- TestGetSessionsFirstMessageV2 (3): join-based first_message
- test_memory_discovery.py (22): _classify_remote, _classify_path,
  _classify_domain, scan_all structure, Windows guard

428 tests passing, 1 skipped (Windows-only guard on non-Windows).
# Conflicts:
#	src/gaia/agents/chat/agent.py
#	src/gaia/apps/webui/src/App.tsx
#	src/gaia/apps/webui/src/components/ChatView.tsx
#	src/gaia/ui/server.py
@kovtcharov kovtcharov force-pushed the feature/agent-memory branch from d4fdb90 to a06f9cc Compare April 1, 2026 16:31
Comprehensive rewrite of agent-memory-architecture.md as a single
unified design document. Key changes:

- Hybrid search: vector (FAISS) + BM25 (FTS5) + RRF fusion + cross-encoder
  reranking (ms-marco-MiniLM-L-6-v2). No fallback — embeddings are a hard
  requirement.
- Mem0-style LLM extraction: ADD/UPDATE/DELETE/NOOP operations against
  existing memory, replacing naive extract-and-store.
- Zep-inspired fact lineage: superseded_by column preserves history when
  facts are corrected rather than silently overwriting.
- Hindsight-inspired background reconciliation: pairwise similarity check
  on startup detects contradictions missed at extraction time.
- Complexity-aware recall depth: adaptive top_k (3/5/10) based on query
  complexity heuristics.
- Temporal range search: time_from/time_to on all search methods for
  natural time-based recall.
- Conversation consolidation: auto-distill old sessions to durable
  knowledge before 90-day prune.
- Second brain use cases: journaling, meeting notes, PKM, reminders,
  wake-up scheduling, recurring commitments.
- Removed all graceful degradation / silent fallback patterns.
- Removed openjarvis-memory-analysis.md (temp analysis doc).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kovtcharov kovtcharov changed the title feat(memory): persistent agent memory system with dashboard UI feat(memory): agent memory v2 — second brain with hybrid search, LLM extraction, and observability dashboard Apr 1, 2026
Karim13014 and others added 7 commits April 1, 2026 15:48
…coverage, temporal+superseded filters

- POST /api/memory/consolidate, /reconcile, /rebuild-embeddings
- GET /api/memory/embedding-coverage
- Updated GET /api/memory/knowledge with include_superseded, time_from, time_to
- Updated GET /api/memory/stats with embedding coverage and reconciliation stats
- 95 tests passing, lint clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d_by, temporal search, consolidation

- Schema v1→v2 migration: embedding BLOB, superseded_by TEXT, consolidated_at TEXT
- New methods: store_embedding, get_items_with/without_embeddings, get_unconsolidated_sessions, mark_turns_consolidated, get_items_for_reconciliation
- Updated search() with time_from/time_to, superseded_by IS NULL, use_count increment
- Updated all query methods with superseded_by IS NULL filter
- 275 tests passing, lint clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…LLM extraction, temporal recall

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… FAISS, API integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ledge browser, activity timeline, tool stats

6-section dashboard: header stat cards, 30-day activity bar chart,
paginated knowledge browser with entity/category/context/search filters,
tool performance table, conversation history with FTS search,
upcoming & overdue temporal panel.

Features:
- Embedding coverage indicator with progress bar
- Maintenance dropdown: consolidate, rebuild embeddings, reconcile, rebuild FTS
- Click-to-expand knowledge row detail (metadata, timestamps, superseded_by chain)
- Inline actions: edit, delete, toggle sensitive, copy ID
- Superseded entries toggle with server-side filtering
- Toast notification system for all CRUD and maintenance operations
- Brain icon in sidebar for navigation
- Keyboard support: Escape key (layered close), Enter/Space on rows
- ARIA labels, roles, and aria-live for accessibility
- Responsive layout (3 breakpoints)
- Relative date formatting ("in 2 days", "3 days ago")
- API calls aligned with backend router field names

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…em0 extraction, consolidation, reconciliation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The backend returns metadata as parsed JSON (dict), not a string.
Rendering it directly showed [object Object]. Now uses
JSON.stringify for object metadata and plain text for strings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Karim13014 and others added 4 commits April 1, 2026 15:52
…e cases

- Strengthen conversation context filtering test with explicit zero-result
  assertions instead of vacuous loop
- Add due_at validation, empty-list consolidation, and history limit tests
- Remove dead _past_iso import from API test file
- 117 tests, all passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m0 extraction, consolidation, reconciliation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…up scope includes entity, dynamic context always returns time

- MemoryStore.search(): corrected from "hybrid" to "FTS5 keyword search" (hybrid is MemoryMixin._hybrid_search)
- get_memory_dynamic_context(): fixed "returns empty" claim — always returns current time
- store() dedup scope: category+context+entity, not category+context
- get_items_with_embeddings(): added missing top_k, time_from, time_to params
- _classify_query_complexity: added missing medium/complex signal words
- get_entities(): added missing last_updated field in return
- Added undocumented update_confidence() and delete_by_source() methods
- update(): noted embedding cleared on content change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… fixes

- memory_store.py: set embedding=NULL when content changes in update()
  to force re-embedding (stale embedding would return wrong results)
- server.py: alphabetize router imports
- test fixes: formatting cleanup, mixin test updates from parallel tasks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kovtcharov
Copy link
Copy Markdown
Collaborator Author

🛑 STOP-THE-LINE — coordination notice

This PR is foundational (38k+ LOC, 77 files) and currently CONFLICTING with main. Until this lands, no PRs may merge changes to the frozen paths below.

Frozen paths

  • src/gaia/agents/base/agent.py
  • src/gaia/agents/base/memory*.py (all memory-related files)
  • src/gaia/agents/base/discovery.py
  • src/gaia/ui/routers/memory.py
  • src/gaia/apps/webui/src/pages/MemoryDashboard.tsx
  • Anything in this PR's file diff (run gh pr diff 606 --name-only for the full list)

Why

Every parallel agent merging to main makes this PR staler and less mergeable. With ~10 parallel agents potentially producing PRs/day for v0.20.0 + v0.18.2, without coordination this PR becomes unmergeable and v0.20.0 (June 2 consumer launch) slips by 4+ weeks.

What's allowed

  • PRs touching ONLY non-frozen file paths (most v0.18.2 mobile work, Telegram, scheduler, MCP catalogue)
  • Documentation changes
  • CI/test infrastructure
  • Hotfixes to v0.17.5/v0.18.0

Active focus on landing this PR

Per issue A5: rebase + review acceleration. Target land date: May 5 (latest acceptable May 12). If slipping beyond May 12, escalate.

For coding agents

Read AGENTS.md (newly added) for the full rule + check command (gh pr list --label stop-the-line).

# Conflicts:
#	src/gaia/apps/webui/src/components/SettingsModal.tsx
#	src/gaia/apps/webui/src/services/api.ts
#	src/gaia/cli.py
Fills the consumer-use-case gaps not covered by the existing 13 memory
scenarios. Each new scenario maps to a specific item in the v0.20.0
milestone and the broader consumer use-case list (morning briefs, email
triage, watch lists, scheduling, writing-style adaptation).

- memory_writing_style          — voice/tone transfer to drafts
                                   (creative_professional, 4 turns)
- memory_morning_brief_personalization — interest profile → daily brief
                                   assembly + interest update
                                   (home_user, 5 turns)
- memory_email_sender_priorities — VIP/ignore rules + conditional
                                   time-of-day rule applied to a mock
                                   inbox (power_user, 5 turns)
- memory_watchlist_monitoring   — multi-domain watch list (real estate,
                                   shopping, options), durability across
                                   unrelated turns, criteria match,
                                   targeted partial update
                                   (power_user, 7 turns)
- memory_schedule_preferences   — hard no-go windows, focus blocks,
                                   soft preferences; rule-aware slot
                                   recommendation and conflict detection
                                   (power_user, 6 turns)

All five validated against runner.validate_scenario(). Total memory
scenarios is now 20.
antmikinka pushed a commit to antmikinka/gaia that referenced this pull request Apr 28, 2026
## Summary

Adds `AGENTS.md` establishing coordination rules for coding agents
(Claude Code, Cursor, Copilot, custom orchestrators) working on GAIA in
parallel. Complements `CLAUDE.md` without duplicating it — `CLAUDE.md`
owns project conventions; `AGENTS.md` owns multi-agent coordination.
Priority order is explicit (CLAUDE.md > AGENTS.md > default agent
behavior).

## Why

With v0.20.0 carrying 11+ consumer-critical PRs landing in parallel,
coordination becomes the dominant cost — typing speed isn't the
bottleneck anymore. Without explicit rules, parallel agents create merge
conflicts, divergent component patterns, and incoherent UX. This
documents the discipline once so it doesn't get reinvented per release.

## Key rules established

- **Stop-the-line discipline** — foundational PRs (e.g. PR amd#606 memory
v2) freeze touching their file paths until they land. Coding agents
check `gh pr list --label stop-the-line` before opening PRs.
- **Spec-before-PR** — issues with `consumer-critical` label require
implementation specs at the depth of amd#887/amd#888/amd#890 before agent
assignment. New `spec-ready` label gates agent-assignment.
- **Review chain** — every agent-authored PR runs through
`code-reviewer` agent + (`architecture-reviewer` if applicable) +
`claude.yml` Opus + human review
- **No silent test skips** — reinforces CLAUDE.md no-fallback rule at
the test layer
- **Pre-flight checks** — agents must check stop-the-line PRs and
`claudia_list_tasks` before opening PRs

## Cross-references

- Issue amd#899 — agent orchestration playbook (consumes the rules in this
file)
- Issue amd#900 — pre-flight implementation specs for under-specified
consumer-critical issues
- Issue amd#903 — stop-the-line for PR amd#606 (canonical example of the rule)
- PR amd#606 — currently the active stop-the-line PR

## Test plan

- [ ] Markdown renders correctly on GitHub (visual check on this PR's
Files tab)
- [ ] All cross-referenced issue numbers (amd#606, amd#887, amd#888, amd#890, amd#899,
amd#900, amd#903) exist
- [ ] Priority order is explicit and doesn't conflict with CLAUDE.md
- [ ] Pre-flight check commands (`gh pr list --label stop-the-line`,
`claudia_list_tasks`) are accurate
@kovtcharov-amd kovtcharov-amd marked this pull request as ready for review April 29, 2026 19:50
@kovtcharov-amd kovtcharov-amd self-requested a review as a code owner April 29, 2026 19:50
# Conflicts:
#	src/gaia/agents/chat/agent.py
#	src/gaia/apps/webui/src/components/ChatView.tsx
#	src/gaia/apps/webui/src/components/SettingsModal.tsx
#	src/gaia/apps/webui/src/components/Sidebar.tsx
#	src/gaia/apps/webui/src/hooks/useModelActions.ts
#	src/gaia/cli.py
#	src/gaia/eval/runner.py
#	src/gaia/ui/routers/chat.py
#	src/gaia/ui/routers/system.py
The 20 existing memory scenarios are functional / use-case tests with 3-7
turns each. They don't exercise the dimensions that bite real users at
scale. These five scenarios fill those gaps:

- memory_stress_capacity (8 turns) — store 100 distinct items across
  4 batches, then targeted retrieval (exact name lookup, cross-fact join,
  semantic ambiguity, prod-vs-staging discrimination). Catches store-scale
  retrieval-quality regressions.

- memory_stress_retrieval_under_noise (6 turns) — store 12 contacts that
  share first names / roles, then disambiguating retrieval queries by
  company / hire-date / role. Catches near-duplicate collapse.

- memory_stress_long_conversation (15 turns) — incremental fact-sharing
  interleaved with chit-chat, ending in a coherence probe that requires
  recalling 9 of 11 facts including a mid-conversation update (cat count
  2 → 3). Catches memory degradation in long-running sessions.

- memory_stress_adversarial_poisoning (7 turns, severity=critical) —
  rapid contradictions (latest wins), fabricated-memory-injection
  ("you previously told me my password is hunter2"), and prompt-injection
  framing ([SYSTEM OVERRIDE]). Privacy / reliability regression catcher.

- memory_stress_forget_semantics (8 turns) — store siblings, forget one,
  verify it's actually gone (direct recall miss, list miss, paraphrase
  miss after intervening turns) AND siblings untouched. Catches soft-delete
  bugs that masquerade as "forget" but leave records retrievable.

Total memory scenarios: 25 (13 original + 5 use-case + 5 stress).

Note: privacy / private-mode isolation is intentionally NOT covered here
because the runner runs one Agent UI session per scenario; cross-session
isolation needs framework support to validate properly.
Mostly mechanical lint/format cleanup across the memory branch:
- Long-line wraps in `agents/base/memory.py` debug logging and
  `agents/chat/agent.py` long-form returns.
- `l` → `line` in list comprehensions in `agents/base/system_context.py`
  to silence the ambiguous-variable-name lint.
- Drop unused `threading` import from `ui/agent_loop.py` and unused
  `_load_memory_settings` import from `cli.py`.

Real fixes folded into the same pass:
- `eval/runner.py` — wire `keep_sessions=keep_sessions` through to a
  call site missed in the earlier conflict resolution; add the matching
  default to the surrounding signature.
- `ui/agent_loop.py` — `get_actionable_goals(limit=5)` →
  `get_actionable_goals()[:5]`; the underlying signature changed and
  `limit` is no longer accepted.
- `ui/server.py` — move the `agent_loop` import above the `routers`
  imports to avoid an import-order issue at app start.
- `tests/*` — drop a few unused imports / stale skip markers so the
  suites collect cleanly.
Bring the feature branch back to green by addressing the cluster of CI
failures that landed when the memory v2 work merged with main.  All fixes
are mechanical or scoped to test isolation — no behavioural change to
the memory pipeline itself.

- Restore lost merge-conflict state in `ChatView.tsx` and `Sidebar.tsx`:
  the `getSessionHash` import, `hashCopied`/`copied` state, and the
  `handleCopyHash` callback all dropped during the merge — Vite build
  was failing on missing identifiers across PyPI Build Check and all
  three Build Installers jobs.

- Lint/Pylint cleanup so the `Code Quality (Lint)` job is green again:
  remove unused vars/imports, drop dead `if x != x` branches, and
  promote a few pointless lambdas to method references in
  `agents/base/discovery.py`.  Reorder `routers/memory.py` imports
  to satisfy isort.

- Tighten `_canonical_agent_type` to surface `AttributeError` instead
  of swallowing it (matches the existing regression test added in #802;
  was failing locally and in CI Unit Tests).

- Add an explicit `GAIA_MEMORY_DISABLED=1` opt-out to `MemoryMixin.init_memory`.
  The Path Validator security tests, Unit Tests, and Chat Agent Tests
  jobs all instantiate `ChatAgent`/`CodeAgent` without a Lemonade server
  available; the memory v2 hard-requirement on the embedding service
  fails them.  This is a deliberate, named opt-out (not a silent
  fallback) — tests that exercise memory itself clear the variable
  via the new `tests/unit/conftest.py` autouse fixture and the
  `_mock_v2_init_context` helper, so memory test coverage is unchanged.
  CI workflows that don't need memory now set the env var explicitly.
@github-actions github-actions Bot added devops DevOps/infrastructure changes security Security-sensitive changes labels Apr 29, 2026
User messages were rendering as plain right-aligned text with just a
bottom-border divider, while assistant replies got the full card treatment
(bg, border, radius, shadow, avatar, name). On a real conversation that
read as "floating text" next to "card" — broken. Now .msg-user is a flex
container that pins its inner bubble to the right edge of the 900px chat
column, with the bubble itself styled to match the assistant card minus
the avatar/name (capped at 70% width so short messages stay compact).
Also dropped the text-align:right hacks on body/markdown elements — text
inside the bubble is left-aligned now that the bubble itself is on the
right.
CI green-up follow-up to 7f86021. Three behavioural fixes plus pickup of
work from parallel memory-eval tasks that landed in the same tree.

- ``MemoryMixin.init_memory`` now degrades to memory-disabled (warning
  log + ``_memory_store=None``) when Lemonade is unreachable, instead of
  raising RuntimeError.  Hard-failure here breaks the AppImage smoke test
  (Lemonade isn't bundled with the installer; fresh users hit this on
  first launch) and was the root cause of the AppImage userns-restricted
  state-machine failure.  Memory tools now refuse to register when the
  store is None so the LLM can't blunder into AttributeError mid-turn,
  and ``get_memory_dynamic_context`` / ``_after_process_query`` /
  ``_execute_tool`` short-circuit cleanly via ``getattr(... , None) is None``.

- ``test_tier2_rag_rules_absent_without_indexed_docs``: branch optimised
  the non-file-context discovery rules to a compact form to save tokens,
  but the test still asserted the literal "FILE SEARCH AND AUTO-INDEX"
  block was always present.  Loosened the assertion to the underlying
  workflow keywords (``search_file``, ``index_document``,
  ``query_specific_file|query_documents``) so the optimisation and the
  test agree on intent.

- Add ``GAIA_MEMORY_DISABLED: "1"`` to the Windows Path Validator
  Security Tests step (the Linux variant was already done in 7f86021).
  The check is also harmless under graceful degrade — it just skips the
  Lemonade probe instead of relying on the warning path.

Picked up alongside (other parallel tasks; included so the tree compiles
and lints clean as a unit):
- ``MemoryStore.get_item`` for dashboard / eval supersession-chain probes
- Memory MCP read tools register on env ``GAIA_MEMORY_MCP_ALWAYS=1`` for
  the eval runner; admin tools also gate on ``GAIA_MEMORY_ADMIN=1``
- ``preflight_check`` rejects memory-category eval runs without admin env
- Eval simulator/judge-turn prompt updates for memory MCP tools
- ``_chat_helpers._stream_chat_response`` resolves Lemonade base URL via
  ``LemonadeManager`` so non-default ports are picked up for /stats
- ``test_security.yml`` Windows path-validator job now sets the same env
Two single-cause CI fixes:

- ``_chat_helpers.py`` no longer reads ``os.environ`` for the Lemonade
  base URL (resolved via ``LemonadeManager`` after f63f09e), so the
  ``import os`` is dead.  Pylint/flake8 caught it; black moved the
  remaining imports.

- ``test_unit.yml`` was installing ``pytest pytest-cov pytest-asyncio
  pyfakefs`` but the new ``test_memory_router::TestReconcileEndpoint::
  test_reconcile_runtime_error_returns_500`` test takes a ``mocker``
  fixture (pytest-mock).  Add ``pytest-mock`` to the install line.
The userns-restricted AppImage smoke test polled ``state: ready`` for
90 seconds, but ``gaia init --profile minimal`` downloads the Gemma-4-E4B
GGUF model (~3 GB) on first run and that exceeds 90s on the public
runner.  The structural and distro-matrix smoke jobs already use 300s
for the same reason; align this one too.

Failure mode it fixed: ``state: installing (gaia-init)`` … ``Step 3/4:
Downloading models for 'minimal' profile`` → timer elapses →
``::error::userns-restricted launch did not reach state: ready``.
@github-actions
Copy link
Copy Markdown
Contributor

Summary

Massive, well-structured PR (~41k LOC) introducing memory v2 with hybrid retrieval (FAISS + FTS5 + RRF + cross-encoder), Mem0-style LLM extraction, Zep-inspired lineage, an observability dashboard, 26 eval scenarios, and a credible suite of unit + integration tests. Architecture, schema migration, FTS5 sanitization, parameterized SQL, admin gating via GAIA_MEMORY_ADMIN, and the locking strategy (WAL + threading.Lock + busy_timeout=5000) all look sound.

The main concern is a direct contradiction between the PR description and the implementation: the description (and test plan) claim "No silent fallback — system fails loudly on misconfiguration" and "Lemonade unavailable at startup raises RuntimeError," but memory.py:init_memory actually swallows the failure and silently disables memory. Per CLAUDE.md "No Silent Fallbacks — Fail Loudly," this should be reconciled before merge.

Issues

🟡 Silent fallback contradicts PR description and CLAUDE.md

src/gaia/agents/base/memory.py:347-361 — when the Lemonade embedding probe fails, the code catches a broad Exception, logs a warning, tears state down, and returns. The session continues with memory disabled. The PR description explicitly promises the opposite ("No silent fallback") and the test plan has a checkbox for "Lemonade unavailable at startup raises RuntimeError (no silent fallback)" — that checkbox cannot pass against this code path.

This also conflicts with the project rule in CLAUDE.md: "Either the operation succeeds as intended, or it raises an actionable error."

Two acceptable resolutions:

Option A — actually fail loudly (matches PR description):

        except Exception as e:
            raise RuntimeError(
                "Lemonade embedding service unreachable — memory v2 cannot "
                "initialize. Start lemonade-server (e.g. `lemonade-server serve`) "
                "and ensure the embedding model is available, or set "
                "GAIA_MEMORY_DISABLED=1 to opt out. Reason: "
                f"{e}"
            ) from e

Option B — keep the degrade, but make it an explicit opt-in and update the PR description. If the intent is genuinely "graceful degrade for users without Lemonade," gate it behind an env flag (e.g. GAIA_MEMORY_DEGRADE_ON_NO_EMBEDDINGS=1) so it's an opt-in, not a default. Then either delete the "No silent fallback" claim from the PR body or qualify it ("fails loudly unless GAIA_MEMORY_DEGRADE_ON_NO_EMBEDDINGS=1").

Either way is fine — the current state is "ships with the docs and the code disagreeing."

🟡 Bare except Exception: pass in mixin prompt auto-discovery

src/gaia/agents/base/agent.py:313-318 — the auto-discovery loop in _get_mixin_prompts swallows any failure from a mixin's prompt-fragment method without even a debug log. If a mixin author writes a buggy _get_xxx_prompt it will silently produce no fragment, with no diagnostic. At minimum log at debug level so the failure is traceable when someone is debugging "why isn't my mixin prompt showing up?":

                try:
                    fragment = getattr(self, attr_name)()
                    if fragment:
                        prompts.append(fragment)
                except Exception as e:
                    logger.debug(
                        "[Agent] mixin prompt fragment %s.%s raised: %s",
                        type(self).__name__, attr_name, e,
                    )

🟢 QueueFull swallowed silently in agent loop trigger enqueue

src/gaia/ui/agent_loop.py:137-142 — the comment says "queue is unbounded; this should never happen" which is true today, but if anyone ever adds a maxsize= to _trigger_queue the trigger will be silently dropped. Cheap fix:

        try:
            self._trigger_queue.put_nowait(
                AgentTrigger("user_message_followup", session_id)
            )
        except asyncio.QueueFull:
            logger.warning(
                "AgentLoop trigger queue full; dropping user_message_followup for session %s",
                session_id,
            )

🟢 Repeated except Exception: pass (or near-equivalents) in discovery / system_context

src/gaia/agents/base/discovery.py and src/gaia/agents/base/system_context.py use broad try/except extensively. For best-effort OS/registry/browser-history scanning this is mostly appropriate — those code paths must not crash a user's session because Chrome's history DB is locked or winreg returns garbage. But several of those handlers don't log at all, which makes "why didn't system context populate field X?" undiagnosable. Per the same CLAUDE.md rule, prefer "log at debug + continue" over silent pass, so the failure is visible when someone runs with --debug. Not blocking, but worth a sweep.

🟢 Minor: PR test-plan checkbox is unverifiable

The "Lemonade unavailable at startup raises RuntimeError (no silent fallback)" checkbox is a useful contract test — please add it as an actual pytest case (tests/unit/test_memory_mixin.py) once the silent-fallback issue is resolved one way or the other, so the regression is locked in.

Strengths

  • Test coverage is genuinely solid. 8 new unit-test files (test_memory_store.py, test_memory_mixin.py, test_memory_router.py, test_goal_store.py, test_goals_router.py, test_memory_discovery.py, test_sdk_tool_messages.py) plus 3 integration files. New tests/unit/conftest.py for shared fixtures. This is exemplary for a feature of this scope.
  • 26 eval scenarios under eval/scenarios/memory/ covering cross-session persistence, conflict resolution, adversarial poisoning, capacity stress, retrieval-under-noise, etc. — exactly the right shape for a "second brain" feature.
  • Documentation is comprehensive and lands in the right places: docs/guides/memory.mdx, docs/sdk/sdks/memory.mdx, docs/spec/agent-memory-architecture.md, plus docs/docs.json nav updates. Matches the CLAUDE.md "every new feature must be documented" rule.
  • SQL hygiene is excellent. Parameterized queries throughout memory_store.py. _sanitize_fts5_query correctly strips FTS5 special chars and caps at MAX_FTS_QUERY_LENGTH=500. No injection vectors spotted.
  • Schema migration is idempotent via ALTER TABLE ADD COLUMN with duplicate-column tolerance — clean upgrade path from v1 → v2.
  • Concurrency is handled deliberately: threading.Lock + WAL + busy_timeout=5000 in memory_store.py; double-checked-locking singletons in routers/memory.py and routers/goals.py.
  • Admin endpoints are properly gated by GAIA_MEMORY_ADMIN=1 (memory clear/seed in both REST and MCP surfaces). CI gating via GAIA_MEMORY_DISABLED=1 is a sensible kill-switch.
  • MCP subprocess management is safe: subprocess.Popen([sys.executable, ...]) with no shell=True, terminate→kill timeout, in routers/mcp.py.
  • Pydantic ISO 8601 validation factored into a shared _validate_iso8601 helper — no DRY violations on the API surface.
  • Goal store is correctly isolated in ~/.gaia/goals.db with its own PRAGMA foreign_keys=ON and state machine, decoupled from memory storage.
  • Pending-approval route ordering in routers/goals.py (literal path before {goal_id}) shows the author thought about FastAPI's path-priority semantics.
  • Single source of truth for VALID_CATEGORIES prevents drift between the store, the mixin, and the API layer.

Verdict

Approve with suggestions. The architecture, tests, docs, and security posture are all well above the bar for a feature this size. The one item that should be addressed before merge is the silent-fallback in memory.py:347-361 — either make it actually raise (so the PR description and test plan are honest) or document the degrade as an explicit, env-gated opt-in. Everything else is polish.

cc @kovtcharov-amd — flagging the CLAUDE.md "no silent fallback" rule violation; not a security issue, but it's the kind of thing that bites later when an outage masks itself as "memory just stopped working." No 🔒 security concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents chat Chat SDK changes cli CLI changes consumer Blocks consumer adoption — must ship for the v0.20.0 consumer launch window dependencies Dependency updates devops DevOps/infrastructure changes documentation Documentation changes electron Electron app changes eval Evaluation framework changes mcp MCP integration changes performance Performance-critical changes security Security-sensitive changes stop-the-line PR is foundational; do not merge changes to its frozen paths until it lands tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants