PhyslibSearch is split into two largely independent parts: an indexing pipeline (run offline, once per corpus update) and a query serving layer (the live API + frontend).
┌──────────────────────────────────────────────────────────────────────┐
│ INDEXING PIPELINE (offline, run once) │
│ │
│ PhysLib Lean source │
│ │ │
│ ▼ │
│ ┌─────────┐ jixia parses .olean files │
│ │ jixia │──► declarations, modules, symbols, dependencies │
│ └─────────┘ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ PostgreSQL │ modules, declarations, symbols, │
│ │ (raw data) │ dependency graph, topological levels │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Gemini (informalize) │ Herald-style formal→natural │
│ │ GEMINI_MODEL │ translation, level by level │
│ └──────────┬────────────┘ │
│ │ informal names + descriptions │
│ ▼ │
│ ┌──────────────┐ │
│ │ PostgreSQL │ informal table │
│ │ (enriched) │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Gemini Embedding │ 3072-dim cosine vectors │
│ │ GEMINI_EMBEDDING_MODEL │ (RETRIEVAL_DOCUMENT task) │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ ChromaDB │ persistent vector store │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ QUERY SERVING (live) │
│ │
│ User query (plain English) │
│ │ │
│ ▼ (optional) │
│ ┌──────────────────────────────┐ │
│ │ QueryExpander (query_expansion.py) │
│ │ HyDE: Gemini GEMINI_FAST_MODEL generates a plausible │
│ │ hypothetical Lean declaration → used as the search text │
│ └──────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Gemini Embedding │ (RETRIEVAL_QUERY task) │
│ └──────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ PhyslibSearchEngine │ engine.py │
│ │ ChromaDB .query() │ nearest-neighbor by cosine dist │
│ │ PostgreSQL record fetch │ joins declaration + informal │
│ └──────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ List[QueryResult] → FastAPI (server.py) → Frontend / CLI │
└──────────────────────────────────────────────────────────────────────┘
The pipeline looks the way it does because the indexed corpus is not text — it is formal Lean code. Generic RAGs (like gbrain) assume the corpus is already prose, so they chunk by paragraph, embed once, and retrieve. That collapses here: a user's plain-English query lives in a completely different distribution than a terse Lean signature, so direct nearest-neighbour search in a single embedding space returns noise. Every non-obvious choice in PhyslibSearch exists to close that gap.
The unit of retrieval is a single Lean declaration (theorem, definition, instance), not a file, module, or fixed-size window. This is because Lean's semantic unit is the declaration — splitting mid-file would break signatures and proofs, while grouping by file would dilute the embedding across unrelated lemmas. jixia already exposes declaration-level structure (name, signature, value, dependencies), so symbol-level chunking comes for free without a custom splitter. See the declaration table below and database/jixia_db.py.
Rather than embedding the raw Lean signature, PhyslibSearch uses Gemini (GEMINI_MODEL, quality-tuned) to generate a natural-language name and description for every declaration, following the Herald "formalization reversal" idea. The final text that gets embedded is a concatenation of the formal and informal forms:
{kind} {name} {signature}
{informal_name}: {informal_description}
(see database/vector_db.py, line 43). This hybrid embedding lets the vector store match both exact-Lean queries (mem_cons_self) and natural-language queries ("an element is in the list it was prepended to"). A pure-formal embedding would miss the second; a pure-informal embedding would miss the first.
The informalization pass is not independent per declaration. Declarations are translated in topological order over the dependency graph (the level table; see database/informalize.py) so that when Gemini translates a theorem, it has already-translated descriptions of every symbol that theorem references, plus the ±2 neighbouring declarations in the same module (find_neighbor). This replicates how a mathematician reads a library: you understand a lemma in terms of things you already understand. A generic RAG has no analogue because prose has no typed dependency graph.
Even with hybrid embeddings, a short user query ("Schrödinger equation") is much shorter and less structured than a full indexed document. HyDE closes this final gap: Gemini (GEMINI_FAST_MODEL, latency-tuned) rewrites the query into a plausible hypothetical Lean declaration, and that is what gets embedded (query_expansion.py). The hypothetical declaration lives in the same distribution as the indexed documents, so nearest-neighbour search becomes meaningful. HyDE is toggleable — for short formal queries where the distribution gap is already small, users can skip it.
The pipeline is graph-heavy: dependency edges, topological levels, neighbour lookups, joins between declarations and their informalizations. That's what relational stores are good at, and the indexing pipeline relies on SQL throughout. ChromaDB, meanwhile, only needs to store vectors and return nearest IDs — it would be wasteful to replicate the full record schema inside a vector DB. At query time Chroma returns IDs, and Postgres assembles the human-readable records via the record view (see engine.py).
Three different Gemini knobs serve three different jobs:
GEMINI_MODEL(defaultgemini-2.5-pro) — offline informalization. Quality matters, latency doesn't; this is a one-time cost per corpus.GEMINI_FAST_MODEL(defaultgemini-2.5-flash) — real-time HyDE expansion on every query. Latency matters.GEMINI_EMBEDDING_MODEL— invoked withtask_type="RETRIEVAL_DOCUMENT"during indexing and"RETRIEVAL_QUERY"at query time (database/embedding.py), letting Gemini tune each embedding for its role in the asymmetric retrieval setup.
| Aspect | Generic notes RAG | PhyslibSearch |
|---|---|---|
| Corpus | Prose the user already wrote | Formal Lean declarations |
| Chunking | By paragraph / section | By declaration (jixia-provided) |
| Embedding input | Raw text | Hybrid (formal + Herald-style informal) |
| Indexing pass | Single embedding pass | Dependency-ordered informalization, then embedding |
| Query expansion | Often unnecessary | HyDE — queries rewritten into Lean syntax |
| Stores | Vector store only | Postgres (graph + records) + Chroma (vectors) |
In short: a generic RAG can treat the corpus as a bag of text because its users already speak the corpus's language. For PhyslibSearch the whole point is that users don't — they speak physics, not Lean — so the pipeline's job is to translate both sides (documents → informal via Herald, queries → formal via HyDE) until they meet in a shared embedding space.
All tables live in the default PostgreSQL schema. PhyslibSearch adds its own physlibsearch schema for operational tables.
| Table | Key columns | Description |
|---|---|---|
module |
name JSONB PK, content BYTEA, docstring TEXT |
One row per Lean module |
symbol |
name JSONB PK, module_name, kind, type, is_prop |
Every symbol in the corpus |
declaration |
(module_name, index) PK, name, visible, kind, signature, value |
Syntactic declarations |
dependency |
(source, target, on_type) |
Symbol dependency edges |
level |
symbol_name PK, level INT |
Topological order (used for batch informalization) |
| Table | Key columns | Description |
|---|---|---|
informal |
symbol_name PK, name TEXT, description TEXT |
Natural-language name + description per symbol |
| Table | Key columns | Description |
|---|---|---|
physlibsearch.query |
id UUID PK, query TEXT, time |
Search query log |
physlibsearch.feedback |
(query_id, declaration_name) PK, action TEXT |
User feedback |
| View | Description |
|---|---|
record |
Joins declaration + informal + symbol — the shape returned by the API |
The database/informalize.py module translates formal Lean declarations into natural language using a level-by-level strategy:
- Declarations are processed in topological order (
leveltable), so dependencies are always translated before dependants. - For each declaration, Gemini receives:
- The formal signature and value
- Neighbouring declarations in the same module (±2 positions)
- Already-translated dependency descriptions
- Gemini returns an
informal_name(a short, human-readable label) and aninformal_description(a LaTeX-rendered explanation).
This follows the Herald approach to formalization reversal.
Hypothetical Document Embeddings (HyDE) is a retrieval technique that closes the distribution gap between short user queries and longer indexed documents:
- Gemini (
GEMINI_FAST_MODEL) is prompted to generate a plausible hypothetical Lean declaration that would answer the user's query. - The hypothetical declaration is embedded instead of the raw query.
- Because the hypothetical text is in the same format as the indexed documents, the embedding is much closer to relevant results in vector space.
HyDE is optional — users can toggle it off in the UI or skip it in the CLI.
| Method | Path | Rate limit | Description |
|---|---|---|---|
POST |
/search |
1/s (default) | Search declarations; accepts a list of queries |
POST |
/expand |
15/min | HyDE query expansion |
POST |
/fetch |
10/s | Fetch specific declarations by Lean name |
GET |
/modules |
30/min | List all indexed modules with declaration counts |
POST |
/modules/declarations |
30/min | All declarations in a module |
POST |
/feedback |
— | Log click/thumbs feedback |
Request/response shapes are defined by Pydantic models in server.py and TypeScript types in frontend/src/types/index.ts.
| Layer | Technology | Why |
|---|---|---|
| Lean parser | jixia | Purpose-built for extracting structured data from Lean 4 .olean files |
| Relational store | PostgreSQL + psycopg 3 | Rich querying for the indexing pipeline; async-capable |
| Vector store | ChromaDB | Simple persistent embedding store; no separate service required |
| Embeddings + LLM | Google Gemini | Single API key for embeddings, informalization, and query expansion |
| Backend | FastAPI | Async, Pydantic-native, minimal boilerplate |
| Frontend | Next.js 16 + HeroUI v3 | App Router, React 19, Tailwind CSS 4 |
| Rate limiting | slowapi | Per-endpoint limits to protect the Gemini quota |