This repository contains an open-source foundation for a controlled AI assistant focused on UK accounting, tax, and finance regulations.
Luci is not a general chatbot. The core objective is a system that enforces a clear source hierarchy, stays within scope, and stops when there is not enough evidence.
Originally, this project started as a case-study MVP. It is now positioned as a reusable reference implementation that others can fork, extend, and adapt for grounded domain-specific assistants.
- Start from the home screen and open a conversation focused on a narrow regulatory question.
- Ask for a grounded answer and verify that the response stays within the configured domain boundaries.
- Inspect the cited sources and confidence signals instead of relying on free-form model output alone.
- Manage approved domains and path prefixes in the link-tree screen before allowing crawl-backed retrieval.
- Answers only within accounting, tax, finance, and company-obligation topics.
- Enforces the source hierarchy:
seed docs > user link tree > official gov > LLM. - Standardizes source attribution through citation lists and URLs.
- Avoids overclaiming when evidence is weak and asks for missing information.
- Analyzes user-uploaded files (PDF/DOCX/DOC/TXT) in a regulatory context.
| Use case | What Luci already provides |
|---|---|
| Controlled LLM behavior | Scope guard + deterministic policy layer + prompt-injection sanitization |
| Source-aware answering | SQL priority order, source-authority labels, domain/path allowlist |
| Hallucination reduction | Fail-closed scope, no unofficial web access, insufficient-evidence behavior |
| Retrieval and filtering | pgvector cosine search, topic_area, is_stale/is_active, domain/path-prefix filters |
| Domain-bounded UX | Polite but clear out-of-scope handling |
| Extensible structure | Modular Django apps (chat, knowledge, domains, crawler) |
This repository is intentionally opinionated. It favors grounded answers, constrained retrieval, and explicit policy checks over broad open-ended chat behavior.
If you want a starting point for a domain-specific assistant, Luci gives you:
- a Django backend with clear separation between chat, knowledge, crawling, and domain controls
- a React frontend for chat, citations, uploads, and source management
- a grounded RAG pipeline with layered source priority
- deterministic policy enforcement on top of LLM outputs
- a structure that can be generalized beyond UK accounting if you replace the seed corpus, scope logic, and allowlisted domains
- Framework: Django + Django REST Framework
- Retrieval: PostgreSQL +
pgvector(HNSW index) - LLM orchestration: DSPy (
gpt-4.1-mini, low temperature) - Application modules:
apps/chat: conversation flow, guardrails, RAG orchestrationapps/knowledge: ingestion, chunking, embeddings, retrievalapps/domains: user link tree management (allowed domains + paths)apps/crawler: domain-restricted crawling and caching
- React + Vite + TypeScript + MUI
- Features:
- Chat screen and conversation history
- Citation rendering
- Document upload
- User Link Tree management at
/links
- The request passes through the scope guard (
in_scope/out_of_scope). - If in scope, pgvector retrieval runs with
topic_areaand source filters. - If needed, controlled web search is triggered only on allowed sources.
- DSPy generates the answer.
- The policy engine performs the final check for tone, disclaimer cleanup, confidence, citation coverage, and related rules.
- The message, citations, and policy flags are written to the conversation record.
When a document is uploaded, Luci uses a separate document-analysis pipeline instead of the normal RAG path.
In apps/knowledge/services/retriever.py, SQL uses ORDER BY priority ASC, similarity DESC:
seeduser_link_treeofficial_govuser_upload- everything else
This ensures the model sees context in the project's intended source priority order.
In apps/chat/services/rag_pipeline.py, context is passed to the LLM with these labels:
[TIER-1 SEED][TIER-2 USER LINK TREE][TIER-3 OFFICIAL GOV][USER DOCUMENT][TIER-4 LLM CONTEXT]
The DSPy signatures in dspy_modules/signatures.py also enforce this hierarchy explicitly.
- General LLM knowledge is not treated as a primary correctness source.
- If seed, user-link-tree, or official-gov data exists, answers must rely on those sources.
- If evidence is insufficient, the policy layer can switch the response into
insufficient_evidencemode.
KnowledgeDocument: source metadata (source_type,source_authority,topic_area,is_stale)DocumentChunk: chunk text plus embedding
is_active=Trueis_stale=Falsetopic_areamatch based on the scope result- Optional source filtering
- Default allowed sources:
gov.uk,hmrc.gov.uk,companieshouse.gov.uk - User-approved link tree:
AllowedDomain+allowed_path_prefixes - Path-level validation enforced by the domain validator
- Forbidden sources such as blogs, forums, and social platforms are blocked via
SEARCH_FORBIDDEN_DOMAINS
This design enforces an allowlist-first retrieval and browsing model.
- Fail-closed scope: if scope evaluation fails, the query is treated as out of scope.
- Prompt-injection cleanup: patterns such as
ignore instructionsandas an aiare removed. - Deterministic policy layer:
- removes AI-disclaimer lines
- softens overconfident final-decision language
- sets
answer_modetogrounded,out_of_scope, orinsufficient_evidence - calibrates confidence and computes citation coverage
- Source follow-up shortcut: when the user only asks for the source, previous citations are returned directly instead of making an unnecessary new LLM call.
Core endpoints:
GET/POST /api/v1/chat/conversations/GET/DELETE /api/v1/chat/conversations/{uuid}/GET/POST /api/v1/chat/conversations/{uuid}/messages/POST /api/v1/chat/conversations/{uuid}/upload/GET /api/v1/chat/health/GET/POST /api/v1/domains/PUT/DELETE /api/v1/domains/{id}/GET /api/v1/knowledge/documents/
- Python
3.12+ uv- Node
18+ pnpm- PostgreSQL
16++pgvector
cp .env.example .envAt minimum, verify these values in .env:
OPENAI_API_KEY(required)JINA_API_KEY(optional, but recommended for web search)- Database settings
docker compose up -d dbIn this compose setup, the DB values are:
POSTGRES_DB=luci_dbPOSTGRES_USER=luci_userPOSTGRES_PASSWORD=luci_passPOSTGRES_HOST=localhost(for a backend running on the host)POSTGRES_PORT=5432
uv sync
uv run python manage.py migrate
uv run python manage.py import_source_map --file train_data/target_data_source.csv
uv run python manage.py ingest_seeds --dir train_data/
uv run python manage.py runserver 0.0.0.0:8000cd frontend
pnpm install
pnpm devDefault backend URL: http://localhost:8000/api/v1
Frontend URL (Vite): http://localhost:5173
If you want to run the backend inside a container, set the DB host in .env to db:
POSTGRES_HOST=dbPOSTGRES_USER=luci_userPOSTGRES_PASSWORD=luci_pass
Then run:
docker compose up --build webFor seed ingestion:
docker compose --profile ingest up workerBackend tests:
uv run pytestNote: if you use PostgreSQL for tests, make sure the vector extension is available in the test database as well.
- The MVP prioritizes correctness and control over creativity.
- External search is not open-ended; only allowlisted sources are used.
- PostgreSQL + pgvector was chosen instead of a separate vector database for simpler setup and sufficient performance.
gpt-4.1-miniwas selected as the model for cost-quality balance and faster iteration.- Upload analysis is kept in a separate pipeline so document review does not get mixed into the general RAG flow.
- No authentication or role-based access control.
- Citation coverage is heuristic-based and does not guarantee legal certainty.
- Regulatory freshness depends on the seed and crawl refresh processes.
If you want to reuse Luci as a foundation for another domain, the main pieces to replace are:
- the seed documents and ingestion inputs in
train_data/andseed_documents/ - the scope detection and prompt signatures in
dspy_modules/ - the allowed-domain and crawl rules in
apps/domains/andapps/crawler/ - the frontend copy and UX constraints for your target use case
The overall control pattern remains the same: scope first, retrieve from approved sources, answer with citations, then enforce policy.


