Luci AI Chatbot (MVP)

This repository contains an open-source foundation for a controlled AI assistant focused on UK accounting, tax, and finance regulations.

Luci is not a general chatbot. The core objective is a system that enforces a clear source hierarchy, stays within scope, and stops when there is not enough evidence.

Originally, this project started as a case-study MVP. It is now positioned as a reusable reference implementation that others can fork, extend, and adapt for grounded domain-specific assistants.

UI Preview

Home

Grounded Conversation

Allowed Links Management

Quick Demo Flow

Start from the home screen and open a conversation focused on a narrow regulatory question.
Ask for a grounded answer and verify that the response stays within the configured domain boundaries.
Inspect the cited sources and confidence signals instead of relying on free-form model output alone.
Manage approved domains and path prefixes in the link-tree screen before allowing crawl-backed retrieval.

1) Project Goal

Answers only within accounting, tax, finance, and company-obligation topics.
Enforces the source hierarchy: seed docs > user link tree > official gov > LLM.
Standardizes source attribution through citation lists and URLs.
Avoids overclaiming when evidence is weak and asks for missing information.
Analyzes user-uploaded files (PDF/DOCX/DOC/TXT) in a regulatory context.

2) What This Repository Is Good For

Use case	What Luci already provides
Controlled LLM behavior	Scope guard + deterministic policy layer + prompt-injection sanitization
Source-aware answering	SQL priority order, source-authority labels, domain/path allowlist
Hallucination reduction	Fail-closed scope, no unofficial web access, insufficient-evidence behavior
Retrieval and filtering	`pgvector` cosine search, `topic_area`, `is_stale/is_active`, domain/path-prefix filters
Domain-bounded UX	Polite but clear out-of-scope handling
Extensible structure	Modular Django apps (`chat`, `knowledge`, `domains`, `crawler`)

3) Open-Source Positioning

This repository is intentionally opinionated. It favors grounded answers, constrained retrieval, and explicit policy checks over broad open-ended chat behavior.

If you want a starting point for a domain-specific assistant, Luci gives you:

a Django backend with clear separation between chat, knowledge, crawling, and domain controls
a React frontend for chat, citations, uploads, and source management
a grounded RAG pipeline with layered source priority
deterministic policy enforcement on top of LLM outputs
a structure that can be generalized beyond UK accounting if you replace the seed corpus, scope logic, and allowlisted domains

4) Architecture

Backend

Framework: Django + Django REST Framework
Retrieval: PostgreSQL + pgvector (HNSW index)
LLM orchestration: DSPy (gpt-4.1-mini, low temperature)
Application modules:
- apps/chat: conversation flow, guardrails, RAG orchestration
- apps/knowledge: ingestion, chunking, embeddings, retrieval
- apps/domains: user link tree management (allowed domains + paths)
- apps/crawler: domain-restricted crawling and caching

Frontend

React + Vite + TypeScript + MUI
Features:
- Chat screen and conversation history
- Citation rendering
- Document upload
- User Link Tree management at /links

Request Flow Summary

The request passes through the scope guard (in_scope / out_of_scope).
If in scope, pgvector retrieval runs with topic_area and source filters.
If needed, controlled web search is triggered only on allowed sources.
DSPy generates the answer.
The policy engine performs the final check for tone, disclaimer cleanup, confidence, citation coverage, and related rules.
The message, citations, and policy flags are written to the conversation record.

When a document is uploaded, Luci uses a separate document-analysis pipeline instead of the normal RAG path.

5) How Source Prioritization Is Enforced

A. Deterministic retrieval ordering

In apps/knowledge/services/retriever.py, SQL uses ORDER BY priority ASC, similarity DESC:

seed
user_link_tree
official_gov
user_upload
everything else

This ensures the model sees context in the project's intended source priority order.

B. Context labeling

In apps/chat/services/rag_pipeline.py, context is passed to the LLM with these labels:

[TIER-1 SEED]
[TIER-2 USER LINK TREE]
[TIER-3 OFFICIAL GOV]
[USER DOCUMENT]
[TIER-4 LLM CONTEXT]

The DSPy signatures in dspy_modules/signatures.py also enforce this hierarchy explicitly.

C. Conflict handling

General LLM knowledge is not treated as a primary correctness source.
If seed, user-link-tree, or official-gov data exists, answers must rely on those sources.
If evidence is insufficient, the policy layer can switch the response into insufficient_evidence mode.

6) Data Access and Filtering Model

Knowledge storage

KnowledgeDocument: source metadata (source_type, source_authority, topic_area, is_stale)
DocumentChunk: chunk text plus embedding

Retrieval filters

is_active=True
is_stale=False
topic_area match based on the scope result
Optional source filtering

Web and crawl access rules

Default allowed sources: gov.uk, hmrc.gov.uk, companieshouse.gov.uk
User-approved link tree: AllowedDomain + allowed_path_prefixes
Path-level validation enforced by the domain validator
Forbidden sources such as blogs, forums, and social platforms are blocked via SEARCH_FORBIDDEN_DOMAINS

This design enforces an allowlist-first retrieval and browsing model.

7) Hallucination-Reduction Design

Fail-closed scope: if scope evaluation fails, the query is treated as out of scope.
Prompt-injection cleanup: patterns such as ignore instructions and as an ai are removed.
Deterministic policy layer:
- removes AI-disclaimer lines
- softens overconfident final-decision language
- sets answer_mode to grounded, out_of_scope, or insufficient_evidence
- calibrates confidence and computes citation coverage
Source follow-up shortcut: when the user only asks for the source, previous citations are returned directly instead of making an unnecessary new LLM call.

8) API Surface

Core endpoints:

GET/POST /api/v1/chat/conversations/
GET/DELETE /api/v1/chat/conversations/{uuid}/
GET/POST /api/v1/chat/conversations/{uuid}/messages/
POST /api/v1/chat/conversations/{uuid}/upload/
GET /api/v1/chat/health/
GET/POST /api/v1/domains/
PUT/DELETE /api/v1/domains/{id}/
GET /api/v1/knowledge/documents/

9) Local Setup and Run

Prerequisites

Python 3.12+
uv
Node 18+
pnpm
PostgreSQL 16+ + pgvector

8.1 Environment variables

cp .env.example .env

At minimum, verify these values in .env:

OPENAI_API_KEY (required)
JINA_API_KEY (optional, but recommended for web search)
Database settings

8.2 Start the database (recommended quick path)

docker compose up -d db

In this compose setup, the DB values are:

POSTGRES_DB=luci_db
POSTGRES_USER=luci_user
POSTGRES_PASSWORD=luci_pass
POSTGRES_HOST=localhost (for a backend running on the host)
POSTGRES_PORT=5432

8.3 Backend

uv sync
uv run python manage.py migrate
uv run python manage.py import_source_map --file train_data/target_data_source.csv
uv run python manage.py ingest_seeds --dir train_data/
uv run python manage.py runserver 0.0.0.0:8000

8.4 Frontend

cd frontend
pnpm install
pnpm dev

Default backend URL: http://localhost:8000/api/v1
Frontend URL (Vite): http://localhost:5173

10) Optional Docker Flow

If you want to run the backend inside a container, set the DB host in .env to db:

POSTGRES_HOST=db
POSTGRES_USER=luci_user
POSTGRES_PASSWORD=luci_pass

Then run:

docker compose up --build web

For seed ingestion:

docker compose --profile ingest up worker

11) Testing

Backend tests:

uv run pytest

Note: if you use PostgreSQL for tests, make sure the vector extension is available in the test database as well.

12) Assumptions and Technical Rationale

The MVP prioritizes correctness and control over creativity.
External search is not open-ended; only allowlisted sources are used.
PostgreSQL + pgvector was chosen instead of a separate vector database for simpler setup and sufficient performance.
gpt-4.1-mini was selected as the model for cost-quality balance and faster iteration.
Upload analysis is kept in a separate pipeline so document review does not get mixed into the general RAG flow.

13) Known Limits

No authentication or role-based access control.
Citation coverage is heuristic-based and does not guarantee legal certainty.
Regulatory freshness depends on the seed and crawl refresh processes.

14) Adapting This Repository

If you want to reuse Luci as a foundation for another domain, the main pieces to replace are:

the seed documents and ingestion inputs in train_data/ and seed_documents/
the scope detection and prompt signatures in dspy_modules/
the allowed-domain and crawl rules in apps/domains/ and apps/crawler/
the frontend copy and UX constraints for your target use case

The overall control pattern remains the same: scope first, retrieve from approved sources, answer with citations, then enforce policy.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
apps		apps
dspy_modules		dspy_modules
frontend		frontend
luci		luci
seed_documents		seed_documents
train_data		train_data
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
luci-chat.png		luci-chat.png
luci-home.png		luci-home.png
luci-links.png		luci-links.png
manage.py		manage.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Luci AI Chatbot (MVP)

UI Preview

Home

Grounded Conversation

Allowed Links Management

Quick Demo Flow

1) Project Goal

2) What This Repository Is Good For

3) Open-Source Positioning

4) Architecture

Backend

Frontend

Request Flow Summary

5) How Source Prioritization Is Enforced

A. Deterministic retrieval ordering

B. Context labeling

C. Conflict handling

6) Data Access and Filtering Model

Knowledge storage

Retrieval filters

Web and crawl access rules

7) Hallucination-Reduction Design

8) API Surface

9) Local Setup and Run

Prerequisites

8.1 Environment variables

8.2 Start the database (recommended quick path)

8.3 Backend

8.4 Frontend

10) Optional Docker Flow

11) Testing

12) Assumptions and Technical Rationale

13) Known Limits

14) Adapting This Repository

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages