Status: Proposed Date: 2026-03-03 Deciders: Ian (Project Lead)
We are building an open-source, fully local assistant for querying SEBI (Securities and Exchange Board of India) circulars and master circulars. The Month-1 milestone focuses on:
- A focused crawler for SEBI's circular listing pages (HTML + PDF artifacts).
- A parser/normalizer pipeline that produces structured JSON from raw HTML and PDFs.
- A chunker tuned for legal/regulatory text (clause-aware, metadata-enriched).
- A hybrid retrieval layer (BM25 sparse + dense embeddings via FAISS).
- A baseline RAG pipeline wiring retrieval to a generic open-source SLM.
- An evaluation suite of 30-50 hand-crafted compliance questions.
The system must be entirely local (no cloud dependencies beyond optional LLM API), respect SEBI's robots.txt and usage policies, and be modular enough to later port to a browser-based SLM.
┌─────────────────────────────────────────────────────────────────┐
│ CLI / Web UI │
│ cli/ingest.py cli/query.py cli/eval_cli.py │
└──────────┬──────────────────┬──────────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌────────────────┐ ┌────────────────────────┐
│ Ingest Pipeline │ │ Query Pipeline │ │ Evaluation Pipeline │
│ │ │ │ │ │
│ 1. Crawl │ │ 1. Retrieve │ │ 1. Load questions │
│ 2. Parse │ │ (hybrid) │ │ 2. Run RAG pipeline │
│ 3. Chunk │ │ 2. Build prompt│ │ 3. Score recall@k │
│ 4. Index │ │ 3. Call SLM │ │ 4. Report │
└──────┬───────────┘ └───────┬────────┘ └───────┬────────────────┘
│ │ │
▼ ▼ │
┌──────────────────────────────────────────┐ │
│ Data Layer │ │
│ │ │
│ data/raw/ Raw HTML + PDFs │ │
│ data/raw/index.jsonl Crawl manifest │ │
│ data/processed/ Structured JSON │ │
│ data/chunks/ chunks.jsonl │ │
│ data/indexes/ BM25 + FAISS indexes │ │
│ eval/questions.jsonl │◄───┘
│ eval/results/ Per-run results │
└──────────────────────────────────────────┘
SEBI Website
│
▼ [crawler/sebi_crawler.py]
data/raw/index.jsonl ← one line per circular (metadata + paths)
data/raw/{year}/{sebi_id}/ ← page.html, circular.pdf
│
▼ [parser/sebi_parser.py]
data/processed/{sebi_id}.json ← structured JSON with sections
│
▼ [chunker/sebi_chunker.py]
data/chunks/chunks.jsonl ← one chunk per line, with metadata
│
▼ [index/build_index.py]
data/indexes/bm25.pkl ← serialized BM25 index
data/indexes/faiss.index ← FAISS vector index
data/indexes/chunk_meta.pkl ← chunk metadata for post-filtering
│
▼ [rag/qa_pipeline.py + index/retriever.py]
User Query → Hybrid Retrieval → Prompt Assembly → SLM → Answer + Citations
| Module | File(s) | Responsibility |
|---|---|---|
crawler/ |
sebi_crawler.py, robots_check.py |
Crawl SEBI listing pages, download HTML/PDF, produce index.jsonl. Respects robots.txt, rate-limits, idempotent. |
parser/ |
sebi_parser.py, pdf_extractor.py, html_extractor.py |
Extract structured text from raw HTML/PDF. Handle encoding, clause numbering. Output: data/processed/{id}.json. |
chunker/ |
sebi_chunker.py |
Split parsed circulars into retrieval-sized chunks (~200-400 tokens). Attach metadata (SEBI ID, section, domain tags). Output: chunks.jsonl. |
index/ |
build_index.py, retriever.py |
Build BM25 + FAISS indexes. Implement hybrid fusion retrieval with post-filtering. |
rag/ |
qa_pipeline.py, llm_client.py, prompts.py |
Wire retrieval to SLM. Prompt engineering for SEBI domain. Abstract LLM client for swappability. |
eval/ |
run_eval.py, questions.jsonl |
Load eval questions, run pipeline, compute recall@k, log results. |
cli/ |
ingest.py, query.py, eval_cli.py |
CLI entry points for ingest, query, eval commands. |
config/ |
settings.yaml |
Central configuration for all tunables (model name, chunk size, fusion weights, rate limits, etc.). |
| Dimension | Assessment |
|---|---|
| Complexity | Low — pure Python BM25, well-known embedding library |
| Cost | Zero — all local, no API fees for retrieval |
| Scalability | Good for ~10K circulars; FAISS scales to millions |
| Team familiarity | High — widely documented, large community |
| Legal-RAG fit | Good — BM25 excels at exact legal term matching |
Pros: Simple to implement, fast iteration, no external services, BM25 handles legal jargon well (exact matches on clause numbers, SEBI IDs), FAISS provides fast ANN search.
Cons: rank_bm25 is in-memory only (fine for Month 1 corpus size ~5K-10K chunks), no built-in persistence (we serialize with pickle), no built-in filtering (we post-filter).
| Dimension | Assessment |
|---|---|
| Complexity | Medium — requires running a JVM service |
| Cost | Low but heavier (RAM/CPU for ES) |
| Scalability | Excellent — production-grade |
| Team familiarity | Medium |
Pros: Built-in hybrid search, filtering, persistence, pagination.
Cons: Overkill for Month 1 (< 10K docs), adds operational burden (JVM, config, Docker), slower iteration cycle, harder to port to browser later.
| Dimension | Assessment |
|---|---|
| Complexity | Low |
| Cost | Zero |
| Scalability | Moderate |
| Team familiarity | Medium |
Pros: Integrated metadata filtering, simpler API than raw FAISS.
Cons: No built-in BM25 (would need separate sparse index anyway), less control over fusion strategy, smaller community for legal-RAG tuning.
The key trade-off is simplicity + control vs. features:
- Option A gives us maximum control over the fusion strategy and chunking pipeline, which matters for legal-RAG where exact clause references and BM25 term matching are critical. The in-memory limitation is acceptable for Month 1's corpus size (~5K-10K chunks).
- Option B would be the right choice at scale but adds unnecessary complexity for a Month 1 prototype.
- Option C is a reasonable middle ground but the lack of BM25 integration means we'd still need two retrieval paths.
We choose Option A with a clear migration path: the retriever.py interface is abstract enough that swapping to Elasticsearch or ChromaDB later requires changing only the index builder and retriever implementations, not the RAG pipeline or evaluation code.
- Fast iteration on chunking strategy and fusion weights (all in Python, no services).
- Complete local execution (no Docker, no JVM, no cloud).
- Clear separation of concerns — each module can be tested and improved independently.
- Evaluation-driven development — the eval suite provides immediate feedback on retrieval changes.
- Scaling beyond ~50K chunks will require moving off in-memory BM25 (Month 2+ concern).
- No built-in metadata filtering in the index layer (we post-filter, which is fine for Month 1 candidate set sizes).
- Persistence is pickle-based (fragile across Python versions, but acceptable for Month 1).
- Embedding model choice once we have eval baselines (may want BGE-small or legal-tuned model).
- Chunk size tuning based on recall@k results.
- BM25 backend (move to Pyserini or ES) if corpus grows significantly.
- LLM client implementation once we pick the local SLM runtime.
All tunables live in config/settings.yaml with environment variable overrides:
# config/settings.yaml
crawler:
obey_robots: true
rate_limit_seconds: 2.0
max_concurrent: 3
max_retries: 3
user_agent: "SEBICircularBot/1.0 (+https://github.com/yourname/sebi-rag)"
parser:
prefer_html: true # Use HTML over PDF when both available
fallback_to_pdf: true
normalize_unicode: true
chunker:
max_tokens: 400
min_tokens: 50
overlap_tokens: 50
use_clause_boundaries: true
index:
embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
embedding_dim: 384
faiss_index_type: "FlatIP" # Inner product (for normalized embeddings = cosine)
fallback_to_annoy: false
bm25_tokenizer: "simple" # or "spacy"
retrieval:
top_k: 10
bm25_weight: 0.4
dense_weight: 0.6
fusion_method: "weighted_sum" # or "rrf" (reciprocal rank fusion)
rag:
llm_backend: "openai_compatible" # or "local_llamacpp", "ollama"
llm_model: "mistral-7b-instruct"
llm_base_url: "http://localhost:11434/v1"
max_context_chunks: 5
temperature: 0.1
eval:
questions_file: "eval/questions.jsonl"
recall_k_values: [3, 5, 10]- Implement crawler with robots.txt checking and rate limiting
- Implement HTML + PDF parsers with encoding normalization
- Implement clause-aware chunker with domain tag inference
- Build hybrid index (BM25 + FAISS) with serialization
- Wire RAG pipeline with abstract LLM client
- Create 30-50 evaluation questions across SEBI domains
- Build evaluation runner with recall@k metrics
- End-to-end integration test