ADR-001: SEBI Circulars Local RAG Assistant — Month 1 Architecture

Status: Proposed Date: 2026-03-03 Deciders: Ian (Project Lead)

Context

We are building an open-source, fully local assistant for querying SEBI (Securities and Exchange Board of India) circulars and master circulars. The Month-1 milestone focuses on:

A focused crawler for SEBI's circular listing pages (HTML + PDF artifacts).
A parser/normalizer pipeline that produces structured JSON from raw HTML and PDFs.
A chunker tuned for legal/regulatory text (clause-aware, metadata-enriched).
A hybrid retrieval layer (BM25 sparse + dense embeddings via FAISS).
A baseline RAG pipeline wiring retrieval to a generic open-source SLM.
An evaluation suite of 30-50 hand-crafted compliance questions.

The system must be entirely local (no cloud dependencies beyond optional LLM API), respect SEBI's robots.txt and usage policies, and be modular enough to later port to a browser-based SLM.

Decision

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        CLI / Web UI                             │
│                  cli/ingest.py  cli/query.py  cli/eval_cli.py   │
└──────────┬──────────────────┬──────────────────┬────────────────┘
           │                  │                  │
           ▼                  ▼                  ▼
┌──────────────────┐ ┌────────────────┐ ┌────────────────────────┐
│  Ingest Pipeline │ │  Query Pipeline │ │  Evaluation Pipeline   │
│                  │ │                 │ │                        │
│  1. Crawl        │ │  1. Retrieve    │ │  1. Load questions     │
│  2. Parse        │ │     (hybrid)    │ │  2. Run RAG pipeline   │
│  3. Chunk        │ │  2. Build prompt│ │  3. Score recall@k     │
│  4. Index        │ │  3. Call SLM    │ │  4. Report             │
└──────┬───────────┘ └───────┬────────┘ └───────┬────────────────┘
       │                     │                  │
       ▼                     ▼                  │
┌──────────────────────────────────────────┐    │
│              Data Layer                  │    │
│                                          │    │
│  data/raw/         Raw HTML + PDFs       │    │
│  data/raw/index.jsonl  Crawl manifest    │    │
│  data/processed/   Structured JSON       │    │
│  data/chunks/      chunks.jsonl          │    │
│  data/indexes/     BM25 + FAISS indexes  │    │
│  eval/questions.jsonl                    │◄───┘
│  eval/results/     Per-run results       │
└──────────────────────────────────────────┘

Data Flow (End-to-End)

SEBI Website
    │
    ▼  [crawler/sebi_crawler.py]
data/raw/index.jsonl          ← one line per circular (metadata + paths)
data/raw/{year}/{sebi_id}/    ← page.html, circular.pdf
    │
    ▼  [parser/sebi_parser.py]
data/processed/{sebi_id}.json ← structured JSON with sections
    │
    ▼  [chunker/sebi_chunker.py]
data/chunks/chunks.jsonl      ← one chunk per line, with metadata
    │
    ▼  [index/build_index.py]
data/indexes/bm25.pkl         ← serialized BM25 index
data/indexes/faiss.index      ← FAISS vector index
data/indexes/chunk_meta.pkl   ← chunk metadata for post-filtering
    │
    ▼  [rag/qa_pipeline.py + index/retriever.py]
User Query → Hybrid Retrieval → Prompt Assembly → SLM → Answer + Citations

Module Responsibilities

Module	File(s)	Responsibility
`crawler/`	`sebi_crawler.py`, `robots_check.py`	Crawl SEBI listing pages, download HTML/PDF, produce `index.jsonl`. Respects robots.txt, rate-limits, idempotent.
`parser/`	`sebi_parser.py`, `pdf_extractor.py`, `html_extractor.py`	Extract structured text from raw HTML/PDF. Handle encoding, clause numbering. Output: `data/processed/{id}.json`.
`chunker/`	`sebi_chunker.py`	Split parsed circulars into retrieval-sized chunks (~200-400 tokens). Attach metadata (SEBI ID, section, domain tags). Output: `chunks.jsonl`.
`index/`	`build_index.py`, `retriever.py`	Build BM25 + FAISS indexes. Implement hybrid fusion retrieval with post-filtering.
`rag/`	`qa_pipeline.py`, `llm_client.py`, `prompts.py`	Wire retrieval to SLM. Prompt engineering for SEBI domain. Abstract LLM client for swappability.
`eval/`	`run_eval.py`, `questions.jsonl`	Load eval questions, run pipeline, compute recall@k, log results.
`cli/`	`ingest.py`, `query.py`, `eval_cli.py`	CLI entry points for `ingest`, `query`, `eval` commands.
`config/`	`settings.yaml`	Central configuration for all tunables (model name, chunk size, fusion weights, rate limits, etc.).

Options Considered

Option A: rank_bm25 + FAISS + Sentence-Transformers (Chosen)

Dimension	Assessment
Complexity	Low — pure Python BM25, well-known embedding library
Cost	Zero — all local, no API fees for retrieval
Scalability	Good for ~10K circulars; FAISS scales to millions
Team familiarity	High — widely documented, large community
Legal-RAG fit	Good — BM25 excels at exact legal term matching

Pros: Simple to implement, fast iteration, no external services, BM25 handles legal jargon well (exact matches on clause numbers, SEBI IDs), FAISS provides fast ANN search.

Cons: rank_bm25 is in-memory only (fine for Month 1 corpus size ~5K-10K chunks), no built-in persistence (we serialize with pickle), no built-in filtering (we post-filter).

Option B: Elasticsearch/OpenSearch for both sparse and dense

Dimension	Assessment
Complexity	Medium — requires running a JVM service
Cost	Low but heavier (RAM/CPU for ES)
Scalability	Excellent — production-grade
Team familiarity	Medium

Pros: Built-in hybrid search, filtering, persistence, pagination.

Cons: Overkill for Month 1 (< 10K docs), adds operational burden (JVM, config, Docker), slower iteration cycle, harder to port to browser later.

Option C: ChromaDB or LanceDB for vector + metadata

Dimension	Assessment
Complexity	Low
Cost	Zero
Scalability	Moderate
Team familiarity	Medium

Pros: Integrated metadata filtering, simpler API than raw FAISS.

Cons: No built-in BM25 (would need separate sparse index anyway), less control over fusion strategy, smaller community for legal-RAG tuning.

Trade-off Analysis

The key trade-off is simplicity + control vs. features:

Option A gives us maximum control over the fusion strategy and chunking pipeline, which matters for legal-RAG where exact clause references and BM25 term matching are critical. The in-memory limitation is acceptable for Month 1's corpus size (~5K-10K chunks).
Option B would be the right choice at scale but adds unnecessary complexity for a Month 1 prototype.
Option C is a reasonable middle ground but the lack of BM25 integration means we'd still need two retrieval paths.

We choose Option A with a clear migration path: the retriever.py interface is abstract enough that swapping to Elasticsearch or ChromaDB later requires changing only the index builder and retriever implementations, not the RAG pipeline or evaluation code.

Consequences

What becomes easier

Fast iteration on chunking strategy and fusion weights (all in Python, no services).
Complete local execution (no Docker, no JVM, no cloud).
Clear separation of concerns — each module can be tested and improved independently.
Evaluation-driven development — the eval suite provides immediate feedback on retrieval changes.

What becomes harder

Scaling beyond ~50K chunks will require moving off in-memory BM25 (Month 2+ concern).
No built-in metadata filtering in the index layer (we post-filter, which is fine for Month 1 candidate set sizes).
Persistence is pickle-based (fragile across Python versions, but acceptable for Month 1).

What we'll need to revisit

Embedding model choice once we have eval baselines (may want BGE-small or legal-tuned model).
Chunk size tuning based on recall@k results.
BM25 backend (move to Pyserini or ES) if corpus grows significantly.
LLM client implementation once we pick the local SLM runtime.

Configuration Strategy

All tunables live in config/settings.yaml with environment variable overrides:

# config/settings.yaml
crawler:
  obey_robots: true
  rate_limit_seconds: 2.0
  max_concurrent: 3
  max_retries: 3
  user_agent: "SEBICircularBot/1.0 (+https://github.com/yourname/sebi-rag)"

parser:
  prefer_html: true          # Use HTML over PDF when both available
  fallback_to_pdf: true
  normalize_unicode: true

chunker:
  max_tokens: 400
  min_tokens: 50
  overlap_tokens: 50
  use_clause_boundaries: true

index:
  embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
  embedding_dim: 384
  faiss_index_type: "FlatIP"  # Inner product (for normalized embeddings = cosine)
  fallback_to_annoy: false
  bm25_tokenizer: "simple"    # or "spacy"

retrieval:
  top_k: 10
  bm25_weight: 0.4
  dense_weight: 0.6
  fusion_method: "weighted_sum"  # or "rrf" (reciprocal rank fusion)

rag:
  llm_backend: "openai_compatible"  # or "local_llamacpp", "ollama"
  llm_model: "mistral-7b-instruct"
  llm_base_url: "http://localhost:11434/v1"
  max_context_chunks: 5
  temperature: 0.1

eval:
  questions_file: "eval/questions.jsonl"
  recall_k_values: [3, 5, 10]

Action Items

Implement crawler with robots.txt checking and rate limiting
Implement HTML + PDF parsers with encoding normalization
Implement clause-aware chunker with domain tag inference
Build hybrid index (BM25 + FAISS) with serialization
Wire RAG pipeline with abstract LLM client
Create 30-50 evaluation questions across SEBI domains
Build evaluation runner with recall@k metrics
End-to-end integration test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR-001: SEBI Circulars Local RAG Assistant — Month 1 Architecture

Context

Decision

Architecture Overview

Data Flow (End-to-End)

Module Responsibilities

Options Considered

Option A: rank_bm25 + FAISS + Sentence-Transformers (Chosen)

Option B: Elasticsearch/OpenSearch for both sparse and dense

Option C: ChromaDB or LanceDB for vector + metadata

Trade-off Analysis

Consequences

What becomes easier

What becomes harder

What we'll need to revisit

Configuration Strategy

Action Items

FilesExpand file tree

ADR-001-SEBI-RAG-ARCHITECTURE.md

Latest commit

History

ADR-001-SEBI-RAG-ARCHITECTURE.md

File metadata and controls

ADR-001: SEBI Circulars Local RAG Assistant — Month 1 Architecture

Context

Decision

Architecture Overview

Data Flow (End-to-End)

Module Responsibilities

Options Considered

Option A: rank_bm25 + FAISS + Sentence-Transformers (Chosen)

Option B: Elasticsearch/OpenSearch for both sparse and dense

Option C: ChromaDB or LanceDB for vector + metadata

Trade-off Analysis

Consequences

What becomes easier

What becomes harder

What we'll need to revisit

Configuration Strategy

Action Items