Skip to content

Latest commit

 

History

History
234 lines (183 loc) · 10.6 KB

File metadata and controls

234 lines (183 loc) · 10.6 KB

ADR-001: SEBI Circulars Local RAG Assistant — Month 1 Architecture

Status: Proposed Date: 2026-03-03 Deciders: Ian (Project Lead)


Context

We are building an open-source, fully local assistant for querying SEBI (Securities and Exchange Board of India) circulars and master circulars. The Month-1 milestone focuses on:

  1. A focused crawler for SEBI's circular listing pages (HTML + PDF artifacts).
  2. A parser/normalizer pipeline that produces structured JSON from raw HTML and PDFs.
  3. A chunker tuned for legal/regulatory text (clause-aware, metadata-enriched).
  4. A hybrid retrieval layer (BM25 sparse + dense embeddings via FAISS).
  5. A baseline RAG pipeline wiring retrieval to a generic open-source SLM.
  6. An evaluation suite of 30-50 hand-crafted compliance questions.

The system must be entirely local (no cloud dependencies beyond optional LLM API), respect SEBI's robots.txt and usage policies, and be modular enough to later port to a browser-based SLM.


Decision

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        CLI / Web UI                             │
│                  cli/ingest.py  cli/query.py  cli/eval_cli.py   │
└──────────┬──────────────────┬──────────────────┬────────────────┘
           │                  │                  │
           ▼                  ▼                  ▼
┌──────────────────┐ ┌────────────────┐ ┌────────────────────────┐
│  Ingest Pipeline │ │  Query Pipeline │ │  Evaluation Pipeline   │
│                  │ │                 │ │                        │
│  1. Crawl        │ │  1. Retrieve    │ │  1. Load questions     │
│  2. Parse        │ │     (hybrid)    │ │  2. Run RAG pipeline   │
│  3. Chunk        │ │  2. Build prompt│ │  3. Score recall@k     │
│  4. Index        │ │  3. Call SLM    │ │  4. Report             │
└──────┬───────────┘ └───────┬────────┘ └───────┬────────────────┘
       │                     │                  │
       ▼                     ▼                  │
┌──────────────────────────────────────────┐    │
│              Data Layer                  │    │
│                                          │    │
│  data/raw/         Raw HTML + PDFs       │    │
│  data/raw/index.jsonl  Crawl manifest    │    │
│  data/processed/   Structured JSON       │    │
│  data/chunks/      chunks.jsonl          │    │
│  data/indexes/     BM25 + FAISS indexes  │    │
│  eval/questions.jsonl                    │◄───┘
│  eval/results/     Per-run results       │
└──────────────────────────────────────────┘

Data Flow (End-to-End)

SEBI Website
    │
    ▼  [crawler/sebi_crawler.py]
data/raw/index.jsonl          ← one line per circular (metadata + paths)
data/raw/{year}/{sebi_id}/    ← page.html, circular.pdf
    │
    ▼  [parser/sebi_parser.py]
data/processed/{sebi_id}.json ← structured JSON with sections
    │
    ▼  [chunker/sebi_chunker.py]
data/chunks/chunks.jsonl      ← one chunk per line, with metadata
    │
    ▼  [index/build_index.py]
data/indexes/bm25.pkl         ← serialized BM25 index
data/indexes/faiss.index      ← FAISS vector index
data/indexes/chunk_meta.pkl   ← chunk metadata for post-filtering
    │
    ▼  [rag/qa_pipeline.py + index/retriever.py]
User Query → Hybrid Retrieval → Prompt Assembly → SLM → Answer + Citations

Module Responsibilities

Module File(s) Responsibility
crawler/ sebi_crawler.py, robots_check.py Crawl SEBI listing pages, download HTML/PDF, produce index.jsonl. Respects robots.txt, rate-limits, idempotent.
parser/ sebi_parser.py, pdf_extractor.py, html_extractor.py Extract structured text from raw HTML/PDF. Handle encoding, clause numbering. Output: data/processed/{id}.json.
chunker/ sebi_chunker.py Split parsed circulars into retrieval-sized chunks (~200-400 tokens). Attach metadata (SEBI ID, section, domain tags). Output: chunks.jsonl.
index/ build_index.py, retriever.py Build BM25 + FAISS indexes. Implement hybrid fusion retrieval with post-filtering.
rag/ qa_pipeline.py, llm_client.py, prompts.py Wire retrieval to SLM. Prompt engineering for SEBI domain. Abstract LLM client for swappability.
eval/ run_eval.py, questions.jsonl Load eval questions, run pipeline, compute recall@k, log results.
cli/ ingest.py, query.py, eval_cli.py CLI entry points for ingest, query, eval commands.
config/ settings.yaml Central configuration for all tunables (model name, chunk size, fusion weights, rate limits, etc.).

Options Considered

Option A: rank_bm25 + FAISS + Sentence-Transformers (Chosen)

Dimension Assessment
Complexity Low — pure Python BM25, well-known embedding library
Cost Zero — all local, no API fees for retrieval
Scalability Good for ~10K circulars; FAISS scales to millions
Team familiarity High — widely documented, large community
Legal-RAG fit Good — BM25 excels at exact legal term matching

Pros: Simple to implement, fast iteration, no external services, BM25 handles legal jargon well (exact matches on clause numbers, SEBI IDs), FAISS provides fast ANN search.

Cons: rank_bm25 is in-memory only (fine for Month 1 corpus size ~5K-10K chunks), no built-in persistence (we serialize with pickle), no built-in filtering (we post-filter).

Option B: Elasticsearch/OpenSearch for both sparse and dense

Dimension Assessment
Complexity Medium — requires running a JVM service
Cost Low but heavier (RAM/CPU for ES)
Scalability Excellent — production-grade
Team familiarity Medium

Pros: Built-in hybrid search, filtering, persistence, pagination.

Cons: Overkill for Month 1 (< 10K docs), adds operational burden (JVM, config, Docker), slower iteration cycle, harder to port to browser later.

Option C: ChromaDB or LanceDB for vector + metadata

Dimension Assessment
Complexity Low
Cost Zero
Scalability Moderate
Team familiarity Medium

Pros: Integrated metadata filtering, simpler API than raw FAISS.

Cons: No built-in BM25 (would need separate sparse index anyway), less control over fusion strategy, smaller community for legal-RAG tuning.


Trade-off Analysis

The key trade-off is simplicity + control vs. features:

  • Option A gives us maximum control over the fusion strategy and chunking pipeline, which matters for legal-RAG where exact clause references and BM25 term matching are critical. The in-memory limitation is acceptable for Month 1's corpus size (~5K-10K chunks).
  • Option B would be the right choice at scale but adds unnecessary complexity for a Month 1 prototype.
  • Option C is a reasonable middle ground but the lack of BM25 integration means we'd still need two retrieval paths.

We choose Option A with a clear migration path: the retriever.py interface is abstract enough that swapping to Elasticsearch or ChromaDB later requires changing only the index builder and retriever implementations, not the RAG pipeline or evaluation code.


Consequences

What becomes easier

  • Fast iteration on chunking strategy and fusion weights (all in Python, no services).
  • Complete local execution (no Docker, no JVM, no cloud).
  • Clear separation of concerns — each module can be tested and improved independently.
  • Evaluation-driven development — the eval suite provides immediate feedback on retrieval changes.

What becomes harder

  • Scaling beyond ~50K chunks will require moving off in-memory BM25 (Month 2+ concern).
  • No built-in metadata filtering in the index layer (we post-filter, which is fine for Month 1 candidate set sizes).
  • Persistence is pickle-based (fragile across Python versions, but acceptable for Month 1).

What we'll need to revisit

  • Embedding model choice once we have eval baselines (may want BGE-small or legal-tuned model).
  • Chunk size tuning based on recall@k results.
  • BM25 backend (move to Pyserini or ES) if corpus grows significantly.
  • LLM client implementation once we pick the local SLM runtime.

Configuration Strategy

All tunables live in config/settings.yaml with environment variable overrides:

# config/settings.yaml
crawler:
  obey_robots: true
  rate_limit_seconds: 2.0
  max_concurrent: 3
  max_retries: 3
  user_agent: "SEBICircularBot/1.0 (+https://github.com/yourname/sebi-rag)"

parser:
  prefer_html: true          # Use HTML over PDF when both available
  fallback_to_pdf: true
  normalize_unicode: true

chunker:
  max_tokens: 400
  min_tokens: 50
  overlap_tokens: 50
  use_clause_boundaries: true

index:
  embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
  embedding_dim: 384
  faiss_index_type: "FlatIP"  # Inner product (for normalized embeddings = cosine)
  fallback_to_annoy: false
  bm25_tokenizer: "simple"    # or "spacy"

retrieval:
  top_k: 10
  bm25_weight: 0.4
  dense_weight: 0.6
  fusion_method: "weighted_sum"  # or "rrf" (reciprocal rank fusion)

rag:
  llm_backend: "openai_compatible"  # or "local_llamacpp", "ollama"
  llm_model: "mistral-7b-instruct"
  llm_base_url: "http://localhost:11434/v1"
  max_context_chunks: 5
  temperature: 0.1

eval:
  questions_file: "eval/questions.jsonl"
  recall_k_values: [3, 5, 10]

Action Items

  1. Implement crawler with robots.txt checking and rate limiting
  2. Implement HTML + PDF parsers with encoding normalization
  3. Implement clause-aware chunker with domain tag inference
  4. Build hybrid index (BM25 + FAISS) with serialization
  5. Wire RAG pipeline with abstract LLM client
  6. Create 30-50 evaluation questions across SEBI domains
  7. Build evaluation runner with recall@k metrics
  8. End-to-end integration test