Skip to content

Latest commit

 

History

History
201 lines (168 loc) · 7.87 KB

File metadata and controls

201 lines (168 loc) · 7.87 KB

SEBI Circulars RAG Assistant

A fully local, open-source hybrid RAG system for querying SEBI (Securities and Exchange Board of India) circulars and master circulars.

Architecture

  User Query
      │
      ▼
┌──────────────┐
│   Query      │  Intent classification · query expansion · HyDE (opt-in)
│ Understanding│
└──────┬───────┘
       ▼
┌──────────────┐
│   Hybrid     │  BM25 sparse + MiniLM-L6-v2 dense → weighted sum / RRF fusion
│  Retrieval   │
└──────┬───────┘
       ▼
┌──────────────┐
│  Cross-Encoder│  ms-marco-MiniLM-L-6-v2 reranking (top-20 → top-5)
│   Reranker   │
└──────┬───────┘
       ▼
┌──────────────┐
│   Context    │  Extractive sentence selection to reduce noise (opt-in)
│ Compression  │
└──────┬───────┘
       ▼
┌──────────────┐
│  LLM (local) │  gemma3:12b via Ollama · streaming + timeout · grounded prompts
│  Generation  │
└──────┬───────┘
       ▼
┌──────────────┐
│  Guardrails  │  Citation verification · n-gram grounding check
│  & Tracing   │  JSONL pipeline traces for observability
└──────┬───────┘
       ▼
    Answer + Citations

Features

  • Hybrid retrieval: BM25 sparse + dense embeddings with configurable fusion (weighted sum or RRF)
  • Cross-encoder reranking: Two-phase retrieval for high precision
  • Query understanding: Intent classification, LLM-based expansion, HyDE support
  • Post-generation guardrails: Citation verification and grounding checks
  • Agentic multi-hop (opt-in): Plan → Route → Act → Verify → Synthesize for complex queries
  • Context compression (opt-in): Extractive sentence selection to reduce noise
  • Structured observability: JSONL pipeline traces with full latency breakdown
  • Streaming output: Token-by-token CLI output for interactive use
  • Offline-first: All models load from local cache when network is unavailable
  • RAGAS-style evaluation: recall@k, MRR, citation accuracy, faithfulness proxy

Developer Setup

Prerequisites

  • Python 3.11+
  • A local LLM runtime (Ollama recommended) — or use mock backend for testing

Installation

git clone https://github.com/iAn-P1nt0/sebi_circular_rag.git
cd sebi_circular_rag

# Install dependencies
pip install -e ".[dev]"

# Or using uv (faster)
uv pip install -e ".[dev]"

Configuration

Edit config/settings.yaml or use environment variables:

# Use Ollama with gemma3
export SEBI_RAG_RAG_LLM_MODEL=gemma3:12b
export SEBI_RAG_RAG_LLM_BASE_URL=http://localhost:11434/v1

# Use mock LLM (no real LLM needed)
export SEBI_RAG_RAG_LLM_BACKEND=mock

Running the Pipeline

Step 1: Ingest (crawl → parse → chunk → index)

python -m cli.ingest
python -m cli.ingest --max-pages 10 --verbose    # Limit crawl for testing
python -m cli.ingest --skip-crawl                 # Use existing raw data

Step 2: Query

# Full RAG answer
python -m cli.query "What are the disclosure requirements for RPTs under LODR?"

# With streaming output
python -m cli.query "What are margin requirements?" --stream

# With filters and chunk details
python -m cli.query "REIT distribution requirements" --domain Intermediaries --show-chunks

# Retrieval only (no LLM call)
python -m cli.query "Insider trading regulations" --retrieval-only

Step 3: Evaluate

python -m cli.eval_cli
python -m cli.eval_cli --questions path/to/my_questions.jsonl --verbose

Project Structure

sebi_circular_rag/
├── config/
│   ├── __init__.py              # Config loader (YAML + env overrides)
│   └── settings.yaml            # Central configuration
├── crawler/
│   ├── sebi_crawler.py          # Async rate-limited SEBI crawler
│   └── robots_check.py          # robots.txt compliance
├── parser/
│   ├── sebi_parser.py           # HTML/PDF → structured JSON
│   ├── html_extractor.py        # HTML content extraction
│   └── pdf_extractor.py         # PDF text extraction (PyMuPDF)
├── chunker/
│   └── sebi_chunker.py          # Clause-aware chunking (200-400 tokens)
├── index/
│   ├── build_index.py           # BM25 + FAISS/Annoy index builder
│   ├── retriever.py             # Hybrid retrieval with score fusion
│   └── reranker.py              # Cross-encoder reranking
├── query/
│   └── understanding.py         # Intent classification + query expansion + HyDE
├── rag/
│   ├── qa_pipeline.py           # Enhanced end-to-end RAG pipeline
│   ├── llm_client.py            # LLM client (OpenAI-compat, streaming, mock)
│   ├── prompts.py               # Prompt templates
│   ├── guardrails.py            # Citation verification + grounding checks
│   └── compressor.py            # Extractive context compression
├── agent/
│   └── orchestrator.py          # Multi-hop agentic decomposition
├── observability/
│   └── tracer.py                # JSONL pipeline tracing
├── utils/
│   └── model_loader.py          # Offline-resilient model loading
├── eval/
│   ├── questions.jsonl           # 40 evaluation questions
│   └── run_eval.py              # Evaluation (recall@k, MRR, faithfulness, citation accuracy)
├── cli/
│   ├── ingest.py                # Full ingest pipeline CLI
│   ├── query.py                 # Query CLI (with streaming)
│   └── eval_cli.py              # Evaluation CLI
├── tests/
│   ├── test_parser.py           # Parser tests
│   ├── test_chunker.py          # Chunker tests
│   ├── test_retriever.py        # Retriever tests
│   ├── test_qa_pipeline.py      # Pipeline integration tests
│   ├── test_guardrails.py       # Guardrails tests
│   ├── test_query_understanding.py  # Query understanding tests
│   ├── test_compressor.py       # Context compression tests
│   └── test_eval_metrics.py     # Evaluation metrics tests
├── data/                        # Generated data (gitignored)
│   ├── raw/                     # Crawled HTML + PDFs
│   ├── processed/               # Parsed JSON per circular
│   ├── chunks/                  # chunks.jsonl
│   ├── indexes/                 # BM25 + FAISS serialized indexes
│   └── traces/                  # Pipeline execution traces
├── pyproject.toml
├── README.md
├── ADR-001-SEBI-RAG-ARCHITECTURE.md
└── ADR-002-ARCHITECTURE-ENHANCEMENT-REVIEW.md

Running Tests

pytest tests/ -v

Key Design Decisions

See ADR-001 for the original architecture and ADR-002 for the enhancement review.

  • BM25 + dense hybrid: Legal text benefits from exact term matching combined with semantic understanding
  • Cross-encoder reranking: Two-phase retrieval significantly improves precision over single-phase
  • Small chunks (200-400 tokens): LegalBench-RAG findings show smaller chunks improve regulatory text retrieval
  • Clause-aware splitting: Respects legal document structure with abbreviation-safe sentence handling
  • Post-generation guardrails: Catch hallucinated citations and ungrounded claims before returning to user
  • Pluggable LLM: Abstract client interface supporting Ollama, LM Studio, vLLM, or any OpenAI-compatible API
  • Offline-first: Models load from local cache when network is unavailable (try online → fallback to local)