Skip to content

Ratul-byte/TechHub-CodeSprint-Challenge-2026

Repository files navigation

TechHub CodeSprint Challenge 2026 — ArXiv RAG Pipeline

Overview

This project is a Retrieval-Augmented Generation (RAG) pipeline built over a sampled subset of the arXiv academic paper metadata. It was developed as a submission for the TechHub CodeSprint Challenge 2026.

The system ingests arXiv paper metadata from a CSV snapshot, stores it in SQLite, builds a vector index for semantic search, and exposes a REST API to query papers by natural language questions. It also includes visualization utilities and a batch question-answering runner.


Project Structure

.
├── ingest.py                  # Ingest arXiv CSV into SQLite + JSON
├── clean.sql                  # SQL script to clean/normalize raw data into analytical tables
├── rag_pipeline.py            # Core RAG pipeline (ChromaDB vector store + embedding support)
├── server.py                  # FastAPI REST API server
├── query_runner.py            # Batch question answering runner
├── visualize.py               # Generate plots from the SQLite database
├── questions.json             # Sample questions with grading criteria
├── answer.json                # Generated answers from the pipeline
├── requirements.txt           # Python dependencies
├── data/
│   ├── arxiv.db               # SQLite database (ingested papers + cleaned tables)
│   ├── papers_raw.json        # Raw papers JSON export
│   └── plots/
│       ├── 01_papers_per_category.png
│       ├── 02_submission_trend_over_time.png
│       ├── 03_publication_status_breakdown.png
│       └── 04_abstract_length_distribution.png
└── vector_store/              # ChromaDB vector index (auto-built on first run)

Features

  • Data ingestion: Load arXiv paper metadata from CSV, filter by category, and persist to SQLite and JSON.
  • Data cleaning: Deduplicate papers, extract submission year, count authors, compute abstract word counts, and classify publication status via a single SQL script.
  • RAG pipeline: Chunk paper abstracts and embed them into a ChromaDB vector store. Supports two embedding backends:
  • REST API: FastAPI server with endpoints to list papers, run semantic queries, and check system health.
  • Batch Q&A: Run a list of natural-language questions against the pipeline and save structured answers to answer.json.
  • Visualizations: Generate four matplotlib charts summarising the dataset.

Setup

Prerequisites

  • Python 3.10+
  • pip

Install dependencies

pip install -r requirements.txt

For local embeddings, no additional setup is needed — the model is downloaded automatically on first use.

For OpenRouter embeddings, create a .env file in the project root with your API key:

# Edit .env and add your API key:
# OPENROUTER_API_KEY="your-key-here"

Or set the environment variable directly:

export OPENROUTER_API_KEY="your-key-here"

Usage

1 — Ingest data

Place the arXiv CSV snapshot (sampled-arxiv-metadata-oai-snapshot.csv) in the project root, then run:

python ingest.py --input sampled-arxiv-metadata-oai-snapshot.csv --output-dir data

Options:

Flag Default Description
--input sampled-arxiv-metadata-oai-snapshot.csv Path to input CSV
--output-dir data Directory for outputs (papers_raw.json, arxiv.db)
--categories cs.AI cs.LG cs.CL stat.ML cs.CV Space-separated list of arXiv categories to keep
--sample-size Fixed number of rows to sample
--sample-frac Fraction of rows to sample (0, 1]
--random-state 42 Random seed for reproducibility

2 — Clean the database

sqlite3 data/arxiv.db < clean.sql

This creates five tables: papers, category_stats, yearly_trends, publication_status, and author_stats.

3 — Start the API server

uvicorn server:app --host 0.0.0.0 --port 8000

Set AUTO_BUILD_INDEX=true to build the vector index automatically on startup:

AUTO_BUILD_INDEX=true uvicorn server:app --host 0.0.0.0 --port 8000

API Endpoints

Method Path Description
GET /health Service health, index size, and config
GET /papers?limit=20&offset=0 Paginated list of papers
POST /query Semantic search over abstracts

Query example:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "transformer models for text classification", "top_k": 5}'

4 — Run batch question answering

python query_runner.py

This reads questions.json, queries the pipeline for each question, applies optional category and year filters, and writes structured answers to answer.json.

Set OPENROUTER_API_KEY or place the key in a temp.txt file in the project root.

5 — Generate visualizations

python visualize.py

Plots are saved to data/plots/:

File Description
01_papers_per_category.png Paper counts by category and publication status
02_submission_trend_over_time.png Submission volume over time by category
03_publication_status_breakdown.png Published vs preprint pie chart
04_abstract_length_distribution.png Abstract word count distribution by category

Use --rebuild-clean to re-run clean.sql automatically if the cleaned tables are missing.


Configuration Reference

Environment Variable Default Description
OPENROUTER_API_KEY Required for OpenRouter backend. API key from openrouter.ai
EMBEDDING_BACKEND openrouter Embedding backend: local (offline) or openrouter (API-based)
LOCAL_EMBED_MODEL sentence-transformers/all-minilm-l6-v2 HuggingFace model ID for local embeddings
OPENROUTER_EMBED_MODEL sentence-transformers/all-minilm-l12-v2 OpenRouter embedding model ID
AUTO_BUILD_INDEX false Auto-build vector index on API startup (can be slow)

Using .env File

Create a .env file in the project root (see .env.example):

OPENROUTER_API_KEY=sk-...
EMBEDDING_BACKEND=openrouter
OPENROUTER_EMBED_MODEL=sentence-transformers/all-minilm-l12-v2
AUTO_BUILD_INDEX=false

The .env file is loaded automatically by the application via python-dotenv.


Dependencies

Package Purpose
fastapi REST API framework
uvicorn ASGI server
chromadb Vector store for embeddings
sentence-transformers Local embedding model
requests HTTP client for OpenRouter API
python-dotenv Load environment variables from .env file
pandas Data processing and CSV loading
matplotlib Visualization and charting

License

This project was created as a competition submission for the TechHub CodeSprint Challenge 2026.

About

This project is a Retrieval-Augmented Generation (RAG) pipeline built over a sampled subset of the arXiv academic paper metadata. It was developed as a submission for the TechHub CodeSprint Challenge 2026.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages