Restructure documentation: Create reader-focused README, preserve technical details in INSTRUCTIONS.md by Copilot · Pull Request #14 · UofT-CSC490-F2025/ArXplorer

Copilot · 2025-12-03T21:05:23Z

The current README.md is overly technical and doesn't effectively engage new users. This PR restructures documentation to separate marketing/overview from technical implementation details.

Changes

File reorganization:

Renamed README.md → INSTRUCTIONS.md (preserves 655 lines of setup/deployment/config docs)
Created new README.md (301 lines, optimized for engagement)

New README.md structure:

Hero section: Title, tagline, 3 concrete query examples, badges, navigation links
Problem statement: Explains lexical matching limitations with specific failure cases
Comparison table: ArXplorer vs Google Scholar vs arXiv (semantic search, intent detection, hybrid ranking, etc.)
Performance metrics: NDCG@10 (+51% vs baseline), Recall@100 (+28%), MRR (+48%)
Quick start: 4-step setup condensed from multi-page guide
Architecture diagram: ASCII flowchart of query pipeline (LLM analyzer → hybrid search → boosting → reranking)
Key features: Intent-aware search, multi-vector hybrid, query processing, reranking, production-ready
Project structure: Directory tree with component descriptions
Supporting sections: Getting Help, Contributing, Citation, License, Acknowledgments

Link updates:

All detailed documentation references now point to INSTRUCTIONS.md
evaluation/README.md links preserved
Badge URLs unchanged (functional as-is)

Before/After

Before:

# ArXplorer - Academic Paper Retrieval System
ArXplorer is a production-ready academic papers retrieval system...
## Quick Start (Local)
### Prerequisites
- Docker Desktop...
[22,505 characters of technical setup]

After:

# ArXplorer 🔍
### Find Academic Papers Like a Researcher Thinks

**Stop fighting with keyword-only search engines.**
Query: "original unet paper" → Finds: "U-Net: Convolutional Networks..." (Ronneberger et al.)

[Value prop, comparison table, metrics, simplified quick start]
📖 Need detailed instructions? See INSTRUCTIONS.md

Value proposition now visible within 10 seconds. Technical depth preserved in INSTRUCTIONS.md for developers who need it.

Original prompt

Problem Statement

The current README.md is too technical and doesn't effectively capture reader interest. We need to:

Rename the current README.md to INSTRUCTIONS.md - This will preserve all the detailed setup and deployment instructions
Create a completely new README.md - A reader-focused file that maximizes interest and engagement

Current Content to Preserve

The existing README.md (starting from commit 8ba61dd) contains valuable technical documentation that should be preserved as INSTRUCTIONS.md.

Requirements for New README.md

Create a completely new README.md with the following structure and content:

1. Hero Section

Eye-catching title with emoji: "# ArXplorer 🔍"
Compelling tagline: "### Find Academic Papers Like a Researcher Thinks"
3 concrete example queries showing semantic understanding
Clear value proposition
Badges (coverage and tests)
Quick navigation links

2. Problem Statement (Expanded)

Use the content from lines 8-9 of current README but expand it significantly:

## 💡 The Problem We Solve

**Traditional academic search engines are broken.**

Try searching Google Scholar or arXiv for:
- *"papers about how neural networks learn internal structure"* → ❌ Zero relevant results (no exact keyword matches)
- *"original transformer paper"* → ❌ Finds papers *about* transformers, not *the* Transformer paper
- *"foundational work on medical image segmentation"* → ❌ Requires you to know it's called "U-Net"

**Why?** They rely on **lexical matching** (keyword matching). If your words don't exactly match the paper's title/abstract, you're out of luck.

**ArXplorer fixes this** with:
- ✅ **Semantic understanding**: Matches concepts, not just words
- ✅ **Intent detection**: Knows if you want recent SOTA or foundational papers
- ✅ **Smart extraction**: "original unet paper" → automatically searches for title="U-Net"
- ✅ **Hybrid search**: Combines semantic vectors + keyword matching + metadata

3. Visual Demo Section

## 🎬 See It In Action

[Placeholder for demo GIF/screenshot - to be added later]

### Example Queries

**Query:** "attention is all you need"

✓ Found: "Attention Is All You Need" (Vaswani et al., 2017)
Score: 0.95 | Citations: 89,234


**Query:** "original unet paper"

Intent: specific_paper
Extracted Title: U-Net
✓ Found: "U-Net: Convolutional Networks for Biomedical Image Segmentation" (Ronneberger et al., 2015)
Score: 0.95 | Citations: 45,234

4. Why ArXplorer? (Comparison Table)

## 🚀 Why ArXplorer?

| Feature | Google Scholar | arXiv Search | **ArXplorer** |
|---------|---------------|--------------|---------------|
| Semantic search | ❌ | ❌ | ✅ |
| Intent detection | ❌ | ❌ | ✅ (6 types) |
| Query expansion | ❌ | ❌ | ✅ (LLM-powered) |
| Hybrid ranking | ❌ | ❌ | ✅ (Dense + Sparse + Metadata) |
| Self-hostable | ❌ | ❌ | ✅ (Docker + AWS) |
| API access | ⚠️ Limited | ⚠️ Limited | ✅ (FastAPI) |

**Plus:**
- 🎓 **Academic-optimized**: SPECTER2 embeddings trained on 750k papers
- ⚡ **Fast**: <200ms query latency with GPU reranking
- 🔧 **Production-ready**: 96% test coverage, automated backups, CI/CD
- 📈 **Scalable**: Handles 300k+ papers, extensible to millions

5. Results/Metrics Section

## 📊 Performance

ArXplorer achieves **state-of-the-art retrieval quality** on academic IR benchmarks:

| Metric | BM25 (baseline) | Dense-only | **ArXplorer (hybrid)** | Improvement |
|--------|-----------------|------------|----------------------|-------------|
| NDCG@10 | 0.412 | 0.487 | **0.623** | +51% vs baseline |
| Recall@100 | 0.651 | 0.712 | **0.834** | +28% vs baseline |
| MRR | 0.398 | 0.471 | **0.589** | +48% vs baseline |

*See [evaluation/README.md](evaluation/README.md) for detailed benchmarking methodology.*

**Real-world impact:**
- ✅ Finds 83% of relevant papers in top 100 results (vs 65% for BM25)
- ✅ Correct paper appears in top 10 results 59% of the time (vs 40% for BM25)

6. Quick Start (Simplified)

## 🚀 Quick Start

⏱️ **Get running in 5 minutes**

```bash
# 1. Start Milvus vector database
docker-compose -f docker-compose.milvus.yml up -d

# 2. Setup Python environment
conda env create -f environment.yml
conda activate arxplorer-env

# 3. Load demo dataset (1k papers)
python scripts/encode.py --data-file data/arxiv_1k.jsonl

# 4. Start searching!
python scripts/query.py
# Try: "attention is all you need"

✅ Success? You should see paper results with titles, authors, and scores.

📖 Need detailed instructions? See INSTRUCTIONS.md for:

Full setup guide with troubleshooting
AWS deployment (production-ready infrastructure)
Configuration options
API deployment


### 7. Architecture Overview
```markdown
## 🏗️ How It Works

User Query: "original unet paper"
│
├──► 1. 🧠 LLM Query Analyzer (Qwen3-4B)
│ → Intent: specific_paper
│ → Extracte...

This pull request was created as a result of the following prompt from Copilot chat.

Problem Statement

The current README.md is too technical and doesn't effectively capture reader interest. We need to:

Rename the current README.md to INSTRUCTIONS.md - This will preserve all the detailed setup and deployment instructions

Create a completely new README.md - A reader-focused file that maximizes interest and engagement

Current Content to Preserve

The existing README.md (starting from commit 8ba61dd) contains valuable technical documentation that should be preserved as INSTRUCTIONS.md.

Requirements for New README.md

Create a completely new README.md with the following structure and content:

1. Hero Section

Eye-catching title with emoji: "# ArXplorer 🔍"

Compelling tagline: "### Find Academic Papers Like a Researcher Thinks"

3 concrete example queries showing semantic understanding

Clear value proposition

Badges (coverage and tests)

Quick navigation links

2. Problem Statement (Expanded)

Use the content from lines 8-9 of current README but expand it significantly:
## 💡 The Problem We Solve

**Traditional academic search engines are broken.**

Try searching Google Scholar or arXiv for:
- *"papers about how neural networks learn internal structure"* → ❌ Zero relevant results (no exact keyword matches)
- *"original transformer paper"* → ❌ Finds papers *about* transformers, not *the* Transformer paper
- *"foundational work on medical image segmentation"* → ❌ Requires you to know it's called "U-Net"

**Why?** They rely on **lexical matching** (keyword matching). If your words don't exactly match the paper's title/abstract, you're out of luck.

**ArXplorer fixes this** with:
- ✅ **Semantic understanding**: Matches concepts, not just words
- ✅ **Intent detection**: Knows if you want recent SOTA or foundational papers
- ✅ **Smart extraction**: "original unet paper" → automatically searches for title="U-Net"
- ✅ **Hybrid search**: Combines semantic vectors + keyword matching + metadata
3. Visual Demo Section
## 🎬 See It In Action

[Placeholder for demo GIF/screenshot - to be added later]

### Example Queries

**Query:** "attention is all you need"
✓ Found: "Attention Is All You Need" (Vaswani et al., 2017)
Score: 0.95 | Citations: 89,234
**Query:** "original unet paper"
Intent: specific_paper
Extracted Title: U-Net
✓ Found: "U-Net: Convolutional Networks for Biomedical Image Segmentation" (Ronneberger et al., 2015)
Score: 0.95 | Citations: 45,234
4. Why ArXplorer? (Comparison Table)
## 🚀 Why ArXplorer?

| Feature | Google Scholar | arXiv Search | **ArXplorer** |
|---------|---------------|--------------|---------------|
| Semantic search | ❌ | ❌ | ✅ |
| Intent detection | ❌ | ❌ | ✅ (6 types) |
| Query expansion | ❌ | ❌ | ✅ (LLM-powered) |
| Hybrid ranking | ❌ | ❌ | ✅ (Dense + Sparse + Metadata) |
| Self-hostable | ❌ | ❌ | ✅ (Docker + AWS) |
| API access | ⚠️ Limited | ⚠️ Limited | ✅ (FastAPI) |

**Plus:**
- 🎓 **Academic-optimized**: SPECTER2 embeddings trained on 750k papers
- ⚡ **Fast**: <200ms query latency with GPU reranking
- 🔧 **Production-ready**: 96% test coverage, automated backups, CI/CD
- 📈 **Scalable**: Handles 300k+ papers, extensible to millions
5. Results/Metrics Section
## 📊 Performance

ArXplorer achieves **state-of-the-art retrieval quality** on academic IR benchmarks:

| Metric | BM25 (baseline) | Dense-only | **ArXplorer (hybrid)** | Improvement |
|--------|-----------------|------------|----------------------|-------------|
| NDCG@10 | 0.412 | 0.487 | **0.623** | +51% vs baseline |
| Recall@100 | 0.651 | 0.712 | **0.834** | +28% vs baseline |
| MRR | 0.398 | 0.471 | **0.589** | +48% vs baseline |

*See [evaluation/README.md](evaluation/README.md) for detailed benchmarking methodology.*

**Real-world impact:**
- ✅ Finds 83% of relevant papers in top 100 results (vs 65% for BM25)
- ✅ Correct paper appears in top 10 results 59% of the time (vs 40% for BM25)
6. Quick Start (Simplified)
## 🚀 Quick Start

⏱️ **Get running in 5 minutes**

```bash
# 1. Start Milvus vector database
docker-compose -f docker-compose.milvus.yml up -d

# 2. Setup Python environment
conda env create -f environment.yml
conda activate arxplorer-env

# 3. Load demo dataset (1k papers)
python scripts/encode.py --data-file data/arxiv_1k.jsonl

# 4. Start searching!
python scripts/query.py
# Try: "attention is all you need"
✅ Success? You should see paper results with titles, authors, and scores.

📖 Need detailed instructions? See INSTRUCTIONS.md for:

Full setup guide with troubleshooting

AWS deployment (production-ready infrastructure)

Configuration options

API deployment
### 7. Architecture Overview
```markdown
## 🏗️ How It Works
User Query: "original unet paper"
│
├──► 1. 🧠 LLM Query Analyzer (Qwen3-4B)
│ → Intent: specific_paper
│ → Extracted: title="U-Net", authors=["Ronneberger"]
│ → Rewrites: "seminal U-Net segmentation architecture"
│
├──► 2. 🔍 Hybrid Search (Milvus)
│ → Dense vectors (SPECTER2): semantic similarity
│ → Sparse vectors (SPLADE): keyword matching
│ → Multi-query: original + rewrites + extracted terms
│ → Retrieves top 200 candidates
│
├──► 3. 🎯 Intent-Based Boosting
│ → Adjust scores based on query type
│ → specific_paper: boost citations, ignore recency
│
├──► 4. 🔗 Title/Author Matching
│ → Fuzzy match extracted terms
│ → Boost exact/near matches
│
├──► 5. 🏆 Jina Reranking
│ → Listwise comparison of top 50
│ → Cross-document relevance
│
└──► 📊 Results: Top 10 papers ranked by fused scores
**Key Technologies:**
- **Milvus**: Open-source vector database
- **SPECTER2**: Academic paper embeddings (768-dim dense)
- **SPLADE**: Learned sparse representations (~30k-dim)
- **Qwen3-4B-AWQ**: Quantized LLM for query analysis
- **Jina Reranker v3**: State-of-the-art listwise reranking

🔍 **See detailed architecture**: [INSTRUCTIONS.md#architecture](INSTRUCTIONS.md#architecture)
8. Key Features
## ✨ Key Features

### 🧠 Intent-Aware Search
Detects 6 query types and adjusts ranking:
- **topical**: General exploration ("machine learning papers")
- **sota**: Recent state-of-the-art ("latest LLM research")
- **foundational**: Seminal works ("foundational papers on CNNs")
- **comparison**: Technique comparison ("transformer vs RNN")
- **method_lookup**: Specific method ("how does BERT work")
- **specific_paper**: Exact paper search ("original ResNet paper")

### 🔍 Multi-Vector Hybrid Search
- **Dense vectors**: Capture semantic meaning
- **Sparse vectors**: Preserve keyword signals
- **RRF Fusion**: Combine rankings from multiple searches
- **Metadata filtering**: Year, citations, categories

### 🎯 Smart Query Processing
- **LLM extraction**: Pulls titles, authors, years from natural language
- **Query expansion**: Generates technical rewrites
- **Multi-query search**: Uses original + expanded + extracted terms

### 🏆 Advanced Reranking
- **Jina listwise reranker**: Sees all candidates simultaneously
- **Intent boosting**: Citation/recency weighting by query type
- **Fuzzy matching**: Title/author similarity scoring
- **Score fusion**: Weighted combination of all signals

### 🚀 Production-Ready
- **96% test coverage**: 163 passing tests
- **CI/CD**: Automated testing and deployment
- **AWS infrastructure**: Terraform IaC for GPU inference
- **API endpoint**: FastAPI with OpenAPI docs
- **Backup/restore**: S3 integration for Milvus data
9. Project Info
## 📁 Project Structure
ArXplorer/
├── src/ # Core library (96% coverage)
│ ├── retrieval/
│ │ ├── encoders/ # SPECTER2 + SPLADE
│ │ ├── searchers/ # Milvus hybrid search
│ │ ├── rerankers/ # Jina + CrossEncoder
│ │ └── query_rewriting/ # LLM query analysis
├── scripts/ # CLI tools
│ ├── encode.py # Build Milvus index
│ ├── query.py # Interactive search
│ └── deploy_*.sh # AWS deployment
├── tests/ # 163 tests
├── evaluation/ # Benchmark framework
├── terraform/ # AWS infrastructure
├── data/ # Datasets
│ └── arxiv_1k.jsonl # Demo dataset
├── README.md # This file (you are here)
├── INSTRUCTIONS.md # Detailed setup guide
└── config.yaml # Configuration
📖 **Full documentation**: [INSTRUCTIONS.md](INSTRUCTIONS.md)
10. Additional Sections

Getting Help

Contributing

Citation

License

References

Acknowledgments

Formatting Requirements

Use emojis strategically for visual appeal

Use tables for comparisons

Use code blocks with proper syntax highlighting

Use collapsible sections where appropriate

Include clear calls-to-action

Add visual hierarchy with headers

Keep paragraphs short and scannable

References to Update

All links to detailed instructions should point to INSTRUCTIONS.md

Maintain links to evaluation/README.md

Keep all badge URLs intact

Success Criteria

Current README.md renamed to INSTRUCTIONS.md

New README.md is engaging and visual

Clear value proposition in first 10 seconds of reading

Easy navigation to detailed docs

Professional but approachable tone

All technical details preserved in INSTRUCTIONS.md

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: sontheteacher <58865293+sontheteacher@users.noreply.github.com>

Initial plan

8b279fe

Copilot AI assigned Copilot and sontheteacher Dec 3, 2025

Copilot started work on behalf of sontheteacher December 3, 2025 21:05 View session

Rename README.md to INSTRUCTIONS.md and create new engaging README.md

ff46ff2

Co-authored-by: sontheteacher <58865293+sontheteacher@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Rename README to INSTRUCTIONS and create new README~~ Restructure documentation: Create reader-focused README, preserve technical details in INSTRUCTIONS.md Dec 3, 2025

Copilot AI requested a review from sontheteacher December 3, 2025 21:16

Copilot finished work on behalf of sontheteacher December 3, 2025 21:16

sontheteacher marked this pull request as ready for review December 3, 2025 22:50

sontheteacher approved these changes Dec 4, 2025

View reviewed changes

sontheteacher merged commit a967338 into main Dec 4, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure documentation: Create reader-focused README, preserve technical details in INSTRUCTIONS.md#14

Restructure documentation: Create reader-focused README, preserve technical details in INSTRUCTIONS.md#14
sontheteacher merged 2 commits into
mainfrom
copilot/rename-readme-to-instructions

Copilot AI commented Dec 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Before/After

Problem Statement

Current Content to Preserve

Requirements for New README.md

1. Hero Section

2. Problem Statement (Expanded)

3. Visual Demo Section

4. Why ArXplorer? (Comparison Table)

5. Results/Metrics Section

6. Quick Start (Simplified)

Problem Statement

Current Content to Preserve

Requirements for New README.md

1. Hero Section

2. Problem Statement (Expanded)

3. Visual Demo Section

4. Why ArXplorer? (Comparison Table)

5. Results/Metrics Section

6. Quick Start (Simplified)

8. Key Features

9. Project Info

10. Additional Sections

Formatting Requirements

References to Update

Success Criteria

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 3, 2025 •

edited

Loading