🧠 RAG-LLM: Wikipedia Q&A System with Evaluation

1. 🚀 Project Overview

This project implements an end-to-end Retrieval-Augmented Generation (RAG) system designed to generate high-quality answers from scraped Wikipedia pages under the "Outline of computer science" category. The solution leverages state-of-the-art tools and models to build a production-ready pipeline, supporting everything from data scraping to question-answer evaluation.

🌐 Scraped Pages:

Pages under Wikipedia's "Outline of Computer Science"
Categories: Algorithms, Data Structures, Programming Languages, Databases, Networking, AI, etc.

📑 Key Features:

Scraping Wikipedia pages and storing them as clean JSON
Chunking the content with RecursiveCharacterTextSplitter
Embedding using BAAI/bge-large-en-v1.5 model
Storing embeddings into FAISS vector database
Reranking with cross-encoder/ms-marco-MiniLM-L-6-v2
Answer generation with llama3:8b via Ollama
UI built in Streamlit
Evaluation using both LLM-based (RAGAS) and non-LLM-based (BERTScore, ROUGE, BLEU) methods

New Evaluation Method:

Q&A Generation: Generated through OpenAI by manually copy-pasting the content of scraped data into ChatGPT.
Answer and Context Generation: Generated from the RAG pipeline using LLaMA3 via Ollama.
Evaluation: Conducted using Gemini-2.0-flash, ensuring no bias from LLMs.

2. 🗂️ Folder Structure

RAG Project/
│
├── app/                        # Streamlit UI
│   ├── ui_app.py              # Main chatbot UI
│   ├── test_chatbot_ui.py     # UI test cases
│   └── logs/                  # Chat interaction logs
│
├── scrapers/                  # Scraping Wikipedia content
│   ├── wiki_scraper.py        # Core scraper module
│   ├── test_scraper.py        # Unit tests
│   ├── test_result_scraper/   # Logs and results
│   └── data/                  # Raw JSON output from scraping
│
├── src/                       # Core RAG pipeline
│   ├── rag_pipeline.py        # Chunking + embedding + saving
│   ├── mini_rag.py            # Minimal version for testing
│   ├── test_ragPipeline.py    # Pipeline unit tests
│   └── embeddings/            # FAISS index stored here
│
├── GroundTruth_Context_Generation/
│   ├── eval_data.csv               # Ground-truth Q&A
│   ├── eval_data_enriched.csv     # Enriched with generated answers/context
│   ├── groundTruth_context.py     # Script to enrich with generation
│ 
│
├── bert_evaluation/               # Traditional metric-based evaluation
│   ├── bert_score_eval.py         # Main script (BERTScore, ROUGE, BLEU)
│   ├── input.csv                  # Evaluation input
│   └── bert_eval_output/
│       ├── logs/                  # Logs
│       └── outputs/               # Evaluation results
│
├── ragas_eval_custom/            # Gemini/LLM-based custom evaluator(RAGAS Mimic)
│   ├── evaluate.py               # Main evaluation script
│   ├── gemini_client.py          # Gemini API client
│   ├── prompts.py                # Evaluation prompts
│   ├── score                     # Metric score saved
│   ├── logs_error                # Logs and errors get saved here
│   └── input/                    # Input data for evaluation
│
├── requirements.txt              # Dependencies
├── .gitignore                    # git ignore file
├── .env                          # for keeping environment variables
├── venv                          # virtual env. folder, make sure to add it in gitignore
└── README.md

3. 🧭 Architecture Overview

4. 🕸️ Scraping Pipeline

Core Module: `wiki_scraper.py`

Uses wikipediaapi and BeautifulSoup to extract clean text
Targets pages from the "Outline of Computer Science"
Saves JSON content in scrapers/data/raw_scraped/
Logs every scraped URL

Workflow:

Delete old scraped files to save space
Recursively navigate linked pages
Extract paragraphs and skip unrelated content
Save clean structured output in JSON

Key Methods:

scrape_outline(): Starts the full scraping session
parse_page(): Extracts clean text from one page
save_json(): Dumps structured content to disk

5. ✂️ Chunking, Embedding & FAISS Storage

Chunking:

Tool: RecursiveCharacterTextSplitter
Config: chunk size = 512, overlap = 50

Comparison of Chunking Strategies:

Method	Pros	Cons
Fixed-size (naïve)	Simple, fast	Breaks context mid-sentence
RecursiveCharacterTextSplitter	Preserves structure (headings etc.)	Slightly slower
Token-based (e.g., tiktoken)	More accurate with LLMs	More complex

➡️ Selected RecursiveCharacterTextSplitter to ensure sentence boundaries and heading preservation.

Embedding:

Model Used: BAAI/bge-large-en-v1.5
Why Chosen:
- Optimized for English
- Trained for retrieval tasks
- Better semantic separation

Model Specifications:

Parameters: 300M+
Dimensionality: 1024
Pretrained on English web/corpora

Comparison of Embedding Models:

Model	Dim	Retrieval Quality	Speed	Comments
`sentence-transformers/all-MiniLM-L6-v2`	384	Medium	Fast	Lightweight, baseline
`BAAI/bge-small-en-v1.5`	384	Medium-High	Fast	Good for compact systems
`BAAI/bge-large-en-v1.5`	1024	High	Medium	Best semantic matching so far
`multi-qa-mpnet-base-dot-v1`	768	High	Medium	Multi-lingual support

Workflow:

Load JSON data
Clean and deduplicate chunks
Apply recursive chunking
Generate embeddings via bge-large-en-v1.5
Save to FAISS index in /src/embeddings/full_faiss_index/

6. 🧑‍💻 Streamlit UI Chatbot

File: `app/ui_app.py`

Loads FAISS index
Retrieves top-k chunks
Reranks using cross-encoder/ms-marco-MiniLM-L-6-v2
Sends best-ranked chunks to LLaMA 3 (8B) via Ollama
Displays chat interface and logs interactions

UI Features:

Clear historical context display
Real-time generation
Logging user queries and responses

7. 📊 Evaluation Strategy

A. BERT-based Traditional Evaluation

File: `bert_score_eval.py`

Input: input.csv with Q, Ground Truth A, Generated A
Output: ROUGE, BLEU, BERTScore

Metrics Used:

Metric	Compares	Score Range	Meaning
BLEU	n-gram overlap	0–1	Precision-oriented score
ROUGE	recall-oriented overlap	0–1	Measures recall of n-grams
BERTScore	contextual embeddings (BERT)	0–1	Semantic similarity (better than n-grams)

BLEU: Measures how much the generated answer overlaps with ground truth using n-gram matching.
ROUGE: Measures how much of the ground truth is captured in the generated answer.
BERTScore: Uses BERT embeddings to compare semantic similarity of sentences.

B. LLM-Based (RAGAS-style)

File: `ragas_eval_custom/`

Uses Gemini to score:
- Faithfulness
- Context relevance
- Answer correctness
- Context precision
- Context recall

Metrics and Scores:

Metric	Score Range	Meaning
Faithfulness	0–1	How well the answer stays true to context
Relevance	0–1	How relevant the answer is to the question
Correctness	0–1	How correct the answer is compared to ground truth
Precision	0–1	How well the context supports the answer
Recall	0–1	How complete the answer is given the context

Scores Obtained:
- Faithfulness: 0.9851
- Relevance: 0.9902
- Correctness: 0.9480
- Precision: 0.9644
- Recall: 0.9030
Interpretation: Higher scores indicate better performance. The scores obtained are above average, indicating a strong performance in maintaining context and relevance.
Reliability: The method is reliable for contextual and semantic evaluation but may lack precision in isolated metric evaluation compared to traditional methods.

Average Score Calculation:

Simple Average: Calculated as the mean of all metric scores.
Weighted Average: Calculated using the following weights:
- Faithfulness: 0.30
- Relevance: 0.25
- Correctness: 0.25
- Precision: 0.10
- Recall: 0.10
Final Scores:
- Simple Average Score: 0.9581
- Weighted Average Score: 0.9668

Comparison:

Metric Type	Tool Used	Key Insight
Non-LLM	BERTScore	Precise but surface level
LLM-based	Gemini	Deeper, contextual judgment

Pros and Cons:

Pros:
- Provides a deeper, more contextual evaluation.
- Can capture nuances missed by traditional metrics.
Cons:
- May not be as precise in isolated metric evaluation.
- Relies on the quality of the LLM and prompt design.

8. 🛠️ Improvements and Enhancements

Rewrote scraper to log and clean more effectively
Replaced bge-small with bge-large-en-v1.5 for better semantic quality
Integrated reranker to boost answer relevance
Improved chunking strategy (overlap + recursion)
Streamlit UI now shows previous questions and logs interactions
Enhanced evaluation: both LLM and traditional metrics now included

9. ➡️ How to Run

Clone the Repository:

git clone [repository-url]
cd RAG-Project

Set Up Environment:
- Install dependencies:
```
pip install -r requirements.txt
```
- Set up Ollama for LLaMA3:
  - [Instructions for setting up Ollama]

Run the Pipeline:

Scrape Wikipedia:
```
python scrapers/wiki_scraper.py
```
Run chunking and embedding:
```
python src/rag_pipeline.py
```
Launch chatbot UI:
```
streamlit run app/ui_app.py
```

Evaluate:

With BERT:

python bert_evaluation/bert_score_eval.py

With Gemini:
```
python ragas_eval_custom/evaluate.py
```

✅ Future Work

Complete RAGA based testing
Add support for multilingual scraping and embedding
Experiment with hybrid reranking (BM25 + cross-encoder)
Extend evaluation framework with Hallucination detection (e.g., TruthfulQA)
Integrate feedback loop to improve QA generation over time

📌 License

MIT License

�� Acknowledgements

BAAI for BGE model
HuggingFace for Transformers & evaluation
Langchain for inspiration on RAG design
Google Gemini, and Meta for LLM APIs

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
GroundTruth_Context_Generation		GroundTruth_Context_Generation
app		app
assets		assets
bert_evaluation		bert_evaluation
ragas_eval_custom		ragas_eval_custom
scrapers		scrapers
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

AmitKumarShorthillsAI/RAG-Project---Amit-Kumar

Folders and files

Latest commit

History

Repository files navigation

🧠 RAG-LLM: Wikipedia Q&A System with Evaluation

Table of Contents

1. 🚀 Project Overview

🌐 Scraped Pages:

📑 Key Features:

New Evaluation Method:

2. 🗂️ Folder Structure

3. 🧭 Architecture Overview

4. 🕸️ Scraping Pipeline

Core Module: wiki_scraper.py

Workflow:

Key Methods:

5. ✂️ Chunking, Embedding & FAISS Storage

Chunking:

Comparison of Chunking Strategies:

Embedding:

Model Specifications:

Comparison of Embedding Models:

Workflow:

6. 🧑‍💻 Streamlit UI Chatbot

File: app/ui_app.py

UI Features:

7. 📊 Evaluation Strategy

A. BERT-based Traditional Evaluation

File: bert_score_eval.py

Metrics Used:

B. LLM-Based (RAGAS-style)

File: ragas_eval_custom/

Metrics and Scores:

Average Score Calculation:

Comparison:

Pros and Cons:

8. 🛠️ Improvements and Enhancements

9. ➡️ How to Run

✅ Future Work

📌 License

�� Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Core Module: `wiki_scraper.py`

File: `app/ui_app.py`

File: `bert_score_eval.py`

File: `ragas_eval_custom/`

Packages