- 1. Project Overview
- 2. Folder Structure
- 3. Architecture Overview
- 4. Scraping Pipeline
- 5. Chunking, Embedding & FAISS Storage
- 6. Streamlit UI Chatbot
- 7. Evaluation Strategy
- 8. Improvements and Enhancements
- 9. How to Run
- 10. Future Work
- 11. License
- 12. Acknowledgements
This project implements an end-to-end Retrieval-Augmented Generation (RAG) system designed to generate high-quality answers from scraped Wikipedia pages under the "Outline of computer science" category. The solution leverages state-of-the-art tools and models to build a production-ready pipeline, supporting everything from data scraping to question-answer evaluation.
- Pages under Wikipedia's "Outline of Computer Science"
- Categories: Algorithms, Data Structures, Programming Languages, Databases, Networking, AI, etc.
- Scraping Wikipedia pages and storing them as clean JSON
- Chunking the content with
RecursiveCharacterTextSplitter - Embedding using BAAI/bge-large-en-v1.5 model
- Storing embeddings into FAISS vector database
- Reranking with
cross-encoder/ms-marco-MiniLM-L-6-v2 - Answer generation with
llama3:8bvia Ollama - UI built in Streamlit
- Evaluation using both LLM-based (RAGAS) and non-LLM-based (BERTScore, ROUGE, BLEU) methods
- Q&A Generation: Generated through OpenAI by manually copy-pasting the content of scraped data into ChatGPT.
- Answer and Context Generation: Generated from the RAG pipeline using LLaMA3 via Ollama.
- Evaluation: Conducted using Gemini-2.0-flash, ensuring no bias from LLMs.
RAG Project/
│
├── app/ # Streamlit UI
│ ├── ui_app.py # Main chatbot UI
│ ├── test_chatbot_ui.py # UI test cases
│ └── logs/ # Chat interaction logs
│
├── scrapers/ # Scraping Wikipedia content
│ ├── wiki_scraper.py # Core scraper module
│ ├── test_scraper.py # Unit tests
│ ├── test_result_scraper/ # Logs and results
│ └── data/ # Raw JSON output from scraping
│
├── src/ # Core RAG pipeline
│ ├── rag_pipeline.py # Chunking + embedding + saving
│ ├── mini_rag.py # Minimal version for testing
│ ├── test_ragPipeline.py # Pipeline unit tests
│ └── embeddings/ # FAISS index stored here
│
├── GroundTruth_Context_Generation/
│ ├── eval_data.csv # Ground-truth Q&A
│ ├── eval_data_enriched.csv # Enriched with generated answers/context
│ ├── groundTruth_context.py # Script to enrich with generation
│
│
├── bert_evaluation/ # Traditional metric-based evaluation
│ ├── bert_score_eval.py # Main script (BERTScore, ROUGE, BLEU)
│ ├── input.csv # Evaluation input
│ └── bert_eval_output/
│ ├── logs/ # Logs
│ └── outputs/ # Evaluation results
│
├── ragas_eval_custom/ # Gemini/LLM-based custom evaluator(RAGAS Mimic)
│ ├── evaluate.py # Main evaluation script
│ ├── gemini_client.py # Gemini API client
│ ├── prompts.py # Evaluation prompts
│ ├── score # Metric score saved
│ ├── logs_error # Logs and errors get saved here
│ └── input/ # Input data for evaluation
│
├── requirements.txt # Dependencies
├── .gitignore # git ignore file
├── .env # for keeping environment variables
├── venv # virtual env. folder, make sure to add it in gitignore
└── README.md- Uses
wikipediaapiandBeautifulSoupto extract clean text - Targets pages from the "Outline of Computer Science"
- Saves JSON content in
scrapers/data/raw_scraped/ - Logs every scraped URL
- Delete old scraped files to save space
- Recursively navigate linked pages
- Extract paragraphs and skip unrelated content
- Save clean structured output in JSON
scrape_outline(): Starts the full scraping sessionparse_page(): Extracts clean text from one pagesave_json(): Dumps structured content to disk
- Tool:
RecursiveCharacterTextSplitter - Config: chunk size = 512, overlap = 50
| Method | Pros | Cons |
|---|---|---|
| Fixed-size (naïve) | Simple, fast | Breaks context mid-sentence |
| RecursiveCharacterTextSplitter | Preserves structure (headings etc.) | Slightly slower |
| Token-based (e.g., tiktoken) | More accurate with LLMs | More complex |
➡️ Selected RecursiveCharacterTextSplitter to ensure sentence boundaries and heading preservation.
-
Model Used:
BAAI/bge-large-en-v1.5 -
Why Chosen:
- Optimized for English
- Trained for retrieval tasks
- Better semantic separation
- Parameters: 300M+
- Dimensionality: 1024
- Pretrained on English web/corpora
| Model | Dim | Retrieval Quality | Speed | Comments |
|---|---|---|---|---|
sentence-transformers/all-MiniLM-L6-v2 |
384 | Medium | Fast | Lightweight, baseline |
BAAI/bge-small-en-v1.5 |
384 | Medium-High | Fast | Good for compact systems |
BAAI/bge-large-en-v1.5 |
1024 | High | Medium | Best semantic matching so far |
multi-qa-mpnet-base-dot-v1 |
768 | High | Medium | Multi-lingual support |
- Load JSON data
- Clean and deduplicate chunks
- Apply recursive chunking
- Generate embeddings via
bge-large-en-v1.5 - Save to FAISS index in
/src/embeddings/full_faiss_index/
- Loads FAISS index
- Retrieves top-k chunks
- Reranks using
cross-encoder/ms-marco-MiniLM-L-6-v2 - Sends best-ranked chunks to LLaMA 3 (8B) via Ollama
- Displays chat interface and logs interactions
- Clear historical context display
- Real-time generation
- Logging user queries and responses
- Input:
input.csvwith Q, Ground Truth A, Generated A - Output: ROUGE, BLEU, BERTScore
| Metric | Compares | Score Range | Meaning |
|---|---|---|---|
| BLEU | n-gram overlap | 0–1 | Precision-oriented score |
| ROUGE | recall-oriented overlap | 0–1 | Measures recall of n-grams |
| BERTScore | contextual embeddings (BERT) | 0–1 | Semantic similarity (better than n-grams) |
- BLEU: Measures how much the generated answer overlaps with ground truth using n-gram matching.
- ROUGE: Measures how much of the ground truth is captured in the generated answer.
- BERTScore: Uses BERT embeddings to compare semantic similarity of sentences.
-
Uses Gemini to score:
- Faithfulness
- Context relevance
- Answer correctness
- Context precision
- Context recall
| Metric | Score Range | Meaning |
|---|---|---|
| Faithfulness | 0–1 | How well the answer stays true to context |
| Relevance | 0–1 | How relevant the answer is to the question |
| Correctness | 0–1 | How correct the answer is compared to ground truth |
| Precision | 0–1 | How well the context supports the answer |
| Recall | 0–1 | How complete the answer is given the context |
-
Scores Obtained:
- Faithfulness: 0.9851
- Relevance: 0.9902
- Correctness: 0.9480
- Precision: 0.9644
- Recall: 0.9030
-
Interpretation: Higher scores indicate better performance. The scores obtained are above average, indicating a strong performance in maintaining context and relevance.
-
Reliability: The method is reliable for contextual and semantic evaluation but may lack precision in isolated metric evaluation compared to traditional methods.
-
Simple Average: Calculated as the mean of all metric scores.
-
Weighted Average: Calculated using the following weights:
- Faithfulness: 0.30
- Relevance: 0.25
- Correctness: 0.25
- Precision: 0.10
- Recall: 0.10
-
Final Scores:
- Simple Average Score: 0.9581
- Weighted Average Score: 0.9668
| Metric Type | Tool Used | Key Insight |
|---|---|---|
| Non-LLM | BERTScore | Precise but surface level |
| LLM-based | Gemini | Deeper, contextual judgment |
-
Pros:
- Provides a deeper, more contextual evaluation.
- Can capture nuances missed by traditional metrics.
-
Cons:
- May not be as precise in isolated metric evaluation.
- Relies on the quality of the LLM and prompt design.
- Rewrote scraper to log and clean more effectively
- Replaced
bge-smallwith bge-large-en-v1.5 for better semantic quality - Integrated reranker to boost answer relevance
- Improved chunking strategy (overlap + recursion)
- Streamlit UI now shows previous questions and logs interactions
- Enhanced evaluation: both LLM and traditional metrics now included
-
Clone the Repository:
git clone [repository-url] cd RAG-Project -
Set Up Environment:
- Install dependencies:
pip install -r requirements.txt
- Set up Ollama for LLaMA3:
- [Instructions for setting up Ollama]
- Install dependencies:
-
Run the Pipeline:
- Scrape Wikipedia:
python scrapers/wiki_scraper.py
- Run chunking and embedding:
python src/rag_pipeline.py
- Launch chatbot UI:
streamlit run app/ui_app.py
- Scrape Wikipedia:
-
Evaluate:
- With BERT:
python bert_evaluation/bert_score_eval.py
- With Gemini:
python ragas_eval_custom/evaluate.py
- With BERT:
- Complete RAGA based testing
- Add support for multilingual scraping and embedding
- Experiment with hybrid reranking (BM25 + cross-encoder)
- Extend evaluation framework with Hallucination detection (e.g., TruthfulQA)
- Integrate feedback loop to improve QA generation over time
MIT License
- BAAI for BGE model
- HuggingFace for Transformers & evaluation
- Langchain for inspiration on RAG design
- Google Gemini, and Meta for LLM APIs
