HKUDS · Awouzz · May 23, 2025
diff --git a/MyRAGProject/.env.example b/MyRAGProject/.env.example
@@ -0,0 +1,19 @@
+# Example .env file for RAG project
+# Copy this to .env and fill in your actual values.
+# Do NOT commit your .env file to version control.
+
+# --- Data Paths ---
+# RAW_DATA_DIR="data/raw/"
+# PROCESSED_DATA_DIR="data/processed/"
+# VECTOR_DB_PATH="models/vector_db.faiss"
+
+# --- Model Configurations ---
+# LLM_MODEL_NAME="gpt2"  # Or another model like "google/flan-t5-base"
+# EMBEDDING_MODEL_NAME="sentence-transformers/all-MiniLM-L6-v2"
+
+# --- Search Parameters ---
+# TOP_K_RESULTS=5
+
+# --- API Keys (if applicable) ---
+# OPENAI_API_KEY="YOUR_OPENAI_API_KEY_HERE"
+# HUGGINGFACE_HUB_TOKEN="YOUR_HUGGINGFACE_HUB_TOKEN_HERE"
diff --git a/MyRAGProject/README.md b/MyRAGProject/README.md
@@ -0,0 +1,134 @@
+# LocalRAG: A RAG Pipeline with Local Models
+
+## Overview
+
+LocalRAG is a Python-based Retrieval Augmented Generation (RAG) system designed to run entirely with locally hosted models. Inspired by projects like MiniRAG, this system aims to provide a foundational RAG pipeline using local sentence transformer models for embeddings and local Large Language Models (LLMs) from the Hugging Face `transformers` library for text generation. This approach allows for greater privacy, control, and offline usability.
+
+The project demonstrates loading text data, chunking it, generating embeddings, storing/retrieving document chunks (currently placeholder retrieval), and generating answers to queries using a local LLM based on provided context.
+
+## Features
+
+-   **Local Embedding Generation**: Utilizes `sentence-transformers` library to generate dense vector embeddings for text data locally.
+-   **Local LLM for Generation**: Employs Hugging Face `transformers` library to load and use local LLMs for generating responses.
+-   **Basic RAG Pipeline**: Implements a simple pipeline involving data processing, (placeholder) retrieval, prompt construction, and LLM-based generation.
+-   **Configurable Models**: Allows easy configuration of embedding and LLM models through `src/config.py`.
+-   **Modular Design**: Core components like data processing, embedding, vector database interaction (placeholder), and LLM interface are separated for clarity.
+
+## Directory Structure
+
+-   `MyRAGProject/`: Root directory of the project.
+    -   `data/`: Intended for storing input data files (e.g., `.txt` files). Contains `sample.txt` for demonstration.
+    -   `models/`: Intended for storing model-related files, such as FAISS indexes or other local model artifacts (currently used for placeholder vector DB path).
+    -   `src/`: Contains the main source code for the RAG application.
+        -   `__init__.py`: Makes `src` a Python package.
+        -   `config.py`: Handles configuration settings (e.g., model names, paths).
+        -   `core.py`: Defines core components like `DataProcessor`, `EmbeddingModel`, `VectorDatabase`, `LLMInterface`, and `RAGSystem`.
+        -   `main.py`: Main script to run the RAG application.
+        -   `utils.py`: For utility functions (currently basic).
+    -   `tests/`: Contains all Pytest test files for the project.
+        -   `__init__.py`: Makes `tests` a Python package.
+        -   `test_data_processing.py`: Tests for data loading and chunking.
+        -   `test_embedding.py`: Tests for the local embedding model.
+        -   `test_llm.py`: Tests for the local LLM interface.
+        -   `test_rag_pipeline.py`: Integration tests for the RAG pipeline.
+    -   `requirements.txt`: Lists project dependencies.
+    -   `.env.example`: Example environment file template.
+    -   `README.md`: This file.
+
+## Setup Instructions
+
+1.  **Clone the Repository**:
+    ```bash
+    git clone <repository-url> # Replace <repository-url> with the actual URL
+    cd MyRAGProject
+    ```
+
+2.  **Create a Virtual Environment** (Recommended):
+    ```bash
+    python -m venv venv
+    source venv/bin/activate  # On Windows: venv\Scripts\activate
+    ```
+
+3.  **Install Dependencies**:
+    ```bash
+    pip install -r requirements.txt
+    ```
+
+4.  **Model Downloads**:
+    The Hugging Face `transformers` and `sentence-transformers` libraries will automatically download the specified pre-trained models (e.g., for embeddings and LLM) on their first use. These models are typically stored in the Hugging Face cache directory (e.g., `~/.cache/huggingface/hub/` or `~/.cache/huggingface/sentence_transformers/`). Ensure you have an internet connection for the initial download.
+
+5.  **Environment Variables** (Optional):
+    If you plan to use specific configurations not suitable for direct inclusion in `config.py` (e.g., API keys for future extensions, or overriding default paths via environment variables), you can:
+    -   Copy `.env.example` to a new file named `.env`:
+        ```bash
+        cp .env.example .env
+        ```
+    -   Edit the `.env` file to set your desired variables. `src/config.py` is set up to load variables from this file. For the current fully local setup, this might not be strictly necessary unless you override default model names or paths.
+
+## How to Run
+
+1.  **Place Data**:
+    -   Input text files (e.g., `.txt`) should be placed in the `MyRAGProject/data/` directory.
+    -   A `sample.txt` file is already provided for demonstration.
+
+2.  **Run the Main Script**:
+    Execute the main application script from the `MyRAGProject` root directory:
+    ```bash
+    python src/main.py
+    ```
+
+3.  **Expected Output/Behavior**:
+    -   The script will initialize the RAG components (DataProcessor, EmbeddingModel, VectorDatabase, LLMInterface).
+    -   It will load and process the data from `MyRAGProject/data/sample.txt`.
+    -   It will "build" an index using the processed documents (currently, this involves generating embeddings if possible and storing documents for placeholder search).
+    -   It will then process a sample query defined in `src/main.py` (e.g., "What is crucial for retrieval accuracy?").
+    -   The RAG system will attempt to retrieve relevant context (using placeholder keyword search) and generate a response using the local LLM.
+    -   You will see print statements indicating these steps, including model loading attempts, data processing, and the final query and response.
+    -   **Note**: If the local models (embedding or LLM) fail to load due to environment issues (like insufficient disk space for PyTorch), the script will print error messages and skip the query processing step.
+
+## Configuration
+
+-   Core configurations are managed in `MyRAGProject/src/config.py`.
+-   You can change the default local models by modifying the following variables in `src/config.py` or by setting them as environment variables (which `config.py` will load via `python-dotenv` if a `.env` file is present):
+    -   `EMBEDDING_MODEL_NAME`: Specifies the sentence transformer model for embeddings (default: `"sentence-transformers/all-MiniLM-L6-v2"`).
+    -   `LLM_MODEL_NAME`: Specifies the Hugging Face model for the LLM (default: `"distilgpt2"`).
+-   Other paths, like `VECTOR_DB_PATH`, `RAW_DATA_DIR`, etc., can also be configured there.
+
+## Testing
+
+-   To run the test suite (requires `pytest`):
+    ```bash
+    pytest MyRAGProject/tests/
+    ```
+    Or, from within the `MyRAGProject` directory:
+    ```bash
+    python -m pytest tests/
+    ```
+
+-   **Important Note on Test Execution**:
+    The project's tests rely on libraries like `torch`, `sentence-transformers`, and `transformers`. These libraries, especially `torch`, can be very large. In constrained environments (like some sandboxed CI/CD runners or low-resource machines), installation of these dependencies might fail due to insufficient disk space. This can lead to `ImportError` (e.g., `ImportError: cannot import name 'Tensor' from 'torch'`) during test collection or execution, causing tests to fail or not run at all. If you encounter such issues, it's likely an environmental limitation rather than a bug in the project code itself.
+
+## Future Improvements
+
+-   **Support for More Data Types**: Extend `DataProcessor` to handle PDFs, DOCX, URLs, etc.
+-   **Advanced Vector Search**: Replace the placeholder keyword search with a proper vector database implementation (e.g., using FAISS for efficient similarity search).
+-   **Improved Chunking Strategies**: Implement more sophisticated text chunking methods (e.g., recursive character splitting, token-based chunking).
+-   **UI/API Interface**: Develop a simple web interface (e.g., using Flask/Streamlit) or an API for easier interaction with the RAG system.
+-   **Batch Processing**: Add capabilities for processing multiple queries or documents in batch.
+-   **Evaluation Framework**: Integrate an evaluation framework to measure retrieval and generation quality.
+-   **More Robust Model Error Handling**: Enhance error handling and fallbacks for model loading and generation.
+
+This README should provide a good overview and guide for users of the LocalRAG project.Okay, I have created the `README.md` file for the `MyRAGProject`.
+
+The content includes:
+-   **Project Title**: "LocalRAG: A RAG Pipeline with Local Models"
+-   **Overview**: Description of the project's purpose.
+-   **Features**: Key capabilities like local embeddings, local LLM, basic RAG pipeline, and configurability.
+-   **Directory Structure**: Explanation of `data`, `models`, `src`, `tests`, and key files.
+-   **Setup Instructions**: Steps for cloning, virtual environment, dependencies, model downloads, and `.env` usage.
+-   **How to Run**: Instructions on placing data, running `src/main.py`, and expected behavior.
+-   **Configuration**: Explanation of `src/config.py` for model selection.
+-   **Testing**: How to run tests with `pytest`, including the crucial note about potential failures due to environmental disk space limitations affecting PyTorch installation.
+-   **Future Improvements**: Suggestions for enhancing the project.
+
+The file is placed in `MyRAGProject/README.md` as requested.
diff --git a/MyRAGProject/data/.gitkeep b/MyRAGProject/data/.gitkeep
@@ -0,0 +1 @@
+# This file keeps the data directory in git, even if it's empty.
diff --git a/MyRAGProject/data/sample.txt b/MyRAGProject/data/sample.txt
@@ -0,0 +1,6 @@
+This is the first paragraph of our sample text file. It contains a few sentences to demonstrate the loading and processing capabilities of the RAG system. We aim to chunk this text into meaningful segments.
+
+The second paragraph provides more content. RAG systems often benefit from well-defined document chunks. These chunks are then vectorized and stored in a database for efficient retrieval. Proper chunking strategy is crucial for retrieval accuracy.
+
+Finally, the third paragraph concludes this sample document. It's a short document, but sufficient for initial testing of the data processing pipeline. Future enhancements could include handling various file formats like PDF, DOCX, or even web URLs.
+The RAG model will use these chunks to find relevant information.
diff --git a/MyRAGProject/models/.gitkeep b/MyRAGProject/models/.gitkeep
@@ -0,0 +1 @@
+# This file keeps the models directory in git, even if it's empty.
diff --git a/MyRAGProject/requirements.txt b/MyRAGProject/requirements.txt
@@ -0,0 +1,14 @@
+# Placeholder for project dependencies
+# Add libraries like:
+# pandas
+# scikit-learn
+torch
+transformers
+# faiss-cpu  # or faiss-gpu if you have a CUDA-enabled GPU
+sentence-transformers
+# PyPDF2
+python-dotenv
+# langchain
+# beautifulsoup4
+# requests
+pytest
diff --git a/MyRAGProject/src/__init__.py b/MyRAGProject/src/__init__.py
@@ -0,0 +1 @@
+# This file makes src a Python package
diff --git a/MyRAGProject/src/config.py b/MyRAGProject/src/config.py
@@ -0,0 +1,46 @@
+# config.py
+# Configuration settings for the RAG application
+
+import os
+from dotenv import load_dotenv
+
+load_dotenv() # Load environment variables from .env file found in the current working directory or parent directories.
+
+# --- Project Root ---
+# It's often useful to define the project root for easier path management.
+# This assumes config.py is in MyRAGProject/src/
+PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+
+# --- Data Paths ---
+# Construct paths relative to PROJECT_ROOT to make them more robust.
+RAW_DATA_DIR = os.getenv("RAW_DATA_DIR", os.path.join(PROJECT_ROOT, "data/raw/"))
+PROCESSED_DATA_DIR = os.getenv("PROCESSED_DATA_DIR", os.path.join(PROJECT_ROOT, "data/processed/"))
+VECTOR_DB_PATH = os.getenv("VECTOR_DB_PATH", os.path.join(PROJECT_ROOT, "models/vector_db.faiss"))
+
+# --- Model Configurations ---
+LLM_MODEL_NAME = os.getenv("LLM_MODEL_NAME", "distilgpt2") # Using a smaller model for local LLM
+EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", "sentence-transformers/all-MiniLM-L6-v2")
+
+# --- Search Parameters ---
+TOP_K_RESULTS = int(os.getenv("TOP_K_RESULTS", 5))
+
+# --- API Keys (if applicable, loaded from .env) ---
+# OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
+# HUGGINGFACE_HUB_TOKEN = os.getenv("HUGGINGFACE_HUB_TOKEN")
+
+def print_config():
+    """Prints the current configuration."""
+    print("Configuration loaded:")
+    print(f"  Project Root: {PROJECT_ROOT}")
+    print(f"  Raw Data Directory: {RAW_DATA_DIR}")
+    print(f"  Processed Data Directory: {PROCESSED_DATA_DIR}")
+    print(f"  Vector DB Path: {VECTOR_DB_PATH}")
+    print(f"  LLM Model Name: {LLM_MODEL_NAME}")
+    print(f"  Embedding Model Name: {EMBEDDING_MODEL_NAME}")
+    print(f"  Top K Results: {TOP_K_RESULTS}")
+
+if __name__ == "__main__":
+    # This allows you to run python src/config.py to check paths,
+    # but ensure .env is in MyRAGProject if running from within MyRAGProject/src
+    # or MyRAGProject if running from MyRAGProject
+    print_config()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		# This file keeps the data directory in git, even if it's empty.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		# This file keeps the models directory in git, even if it's empty.