This project offers a lightweight implementation (no llama-index or langchain) for question-answering on PDF documents using local hardware and a Retrieval-Augmented-Generation (RAG) for Large Language Models (LLM) pipeline. It leverages llama.cpp Python bindings for CPU support and showcases a demo in a Jupyter Notebook. Key features include:
A preprocessing and chunking pipeline for PDF documents PDF extraction implemented with AllenAI's Papermage (https://github.com/allenai/papermage) A visualization plugin to compare generated responses with source documents, enabling model answer verification
The implementation demonstrates how to perform document-based question-answering without relying on cloud services, making it suitable for scenarios requiring data privacy or offline processing.
- At least 10GB of random access memory (RAM) available for CPU inference
- (Nvidia GPU recommended for faster inference with at least 18GB VRAM)
- Python 3.11
Clone the repository:
git clone https://github.com/strath-ace-labs/local_rag
cd local_ragInstall the required Python packages into new environment:
pip install -r requirements.txt- Insert your documents into a "./datasets/<your_dataset>/"
- Run a script similar to "preprocess_nasa_teaching_documents.sh" preprocess PDFs and chunk texts accordingly
- open "qa_notebook.ipynb" and follow instructions there
- Device Selection: Specify
deviceas either 'cpu', 'cuda' (for GPU), orNonefor self-selection. - Model Specification:
- To specify a specific model, use the link to the Hugging Face repository:
- Example for CPU model:
model = rag_llm_classes.load_inference_model(model_name='QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF', file name="Meta-Llama-3.1-8B-Instruct.Q5_K_M.gguf", device='cpu') - Example for GPU model:
model = rag_llm_classes.load_inference_model(model_name='Nexusflow/Starling-LM-7B-beta', device='cuda')
- Example for CPU model:
- Check for new models at LMSYS Arena.
- Add Tokenizer Change tokenizer_dict variable in in rag_llm_classes.load_cpu_model to include the tokenizer to your model (fallback llama3 tokenizer)
- Default Model: If not specified, the standard model loaded is Llama-3.
- License Agreement: Note that some models may require a license agreement before use.
- To specify a specific model, use the link to the Hugging Face repository:

