ArchiveBot

Archivebot is a Django-based RAG (Retrieval-Augmented Generation) system that enables intuitive, conversational access to archival material. Present-day researchers interested in working with historical documents are forced to menially sift through thousands of pages of documents. Despite digitization and search functionality, this process is not only time-consuming, but also error-prone.

This project aims to alleviate this issue by allowing users to query the archive through natural language, and receive a summary of the most relevant information from the archive.

Setup Instructions

Install the required packages:
```
pip install -r requirements.txt
```
Set up the Django app:
```
python manage.py setup_app
```
Run the development server:
```
python manage.py runserver
```
Access the application at http://127.0.0.1:8000/

Features

Web scraping of archive materials
OCR processing of PDF documents
Text chunking (semantic or fixed-size)
Embedding generation
Interactive chat interface with RAG capabilities

Project Structure

rag_app/: The main Django application
- models.py: Database models for pipeline state and chat history
- views.py: API endpoints and view functions
- pipeline.py: Core pipeline functionality
- urls.py: URL routing
- templates/: HTML templates

Usage

Start by scraping archive materials for specific years
Process the downloaded PDFs with OCR
Chunk the extracted text
Generate embeddings for the chunks
Load a language model
Chat with the system to query the archived materials

Known Issues

Occasionally requires resetting embeddings and loading model after refreshing to run correctly.
Progress updating and sometimes hangs on retrieving pdfs
Excessive amounts of time when loading (might be unavoidable but maybe a way to speed things up)
Database resetting/not resetting when it should/shouldn't
Processing years outside of what is user-specified

To-dos and Future Steps

Filtering based on article type (excluding Eighth Page articles, etc.)
Linking to view original article PDF
UI testing for level of parameters able to be set (currently the panel is more like an admin panel, users probably wouldn't actually see anything but the chat)
Additional weighting based on recency of source material

Developed in the Computer Science 600 Research and Development Class at Phillips Academy.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
archivebot_project		archivebot_project
data		data
docs		docs
legacy		legacy
rag_app		rag_app
scripts		scripts
.gitignore		.gitignore
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArchiveBot

Setup Instructions

Features

Project Structure

Usage

Known Issues

To-dos and Future Steps

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tianyi-gu/archivebot

Folders and files

Latest commit

History

Repository files navigation

ArchiveBot

Setup Instructions

Features

Project Structure

Usage

Known Issues

To-dos and Future Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages