Ultra‑compact RAG storage + retrieval engine

This script implements, end‑to‑end, the system we designed:

Chunk documents using character-based splitting (512 characters per chunk, 50 characters overlap) for fair comparison with other systems. Then compress chunks into seekable Zstandard frames (64 KiB raw, dictionary‑trained, level‑19) stored in a single file corpus.zs.
Build a minimal self‑index mapping chunk‑id → (frame‑id, offset‑in‑frame, length) so we can jump straight to the chunk after decompressing one frame.
Generate embeddings with MiniLM (384‑d) and write them into a FAISS IndexIVFPQ (with parameters N_LIST=1024, M_PQ=48, N_BITS=8) for efficient similarity search. The index stores compressed vectors.
Query: embed the query, search FAISS, decompress the single frame(s) needed and return top‑k chunks.

Everything is wrapped in a simple CLI built with typer so you can run:

python rag_compressed_store.py ingest --input docs.txt --data_dir data/
python rag_compressed_store.py search --query "What is foo?" --data_dir data/ --top_k 5

Data Download and Preparation (`download_and_index.py`)

The download_and_index.py script automates the process of fetching source documents and preparing them for ingestion by the rag_compressed_store.py system. It performs the following main steps:

Downloads Data:
- Fetches a configurable number of top Wikipedia articles (default: 1000 via --num_wiki) from Wikipedia:Vital articles/Level/4.
- Fetches a configurable number of top Project Gutenberg books (default: 10 via --num_gutenberg) from the Gutenberg Top 100.
- Downloaded raw text files are saved individually into the input_data/ directory (configurable via --input_source_dir).
Triggers Indexing:
- Automatically invokes python rag_compressed_store.py ingest to process the downloaded files from the input_data/ directory.
- The rag_compressed_store.py script then builds the compressed corpus (corpus.zs), FAISS index (index.faiss), and other necessary files in the data/ directory (configurable via --output_data_dir for download_and_index.py, which passes it as --data_dir to rag_compressed_store.py).

Usage

To download the default number of documents and then trigger indexing:

python download_and_index.py

You can customize the number of articles and books, or skip certain stages using command-line arguments:

# Download 500 Wikipedia articles and 5 Gutenberg books, then index
python download_and_index.py --num_wiki 500 --num_gutenberg 5

# Skip downloading and only run indexing (assumes data is already in input_data/)
python download_and_index.py --skip_download

# Only download data, skip the indexing step
python download_and_index.py --skip_ingest

# Specify custom directories
python download_and_index.py --input_source_dir custom_input/ --output_data_dir custom_output/

The script ensures the necessary directories (input_data/ and data/ by default) are created.

Memvid RAG System Comparison

This project also includes scripts for ingesting data into and benchmarking the Memvid RAG system. This allows for a comparative analysis of memvid against the FAISS-based approach described above. Key scripts for memvid interaction include:

ingest_memvid.py: Ingests text data into memvid's video-based storage.
benchmark_memvid.py: Runs benchmark queries against a memvid instance.

To ensure a fair comparison, the chunking strategy (512 characters per chunk, 50 characters overlap) and embedding model (all-MiniLM-L6-v2) are kept consistent across both systems. A detailed critique and performance comparison can be found in memvid_critique.md.

Dependencies (install via pip install): zstandard, numpy, sentence_transformers, faiss-cpu, typer, tqdm, memvid

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
benchmark_memvid.py		benchmark_memvid.py
benchmark_queries.txt		benchmark_queries.txt
download_and_index.py		download_and_index.py
ingest_memvid.py		ingest_memvid.py
memvid_critique.md		memvid_critique.md
rag_compressed_store.py		rag_compressed_store.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ultra‑compact RAG storage + retrieval engine

Data Download and Preparation (`download_and_index.py`)

Usage

Memvid RAG System Comparison

About

Uh oh!

Releases

Packages

Languages

janekm/retrieval_comparison

Folders and files

Latest commit

History

Repository files navigation

Ultra‑compact RAG storage + retrieval engine

Data Download and Preparation (download_and_index.py)

Usage

Memvid RAG System Comparison

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Data Download and Preparation (`download_and_index.py`)

Packages