aRAG, or arag
, is a command-line interface (CLI) tool for creating, managing, and querying a custom file type called .arag
. This tool enables users to package content into a structured format, process it into a searchable corpus, generate embeddings for vector-based querying, and retrieve information efficiently. It currently supports both local and OpenAI-based embedding methods and includes features for content management, packaging, and interactive usage.
The goal of the arag
file type is to create a simple, self-contained method for creating localized vector databases that can be easily implemented for use with RAG and LLMs. Imagine, for example, if you could download the entire documentation for some coding language or package, generate an arag
with a couple clicks, and then drag and drop this file into your AI chats, giving it the information effectively without compromising your context window. The current plan is that support for arag
files will be added to popular LLM chats (like chatgpt.com) via a browser extension to further increase their usefulness.
- Create
.arag
Files: Generate.arag
directories or packaged archives with a custom structure. - Content Management: Add, delete, list, and clean content within
.arag
files. - Corpus Processing: Convert content into a SQLite-based searchable corpus with chunking support.
- Embedding Generation: Index content using OpenAI or local SentenceTransformer models for vector search.
- Vector Querying: Perform similarity searches on indexed content using a query string.
- Packaging: Compress
.arag
directories into.arag
archives and unpackage them as needed. - Interactive Mode: Open an
.arag
file and manage it interactively. - Custom VFS: Query packaged
.arag
files directly using a SQLite Virtual File System (VFS).
- Python (3.11 was used for development)
pip
package manager
-
Clone the Repository:
git clone https://github.com/jmelovich/arag-cli.git cd arag
-
Install the Package:
pip install .
To include optional support for local embeddings (SentenceTransformer), use:
pip install ".[local_embeddings]"
-
Verify Installation:
arag --help
This should display the CLI help message with available commands.
The arag
CLI provides a variety of subcommands to manage arag
files. Below is an overview of the commands and their usage.
Create a new arag
directory, spec file, or packaged .arag
from a spec.
-
Create a Directory:
arag create dir myarag /path/to/directory
Creates
myarag-arag
directory at the specified path. Anarag
directory works the same as an.arag
file, but is not read only (until packaged into a file). This is the principle way to create anarag
. -
Create a Spec File:
arag create spec /path/to/example.arag-json
Generates a template
.arag-json
file. You can modify this template to set all the settings needed to create a.arag
file. -
Create from Spec:
arag create from-spec /path/to/spec.arag-json
Builds a packaged
.arag
file based on the spec file. This is the easiest way to create anarag
file.
Manage content within an .arag
directory (not supported for packaged files). Content is whatever you want to be indexed, so (for now) any sort of text information, pdfs, or docx files.
-
Add Content:
arag content add myfile.txt --arag /path/to/myarag-arag
Adds
myfile.txt
to thecontent
folder. This also supports directories, and will add all files in a pointed directory recursively. -
Delete Content:
arag content del myfile.txt --arag /path/to/myarag-arag
Removes
myfile.txt
from thecontent
folder. Also works with directories. -
List Contents:
arag content ls --arag /path/to/myarag-arag
Lists all files in the
content
folder. -
Corpify Content:
arag content corpify --arag /path/to/myarag-arag --chunk-size 8192 --force
Processes content into
corpus.db
with specified chunk size. The--force
flag overwrites any existing corpus. The--chunk-size
argument determines how often each entry (file) being added to the corpus should be split into its own row, in bytes (the default is typically fine). -
Clean Content:
arag content clean --arag /path/to/myarag-arag
Removes files from
content
not present incorpus.db
. This is always recommended as to not waste space.
Generate embeddings for the corpus.
-
Index with OpenAI:
arag index --arag /path/to/myarag-arag --method openai --api-key YOUR_API_KEY
Indexes using OpenAI embeddings. The
--api-key
flag is optional if you have an api key set as an evironmental variable calledOPENAI_API_KEY
. -
Index Locally:
arag index --arag /path/to/myarag-arag --method local
Uses the default SentenceTransformer model. Pass the
--model
argument to determine the model to use, given as a huggingface name such assentence-transformers/all-MiniLM-L6-v2
.
Search the corpus with a query string.
-
Query with Results:
arag query "search term" --arag /path/to/myarag.arag --topk 3
Returns top 3 matching chunks with content.
--topk
defaults to 1. -
Query with File Paths:
arag query "search term" --arag /path/to/myarag.arag --get-file
Returns just file paths instead of content.
Package an .arag
directory into a .arag
file.
- Package Directory:
Creates
arag package /path/to/myarag-arag --remove-original
myarag.arag
and removes the original directory.
Unpackage a .arag
file into a directory.
- Unpackage File:
Extracts to
arag unpackage /path/to/myarag.arag --remove-original
myarag-arag
and removes the original file.
Enter interactive mode with an .arag
file or directory.
- Open a File:
Starts an interactive shell for managing the
arag open /path/to/myarag.arag
.arag
.
Run arag open <path>
to interact with an .arag
file or directory. Commands can be entered without the arag
prefix, or an --arag
argument:
> content ls
> content add myfile.txt
> query "find this" --topk 2
> close
Type quit
or close
to exit.
Use arag create spec <destination>
to generate a template .arag-json
file at the destination, then modify it:
{
"arag_name": "myarag",
"arag_dest": "./myarag.arag",
"content_include": ["file1.txt", "dir/docs"],
"clean_content": true,
"chunk_size": 8192,
"index_method": "openai",
"index_model": "text-embedding-3-small",
"api_key": "YOUR_API_KEY",
"openai_endpoint": "https://api.openai.com/v1",
"arag_version": "0.1.0",
"should_package": true,
"remove_arag_dir": true
}
Run arag create from-spec <.arag-json-path>
to build the .arag
file.
-
Full Workflow:
# Create an arag directory arag create dir myarag ./data # Open the arag directory in interactive mode arag open ./data/myarag-arag # Add content content add document.pdf # Corpify content corpify --clean # Index locally index --method local # Package the file and remove this directoru package --remove-original # Open the .arag file arag open ./data/myarag.arag # Query query "important info"
-
Using a Spec File:
arag create spec myarag.arag-json # Edit myarag.arag-json as needed nano myarag.arag-json arag create from-spec myarag.arag-json
An arag
directory/file has the following structure:
content/
: Stores raw files and directories.content_list.txt
: Lists all files incontent/
.corpus.db
: SQLite database with chunked content & vector embeddings.index.json
: Metadata about embeddings (method, model, etc.).
A packaged .arag
file is a special ZIP archive containing these components. (In a .arag
file, only the content folder is compressed. The rest is stored directly for direct access.)
-
Required:
apsw
: SQLite with custom VFS support.numpy
: For vector operations.openai
: For OpenAI embeddings.pypdf
: For PDF processing.spire.doc
: For DOCX processing.
-
Optional:
sentence-transformers
: For local embeddings (pip install ".[local_embeddings]"
).
Install additional dependencies as needed for specific file types.
- OpenAI API Key: Set via
--api-key
or theOPENAI_API_KEY
environment variable. - Embedding Models: Default models are
sentence-transformers/all-MiniLM-L6-v2
(local) andtext-embedding-3-small
(OpenAI). Override with--model
. - Chunk Size: Default is 8192 bytes; adjust with
--chunk-size
.
Contributions are welcome! Please:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/yourfeature
). - Commit changes (
git commit -m "Add your feature"
). - Push to the branch (
git push origin feature/yourfeature
). - Open a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.