aRAG CLI Tool

aRAG, or arag, is a command-line interface (CLI) tool for creating, managing, and querying a custom file type called .arag. This tool enables users to package content into a structured format, process it into a searchable corpus, generate embeddings for vector-based querying, and retrieve information efficiently. It currently supports both local and OpenAI-based embedding methods and includes features for content management, packaging, and interactive usage.

The goal of the arag file type is to create a simple, self-contained method for creating localized vector databases that can be easily implemented for use with RAG and LLMs. Imagine, for example, if you could download the entire documentation for some coding language or package, generate an arag with a couple clicks, and then drag and drop this file into your AI chats, giving it the information effectively without compromising your context window. The current plan is that support for arag files will be added to popular LLM chats (like chatgpt.com) via a browser extension to further increase their usefulness.

Features

Create .arag Files: Generate .arag directories or packaged archives with a custom structure.
Content Management: Add, delete, list, and clean content within .arag files.
Corpus Processing: Convert content into a SQLite-based searchable corpus with chunking support.
Embedding Generation: Index content using OpenAI or local SentenceTransformer models for vector search.
Vector Querying: Perform similarity searches on indexed content using a query string.
Packaging: Compress .arag directories into .arag archives and unpackage them as needed.
Interactive Mode: Open an .arag file and manage it interactively.
Custom VFS: Query packaged .arag files directly using a SQLite Virtual File System (VFS).

Installation

Prerequisites

Python (3.11 was used for development)
pip package manager

Steps

Clone the Repository:

git clone https://github.com/jmelovich/arag-cli.git
cd arag

Install the Package:
```
pip install .
```
To include optional support for local embeddings (SentenceTransformer), use:
```
pip install ".[local_embeddings]"
```
Verify Installation:
```
arag --help
```
This should display the CLI help message with available commands.

Usage

The arag CLI provides a variety of subcommands to manage arag files. Below is an overview of the commands and their usage.

Commands

`create`

Create a new arag directory, spec file, or packaged .arag from a spec.

Create a Directory:
```
arag create dir myarag /path/to/directory
```
Creates myarag-arag directory at the specified path. An arag directory works the same as an .arag file, but is not read only (until packaged into a file). This is the principle way to create an arag.
Create a Spec File:
```
arag create spec /path/to/example.arag-json
```
Generates a template .arag-json file. You can modify this template to set all the settings needed to create a .arag file.
Create from Spec:
```
arag create from-spec /path/to/spec.arag-json
```
Builds a packaged .arag file based on the spec file. This is the easiest way to create an arag file.

`content`

Manage content within an .arag directory (not supported for packaged files). Content is whatever you want to be indexed, so (for now) any sort of text information, pdfs, or docx files.

Add Content:
```
arag content add myfile.txt --arag /path/to/myarag-arag
```
Adds myfile.txt to the content folder. This also supports directories, and will add all files in a pointed directory recursively.
Delete Content:
```
arag content del myfile.txt --arag /path/to/myarag-arag
```
Removes myfile.txt from the content folder. Also works with directories.
List Contents:
```
arag content ls --arag /path/to/myarag-arag
```
Lists all files in the content folder.
Corpify Content:
```
arag content corpify --arag /path/to/myarag-arag --chunk-size 8192 --force
```
Processes content into corpus.db with specified chunk size. The --force flag overwrites any existing corpus. The --chunk-size argument determines how often each entry (file) being added to the corpus should be split into its own row, in bytes (the default is typically fine).
Clean Content:
```
arag content clean --arag /path/to/myarag-arag
```
Removes files from content not present in corpus.db. This is always recommended as to not waste space.

`index`

Generate embeddings for the corpus.

Index with OpenAI:
```
arag index --arag /path/to/myarag-arag --method openai --api-key YOUR_API_KEY
```
Indexes using OpenAI embeddings. The --api-key flag is optional if you have an api key set as an evironmental variable called OPENAI_API_KEY.
Index Locally:
```
arag index --arag /path/to/myarag-arag --method local
```
Uses the default SentenceTransformer model. Pass the --model argument to determine the model to use, given as a huggingface name such as sentence-transformers/all-MiniLM-L6-v2.

`query`

Search the corpus with a query string.

Query with Results:
```
arag query "search term" --arag /path/to/myarag.arag --topk 3
```
Returns top 3 matching chunks with content. --topk defaults to 1.

Query with File Paths:

arag query "search term" --arag /path/to/myarag.arag --get-file

Returns just file paths instead of content.

`package`

Package an .arag directory into a .arag file.

Package Directory:
```
arag package /path/to/myarag-arag --remove-original
```
Creates myarag.arag and removes the original directory.

`unpackage`

Unpackage a .arag file into a directory.

Unpackage File:
```
arag unpackage /path/to/myarag.arag --remove-original
```
Extracts to myarag-arag and removes the original file.

`open`

Enter interactive mode with an .arag file or directory.

Open a File:
```
arag open /path/to/myarag.arag
```
Starts an interactive shell for managing the .arag.

Interactive Mode

Run arag open <path> to interact with an .arag file or directory. Commands can be entered without the arag prefix, or an --arag argument:

> content ls
> content add myfile.txt
> query "find this" --topk 2
> close

Type quit or close to exit.

Spec File Creation

Use arag create spec <destination> to generate a template .arag-json file at the destination, then modify it:

{
    "arag_name": "myarag",
    "arag_dest": "./myarag.arag",
    "content_include": ["file1.txt", "dir/docs"],
    "clean_content": true,
    "chunk_size": 8192,
    "index_method": "openai",
    "index_model": "text-embedding-3-small",
    "api_key": "YOUR_API_KEY",
    "openai_endpoint": "https://api.openai.com/v1",
    "arag_version": "0.1.0",
    "should_package": true,
    "remove_arag_dir": true
}

Run arag create from-spec <.arag-json-path> to build the .arag file.

Examples

Full Workflow:

# Create an arag directory
arag create dir myarag ./data
# Open the arag directory in interactive mode
arag open ./data/myarag-arag
# Add content
content add document.pdf
# Corpify
content corpify --clean
# Index locally
index --method local
# Package the file and remove this directoru
package --remove-original
# Open the .arag file
arag open ./data/myarag.arag
# Query
query "important info"

Using a Spec File:

arag create spec myarag.arag-json

# Edit myarag.arag-json as needed
nano myarag.arag-json

arag create from-spec myarag.arag-json

File Structure

An arag directory/file has the following structure:

content/: Stores raw files and directories.
content_list.txt: Lists all files in content/.
corpus.db: SQLite database with chunked content & vector embeddings.
index.json: Metadata about embeddings (method, model, etc.).

A packaged .arag file is a special ZIP archive containing these components. (In a .arag file, only the content folder is compressed. The rest is stored directly for direct access.)

Dependencies

Required:
- apsw: SQLite with custom VFS support.
- numpy: For vector operations.
- openai: For OpenAI embeddings.
- pypdf: For PDF processing.
- spire.doc: For DOCX processing.
Optional:
- sentence-transformers: For local embeddings (pip install ".[local_embeddings]").

Install additional dependencies as needed for specific file types.

Configuration

OpenAI API Key: Set via --api-key or the OPENAI_API_KEY environment variable.
Embedding Models: Default models are sentence-transformers/all-MiniLM-L6-v2 (local) and text-embedding-3-small (OpenAI). Override with --model.
Chunk Size: Default is 8192 bytes; adjust with --chunk-size.

Contributing

Contributions are welcome! Please:

Fork the repository.
Create a feature branch (git checkout -b feature/yourfeature).
Commit changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/yourfeature).
Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
arag		arag
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

aRAG CLI Tool

Table of Contents

Features

Installation

Prerequisites

Steps

Usage

Commands

`create`

`content`

`index`

`query`

`package`

`unpackage`

`open`

Interactive Mode

Spec File Creation

Examples

File Structure

Dependencies

Configuration

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jmelovich/arag-cli

Folders and files

Latest commit

History

Repository files navigation

aRAG CLI Tool

Table of Contents

Features

Installation

Prerequisites

Steps

Usage

Commands

create

content

index

query

package

unpackage

open

Interactive Mode

Spec File Creation

Examples

File Structure

Dependencies

Configuration

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`create`

`content`

`index`

`query`

`package`

`unpackage`

`open`

Packages