SemTools

Semantic search and document parsing tools for the command line

A high-performance CLI tool for document processing and semantic search, built with Rust for speed and reliability.

semtools parse - Parse documents (PDF, DOCX, etc.) using, by default, the LlamaParse API into markdown format
semtools search - Local semantic keyword search using multilingual embeddings with cosine similarity matching and per-line context matching
semtools ask - AI agent with search and read tools for answering questions over document collections (defaults to OpenAI, but see the config section to learn more about connecting to any OpenAI-Compatible API)
semtools workspace - Workspace management for accelerating search over large collections
semtools jgrep - Fork-local Rust-native Jina semantic grep, classification, stdin reranking, and workspace-backed profile search

NOTE: By default, parse uses LlamaParse as a backend. Get your API key today for free at https://cloud.llamaindex.ai. search and workspace remain local-only. ask requires an OpenAI API key.

Key Features

Fast semantic search using model2vec embeddings from minishlab/potion-multilingual-128M
Local Jina semantic grep in this fork through semtools jgrep, including recursive code/text search, classification labels, stdin reranking, JSON output, workspace profiles, and optional daemon-backed embedding
Reliable document parsing with caching and error handling
Unix-friendly design with proper stdin/stdout handling
Configurable distance thresholds and returned chunk sizes
Multi-format support for parsing documents (PDF, DOCX, PPTX, etc.)
Concurrent processing for better parsing performance
Workspace management for efficient document retrieval over large collections

Installation

Prerequisites:

For the parse subcommand: LlamaIndex Cloud API key
For the ask subcommand: OpenAI API key

Install:

You can install semtools via npm:

npm i -g @llamaindex/semtools

Or via cargo:

# install entire crate
cargo install semtools

# install only select features
cargo install semtools --no-default-features --features=parse

Note: Installing from npm builds the Rust binaries locally during install if a prebuilt binary is not available, which requires Rust and Cargo to be available in your environment. Install from rustup if needed: https://www.rust-lang.org/tools/install.

Local Fork Installation

This repository is also maintained as a local enhanced fork. On this workstation, the active semtools command is expected to be installed from this source tree:

cargo install --path . --force

That local-source install is distinct from the upstream npm or crates.io package. The fork adds the semtools jgrep subcommand, which provides the local jina-grep / jina-semsearch style surface as a semtools subcommand rather than as separate PATH-default binaries.

Verify the active enhanced binary with:

which semtools
cargo install --list | grep -A1 '^semtools '
semtools jgrep --help
semtools jgrep --models-status --json

Supported Jina Models

This fork projects the primary local model information from the Jina grep CLI model matrix and the Hugging Face model cards into semtools jgrep. The current runtime targets the MLX checkpoint repositories listed below, while executing them through the Rust-native local runtime in this crate.

Model	Params	Dims	Max Seq	Matryoshka dims	Tasks
`jina-embeddings-v5-small`	677M	1024	32768	32, 64, 128, 256, 512, 768, 1024	`retrieval`, `text-matching`, `clustering`, `classification`
`jina-code-embeddings-1.5b`	1.54B	1536	32768	128, 256, 512, 1024, 1536	`nl2code`, `code2code`, `code2nl`, `code2completion`, `qa`

HF model-card details for the MLX repos:

jina-embeddings-v5-small resolves to jinaai/jina-embeddings-v5-text-small-mlx. The card describes it as an MLX multi-task checkpoint for jina-embeddings-v5-text-small, using a Qwen3-0.6B base with task-specific LoRA adapters (r=32, alpha=32) for retrieval, text matching, clustering, and classification. The repo stores the shared base weights plus task adapters under adapters/<task>/; the card lists about 1.1GB of base weights plus four 38MB adapters, or about 1.3GB total for all tasks.
jina-code-embeddings-1.5b resolves to jinaai/jina-code-embeddings-1.5b-mlx. The card describes it as an MLX port of the Jina code embedding model, based on Qwen2.5-Coder-1.5B, optimized for code retrieval across 15+ programming languages. Code tasks use task-specific query and passage instruction prefixes. The repo ships model.safetensors, config.json, tokenizer.json, tokenizer_config.json, vocab.json, and merges.txt.
Both model cards use pipeline_tag: feature-extraction, library_name: mlx, safetensors weights, and license: cc-by-nc-4.0. Treat the license as non-commercial unless you have separate rights.
semtools jgrep --models-status --json verifies the required local files config.json, tokenizer.json, and model.safetensors; v5 task adapters are loaded from the MLX repo layout when present.

semtools jgrep defaults to jina-embeddings-v5-small with the retrieval task for text search and reranking. Use the code model explicitly for code search:

semtools jgrep --model jina-code-embeddings-1.5b --task nl2code "HTTP retry with backoff" src --recursive --include '*.rs'

Quick Start

Basic Usage:

# Parse some files
semtools parse my_dir/*.pdf

# Search some (text-based) files
semtools search "some keywords" *.txt --max-distance 0.3 --n-lines 5

# Ask questions about your documents using an AI agent
semtools ask "What are the main findings?" papers/*.txt

# Combine parsing and search
semtools parse my_docs/*.pdf | xargs search "API endpoints"

# Ask a question to a set of files
semtools ask "Some question?" *.txt 

# Combine parsing with the ask agent
semtools parse research_papers/*.pdf | xargs ask "Summarize the key methodologies"

# Ask based on stdin content
cat README.md | semtools ask "How do I install SemTools?"

Local fork Jina usage:

# Semantic grep over Rust code
semtools jgrep "retry backoff timeout" src --recursive --include '*.rs' --top-k 8 --json

# Rerank piped candidates
printf '%s\n' "token refresh logic" "database migration runner" | semtools jgrep "OAuth token refresh" --top-k 1 --json

# Classify files against labels
semtools jgrep --classify -e bug -e feature -e docs ./issues/*.txt --json

# Build and query a Jina workspace profile
semtools jgrep --workspace semtools --profile code --sync README.md src --recursive --include '*.rs' --json
semtools jgrep --workspace semtools --profile code "Jina workspace profile search" --top-k 8 --json

Advanced Usage:

# Combine with grep for exact-match pre-filtering and distance thresholding
semtools parse *.pdf | xargs cat | grep -i "error" | semtools search "network error" --max-distance 0.3

# Pipeline with content search (note the 'xargs' on search to search files instead of stdin)
find . -name "*.md" | xargs semtools parse | xargs semtools search "installation"

# Combine with grep for filtering (grep could be before or after parse/search!)
semtools parse docs/*.pdf | xargs semtools search "API" | grep -A5 "authentication"

# Save search results from stdin search
semtools parse report.pdf | xargs cat | semtools search "summary" > results.txt

Using Workspaces:

# Create or select a workspace
# Workspaces are stored in ~/.semtools/workspaces/
semtools workspace use my-workspace
> Workspace 'my-workspace' configured.
> To activate it, run:
>   export SEMTOOLS_WORKSPACE=my-workspace
> 
> Or add this to your shell profile (.bashrc, .zshrc, etc.)

# Activate the workspace
export SEMTOOLS_WORKSPACE=my-workspace

# All search commands will now use the workspace for caching embeddings
# The initial command is used to initialize the workspace
semtools search "some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10

# If documents change, they are automatically re-embedded and cached
echo "some new content" > ./some_large_dir/some_file.txt
semtools search "some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10

# If documents are removed, you can run prune to clean up stale files
semtools workspace prune

# You can see the stats of a workspace at any time
semtools workspace status
> Active workspace: arxiv
> Root: ~/.semtools/workspaces/arxiv
> Documents: 3000
> Index: Yes (IVF_PQ)

CLI Help

$ semtools parse --help
A CLI tool for parsing documents using various backends

Usage: semtools parse [OPTIONS] <FILES>...

Arguments:
  <FILES>...  Files to parse

Options:
  -c, --config <CONFIG>    Path to the config file. Defaults to ~/.semtools_config.json
  -b, --backend <BACKEND>  The backend type to use for parsing. Defaults to `llama-parse` [default: llama-parse]
  -v, --verbose            Verbose output while parsing
  -h, --help               Print help

$ semtools search --help
A CLI tool for fast semantic keyword search

Usage: semtools search [OPTIONS] <QUERY> [FILES]...

Arguments:
  <QUERY>     Query to search for (positional argument)
  [FILES]...  Files to search, optional if using stdin

Options:
  -n, --n-lines <N_LINES>            How many lines before/after to return as context [default: 3]
      --top-k <TOP_K>                The top-k files or texts to return (ignored if max_distance is set) [default: 3]
  -m, --max-distance <MAX_DISTANCE>  Return all results with distance below this threshold (0.0+)
  -i, --ignore-case                  Perform case-insensitive search (default is false)
  -j, --json                         Output results in JSON format
  -h, --help                         Print help

$ semtools workspace --help
Manage semtools workspaces

Usage: semtools workspace [OPTIONS] <COMMAND>

Commands:
  use     Use or create a workspace (prints export command to run)
  status  Show active workspace and basic stats
  prune   Remove stale or missing files from store
  help    Print this message or the help of the given subcommand(s)

Options:
  -j, --json  Output results in JSON format
  -h, --help  Print help

$ semtools jgrep --help
High-quality local Jina semantic grep and code search

Usage: semtools jgrep [OPTIONS] [PATTERN] [FILES]...

Common options:
  -r, --recursive             Recursive directory search
      --include <GLOB>        Search only files matching GLOB
      --exclude <GLOB>        Skip files matching GLOB
      --exclude-dir <GLOB>    Skip directories matching GLOB
  -A, --after-context <N>     Lines after match
  -B, --before-context <N>    Lines before match
  -C, --context <N>           Lines before and after match
      --threshold <D>         Similarity threshold
      --top-k <K>             Max results
  -e, --regexp <LABEL>        Classification label
      --classify              Force classification mode
  -f, --file <LABEL_FILE>     Read classification labels from file
      --model <MODEL>         Jina model name
      --task <TASK>           Embedding task
      --truncate-dim <DIM>    Matryoshka output dimension
      --fast                  Use a lower Matryoshka dimension
      --model-dir <DIR>       Local model directory
  -w, --workspace <NAME>      Use a semtools workspace
      --profile <ID>          Jina workspace profile id
      --sync                  Sync/index files into the profile
  -j, --json                  Output JSON
      --models-status         Print local model status
      --daemon-start          Start the Rust-native jgrep daemon
      --daemon-status         Print daemon status
      --daemon-stop           Stop the daemon
      --daemon                Use daemon-backed embedding calls

$ semtools ask --help
A CLI tool for document-based question-answering

Usage: semtools ask [OPTIONS] <QUERY> [FILES]...

Arguments:
  <QUERY>     Query to prompt the agent with
  [FILES]...  Files to search, optional if using stdin

Options:
  -c, --config <CONFIG>      Path to the config file. Defaults to ~/.semtools_config.json
      --api-key <API_KEY>    OpenAI API key (overrides config file and env var)
      --base-url <BASE_URL>  OpenAI base URL (overrides config file)
  -m, --model <MODEL>        Model to use for the agent (overrides config file)
      --api-mode <API_MODE>  API mode to use: 'chat' or 'responses' (overrides config file)
  -j, --json                 Output results in JSON or text format
  -h, --help                 Print help

Configuration

SemTools uses a unified configuration file at ~/.semtools_config.json that contains settings for all CLI tools. You can also specify a custom config file path using the -c or --config flag on any command.

Unified Configuration File

Create a ~/.semtools_config.json file with settings for the tools you use. All sections are optional - if not specified, sensible defaults will be used. (They parse_kwargs section is passed directly to LlamaParse, see docs for available options.)

{
  "parse": {
    "api_key": "your_llama_cloud_api_key_here",
    "num_ongoing_requests": 10,
    "base_url": "https://api.cloud.llamaindex.ai",
    "parse_kwargs": {
      "tier": "agentic",
      "version": "latest",
      "disable_cache": false
    },
    "check_interval": 5,
    "max_timeout": 3600,
    "max_retries": 10,
    "retry_delay_ms": 1000,
    "backoff_multiplier": 2.0
  },
  "ask": {
    "api_key": "your_openai_api_key_here",
    "base_url": null,
    "model": "gpt-4o-mini",
    "max_iterations": 20,
    "api_mode": "responses",  // Can be responses or chat
  }
}

Find out more about parsing configuration on the dedicated documentation page.

See example_semtools_config.json in the repository for a complete example.

Environment Variables

As an alternative or supplement to the config file, you can set API keys via environment variables:

# For parse tool
export LLAMA_CLOUD_API_KEY="your_llama_cloud_api_key_here"

# For ask tool
export OPENAI_API_KEY="your_openai_api_key_here"

Configuration Priority

Configuration values are resolved in the following priority order (highest to lowest):

CLI arguments (e.g., --api-key, --model, --base-url)
Config file (~/.semtools_config.json or custom path via -c)
Environment variables (LLAMA_CLOUD_API_KEY, OPENAI_API_KEY)
Built-in defaults

This allows you to set common defaults in the config file while overriding them on a per-command basis when needed.

Subcommand-Specific Configuration

Parse Subcommand

The parse subcommand requires a LlamaParse API key. Get your free API key at https://cloud.llamaindex.ai.

Configuration options:

api_key: Your LlamaParse API key
base_url: API endpoint (default: "https://api.cloud.llamaindex.ai")
num_ongoing_requests: Number of concurrent requests (default: 10)
parse_kwargs: Additional parsing parameters
check_interval, max_timeout, max_retries, retry_delay_ms, backoff_multiplier: Retry and timeout settings

Ask Subcommand

The ask subcommand requires an OpenAI API key for the agent's LLM.

Configuration options:

api_key: Your OpenAI API key
base_url: Custom OpenAI-compatible API endpoint (optional, for using other providers)
model: LLM model to use (default: "gpt-4o-mini")
max_iterations: Maximum agent loop iterations (default: 10)

You can also override these per-command:

semtools ask "What is this about?" docs/*.txt --model gpt-4o --api-key sk-...

Agent Use Case Examples

Future Work

More parsing backends (something local-only would be great!)
Improved search algorithms
Built-in agentic search
Persistence for speedups on repeat searches on the same files

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

LlamaIndex/LlamaParse for document parsing capabilities
model2vec-rsfor fast embedding generation
minishlab/potion-multilingual-128M for an amazing default static embedding model
simsimd for efficient similarity computation

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.github		.github
benchmarks/arxiv		benchmarks/arxiv
cli		cli
docs		docs
examples		examples
scripts		scripts
skills		skills
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
example_semtools_config.json		example_semtools_config.json
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemTools

Key Features

Installation

Local Fork Installation

Supported Jina Models

Quick Start

CLI Help

Configuration

Unified Configuration File

Environment Variables

Configuration Priority

Subcommand-Specific Configuration

Parse Subcommand

Ask Subcommand

Agent Use Case Examples

Future Work

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SemTools

Key Features

Installation

Local Fork Installation

Supported Jina Models

Quick Start

CLI Help

Configuration

Unified Configuration File

Environment Variables

Configuration Priority

Subcommand-Specific Configuration

Parse Subcommand

Ask Subcommand

Agent Use Case Examples

Future Work

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages