Skip to content

Latest commit

 

History

History
820 lines (663 loc) · 28.4 KB

File metadata and controls

820 lines (663 loc) · 28.4 KB

NextPlaid API

A REST API for multi-vector search with built-in text encoding.
Async batching, metadata filtering, optional rate limiting, Swagger UI. Powers the NextPlaid ecosystem.

Quick Start · API Reference · Python SDK · Docker · Architecture


Quick Start

Run with Docker (recommended):

# CPU with built-in model
docker run -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
  ghcr.io/lightonai/next-plaid:cpu-1.0.6 \
  --host 0.0.0.0 --port 8080 --index-dir /data/indices \
  --model lightonai/answerai-colbert-small-v1-onnx --int8

# GPU with CUDA
docker run --gpus all -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
  ghcr.io/lightonai/next-plaid:cuda-1.0.6 \
  --host 0.0.0.0 --port 8080 --index-dir /data/indices \
  --model lightonai/GTE-ModernColBERT-v1 --cuda

Use from Python:

pip install next-plaid-client
from next_plaid_client import NextPlaidClient, IndexConfig

client = NextPlaidClient("http://localhost:8080")

# Create index and add documents
client.create_index("docs", IndexConfig(nbits=4))
client.add("docs",
    documents=["NextPlaid is a multi-vector database", "ColGREP searches code semantically"],
    metadata=[{"id": "doc_1"}, {"id": "doc_2"}],
)

# Search
results = client.search("docs", ["vector database"])

# Search with metadata filtering
results = client.search("docs", ["coding tool"],
    filter_condition="id = ?", filter_parameters=["doc_1"],
)

# Delete by predicate
client.delete("docs", "id = ?", ["doc_1"])

Or call the API directly:

# Create index
curl -X POST http://localhost:8080/indices \
  -H 'Content-Type: application/json' \
  -d '{"name": "docs", "config": {"nbits": 4}}'

# Add documents (text encoded server-side)
curl -X POST http://localhost:8080/indices/docs/update_with_encoding \
  -H 'Content-Type: application/json' \
  -d '{"documents": ["hello world"], "metadata": [{"title": "test"}]}'

# Search
curl -X POST http://localhost:8080/indices/docs/search_with_encoding \
  -H 'Content-Type: application/json' \
  -d '{"queries": ["hello"], "params": {"top_k": 5}}'

Interactive docs at http://localhost:8080/swagger-ui.


Two Modes

NextPlaid API runs in two modes depending on whether you pass --model:

With --model Without --model
Encoding Pass text, get results. Server encodes via ONNX Runtime. You encode externally, pass embedding arrays.
Endpoints All endpoints available, including *_with_encoding Core endpoints only. Encoding endpoints return 400.
Use case Production deployments, Python SDK Custom models, external encoding pipelines

API Reference

Health & Documentation

Method Path Description
GET /health Health check with system info, model config, all index summaries
GET / Alias for /health
GET /swagger-ui Interactive Swagger UI
GET /api-docs/openapi.json OpenAPI 3.0 specification

Index Management

Method Path Description
GET /indices List all indices
POST /indices Declare a new index (config only, no data)
GET /indices/{name} Get index info (docs, partitions, dimension)
DELETE /indices/{name} Delete an index and all its data
PUT /indices/{name}/config Update config (e.g. max_documents)

Documents

Method Path Returns Description
POST /indices/{name}/update 202 Add documents with pre-computed embeddings
POST /indices/{name}/update_with_encoding 202 Add documents as text (server encodes)
POST /indices/{name}/documents 202 Add to existing index (legacy)
DELETE /indices/{name}/documents 202 Delete by SQL WHERE condition

All document mutations return 202 Accepted and process asynchronously. Concurrent requests to the same index are batched automatically.

Search

Method Path Description
POST /indices/{name}/search Search with embedding arrays
POST /indices/{name}/search/filtered Search + SQL metadata filter
POST /indices/{name}/search_with_encoding Search with text queries
POST /indices/{name}/search/filtered_with_encoding Text search + metadata filter

Metadata

Method Path Description
GET /indices/{name}/metadata Get all metadata entries
GET /indices/{name}/metadata/count Count metadata entries
POST /indices/{name}/metadata/check Check which doc IDs have metadata
POST /indices/{name}/metadata/query Get doc IDs matching SQL condition
POST /indices/{name}/metadata/get Get metadata by IDs or SQL condition
POST /indices/{name}/metadata/update Update metadata rows matching condition

Encoding & Reranking

Method Path Description
POST /encode Encode texts to ColBERT embeddings
POST /rerank Rerank with pre-computed embeddings (MaxSim)
POST /rerank_with_encoding Rerank with text (server encodes + MaxSim)

Request & Response Examples

Create Index

POST /indices
{
  "name": "my_index",
  "config": {
    "nbits": 4,
    "batch_size": 50000,
    "seed": 42,
    "start_from_scratch": 999,
    "max_documents": 10000
  }
}
Field Default Description
nbits 4 Quantization bits (2 or 4)
batch_size 50000 Documents per indexing chunk
seed null Random seed for K-means
start_from_scratch 999 Below this doc count, full rebuild on update
max_documents null Evict oldest when exceeded (null = unlimited)

Add Documents (text)

POST /indices/my_index/update_with_encoding
{
  "documents": [
    "Paris is the capital of France.",
    "Berlin is the capital of Germany."
  ],
  "metadata": [{ "country": "France" }, { "country": "Germany" }],
  "pool_factor": 2
}

Returns 202 Accepted. The pool_factor reduces token count via hierarchical clustering (e.g. 2 = ~50% fewer embeddings per document).

Add Documents (embeddings)

POST /indices/my_index/update
{
  "documents": [
    {
      "embeddings": [
        [0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]
      ]
    },
    {
      "embeddings": [
        [0.7, 0.8, 0.9],
        [0.1, 0.2, 0.3]
      ]
    }
  ],
  "metadata": [{ "title": "Doc A" }, { "title": "Doc B" }]
}

Search (text)

POST /indices/my_index/search_with_encoding
{
  "queries": ["What is the capital of France?"],
  "params": { "top_k": 10 }
}

Response:

{
  "results": [
    {
      "query_id": 0,
      "document_ids": [0, 1],
      "scores": [18.42, 12.67],
      "metadata": [{ "country": "France" }, { "country": "Germany" }]
    }
  ],
  "num_queries": 1
}

Search with Filter

POST /indices/my_index/search/filtered_with_encoding
{
  "queries": ["capital city"],
  "params": { "top_k": 5 },
  "filter_condition": "country = ?",
  "filter_parameters": ["France"]
}

Search Parameters

Parameter Default Description
top_k 10 Results to return per query
n_ivf_probe 8 IVF cells to probe per query token
n_full_scores 4096 Candidates for exact re-ranking
centroid_score_threshold null Prune low-scoring centroids (e.g. 0.4)

Delete Documents

DELETE /indices/my_index/documents
{
  "condition": "country = ? AND year < ?",
  "parameters": ["outdated", 2020]
}

Returns 202 Accepted. Deletes are batched: multiple delete requests within a short window are processed together.

Encode

POST /encode
{
  "texts": ["Paris is the capital of France."],
  "input_type": "document",
  "pool_factor": 2
}

Response:

{
  "embeddings": [[[0.1, 0.2, ...], [0.3, 0.4, ...]]],
  "num_texts": 1
}

input_type is "query" or "document". Queries use MASK token expansion. Documents filter padding tokens.

Rerank

POST /rerank_with_encoding
{
  "query": "What is the capital of France?",
  "documents": [
    "Paris is the capital of France.",
    "Berlin is the capital of Germany."
  ],
  "pool_factor": null
}

Response:

{
  "results": [
    { "index": 0, "score": 15.23 },
    { "index": 1, "score": 8.12 }
  ],
  "num_documents": 2
}

Health

GET /health
{
  "status": "healthy",
  "version": "1.0.1",
  "loaded_indices": 1,
  "index_dir": "/data/indices",
  "memory_usage_bytes": 104857600,
  "indices": [
    {
      "name": "my_index",
      "num_documents": 1000,
      "num_embeddings": 50000,
      "num_partitions": 512,
      "dimension": 128,
      "nbits": 4,
      "avg_doclen": 50.0,
      "has_metadata": true
    }
  ],
  "model": {
    "name": "GTE-ModernColBERT-v1",
    "path": "/models/GTE-ModernColBERT-v1",
    "quantized": false,
    "embedding_dim": 128,
    "batch_size": 128,
    "num_sessions": 1,
    "query_prefix": "[Q] ",
    "document_prefix": "[D] ",
    "query_length": 48,
    "document_length": 300,
    "do_query_expansion": true
  }
}

Error Codes

All errors return JSON:

{
  "code": "ERROR_CODE",
  "message": "Human-readable description",
  "details": null
}
Code HTTP When
INDEX_NOT_FOUND 404 Index does not exist
INDEX_ALREADY_EXISTS 409 Index name already taken
INDEX_NOT_DECLARED 404 Must POST /indices before updating
BAD_REQUEST 400 Invalid parameters
DIMENSION_MISMATCH 400 Embedding dim doesn't match index
METADATA_NOT_FOUND 404 No metadata database for this index
MODEL_NOT_LOADED 400 Encoding endpoint needs --model
MODEL_ERROR 500 ONNX inference failed
SERVICE_UNAVAILABLE 503 Queue full, retry later
RATE_LIMITED 429 Too many requests, requires RATE_LIMIT_ENABLED (retry after 2s)
INTERNAL_ERROR 500 Unexpected server error

Python SDK

pip install next-plaid-client

Both sync and async clients:

from next_plaid_client import NextPlaidClient, AsyncNextPlaidClient
from next_plaid_client import IndexConfig, SearchParams

# Sync
client = NextPlaidClient("http://localhost:8080")

# Async
client = AsyncNextPlaidClient("http://localhost:8080")
await client.search("docs", ["query"])

SDK Methods

Method Description
client.health() Health check
client.create_index(name, config) Create index
client.delete_index(name) Delete index
client.get_index(name) Get index info
client.list_indices() List all indices
client.add(name, documents, metadata) Add documents (text or embeddings)
client.search(name, queries, params, filter_condition, filter_parameters) Search
client.delete(name, condition, parameters) Delete by filter
client.encode(texts, input_type, pool_factor) Encode texts
client.rerank(query, documents) Rerank documents

Docker

Images

# CPU (amd64 + arm64)
docker pull ghcr.io/lightonai/next-plaid:cpu-1.0.6

# CUDA (amd64, requires NVIDIA GPU)
docker pull ghcr.io/lightonai/next-plaid:cuda-1.0.6

The Docker entrypoint auto-downloads HuggingFace models. Pass org/model as --model and it handles the rest. Set HF_TOKEN for private models.

Docker Compose (CPU)

services:
  next-plaid-api:
    image: ghcr.io/lightonai/next-plaid:cpu-1.0.6
    ports:
      - "8080:8080"
    volumes:
      - ${NEXT_PLAID_DATA:-~/.local/share/next-plaid}:/data/indices
      - ${NEXT_PLAID_MODELS:-~/.cache/huggingface/next-plaid}:/models
    environment:
      - RUST_LOG=info
    command:
      - --host
      - "0.0.0.0"
      - --port
      - "8080"
      - --index-dir
      - /data/indices
      - --model
      - lightonai/answerai-colbert-small-v1-onnx
      - --int8
      - --parallel
      - "16"
      - --batch-size
      - "4"
    healthcheck:
      test:
        ["CMD", "curl", "-f", "--max-time", "5", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 2
      start_period: 120s
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 16G

Docker Compose (CUDA)

services:
  next-plaid-api:
    image: ghcr.io/lightonai/next-plaid:cuda-1.0.6
    ports:
      - "8080:8080"
    volumes:
      - ${NEXT_PLAID_DATA:-~/.local/share/next-plaid}:/data/indices
      - ${NEXT_PLAID_MODELS:-~/.cache/huggingface/next-plaid}:/models
    environment:
      - RUST_LOG=info
      - NVIDIA_VISIBLE_DEVICES=all
    command:
      - --host
      - "0.0.0.0"
      - --port
      - "8080"
      - --index-dir
      - /data/indices
      - --model
      - lightonai/GTE-ModernColBERT-v1
      - --cuda
      - --batch-size
      - "128"
    healthcheck:
      test:
        ["CMD", "curl", "-f", "--max-time", "5", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 2
      start_period: 120s
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 16G
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Volume Mounts

Host Path Container Path Purpose
~/.local/share/next-plaid /data/indices Persistent index storage
~/.cache/huggingface/next-plaid /models HuggingFace model cache

CLI Reference

next-plaid-api [OPTIONS]
Flag Default Description
-h, --host 0.0.0.0 Bind address
-p, --port 8080 Bind port
-d, --index-dir ./indices Index storage directory
-m, --model (none) ONNX model path or HuggingFace ID
--cuda off CUDA for model inference
--int8 off INT8 quantized model (~2x faster on CPU)
--parallel 1 Parallel ONNX sessions (recommended: 8-25 for throughput)
--batch-size auto Batch size per session (32 CPU, 64 GPU, 2 parallel)
--threads auto Threads per ONNX session (auto: 1 when parallel)
--query-length 48 Max query length in tokens
--document-length 300 Max document length in tokens
--model-pool-size 1 Number of model worker instances for concurrent encoding
# Embeddings-only (no model)
next-plaid-api -p 3000 -d /data/indices

# CPU with model
next-plaid-api --model lightonai/answerai-colbert-small-v1-onnx --int8 --parallel 16

# GPU
next-plaid-api --model lightonai/GTE-ModernColBERT-v1 --cuda --batch-size 128

# Debug logging
RUST_LOG=debug next-plaid-api --model ./models/colbert

Architecture

flowchart TD
    subgraph API["REST API (Axum)"]
        H["/health"]
        I["/indices/*"]
        S["/search"]
        E["/encode"]
        R["/rerank"]
    end

    subgraph Middleware
        RL["Rate Limiter<br/>token bucket · optional"]
        CL["Concurrency Limiter"]
        TR["Tracing<br/>X-Request-ID"]
        TO["Timeout<br/>30s health · 300s ops"]
    end

    subgraph Workers["Background Workers"]
        UQ["Update Batch Queue<br/>per index"]
        DQ["Delete Batch Queue<br/>per index"]
        EQ["Encode Batch Queue<br/>global"]
    end

    subgraph Core["Core (next-plaid)"]
        NP["MmapIndex<br/>IVF + PQ + MaxSim"]
        SQ["SQLite<br/>Metadata Filtering"]
    end

    subgraph Model["Model (next-plaid-onnx)"]
        OX["ONNX Runtime<br/>ColBERT Encoder"]
    end

    API --> Middleware
    I --> UQ
    I --> DQ
    E --> EQ
    UQ --> NP
    DQ --> NP
    UQ --> SQ
    DQ --> SQ
    S --> NP
    S --> SQ
    EQ --> OX
    R --> OX

    style API fill:#4a90d9,stroke:#357abd,color:#fff
    style Middleware fill:#50b86c,stroke:#3d9956,color:#fff
    style Workers fill:#e8913a,stroke:#d07a2e,color:#fff
    style Core fill:#9b59b6,stroke:#8445a0,color:#fff
    style Model fill:#e74c3c,stroke:#c0392b,color:#fff
Loading

Concurrency Design

The API uses lock-free reads and batched writes for high throughput:

  • Reads (search, metadata queries): Lock-free via ArcSwap. Readers never block, even during writes.
  • Index updates: Per-index batch queue collects requests, processes up to 300 documents (or 100ms timeout) in a single atomic operation.
  • Deletes: Per-index delete queue batches conditions, resolves IDs inside the lock to handle ID shifting correctly.
  • Encoding: Global worker pool with N model instances. Requests are grouped by input_type and pool_factor, then encoded in a single batch.
  • Auto-repair: Before every update/delete, the API checks if the vector index and SQLite metadata are in sync. If not, it repairs automatically.
flowchart LR
    R1["Request 1"] --> BQ["Batch Queue"]
    R2["Request 2"] --> BQ
    R3["Request 3"] --> BQ
    BQ -->|"collect until<br/>300 docs or 100ms"| BW["Batch Worker"]
    BW -->|"acquire lock"| IDX["Index Update"]
    IDX --> META["Metadata Update"]
    META --> EVICT["Eviction Check"]
    EVICT --> RELOAD["Atomic Reload<br/>(ArcSwap)"]

    style BQ fill:#e8913a,stroke:#d07a2e,color:#fff
    style BW fill:#e8913a,stroke:#d07a2e,color:#fff
    style IDX fill:#9b59b6,stroke:#8445a0,color:#fff
    style META fill:#9b59b6,stroke:#8445a0,color:#fff
    style EVICT fill:#9b59b6,stroke:#8445a0,color:#fff
    style RELOAD fill:#50b86c,stroke:#3d9956,color:#fff
Loading

Rate Limiting

Rate limiting is optional and disabled by default. Enable it by setting RATE_LIMIT_ENABLED=true. When enabled, the API applies a token bucket algorithm to a subset of routes:

Scope Rate limited? Why exempt
/health, / No Monitoring must always work
GET /indices, GET /indices/{name} No Clients poll during async operations
POST /indices/{name}/update* No Has per-index semaphore protection
DELETE /indices/{name}, DELETE /indices/{name}/documents No Has internal batching
/encode, /rerank* No Has internal backpressure via queue
Everything else Yes Standard rate limiting

Environment Variables

Rate Limiting & Concurrency

Variable Default Description
RATE_LIMIT_ENABLED false Enable rate limiting (true, 1, or yes to enable)
RATE_LIMIT_PER_SECOND 50 Sustained requests/second (when enabled)
RATE_LIMIT_BURST_SIZE 100 Max burst size (when enabled)
CONCURRENCY_LIMIT 100 Max concurrent in-flight requests

Document Batching

Variable Default Description
MAX_QUEUED_TASKS_PER_INDEX 10 Max pending updates per index (503 when full)
MAX_BATCH_DOCUMENTS 300 Documents per batch before processing
BATCH_CHANNEL_SIZE 100 Buffer for document batch queue

Encode Batching

Variable Default Description
MAX_BATCH_TEXTS 64 Texts per encoding batch
ENCODE_BATCH_CHANNEL_SIZE 256 Buffer for encode batch queue

Delete Batching

Variable Default Description
DELETE_BATCH_MIN_WAIT 500 Min wait (ms) after first delete before processing
DELETE_BATCH_MAX_WAIT 2000 Max wait (ms) for accumulating deletes
MAX_DELETE_BATCH_CONDITIONS 200 Max conditions per delete batch

Logging

Variable Default Description
RUST_LOG info Log level (debug, info, warn, error)
HF_TOKEN (none) HuggingFace token for private model downloads

Feature Flags

Feature Description
(default) Core API, no BLAS, no model support
openblas OpenBLAS for matrix operations (Linux)
accelerate Apple Accelerate (macOS)
model ONNX model encoding (/encode, *_with_encoding)
cuda CUDA acceleration (implies model)
# Embeddings-only API
cargo build --release -p next-plaid-api

# With model support (CPU, Linux)
cargo build --release -p next-plaid-api --features "openblas,model"

# With CUDA
cargo build --release -p next-plaid-api --features "cuda"

Modules

Module Lines Description
handlers/documents 1,638 Index CRUD, update batching, delete batching, eviction, auto-repair
models 759 All request/response JSON schemas with OpenAPI annotations
handlers/encode 549 Encode worker pool, batch grouping by input type, ONNX inference
state 488 AppState, IndexSlot (ArcSwap), model pool, config caching
handlers/search 449 Search + filtered search, metadata enrichment, text-to-search pipeline
handlers/metadata 484 Metadata CRUD: check, query, get, count, update
handlers/rerank 292 ColBERT MaxSim scoring, text and embedding reranking
error 138 Error types with HTTP status code mapping
tracing_middleware 115 Request tracing via X-Request-ID header
main 887 CLI argument parsing, router construction, Swagger UI, server startup
lib 44 PrettyJson response type, module re-exports

Dependencies

Crate Purpose
next-plaid Core PLAID index (IVF + PQ + MaxSim)
next-plaid-onnx ColBERT ONNX encoding (optional)
axum 0.8 Web framework
tokio Async runtime
tower / tower-http Middleware (CORS, tracing, timeout, concurrency)
tower_governor Rate limiting (token bucket)
utoipa / utoipa-swagger-ui OpenAPI generation + Swagger UI
arc-swap Lock-free index swapping
parking_lot Fast read-write locks
sysinfo Process memory usage for /health
uuid Request trace IDs
ndarray N-dimensional arrays
serde / serde_json Serialization

License

Apache-2.0