Skip to content

hec-ovi/openclaw-strix-embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openclaw-strix-embed

Local, GPU-accelerated, OpenAI-compatible /v1/embeddings API on AMD Strix Halo. BAAI/bge-m3 by default, no API keys, no fees, no data leaving your network.

Status AMD Strix Halo ROCm Model FastAPI License


What this is

Local, GPU-accelerated, OpenAI-compatible embeddings API. Drop-in replacement for OpenAI's /v1/embeddings endpoint, no API keys, no usage fees, no data leaving your network.

Built for AMD Strix Halo (RDNA 3.5 / gfx1151) with ROCm, but falls back to CPU if no GPU is available.

Why

Every vector database and RAG pipeline needs an embeddings API. The standard options, OpenAI text-embedding-3-small, Google text-embedding-004, charge per token and send your data to external servers. This runs the same API contract locally, for free, on your own hardware.

Project Structure

.
├── .gitignore              # Ignores .env and data/
├── .env.template           # Template, copy to .env
├── README.md               # This file
├── llm.txt                 # Complete technical reference
├── Dockerfile              # Ubuntu Rolling + ROCm PyTorch + FastAPI
├── docker-compose.yml      # Service definition with GPU passthrough
├── entrypoint.sh           # GPU check + uvicorn start
├── server.py               # OpenAI-compatible FastAPI server
└── data/                   # Persistent model cache (git-ignored)
    └── models/             # HuggingFace model files

Prerequisites

  • Docker with compose plugin
  • AMD Strix Halo (or any RDNA 3.5 GPU) for GPU mode
  • ~7 GB disk for the model + Docker image

Quick Start

cp .env.template .env
# Edit .env if you want to change the model or port
docker compose up -d --build

First start downloads the model (~4.3 GB). Subsequent starts load from cache in seconds.

Verify:

# Check GPU detection
docker logs openclaw-embeddings

# Test the API
curl http://localhost:8484/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"BAAI/bge-m3","input":"Hello world"}'

API

POST /v1/embeddings

OpenAI-compatible. Works with any client that speaks the OpenAI embeddings format.

Request:

{
  "model": "BAAI/bge-m3",
  "input": "text to embed"
}

input accepts a single string or an array of strings for batch embedding.

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0123, -0.044, ...]
    }
  ],
  "model": "BAAI/bge-m3",
  "usage": {"prompt_tokens": 3, "total_tokens": 3}
}

GET /v1/models

Lists available models and their dimensions.

GET /health

Returns {"status": "ok", "model_loaded": true} when ready.

Configuration

All configurable via .env:

Variable Default Description
EMBEDDING_MODEL BAAI/bge-m3 HuggingFace model ID
EMBEDDING_PORT 8484 Host port for the API

Model

Default: BAAI/bge-m3

Property Value
Dimensions 1024
Languages 100+ (excellent EN + ES)
Max tokens 8192
Size ~2.2 GB (weights)

You can swap it for any sentence-transformers compatible model by changing EMBEDDING_MODEL in .env.

Using with Multipass VMs

If OpenClaw runs inside a Multipass VM and this embeddings service runs on the host, localhost won't work, it points to the VM, not the host.

Find the host IP on the Multipass bridge:

# On the host
ip addr show mpqemubr0 | grep 'inet '
# Example output: inet 10.5.162.1/24 ...

Test from inside the VM:

multipass exec <vm-name> -- curl http://10.5.162.1:8484/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"BAAI/bge-m3","input":"connectivity test"}'

Configure OpenClaw with:

Setting Value
Base URL http://<host-bridge-ip>:8484/v1
Model BAAI/bge-m3
Auth None

Performance

Tested on AMD Ryzen AI Max (Strix Halo) with Radeon 8060S iGPU:

Metric Value
Latency (single text) ~47ms
Latency (first request, cold) ~270ms
GPU VRAM used ~1.5 GB
Model load time (from cache) ~10s

GPU Details

This container uses the same ROCm setup from rocm-strix-docker:

  • HSA_OVERRIDE_GFX_VERSION=11.5.1, required for ROCm to recognize Strix Halo
  • privileged: true, grants /dev/kfd and /dev/dri access for GPU compute
  • ipc: host, shared memory for PyTorch
  • PyTorch wheels from https://rocm.prereleases.amd.com/whl/gfx1151/ (ROCm 7.11 prerelease)
  • UV manages Python 3.12 + all packages (no pip)

License

MIT for original code in this repository (FastAPI server, Dockerfile, Compose configs, scripts). Third-party model weights (BAAI/bge-m3) and runtimes (sentence-transformers, transformers, PyTorch ROCm) retain their own upstream licenses; this repository does not redistribute them.

Verified Output

[embeddings] ========================================
[embeddings] Model: BAAI/bge-m3
[embeddings] ========================================
[embeddings] Checking GPU...
[embeddings] GPU: Radeon 8060S Graphics
[embeddings] VRAM: 124.6 GB
[embeddings] ROCm/HIP: 7.2.53150-7b886380f9
[embeddings] Device: cuda
[embeddings] Starting server on port 80...
[embeddings] Loading BAAI/bge-m3 on cuda
[embeddings] GPU: Radeon 8060S Graphics
[embeddings] Model loaded. Dimension: 1024

About

OpenAI-compatible /v1/embeddings server (BAAI/bge-m3, 1024 dims, 100+ langs) on AMD Strix Halo via ROCm. Drop-in replacement for OpenAI text-embedding-3, Docker, no API keys, ~47ms single-text latency.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors