Local, GPU-accelerated, OpenAI-compatible /v1/embeddings API on AMD Strix Halo. BAAI/bge-m3 by default, no API keys, no fees, no data leaving your network.
Local, GPU-accelerated, OpenAI-compatible embeddings API. Drop-in replacement for OpenAI's /v1/embeddings endpoint, no API keys, no usage fees, no data leaving your network.
Built for AMD Strix Halo (RDNA 3.5 / gfx1151) with ROCm, but falls back to CPU if no GPU is available.
Every vector database and RAG pipeline needs an embeddings API. The standard options, OpenAI text-embedding-3-small, Google text-embedding-004, charge per token and send your data to external servers. This runs the same API contract locally, for free, on your own hardware.
.
├── .gitignore # Ignores .env and data/
├── .env.template # Template, copy to .env
├── README.md # This file
├── llm.txt # Complete technical reference
├── Dockerfile # Ubuntu Rolling + ROCm PyTorch + FastAPI
├── docker-compose.yml # Service definition with GPU passthrough
├── entrypoint.sh # GPU check + uvicorn start
├── server.py # OpenAI-compatible FastAPI server
└── data/ # Persistent model cache (git-ignored)
└── models/ # HuggingFace model files
- Docker with compose plugin
- AMD Strix Halo (or any RDNA 3.5 GPU) for GPU mode
- ~7 GB disk for the model + Docker image
cp .env.template .env
# Edit .env if you want to change the model or port
docker compose up -d --buildFirst start downloads the model (~4.3 GB). Subsequent starts load from cache in seconds.
Verify:
# Check GPU detection
docker logs openclaw-embeddings
# Test the API
curl http://localhost:8484/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"BAAI/bge-m3","input":"Hello world"}'OpenAI-compatible. Works with any client that speaks the OpenAI embeddings format.
Request:
{
"model": "BAAI/bge-m3",
"input": "text to embed"
}input accepts a single string or an array of strings for batch embedding.
Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0123, -0.044, ...]
}
],
"model": "BAAI/bge-m3",
"usage": {"prompt_tokens": 3, "total_tokens": 3}
}Lists available models and their dimensions.
Returns {"status": "ok", "model_loaded": true} when ready.
All configurable via .env:
| Variable | Default | Description |
|---|---|---|
EMBEDDING_MODEL |
BAAI/bge-m3 |
HuggingFace model ID |
EMBEDDING_PORT |
8484 |
Host port for the API |
Default: BAAI/bge-m3
| Property | Value |
|---|---|
| Dimensions | 1024 |
| Languages | 100+ (excellent EN + ES) |
| Max tokens | 8192 |
| Size | ~2.2 GB (weights) |
You can swap it for any sentence-transformers compatible model by changing EMBEDDING_MODEL in .env.
If OpenClaw runs inside a Multipass VM and this embeddings service runs on the host, localhost won't work, it points to the VM, not the host.
Find the host IP on the Multipass bridge:
# On the host
ip addr show mpqemubr0 | grep 'inet '
# Example output: inet 10.5.162.1/24 ...Test from inside the VM:
multipass exec <vm-name> -- curl http://10.5.162.1:8484/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"BAAI/bge-m3","input":"connectivity test"}'Configure OpenClaw with:
| Setting | Value |
|---|---|
| Base URL | http://<host-bridge-ip>:8484/v1 |
| Model | BAAI/bge-m3 |
| Auth | None |
Tested on AMD Ryzen AI Max (Strix Halo) with Radeon 8060S iGPU:
| Metric | Value |
|---|---|
| Latency (single text) | ~47ms |
| Latency (first request, cold) | ~270ms |
| GPU VRAM used | ~1.5 GB |
| Model load time (from cache) | ~10s |
This container uses the same ROCm setup from rocm-strix-docker:
HSA_OVERRIDE_GFX_VERSION=11.5.1, required for ROCm to recognize Strix Haloprivileged: true, grants/dev/kfdand/dev/driaccess for GPU computeipc: host, shared memory for PyTorch- PyTorch wheels from
https://rocm.prereleases.amd.com/whl/gfx1151/(ROCm 7.11 prerelease) - UV manages Python 3.12 + all packages (no pip)
MIT for original code in this repository (FastAPI server, Dockerfile, Compose configs, scripts). Third-party model weights (BAAI/bge-m3) and runtimes (sentence-transformers, transformers, PyTorch ROCm) retain their own upstream licenses; this repository does not redistribute them.
[embeddings] ========================================
[embeddings] Model: BAAI/bge-m3
[embeddings] ========================================
[embeddings] Checking GPU...
[embeddings] GPU: Radeon 8060S Graphics
[embeddings] VRAM: 124.6 GB
[embeddings] ROCm/HIP: 7.2.53150-7b886380f9
[embeddings] Device: cuda
[embeddings] Starting server on port 80...
[embeddings] Loading BAAI/bge-m3 on cuda
[embeddings] GPU: Radeon 8060S Graphics
[embeddings] Model loaded. Dimension: 1024