MLX TurboQuant Service — Supervised, Local Gemma 4 26B on Apple Silicon

Runs Gemma 4 26B-A4B locally on Apple Silicon via MLX and exposes it as an OpenAI-compatible provider boundary for OpenClaw-style agent stacks. A lightweight HTTP supervisor manages a separate worker process so the model stays up, restarts cleanly, and behaves predictably under agent workloads — single-target on purpose, not a generic multi-model surface.

Current 26B setup note: the main service currently runs the 8-bit MLX weights at majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit on port 4017. In this repository, “TurboQuant” refers to the runtime/service path and KV-cache experimentation around that model family, not to a separate published 26B TQPlus weight artifact.

Current E4B setup note: the sibling test service runs mlx-community/gemma-4-e4b-it-8bit on port 4018 for small-model agent usage.

Why this exists

Getting a model to produce tokens is not enough for real agent use. This project exists to make local Gemma 4 inference with TurboQuant operationally usable by adding:

a stable OpenAI-style chat endpoint with streaming
supervised worker lifecycle management
local health and admin endpoints
smoke, recovery, and timeout testing
a cleaner path for evaluating MLX as a serious OpenClaw lane

Features

OpenAI-style API
- POST /v1/chat/completions (streaming and non-streaming)
- tool-call passthrough with hallucinated-tool containment
- configurable sampling (temperature, top-p)
- strips model-internal reasoning markers so only the final answer reaches clients
Supervisor + worker design
- keeps inference isolated from the control plane over a JSON-framed subprocess pipe
Bounded request queue
- allows one active worker request plus a small configurable FIFO queue instead of dropping the first overlap as worker_busy
Lifecycle controls
- lazy load, idle unload, explicit unload, restart, readiness, and health checks
Shared memory governor
- optional file-lock/state-file admission control for sibling services so smaller lanes do not casually crowd out the protected 26B lane
Operational visibility
- structured request and state-transition logs
- timing metrics for load, prefill, generation, and total request time
Local testing tools
- smoke, recovery, timeout, fixture, soak, reclaim, and lane-comparison scripts
Local-first security posture
- designed for loopback/private host use with enforceable local-only admin endpoints

API Endpoints

Method	Endpoint	Description
GET	`/health`	Basic health check
GET	`/ready`	Readiness and worker-state view
GET	`/v1/models`	Exposed model list
GET	`/admin/stats`	Worker metrics and config snapshot
POST	`/v1/chat/completions`	OpenAI-style chat completions (SSE streaming supported)
POST	`/admin/worker/unload`	Unload the worker
POST	`/admin/worker/restart`	Restart the worker

Project Layout

mlx-turbo-gemma-service/
├── src/
│   ├── supervisor/      # HTTP control plane
│   ├── worker/          # inference worker process
│   └── shared/          # shared config, models, constants
├── config/              # default + example local override
├── scripts/             # start/stop/smoke/fixture/soak helpers
├── benchmarks/          # prompt fixtures and shared prompt text
├── runtime/             # dedicated MLX Python virtualenv (gitignored)
├── logs/                # runtime logs (gitignored)
└── tmp/                 # scratch runtime files (gitignored)

Running

Start the service:

./scripts/start

Check status:

./scripts/state

Stop or restart:

./scripts/stop
./scripts/restart

Configuration

Configuration is split between:

config/default.json for baseline defaults
config/local.example.json for example overrides
config/local.json for machine-specific model/runtime settings (gitignored)

Typical local settings include model path, model id, Python runtime path, startup/request/probe timeouts, lazy-load behavior, idle-unload behavior, governor behavior, and sampling (temperature, top-p).

Note: model.maxOutputTokens defaults to 8192 (raised from the previous 1024) so longer agent turns and tool-call sequences fit without per-request overrides. Lower it in config/local.json if you need to cap output for memory or latency reasons.

Request Queue

The supervisor intentionally keeps inference single-worker and local-first. It can now absorb a bounded amount of overlap with worker.queue.maxDepth:

0: no queue; overlapping requests are rejected as worker_busy.
1: one active request plus one queued request. This is the recommended local default for the 26B, E4B, and voice-helper lanes.

Queued requests wait for the active request to finish, then run through the same worker path. If the queue is already full, POST /v1/chat/completions returns 409 queue_full with the current queue depth. Queue depth, max depth, and queue metrics are visible in /admin/stats.

This is not continuous batching. It is a conservative FIFO guard for agent traffic so short overlaps do not fail immediately while the service remains predictable. Continuous batching is a separate future scheduler change.

Shared memory governor

The optional governor reserves estimated loaded-worker RSS in a JSON state file under governor.stateDir, protected by an advisory file lock. It is designed for sibling single-model services, not dynamic routing inside one supervisor.

Recommended local shape:

26B: governor.instanceId: "mlx-26b", priority: 1, rssEstimateLoadedGb: 20.0
E4B sibling: governor.instanceId: "mlx-e4b", priority: 2, rssEstimateLoadedGb: 12.0
Shared ceiling: ceilingGb: 32.0
Keep allowLowerPriorityToPreemptHigher: false so the E4B lane cannot preempt 26B by default

When a cold load would exceed the ceiling, the governor refuses admission with governor_refused unless a configured preemption path can safely unload lower-priority rows first. Set governor.enabled: false to return to independent service behavior.

KV-cache recommendation

For TurboQuant / KV-cache experiments, the current recommendation is asymmetric compression:

keep K high precision by default
compress V first if memory pressure requires it
avoid symmetric low-bit K/V compression as the default
validate any KV change with tiny factual, long-context retrieval, tool-call, and reclaim tests before using it for agent traffic. Keep heavier private stress fixtures outside the public repo unless they are intentionally curated for release

The service does not currently expose stable KV tuning environment variables. Add documented knobs only when the underlying MLX cache path is wired and tested; until then, prefer known-good baseline cache behavior.

Helper Scripts

scripts/start / scripts/stop / scripts/restart
scripts/state
scripts/smoke-test
scripts/smoke-ready-state.sh
scripts/recovery-test
scripts/timeout-failure-test
scripts/list-fixtures
scripts/run-fixture
scripts/soak-profile
scripts/memory-profile
scripts/reclaim-profile
scripts/compare-lanes

Current Status

This project has moved beyond scaffold-only bring-up and into real local MLX inference testing. It currently supports:

real Gemma completions through MLX with configurable sampling
streaming responses with channel-markup and tool-call containment
supervised worker startup, idle unload, and recovery
readiness and stats inspection that stays responsive during active generation
cold/warm request validation and fixture-based cleanliness checks
optional shared memory-governor admission for 26B/E4B sibling services
early hardening for OpenClaw compatibility

Security

local-first by design
intended for loopback/private use
admin endpoints enforced local-only when server.adminLocalOnly is enabled
no built-in public auth layer

Requirements

Apple Silicon Mac
Python 3.11+ with an MLX-capable virtualenv (see runtime/)
local Gemma model files (for example, the 26B service can use majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit, while a smaller sibling service can use mlx-community/gemma-4-e4b-it-8bit; set model.path in config/default.json or a local override)
OpenClaw-compatible workflow if used as a lane

Development note

Built AI-assisted, using my personal OpenClaw agents as my coding collaborators.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
benchmarks		benchmarks
config		config
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX TurboQuant Service — Supervised, Local Gemma 4 26B on Apple Silicon

Why this exists

Features

API Endpoints

Project Layout

Running

Configuration

Request Queue

Shared memory governor

KV-cache recommendation

Helper Scripts

Current Status

Security

Requirements

Development note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLX TurboQuant Service — Supervised, Local Gemma 4 26B on Apple Silicon

Why this exists

Features

API Endpoints

Project Layout

Running

Configuration

Request Queue

Shared memory governor

KV-cache recommendation

Helper Scripts

Current Status

Security

Requirements

Development note

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages