Skip to content

PetoVeritas/MLX-TurboQuant-Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLX TurboQuant Service — Supervised, Local Gemma 4 26B on Apple Silicon

Runs Gemma 4 26B-A4B locally on Apple Silicon via MLX and exposes it as an OpenAI-compatible provider boundary for OpenClaw-style agent stacks. A lightweight HTTP supervisor manages a separate worker process so the model stays up, restarts cleanly, and behaves predictably under agent workloads — single-target on purpose, not a generic multi-model surface.

Current 26B setup note: the main service currently runs the 8-bit MLX weights at majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit on port 4017. In this repository, “TurboQuant” refers to the runtime/service path and KV-cache experimentation around that model family, not to a separate published 26B TQPlus weight artifact.

Current E4B setup note: the sibling test service runs mlx-community/gemma-4-e4b-it-8bit on port 4018 for small-model agent usage.

Why this exists

Getting a model to produce tokens is not enough for real agent use. This project exists to make local Gemma 4 inference with TurboQuant operationally usable by adding:

  • a stable OpenAI-style chat endpoint with streaming
  • supervised worker lifecycle management
  • local health and admin endpoints
  • smoke, recovery, and timeout testing
  • a cleaner path for evaluating MLX as a serious OpenClaw lane

Features

  • OpenAI-style API
    • POST /v1/chat/completions (streaming and non-streaming)
    • tool-call passthrough with hallucinated-tool containment
    • configurable sampling (temperature, top-p)
    • strips model-internal reasoning markers so only the final answer reaches clients
  • Supervisor + worker design
    • keeps inference isolated from the control plane over a JSON-framed subprocess pipe
  • Bounded request queue
    • allows one active worker request plus a small configurable FIFO queue instead of dropping the first overlap as worker_busy
  • Lifecycle controls
    • lazy load, idle unload, explicit unload, restart, readiness, and health checks
  • Shared memory governor
    • optional file-lock/state-file admission control for sibling services so smaller lanes do not casually crowd out the protected 26B lane
  • Operational visibility
    • structured request and state-transition logs
    • timing metrics for load, prefill, generation, and total request time
  • Local testing tools
    • smoke, recovery, timeout, fixture, soak, reclaim, and lane-comparison scripts
  • Local-first security posture
    • designed for loopback/private host use with enforceable local-only admin endpoints

API Endpoints

Method Endpoint Description
GET /health Basic health check
GET /ready Readiness and worker-state view
GET /v1/models Exposed model list
GET /admin/stats Worker metrics and config snapshot
POST /v1/chat/completions OpenAI-style chat completions (SSE streaming supported)
POST /admin/worker/unload Unload the worker
POST /admin/worker/restart Restart the worker

Project Layout

mlx-turbo-gemma-service/
├── src/
│   ├── supervisor/      # HTTP control plane
│   ├── worker/          # inference worker process
│   └── shared/          # shared config, models, constants
├── config/              # default + example local override
├── scripts/             # start/stop/smoke/fixture/soak helpers
├── benchmarks/          # prompt fixtures and shared prompt text
├── runtime/             # dedicated MLX Python virtualenv (gitignored)
├── logs/                # runtime logs (gitignored)
└── tmp/                 # scratch runtime files (gitignored)

Running

Start the service:

./scripts/start

Check status:

./scripts/state

Stop or restart:

./scripts/stop
./scripts/restart

Configuration

Configuration is split between:

  • config/default.json for baseline defaults
  • config/local.example.json for example overrides
  • config/local.json for machine-specific model/runtime settings (gitignored)

Typical local settings include model path, model id, Python runtime path, startup/request/probe timeouts, lazy-load behavior, idle-unload behavior, governor behavior, and sampling (temperature, top-p).

Note: model.maxOutputTokens defaults to 8192 (raised from the previous 1024) so longer agent turns and tool-call sequences fit without per-request overrides. Lower it in config/local.json if you need to cap output for memory or latency reasons.

Request Queue

The supervisor intentionally keeps inference single-worker and local-first. It can now absorb a bounded amount of overlap with worker.queue.maxDepth:

  • 0: no queue; overlapping requests are rejected as worker_busy.
  • 1: one active request plus one queued request. This is the recommended local default for the 26B, E4B, and voice-helper lanes.

Queued requests wait for the active request to finish, then run through the same worker path. If the queue is already full, POST /v1/chat/completions returns 409 queue_full with the current queue depth. Queue depth, max depth, and queue metrics are visible in /admin/stats.

This is not continuous batching. It is a conservative FIFO guard for agent traffic so short overlaps do not fail immediately while the service remains predictable. Continuous batching is a separate future scheduler change.

Shared memory governor

The optional governor reserves estimated loaded-worker RSS in a JSON state file under governor.stateDir, protected by an advisory file lock. It is designed for sibling single-model services, not dynamic routing inside one supervisor.

Recommended local shape:

  • 26B: governor.instanceId: "mlx-26b", priority: 1, rssEstimateLoadedGb: 20.0
  • E4B sibling: governor.instanceId: "mlx-e4b", priority: 2, rssEstimateLoadedGb: 12.0
  • Shared ceiling: ceilingGb: 32.0
  • Keep allowLowerPriorityToPreemptHigher: false so the E4B lane cannot preempt 26B by default

When a cold load would exceed the ceiling, the governor refuses admission with governor_refused unless a configured preemption path can safely unload lower-priority rows first. Set governor.enabled: false to return to independent service behavior.

KV-cache recommendation

For TurboQuant / KV-cache experiments, the current recommendation is asymmetric compression:

  • keep K high precision by default
  • compress V first if memory pressure requires it
  • avoid symmetric low-bit K/V compression as the default
  • validate any KV change with tiny factual, long-context retrieval, tool-call, and reclaim tests before using it for agent traffic. Keep heavier private stress fixtures outside the public repo unless they are intentionally curated for release

The service does not currently expose stable KV tuning environment variables. Add documented knobs only when the underlying MLX cache path is wired and tested; until then, prefer known-good baseline cache behavior.

Helper Scripts

  • scripts/start / scripts/stop / scripts/restart
  • scripts/state
  • scripts/smoke-test
  • scripts/smoke-ready-state.sh
  • scripts/recovery-test
  • scripts/timeout-failure-test
  • scripts/list-fixtures
  • scripts/run-fixture
  • scripts/soak-profile
  • scripts/memory-profile
  • scripts/reclaim-profile
  • scripts/compare-lanes

Current Status

This project has moved beyond scaffold-only bring-up and into real local MLX inference testing. It currently supports:

  • real Gemma completions through MLX with configurable sampling
  • streaming responses with channel-markup and tool-call containment
  • supervised worker startup, idle unload, and recovery
  • readiness and stats inspection that stays responsive during active generation
  • cold/warm request validation and fixture-based cleanliness checks
  • optional shared memory-governor admission for 26B/E4B sibling services
  • early hardening for OpenClaw compatibility

Security

  • local-first by design
  • intended for loopback/private use
  • admin endpoints enforced local-only when server.adminLocalOnly is enabled
  • no built-in public auth layer

Requirements

Development note

Built AI-assisted, using my personal OpenClaw agents as my coding collaborators.


Copyright: © 2026 PetoVeritas
License: Apache-2.0

About

Local-first MLX inference service for Gemma 4 on Apple Silicon — OpenAI-compatible, supervised, streamable, tuned for Gemma 4 26B

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors