Environment variables

This page documents the MLXCEL_* environment variables that affect mlxcel runtime, server, downloader, build, and diagnostic behavior.

Prefer CLI flags for settings that have a flag equivalent. Environment variables are useful for containers, service units, and repeatable benchmark runs, but they are process-wide and several of the low-level knobs are read once and cached on first use. Set them before starting mlxcel or mlxcel-server.

Precedence and value conventions

If a CLI flag and an environment variable control the same option, the CLI flag wins unless the flag help states otherwise.
LLAMA_ARG_* aliases exist for a subset of llama-server-compatible flags. This page focuses on MLXCEL_*; use --help for the full flag/env surface.
Boolean parsing is not completely uniform across all internal knobs:
- documented server options generally accept true/false, 1/0, yes/no, and on/off;
- many diagnostic switches are presence-based, so any set value enables the behavior;
- variables whose row says "falsy disables" treat 0, false, off, or no as disabled.
Variables marked advanced or diagnostic are not a stable public API. They exist for benchmarking, rollback, or kernel-development work and may change between releases.

Common runtime variables

Variable	Values	Default	Notes
`MLXCEL_DEVICE`	`gpu`, `metal`, `cpu`	`gpu` hint	`cpu` requests CPU execution. Invalid values are ignored with a warning and treated as `gpu`; if no GPU backend is available, runtime falls back to CPU.
`MLXCEL_WIRED_LIMIT`	`max`, `0`, `none`, bytes, `NGB`, `NMB`	`max`	Apple Silicon GPU wired-memory limit. Unset/empty/`max` sets MLX's reported GPU max memory size; `0`/`none` disables the limit; numeric values set an explicit limit.
`MLXCEL_CACHE_DIR`	directory path	`$HOME/.cache/mlxcel`	Root for the tokenizer language-analysis disk cache used by language-bias features. Files live under `tokenizer-scripts/`.
`MLXCEL_SERVER_DECODE_STORAGE`	`auto`, `dense`, `paged`	`auto`	Server continuous-batching decode storage. `--decode-storage-backend` takes precedence. Invalid values warn and fall back to `auto`.
`MLXCEL_SURGERY`	YAML file path	unset	Feature-gated weight-load surgery configuration. `--surgery` takes precedence when the `surgery` feature is built.

Build-time variables

These are read by the mlxcel-core build script.

Variable	Values	Default	Notes
`MLXCEL_BUILD_METAL`	`1/0`, `on/off`, `true/false`, `yes/no`	`on` on macOS	Overrides the CMake `MLX_BUILD_METAL` setting for local builds. Invalid values fail the build.
`MLXCEL_BUILD_ACCELERATE`	`1/0`, `on/off`, `true/false`, `yes/no`	`on` on macOS	Overrides the CMake `MLX_BUILD_ACCELERATE` setting for local builds. Invalid values fail the build.

CUDA builds also use non-MLXCEL_* variables such as CUDA_HOME and MLX_CUDA_ARCHITECTURES; see Installation.

Downloader variables

Variable	Values	Default	Notes
`MLXCEL_NO_PROGRESS`	any non-empty value	unset	Suppresses interactive download progress bars. `NO_COLOR` and `CI=true` also suppress bars.
`MLXCEL_ALLOW_INSECURE_ENDPOINT`	any non-empty value	unset	Allows sending a Hugging Face token to a non-HTTPS `HF_ENDPOINT`. Leave unset outside audited internal mirrors.

Server prompt-cache variables

These variables are applied when the corresponding CLI flag is absent.

Variable	Values	Default	Flag equivalent
`MLXCEL_PROMPT_CACHE_ENABLED`	boolean	`true`	`--prompt-cache-enabled`
`MLXCEL_PROMPT_CACHE_CAPACITY_BYTES`	unsigned integer bytes	`2147483648`	`--prompt-cache-capacity-bytes`
`MLXCEL_PROMPT_CACHE_MAX_ENTRIES`	unsigned integer	`1024`	`--prompt-cache-max-entries`
`MLXCEL_PROMPT_CACHE_TTL`	unsigned integer seconds	`3600`	`--prompt-cache-ttl`
`MLXCEL_PROMPT_CACHE_MIN_PREFIX`	unsigned integer tokens	`32`	`--prompt-cache-min-prefix`

MLXCEL_PROMPT_CACHE_ENABLED has higher precedence than the llama.cpp compatibility alias LLAMA_ARG_CACHE_REUSE when both are set and no CLI flag is provided.

Speculative-decoding variables

Variable	Values	Default	Notes
`MLXCEL_DRAFT_KIND`	`dflash`, `mtp`	auto/none	Alias for `--draft-kind` when the CLI flag and `LLAMA_ARG_DRAFT_KIND` are absent.
`MLXCEL_DRAFT_BLOCK_SIZE`	unsigned integer	per drafter (`4` for MTP, `16` for DFlash)	Alias for `--draft-block-size` when the CLI flag and `LLAMA_ARG_DRAFT_BLOCK_SIZE` are absent.
`MLXCEL_ENABLE_MTP_B1`	truthy value	off	Advanced. Forces the singleton Gemma 4 MTP burst path for parity/debug testing.
`MLXCEL_ENABLE_MTP_BATCH`	truthy value	off	Advanced. Forces the batched Gemma 4 MTP burst path for parity/debug testing.
`MLXCEL_ENABLE_MTP_DEFERRED`	`1`	off	Advanced. Enables the deferred greedy verifier path for Gemma 4 MTP when sampling settings allow it.

KV cache and TurboQuant variables

Use CLI flags such as --cache-type-k, --cache-type-v, --kv-cache-mode, --turbo-boundary-v, and the batch KV quantization flags when possible. The variables below are useful for service-level defaults and A/B experiments. See TurboQuant KV cache for the user-facing mode descriptions.

Variable	Values	Default	Notes
`MLXCEL_KV_BOUNDARY_V_LAYERS`	integer count	`2`	Number of first/last layers kept at higher precision for Turbo4-family modes. `0` disables. `--turbo-boundary-v` writes this value before cache construction and takes precedence.
`MLXCEL_TURBO_BOUNDARY_V`	integer count	fallback alias	Compatibility alias for `MLXCEL_KV_BOUNDARY_V_LAYERS`; the primary name wins when both are set.
`MLXCEL_KV_SKIP_LAST_LAYER`	boolean	`true`	Fallback for `--kv-skip-last-layer` in continuous-batching KV quantization.
`MLXCEL_SPARSE_V_THRESHOLD`	non-negative float	`1e-6`	Sparse-V alive threshold. `0` disables sparse-V; invalid values warn and use the default.
`MLXCEL_SPARSE_V_KERNEL`	falsy disables	enabled on macOS	Allows the fused Sparse-V/dequant Metal kernels. Set `0`, `false`, `off`, or `no` to force graph fallback.
`MLXCEL_TURBO4_DEQUANT_SDPA`	falsy disables	on	Controls the dequant-first SDPA path for symmetric `Turbo4`.
`MLXCEL_TURBO4_DELEGATED_DEQUANT_SDPA`	falsy disables	on	Controls the default dequant-first SDPA path for `Turbo4Delegated`.
`MLXCEL_TURBO4_DELEGATED_FUSED`	truthy enables	off	Advanced. Enables the older custom fused delegated-kernel route, mainly for comparison when dequant-first SDPA is disabled.
`MLXCEL_TURBO4_DELEGATED_FP16_FAST_PATH`	truthy enables	off	Advanced. Keeps a unified FP16 V working set in delegated mode for speed experiments while maintaining packed sidecars.
`MLXCEL_TURBO4_DELEGATED_FP16_SIDECARS`	`predecode`, `eager`, `lazy`, `on-demand`	`predecode`	Sidecar maintenance policy for the delegated FP16 fast path.
`MLXCEL_ENABLE_DIRECT_PREFILL_CACHE_STORE`	presence enables	off	Advanced. Installs the incoming prefill tensor directly as the initial KV cache buffer when applicable.

Video and local-media variables

These apply to video-capable VLM request handling.

Variable	Values	Default	Notes
`MLXCEL_VIDEO_DIR_ALLOWLIST`	comma-separated directories	unset	Local `video_url` file paths are rejected unless they resolve under one of these canonicalized directories. Keep directories owner-writable only; group/world-writable entries warn at startup.
`MLXCEL_VIDEO_MAX_PIXELS`	unsigned integer	`16777216`	Rejects source videos whose `width × height` exceeds the cap.
`MLXCEL_VIDEO_MAX_DURATION_SEC`	float seconds	`600`	Rejects source videos longer than the cap.
`MLXCEL_VIDEO_MAX_PNG_FRAME_BYTES`	unsigned integer bytes	`268435456`	Per-frame cap for the ffmpeg PNG stream splitter.

Hardware and kernel diagnostic variables

These variables are for profiling, rollback, or experiments. They are not recommended as normal deployment settings.

Variable	Values	Default	Purpose
`MLXCEL_NO_PADDED_PREFILL`	presence disables	auto	Disables M5+/Neural-Accelerator prefill tile alignment.
`MLXCEL_FORCE_PADDED_PREFILL_MASK`	presence enables	off	Forces an explicit padded prefill mask path for debugging.
`MLXCEL_LOG_NA_ATTENTION`	`sampled`, `all`, truthy	off	Logs Neural Accelerator attention dispatch decisions.
`MLXCEL_ENABLE_FUSED_CAUSAL_PREFILL_ATTENTION`	presence enables	off	Enables an experimental Llama-family fused causal prefill path when supported.
`MLXCEL_ENABLE_FUSED_QKV_SPLIT_ROPE`	presence enables	off	Enables an experimental fused QKV projection/split/RoPE path.
`MLXCEL_GEMMA4_ENABLE_FUSED_QKV`	presence enables	off	Enables a Gemma 4 fused-QKV projection experiment.
`MLXCEL_DISABLE_COMPILED_SWITCH_QGEGLU`	presence disables	compiled path on when supported	Rolls back Gemma 4 compiled Switch-QGeGLU decode path.
`MLXCEL_ENABLE_SOFTCAP_GQA_DECODE_GROUPED`	any value except `0` enables	off	Enables grouped softcap-GQA decode optimization.
`MLXCEL_DISABLE_SOFTCAP_GQA_DECODE_GROUPED`	`1` disables, `0` enables	unset	Legacy rollback/override for grouped softcap-GQA decode.
`MLXCEL_DISABLE_SINGLE_QUERY_MASKLESS`	truthy disables	maskless path on	Disables the single-query maskless attention path.
`MLXCEL_EXPERIMENTAL_BOOL_CAUSAL_MASK`	truthy enables	off	Enables an experimental boolean causal-mask path.
`MLXCEL_PIPELINE_GRANULARITY`	`off`, `layer`, `block:N`	`off`	Inserts layer-boundary async-eval hints for pipeline experiments.

Logging, profiling, and capture variables

Most of these switches force synchronization or extra graph work and will change throughput measurements. Use them for diagnosis, not capacity planning.

Variable	Values	Default	Purpose
`MLXCEL_TRACE_DTYPE`	presence enables	off	Prints selected tensor dtypes/shapes during generation.
`MLXCEL_FORCE_SYNC`	presence enables	off	Forces synchronous decode evaluation.
`MLXCEL_PROFILE_PIPELINE`	presence enables	off	Emits high-level generation pipeline timing.
`MLXCEL_PROFILE_PIPELINE_DETAIL`	presence enables	off	Adds per-step pipeline timing detail.
`MLXCEL_PROFILE_BLOCKS`	presence enables	off	Emits per-block/model-family timing where implemented.
`MLXCEL_PROFILE_FORWARD`	presence enables	off	Enables model-specific forward profiling where implemented.
`MLXCEL_PROFILE_QWEN3_MOE_DETAIL`	presence enables	off	Profiles Qwen3 MoE internals.
`MLXCEL_PROFILE_MOE_INNER`	presence enables	off	Profiles Gemma 4 MoE sub-operations.
`MLXCEL_PROFILE_PER_LAYER`	presence enables	off	Prints per-layer Gemma 4 timing.
`MLXCEL_PROFILE_LAYER_BUILD`	presence enables	off	Adds Gemma 4 layer-build timing.
`MLXCEL_PROFILE_LAYER_SUBOPS`	presence enables	off	Adds Gemma 4 per-suboperation timing.
`MLXCEL_EXPORT_DECODE_DOT`	file path	unset	Exports the first decode graph pair to DOT.
`MLXCEL_METAL_CAPTURE_PATH`	file path	unset	Starts a Metal capture around steady-state generation; requires `MTL_CAPTURE_ENABLED=1`.
`MLXCEL_DEBUG_GEMMA4_LOAD`	presence enables	off	Emits Gemma 4 safetensors loading diagnostics.
`MLXCEL_NO_PRECISION_WARNING`	presence suppresses	warning on	Suppresses the bf16-on-Apple-Silicon precision/performance note.

Test and CI variables

These are intended for the repository's own tests and automation rather than normal end-user operation.

Variable	Purpose
`MLXCEL_CI_PP_MODEL`	Model path used by the pipeline-parallel CI integration test.
`MLXCEL_SKIP_HEAVY_TESTS`	Skips selected heavy tests.
`MLXCEL_BENCH_DATE`	Metadata override for Turbo KV benchmark tests.
`MLXCEL_BENCH_MACHINE`	Metadata override for Turbo KV benchmark tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment variables

Precedence and value conventions

Common runtime variables

Build-time variables

Downloader variables

Server prompt-cache variables

Speculative-decoding variables

KV cache and TurboQuant variables

Video and local-media variables

Hardware and kernel diagnostic variables

Logging, profiling, and capture variables

Test and CI variables

FilesExpand file tree

environment-variables.md

Latest commit

History

environment-variables.md

File metadata and controls

Environment variables

Precedence and value conventions

Common runtime variables

Build-time variables

Downloader variables

Server prompt-cache variables

Speculative-decoding variables

KV cache and TurboQuant variables

Video and local-media variables

Hardware and kernel diagnostic variables

Logging, profiling, and capture variables

Test and CI variables