Skip to content

Latest commit

 

History

History
168 lines (135 loc) · 12.2 KB

File metadata and controls

168 lines (135 loc) · 12.2 KB

Environment variables

This page documents the MLXCEL_* environment variables that affect mlxcel runtime, server, downloader, build, and diagnostic behavior.

Prefer CLI flags for settings that have a flag equivalent. Environment variables are useful for containers, service units, and repeatable benchmark runs, but they are process-wide and several of the low-level knobs are read once and cached on first use. Set them before starting mlxcel or mlxcel-server.

Precedence and value conventions

  • If a CLI flag and an environment variable control the same option, the CLI flag wins unless the flag help states otherwise.
  • LLAMA_ARG_* aliases exist for a subset of llama-server-compatible flags. This page focuses on MLXCEL_*; use --help for the full flag/env surface.
  • Boolean parsing is not completely uniform across all internal knobs:
    • documented server options generally accept true/false, 1/0, yes/no, and on/off;
    • many diagnostic switches are presence-based, so any set value enables the behavior;
    • variables whose row says "falsy disables" treat 0, false, off, or no as disabled.
  • Variables marked advanced or diagnostic are not a stable public API. They exist for benchmarking, rollback, or kernel-development work and may change between releases.

Common runtime variables

Variable Values Default Notes
MLXCEL_DEVICE gpu, metal, cpu gpu hint cpu requests CPU execution. Invalid values are ignored with a warning and treated as gpu; if no GPU backend is available, runtime falls back to CPU.
MLXCEL_WIRED_LIMIT max, 0, none, bytes, NGB, NMB max Apple Silicon GPU wired-memory limit. Unset/empty/max sets MLX's reported GPU max memory size; 0/none disables the limit; numeric values set an explicit limit.
MLXCEL_CACHE_DIR directory path $HOME/.cache/mlxcel Root for the tokenizer language-analysis disk cache used by language-bias features. Files live under tokenizer-scripts/.
MLXCEL_SERVER_DECODE_STORAGE auto, dense, paged auto Server continuous-batching decode storage. --decode-storage-backend takes precedence. Invalid values warn and fall back to auto.
MLXCEL_SURGERY YAML file path unset Feature-gated weight-load surgery configuration. --surgery takes precedence when the surgery feature is built.

Build-time variables

These are read by the mlxcel-core build script.

Variable Values Default Notes
MLXCEL_BUILD_METAL 1/0, on/off, true/false, yes/no on on macOS Overrides the CMake MLX_BUILD_METAL setting for local builds. Invalid values fail the build.
MLXCEL_BUILD_ACCELERATE 1/0, on/off, true/false, yes/no on on macOS Overrides the CMake MLX_BUILD_ACCELERATE setting for local builds. Invalid values fail the build.

CUDA builds also use non-MLXCEL_* variables such as CUDA_HOME and MLX_CUDA_ARCHITECTURES; see Installation.

Downloader variables

Variable Values Default Notes
MLXCEL_NO_PROGRESS any non-empty value unset Suppresses interactive download progress bars. NO_COLOR and CI=true also suppress bars.
MLXCEL_ALLOW_INSECURE_ENDPOINT any non-empty value unset Allows sending a Hugging Face token to a non-HTTPS HF_ENDPOINT. Leave unset outside audited internal mirrors.

Server prompt-cache variables

These variables are applied when the corresponding CLI flag is absent.

Variable Values Default Flag equivalent
MLXCEL_PROMPT_CACHE_ENABLED boolean true --prompt-cache-enabled
MLXCEL_PROMPT_CACHE_CAPACITY_BYTES unsigned integer bytes 2147483648 --prompt-cache-capacity-bytes
MLXCEL_PROMPT_CACHE_MAX_ENTRIES unsigned integer 1024 --prompt-cache-max-entries
MLXCEL_PROMPT_CACHE_TTL unsigned integer seconds 3600 --prompt-cache-ttl
MLXCEL_PROMPT_CACHE_MIN_PREFIX unsigned integer tokens 32 --prompt-cache-min-prefix

MLXCEL_PROMPT_CACHE_ENABLED has higher precedence than the llama.cpp compatibility alias LLAMA_ARG_CACHE_REUSE when both are set and no CLI flag is provided.

Speculative-decoding variables

Variable Values Default Notes
MLXCEL_DRAFT_KIND dflash, mtp auto/none Alias for --draft-kind when the CLI flag and LLAMA_ARG_DRAFT_KIND are absent.
MLXCEL_DRAFT_BLOCK_SIZE unsigned integer per drafter (4 for MTP, 16 for DFlash) Alias for --draft-block-size when the CLI flag and LLAMA_ARG_DRAFT_BLOCK_SIZE are absent.
MLXCEL_ENABLE_MTP_B1 truthy value off Advanced. Forces the singleton Gemma 4 MTP burst path for parity/debug testing.
MLXCEL_ENABLE_MTP_BATCH truthy value off Advanced. Forces the batched Gemma 4 MTP burst path for parity/debug testing.
MLXCEL_ENABLE_MTP_DEFERRED 1 off Advanced. Enables the deferred greedy verifier path for Gemma 4 MTP when sampling settings allow it.

KV cache and TurboQuant variables

Use CLI flags such as --cache-type-k, --cache-type-v, --kv-cache-mode, --turbo-boundary-v, and the batch KV quantization flags when possible. The variables below are useful for service-level defaults and A/B experiments. See TurboQuant KV cache for the user-facing mode descriptions.

Variable Values Default Notes
MLXCEL_KV_BOUNDARY_V_LAYERS integer count 2 Number of first/last layers kept at higher precision for Turbo4-family modes. 0 disables. --turbo-boundary-v writes this value before cache construction and takes precedence.
MLXCEL_TURBO_BOUNDARY_V integer count fallback alias Compatibility alias for MLXCEL_KV_BOUNDARY_V_LAYERS; the primary name wins when both are set.
MLXCEL_KV_SKIP_LAST_LAYER boolean true Fallback for --kv-skip-last-layer in continuous-batching KV quantization.
MLXCEL_SPARSE_V_THRESHOLD non-negative float 1e-6 Sparse-V alive threshold. 0 disables sparse-V; invalid values warn and use the default.
MLXCEL_SPARSE_V_KERNEL falsy disables enabled on macOS Allows the fused Sparse-V/dequant Metal kernels. Set 0, false, off, or no to force graph fallback.
MLXCEL_TURBO4_DEQUANT_SDPA falsy disables on Controls the dequant-first SDPA path for symmetric Turbo4.
MLXCEL_TURBO4_DELEGATED_DEQUANT_SDPA falsy disables on Controls the default dequant-first SDPA path for Turbo4Delegated.
MLXCEL_TURBO4_DELEGATED_FUSED truthy enables off Advanced. Enables the older custom fused delegated-kernel route, mainly for comparison when dequant-first SDPA is disabled.
MLXCEL_TURBO4_DELEGATED_FP16_FAST_PATH truthy enables off Advanced. Keeps a unified FP16 V working set in delegated mode for speed experiments while maintaining packed sidecars.
MLXCEL_TURBO4_DELEGATED_FP16_SIDECARS predecode, eager, lazy, on-demand predecode Sidecar maintenance policy for the delegated FP16 fast path.
MLXCEL_ENABLE_DIRECT_PREFILL_CACHE_STORE presence enables off Advanced. Installs the incoming prefill tensor directly as the initial KV cache buffer when applicable.

Video and local-media variables

These apply to video-capable VLM request handling.

Variable Values Default Notes
MLXCEL_VIDEO_DIR_ALLOWLIST comma-separated directories unset Local video_url file paths are rejected unless they resolve under one of these canonicalized directories. Keep directories owner-writable only; group/world-writable entries warn at startup.
MLXCEL_VIDEO_MAX_PIXELS unsigned integer 16777216 Rejects source videos whose width × height exceeds the cap.
MLXCEL_VIDEO_MAX_DURATION_SEC float seconds 600 Rejects source videos longer than the cap.
MLXCEL_VIDEO_MAX_PNG_FRAME_BYTES unsigned integer bytes 268435456 Per-frame cap for the ffmpeg PNG stream splitter.

Hardware and kernel diagnostic variables

These variables are for profiling, rollback, or experiments. They are not recommended as normal deployment settings.

Variable Values Default Purpose
MLXCEL_NO_PADDED_PREFILL presence disables auto Disables M5+/Neural-Accelerator prefill tile alignment.
MLXCEL_FORCE_PADDED_PREFILL_MASK presence enables off Forces an explicit padded prefill mask path for debugging.
MLXCEL_LOG_NA_ATTENTION sampled, all, truthy off Logs Neural Accelerator attention dispatch decisions.
MLXCEL_ENABLE_FUSED_CAUSAL_PREFILL_ATTENTION presence enables off Enables an experimental Llama-family fused causal prefill path when supported.
MLXCEL_ENABLE_FUSED_QKV_SPLIT_ROPE presence enables off Enables an experimental fused QKV projection/split/RoPE path.
MLXCEL_GEMMA4_ENABLE_FUSED_QKV presence enables off Enables a Gemma 4 fused-QKV projection experiment.
MLXCEL_DISABLE_COMPILED_SWITCH_QGEGLU presence disables compiled path on when supported Rolls back Gemma 4 compiled Switch-QGeGLU decode path.
MLXCEL_ENABLE_SOFTCAP_GQA_DECODE_GROUPED any value except 0 enables off Enables grouped softcap-GQA decode optimization.
MLXCEL_DISABLE_SOFTCAP_GQA_DECODE_GROUPED 1 disables, 0 enables unset Legacy rollback/override for grouped softcap-GQA decode.
MLXCEL_DISABLE_SINGLE_QUERY_MASKLESS truthy disables maskless path on Disables the single-query maskless attention path.
MLXCEL_EXPERIMENTAL_BOOL_CAUSAL_MASK truthy enables off Enables an experimental boolean causal-mask path.
MLXCEL_PIPELINE_GRANULARITY off, layer, block:N off Inserts layer-boundary async-eval hints for pipeline experiments.

Logging, profiling, and capture variables

Most of these switches force synchronization or extra graph work and will change throughput measurements. Use them for diagnosis, not capacity planning.

Variable Values Default Purpose
MLXCEL_TRACE_DTYPE presence enables off Prints selected tensor dtypes/shapes during generation.
MLXCEL_FORCE_SYNC presence enables off Forces synchronous decode evaluation.
MLXCEL_PROFILE_PIPELINE presence enables off Emits high-level generation pipeline timing.
MLXCEL_PROFILE_PIPELINE_DETAIL presence enables off Adds per-step pipeline timing detail.
MLXCEL_PROFILE_BLOCKS presence enables off Emits per-block/model-family timing where implemented.
MLXCEL_PROFILE_FORWARD presence enables off Enables model-specific forward profiling where implemented.
MLXCEL_PROFILE_QWEN3_MOE_DETAIL presence enables off Profiles Qwen3 MoE internals.
MLXCEL_PROFILE_MOE_INNER presence enables off Profiles Gemma 4 MoE sub-operations.
MLXCEL_PROFILE_PER_LAYER presence enables off Prints per-layer Gemma 4 timing.
MLXCEL_PROFILE_LAYER_BUILD presence enables off Adds Gemma 4 layer-build timing.
MLXCEL_PROFILE_LAYER_SUBOPS presence enables off Adds Gemma 4 per-suboperation timing.
MLXCEL_EXPORT_DECODE_DOT file path unset Exports the first decode graph pair to DOT.
MLXCEL_METAL_CAPTURE_PATH file path unset Starts a Metal capture around steady-state generation; requires MTL_CAPTURE_ENABLED=1.
MLXCEL_DEBUG_GEMMA4_LOAD presence enables off Emits Gemma 4 safetensors loading diagnostics.
MLXCEL_NO_PRECISION_WARNING presence suppresses warning on Suppresses the bf16-on-Apple-Silicon precision/performance note.

Test and CI variables

These are intended for the repository's own tests and automation rather than normal end-user operation.

Variable Purpose
MLXCEL_CI_PP_MODEL Model path used by the pipeline-parallel CI integration test.
MLXCEL_SKIP_HEAVY_TESTS Skips selected heavy tests.
MLXCEL_BENCH_DATE Metadata override for Turbo KV benchmark tests.
MLXCEL_BENCH_MACHINE Metadata override for Turbo KV benchmark tests.