mlxcel is a Rust inference runtime that calls MLX through a C++ bridge. The
public entry points are intentionally thin: CLI parsing happens at the edge, and
model loading, request preparation, scheduling, and MLX operations live in
focused modules.
src/
├── main.rs # `mlxcel` CLI schema and subcommand routing
├── bin/mlx_server.rs # standalone `mlxcel-server` binary
├── commands/ # CLI subcommand handlers
├── execution/ # runtime/device and sampling helpers
├── model_metadata.rs # model-kind and loading-policy descriptors
├── loading/ # model loading routers and family registries
├── loaded_model.rs # LoadedModel enum and LanguageModel dispatch
├── loaded_model_capabilities.rs # multimodal capability routing
├── models/ # text model implementations and detection
├── multimodal/ # shared multimodal prompt/runtime helpers
├── vision/ # vision encoders, processors, connectors
├── audio/ # audio encoder support
├── server/ # HTTP server, request translation, scheduler
├── distributed/ # TP/PP/DI config, transports, registries
├── tokenizer/ # tokenizer loading helpers
├── lora/ # LoRA adapter loading
└── lib/mlxcel-core/ # MLX C++ FFI crate and low-level generation primitives
src/lib/mlxcel-core/ owns the direct MLX bridge and low-level runtime pieces:
src/lib/mlxcel-core/src/lib.rs—cxx::bridgedefinitions and crate exports.src/lib/mlxcel-core/src/cache.rsandsrc/lib/mlxcel-core/src/cache/— FP16/INT8/TurboQuant KV cache variants, paged cache layout, detach/adopt helpers, and cache tests.src/lib/mlxcel-core/src/ops.rs,src/lib/mlxcel-core/src/dtype.rs,src/lib/mlxcel-core/src/streams.rs— wrappers around common MLX operations and runtime concepts.src/lib/mlxcel-core/src/sampling.rs— penalties and token sampling shared by CLI/server paths.src/lib/mlxcel-core/src/generate.rs—LanguageModeltrait and generation loops.src/lib/mlxcel-core/src/drafter/andsrc/lib/mlxcel-core/src/speculative/— speculative decoding support.src/lib/mlxcel-core/src/layers.rs,src/lib/mlxcel-core/src/weights.rs,src/lib/mlxcel-core/src/utils.rs— model building blocks, SafeTensors loading, masks, and helper operations.
The in-tree MLX source is under src/lib/mlx-cpp/; src/lib/mlxcel-core/build.rs builds the pinned
MLX commit and compiles the bridge code.
A normal text generation request follows this path:
model path
→ src/models/detection.rs reads config.json and returns ModelType
→ src/model_metadata.rs selects loading policy
→ src/loading/ dispatches to config-backed, non-standard, special, or VLM loader
→ tokenizer is loaded
→ LoadedModel + tokenizer are returned to CLI/server
Important control surfaces:
src/models/detection.rsmapsconfig.json::model_typeand related config hints toModelType.src/model_metadata.rsrecords whether a family is text or VLM, how it is loaded, and whether adapters are supported.src/loading/config_backed.rs,src/loading/nonstandard.rs,src/loading/special.rs, andsrc/loading/vlm*.rscontain the loading implementation.src/loaded_model.rsandsrc/loaded_model_capabilities.rskeep downstream CLI/server code from matching on every concrete model type.
src/main.rsparses CLI arguments.src/commands/generate.rsprepares prompt/media inputs and sampling options.- The loading pipeline constructs a
LoadedModel. mlxcel-coreruns the decode loop and writes output to stdout.
src/main.rsorsrc/bin/mlx_server.rsparses CLI flags andLLAMA_ARG_*environment-backed options.src/server/startup.rsresolves startup configuration, loads the model, and builds the Axum application.src/server/app.rsmounts routes such as/v1/chat/completions,/v1/completions,/v1/responses,/health, and/v1/models.- Route handlers translate requests into internal generation work.
src/server/batch/schedules batched decode when enabled.- Streaming responses are emitted as SSE frames.
- macOS/Metal and Linux/CUDA behavior is primarily determined by the pinned MLX
build under
src/lib/mlx-cpp/and the feature flags passed to Cargo. - Apple Silicon runtime/device helpers live in
src/lib/mlxcel-core/src/hardware.rsandsrc/execution/runtime.rs. - Custom TurboQuant Metal kernels live under
src/lib/mlx-cpp/turbo/and are called through the C++ bridge. - CUDA kernel behavior is mostly inherited from MLX;
mlxcelpasses the CUDA architecture list throughMLX_CUDA_ARCHITECTURESat build time.
src/distributed/ contains the shared cluster configuration, transport,
registry, metrics, and scheduler infrastructure used by tensor parallelism,
pipeline parallelism, and disaggregated inference experiments. See
distributed inference for the operator-facing summary.