Skip to content

Latest commit

 

History

History
119 lines (96 loc) · 5.59 KB

File metadata and controls

119 lines (96 loc) · 5.59 KB

Architecture overview

mlxcel is a Rust inference runtime that calls MLX through a C++ bridge. The public entry points are intentionally thin: CLI parsing happens at the edge, and model loading, request preparation, scheduling, and MLX operations live in focused modules.

Top-level layout

src/
├── main.rs                      # `mlxcel` CLI schema and subcommand routing
├── bin/mlx_server.rs            # standalone `mlxcel-server` binary
├── commands/                    # CLI subcommand handlers
├── execution/                   # runtime/device and sampling helpers
├── model_metadata.rs            # model-kind and loading-policy descriptors
├── loading/                     # model loading routers and family registries
├── loaded_model.rs              # LoadedModel enum and LanguageModel dispatch
├── loaded_model_capabilities.rs # multimodal capability routing
├── models/                      # text model implementations and detection
├── multimodal/                  # shared multimodal prompt/runtime helpers
├── vision/                      # vision encoders, processors, connectors
├── audio/                       # audio encoder support
├── server/                      # HTTP server, request translation, scheduler
├── distributed/                 # TP/PP/DI config, transports, registries
├── tokenizer/                   # tokenizer loading helpers
├── lora/                        # LoRA adapter loading
└── lib/mlxcel-core/             # MLX C++ FFI crate and low-level generation primitives

mlxcel-core

src/lib/mlxcel-core/ owns the direct MLX bridge and low-level runtime pieces:

  • src/lib/mlxcel-core/src/lib.rscxx::bridge definitions and crate exports.
  • src/lib/mlxcel-core/src/cache.rs and src/lib/mlxcel-core/src/cache/ — FP16/INT8/TurboQuant KV cache variants, paged cache layout, detach/adopt helpers, and cache tests.
  • src/lib/mlxcel-core/src/ops.rs, src/lib/mlxcel-core/src/dtype.rs, src/lib/mlxcel-core/src/streams.rs — wrappers around common MLX operations and runtime concepts.
  • src/lib/mlxcel-core/src/sampling.rs — penalties and token sampling shared by CLI/server paths.
  • src/lib/mlxcel-core/src/generate.rsLanguageModel trait and generation loops.
  • src/lib/mlxcel-core/src/drafter/ and src/lib/mlxcel-core/src/speculative/ — speculative decoding support.
  • src/lib/mlxcel-core/src/layers.rs, src/lib/mlxcel-core/src/weights.rs, src/lib/mlxcel-core/src/utils.rs — model building blocks, SafeTensors loading, masks, and helper operations.

The in-tree MLX source is under src/lib/mlx-cpp/; src/lib/mlxcel-core/build.rs builds the pinned MLX commit and compiles the bridge code.

Loading pipeline

A normal text generation request follows this path:

model path
  → src/models/detection.rs reads config.json and returns ModelType
  → src/model_metadata.rs selects loading policy
  → src/loading/ dispatches to config-backed, non-standard, special, or VLM loader
  → tokenizer is loaded
  → LoadedModel + tokenizer are returned to CLI/server

Important control surfaces:

  • src/models/detection.rs maps config.json::model_type and related config hints to ModelType.
  • src/model_metadata.rs records whether a family is text or VLM, how it is loaded, and whether adapters are supported.
  • src/loading/config_backed.rs, src/loading/nonstandard.rs, src/loading/special.rs, and src/loading/vlm*.rs contain the loading implementation.
  • src/loaded_model.rs and src/loaded_model_capabilities.rs keep downstream CLI/server code from matching on every concrete model type.

Request paths

mlxcel generate

  1. src/main.rs parses CLI arguments.
  2. src/commands/generate.rs prepares prompt/media inputs and sampling options.
  3. The loading pipeline constructs a LoadedModel.
  4. mlxcel-core runs the decode loop and writes output to stdout.

mlxcel serve / mlxcel-server

  1. src/main.rs or src/bin/mlx_server.rs parses CLI flags and LLAMA_ARG_* environment-backed options.
  2. src/server/startup.rs resolves startup configuration, loads the model, and builds the Axum application.
  3. src/server/app.rs mounts routes such as /v1/chat/completions, /v1/completions, /v1/responses, /health, and /v1/models.
  4. Route handlers translate requests into internal generation work.
  5. src/server/batch/ schedules batched decode when enabled.
  6. Streaming responses are emitted as SSE frames.

Platform-specific behavior

  • macOS/Metal and Linux/CUDA behavior is primarily determined by the pinned MLX build under src/lib/mlx-cpp/ and the feature flags passed to Cargo.
  • Apple Silicon runtime/device helpers live in src/lib/mlxcel-core/src/hardware.rs and src/execution/runtime.rs.
  • Custom TurboQuant Metal kernels live under src/lib/mlx-cpp/turbo/ and are called through the C++ bridge.
  • CUDA kernel behavior is mostly inherited from MLX; mlxcel passes the CUDA architecture list through MLX_CUDA_ARCHITECTURES at build time.

Distributed and multi-device

src/distributed/ contains the shared cluster configuration, transport, registry, metrics, and scheduler infrastructure used by tensor parallelism, pipeline parallelism, and disaggregated inference experiments. See distributed inference for the operator-facing summary.

Further reading