-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
performancePerformance optimizations and resource efficiencyPerformance optimizations and resource efficiency
Description
MLX Ring Pipeline Memory Investigation
dnet ring pipeline fails to run Llama-3.3/70B Q4 on 32+24 GB (56 GB total) for k=1 (1 round per token setup) while distributed-llama(tensor parallel + Q80 pipes) fits on 2×24 GB (48 GB). This task instruments, tests, and narrows the root causes (inter-stage dtype, KV, embedding/LM-head placement, lack of TP, pools) and proposes concrete fixes.
Background
Distributed Llama (DLlama)
- Tensor-parallel sharding for all large mats per layer (Q/K/V, W1/W3 row; W2/logits col)
- Inter-node pipes in Q80; matmul Q80×Q40→F32; KV heads sharded across nodes
DNET
| Component | Current Behavior | References |
|---|---|---|
| Inter-stage dtype | fp16 by default (TransportConfig.wire_dtype, runtime forces casts). Compression exists (qsparse8_v1) but ring path doesn't use it. |
dnet/shard/config.py, dnet/shard/runtime.py, dnet/shard/codec.py, dnet/core/tensor.py, dnet/compression/wire.py |
| KV cache | Defaulted to fp16 unless API explicitly sets 8bit/4bit |
dnet/shard/runtime.py (load_model_core kv_bits handling) |
| Parallelism | No tensor sharding; pure pipeline parallel via assigned_layers |
dnet/api/strategies/ring.py, dnet/shard/runtime.py |
| Embedding/LM head | Embedding and Head split; dtype follows checkpoint (can be dense) | dnet/utils/model.py (load_embeddings, load_lm_head) |
| Pools | Input/output pools default 512 MB each per shard | dnet/shard/config.py |
Hypotheses
| ID | Hypothesis |
|---|---|
| H1 | Inter-stage fp16 doubles activation traffic and buffer footprint vs Q8, pushing peak stage memory over 24 GB |
| H2 | Embedding/LM head reside unsharded on edge stages; if dense, they spike stage memory |
| H3 | Pools and stream buffers add nontrivial overhead on tight 24 GB stages |
| H4 | Absence of TP concentrates full layer weights on a stage; DLlama's TP flattens peaks |
Goals
- Produce a stage-wise memory and comm budget that explains the OOM
- Validate which factors (H1–H5) materially impact peak stage memory
- Deliver minimal changes to get closer to DLlama's footprint (or document why TP is required)
Experiments
E1: Inter-stage dtype
| Variant | Description |
|---|---|
| A | Baseline fp16 wire (current) |
| B | Enable qsparse8_v1 by default; compare peak memory + bytes/token |
E2: Pools
Reduce pools from 512 MB → 128 MB → 64 MB; measure headroom.
E3: Embedding/LM head
- Verify dtype (dense vs quantized)
- If dense, attempt quantized tensors (if present) or relocate embedding to larger device
- Shard LM head by vocab for sampling
Test Matrix
| Parameter | Values |
|---|---|
| Hardware | Macbook Pro M1 Max 32 GB + MacMini M4 Pro 24 GB (or 2×24 GB) |
| Model | Llama-3.1 70B Q4, seq_len=4096, batch=1 |
| Wire | fp16 vs qsparse8_v1 |
| KV | fp16 vs 8-bit |
| Pools | 512 vs 128 vs 64 MB |
| Embed/Head | Dense vs quantized (if available), placement |
Metadata
Metadata
Assignees
Labels
performancePerformance optimizations and resource efficiencyPerformance optimizations and resource efficiency