Skip to content

Investigate Why Ringed Pipeline favors k>1 at 70B Q4 #73

@andthattoo

Description

@andthattoo

MLX Ring Pipeline Memory Investigation

dnet ring pipeline fails to run Llama-3.3/70B Q4 on 32+24 GB (56 GB total) for k=1 (1 round per token setup) while distributed-llama(tensor parallel + Q80 pipes) fits on 2×24 GB (48 GB). This task instruments, tests, and narrows the root causes (inter-stage dtype, KV, embedding/LM-head placement, lack of TP, pools) and proposes concrete fixes.

Background

Distributed Llama (DLlama)

  • Tensor-parallel sharding for all large mats per layer (Q/K/V, W1/W3 row; W2/logits col)
  • Inter-node pipes in Q80; matmul Q80×Q40→F32; KV heads sharded across nodes

DNET

Component Current Behavior References
Inter-stage dtype fp16 by default (TransportConfig.wire_dtype, runtime forces casts). Compression exists (qsparse8_v1) but ring path doesn't use it. dnet/shard/config.py, dnet/shard/runtime.py, dnet/shard/codec.py, dnet/core/tensor.py, dnet/compression/wire.py
KV cache Defaulted to fp16 unless API explicitly sets 8bit/4bit dnet/shard/runtime.py (load_model_core kv_bits handling)
Parallelism No tensor sharding; pure pipeline parallel via assigned_layers dnet/api/strategies/ring.py, dnet/shard/runtime.py
Embedding/LM head Embedding and Head split; dtype follows checkpoint (can be dense) dnet/utils/model.py (load_embeddings, load_lm_head)
Pools Input/output pools default 512 MB each per shard dnet/shard/config.py

Hypotheses

ID Hypothesis
H1 Inter-stage fp16 doubles activation traffic and buffer footprint vs Q8, pushing peak stage memory over 24 GB
H2 Embedding/LM head reside unsharded on edge stages; if dense, they spike stage memory
H3 Pools and stream buffers add nontrivial overhead on tight 24 GB stages
H4 Absence of TP concentrates full layer weights on a stage; DLlama's TP flattens peaks

Goals

  • Produce a stage-wise memory and comm budget that explains the OOM
  • Validate which factors (H1–H5) materially impact peak stage memory
  • Deliver minimal changes to get closer to DLlama's footprint (or document why TP is required)

Experiments

E1: Inter-stage dtype

Variant Description
A Baseline fp16 wire (current)
B Enable qsparse8_v1 by default; compare peak memory + bytes/token

E2: Pools

Reduce pools from 512 MB → 128 MB → 64 MB; measure headroom.

E3: Embedding/LM head

  • Verify dtype (dense vs quantized)
  • If dense, attempt quantized tensors (if present) or relocate embedding to larger device
  • Shard LM head by vocab for sampling

Test Matrix

Parameter Values
Hardware Macbook Pro M1 Max 32 GB + MacMini M4 Pro 24 GB (or 2×24 GB)
Model Llama-3.1 70B Q4, seq_len=4096, batch=1
Wire fp16 vs qsparse8_v1
KV fp16 vs 8-bit
Pools 512 vs 128 vs 64 MB
Embed/Head Dense vs quantized (if available), placement

Metadata

Metadata

Labels

performancePerformance optimizations and resource efficiency

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions