-
Notifications
You must be signed in to change notification settings - Fork 15
Description
I have followed following script to convert kimi-k2-thinking vanilla model to mxfp4
LOCAL_QUARK_DIR=/data/users/kadvani/gitrepos/Quark
MODEL_INPUT_DIR=/data/users/kadvani/models/Kimi-K2-Thinking-BF16/
MODEL_OUTPUT_DIR=/data/users/kadvani/models/Kimi-K2-Thinking-MXFP4/
cd $LOCAL_QUARK_DIR/examples/torch/language_modeling/llm_ptq/
exclude_layers="*self_attn* *mlp.gate.* *lm_head"
# Speed up safetensors loading with parallel decompression
# export SAFETENSORS_FAST_GPU=1 # DISABLED: Causes HIP segfault during export
# export OMP_NUM_THREADS=16 # Adjust based on your CPU cores
export LD_PRELOAD=/usr/lib64/libzstd.so.1
# Use the detected ROCm version path to match amdsmi
export ROCM_HOME=/opt/rocm-7.0.2
export ROCM_PATH=/opt/rocm-7.0.2
export HIP_PATH=/opt/rocm-7.0.2
export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH
export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
export HSA_FORCE_FINE_GRAIN_PCIE=1
python3 quantize_quark.py --model_dir $MODEL_INPUT_DIR \
--quant_scheme w_mxfp4_a_mxfp4 \
--device cpu \
--group_size 32 \
--num_calib_data 128 \
--exclude_layers $exclude_layers \
--skip_evaluation \
--model_export hf_format \
--output_dir $MODEL_OUTPUT_DIR 2>&1 | tee /tmp/kadvani/quantize_quark.log
I was able to successfully get MXFP4 quantized model.
Upon starting vllm server with the model i ran into error triggered because of this https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/quark/quark.py#L155.
Upon checking, i have realized that config.json has non empty
"kv_cache_group": [
"*kv_b_proj"
],
but
"kv_cache_quant_config": {},
"layer_quant_config": {},
"layer_type_quant_config": {},
Later i removed value from "kv_cache_group" to get rid of ValueError but model ran into another issue where model layers name got suffix model.layers.4.mlp.experts.w2_weight_packed instead of model.layers.4.mlp.experts.w2_weight. I added quick fix and somehow able to start the server but upon running eval i got 0 score. I am not sure if the changes made in order to start the server were intended or not.