Skip to content

Commit 34bba0b

Browse files
Gao016gaochang
andauthored
Add kimi-k2 (#560)
Co-authored-by: gaochang <gaochang@U-19PX2WQ1-0350.local>
1 parent 2f39bea commit 34bba0b

File tree

3 files changed

+65
-0
lines changed

3 files changed

+65
-0
lines changed

docs/en/get_started/quick_start.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
9393
```
9494

9595
For larger models, you can use `torchrun` to start the covnersion script to convert with multi-gpus or even multi-nodes.
96+
Note: When converting the kimi-k2 model weights, you need to open config.json in the model path and change "model_type": "kimi_k2" to "model_type": "deepseek_v3".
9697

9798
### Convert from Megatron Format to Hugging Face Format
9899

docs/zh/get_started/quick_start.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
9393
```
9494

9595
对于更大的模型,可以使用 `torchrun` 来启动转换脚本,从而使用多张 GPU 甚至多机进行权重转换。
96+
注意:kimi-k2模型权重转换时,需打开模型路径中的config.json,将"model_type": "kimi_k2"修改为"model_type": "deepseek_v3"。
9697

9798
### Megatron 格式 转换为 Hugging Face 格式
9899

scripts/models/kimi-k2.sh

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
NLAYERS=61
2+
FIRST_K_DENSE_REPLACE=1
3+
4+
arr=()
5+
for ((i=0; i<NLAYERS; i++)); do
6+
if (( i < FIRST_K_DENSE_REPLACE )); then
7+
arr+=(0)
8+
else
9+
arr+=(1)
10+
fi
11+
done
12+
13+
printf -v MOE_LAYER_FREQ "[%s]" "$(IFS=', '; echo "${arr[*]}")"
14+
15+
# kimi-k2
16+
MODEL_ARGS=(
17+
--disable-bias-linear
18+
--num-layers 61
19+
--hidden-size 7168
20+
--ffn-hidden-size 18432
21+
--num-attention-heads 64
22+
--kv-channels 64
23+
--normalization RMSNorm
24+
--position-embedding-type rope
25+
--norm-epsilon 1e-6
26+
--swiglu
27+
--untie-embeddings-and-output-weights
28+
--vocab-size 163840
29+
30+
--multi-latent-attention
31+
--q-lora-rank 1536
32+
--kv-lora-rank 512
33+
--qk-head-dim 128
34+
--qk-pos-emb-head-dim 64
35+
--v-head-dim 128
36+
--qk-layernorm
37+
--rotary-scaling-factor 32.0
38+
--rotary-base 50000
39+
--mscale 1.0
40+
--mscale-all-dim 1.0
41+
--attention-softmax-in-fp32
42+
--no-rope-fusion
43+
44+
# moe
45+
--num-experts 384
46+
--moe-layer-freq $MOE_LAYER_FREQ
47+
--moe-ffn-hidden-size 2048
48+
--moe-router-topk 8
49+
--moe-shared-expert-intermediate-size 2048
50+
--moe-router-pre-softmax
51+
--moe-router-score-function sigmoid
52+
--moe-router-enable-expert-bias
53+
--moe-router-load-balancing-type seq_aux_loss
54+
--moe-token-dispatcher-type alltoall
55+
--moe-aux-loss-coeff 0
56+
--moe-router-bias-update-rate 0
57+
--moe-router-group-topk 1
58+
--moe-router-num-groups 1
59+
--moe-grouped-gemm
60+
--moe-router-topk-scaling-factor 2.827
61+
--moe-router-dtype fp32
62+
--moe-permute-fusion
63+
)

0 commit comments

Comments
 (0)