You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bugSomething isn't workingpendingThis problem is yet to be addressed
1 participant
Converted from issue
This discussion was converted from issue #7589 on April 03, 2025 08:22.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Reminder
System Info
llamafactory
version: 0.9.2.dev0Reproduction
``
[INFO|2025-04-03 15:09:41] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled.
[INFO|2025-04-03 15:09:41] llamafactory.model.model_utils.attention:157 >> Using torch SDPA for faster training and inference.
[INFO|2025-04-03 15:09:41] llamafactory.model.adapter:157 >> Upcasting trainable params to float32.
[INFO|2025-04-03 15:09:41] llamafactory.model.adapter:157 >> Fine-tuning method: LoRA
[INFO|2025-04-03 15:09:41] llamafactory.model.model_utils.misc:157 >> Found linear modules: v_proj,o_proj,q_proj,k_proj,down_proj,gate_proj,up_proj
No label_names provided for model class
PeftModelForCausalLM
. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.[INFO|2025-04-03 15:09:42] llamafactory.model.loader:157 >> trainable params: 67,108,864 || all params: 32,830,985,216 || trainable%: 0.2044
[INFO|trainer.py:746] 2025-04-03 15:09:42,189 >> Using auto half precision backend
[WARNING|trainer.py:781] 2025-04-03 15:09:42,190 >> No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[INFO|2025-04-03 15:09:42] llamafactory.train.trainer_utils:157 >> Using LoRA+ optimizer with loraplus lr ratio 16.00.
[INFO|trainer.py:2405] 2025-04-03 15:09:42,679 >> ***** Running training *****
[INFO|trainer.py:2406] 2025-04-03 15:09:42,679 >> Num examples = 12,619
[INFO|trainer.py:2407] 2025-04-03 15:09:42,679 >> Num Epochs = 30
[INFO|trainer.py:2408] 2025-04-03 15:09:42,679 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2411] 2025-04-03 15:09:42,679 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2412] 2025-04-03 15:09:42,679 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2413] 2025-04-03 15:09:42,679 >> Total optimization steps = 47,310
[INFO|trainer.py:2414] 2025-04-03 15:09:42,686 >> Number of trainable parameters = 67,108,864
0%| | 5/47310 [01:13<191:09:20, 14.55s/it][INFO|2025-04-03 15:10:57] llamafactory.train.callbacks:157 >> {'loss': 1.1686, 'learning_rate': 5.0000e-05, 'epoch': 0.00, 'throughput': 1020.13}
{'loss': 1.1686, 'grad_norm': 0.14585863053798676, 'learning_rate': 4.999999862201698e-05, 'epoch': 0.0, 'num_input_tokens_seen': 75232}
0%| | 10/47310 [02:34<206:30:10, 15.72s/it][INFO|2025-04-03 15:12:18] llamafactory.train.callbacks:157 >> {'loss': 0.9774, 'learning_rate': 5.0000e-05, 'epoch': 0.01, 'throughput': 1011.31}
{'loss': 0.9774, 'grad_norm': 0.11302416771650314, 'learning_rate': 4.999999448806806e-05, 'epoch': 0.01, 'num_input_tokens_seen': 156368}
0%| | 15/47310 [03:50<204:48:43, 15.59s/it][INFO|2025-04-03 15:13:34] llamafactory.train.callbacks:157 >> {'loss': 0.9439, 'learning_rate': 5.0000e-05, 'epoch': 0.01, 'throughput': 1001.57}
{'loss': 0.9439, 'grad_norm': 0.1326381415128708, 'learning_rate': 4.999998759815371e-05, 'epoch': 0.01, 'num_input_tokens_seen': 230736}
0%| | 20/47310 [05:05<196:26:14, 14.95s/it][INFO|2025-04-03 15:14:49] llamafactory.train.callbacks:157 >> {'loss': 0.9269, 'learning_rate': 5.0000e-05, 'epoch': 0.01, 'throughput': 997.06}
{'loss': 0.9269, 'grad_norm': 0.09542257338762283, 'learning_rate': 4.9999977952274676e-05, 'epoch': 0.01, 'num_input_tokens_seen': 304624}
0%| | 25/47310 [06:19<197:02:47, 15.00s/it][INFO|2025-04-03 15:16:03] llamafactory.train.callbacks:157 >> {'loss': 0.8354, 'learning_rate': 5.0000e-05, 'epoch': 0.02, 'throughput': 1000.50}
{'loss': 0.8354, 'grad_norm': 0.11180797964334488, 'learning_rate': 4.999996555043203e-05, 'epoch': 0.02, 'num_input_tokens_seen': 380112}
0%| | 30/47310 [07:33<192:09:52, 14.63s/it][INFO|2025-04-03 15:17:17] llamafactory.train.callbacks:157 >> {'loss': 0.7925, 'learning_rate': 5.0000e-05, 'epoch': 0.02, 'throughput': 1007.71}
{'loss': 0.7925, 'grad_norm': 0.10915672779083252, 'learning_rate': 4.9999950392627126e-05, 'epoch': 0.02, 'num_input_tokens_seen': 457424}
0%| | 35/47310 [08:51<205:51:53, 15.68s/it][INFO|2025-04-03 15:18:35] llamafactory.train.callbacks:157 >> {'loss': 0.7684, 'learning_rate': 5.0000e-05, 'epoch': 0.02, 'throughput': 1009.29}
{'loss': 0.7684, 'grad_norm': 0.10244904458522797, 'learning_rate': 4.999993247886166e-05, 'epoch': 0.02, 'num_input_tokens_seen': 536736}
0%| | 40/47310 [10:08<207:55:55, 15.84s/it][INFO|2025-04-03 15:19:51] llamafactory.train.callbacks:157 >> {'loss': 0.8594, 'learning_rate': 5.0000e-05, 'epoch': 0.03, 'throughput': 1008.51}
{'loss': 0.8594, 'grad_norm': 0.10987041145563126, 'learning_rate': 4.999991180913758e-05, 'epoch': 0.03, 'num_input_tokens_seen': 613408}
0%| | 45/47310 [11:27<211:56:37, 16.14s/it][INFO|2025-04-03 15:21:11] llamafactory.train.callbacks:157 >> {'loss': 0.7329, 'learning_rate': 5.0000e-05, 'epoch': 0.03, 'throughput': 1008.89}
{'loss': 0.7329, 'grad_norm': 0.11014138162136078, 'learning_rate': 4.999988838345718e-05, 'epoch': 0.03, 'num_input_tokens_seen': 693888}
0%| | 50/47310 [12:39<188:59:24, 14.40s/it][INFO|2025-04-03 15:22:23] llamafactory.train.callbacks:157 >> {'loss': 0.8117, 'learning_rate': 5.0000e-05, 'epoch': 0.03, 'throughput': 1006.75}
{'loss': 0.8117, 'grad_norm': 0.145054891705513, 'learning_rate': 4.999986220182303e-05, 'epoch': 0.03, 'num_input_tokens_seen': 764720}
0%| | 55/47310 [13:53<190:34:08, 14.52s/it][INFO|2025-04-03 15:23:37] llamafactory.train.callbacks:157 >> {'loss': 0.7202, 'learning_rate': 5.0000e-05, 'epoch': 0.03, 'throughput': 1008.15}
{'loss': 0.7202, 'grad_norm': 0.128993958234787, 'learning_rate': 4.999983326423804e-05, 'epoch': 0.03, 'num_input_tokens_seen': 840288}
0%| | 60/47310 [15:16<210:34:10, 16.04s/it][INFO|2025-04-03 15:25:00] llamafactory.train.callbacks:157 >> {'loss': 0.7561, 'learning_rate': 5.0000e-05, 'epoch': 0.04, 'throughput': 1006.23}
{'loss': 0.7561, 'grad_norm': 0.1089465394616127, 'learning_rate': 4.999980157070538e-05, 'epoch': 0.04, 'num_input_tokens_seen': 922000}
0%| | 65/47310 [16:36<209:34:52, 15.97s/it][INFO|2025-04-03 15:26:19] llamafactory.train.callbacks:157 >> {'loss': 0.7396, 'learning_rate': 5.0000e-05, 'epoch': 0.04, 'throughput': 1006.27}
{'loss': 0.7396, 'grad_norm': 0.12486272305250168, 'learning_rate': 4.9999767121228546e-05, 'epoch': 0.04, 'num_input_tokens_seen': 1002352}
0%| | 69/47310 [17:35<200:29:28, 15.28s/it]
我这边训练12,619条训练集,需要200个小时?需要这么久吗?是不是我哪里配置没有配置对呢,有什么优化的办法吗?
Beta Was this translation helpful? Give feedback.
All reactions