Skip to content

Commit bef8226

Browse files
committed
Record: SP8192 + Depth Recurrence x2 + GPTQ INT6 + Score-First TTT -- val_bpb 1.07974 (3-seed mean)
Seeds 1337, 42, 2024 on 8xH100 SXM with fused-softcap-ce kernel integration.
1 parent 78fac92 commit bef8226

3 files changed

Lines changed: 444 additions & 0 deletions

File tree

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
W0412 17:41:11.842000 48239 torch/distributed/run.py:803]
2+
W0412 17:41:11.842000 48239 torch/distributed/run.py:803] *****************************************
3+
W0412 17:41:11.842000 48239 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
4+
W0412 17:41:11.842000 48239 torch/distributed/run.py:803] *****************************************
5+
Hyperparameters:
6+
adam_eps: 1e-08
7+
adam_wd: 0.02
8+
beta1: 0.9
9+
beta2: 0.95
10+
compressor: brotli
11+
data_dir: /workspace/data
12+
datasets_dir: /workspace/data/datasets/fineweb10B_sp8192
13+
distributed: True
14+
ema_decay: 0.9965
15+
embed_bits: 8
16+
embed_clip_sigmas: 20.0
17+
embed_lr: 0.6
18+
embed_wd: 0.085
19+
embedding_dim: 512
20+
enable_looping_at: 0.35
21+
etlb_clip: 3.0
22+
etlb_enabled: False
23+
etlb_lr: 0.05
24+
etlb_steps: 5
25+
eval_seq_len: 2048
26+
eval_stride: 64
27+
gptq_calibration_batches: 64
28+
gptq_reserve_seconds: 12.0
29+
grad_accum_steps: 1
30+
grad_clip_norm: 0.3
31+
head_lr: 0.008
32+
is_main_process: True
33+
iterations: 20000
34+
ln_scale: True
35+
local_rank: 0
36+
logfile: logs/frontier_seed1337.txt
37+
logit_softcap: 30.0
38+
loop_end: 5
39+
loop_start: 3
40+
matrix_bits: 6
41+
matrix_clip_sigmas: 12.85
42+
matrix_lr: 0.022
43+
max_wallclock_seconds: 600.0
44+
min_lr: 0.0
45+
mlp_mult: 4.0
46+
model_dim: 512
47+
model_path: final_model.pt
48+
muon_backend_steps: 5
49+
muon_beta2: 0.95
50+
muon_momentum: 0.99
51+
muon_momentum_warmup_start: 0.92
52+
muon_momentum_warmup_steps: 1500
53+
muon_row_normalize: True
54+
muon_wd: 0.095
55+
num_heads: 8
56+
num_kv_heads: 4
57+
num_layers: 11
58+
num_loops: 2
59+
parallel_residual_start: 7
60+
qk_gain_init: 5.25
61+
quantized_model_path: final_model.int6.ptz
62+
rank: 0
63+
rope_base: 10000.0
64+
rope_dims: 16
65+
rope_train_seq_len: 2048
66+
run_id: frontier_seed1337
67+
scalar_lr: 0.02
68+
seed: 1337
69+
skip_gates_enabled: True
70+
sliding_window_enabled: True
71+
tie_embeddings: True
72+
tied_embed_init_std: 0.005
73+
tied_embed_lr: 0.03
74+
tokenizer_path: /workspace/data/tokenizers/fineweb_8192_bpe.model
75+
train_batch_tokens: 786432
76+
train_files: /workspace/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
77+
train_log_every: 500
78+
train_seq_len: 2048
79+
ttt_chunk_tokens: 32768
80+
ttt_enabled: True
81+
ttt_epochs: 3
82+
ttt_lr: 0.005
83+
ttt_momentum: 0.9
84+
val_batch_tokens: 524288
85+
val_files: /workspace/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
86+
val_loss_every: 4000
87+
vocab_size: 8192
88+
warmdown_frac: 0.72
89+
warmup_steps: 20
90+
world_size: 8
91+
xsa_last_n: 11
92+
train_shards: 80
93+
val_tokens: 40548352
94+
model_params:35944536
95+
gptq:reserving 12s, effective=588000ms
96+
warmup_step: 1/20
97+
warmup_step: 2/20
98+
warmup_step: 3/20
99+
warmup_step: 4/20
100+
warmup_step: 5/20
101+
warmup_step: 6/20
102+
warmup_step: 10/20
103+
warmup_step: 20/20
104+
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
105+
loop_warmup_step: 1/20
106+
loop_warmup_step: 2/20
107+
loop_warmup_step: 3/20
108+
loop_warmup_step: 4/20
109+
loop_warmup_step: 5/20
110+
loop_warmup_step: 6/20
111+
loop_warmup_step: 10/20
112+
loop_warmup_step: 20/20
113+
0/20000 val_loss: 9.0047 val_bpb: 3.4867
114+
1/20000 train_loss: 9.0080 train_time: 0.0m tok/s: 8336072
115+
2/20000 train_loss: 12.2992 train_time: 0.0m tok/s: 8184327
116+
3/20000 train_loss: 11.0456 train_time: 0.0m tok/s: 8084574
117+
4/20000 train_loss: 9.4139 train_time: 0.0m tok/s: 8030457
118+
5/20000 train_loss: 8.3296 train_time: 0.0m tok/s: 7997738
119+
500/20000 train_loss: 3.3332 train_time: 0.8m tok/s: 7731821
120+
1000/20000 train_loss: 3.2115 train_time: 1.7m tok/s: 7728010
121+
1500/20000 train_loss: 3.0985 train_time: 2.5m tok/s: 7736121
122+
2000/20000 train_loss: 3.0193 train_time: 3.4m tok/s: 7741721
123+
layer_loop:enabled step:2026 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
124+
2500/20000 train_loss: 2.9987 train_time: 4.6m tok/s: 7114884
125+
3000/20000 train_loss: 3.0367 train_time: 5.8m tok/s: 6727898
126+
3500/20000 train_loss: 2.9188 train_time: 7.1m tok/s: 6476757
127+
4000/20000 train_loss: 2.9547 train_time: 8.3m tok/s: 6299690
128+
4000/20000 val_loss: 2.8728 val_bpb: 1.1124
129+
4500/20000 train_loss: 2.7579 train_time: 9.6m tok/s: 6170374
130+
4598/20000 val_loss: 2.8075 val_bpb: 1.0871
131+
stopping_early: wallclock_cap train_time: 588092ms step: 4598/20000
132+
peak memory allocated: 39046 MiB reserved: 39070 MiB
133+
ema:applying EMA weights
134+
pre-quantization post-ema val_loss:2.80424019 val_bpb:1.08583141 eval_time:6825ms
135+
Serialized model: 135431033 bytes
136+
Code size: 16791 bytes
137+
GPTQ:collecting Hessians from calibration data...
138+
GPTQ:collected 67 Hessians in 12.7s
139+
Quantized weights:
140+
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
141+
gptq (int8): tok_emb.weight
142+
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
143+
Serialized model quantized+brotli: 15975659 bytes
144+
Total submission size quantized+brotli: 15992450 bytes
145+
quantized val_loss:2.83421669 val_bpb:1.09743862 eval_time:8477ms
146+
quantized_sliding_window val_loss:2.79040941 val_bpb:1.08047598 eval_time:88503ms
147+
ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3
148+
quantized_ttt val_loss:2.78678937 val_bpb:1.07907426 eval_time:334602ms
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
W0412 18:26:02.275000 60284 torch/distributed/run.py:803]
2+
W0412 18:26:02.275000 60284 torch/distributed/run.py:803] *****************************************
3+
W0412 18:26:02.275000 60284 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
4+
W0412 18:26:02.275000 60284 torch/distributed/run.py:803] *****************************************
5+
Hyperparameters:
6+
adam_eps: 1e-08
7+
adam_wd: 0.02
8+
beta1: 0.9
9+
beta2: 0.95
10+
compressor: brotli
11+
data_dir: /workspace/data
12+
datasets_dir: /workspace/data/datasets/fineweb10B_sp8192
13+
distributed: True
14+
ema_decay: 0.9965
15+
embed_bits: 8
16+
embed_clip_sigmas: 20.0
17+
embed_lr: 0.6
18+
embed_wd: 0.085
19+
embedding_dim: 512
20+
enable_looping_at: 0.35
21+
etlb_clip: 3.0
22+
etlb_enabled: False
23+
etlb_lr: 0.05
24+
etlb_steps: 5
25+
eval_seq_len: 2048
26+
eval_stride: 64
27+
gptq_calibration_batches: 64
28+
gptq_reserve_seconds: 12.0
29+
grad_accum_steps: 1
30+
grad_clip_norm: 0.3
31+
head_lr: 0.008
32+
is_main_process: True
33+
iterations: 20000
34+
ln_scale: True
35+
local_rank: 0
36+
logfile: logs/frontier_seed2024.txt
37+
logit_softcap: 30.0
38+
loop_end: 5
39+
loop_start: 3
40+
matrix_bits: 6
41+
matrix_clip_sigmas: 12.85
42+
matrix_lr: 0.022
43+
max_wallclock_seconds: 600.0
44+
min_lr: 0.0
45+
mlp_mult: 4.0
46+
model_dim: 512
47+
model_path: final_model.pt
48+
muon_backend_steps: 5
49+
muon_beta2: 0.95
50+
muon_momentum: 0.99
51+
muon_momentum_warmup_start: 0.92
52+
muon_momentum_warmup_steps: 1500
53+
muon_row_normalize: True
54+
muon_wd: 0.095
55+
num_heads: 8
56+
num_kv_heads: 4
57+
num_layers: 11
58+
num_loops: 2
59+
parallel_residual_start: 7
60+
qk_gain_init: 5.25
61+
quantized_model_path: final_model.int6.ptz
62+
rank: 0
63+
rope_base: 10000.0
64+
rope_dims: 16
65+
rope_train_seq_len: 2048
66+
run_id: frontier_seed2024
67+
scalar_lr: 0.02
68+
seed: 2024
69+
skip_gates_enabled: True
70+
sliding_window_enabled: True
71+
tie_embeddings: True
72+
tied_embed_init_std: 0.005
73+
tied_embed_lr: 0.03
74+
tokenizer_path: /workspace/data/tokenizers/fineweb_8192_bpe.model
75+
train_batch_tokens: 786432
76+
train_files: /workspace/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
77+
train_log_every: 500
78+
train_seq_len: 2048
79+
ttt_chunk_tokens: 32768
80+
ttt_enabled: True
81+
ttt_epochs: 3
82+
ttt_lr: 0.005
83+
ttt_momentum: 0.9
84+
val_batch_tokens: 524288
85+
val_files: /workspace/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
86+
val_loss_every: 4000
87+
vocab_size: 8192
88+
warmdown_frac: 0.72
89+
warmup_steps: 20
90+
world_size: 8
91+
xsa_last_n: 11
92+
train_shards: 80
93+
val_tokens: 40548352
94+
model_params:35944536
95+
gptq:reserving 12s, effective=588000ms
96+
warmup_step: 1/20
97+
warmup_step: 2/20
98+
warmup_step: 3/20
99+
warmup_step: 4/20
100+
warmup_step: 5/20
101+
warmup_step: 6/20
102+
warmup_step: 10/20
103+
warmup_step: 20/20
104+
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
105+
loop_warmup_step: 1/20
106+
loop_warmup_step: 2/20
107+
loop_warmup_step: 3/20
108+
loop_warmup_step: 4/20
109+
loop_warmup_step: 5/20
110+
loop_warmup_step: 6/20
111+
loop_warmup_step: 10/20
112+
loop_warmup_step: 20/20
113+
0/20000 val_loss: 9.0072 val_bpb: 3.4877
114+
1/20000 train_loss: 9.0094 train_time: 0.0m tok/s: 8317541
115+
2/20000 train_loss: 12.2867 train_time: 0.0m tok/s: 8171456
116+
3/20000 train_loss: 11.0810 train_time: 0.0m tok/s: 8077182
117+
4/20000 train_loss: 9.4616 train_time: 0.0m tok/s: 8025433
118+
5/20000 train_loss: 8.3776 train_time: 0.0m tok/s: 7991921
119+
500/20000 train_loss: 3.3317 train_time: 0.8m tok/s: 7750924
120+
1000/20000 train_loss: 3.2122 train_time: 1.7m tok/s: 7739976
121+
1500/20000 train_loss: 3.0989 train_time: 2.5m tok/s: 7739565
122+
2000/20000 train_loss: 3.0174 train_time: 3.4m tok/s: 7741289
123+
layer_loop:enabled step:2026 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
124+
2500/20000 train_loss: 2.9978 train_time: 4.6m tok/s: 7114500
125+
3000/20000 train_loss: 3.0376 train_time: 5.8m tok/s: 6727692
126+
3500/20000 train_loss: 2.9237 train_time: 7.1m tok/s: 6476233
127+
4000/20000 train_loss: 2.9570 train_time: 8.3m tok/s: 6300234
128+
4000/20000 val_loss: 2.8752 val_bpb: 1.1133
129+
4500/20000 train_loss: 2.7565 train_time: 9.6m tok/s: 6170620
130+
4598/20000 val_loss: 2.8097 val_bpb: 1.0879
131+
stopping_early: wallclock_cap train_time: 588079ms step: 4598/20000
132+
peak memory allocated: 39046 MiB reserved: 39070 MiB
133+
ema:applying EMA weights
134+
pre-quantization post-ema val_loss:2.80642814 val_bpb:1.08667860 eval_time:6803ms
135+
Serialized model: 135431033 bytes
136+
Code size: 16791 bytes
137+
GPTQ:collecting Hessians from calibration data...
138+
GPTQ:collected 67 Hessians in 12.8s
139+
Quantized weights:
140+
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
141+
gptq (int8): tok_emb.weight
142+
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
143+
Serialized model quantized+brotli: 15972913 bytes
144+
Total submission size quantized+brotli: 15989704 bytes
145+
quantized val_loss:2.83555862 val_bpb:1.09795824 eval_time:8518ms
146+
quantized_sliding_window val_loss:2.79237236 val_bpb:1.08123606 eval_time:88486ms
147+
ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3
148+
quantized_ttt val_loss:2.78921781 val_bpb:1.08001458 eval_time:333698ms

0 commit comments

Comments
 (0)