Skip to content

performance is worse than benchmark on RTX 5090, pytorch 2.9.0 on cuda 12.8 #123

@Wingtail

Description

@Wingtail

I'm having trouble reproducing the exact forecasting MSE / MAE results on my machine. I'm running on pytorch version '2.9.0+cu128' on a RTX 5090 machine.

These are the details of the supervised training results:

Args in experiment:
Namespace(is_training=1, train_only=False, model_id='ETTh1_336_96', model='DLinear', data='ETTh1', root_path='./dataset/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=336, label_len=48, pred_len=96, individual=False, embed_type=0, enc_in=7, dec_in=7, c_out=7, d_model=512, n_heads=8, e_layers=2, d_layers=1, d_ff=2048, moving_avg=25, factor=1, distil=True, dropout=0.05, embed='timeF', activation='gelu', output_attention=False, do_predict=False, num_workers=10, itr=1, train_epochs=10, batch_size=32, patience=3, learning_rate=0.005, des='Exp', loss='mse', lradj='type1', use_amp=False, use_gpu=True, gpu=0, use_multi_gpu=False, devices='0,1,2,3', test_flop=False)
Use GPU: cuda:0
>>>>>>>start training : ETTh1_336_96_DLinear_ETTh1_ftM_sl336_ll48_pl96_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0>>>>>>>>>>>>>>>>>>>>>>>>>>
train 8209
val 2785
test 2785
	iters: 100, epoch: 1 | loss: 0.3399413
	speed: 0.0067s/iter; left time: 16.6099s
	iters: 200, epoch: 1 | loss: 0.3660883
	speed: 0.0011s/iter; left time: 2.5808s
Epoch: 1 cost time: 0.4904322624206543
Epoch: 1, Steps: 256 | Train Loss: 0.4029245 Vali Loss: 0.7159958 Test Loss: 0.4223829
Validation loss decreased (inf --> 0.715996).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4710605
	speed: 0.0072s/iter; left time: 15.9437s
	iters: 200, epoch: 2 | loss: 0.4261848
	speed: 0.0011s/iter; left time: 2.3564s
Epoch: 2 cost time: 0.4115111827850342
Epoch: 2, Steps: 256 | Train Loss: 0.3777461 Vali Loss: 0.7026909 Test Loss: 0.4617932
Validation loss decreased (0.715996 --> 0.702691).  Saving model ...
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.4502370
	speed: 0.0079s/iter; left time: 15.3003s
	iters: 200, epoch: 3 | loss: 0.3483346
	speed: 0.0011s/iter; left time: 1.9859s
Epoch: 3 cost time: 0.429640531539917
Epoch: 3, Steps: 256 | Train Loss: 0.3555608 Vali Loss: 0.7140683 Test Loss: 0.3948340
EarlyStopping counter: 1 out of 3
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.3529106
	speed: 0.0079s/iter; left time: 13.3660s
	iters: 200, epoch: 4 | loss: 0.4078414
	speed: 0.0011s/iter; left time: 1.7850s
Epoch: 4 cost time: 0.44702672958374023
Epoch: 4, Steps: 256 | Train Loss: 0.3421529 Vali Loss: 0.6486187 Test Loss: 0.3856722
Validation loss decreased (0.702691 --> 0.648619).  Saving model ...
Updating learning rate to 0.000625
	iters: 100, epoch: 5 | loss: 0.3692797
	speed: 0.0075s/iter; left time: 10.8258s
	iters: 200, epoch: 5 | loss: 0.3427099
	speed: 0.0011s/iter; left time: 1.5080s
Epoch: 5 cost time: 0.41738319396972656
Epoch: 5, Steps: 256 | Train Loss: 0.3369668 Vali Loss: 0.6703700 Test Loss: 0.3762412
EarlyStopping counter: 1 out of 3
Updating learning rate to 0.0003125
	iters: 100, epoch: 6 | loss: 0.3457166
	speed: 0.0075s/iter; left time: 8.8783s
	iters: 200, epoch: 6 | loss: 0.3241579
	speed: 0.0011s/iter; left time: 1.2278s
Epoch: 6 cost time: 0.42697834968566895
Epoch: 6, Steps: 256 | Train Loss: 0.3338273 Vali Loss: 0.6666238 Test Loss: 0.3734278
EarlyStopping counter: 2 out of 3
Updating learning rate to 0.00015625
	iters: 100, epoch: 7 | loss: 0.3143586
	speed: 0.0077s/iter; left time: 7.1401s
	iters: 200, epoch: 7 | loss: 0.3235159
	speed: 0.0011s/iter; left time: 0.9038s
Epoch: 7 cost time: 0.43926358222961426
Epoch: 7, Steps: 256 | Train Loss: 0.3320066 Vali Loss: 0.6627674 Test Loss: 0.3719860
EarlyStopping counter: 3 out of 3
Early stopping
>>>>>>>testing : ETTh1_336_96_DLinear_ETTh1_ftM_sl336_ll48_pl96_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
test 2785
mse:0.3841443955898285, mae:0.40471312403678894

Args in experiment:
Namespace(is_training=1, train_only=False, model_id='ETTh1_336_192', model='DLinear', data='ETTh1', root_path='./dataset/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=336, label_len=48, pred_len=192, individual=False, embed_type=0, enc_in=7, dec_in=7, c_out=7, d_model=512, n_heads=8, e_layers=2, d_layers=1, d_ff=2048, moving_avg=25, factor=1, distil=True, dropout=0.05, embed='timeF', activation='gelu', output_attention=False, do_predict=False, num_workers=10, itr=1, train_epochs=10, batch_size=32, patience=3, learning_rate=0.005, des='Exp', loss='mse', lradj='type1', use_amp=False, use_gpu=True, gpu=0, use_multi_gpu=False, devices='0,1,2,3', test_flop=False)
Use GPU: cuda:0
>>>>>>>start training : ETTh1_336_192_DLinear_ETTh1_ftM_sl336_ll48_pl192_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0>>>>>>>>>>>>>>>>>>>>>>>>>>
train 8113
val 2689
test 2689
	iters: 100, epoch: 1 | loss: 0.4030632
	speed: 0.0069s/iter; left time: 16.7170s
	iters: 200, epoch: 1 | loss: 0.3989787
	speed: 0.0011s/iter; left time: 2.5751s
Epoch: 1 cost time: 0.4998776912689209
Epoch: 1, Steps: 253 | Train Loss: 0.4550423 Vali Loss: 0.9690023 Test Loss: 0.4644403
Validation loss decreased (inf --> 0.969002).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.3955494
	speed: 0.0076s/iter; left time: 16.5265s
	iters: 200, epoch: 2 | loss: 0.4618148
	speed: 0.0011s/iter; left time: 2.3185s
Epoch: 2 cost time: 0.44080448150634766
Epoch: 2, Steps: 253 | Train Loss: 0.4368211 Vali Loss: 0.9542814 Test Loss: 0.4633306
Validation loss decreased (0.969002 --> 0.954281).  Saving model ...
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3942771
	speed: 0.0081s/iter; left time: 15.6748s
	iters: 200, epoch: 3 | loss: 0.3918642
	speed: 0.0012s/iter; left time: 2.1555s
Epoch: 3 cost time: 0.44551682472229004
Epoch: 3, Steps: 253 | Train Loss: 0.4110306 Vali Loss: 0.8765706 Test Loss: 0.4461168
Validation loss decreased (0.954281 --> 0.876571).  Saving model ...
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.4308404
	speed: 0.0079s/iter; left time: 13.1649s
	iters: 200, epoch: 4 | loss: 0.4194396
	speed: 0.0012s/iter; left time: 1.8090s
Epoch: 4 cost time: 0.43269872665405273
Epoch: 4, Steps: 253 | Train Loss: 0.3971279 Vali Loss: 0.9650614 Test Loss: 0.4192918
EarlyStopping counter: 1 out of 3
Updating learning rate to 0.000625
	iters: 100, epoch: 5 | loss: 0.3640448
	speed: 0.0083s/iter; left time: 11.7415s
	iters: 200, epoch: 5 | loss: 0.3823794
	speed: 0.0011s/iter; left time: 1.5164s
Epoch: 5 cost time: 0.4396634101867676
Epoch: 5, Steps: 253 | Train Loss: 0.3900709 Vali Loss: 0.8844976 Test Loss: 0.4103249
EarlyStopping counter: 2 out of 3
Updating learning rate to 0.0003125
	iters: 100, epoch: 6 | loss: 0.3405749
	speed: 0.0078s/iter; left time: 9.1243s
	iters: 200, epoch: 6 | loss: 0.3576510
	speed: 0.0011s/iter; left time: 1.1756s
Epoch: 6 cost time: 0.4277379512786865
Epoch: 6, Steps: 253 | Train Loss: 0.3865888 Vali Loss: 0.8965758 Test Loss: 0.4061241
EarlyStopping counter: 3 out of 3
Early stopping
>>>>>>>testing : ETTh1_336_192_DLinear_ETTh1_ftM_sl336_ll48_pl192_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
test 2689
mse:0.4434865117073059, mae:0.449925035238266

Args in experiment:
Namespace(is_training=1, train_only=False, model_id='ETTh1_336_336', model='DLinear', data='ETTh1', root_path='./dataset/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=336, label_len=48, pred_len=336, individual=False, embed_type=0, enc_in=7, dec_in=7, c_out=7, d_model=512, n_heads=8, e_layers=2, d_layers=1, d_ff=2048, moving_avg=25, factor=1, distil=True, dropout=0.05, embed='timeF', activation='gelu', output_attention=False, do_predict=False, num_workers=10, itr=1, train_epochs=10, batch_size=32, patience=3, learning_rate=0.005, des='Exp', loss='mse', lradj='type1', use_amp=False, use_gpu=True, gpu=0, use_multi_gpu=False, devices='0,1,2,3', test_flop=False)
Use GPU: cuda:0
>>>>>>>start training : ETTh1_336_336_DLinear_ETTh1_ftM_sl336_ll48_pl336_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0>>>>>>>>>>>>>>>>>>>>>>>>>>
train 7969
val 2545
test 2545
	iters: 100, epoch: 1 | loss: 0.5185504
	speed: 0.0070s/iter; left time: 16.7609s
	iters: 200, epoch: 1 | loss: 0.4845584
	speed: 0.0012s/iter; left time: 2.7602s
Epoch: 1 cost time: 0.5235846042633057
Epoch: 1, Steps: 249 | Train Loss: 0.5078838 Vali Loss: 1.1275409 Test Loss: 0.4939544
Validation loss decreased (inf --> 1.127541).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4619045
	speed: 0.0074s/iter; left time: 15.9251s
	iters: 200, epoch: 2 | loss: 0.5337794
	speed: 0.0012s/iter; left time: 2.3671s
Epoch: 2 cost time: 0.4245939254760742
Epoch: 2, Steps: 249 | Train Loss: 0.4874056 Vali Loss: 1.0683818 Test Loss: 0.5289848
Validation loss decreased (1.127541 --> 1.068382).  Saving model ...
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.4296724
	speed: 0.0076s/iter; left time: 14.4112s
	iters: 200, epoch: 3 | loss: 0.4573922
	speed: 0.0012s/iter; left time: 2.1116s
Epoch: 3 cost time: 0.43151092529296875
Epoch: 3, Steps: 249 | Train Loss: 0.4611091 Vali Loss: 1.2806047 Test Loss: 0.4659030
EarlyStopping counter: 1 out of 3
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.4402729
	speed: 0.0079s/iter; left time: 12.9769s
	iters: 200, epoch: 4 | loss: 0.5046263
	speed: 0.0012s/iter; left time: 1.8594s
Epoch: 4 cost time: 0.43999457359313965
Epoch: 4, Steps: 249 | Train Loss: 0.4484481 Vali Loss: 1.0534259 Test Loss: 0.4497196
Validation loss decreased (1.068382 --> 1.053426).  Saving model ...
Updating learning rate to 0.000625
	iters: 100, epoch: 5 | loss: 0.4836571
	speed: 0.0079s/iter; left time: 11.0167s
	iters: 200, epoch: 5 | loss: 0.3926441
	speed: 0.0012s/iter; left time: 1.6136s
Epoch: 5 cost time: 0.4381873607635498
Epoch: 5, Steps: 249 | Train Loss: 0.4405276 Vali Loss: 1.0723035 Test Loss: 0.4386732
EarlyStopping counter: 1 out of 3
Updating learning rate to 0.0003125
	iters: 100, epoch: 6 | loss: 0.4520393
	speed: 0.0081s/iter; left time: 9.2720s
	iters: 200, epoch: 6 | loss: 0.4383180
	speed: 0.0012s/iter; left time: 1.2263s
Epoch: 6 cost time: 0.44292569160461426
Epoch: 6, Steps: 249 | Train Loss: 0.4367764 Vali Loss: 1.0381850 Test Loss: 0.4469434
Validation loss decreased (1.053426 --> 1.038185).  Saving model ...
Updating learning rate to 0.00015625
	iters: 100, epoch: 7 | loss: 0.5690084
	speed: 0.0075s/iter; left time: 6.7392s
	iters: 200, epoch: 7 | loss: 0.4597495
	speed: 0.0012s/iter; left time: 0.9326s
Epoch: 7 cost time: 0.4305448532104492
Epoch: 7, Steps: 249 | Train Loss: 0.4349577 Vali Loss: 1.0773293 Test Loss: 0.4334371
EarlyStopping counter: 1 out of 3
Updating learning rate to 7.8125e-05
	iters: 100, epoch: 8 | loss: 0.4200609
	speed: 0.0078s/iter; left time: 5.0615s
	iters: 200, epoch: 8 | loss: 0.4240857
	speed: 0.0012s/iter; left time: 0.6316s
Epoch: 8 cost time: 0.4108271598815918
Epoch: 8, Steps: 249 | Train Loss: 0.4337557 Vali Loss: 1.0759537 Test Loss: 0.4318466
EarlyStopping counter: 2 out of 3
Updating learning rate to 3.90625e-05
	iters: 100, epoch: 9 | loss: 0.4291583
	speed: 0.0079s/iter; left time: 3.1685s
	iters: 200, epoch: 9 | loss: 0.4364204
	speed: 0.0012s/iter; left time: 0.3636s
Epoch: 9 cost time: 0.4447150230407715
Epoch: 9, Steps: 249 | Train Loss: 0.4331892 Vali Loss: 1.0747591 Test Loss: 0.4318236
EarlyStopping counter: 3 out of 3
Early stopping
>>>>>>>testing : ETTh1_336_336_DLinear_ETTh1_ftM_sl336_ll48_pl336_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
test 2545
mse:0.4469250440597534, mae:0.4483712911605835

Args in experiment:
Namespace(is_training=1, train_only=False, model_id='ETTh1_336_720', model='DLinear', data='ETTh1', root_path='./dataset/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=336, label_len=48, pred_len=720, individual=False, embed_type=0, enc_in=7, dec_in=7, c_out=7, d_model=512, n_heads=8, e_layers=2, d_layers=1, d_ff=2048, moving_avg=25, factor=1, distil=True, dropout=0.05, embed='timeF', activation='gelu', output_attention=False, do_predict=False, num_workers=10, itr=1, train_epochs=10, batch_size=32, patience=3, learning_rate=0.005, des='Exp', loss='mse', lradj='type1', use_amp=False, use_gpu=True, gpu=0, use_multi_gpu=False, devices='0,1,2,3', test_flop=False)
Use GPU: cuda:0
>>>>>>>start training : ETTh1_336_720_DLinear_ETTh1_ftM_sl336_ll48_pl720_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0>>>>>>>>>>>>>>>>>>>>>>>>>>
train 7585
val 2161
test 2161
	iters: 100, epoch: 1 | loss: 0.6062934
	speed: 0.0071s/iter; left time: 16.1237s
	iters: 200, epoch: 1 | loss: 0.6095265
	speed: 0.0013s/iter; left time: 2.8664s
Epoch: 1 cost time: 0.5387582778930664
Epoch: 1, Steps: 237 | Train Loss: 0.5910378 Vali Loss: 1.2456728 Test Loss: 0.6004225
Validation loss decreased (inf --> 1.245673).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.6238616
	speed: 0.0079s/iter; left time: 16.1546s
	iters: 200, epoch: 2 | loss: 0.5666766
	speed: 0.0013s/iter; left time: 2.5445s
Epoch: 2 cost time: 0.4538290500640869
Epoch: 2, Steps: 237 | Train Loss: 0.5705364 Vali Loss: 1.2542794 Test Loss: 0.5699026
EarlyStopping counter: 1 out of 3
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.5516502
	speed: 0.0079s/iter; left time: 14.1166s
	iters: 200, epoch: 3 | loss: 0.5464033
	speed: 0.0014s/iter; left time: 2.3400s
Epoch: 3 cost time: 0.4545300006866455
Epoch: 3, Steps: 237 | Train Loss: 0.5410811 Vali Loss: 1.2246164 Test Loss: 0.5051882
Validation loss decreased (1.245673 --> 1.224616).  Saving model ...
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.5571614
	speed: 0.0076s/iter; left time: 11.7834s
	iters: 200, epoch: 4 | loss: 0.5702764
	speed: 0.0014s/iter; left time: 2.0257s
Epoch: 4 cost time: 0.4412670135498047
Epoch: 4, Steps: 237 | Train Loss: 0.5253573 Vali Loss: 1.2611449 Test Loss: 0.4813100
EarlyStopping counter: 1 out of 3
Updating learning rate to 0.000625
	iters: 100, epoch: 5 | loss: 0.5861177
	speed: 0.0080s/iter; left time: 10.5206s
	iters: 200, epoch: 5 | loss: 0.5146908
	speed: 0.0013s/iter; left time: 1.6404s
Epoch: 5 cost time: 0.464583158493042
Epoch: 5, Steps: 237 | Train Loss: 0.5185215 Vali Loss: 1.2298431 Test Loss: 0.4759401
EarlyStopping counter: 2 out of 3
Updating learning rate to 0.0003125
	iters: 100, epoch: 6 | loss: 0.4870183
	speed: 0.0076s/iter; left time: 8.2522s
	iters: 200, epoch: 6 | loss: 0.5146449
	speed: 0.0013s/iter; left time: 1.3303s
Epoch: 6 cost time: 0.4479823112487793
Epoch: 6, Steps: 237 | Train Loss: 0.5148075 Vali Loss: 1.2609578 Test Loss: 0.4583052
EarlyStopping counter: 3 out of 3
Early stopping
>>>>>>>testing : ETTh1_336_720_DLinear_ETTh1_ftM_sl336_ll48_pl720_dm512_nh8_el2_dl1_df2048_fc1_ebtimeF_dtTrue_Exp_0<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
test 2161
mse:0.5042446851730347, mae:0.5145779848098755

All of the MSE and MAE results are consistently worse than what the benchmarks have mentioned.

Any feedback is appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions