Skip to content

show L2 norm of parameters during training. #3925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 19, 2020

Conversation

csukuangfj
Copy link
Contributor

In addition, set affine to false for batchnorm layers and switch to SGD optimizer.

The training is still running and
a screenshot of the L2-norms of the training parameters is as follows:

Screen Shot 2020-02-12 at 09 05 51

I will post the decoding results once it is done.

In addition, set affine to false for batchnorm layers and switch
to SGD optimizer.
@danpovey
Copy link
Contributor

danpovey commented Feb 12, 2020 via email

@csukuangfj
Copy link
Contributor Author

I kept affine==True for input batchnorm since I wanted it to simulate LDA, which is an affine transform.

I will set affine==False now.

* change stride kernel(3,1) to stride kernel(2,2)

* make subsampling readable

* make model trainable
@danpovey
Copy link
Contributor

Should I merge this?

@csukuangfj
Copy link
Contributor Author

I am running training with affine == False for the input batch norm.

Please wait for a moment.

@csukuangfj
Copy link
Contributor Author

I am comparing

  • adam + affine == false/true
  • sgd + affine == false/true

where affine == false/true is for the input batch norm layer.


I will add DistributedDataParallel to use multiple GPUs to reduce training time.

@fanlu
Copy link

fanlu commented Feb 13, 2020

Pytorch TDNNF stride kernel(2,2) + shuffle=True in egs dataset DataLoader
test_cer 8.45 8.33
test_wer 17.37 17.16
dev_cer 7.03 6.84
dev_wer 15.22 15.00

this result is based on this recipe
data generated by #3868 tdnn_1c recipe till stage 16 and run_chain.sh begin at stage 3
the lda mat is at exp/chain_cleaned_1c/tdnn1c_sp/lda.mat
adam optimizer multi step lr with milestones [1,3,5]

@csukuangfj
Copy link
Contributor Author

csukuangfj commented Feb 14, 2020

I have performed 4 experiments with the following settings:

  • adam + affine == false
  • adam + affine == true
  • sgd + affine == false
  • sgd + affine == true

where affine == false means the input batchnorm is with affine == false

The results are as follows

adam (false) adam (true) sgd (false) sgd (true)
test cer 9.08 9.10 10.34 10.37
test wer 17.90 17.84 19.41 19.45
dev cer 7.28 7.35 8.27 8.43
dev wer 15.44 15.51 16.68 16.90

For the AIShell dataset:

  • adam has a better performance than sgd
  • we should set affine==false for the input batchnorm.

@csukuangfj
Copy link
Contributor Author

csukuangfj commented Feb 14, 2020

@fanlu 's results show that lda has a lower cer/wer than input batchnorm.

The problem is that we have to build a nnet3 network explicitly to get lda.mat in the current
kaldi's implementation.

@csukuangfj
Copy link
Contributor Author

enabling shuffle in egs dataloader improves a little bit of cer/wer.

adam (affine==false), without shuffle adam (affine == false), with shuffle
test cer 9.17 9.05
test wer 17.91 17.79
dev cer 7.34 7.15
dev wer 15.52 15.29

@danpovey
Copy link
Contributor

danpovey commented Feb 14, 2020 via email

@qindazhu
Copy link
Contributor

BTW, it seems the current PyTorch model (TDNN_1c, that is @csukuangfj' model in #3892, not @fanlu's new model) is some sort of underfitting, result will get better (best until now) with lower l2_regularize.

TDNN-F(Pytorch) tdnn_1c_rd_rmc_rng
dev_cer 6.70 5.99
dev_wer 14.75 13.86
test_cer 8.13 7.08
test_wer 16.83 15.72

the config of mine is

-    opts.l2_regularize = 5e-4
+    opts.l2_regularize = 1e-4
     opts.leaky_hmm_coefficient = 0.1
+    opts.xent_regularize = 0.1
+    opts.out_of_range_regularize = 0.01

I used Adam with affine=true.

Besides l2_regularize, I guess the threshold of clip_grad_value_ will matter as well, I tried to set clip_grad_value_ (model.parameters(), 4.8) instead of 5.0 in the current script, but it will lead warning of nnet outputs outside of range.

Lower l2_regularize should be tried, for comparing, this is current global average objf:

2020-02-14 14:39:04,638 INFO [train.py:102] Process 9100/9444(96.357476%) global average objf: -0.039630 over 51210880.0 frames, current batch average objf: -0.036466 over 6400 frames, epoch 5
2020-02-14 14:39:24,135 INFO [train.py:102] Process 9200/9444(97.416349%) global average objf: -0.039632 over 51779328.0 frames, current batch average objf: -0.033343 over 6400 frames, epoch 5
2020-02-14 14:39:43,281 INFO [train.py:102] Process 9300/9444(98.475222%) global average objf: -0.039640 over 52335744.0 frames, current batch average objf: -0.031165 over 6400 frames, epoch 5
2020-02-14 14:40:02,384 INFO [train.py:102] Process 9400/9444(99.534096%) global average objf: -0.039642 over 52895360.0 frames, current batch average objf: -0.050952 over 3840 frames, epoch 5
2020-02-14 14:40:10,763 INFO [common.py:61] Save checkpoint to exp/chain_cleaned_pybind/tdnn1c_sp/best_model.pt: epoch=5, learning_rate=1.5625e-05, objf=-0.039643600787482226

@danpovey
Copy link
Contributor

danpovey commented Feb 14, 2020 via email

@csukuangfj
Copy link
Contributor Author

global average objf: -0.039642 is quite high.

Mine is about -0.06. I'll try what you have suggested.

I think the default value for opts.out_of_range_regularize is 0.01.

ChainTrainingOptions(): l2_regularize(0.0), out_of_range_regularize(0.01),

I will

  • set l2_regularize to 1e-4
  • set xent_regularize to 0.1
  • disable gradient clipping

and to see what will happen. And then to use conv1d to compute delta-delta features.

@qindazhu
Copy link
Contributor

The above result was gotten with Adam instead of SGD, I tried SGD, but it seems it may require more tricks on the learning rate.

@danpovey
Copy link
Contributor

danpovey commented Feb 14, 2020 via email

@qindazhu
Copy link
Contributor

global average objf: -0.039642 is quite high.

Mine is about -0.06. I'll try what you have suggested.

I think the default value for opts.out_of_range_regularize is 0.01.

ChainTrainingOptions(): l2_regularize(0.0), out_of_range_regularize(0.01),

I will

  • set l2_regularize to 1e-4
  • set xent_regularize to 0.1
  • disable gradient clipping

and to see what will happen. And then to use conv1d to compute delta-delta features.

@csukuangfj , the higher of objf, the better...see your code and logs.

@danpovey
Copy link
Contributor

danpovey commented Feb 14, 2020 via email

@csukuangfj
Copy link
Contributor Author

@danpovey
thanks.

I will compute the objective function value of
train_diagnostic.cegs and valid_diagnostic.cegs
generated by get_egs.sh.

@qindazhu
Copy link
Contributor

I tried different learning rate value and different rate decay in SGD, but it all gets worse result than Adam. Do you have any idea about how to figure out the reason?

BTW, in my last try with Adam (above), it seems the objf has been up and down at the corresponding learning rate?

 learning_rate=1.5625e-05
2020-02-14 14:36:32,193 INFO [train.py:102] Process 8300/9444(87.886489%) global average objf: -0.039626 over 46693888.0 frames, current batch average objf: -0.037104 over 6400 frames, epoch 5
2020-02-14 14:36:51,451 INFO [train.py:102] Process 8400/9444(88.945362%) global average objf: -0.039618 over 47272320.0 frames, current batch average objf: -0.045614 over 3840 frames, epoch 5
2020-02-14 14:37:10,245 INFO [train.py:102] Process 8500/9444(90.004235%) global average objf: -0.039622 over 47832064.0 frames, current batch average objf: -0.040992 over 6400 frames, epoch 5
2020-02-14 14:37:29,049 INFO [train.py:102] Process 8600/9444(91.063109%) global average objf: -0.039637 over 48383360.0 frames, current batch average objf: -0.043111 over 6400 frames, epoch 5
2020-02-14 14:37:47,949 INFO [train.py:102] Process 8700/9444(92.121982%) global average objf: -0.039640 over 48935296.0 frames, current batch average objf: -0.052075 over 3840 frames, epoch 5
2020-02-14 14:38:07,174 INFO [train.py:102] Process 8800/9444(93.180856%) global average objf: -0.039634 over 49503360.0 frames, current batch average objf: -0.046243 over 4736 frames, epoch 5
2020-02-14 14:38:26,456 INFO [train.py:102] Process 8900/9444(94.239729%) global average objf: -0.039627 over 50077312.0 frames, current batch average objf: -0.038564 over 6400 frames, epoch 5
2020-02-14 14:38:45,522 INFO [train.py:102] Process 9000/9444(95.298602%) global average objf: -0.039628 over 50638592.0 frames, current batch average objf: -0.043415 over 3840 frames, epoch 5
2020-02-14 14:39:04,638 INFO [train.py:102] Process 9100/9444(96.357476%) global average objf: -0.039630 over 51210880.0 frames, current batch average objf: -0.036466 over 6400 frames, epoch 5

@danpovey
Copy link
Contributor

With the SGD run, compare the norms of the parameter matrices with those of Kaldi's model and see if any are significantly different. If those printed values are for minibatches, the variation is probably normal; sometimes you'll get easier or harder examples.

@qindazhu
Copy link
Contributor

With the SGD run, compare the norms of the parameter matrices with those of Kaldi's model and see if any are significantly different. If those printed values are for minibatches, the variation is probably normal; sometimes you'll get easier or harder examples.

Ok, thanks Dan. I'll try lower l2 first, and then try to tune it with SGD with your suggestion.

And I suggest we should not (at least for now) change the model struct, it seems that we may get close result with the same configuration of tdnn_1c_rd_rmc_rng. Update model config may make things complex.

@danpovey
Copy link
Contributor

danpovey commented Feb 14, 2020 via email

@csukuangfj
Copy link
Contributor Author

@qindazhu
I agree with you.

For my previous pullrequst, I used [-1, 0, 1] for the first linear layer, which
has left context ==1 and right context ==1.

But the current model structure has kernel size 2 for the first linear layer
and kernel size 2 for the second affine layer. I thin the combined TDNN-F layer
has only one-sided context, i.e., either the left context == 2 , right context == 0
or the left context == 0, right context == 2, depending on how PyTorch implements it.

I merged it since fanlu said it has a better cer/wer.

I am wondering whether the left/right context used in generating egs is still relevant
for the current model structure.

@qindazhu
Copy link
Contributor

@csukuangfj I just kept all other configuration in your first PR, that is almost the same with tdnn_1c_rd_rmc_rng

model_left_context=28
model_right_context=28
egs_left_context=$[$model_left_context + 1]
egs_right_context=$[$model_right_context + 1]
frames_per_eg=150,110,90
frames_per_iter=1500000
minibatch_size=128

num_epochs=6
lr=1e-3

hidden_dim=1024
bottleneck_dim=128
time_stride_list="1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1" # comma separated list
conv_stride_list="1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1" # comma separated list

The only difference is that I re-used data/features for tdnn_1c_rd_rmc_rng in get_egs, but that is only because I want to avoid extract feature again in PyTorch setup, and I think that does not make a difference.

@qindazhu
Copy link
Contributor

Anyway, let's keep the same model configuration (your first pr, tdnn_1c_rd_rmc_rng) for now to compare result easier?

@danpovey
Copy link
Contributor

danpovey commented Feb 18, 2020 via email

@qindazhu
Copy link
Contributor

Ok, thanks! Let me just try more epochs first as the l2 and initial learning rate seem right according to my previous experiments (large or small l2/lr would not make objf get better in the epoch 0). And then try to decrease rate in the last epochs

@csukuangfj
Copy link
Contributor Author

$ steps/info/chain_dir_info.pl exp-haowen/chain_cleaned_1c/tdnn1c_sp/
exp-haowen/chain_cleaned_1c/tdnn1c_sp/: num-iters=79 nj=3..12 
num-params=9.3M dim=40->3456 combine=-0.032->-0.032 (over 2) 
xent:train/valid[51,78,final]=(-0.726,-0.550,-0.542/-0.735,-0.584,-0.576) 
logprob:train/valid[51,78,final]=(-0.050,-0.033,-0.033/-0.052,-0.042,-0.041)

yes, I am wondering why I get worse results than haowen's and fanlu's.

I think haowen used the same network architecture as me for the PyTorch, i.e.,
[-1, 0, 1] for the first linear layer.

@danpovey
Copy link
Contributor

danpovey commented Feb 18, 2020 via email

@qindazhu
Copy link
Contributor

Do you follow scripts in this pr #3868? I guess maybe alignment before nnet training matters which differs from your PyTorch scripts?

@csukuangfj
Copy link
Contributor Author

    opts.l2_regularize = 5e-5
    opts.leaky_hmm_coefficient = 0.1
    opts.xent_regularize = 0.1
    opts.out_of_range_regularize = 0.01

    optimizer = optim.Adam(model.parameters(),
                          lr=learning_rate
                          weight_decay=5e-4)

    learning_rate = 1e-3 * (0.4 ^ epoch),  where epoch <- [0,1,2,3,4,5] 

I used the above settings suggested by haowen and replaced the multistep learning rate scheduler with

learning_rate = 1e-3 * (0.4 ^ epoch),  where epoch <- [0,1,2,3,4,5] 

I guess maybe alignment before nnet training matters

My alignment generated by run.sh is different from your run_kaldi.sh.

Should I replace run.sh with your run_kaldi.sh ?


I will double the number of epochs, i.e., from 6 to 12.

@qindazhu
Copy link
Contributor

I assume you were already using a learning rate schedule where it decreases in later epochs (?) That is definitely necessary. Aim for a factor of 10 or 20, maybe.

On Tue, Feb 18, 2020 at 5:11 PM Fangjun Kuang @.***> wrote: $ steps/info/chain_dir_info.pl exp-haowen/chain_cleaned_1c/tdnn1c_sp/ exp-haowen/chain_cleaned_1c/tdnn1c_sp/: num-iters=79 nj=3..12 num-params=9.3M dim=40->3456 combine=-0.032->-0.032 (over 2) xent:train/valid[51,78,final]=(-0.726,-0.550,-0.542/-0.735,-0.584,-0.576) logprob:train/valid[51,78,final]=(-0.050,-0.033,-0.033/-0.052,-0.042,-0.041) yes, I am wondering why I get worse results than haowen's and fanlu's. I think haowen used the same network architecture as me for the PyTorch, i.e., [-1, 0, 1] for the first linear layer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3925?email_source=notifications&email_token=AAZFLOYDLME7PU2TUZ27R5DRDOQ4PA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMBFHXI#issuecomment-587355101>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2RPFGYY634SVCHQOLRDOQ4PANCNFSM4KTPRNKQ .

Yes, current learning rate is

 optimizer = optim.SGD(model.parameters(),
                           lr=learning_rate,
                           momentum=0.9)

    learning_rate = 1e-3 * (0.5 ^ epoch),  where epoch <- [0,1,2,3,4,5]

the initial lr is 1e-3 and the final lr is 3.125e-5, which is initial_lr * 0.0315

@qindazhu
Copy link
Contributor

Should I replace run.sh with your run_kaldi.sh

I guess so, maybe @fanlu can help to confirm this as I never run run.sh before, I just re-use data from run_kaldi.sh to run experiment in PyTorch.

The other difference is that I use files in your first pr #3892, here is my local git log

commit 154e36680bf401bc3eeff3403f899b4be68667c1
Author: Fangjun Kuang <[email protected]>
Date:   Fri Jan 31 12:22:48 2020 +0800

    update training scripts.

diff --git a/egs/aishell/s10/chain/model.py b/egs/aishell/s10/chain/model.py
index 99ab1a0..39d7acb 100644
--- a/egs/aishell/s10/chain/model.py
+++ b/egs/aishell/s10/chain/model.py
@@ -201,6 +201,14 @@ class ChainModel(nn.Module):

         return nnet_output, xent_output

+    def constrain_orthonormal(self):
+        for i in range(len(self.tdnnfs)):
+            self.tdnnfs[i].constrain_orthonormal()
+
+        self.prefinal_l.constrain_orthonormal()
+        self.prefinal_chain.constrain_orthonormal()
+        self.prefinal_xent.constrain_orthonormal()
+

 if __name__ == '__main__':
     feat_dim = 43
@@ -212,3 +220,4 @@ if __name__ == '__main__':
     x = torch.arange(N * T * C).reshape(N, T, C).float()
     nnet_output, xent_output = model(x)
     print(x.shape, nnet_output.shape, xent_output.shape)
+    model.constrain_orthonormal()

commit 7c7dda3bd4bb94071364797a97647073816f08b4
Author: Fangjun Kuang <[email protected]>
Date:   Thu Jan 30 21:10:28 2020 +0800

    update model to use TDNNF.
     
     ......

+                    bottleneck_dim,
+                    time_stride_list,
+                    conv_stride_list,
                     lda_mat_filename=None):
     model = ChainModel(feat_dim=feat_dim,
                        output_dim=output_dim,
                        lda_mat_filename=lda_mat_filename,
                        hidden_dim=hidden_dim,
-                       kernel_size_list=kernel_size_list,
-                       stride_list=stride_list)
+                       time_stride_list=time_stride_list,
+                       conv_stride_list=conv_stride_list)
     return model
    
    ......

@fanlu
Copy link

fanlu commented Feb 18, 2020

@csukuangfj please follow haowen's pr #3868 and rerun egs/aishell/s10/local/run_tdnn_1c.sh before stage 16, and then run your recipe egs/aishell/s10/local/run_chain.sh start with stage 3
the difference is speed_perturb and tree leaves number

@csukuangfj
Copy link
Contributor Author

The model in this pullrequest is equivalent to my first pullrequest.

I will switch to run_kaldi.sh now.

@qindazhu
Copy link
Contributor

For SGD, I just find the l2-norm of the parameters are too big compared with Kaldi's:

kaldi:  [tdnn1.affine:13.066 tdnnf2.linear:9.77364 tdnnf2.affine:13.2666 tdnnf3.linear:9.99066 tdnnf3.affine:12.3348 tdnnf4.linear:9.14051 tdnnf4.affine:12.1615 tdnnf5.linear:7.58906 tdnnf5.affine:11.355 tdnnf6.linear:9.04031 tdnnf6.affine:12.423 tdnnf7.linear:8.99408 tdnnf7.affine:12.4773 tdnnf8.linear:8.75175 tdnnf8.affine:12.3263 tdnnf9.linear:8.6019 tdnnf9.affine:12.028 tdnnf10.linear:8.49032 tdnnf10.affine:11.9081 tdnnf11.linear:8.28274 tdnnf11.affine:11.9213 tdnnf12.linear:8.25842 tdnnf12.affine:11.7487 tdnnf13.linear:8.19186 tdnnf13.affine:11.6624 prefinal-l:13.9765 prefinal-chain.affine:11.6501 prefinal-chain.linear:12.3917 output.affine:22.0577 prefinal-xent.affine:10.6057 prefinal-xent.linear:9.92937 output-xent.affine:50.9147 ]
pytorch:[tdnn1.affine:112.8931,tdnnf2.linear:102.4325,tdnnf2.affine:75.7402,tdnnf3.linear:86.0951,tdnnf3.affine:62.5637,tdnnf4.linear:78.2900,tdnnf4.affine:56.5584,tdnnf5.linear:48.1624,tdnnf5.affine:46.7811,tdnnf6.linear:89.7642,tdnnf6.affine:65.7036,tdnnf7.linear:90.6984,tdnnf7.affine:69.4275,tdnnf8.linear:89.6295,tdnnf8.affine:66.2355,tdnnf9.linear:87.3545,tdnnf9.affine:63.2102,tdnnf10.linear:83.4296,tdnnf10.affine:60.1824,tdnnf11.linear:83.5425,tdnnf11.affine:58.1394,tdnnf12.linear:82.4880,tdnnf12.affine:57.8074,tdnnf13.linear:85.2806,tdnnf13.affine:62.5397,prefinal_l:95.0566,prefinal_chain.affine:78.2842,prefinal_chain.linear:95.9710,output_affine:110.7365,prefinal_xent.affine:60.7356,prefinal_xent.linear:75.7621,output_xent_affine:121.6411]

How about yours? just make sure I don't compute it incorrectly

for name, param in model.named_parameters():
    if param.requires_grad and name.endswith('.weight'):
        change_name_to_align_with_kaldi()
        norm_str = norm_str + '{}:{:.4f},'.format(name, torch.norm(param, 2))

@csukuangfj
Copy link
Contributor Author

csukuangfj commented Feb 18, 2020 via email

@qindazhu
Copy link
Contributor

I think norm(param, 2) or norm(param) is equivalent to norm(param, p='fro'), and why should I sqrt again after getting norm as it has been sqrt in norm?

@csukuangfj
Copy link
Contributor Author

@qindazhu

oh, yes, you're right.


Now I have changed the data processing pipeline to match yours
and the script is still running. I will post the results tomorrow.

@fanlu
Copy link

fanlu commented Feb 18, 2020

this is my sgd model's l2-norm

tdnn1_affine: 145.0203 tdnnfs.0.linear: 143.8808 tdnnfs.0.affine: 113.1185 tdnnfs.1.linear: 127.6596 tdnnfs.1.affine: 92.0355 tdnnfs.2.linear: 110.5569 tdnnfs.2.affine: 86.0150 tdnnfs.3.linear: 82.6030 tdnnfs.3.affine: 67.5434 tdnnfs.4.linear: 115.3349 tdnnfs.4.affine: 101.5720 tdnnfs.5.linear: 116.2821 tdnnfs.5.affine: 99.2401 tdnnfs.6.linear: 116.0093 tdnnfs.6.affine: 98.3118 tdnnfs.7.linear: 119.6492 tdnnfs.7.affine: 103.3342 tdnnfs.8.linear: 118.5094 tdnnfs.8.affine: 104.8650 tdnnfs.9.linear: 125.1008 tdnnfs.9.affine: 109.7343 tdnnfs.10.linear: 126.0465 tdnnfs.10.affine: 111.6607 tdnnfs.11.linear: 141.6897 tdnnfs.11.affine: 129.4766 prefinal_chain.affine: 277.1615 prefinal_chain.linear: 466.6338 output_affine: 147.2012 prefinal_xent.affine: 82.3923 prefinal_xent.linear: 115.3727 output_xent_affine: 135.4102

@csukuangfj
Copy link
Contributor Author

@qindazhu

It turns out that your run_kaldi.sh generates a higher number of batches than my run.sh,
i.e., 9447 vs 3161.

There is more data generated in run_kaldi.sh and this may be the reason for
better performance.

In addition, the training time is nearly tripled.

@qindazhu
Copy link
Contributor

Yes, it is because of perturb data as Kaldi's latest scripts do. I just thought you have done this as Fanlu and I have put logs in this thread of conversation. Anyway, let's keep this setting as it will make the model be more invariant to test data.

@qindazhu
Copy link
Contributor

this is my sgd model's l2-norm

tdnn1_affine: 145.0203 tdnnfs.0.linear: 143.8808 tdnnfs.0.affine: 113.1185 tdnnfs.1.linear: 127.6596 tdnnfs.1.affine: 92.0355 tdnnfs.2.linear: 110.5569 tdnnfs.2.affine: 86.0150 tdnnfs.3.linear: 82.6030 tdnnfs.3.affine: 67.5434 tdnnfs.4.linear: 115.3349 tdnnfs.4.affine: 101.5720 tdnnfs.5.linear: 116.2821 tdnnfs.5.affine: 99.2401 tdnnfs.6.linear: 116.0093 tdnnfs.6.affine: 98.3118 tdnnfs.7.linear: 119.6492 tdnnfs.7.affine: 103.3342 tdnnfs.8.linear: 118.5094 tdnnfs.8.affine: 104.8650 tdnnfs.9.linear: 125.1008 tdnnfs.9.affine: 109.7343 tdnnfs.10.linear: 126.0465 tdnnfs.10.affine: 111.6607 tdnnfs.11.linear: 141.6897 tdnnfs.11.affine: 129.4766 prefinal_chain.affine: 277.1615 prefinal_chain.linear: 466.6338 output_affine: 147.2012 prefinal_xent.affine: 82.3923 prefinal_xent.linear: 115.3727 output_xent_affine: 135.4102

Thanks @fanlu . Seems the L2-norm is really too big. Maybe I should try larger L2-regularize values and tune learning rate again.

@qindazhu
Copy link
Contributor

In addition, the training time is nearly tripled.

BTW, for training time, it seems that PyTorch takes the same time with Kaldi. Maybe we should do more things to make it faster. But I suggest we should focus on the WER first and put SPEED aside just for now.

@danpovey
Copy link
Contributor

danpovey commented Feb 19, 2020 via email

@qindazhu
Copy link
Contributor

sure, thanks Dan.

@csukuangfj
Copy link
Contributor Author

Here are the results for the current pullrequest.

this pullrequest haowen's PyTorch fanlu's Pytorch (with kerne [2, 2]) haowen's kaldi
test cer 7.91 7.86 8.33 7.08
test wer 16.49 16.56 17.16 15.72
dev cer 6.48 6.47 6.84 5.99
dev wer 14.48 14.45 15.00 13.86

  • training time for 6 epochs in total: 3 hours, 0 minute, 54 seconds
  • time per epoch: about 36 minutes

@danpovey

Please merge this if necessary.


Part of the training log is as follows. It seems that the objf value keeps at -0.04
and stops increasing. Possibly a lower learning rate should be used.

2020-02-19 11:06:56,382 INFO [train.py:156] Process 8100/9447(85.741505%) global average objf: -0.040148 over 45543040.0 frames, current batch average
 objf: -0.053073 over 3840 frames, epoch 5
2020-02-19 11:07:18,288 INFO [train.py:156] Process 8200/9447(86.800042%) global average objf: -0.040145 over 46109568.0 frames, current batch average
 objf: -0.040103 over 6400 frames, epoch 5
2020-02-19 11:07:40,595 INFO [train.py:156] Process 8300/9447(87.858579%) global average objf: -0.040147 over 46676992.0 frames, current batch average
 objf: -0.040909 over 6400 frames, epoch 5
2020-02-19 11:08:02,810 INFO [train.py:156] Process 8400/9447(88.917117%) global average objf: -0.040150 over 47242496.0 frames, current batch average
 objf: -0.055194 over 3840 frames, epoch 5
2020-02-19 11:08:25,285 INFO [train.py:156] Process 8500/9447(89.975654%) global average objf: -0.040137 over 47827456.0 frames, current batch average
 objf: -0.046951 over 3840 frames, epoch 5
2020-02-19 11:08:29,973 INFO [train.py:171] Validation average objf: -0.052603 over 17181.0 frames
2020-02-19 11:08:52,365 INFO [train.py:156] Process 8600/9447(91.034191%) global average objf: -0.040139 over 48395904.0 frames, current batch average
 objf: -0.034007 over 6400 frames, epoch 5
2020-02-19 11:09:14,369 INFO [train.py:156] Process 8700/9447(92.092728%) global average objf: -0.040137 over 48954752.0 frames, current batch average
 objf: -0.052936 over 3840 frames, epoch 5
2020-02-19 11:09:36,613 INFO [train.py:156] Process 8800/9447(93.151265%) global average objf: -0.040130 over 49527168.0 frames, current batch average
 objf: -0.050902 over 3840 frames, epoch 5
2020-02-19 11:09:59,323 INFO [train.py:156] Process 8900/9447(94.209802%) global average objf: -0.040127 over 50103040.0 frames, current batch average
 objf: -0.046969 over 4736 frames, epoch 5
2020-02-19 11:10:21,018 INFO [train.py:156] Process 9000/9447(95.268339%) global average objf: -0.040128 over 50650752.0 frames, current batch average
 objf: -0.034439 over 6400 frames, epoch 5
2020-02-19 11:10:25,709 INFO [train.py:171] Validation average objf: -0.053590 over 17181.0 frames
2020-02-19 11:10:47,536 INFO [train.py:156] Process 9100/9447(96.326876%) global average objf: -0.040129 over 51209856.0 frames, current batch average
 objf: -0.035524 over 6400 frames, epoch 5
2020-02-19 11:11:09,646 INFO [train.py:156] Process 9200/9447(97.385413%) global average objf: -0.040129 over 51772160.0 frames, current batch average
 objf: -0.041880 over 4736 frames, epoch 5
2020-02-19 11:11:31,622 INFO [train.py:156] Process 9300/9447(98.443950%) global average objf: -0.040129 over 52336128.0 frames, current batch average
 objf: -0.036277 over 6400 frames, epoch 5
2020-02-19 11:11:53,590 INFO [train.py:156] Process 9400/9447(99.502488%) global average objf: -0.040132 over 52897280.0 frames, current batch average
 objf: -0.040822 over 6400 frames, epoch 5
2020-02-19 11:12:03,591 INFO [common.py:61] Save checkpoint to exp/chain/train/best_model.pt: epoch=5, learning_rate=1.0240000000000004e-05, objf=-0.0
40137736733864254

@danpovey
Copy link
Contributor

Fantastic! And of course thanks to all the others who provided results for comparison: those are valuable too, even if not directly merged.

@danpovey danpovey merged commit f5875be into kaldi-asr:pybind11 Feb 19, 2020
@fanlu
Copy link

fanlu commented Feb 19, 2020

sync kernel(2,2) latest result

kernel(2,2)
test cer 7.92
test wer 16.58
dev cer 6.53
dev wer 14.59

@danpovey
Copy link
Contributor

cool...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants