-
Notifications
You must be signed in to change notification settings - Fork 5.4k
show L2 norm of parameters during training. #3925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In addition, set affine to false for batchnorm layers and switch to SGD optimizer.
Make sure that the batch-norm for the input has affine=False as well. (I
see input_batch_norm.weight etc.)
Or at least try that.. I'm not 100% sure what effect it will have.
…On Wed, Feb 12, 2020 at 9:07 AM Fangjun Kuang ***@***.***> wrote:
In addition, set affine to false for batchnorm layers and switch to SGD
optimizer.
The training is still running and
a screenshot of the L2-norms of the training parameters is as follows:
[image: Screen Shot 2020-02-12 at 09 05 51]
<https://user-images.githubusercontent.com/5284924/74293834-fc253d00-4d76-11ea-9b37-a04953891ee1.png>
I will post the decoding results once it is done.
------------------------------
You can view, comment on, or merge this pull request online at:
#3925
Commit Summary
- show L2 norm of parameters during training.
File Changes
- *M* egs/aishell/s10/chain/inference.py
<https://github.com/kaldi-asr/kaldi/pull/3925/files#diff-0> (16)
- *M* egs/aishell/s10/chain/model.py
<https://github.com/kaldi-asr/kaldi/pull/3925/files#diff-1> (22)
- *M* egs/aishell/s10/chain/options.py
<https://github.com/kaldi-asr/kaldi/pull/3925/files#diff-2> (7)
- *M* egs/aishell/s10/chain/tdnnf_layer.py
<https://github.com/kaldi-asr/kaldi/pull/3925/files#diff-3> (6)
- *M* egs/aishell/s10/chain/train.py
<https://github.com/kaldi-asr/kaldi/pull/3925/files#diff-4> (40)
- *M* egs/aishell/s10/local/run_chain.sh
<https://github.com/kaldi-asr/kaldi/pull/3925/files#diff-5> (3)
Patch Links:
- https://github.com/kaldi-asr/kaldi/pull/3925.patch
- https://github.com/kaldi-asr/kaldi/pull/3925.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLO2WI45JN2EUQEERBTDRCNDV5A5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IMY3BPQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7XSLN3SIVS66K4D33RCNDV5ANCNFSM4KTPRNKQ>
.
|
I kept I will set |
* change stride kernel(3,1) to stride kernel(2,2) * make subsampling readable * make model trainable
Should I merge this? |
I am running training with affine == False for the input batch norm. Please wait for a moment. |
I am comparing
where affine == false/true is for the input batch norm layer. I will add DistributedDataParallel to use multiple GPUs to reduce training time. |
this result is based on this recipe |
I have performed 4 experiments with the following settings:
where The results are as follows
For the AIShell dataset:
|
@fanlu 's results show that The problem is that we have to build a nnet3 network explicitly to get |
enabling shuffle in egs dataloader improves a little bit of cer/wer.
|
Guys, for now the aim is to reproduce the Kaldi system, so please use SGD
not Adam. Anyway you can't really compare properly without doing a
parameter sweep e.g. on learning rate or l2; you'd have to look at the
train/valid difference to see whether the SGD one is overfitting or
underfitting.
RE the LDA: please use deltas, which are as good, and easy to implement.
You'll have to pad the input with 2 more frames.
Look at steps/libs/nnet3/xconfig/trivial_layers.py
```
class XconfigDeltaLayer(XconfigLayerBase):
"""This class is for parsing lines like
'delta-layer name=delta input=idct'
which appends the central frame with the delta features
(i.e. -1,0,1 since scale equals 1) and delta-delta features
(i.e. 1,0,-2,0,1), and then applies batchnorm to it.
```
So the numbers above are scaling factors, e.g. -1,0,1 means frame -1*(frame
t-1 ) + 1*(frame t).
Do batchnorm after that.
…On Fri, Feb 14, 2020 at 10:03 AM Fangjun Kuang ***@***.***> wrote:
enabling shuffle in egs dataloader improves a little bit of cer/wer.
adam (affine==false), without shuffle adam (affine == false), with shuffle
test cer 9.17 9.05
test wer 17.91 17.79
dev cer 7.34 7.15
dev wer 15.52 15.29
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLO4U46GB77TKBZXS4R3RCX3VPA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELXJPHQ#issuecomment-586061726>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3ZLFFWCV3C5CMSWF3RCX3VPANCNFSM4KTPRNKQ>
.
|
BTW, it seems the current PyTorch model (TDNN_1c, that is @csukuangfj' model in #3892, not @fanlu's new model) is some sort of underfitting, result will get better (best until now) with lower l2_regularize.
the config of mine is
I used Adam with Besides Lower
|
Great!
Was that with SGD?
I don't think gradient clipping should be necessary (or even helpful) for a
non-recurrent network.
I suspect, also, that that failure was random and not connected to such a
tiny change in the gradient clipping (anything less than a factor of 2 is
no change at all, IMO).
…On Fri, Feb 14, 2020 at 3:32 PM Haowen Qiu ***@***.***> wrote:
BTW, it seems the current PyTorch model (TDNN_1c, that is @csukuangfj
<https://github.com/csukuangfj>' model in #3892
<#3892>, not @fanlu
<https://github.com/fanlu>'s new model) is some sort of underfitting,
result will get better (*best* until now) with lower l2_regularize.
TDNN-F(Pytorch) tdnn_1c_rd_rmc_rng
dev_cer 6.70 5.99
dev_wer 14.75 13.86
test_cer 8.13 7.08
test_wer 16.83 15.72
the config of mine is
- opts.l2_regularize = 5e-4
+ opts.l2_regularize = 1e-4
opts.leaky_hmm_coefficient = 0.1
+ opts.xent_regularize = 0.1
+ opts.out_of_range_regularize = 0.01
I used Adam with affine=true.
Besides l2_regularize, I guess the threshold of clip_grad_value_ will
matter as well, I tried to set clip_grad_value_ (model.parameters(), 4.8)
instead of 5.0 in the current script, but it will lead warning of nnet
outputs outside of range.
Lower l2_regularize should be tried, for comparing, this is current global
average objf:
2020-02-14 14:39:04,638 INFO [train.py:102] Process 9100/9444(96.357476%) global average objf: -0.039630 over 51210880.0 frames, current batch average objf: -0.036466 over 6400 frames, epoch 5
2020-02-14 14:39:24,135 INFO [train.py:102] Process 9200/9444(97.416349%) global average objf: -0.039632 over 51779328.0 frames, current batch average objf: -0.033343 over 6400 frames, epoch 5
2020-02-14 14:39:43,281 INFO [train.py:102] Process 9300/9444(98.475222%) global average objf: -0.039640 over 52335744.0 frames, current batch average objf: -0.031165 over 6400 frames, epoch 5
2020-02-14 14:40:02,384 INFO [train.py:102] Process 9400/9444(99.534096%) global average objf: -0.039642 over 52895360.0 frames, current batch average objf: -0.050952 over 3840 frames, epoch 5
2020-02-14 14:40:10,763 INFO [common.py:61] Save checkpoint to exp/chain_cleaned_pybind/tdnn1c_sp/best_model.pt: epoch=5, learning_rate=1.5625e-05, objf=-0.039643600787482226
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLO266CQH3SJHQ7B6K6TRCZCKLA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELX3I6A#issuecomment-586134648>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3NSUJNYENEPO7JVO3RCZCKLANCNFSM4KTPRNKQ>
.
|
Mine is about I think the default value for kaldi/src/chain/chain-training.h Line 74 in 793191b
I will
and to see what will happen. And then to use conv1d to compute delta-delta features. |
The above result was gotten with Adam instead of SGD, I tried SGD, but it seems it may require more tricks on the learning rate. |
Why do you say that SGD may require more tricks-- what happened? And did
you tune the learning rate?
opts.out_of_range_regularize is a newly added, relatively obscure feature
designed to stop the outputs getting very large and overflowing the
denominator computation (which is not done in log space). You can ignore
it for now, unless it's already implemented.
…On Fri, Feb 14, 2020 at 3:48 PM Haowen Qiu ***@***.***> wrote:
The above result was gotten with Adam instead of SGD, I tried SGD, but it
seems it may require more tricks on the learning rate.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLO6JNCFH4AHBQEAZMBLRCZEGJA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELX4NSQ#issuecomment-586139338>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO567GXKKKUIWVCAR4DRCZEGJANCNFSM4KTPRNKQ>
.
|
@csukuangfj , the higher of objf, the better...see your code and logs. |
BTW, guys, you should monitor the train and valid objective functions
separately.
Too-big difference means the model is overfitting, which will normally mean
the
(learning rate * l2) is too small, so one of those should be increased.
And vice versa if the difference is quite small.
…On Fri, Feb 14, 2020 at 3:52 PM Haowen Qiu ***@***.***> wrote:
global average objf: -0.039642 is quite high.
Mine is about -0.06. I'll try what you have suggested.
I think the default value for opts.out_of_range_regularize is 0.01.
https://github.com/kaldi-asr/kaldi/blob/793191be209357cd22dd20f4958a570098ac0cf8/src/chain/chain-training.h#L74
I will
- set l2_regularize to 1e-4
- set xent_regularize to 0.1
- disable gradient clipping
and to see what will happen. And then to use conv1d to compute delta-delta
features.
@csukuangfj <https://github.com/csukuangfj> , the higher of objf, the
better...see your code and logs.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLOZUOU5RR6W7QPGF5MTRCZEUNA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELX4WNQ#issuecomment-586140470>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZ4347XXZC4L73QS5TRCZEUNANCNFSM4KTPRNKQ>
.
|
@danpovey I will compute the objective function value of |
I tried different learning rate value and different rate decay in SGD, but it all gets worse result than Adam. Do you have any idea about how to figure out the reason? BTW, in my last try with Adam (above), it seems the objf has been up and down at the corresponding learning rate?
|
With the SGD run, compare the norms of the parameter matrices with those of Kaldi's model and see if any are significantly different. If those printed values are for minibatches, the variation is probably normal; sometimes you'll get easier or harder examples. |
Ok, thanks Dan. I'll try lower l2 first, and then try to tune it with SGD with your suggestion. And I suggest we should not (at least for now) change the model struct, it seems that we may get close result with the same configuration of |
If you are referring to the delta thing,
I think you should make a version of that baseline script that uses deltas,
since that's easier to implement,
and anyway that's the recommended pattern right now. (Assuming the
features are MFCCs).
You can refer to the mini_librispeech setup.
…On Fri, Feb 14, 2020 at 4:18 PM Haowen Qiu ***@***.***> wrote:
With the SGD run, compare the norms of the parameter matrices with those
of Kaldi's model and see if any are significantly different. If those
printed values are for minibatches, the variation is probably normal;
sometimes you'll get easier or harder examples.
Ok, thanks Dan. I'll try lower l2 first, and then try to tune it with SGD
with your suggestion.
And I suggest we should not (at least for now) change the model struct, it
seems that we may get close result with the same configuration of
tdnn_1c_rd_rmc_rng. Update model config may make things complex.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLO66ARPOAXGJMUTCZTDRCZHTXA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELX6RVQ#issuecomment-586148054>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO5PVTZ6B3AJ7BT3EELRCZHTXANCNFSM4KTPRNKQ>
.
|
@qindazhu For my previous pullrequst, I used [-1, 0, 1] for the first linear layer, which But the current model structure has kernel size 2 for the first linear layer I merged it since fanlu said it has a better cer/wer. I am wondering whether the left/right context used in generating egs is still relevant |
@csukuangfj I just kept all other configuration in your first PR, that is almost the same with
The only difference is that I re-used data/features for |
Anyway, let's keep the same model configuration (your first pr, |
That probably indicates the learning rate needs to be reduced.
…On Tue, Feb 18, 2020 at 4:59 PM Haowen Qiu ***@***.***> wrote:
BTW, for training set, it seems that the objf is frozen after epoch 2?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLOYXD5EK7ZXAURQWU5LRDOPPXA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMBEDSQ#issuecomment-587350474>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO5KU7Y3S4RCDF2KBNTRDOPPXANCNFSM4KTPRNKQ>
.
|
Ok, thanks! Let me just try more epochs first as the l2 and initial learning rate seem right according to my previous experiments (large or small l2/lr would not make objf get better in the epoch 0). And then try to decrease rate in the last epochs |
yes, I am wondering why I get worse results than haowen's and fanlu's. I think haowen used the same network architecture as me for the PyTorch, i.e., |
I assume you were already using a learning rate schedule where it decreases
in later epochs (?)
That is definitely necessary. Aim for a factor of 10 or 20, maybe.
…On Tue, Feb 18, 2020 at 5:11 PM Fangjun Kuang ***@***.***> wrote:
$ steps/info/chain_dir_info.pl exp-haowen/chain_cleaned_1c/tdnn1c_sp/
exp-haowen/chain_cleaned_1c/tdnn1c_sp/: num-iters=79 nj=3..12
num-params=9.3M dim=40->3456 combine=-0.032->-0.032 (over 2)
xent:train/valid[51,78,final]=(-0.726,-0.550,-0.542/-0.735,-0.584,-0.576)
logprob:train/valid[51,78,final]=(-0.050,-0.033,-0.033/-0.052,-0.042,-0.041)
yes, I am wondering why I get worse results than haowen's and fanlu's.
I think haowen used the same network architecture as me for the PyTorch,
i.e.,
[-1, 0, 1] for the first linear layer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLOYDLME7PU2TUZ27R5DRDOQ4PA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMBFHXI#issuecomment-587355101>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2RPFGYY634SVCHQOLRDOQ4PANCNFSM4KTPRNKQ>
.
|
Do you follow scripts in this pr #3868? I guess maybe alignment before nnet training matters which differs from your PyTorch scripts? |
I used the above settings suggested by haowen and replaced the multistep learning rate scheduler with
My alignment generated by Should I replace I will double the number of epochs, i.e., from 6 to 12. |
Yes, current learning rate is
the initial lr is |
I guess so, maybe @fanlu can help to confirm this as I never run The other difference is that I use files in your first pr #3892, here is my local git log
|
@csukuangfj please follow haowen's pr #3868 and rerun egs/aishell/s10/local/run_tdnn_1c.sh before stage 16, and then run your recipe egs/aishell/s10/local/run_chain.sh start with stage 3 |
The model in this pullrequest is equivalent to my first pullrequest. I will switch to |
For
How about yours? just make sure I don't compute it incorrectly
|
I would suggest you using
```
torch.norm(param, p='fro')
```
to compute the frobenius norm.
Refer to https://pytorch.org/docs/stable/torch.html#torch.norm
From your results, it seems that you forgot to take the square root of the
PyTorch values.
…On Tue, Feb 18, 2020 at 7:54 PM Haowen Qiu ***@***.***> wrote:
For SGD, I just find the l2-norm of the parameters are too big compared
with Kaldi's:
kaldi: [tdnn1.affine:13.066 tdnnf2.linear:9.77364 tdnnf2.affine:13.2666 tdnnf3.linear:9.99066 tdnnf3.affine:12.3348 tdnnf4.linear:9.14051 tdnnf4.affine:12.1615 tdnnf5.linear:7.58906 tdnnf5.affine:11.355 tdnnf6.linear:9.04031 tdnnf6.affine:12.423 tdnnf7.linear:8.99408 tdnnf7.affine:12.4773 tdnnf8.linear:8.75175 tdnnf8.affine:12.3263 tdnnf9.linear:8.6019 tdnnf9.affine:12.028 tdnnf10.linear:8.49032 tdnnf10.affine:11.9081 tdnnf11.linear:8.28274 tdnnf11.affine:11.9213 tdnnf12.linear:8.25842 tdnnf12.affine:11.7487 tdnnf13.linear:8.19186 tdnnf13.affine:11.6624 prefinal-l:13.9765 prefinal-chain.affine:11.6501 prefinal-chain.linear:12.3917 output.affine:22.0577 prefinal-xent.affine:10.6057 prefinal-xent.linear:9.92937 output-xent.affine:50.9147 ]
pytorch:[tdnn1.affine:112.8931,tdnnf2.linear:102.4325,tdnnf2.affine:75.7402,tdnnf3.linear:86.0951,tdnnf3.affine:62.5637,tdnnf4.linear:78.2900,tdnnf4.affine:56.5584,tdnnf5.linear:48.1624,tdnnf5.affine:46.7811,tdnnf6.linear:89.7642,tdnnf6.affine:65.7036,tdnnf7.linear:90.6984,tdnnf7.affine:69.4275,tdnnf8.linear:89.6295,tdnnf8.affine:66.2355,tdnnf9.linear:87.3545,tdnnf9.affine:63.2102,tdnnf10.linear:83.4296,tdnnf10.affine:60.1824,tdnnf11.linear:83.5425,tdnnf11.affine:58.1394,tdnnf12.linear:82.4880,tdnnf12.affine:57.8074,tdnnf13.linear:85.2806,tdnnf13.affine:62.5397,prefinal_l:95.0566,prefinal_chain.affine:78.2842,prefinal_chain.linear:95.9710,output_affine:110.7365,prefinal_xent.affine:60.7356,prefinal_xent.linear:75.7621,output_xent_affine:121.6411]
How about yours? just make sure I don't compute it incorrectly
for name, param in model.named_parameters():
if param.requires_grad and name.endswith('.weight'):
change_name_to_align_with_kaldi()
norm_str = norm_str + '{}:{:.4f},'.format(name, torch.norm(param, 2))
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=ABIKIPF56LOQR4AULDTYQNDRDPD75A5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMBWJII#issuecomment-587424929>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIKIPGRIB7H7UKDOZ5GF4TRDPD75ANCNFSM4KTPRNKQ>
.
|
I think |
oh, yes, you're right. Now I have changed the data processing pipeline to match yours |
this is my sgd model's l2-norm
|
9a07339
to
523f9a4
Compare
It turns out that your There is more data generated in In addition, the training time is nearly tripled. |
Yes, it is because of |
Thanks @fanlu . Seems the L2-norm is really too big. Maybe I should try larger L2-regularize values and tune learning rate again. |
BTW, for training time, it seems that PyTorch takes the same time with Kaldi. Maybe we should do more things to make it faster. But I suggest we should focus on the WER first and put SPEED aside just for now. |
You could try, say, having 4 times the l2 and one quarter the learning rate.
…On Wed, Feb 19, 2020 at 10:03 AM Haowen Qiu ***@***.***> wrote:
In addition, the training time is nearly tripled.
BTW, for training time, it seems that PyTorch takes the same time with
Kaldi. Maybe we should do more things to make it faster. But I suggest we
should focus on the WER first and put SPEED aside just for now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3925?email_source=notifications&email_token=AAZFLO6QGCB3DIQ7B2T74QTRDSHONA5CNFSM4KTPRNK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMGB6UA#issuecomment-587997008>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO4TLM5CVIDUJTPA5VLRDSHONANCNFSM4KTPRNKQ>
.
|
sure, thanks Dan. |
Here are the results for the current pullrequest.
Please merge this if necessary. Part of the training log is as follows. It seems that the objf value keeps at
|
Fantastic! And of course thanks to all the others who provided results for comparison: those are valuable too, even if not directly merged. |
sync kernel(2,2) latest result
|
cool... |
In addition, set affine to false for batchnorm layers and switch to SGD optimizer.
The training is still running and
a screenshot of the L2-norms of the training parameters is as follows:
I will post the decoding results once it is done.