Unigrams probs and add_unigrams_arpa.pl

It looks like the script:

https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/lang/add_unigrams_arpa.pl

Doesn't make any attempt to assure that the unigrams probabilities sum to 1.0.  I don't know if this is a problem or not.

My suggestion would be to treat the "scale" parameter is the probability of OOV - P(OOV) - as suggested in the script.  Then the following normalizations could be done:
1. Normalize non-OOV unigrams so they sum to 1 - P(OOV)
2. Normalize OOV unigrams so they sum to P(OOV)
That should ensure that the set of specified OOV words is treated as having a collected probability of P(OOV) and the remainder of the lexicon picks up the remaining probability mass.

It may also make sense to ensure that any word specified by the user that already exists in the lexicon is moved to the OOV set so that it inherits the probability specified by the user.  I actually don't know if that's a good idea as it will impact all backoff N-grams.  So perhaps a warning is better and these words are skipped or an option could exist to apply the user specified probabilities to in vocabulary words if that's really what the  user wants to do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unigrams probs and add_unigrams_arpa.pl #4933

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unigrams probs and add_unigrams_arpa.pl #4933

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions