Skip to content

Unigrams probs and add_unigrams_arpa.pl #4933

Open
@FredSRichardson

Description

@FredSRichardson

It looks like the script:

https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/lang/add_unigrams_arpa.pl

Doesn't make any attempt to assure that the unigrams probabilities sum to 1.0. I don't know if this is a problem or not.

My suggestion would be to treat the "scale" parameter is the probability of OOV - P(OOV) - as suggested in the script. Then the following normalizations could be done:

  1. Normalize non-OOV unigrams so they sum to 1 - P(OOV)
  2. Normalize OOV unigrams so they sum to P(OOV)
    That should ensure that the set of specified OOV words is treated as having a collected probability of P(OOV) and the remainder of the lexicon picks up the remaining probability mass.

It may also make sense to ensure that any word specified by the user that already exists in the lexicon is moved to the OOV set so that it inherits the probability specified by the user. I actually don't know if that's a good idea as it will impact all backoff N-grams. So perhaps a warning is better and these words are skipped or an option could exist to apply the user specified probabilities to in vocabulary words if that's really what the user wants to do.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugstaleStale bot on the loose

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions