Skip to content

wjonasreger/traditional_ngram_language_models

Repository files navigation

Traditional n-Gram Language Models

In this notebook, some traditional n-gram language models are trained on the WikiText-2 data, a corpus of high-quality Wikipedia articles. The dataset was originally introduced in the following paper. A raw version of the data can easily be viewed here.


Four types of language models are implemented: two unigram models and two bigram models. Each pair has an unsmoothed and smoothed model. Each model class contains the three class methods shown below.

  • Sentence generation (i.e., generateSentence(self)): Returns a sentence that is generated by the model. The output is a list of words that are found in the vocab set, including the UNK token but excluding the START and END tokens. Each sentence starts with the START token (with probability 1). The number of words in a sentence is not fixed, rather it terminates whenever END is generated. The probability of an n-gram is estimated by $$\hat{P}(n_i)$$, where $n_i$ is an n-gram.

  • Log-likelihood of a Sentence (i.e., getSentenceLogLikelihood(self, sentence)): Returns the logarithm of the likelihood probability of sentence, which is a list of words that exist in the vocab set. The log-likelihood of a sentence is estimated by $$\sum_{i=1}^n \log P(n_i)$$, where $n$ is the length of a sentence.

  • Corpus Perplexity (i.e., getCorpusPerplexity(self, test_data)): Computes the perplexity (i.e., normalized inverse log probability) of test_data. The corpus perplexity is estimated by $$\exp \left( -\frac1N \times \sum_{i=1}^N \log P(n_i) \right)$$


Here are some sample sentences generated by the models.

Unigram Model

<START> 7 bringing wearing formed , is , 1632 of again might was Brand He <UNK> as 's of . this million Aquila Gray , and for the The cyclone opera MLs her a is specimens was in southeastward with to a Fall on Angelo the addition on <UNK> thought In ; mile stock 's . an . starring , Commission the ) <UNK> released the ( " of cute 18 an , <END>

Smoothed Unigram Model

<START> end always in Internet , City he closely . ALL in , cause jokes the " pleas 3 and age . his , point , bass dose as and and beast ... so what and show creator , ) early Slow level <UNK> to <END>

Bigram Model

<START> Internal Revenue Service Byway is a Norwegian Trunk Railway Museum and a <UNK> International Airport to several American Music from Orsogna , left arm for two paths are due to be possible from the game victories against the key idea incomprehensible to expose the Toronto Students who signed , another . <END>

Smoothed Bigram Model

<START> After a reaction has been suspected by the right and forbs . " Underneath the Song and Adelaide remained in an additional pressure . <END>

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published