Skip to content

A simple model for identifying the language of a text via likelihood maximization over Markov models trained on phoneme encodings.

Notifications You must be signed in to change notification settings

joecomerisnotavailable/phoneme_based_language_id

Repository files navigation

Language Identification via Phoneme encoding

We propose to distinguish languages by maximum likelihood via a Markov modeling of each language of interest, after encoding the raw text into phonemme encodings. Phonemes are broken down into consonant types and vowels.

To run:

Simply run predict.py. New files can be added either to the train or test folders, but the model is currently limited to languages which employ variations on the Latin alphabet.

Requirements:

pandas numpy unicodedata

About

A simple model for identifying the language of a text via likelihood maximization over Markov models trained on phoneme encodings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published