|
| 1 | +================ |
| 2 | +Forced Alignment |
| 3 | +================ |
| 4 | + |
| 5 | +This tutorial covers the usage of abkhazia to do phone-level forced alignment |
| 6 | +on your own corpus of annotated audio files. |
| 7 | + |
| 8 | +Prerequisites |
| 9 | +============= |
| 10 | +Here's what you need to have before being able to follow this tutorial: |
| 11 | + |
| 12 | +- A set of audio files encoded in 16000kz WAV 16bit PCM on which to run the alignment |
| 13 | +- On these audio files, a set of segments corresponding to utterances. For each utterance, you'll |
| 14 | + need to have a phonemic transcription (an easy way to get these is by |
| 15 | + using [Phonemizer](https://github.com/bootphon/phonemizer) ) |
| 16 | + |
| 17 | +It's also recommended (yet optional) to have some kind of reference file where you can identify |
| 18 | +the speaker of each of your phonemized utterance. |
| 19 | + |
| 20 | +Corpus format |
| 21 | +============= |
| 22 | + |
| 23 | +The corpus format is the same as the one specified in :ref:`abkhazia_format`, two |
| 24 | + corpus files having a bit more specific format, namely ``text.txt`` and ``lexicon.txt``. |
| 25 | + Here, ``text.txt`` is composed of your phonemic transcription of each utterance:: |
| 26 | + |
| 27 | + <utterance-id> <pho1> <pho2> ... <phoN> |
| 28 | + |
| 29 | + |
| 30 | +and ``lexicon.txt`` is just a "phony" file containg phonemes mapped to themselves:: |
| 31 | + |
| 32 | + <pho1> <pho1> |
| 33 | + <pho2> <pho2> |
| 34 | + <pho3> <pho3> |
| 35 | + ... |
| 36 | + |
| 37 | + |
| 38 | +Doing the Forced Alignment |
| 39 | +========================== |
| 40 | + |
| 41 | +Once you've gathered all the required files (cited above) in a ``corpus/`` folder (the name is |
| 42 | +obviously arbitrary), you'll want to validate the corpus to check that it is conform to Kaldi's |
| 43 | +input format. Abkhazia luckily does that for us:: |
| 44 | + |
| 45 | + abhkazia validate corpus/ |
| 46 | + |
| 47 | + |
| 48 | +Then, we'll compute the language model (actually here a phonetic model) for your dataset. |
| 49 | +Note that even though we set the model-level (option ``-l``) to "word", here it's |
| 50 | +still working find since all words are phonemes:: |
| 51 | + |
| 52 | + abkhazia language corpus/ -l word -n 3 -v |
| 53 | + |
| 54 | + |
| 55 | +We'll now extract the MFCC features from the wav files:: |
| 56 | + |
| 57 | + abkhazia features mfcc corpus/ --cmvn |
| 58 | + |
| 59 | + |
| 60 | +Then, using the langage model and the extracted MFCC's, compute a triphone HMM-GMM acoustic model:: |
| 61 | + |
| 62 | + abkhazia acoustic monophone -v corpus/ --force --recipe |
| 63 | + abkhazia acoustic triphone -v corpus/ |
| 64 | + |
| 65 | +If you specified the speaker for each utterance, you can adapt your model per speaker:: |
| 66 | + |
| 67 | + abkhazia acoustic triphone-sa -v corpus/ |
| 68 | + |
| 69 | +And the, at last, we can compute the forced phonetic aligments:: |
| 70 | + |
| 71 | + abkhazia align corpus -a corpus/triphone-sa # if you computed the speaker-adapted triphones |
| 72 | + abkhazia align corpus -a corpus/triphone # if you didn't |
| 73 | + |
| 74 | + |
| 75 | +If everything went right, you should be able to find your alignment in |
| 76 | +``corpus/align/alignments.txt``. The file will have the following row structure:: |
| 77 | + |
| 78 | + <utt_id> <pho_start> <pho_end> <pho_name> <pho_symbol> |
| 79 | + ... |
| 80 | + |
| 81 | +**Note that the phoneme's start and end time markers (in seconds) are relative to the utterance |
| 82 | +in which they were contained, not to the entire audio file.** |
0 commit comments