Doc force align (#18)

hadware · mmmaat · commit 8aac14109831 · 2019-04-02T17:24:44.000+02:00
phone level forced-alignment tutorial
diff --git a/docs/source/abkhazia_force_align.rst b/docs/source/abkhazia_force_align.rst
@@ -0,0 +1,82 @@
+================
+Forced Alignment
+================
+
+This tutorial covers the usage of abkhazia to do phone-level forced alignment
+on your own corpus of annotated audio files.
+
+Prerequisites
+=============
+Here's what you need to have before being able to follow this tutorial:
+
+- A set of audio files encoded in 16000kz WAV 16bit PCM on which to run the alignment
+- On these audio files, a set of segments corresponding to utterances. For each utterance, you'll
+  need to have a phonemic transcription (an easy way to get these is by
+  using [Phonemizer](https://github.com/bootphon/phonemizer) )
+
+It's also recommended (yet optional) to have some kind of reference file where you can identify
+the speaker of each of your phonemized utterance.
+
+Corpus format
+=============
+
+The corpus format is the same as the one specified in :ref:`abkhazia_format`, two
+ corpus files having a bit more specific format, namely ``text.txt`` and ``lexicon.txt``.
+ Here, ``text.txt`` is composed of your phonemic transcription of each utterance::
+
+  <utterance-id> <pho1> <pho2> ... <phoN>
+
+
+and ``lexicon.txt`` is just a "phony" file containg phonemes mapped to themselves::
+
+  <pho1> <pho1>
+  <pho2> <pho2>
+  <pho3> <pho3>
+  ...
+
+
+Doing the Forced Alignment
+==========================
+
+Once you've gathered all the required files (cited above) in a ``corpus/`` folder (the name is
+obviously arbitrary), you'll want to validate the corpus to check that it is conform to Kaldi's
+input format. Abkhazia luckily does that for us::
+
+  abhkazia validate corpus/
+
+
+Then, we'll compute the language model (actually here a phonetic model) for your dataset.
+Note that even though we set the model-level (option ``-l``) to "word", here it's
+still working find since all words are phonemes::
+
+  abkhazia language corpus/ -l word -n 3 -v
+
+
+We'll now extract the MFCC features from the wav files::
+
+  abkhazia features mfcc corpus/ --cmvn
+
+
+Then, using the langage model and the extracted MFCC's, compute a triphone HMM-GMM acoustic model::
+
+  abkhazia acoustic monophone -v corpus/ --force --recipe
+  abkhazia acoustic triphone -v corpus/
+
+If you specified the speaker for each utterance, you can adapt your model per speaker::
+
+  abkhazia acoustic triphone-sa -v corpus/
+
+And the, at last, we can compute the forced phonetic aligments::
+
+  abkhazia align corpus -a corpus/triphone-sa # if you computed the speaker-adapted triphones
+  abkhazia align corpus -a corpus/triphone # if you didn't
+
+
+If everything went right, you should be able to find your alignment in
+``corpus/align/alignments.txt``. The file will have the following row structure::
+
+  <utt_id> <pho_start> <pho_end> <pho_name> <pho_symbol>
+  ...
+
+**Note that the phoneme's start and end time markers (in seconds) are relative to the utterance
+in which they were contained, not to the entire audio file.**
diff --git a/docs/source/abkhazia_format.rst b/docs/source/abkhazia_format.rst
@@ -154,6 +154,16 @@ Here is an example file with three utterances::
   sp109-sentence003 sp109
 
 
+If you don't have this information, or wish to hide this information to kaldi but still
+conform to this dataset format, you should set each utterance to its own unique speaker ID
+ (as explained [here](http://kaldi-asr.org/doc/data_prep.html)), e.g::
+
+  sentence001 sp001
+  sentence002 sp002
+  sentence003 sp003
+  sentence004 sp004
+  ....
+
 4. Transcription
 ----------------
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -44,6 +44,7 @@ available. List the list of corpora supported in abkhazia with
 
    install
    abkhazia_format
+   abkhazia_force_align
    abkhazia_usage
    abkhazia_api
    license