Skip to content

Syntax error in generated pipeline specification stops first normalisation FST from being used #35

@snomos

Description

@snomos

For lang-sma, the following tts pipeline is specified:

export default function smaTextTTS(entry: StringEntry): Command {
  let x = hfst.tokenize("tokenise", entry, { model_path: "tokeniser-tts-cggt-desc.pmhfst" });
  x = divvun.blanktag("whitespace",     x, { model_path: "analyser-gt-whitespace.hfst" });
  x = cg3.vislcg3("remove-lexicalised", x, { model_path: "generated-remove-lexicalised-compounds.bin" });
  x = cg3.vislcg3("valency",            x, { model_path: "valency.bin" });
  x = cg3.vislcg3("mwe-dis",            x, { model_path: "mwe-dis.bin" });
  x = cg3.mwesplit("mwesplit",          x);
  x = cg3.vislcg3("disamb",             x, { model_path: "disambiguator.bin" });
  x = cg3.vislcg3("functions",          x, { model_path: "functions.bin" });
  x = cg3.vislcg3("deps",               x, { model_path: "dependency.bin" });
  x = speech.normalize(
    "normaliser", x,
    {
      generator: "generator-tts-gt-norm.hfstol",
      analyzer:  "analyser-gt-norm.hfstol",
      normalizers: {
        "Sem/Time-clock": "transcriptor-clock-digit2text.filtered.lookup.hfstol",
        "Sem/Date":       "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
        "Sem/Year":       "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
        "Arab":           "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
        "Roman":          "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
        "ABBR":           "transcriptor-abbrevs2text.filtered.lookup.hfstol",
        "ACR":            "transcriptor-abbrevs2text.filtered.lookup.hfstol",
        "Symbol":         "transcriptor-symbols2text.filtered.lookup.hfstol",
        "Emoji":          "transcriptor-emoji2text.filtered.lookup.hfstol"
      }
    }
  );
  x = speech.phon("text2phon", x, { model: "text2phontext.hfstol", tag_models: { "ACR": "acro2text.hfstol" } });
  x = cg3.sentences("phon",    x, { mode: "phonological" });
  return x;
}

But after bundling, the normalisation step looks like this in the Divvun Runtime Playground (line breaks added for readability):

speech::normalize(analyzer = <path>"analyser-gt-norm.hfstol",
   generator = <path>"generator-tts-gt-norm.hfstol",
   normalizers = <{path}>{Sem/Time-clock: "transcriptor-clock-digit2text.filtered.lookup.hfstol",
   Sem/Date: "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
   Sem/Year: "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
   Arab: "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
   Roman: "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
   ABBR: "transcriptor-abbrevs2text.filtered.lookup.hfstol",
   ACR: "transcriptor-abbrevs2text.filtered.lookup.hfstol",
   Symbol: "transcriptor-symbols2text.filtered.lookup.hfstol",
   Emoji: "transcriptor-emoji2text.filtered.lookup.hfstol"}) -> string

Notice the curly braces {} for the first tag-specific normalisation (Sem/Time-clock). The effect of this can be seen by comparing the output for the following two sentences:

  1. with error:

Joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam Kloemegistie.

Output:

"joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam kloemegistie"

  1. without error:

Joekoen guhkiem, jis edtjebe jaehkedh dam 35 jaepien båeries nyjsenæjjam Kloemegistie.

Output:

"joekoen guhkiem, jis edtjebe jaehkedh dam golmeluhkievïjhte jaepien båeries nyjsenæjjam kloemegistie"

The difference is that 25 gets a Sem/Time-clock reading after disambiguation:

"<25>"
	"25" Num Arab Sem/Time-clock Sg Nom <W:0.0> <sma>

whereas 35 does not (because 35 can't be an hour in our system):

"<35>"
	"35" Num Arab Sg Gen Attr <W:0.0> <sma>

This difference in disambiguated analysis affects which FST is used in the normalisation process: either the first one with the curly braces, or another one. The one with the curly braces is not applied.

If digit2text conversion is done outside the divvun-runtime environment, the Sem/Time-clock FST delivers exactly the same output as the other one, so if it had worked, output would have been correct even if the analysis is wrong (it is not a clock hour, it is an age in years):

echo 25 | hfst-lookup -q tools/tts/transcriptor-ttsdate-digit2text.filtered.lookup.hfstol 
25	göökteluhkievïjhte	0.000000

echo 25 | hfst-lookup -q tools/tts/transcriptor-numbers-digit2text.filtered.lookup.hfstol 
25	göökteluhkievïjhte	0.000000

The output is the same for both the Playground (above) and the CLI:

echo 'Joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam Kloemegistie.' |\
    divvun-runtime run -p tools/tts/bundle.drb

Output:

[
    "joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam kloemegistie",
    "\\n",
]

And the working version:

echo 'Joekoen guhkiem, jis edtjebe jaehkedh dam 35 jaepien båeries nyjsenæjjam Kloemegistie.' |\
    divvun-runtime run -p tools/tts/bundle.drb

Output:

[
    "joekoen guhkiem, jis edtjebe jaehkedh dam golmeluhkievïjhte jaepien båeries nyjsenæjjam kloemegistie",
    "\\n",
]

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions