Syntax error in generated pipeline specification stops first normalisation FST from being used

For [lang-sma](https://github.com/giellalt/lang-sma), the following tts pipeline is specified:

```typescript
export default function smaTextTTS(entry: StringEntry): Command {
  let x = hfst.tokenize("tokenise", entry, { model_path: "tokeniser-tts-cggt-desc.pmhfst" });
  x = divvun.blanktag("whitespace",     x, { model_path: "analyser-gt-whitespace.hfst" });
  x = cg3.vislcg3("remove-lexicalised", x, { model_path: "generated-remove-lexicalised-compounds.bin" });
  x = cg3.vislcg3("valency",            x, { model_path: "valency.bin" });
  x = cg3.vislcg3("mwe-dis",            x, { model_path: "mwe-dis.bin" });
  x = cg3.mwesplit("mwesplit",          x);
  x = cg3.vislcg3("disamb",             x, { model_path: "disambiguator.bin" });
  x = cg3.vislcg3("functions",          x, { model_path: "functions.bin" });
  x = cg3.vislcg3("deps",               x, { model_path: "dependency.bin" });
  x = speech.normalize(
    "normaliser", x,
    {
      generator: "generator-tts-gt-norm.hfstol",
      analyzer:  "analyser-gt-norm.hfstol",
      normalizers: {
        "Sem/Time-clock": "transcriptor-clock-digit2text.filtered.lookup.hfstol",
        "Sem/Date":       "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
        "Sem/Year":       "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
        "Arab":           "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
        "Roman":          "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
        "ABBR":           "transcriptor-abbrevs2text.filtered.lookup.hfstol",
        "ACR":            "transcriptor-abbrevs2text.filtered.lookup.hfstol",
        "Symbol":         "transcriptor-symbols2text.filtered.lookup.hfstol",
        "Emoji":          "transcriptor-emoji2text.filtered.lookup.hfstol"
      }
    }
  );
  x = speech.phon("text2phon", x, { model: "text2phontext.hfstol", tag_models: { "ACR": "acro2text.hfstol" } });
  x = cg3.sentences("phon",    x, { mode: "phonological" });
  return x;
}
```

But after bundling, the normalisation step looks like this in the Divvun Runtime Playground (line breaks added for readability):

```
speech::normalize(analyzer = <path>"analyser-gt-norm.hfstol",
   generator = <path>"generator-tts-gt-norm.hfstol",
   normalizers = <{path}>{Sem/Time-clock: "transcriptor-clock-digit2text.filtered.lookup.hfstol",
   Sem/Date: "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
   Sem/Year: "transcriptor-ttsdate-digit2text.filtered.lookup.hfstol",
   Arab: "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
   Roman: "transcriptor-numbers-digit2text.filtered.lookup.hfstol",
   ABBR: "transcriptor-abbrevs2text.filtered.lookup.hfstol",
   ACR: "transcriptor-abbrevs2text.filtered.lookup.hfstol",
   Symbol: "transcriptor-symbols2text.filtered.lookup.hfstol",
   Emoji: "transcriptor-emoji2text.filtered.lookup.hfstol"}) -> string
```

Notice the curly braces `{}` for the first tag-specific normalisation (`Sem/Time-clock`). The effect of this can be seen by comparing the output for the following two sentences:

1. with error:

> Joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam Kloemegistie.

Output:

> "joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam kloemegistie"

2. without error:

> Joekoen guhkiem, jis edtjebe jaehkedh dam 35 jaepien båeries nyjsenæjjam Kloemegistie.

Output:

> "joekoen guhkiem, jis edtjebe jaehkedh dam golmeluhkievïjhte jaepien båeries nyjsenæjjam kloemegistie"

The difference is that `25` gets a `Sem/Time-clock` reading after disambiguation:

```
"<25>"
	"25" Num Arab Sem/Time-clock Sg Nom <W:0.0> <sma>
```

whereas `35` does not (because 35 can't be an hour in our system):


```
"<35>"
	"35" Num Arab Sg Gen Attr <W:0.0> <sma>
```

This difference in disambiguated analysis affects which FST is used in the normalisation process: either the first one with the curly braces, or another one. The one with the curly braces is not applied.

If digit2text conversion is done outside the divvun-runtime environment, the `Sem/Time-clock` FST delivers exactly the same output as the other one, so if it had worked, output would have been correct even if the analysis is wrong (it is not a clock hour, it is an age in years):

```sh
echo 25 | hfst-lookup -q tools/tts/transcriptor-ttsdate-digit2text.filtered.lookup.hfstol 
25	göökteluhkievïjhte	0.000000

echo 25 | hfst-lookup -q tools/tts/transcriptor-numbers-digit2text.filtered.lookup.hfstol 
25	göökteluhkievïjhte	0.000000
```

The output is the same for both the Playground (above) and the CLI:

```sh
echo 'Joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam Kloemegistie.' |\
    divvun-runtime run -p tools/tts/bundle.drb
```

Output:

```json
[
    "joekoen guhkiem, jis edtjebe jaehkedh dam 25 jaepien båeries nyjsenæjjam kloemegistie",
    "\\n",
]
```

And the working version:

```sh
echo 'Joekoen guhkiem, jis edtjebe jaehkedh dam 35 jaepien båeries nyjsenæjjam Kloemegistie.' |\
    divvun-runtime run -p tools/tts/bundle.drb
```

Output:

```json
[
    "joekoen guhkiem, jis edtjebe jaehkedh dam golmeluhkievïjhte jaepien båeries nyjsenæjjam kloemegistie",
    "\\n",
]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax error in generated pipeline specification stops first normalisation FST from being used #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Syntax error in generated pipeline specification stops first normalisation FST from being used #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions