Error "Should encode value 65536 in one byte"

Hello!
I stumbled upon this error during tagger training on some part of Taiga corpus of Russian language (~1 Gb of texts): ```"An error occurred during model training: Should encode value 65536 in one byte!"```

The quick question is: does udpipe have some vocabulary size limitations?

The full story is:
I know about this issue https://github.com/ufal/udpipe/issues/53 and I tried everything written there (I don't have tokens with length > 255 bytes, don't have dubious lemmas - max number of forms for one lemma in my corpus is 158 because of rich morphology of language, set ```guesser_enrich_dictionary``` to 1). I also removed all sentences with length more than 255 tokens. But I still get this error.
The only thing helped - to reduce corpus size to ~750 Mb. Size ~800 Mb still causes the error. I guessed the problem is in specific sentences (the difference of these two corpora). Ok, I tried to train tagger on corpora diff and didn't get the error. So, does udpipe have some vocabulary size limitations? Or may be there is some less obvious cause of problem?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error "Should encode value 65536 in one byte" #118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error "Should encode value 65536 in one byte" #118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions