Open
Description
The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.
The initial issues include:
- Some properties don't exist, like the patronymic and informal grammemes. Either such functionality needs to be removed, or such data needs to be added to Wikidata. I vote for adding data instead of removing functionality. Just one entry needs to be added for the to make the tests not fail so badly.
- Review the short-form usage. This might be for short adjectives, but that should be reviewed.
- Double check that negative prefixes of words, like verbs, are not included and are filtered out. It can increase the size of the inflection tables a lot.
- The dictionary-parser output needs to be addressed
- The unit tests need to be fixed.
Tool output that needs to be addressed:
Line 163464: Q1552433 is not a known part of speech grammeme for L1337828(со)
Line 170922: Q904896 is not a known part of speech grammeme for L1398984(рассказывая)
Line 172378: Q904896 is not a known part of speech grammeme for L2122(делая)
Line 242130: Q2006180 is not a known part of speech grammeme for L582437(она)
Line 242131: Q2006180 is not a known part of speech grammeme for L582438(её)
Line 332924: Q430255 is not a known grammeme for L1317646(давать)
Line 520068: Q1299049 is not a known grammeme for L38291(трудиться)
Line 687296: Q904896 is not a known grammeme for L4744(идти)
Line 756518: Q2006180 is not a known part of speech grammeme for L582524(они)
Line 760869: Q1930668 is not a known part of speech grammeme for L618162(однако)
Line 907179: Q904896 is not a known part of speech grammeme for L409446(для)
Line 947137: Q29485 is not a known grammeme for L740234(с)
Line 997003: Q55074511 is not a known grammeme for L1144304(корма)
Line 1034314: Q1250335 is not a known grammeme for L37632(много)
Line 1116981: Q89522629 is not a known grammeme for L726217(крыло)
Line 1178044: Q54944750 is not a known grammeme for L1223574(Аллах)
Line 1200971: Q1930668 is not a known part of speech grammeme for L2121(авось)
Line 1201298: Q904896 is not a known grammeme for L4852(пойти)
Line 1249669: Q904896 is not a known grammeme for L409448(длить)
Line 1269667: Q2112896 is not a known part of speech grammeme for L578639(весь)
Line 1269802: Q2006180 is not a known part of speech grammeme for L579902(этот)
Line 1271440: Q2006180 is not a known part of speech grammeme for L592806(который)
Here is the current generated lexical dictionary files to debug the test failures.
The generated files are 4,130,513 bytes as a zip. The size is 3,414,735 bytes as the sdict file. Uncompressed they're 59,726,225 bytes in size. This language alone will force usage of git-lfs.
Metadata
Assignees
Labels
No labels
Activity