Skip to content

Integrate ru Wikidata into Unicode Inflection #54

Open
@grhoten

Description

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

  • Some properties don't exist, like the patronymic and informal grammemes. Either such functionality needs to be removed, or such data needs to be added to Wikidata. I vote for adding data instead of removing functionality. Just one entry needs to be added for the to make the tests not fail so badly.
  • Review the short-form usage. This might be for short adjectives, but that should be reviewed.
  • Double check that negative prefixes of words, like verbs, are not included and are filtered out. It can increase the size of the inflection tables a lot.
  • The dictionary-parser output needs to be addressed
  • The unit tests need to be fixed.

Tool output that needs to be addressed:

Line 163464: Q1552433 is not a known part of speech grammeme for L1337828(со)
Line 170922: Q904896 is not a known part of speech grammeme for L1398984(рассказывая)
Line 172378: Q904896 is not a known part of speech grammeme for L2122(делая)
Line 242130: Q2006180 is not a known part of speech grammeme for L582437(она)
Line 242131: Q2006180 is not a known part of speech grammeme for L582438(её)
Line 332924: Q430255 is not a known grammeme for L1317646(давать)
Line 520068: Q1299049 is not a known grammeme for L38291(трудиться)
Line 687296: Q904896 is not a known grammeme for L4744(идти)
Line 756518: Q2006180 is not a known part of speech grammeme for L582524(они)
Line 760869: Q1930668 is not a known part of speech grammeme for L618162(однако)
Line 907179: Q904896 is not a known part of speech grammeme for L409446(для)
Line 947137: Q29485 is not a known grammeme for L740234(с)
Line 997003: Q55074511 is not a known grammeme for L1144304(корма)
Line 1034314: Q1250335 is not a known grammeme for L37632(много)
Line 1116981: Q89522629 is not a known grammeme for L726217(крыло)
Line 1178044: Q54944750 is not a known grammeme for L1223574(Аллах)
Line 1200971: Q1930668 is not a known part of speech grammeme for L2121(авось)
Line 1201298: Q904896 is not a known grammeme for L4852(пойти)
Line 1249669: Q904896 is not a known grammeme for L409448(длить)
Line 1269667: Q2112896 is not a known part of speech grammeme for L578639(весь)
Line 1269802: Q2006180 is not a known part of speech grammeme for L579902(этот)
Line 1271440: Q2006180 is not a known part of speech grammeme for L592806(который)

Here is the current generated lexical dictionary files to debug the test failures.

ru.zip

The generated files are 4,130,513 bytes as a zip. The size is 3,414,735 bytes as the sdict file. Uncompressed they're 59,726,225 bytes in size. This language alone will force usage of git-lfs.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions