Context
Issue #47 requires "natural languages (including actual parsing of their
grammar with no semantic checks ... grammatical correctness, syntax
correctness and so on - should be fully supported)". Current support is
identification, segmentation, normalization, and script annotations
(src/natural_language.rs) - nothing parses grammar, so correctness cannot be
checked. See requirements.md R-5 and
solution-plans.md S-8.
Research (competitors-natural-language.md):
Grammatical Framework + Resource Grammar Library is the only surveyed system
doing exactly this job (parse-or-reject, no semantics, covers the ten target
languages, LGPL/BSD); UD supplies the morphosyntax vocabulary and corpora;
Wikidata lexemes (CC0) and UniMorph (CC BY-SA) supply word forms; LanguageTool
(LGPL, proven portable to Rust by nlprule) supplies explainable negative
checks.
Scope (staged)
- Adopt UD's UPOS/UFeats/deprel inventory as link-type vocabulary for
morphosyntax links; import CoNLL-U fixtures (e.g. via rs-conllu) with
per-treebank license provenance.
- Word-level correctness: morphological lexica seeded from Wikidata lexeme
Forms (CC0; UniMorph optional second source); unknown/ill-formed tokens
become recoverable is_error links, reusing the parse-recovery contract.
- Sentence-level correctness: integrate GF RGL grammars compiled to PGF
(start with the gf-core crate; evaluate a native PMCFG reader over the
links network as the long-term path); grammatical sentences parse clean,
ungrammatical ones carry error links - verify_full_match() thereby
answers "is this grammatical?".
- Explainable negatives: port a starter set of LanguageTool-style pass/fail
rule sentences per target language as fixtures (DELPH-IN "mal-rule" model
for explanations).
Acceptance criteria
References
Filed from docs/case-studies/issue-47/proposed-issues/08-natural-language-grammar-parsing.md. Part of the implementation plan for #47.
Context
Issue #47 requires "natural languages (including actual parsing of their
grammar with no semantic checks ... grammatical correctness, syntax
correctness and so on - should be fully supported)". Current support is
identification, segmentation, normalization, and script annotations
(
src/natural_language.rs) - nothing parses grammar, so correctness cannot bechecked. See
requirements.mdR-5 andsolution-plans.mdS-8.Research (
competitors-natural-language.md):Grammatical Framework + Resource Grammar Library is the only surveyed system
doing exactly this job (parse-or-reject, no semantics, covers the ten target
languages, LGPL/BSD); UD supplies the morphosyntax vocabulary and corpora;
Wikidata lexemes (CC0) and UniMorph (CC BY-SA) supply word forms; LanguageTool
(LGPL, proven portable to Rust by nlprule) supplies explainable negative
checks.
Scope (staged)
morphosyntax links; import CoNLL-U fixtures (e.g. via
rs-conllu) withper-treebank license provenance.
Forms (CC0; UniMorph optional second source); unknown/ill-formed tokens
become recoverable
is_errorlinks, reusing the parse-recovery contract.(start with the
gf-corecrate; evaluate a native PMCFG reader over thelinks network as the long-term path); grammatical sentences parse clean,
ungrammatical ones carry error links -
verify_full_match()therebyanswers "is this grammatical?".
rule sentences per target language as fixtures (DELPH-IN "mal-rule" model
for explanations).
Acceptance criteria
NATURAL_LANGUAGE_TARGETS: at least one grammaticalfixture parses with a clean
verify_full_match()and one ungrammaticalfixture surfaces error links - while both reconstruct byte-for-byte.
fully gated; remaining ones are tracked in the roadmap.
bump: minor).References
requirements.mdR-5solution-plans.mdS-8issue-47-76af108c0f24(PR Finish issue #47 parity feature set #48).Filed from
docs/case-studies/issue-47/proposed-issues/08-natural-language-grammar-parsing.md. Part of the implementation plan for #47.