-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[pt] Major enhancement in disambiguation.xml #11673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new Portuguese disambiguation rule NP_UNKNOWN_VMIP3S0_VMM02S0_20251127 to pt/disambiguation.xml (inserted twice), matching NP/UNKNOWN tokens with VMIP3S0 + VMM02S0 sequences and removing the VMM02S0 postag when VMIP3S0 is present. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml(1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11433
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3716-3716
Timestamp: 2025-07-09T06:30:58.965Z
Learning: marcoagpinto uses temporary placeholder values like "temp_off" in LanguageTool rule attributes while waiting for nightly test results before enabling rules, as part of his testing methodology to ensure rules don't require minor adjustments.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11557
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4068-4078
Timestamp: 2025-10-08T06:41:55.119Z
Learning: In Portuguese disambiguation rules, when a pattern targets tokens with multiple verb readings (e.g., VMIP3S0 and VMM02S0), including an exception like `<exception postag_regexp='yes' postag='AQ.+'//>` on subsequent participle tokens is necessary to prevent false positives, even though it reduces the number of matches. The rule can still fire successfully for cases where the participle doesn't have an adjective reading, as confirmed by testing showing ~4683 matches across ~950k sentences.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11648
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4198-4199
Timestamp: 2025-11-18T06:43:24.213Z
Learning: In Portuguese disambiguation rules, when removing verb readings from words that have both verb and noun tags (e.g., "testes" with VMSP2S0 and NCMP000, "pacientes" with VMSP2S0 and NCCP000), combining similar verb form tags like VMP00PM and VMSP2S0 in a single pattern is appropriate because they create similar noun/verb ambiguities in the same grammatical contexts (after verbs/adjectives/pronouns). Adding AQ exceptions to such patterns would break functionality by filtering out valid cases where past participles should be treated as nouns (e.g., "enviados" in "devem ser enviados").
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11196
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:3884-3892
Timestamp: 2025-01-17T08:46:06.456Z
Learning: In Portuguese disambiguation rules, when handling multiple verb forms for the same pattern, use separate rules for each verb form tag instead of combining them with multiple <wd> tags in a single rule.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11490
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/grammar.xml:40665-40665
Timestamp: 2025-08-25T03:54:09.419Z
Learning: In Portuguese LanguageTool rules, the word "porta" appears as both a verb and a noun, creating inherent POS tagging ambiguity. Rules targeting "porta" in compound constructions should not exclude verb tags as exceptions because this would break the rule's functionality when "porta" functions as a noun in valid compound word patterns.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11345
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3801-3804
Timestamp: 2025-04-26T11:44:57.044Z
Learning: In Portuguese rules for LanguageTool, using `<match no='X' postag='V.+' postag_regexp='yes'>verb</match>` pattern with infinitive verbs (like "zangar") is preferred over direct adjective forms because it allows proper handling of gender and number inflections, while common gender adjectives like "descontente" can use the `postag_replace='AQ0C$10'` pattern.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11515
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3714-3717
Timestamp: 2025-09-19T05:58:04.682Z
Learning: In Portuguese LanguageTool rules, using `<match no='X' postag='AQ.+' postag_regexp='yes'>adjective</match>` automatically preserves gender and number inflection from the matched token without requiring postag_replace, allowing adjectives like "qualificado" to properly inflect to "qualificado/qualificada/qualificados/qualificadas" based on the original matched form.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11415
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3721-3723
Timestamp: 2025-06-28T05:00:46.342Z
Learning: In Portuguese LanguageTool rules, capture group references $1 through $9 work correctly in postag_replace patterns. The parsing issue only occurs when a single-digit group reference is immediately followed by digits (like $1000), creating ambiguity between "group 1 + literal 000" vs "group 1000". Using braces ${1}000 disambiguates this case.
📚 Learning: 2025-10-08T06:41:55.119Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11557
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4068-4078
Timestamp: 2025-10-08T06:41:55.119Z
Learning: In Portuguese disambiguation rules, when a pattern targets tokens with multiple verb readings (e.g., VMIP3S0 and VMM02S0), including an exception like `<exception postag_regexp='yes' postag='AQ.+'//>` on subsequent participle tokens is necessary to prevent false positives, even though it reduces the number of matches. The rule can still fire successfully for cases where the participle doesn't have an adjective reading, as confirmed by testing showing ~4683 matches across ~950k sentences.
Applied to files:
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-11-18T06:43:24.213Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11648
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4198-4199
Timestamp: 2025-11-18T06:43:24.213Z
Learning: In Portuguese disambiguation rules, when removing verb readings from words that have both verb and noun tags (e.g., "testes" with VMSP2S0 and NCMP000, "pacientes" with VMSP2S0 and NCCP000), combining similar verb form tags like VMP00PM and VMSP2S0 in a single pattern is appropriate because they create similar noun/verb ambiguities in the same grammatical contexts (after verbs/adjectives/pronouns). Adding AQ exceptions to such patterns would break functionality by filtering out valid cases where past participles should be treated as nouns (e.g., "enviados" in "devem ser enviados").
Applied to files:
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-01-17T08:46:06.456Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11196
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:3884-3892
Timestamp: 2025-01-17T08:46:06.456Z
Learning: In Portuguese disambiguation rules, when handling multiple verb forms for the same pattern, use separate rules for each verb form tag instead of combining them with multiple <wd> tags in a single rule.
Applied to files:
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-08-25T03:54:09.419Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11490
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/grammar.xml:40665-40665
Timestamp: 2025-08-25T03:54:09.419Z
Learning: In Portuguese LanguageTool rules, the word "porta" appears as both a verb and a noun, creating inherent POS tagging ambiguity. Rules targeting "porta" in compound constructions should not exclude verb tags as exceptions because this would break the rule's functionality when "porta" functions as a noun in valid compound word patterns.
Applied to files:
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-07-09T06:30:58.965Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11433
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3716-3716
Timestamp: 2025-07-09T06:30:58.965Z
Learning: marcoagpinto uses temporary placeholder values like "temp_off" in LanguageTool rule attributes while waiting for nightly test results before enabling rules, as part of his testing methodology to ensure rules don't require minor adjustments.
Applied to files:
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-04-26T11:44:57.044Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11345
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3801-3804
Timestamp: 2025-04-26T11:44:57.044Z
Learning: In Portuguese rules for LanguageTool, using `<match no='X' postag='V.+' postag_regexp='yes'>verb</match>` pattern with infinitive verbs (like "zangar") is preferred over direct adjective forms because it allows proper handling of gender and number inflections, while common gender adjectives like "descontente" can use the `postag_replace='AQ0C$10'` pattern.
Applied to files:
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (java-kotlin)
This is a major enhancement in the disambiguator.
@jaumeortola
If this breaks Premium, I will need your help. It is a major enhancement.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.