Skip to content

Conversation

@marcoagpinto
Copy link
Member

@marcoagpinto marcoagpinto commented Nov 27, 2025

This is a major enhancement in the disambiguator.

@jaumeortola
If this breaks Premium, I will need your help. It is a major enhancement.

Summary by CodeRabbit

  • Bug Fixes
    • Improved Portuguese grammar checking by adding a targeted disambiguation for certain verb sequences, reducing false positives and improving verb-form recognition.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 27, 2025

Walkthrough

Adds a new Portuguese disambiguation rule NP_UNKNOWN_VMIP3S0_VMM02S0_20251127 to pt/disambiguation.xml (inserted twice), matching NP/UNKNOWN tokens with VMIP3S0 + VMM02S0 sequences and removing the VMM02S0 postag when VMIP3S0 is present.

Changes

Cohort / File(s) Change Summary
Portuguese disambiguation file — new rule (duplicate)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
Adds NP_UNKNOWN_VMIP3S0_VMM02S0_20251127 disambiguation rule (appears twice). Rule targets NP.+ or UNKNOWN (optionally RM/RN/RG) with VMIP3S0 and VMM02S0 and removes the VMM02S0 postag when VMIP3S0 is present; includes example comments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Confirm duplicate insertion is intentional or remove the unintended duplicate.
  • Verify rule pattern and postag removal logic align with existing disambiguation conventions.
  • Check example comments for correctness.

Possibly related PRs

Suggested reviewers

  • jaumeortola
  • p-goulart
  • susanaboatto

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '[pt] Major enhancement in disambiguation.xml' is vague and overly broad, using the generic term 'enhancement' without specifying what was actually changed. Replace with a more specific title that describes the actual change, such as '[pt] Add VMIP3S0/VMM02S0 disambiguation rule' or '[pt] Add Portuguese verb disambiguation rule'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch lt_marcoagpinto_20251127_0634

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 261c202 and 85e3da0.

📒 Files selected for processing (1)
  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (java-kotlin)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 777b82c and 261c202.

📒 Files selected for processing (1)
  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11433
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3716-3716
Timestamp: 2025-07-09T06:30:58.965Z
Learning: marcoagpinto uses temporary placeholder values like "temp_off" in LanguageTool rule attributes while waiting for nightly test results before enabling rules, as part of his testing methodology to ensure rules don't require minor adjustments.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11557
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4068-4078
Timestamp: 2025-10-08T06:41:55.119Z
Learning: In Portuguese disambiguation rules, when a pattern targets tokens with multiple verb readings (e.g., VMIP3S0 and VMM02S0), including an exception like `<exception postag_regexp='yes' postag='AQ.+'//>` on subsequent participle tokens is necessary to prevent false positives, even though it reduces the number of matches. The rule can still fire successfully for cases where the participle doesn't have an adjective reading, as confirmed by testing showing ~4683 matches across ~950k sentences.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11648
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4198-4199
Timestamp: 2025-11-18T06:43:24.213Z
Learning: In Portuguese disambiguation rules, when removing verb readings from words that have both verb and noun tags (e.g., "testes" with VMSP2S0 and NCMP000, "pacientes" with VMSP2S0 and NCCP000), combining similar verb form tags like VMP00PM and VMSP2S0 in a single pattern is appropriate because they create similar noun/verb ambiguities in the same grammatical contexts (after verbs/adjectives/pronouns). Adding AQ exceptions to such patterns would break functionality by filtering out valid cases where past participles should be treated as nouns (e.g., "enviados" in "devem ser enviados").
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11196
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:3884-3892
Timestamp: 2025-01-17T08:46:06.456Z
Learning: In Portuguese disambiguation rules, when handling multiple verb forms for the same pattern, use separate rules for each verb form tag instead of combining them with multiple <wd> tags in a single rule.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11490
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/grammar.xml:40665-40665
Timestamp: 2025-08-25T03:54:09.419Z
Learning: In Portuguese LanguageTool rules, the word "porta" appears as both a verb and a noun, creating inherent POS tagging ambiguity. Rules targeting "porta" in compound constructions should not exclude verb tags as exceptions because this would break the rule's functionality when "porta" functions as a noun in valid compound word patterns.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11345
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3801-3804
Timestamp: 2025-04-26T11:44:57.044Z
Learning: In Portuguese rules for LanguageTool, using `<match no='X' postag='V.+' postag_regexp='yes'>verb</match>` pattern with infinitive verbs (like "zangar") is preferred over direct adjective forms because it allows proper handling of gender and number inflections, while common gender adjectives like "descontente" can use the `postag_replace='AQ0C$10'` pattern.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11515
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3714-3717
Timestamp: 2025-09-19T05:58:04.682Z
Learning: In Portuguese LanguageTool rules, using `<match no='X' postag='AQ.+' postag_regexp='yes'>adjective</match>` automatically preserves gender and number inflection from the matched token without requiring postag_replace, allowing adjectives like "qualificado" to properly inflect to "qualificado/qualificada/qualificados/qualificadas" based on the original matched form.
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11415
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3721-3723
Timestamp: 2025-06-28T05:00:46.342Z
Learning: In Portuguese LanguageTool rules, capture group references $1 through $9 work correctly in postag_replace patterns. The parsing issue only occurs when a single-digit group reference is immediately followed by digits (like $1000), creating ambiguity between "group 1 + literal 000" vs "group 1000". Using braces ${1}000 disambiguates this case.
📚 Learning: 2025-10-08T06:41:55.119Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11557
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4068-4078
Timestamp: 2025-10-08T06:41:55.119Z
Learning: In Portuguese disambiguation rules, when a pattern targets tokens with multiple verb readings (e.g., VMIP3S0 and VMM02S0), including an exception like `<exception postag_regexp='yes' postag='AQ.+'//>` on subsequent participle tokens is necessary to prevent false positives, even though it reduces the number of matches. The rule can still fire successfully for cases where the participle doesn't have an adjective reading, as confirmed by testing showing ~4683 matches across ~950k sentences.

Applied to files:

  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-11-18T06:43:24.213Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11648
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4198-4199
Timestamp: 2025-11-18T06:43:24.213Z
Learning: In Portuguese disambiguation rules, when removing verb readings from words that have both verb and noun tags (e.g., "testes" with VMSP2S0 and NCMP000, "pacientes" with VMSP2S0 and NCCP000), combining similar verb form tags like VMP00PM and VMSP2S0 in a single pattern is appropriate because they create similar noun/verb ambiguities in the same grammatical contexts (after verbs/adjectives/pronouns). Adding AQ exceptions to such patterns would break functionality by filtering out valid cases where past participles should be treated as nouns (e.g., "enviados" in "devem ser enviados").

Applied to files:

  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-01-17T08:46:06.456Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11196
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:3884-3892
Timestamp: 2025-01-17T08:46:06.456Z
Learning: In Portuguese disambiguation rules, when handling multiple verb forms for the same pattern, use separate rules for each verb form tag instead of combining them with multiple <wd> tags in a single rule.

Applied to files:

  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-08-25T03:54:09.419Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11490
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/grammar.xml:40665-40665
Timestamp: 2025-08-25T03:54:09.419Z
Learning: In Portuguese LanguageTool rules, the word "porta" appears as both a verb and a noun, creating inherent POS tagging ambiguity. Rules targeting "porta" in compound constructions should not exclude verb tags as exceptions because this would break the rule's functionality when "porta" functions as a noun in valid compound word patterns.

Applied to files:

  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-07-09T06:30:58.965Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11433
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3716-3716
Timestamp: 2025-07-09T06:30:58.965Z
Learning: marcoagpinto uses temporary placeholder values like "temp_off" in LanguageTool rule attributes while waiting for nightly test results before enabling rules, as part of his testing methodology to ensure rules don't require minor adjustments.

Applied to files:

  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
📚 Learning: 2025-04-26T11:44:57.044Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11345
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/rules/pt/style.xml:3801-3804
Timestamp: 2025-04-26T11:44:57.044Z
Learning: In Portuguese rules for LanguageTool, using `<match no='X' postag='V.+' postag_regexp='yes'>verb</match>` pattern with infinitive verbs (like "zangar") is preferred over direct adjective forms because it allows proper handling of gender and number inflections, while common gender adjectives like "descontente" can use the `postag_replace='AQ0C$10'` pattern.

Applied to files:

  • languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (java-kotlin)

@marcoagpinto marcoagpinto merged commit c9ca477 into master Nov 27, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants