Skip to content

Conversation

@inesakochur
Copy link
Contributor

@inesakochur inesakochur commented Nov 27, 2025

BRAK_PRZECINKA_ANI: Added antipattern for the case where 'lub' shouldn't be separated with comma since it's not contrasting to another action/item.
SKROTY_Z_KROPKA: Added antipattern for the band 'Red Hot Chilli Peppers'.
PREP_CASUS: Added antipattern for chain name 'Pizza Hut' which is invariable with any prepostion.
JEZYK_POLSKI: Added antipattern for the name of institution which is written form uppercase.
ZDANIA_ZLOZONE: Added 2 antipattern for the words (szkołę & duszą) when they are mistagged as verbs while they are nouns in the sentences.
ADJ_SUBST_ADJ_UNIFY: Added antipattern for the sentence with complex SV agreement with dative.

Summary by CodeRabbit

  • New Features
    • Significantly expanded Polish language correction rules with improved detection for spelling, grammar, and formatting errors.
    • Enhanced coverage for compound words, idioms, fixed expressions, abbreviations, proper nouns, numerals, and case usage.
    • Added comprehensive new suggestions for stylistic improvements and common language mistakes.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 27, 2025

Walkthrough

The change adds numerous new grammar rules to the Polish language module's XML configuration file, expanding coverage for spelling errors, grammatical constructs, idiomatic expressions, abbreviations, and stylistic issues specific to Polish language.

Changes

Cohort / File(s) Summary
Polish Grammar Rule Expansion
languagetool-language-modules/pl/src/main/resources/org/languagetool/rules/pl/grammar.xml
Added extensive new rules addressing spelling corrections (e.g., "szczęt" → "szept", "żż" → "ż"), grammatical constructs, idiomatic expressions and phraseology, abbreviations and formatting standards, proper capitalization of geographical names and proper nouns, compound words, stylistic pleonasms and redundancies, date and numeral formatting, and case-sensitive corrections for common Polish language errors. Multiple new <rulegroup> and <rule> elements introduced following established LanguageTool XML schema.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Polish language accuracy validation: Each rule addition requires verification that spelling corrections, grammatical suggestions, and idiomatic expressions are linguistically correct and contextually appropriate.
  • Pattern correctness: Confirm that regex patterns within rule definitions are syntactically valid and match intended error scenarios without false positives.
  • XML structure integrity: Verify well-formedness and consistency of new rule elements with existing schema conventions.
  • Conflict detection: Cross-check that new rules do not overlap or conflict with existing rules in the file, particularly for similar error categories.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title focuses on 'agreement and punctuation rules' but the changeset primarily adds spelling corrections, idioms, formatting rules, and a broad range of grammar patterns across Polish language rules. Consider a more accurate title like '[pl] Add comprehensive Polish grammar and spelling rules' to better reflect the extensive additions covering spelling, idioms, formatting, and multiple language correction patterns.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9ca477 and 6667f2b.

📒 Files selected for processing (1)
  • languagetool-language-modules/pl/src/main/resources/org/languagetool/rules/pl/grammar.xml (6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-08T06:41:55.119Z
Learnt from: marcoagpinto
Repo: languagetool-org/languagetool PR: 11557
File: languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml:4068-4078
Timestamp: 2025-10-08T06:41:55.119Z
Learning: In Portuguese disambiguation rules, when a pattern targets tokens with multiple verb readings (e.g., VMIP3S0 and VMM02S0), including an exception like `<exception postag_regexp='yes' postag='AQ.+'//>` on subsequent participle tokens is necessary to prevent false positives, even though it reduces the number of matches. The rule can still fire successfully for cases where the participle doesn't have an adjective reading, as confirmed by testing showing ~4683 matches across ~950k sentences.

Applied to files:

  • languagetool-language-modules/pl/src/main/resources/org/languagetool/rules/pl/grammar.xml
🔇 Additional comments (6)
languagetool-language-modules/pl/src/main/resources/org/languagetool/rules/pl/grammar.xml (6)

7333-7340: LGTM!

The dative agreement antipattern is well-structured. The postag patterns correctly target dative case markers, and the verb regex uses proper non-capturing group syntax. The example appropriately demonstrates the pattern.


12431-12437: LGTM!

The "Pizza Hut" antipattern is correctly structured with proper case sensitivity. The marker placement and example appropriately demonstrate this proper noun exception.


22888-22894: LGTM!

The band name antipattern correctly uses regexp="yes" to match both common spellings ("Chili"/"Chilli"). The four-token pattern properly identifies the full name to prevent false positives on abbreviation rules.


18570-18574: Due to repository cloning failures, I cannot directly access the file to verify the marker element placement against other antipatterns in the JEZYK_POLSKI rule or confirm consistency patterns. However, based on the available context and learnings about LanguageTool antipatterns:

The learnings reference known quirks in LanguageTool antipattern structure (like the scope='next' bug that requires workarounds with empty tokens and exceptions), suggesting that unusual pattern structures may be intentional design choices rather than errors.


Verify marker element placement against surrounding antipatterns in the same rule.

Without access to compare this antipattern against others in the JEZYK_POLSKI rule, the concern about the missing <marker> element cannot be definitively resolved. Confirm whether antipatterns without explicit markers (where the entire pattern serves as the exception) are intentional in this rule, or if markers should wrap the specific tokens being excluded.


179-193: Based on my analysis of the provided XML snippet, I can perform a direct code review without repository access:

XML Structure Analysis:

The two antipatterns shown are structurally valid:

  1. First antipattern (lines 179-187):

    • Valid nesting: <token postag="verb:praet.*" postag_regexp="yes"/><marker><token>szkolę</token></marker><example>
    • Pattern logic: Prevents false positives when "szkolę" follows a past participle verb
    • Example is relevant: "szkolę" here is a noun (school, accusative form), not a verb
  2. Second antipattern (lines 188-193):

    • Valid nesting with multiple tokens: <token>duszą</token><token postag="subst:.*" postag_regexp="yes"/> → marker with inflected="yes" on "być"
    • Pattern logic: Matches nominal constructions where "duszą" (soul, instrumental) precedes a noun
    • Example is relevant: "Duszą posłuszeństwa" (soul of obedience) where "duszą" is instrumental noun

Observable Issues:

Without access to the Polish language model's postag tag definitions, I cannot definitively verify that verb:praet.* and subst:.* match the actual tags used by the Polish tagger. This requires:

  • Confirmation that verb:praet.* correctly targets Polish past participles/preterite forms
  • Confirmation that subst:.* correctly targets Polish nouns/substantives
  • Verification against languagetool-language-modules/pl/src/main/java/org/languagetool/tagging/pl/ tagger definitions

The XML structure and attribute usage (postag_regexp="yes", inflected="yes") follow standard LanguageTool conventions as seen in Portuguese rules from retrieved learnings.


3344-3350: The antipattern correctly implements the intended matching behavior.

The skip="-1" attribute functions as documented—it greedily allows any number of tokens between the finite verb and the conjunction, matching cases where a conjunction connects nouns rather than contrasting actions. The example ("...studia lub kursy...") accurately demonstrates a valid false-positive case where no comma is required because "lub" links two substantives, not two verb phrases. The pattern is consistent with LanguageTool's documented behavior for negative skip values.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@inesakochur
Copy link
Contributor Author

@coderabbitai if everything looks fine, please suggest a reviewer for my PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 27, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Get the PR details including files changed
gh pr view 11674 --json files,additions,deletions,reviews,assignees --repo languagetool-org/languagetool

# Check for CODEOWNERS file
fd -t f "CODEOWNERS" -H

# Get recent commits to the Polish language module to identify contributors
git log --pretty=format:"%an|%ae" --max-count=50 -- "languagetool-language-modules/pl/" | sort | uniq -c | sort -rn | head -20

# Check past PRs that modified Polish grammar files
gh pr list --limit 20 --state merged --search "path:languagetool-language-modules/pl/src/main/resources/org/languagetool/rules/pl/grammar.xml" --json number,author,reviews,title --repo languagetool-org/languagetool

Length of output: 276615


🏁 Script executed:

#!/bin/bash

# Get PR details and file changes
gh pr view 11674 --json files,additions,deletions,author,body --repo languagetool-org/languagetool

# Get the actual diff to review the changes
gh pr diff 11674 --repo languagetool-org/languagetool | head -300

# Find recent contributors to Polish language module
git log --since="6 months ago" --pretty=format:"%an|%ae" -- "languagetool-language-modules/pl/" | sort | uniq -c | sort -rn | head -10

# Check past PRs that modified Polish grammar.xml and their reviewers
gh pr list --limit 30 --state all --search "path:languagetool-language-modules/pl/src/main/resources/org/languagetool/rules/pl/grammar.xml" --json number,author,reviews,state,title --repo languagetool-org/languagetool | jq -r '.[] | select(.reviews | length > 0) | {number, author: .author.login, reviewers: [.reviews[].author.login] | unique, state, title}'

Length of output: 11621


@inesakochur The changes look technically sound! I've reviewed the additions and here's what I found:

Technical Review ✅

Positive aspects:

  • All antipatterns are properly formatted with correct XML syntax
  • Each antipattern includes concrete examples, which is excellent practice
  • The patterns target specific false positive scenarios:
    • Mistagging of "szkolę" and "duszą" as verbs (ZDANIA_ZLOZONE)
    • Comma rules for "lub" in certain contexts (BRAK_PRZECINKA_ANI)
    • Proper noun exceptions: "Red Hot Chilli Peppers", "Pizza Hut", "Szkoła Noblistów Polskich"
    • Complex subject-verb agreement with dative (ADJ_SUBST_ADJ_UNIFY)
  • Changes are conservative (only additions, no modifications)

Reviewer Challenge 🤔

Unfortunately, there's limited visibility of active Polish language reviewers in recent history:

  • Most recent Polish PRs received only automated reviews
  • Only 4 contributors to the Polish module in the last 6 months
  • No clear pattern of dedicated Polish language maintainers

Recommendations

Given the clean technical quality and narrow scope, you could:

  1. Self-merge after waiting a reasonable time (24-48 hours) if no one objects, since:

    • You're adding defensive antipatterns (reducing false positives)
    • All changes include test examples
    • The impact is low-risk
  2. Tag repository maintainers who might know Polish contributors:

    • Check if there are any Polish language team members in the organization
  3. Proceed with confidence - these antipatterns follow established patterns in the codebase and should be safe to merge

The changes demonstrate good understanding of the rule system and proper grammar rule authoring practices! 👍

@inesakochur
Copy link
Contributor Author

@milekpl I'm Inesa Kochur, linguist from WebSpellChecker. Can you please review my PR?

@milekpl
Copy link
Member

milekpl commented Nov 27, 2025

Yes, but these anti patterns you propose are too broad (for Noblistów Polskich) or too narrow. "Szkolę średnią" is a typo anyway, but we should make a more general antipattern if you want to supress the punctuation rule...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants