Skip to content

fix(core): preserve camelCase proper nouns in title case#3438

Open
johndecker3 wants to merge 2 commits into
Automattic:masterfrom
johndecker3:fix-title-case-camel-case-proper-nouns
Open

fix(core): preserve camelCase proper nouns in title case#3438
johndecker3 wants to merge 2 commits into
Automattic:masterfrom
johndecker3:fix-title-case-camel-case-proper-nouns

Conversation

@johndecker3
Copy link
Copy Markdown

Issues

Related: #831 (closed) — the symptom was reported then, and #834 routed ProperNounCapitalizationLinter away from make_title_case to dodge it. The underlying bug in make_title_case remains and still affects the UseTitleCase heading linter; this PR fixes it at the source.

Description

The title-case linter (UseTitleCase) renders camelCase brand names like iCloud, iOS, macOS, and eBay as ICloud, IOS, MacOS, and EBay in headings — overwriting the intentional lowercase first letter with an uppercase one. Reproduction:

### apple launched icloud   →   ### Apple Launched ICloud   (today)
### apple launched icloud   →   ### Apple Launched iCloud   (with this fix)

Root cause

try_make_title_case (harper-core/src/title_case.rs) runs two passes per word:

  1. If the word is a proper noun, write the dictionary's canonical capitalization into the output buffer character-by-character.
  2. Unconditionally apply the title-case rules (uppercase first letter unless the word is a special article / conjunction / short preposition).

The two passes don't coordinate. For a canonical like iCloud, pass 1 writes i,C,l,o,u,d and pass 2 immediately overwrites position 0 with I, producing ICloud. Words whose canonical form already starts with uppercase (e.g., JavaScript, MacBook, VideoPress, NASA) escape the bug because pass 2's first-letter uppercase is a no-op on them.

Relationship to PR #834

PR #834 ("refactor(core): proper noun linters use canonical casing and JSON file") closed #831 by changing ProperNounCapitalizationLinter to read canonical capitalization directly from proper_noun_rules.json instead of calling make_title_case. That routed one caller around the buggy function but left make_title_case itself broken — and UseTitleCase still calls it. This PR fixes the function so all callers benefit, present and future.

Fix

A targeted heuristic in pass 1 detects when the canonical form is intentionally camelCase — the canonical's first alphabetic character is lowercase AND at least one other character is uppercase. When that condition holds, pass 2's first-letter rule is skipped (pass 1 has already written exactly what's wanted).

Why the heuristic isn't just "skip if proper noun"

The dictionary has entries like Apple/ONg (the company) and apple/~NwgS (the fruit). Looking up apple returns canonical "apple" (lowercase) due to the dual-entry ambiguity. If the proper-noun pass unconditionally won, apple would title-case to apple instead of Apple. The heuristic requires both a lowercase first AND an uppercase elsewhere, which keeps appleApple working while fixing icloudiCloud.

Demo

How Has This Been Tested?

Three new tests added:

  • title_case::tests::preserves_icloud_camel_case_mid_sentence — direct unit test for the title-case function with iCloud mid-sentence ("she backs up photos to icloud""She Backs up Photos to iCloud").
  • title_case::tests::preserves_icloud_camel_case_as_first_word — direct unit test for the "even the first word should keep its lowercase letter" behavior ("icloud syncs your files""iCloud Syncs Your Files").
  • linting::use_title_case::tests::preserves_camel_case_proper_nouns_in_heading — full-linter regression test for the original symptom ("### apple launched icloud""### Apple Launched iCloud").

cargo test -p harper-core --lib title_case and cargo test -p harper-core --lib use_title_case both green (49/49 and 7/7 respectively). The existing fixes_video_press test confirms that proper nouns with uppercase-first canonical forms (VideoPress) continue to work. The existing tests for special articles, conjunctions, and short prepositions also still pass.

I used iCloud rather than iPhone in tests because iPhone, iPad, iPod, iMac, and iTunes don't yet have the proper-noun flag on master — that's coming in a separate PR (#[fill in PR 1 number]). Once that PR merges, this fix automatically benefits those words too with no further test changes needed.

AI Disclosure

  • I am a human and didn't use any AI.
  • I used LLM features of my editor, but not an agent.
  • I used an AI agent interactively.
  • I am an agent or I got an agent to do the work autonomously.

If Your PR Implements or Enhances a Linter

  • I made up the sentences in the unit tests.
  • The sentences in the unit tests were generated by an AI.
  • I'm using examples from the bug report / feature request.
  • I collected real-world sentences for the unit tests.

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes
  • I have considered splitting this into smaller pull requests.

@johndecker3
Copy link
Copy Markdown
Author

I tried to rerun the checks because it looked like a timeout issue rather than a code issue, but I cannot due to admin limitations. Please let me know if there is anything I need to do to address the failed checks.

@hippietrail
Copy link
Copy Markdown
Collaborator

There is OrthFlags in dict_word_metadata_orthography.rs to help with this. It provides info about the spellings (plural) in the dictionary, independent of the spelling in the document.

But note that it's fuzzy at the moment because the dictionary is case-folded.

It can tell you that at least one spelling that went into the case-folded entry was mixed case or started with a lowercase letter. This is independent from the .is_proper_noun() property. So things like "iPad" and "nvm" can be proper nouns and title-case-making code can make a good guess about not uppercasing their first letter.

I fixed one place this happened about six months ago but don't remember where. Knowing the names of the structs should help you find it.

For the underlying case-folding problem itself, there is #2630 which will address it and related issues around letter case, acronyms & initialisms, etc.

@johndecker3
Copy link
Copy Markdown
Author

Thanks — that's a much cleaner path than my string-pattern heuristic. I'll switch the check to metadata.is_lower_camel() (equivalent to metadata.orth_info.contains(OrthFlags::LOWER_CAMEL)). Confirming the equivalence:

iCloud / iPad / iPhone / iPod / iMac / iTunes / iOS / macOS / eBay → LOWER_CAMEL set, first-letter rule skipped (fix applies).
JavaScript / MacBook / VideoPress / NASA → not LOWER_CAMEL, first-letter rule runs as a no-op on the already-uppercase initial (existing behavior preserved).
Apple / apple (the dual-entry ambiguity I flagged in the PR description) → neither spelling is LOWER_CAMEL, first-letter rule still produces Apple (still correct).

One question on the "place I fixed about six months ago" — was that the OrthographicConsistency rule added in PR #2107? That one uses OrthFlags::LOWER_CAMEL at orthographic_consistency.rs:89 for canonical-spelling suggestions. If yes, I'll model my fix on the same pattern; if you were thinking of somewhere else, a pointer would be helpful.

I read through #2630 — that refactor will sharpen these queries (per-spelling vs case-folded) and adds more convenience around OrthFlags, but my fix shouldn't conflict with it: switching to is_lower_camel() puts me on the same API surface you're expanding there. Happy to revisit once #2630 lands if you want, but I'd suggest this PR can stay independent.

@hippietrail
Copy link
Copy Markdown
Collaborator

One question on the "place I fixed about six months ago" — was that the OrthographicConsistency rule added in PR #2107? That one uses OrthFlags::LOWER_CAMEL at orthographic_consistency.rs:89 for canonical-spelling suggestions. If yes, I'll model my fix on the same pattern; if you were thinking of somewhere else, a pointer would be helpful.

I believe Elijah made the OrthographicConsistency rule, but I may well have added that logic to it. Let me check the blame. Nope looks like that's all his work from about 3 months ago and apparently I contributed OrthFlags 9 months ago already. Time flies!

It sounds like you're equipped with enough for the job either way (-:

@johndecker3
Copy link
Copy Markdown
Author

Thanks! Used metadata.is_lower_camel(). Worked great as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Correcting Apple product names is broken

2 participants