Skip to content

Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to Intl.Segmenter ICU data inconsistency #1176

@ronggeshi

Description

@ronggeshi

Note: I am a Chinese-speaking user. This issue was investigated through a debugging session with Claude (Anthropic's AI assistant), which helped me systematically isolate the root cause. The findings below reflect what we discovered together.


Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to Intl.Segmenter ICU data inconsistency

Summary

Searching for Chinese terms on Android Chrome/Edge returns far fewer results than the same search on desktop Chrome, desktop Firefox, or Android Firefox. Through systematic debugging, the root cause was traced to Intl.Segmenter producing different word boundaries on Android Chromium, which causes pagefind to query different index shards and return an incomplete result set.


Environment

Value
Pagefind version 1.5.2
Build command pagefind_extended --site all --force-language zh
Affected browsers Android Chrome 148, Android Edge (EdgA) 148
Unaffected browsers Desktop Chrome, Desktop Firefox, Android Firefox
Index language zh (155,922 pages)

Steps to reproduce

  1. Build a pagefind index with --force-language zh over a large Chinese corpus.
  2. Open the search page on Android Chrome or Android Edge.
  3. Search for a common two-character Chinese word, e.g. 小说 (novel).
  4. Compare the result count with the same search on desktop Chrome or Android Firefox.

Observed vs expected behaviour

Browser Search term Results returned
Desktop Chrome 小说 3,262
Android Firefox 小说 3,262
Android Chrome 148 小说 44
Android Edge 148 小说 44

Expected: all browsers return the same result count for the same query.


Root cause analysis

1. Intl.Segmenter produces different word boundaries

The search logic inside pagefind-worker.js uses Intl.Segmenter with granularity: "word" to split the query term before looking up index shards. The same call produces different output across platforms:

Desktop Chrome / Android Firefox (correct):

const seg = new Intl.Segmenter('zh', { granularity: 'word' });
[...seg.segment('小说')]
// → [{ segment: '小说', index: 0, isWordLike: false }]
// One word-chunk → queries one index shard

Android Chrome / Android Edge (incorrect):

const seg = new Intl.Segmenter('zh', { granularity: 'word' });
[...seg.segment('小说')]
// → [{ segment: '小', index: 0, isWordLike: true },
//    { segment: '说', index: 1, isWordLike: true }]
// Two word-chunks → queries two different index shards

2. Different word-chunks → different index shards loaded

The two word-chunks each map to a different .pf_index shard. This was confirmed by inspecting the Network panel:

Browser Shards loaded Results
Desktop Chrome zh_c9d4d46.pf_index (1 shard) 3,262
Android Chrome zh_768561a.pf_index + zh_de3f9ff.pf_index (2 shards) 44

The AND-intersection computed across two mismatched shards is almost empty compared to the correct shard.

3. Why Android Chromium behaves differently

Android Chromium ships a trimmed ICU (International Components for Unicode) dataset to reduce APK size. This dataset lacks the Chinese lexicon data required for word-boundary detection, so Intl.Segmenter falls back to character-level splitting for most Chinese words. Desktop Chromium and all Firefox builds ship the full ICU dataset.

This has been reported as a Chromium-level limitation:

4. The segmentation happens inside a Web Worker

Pagefind runs its search logic (including the Intl.Segmenter call) inside pagefind-worker.js as a Web Worker, isolated from the main thread.


Additional observations

  • The pagefind-entry.json and pf_meta files are identical across browsers (same Content-Length, same content).
  • WebAssembly is supported and returns true on both affected and unaffected browsers — WASM is not the issue.
  • The problem is confirmed at the JS API level, not just the UI layer:
// Run directly in browser console
const pagefind = await import('/pagefind/pagefind.js');
await pagefind.init();
const results = await pagefind.search('小说');
console.log(results.unfilteredResultCount);
// → 3262 on desktop Chrome / Android Firefox
// → 44   on Android Chrome / Android Edge

Impact

Any pagefind deployment serving Chinese content is affected for all users on Android Chrome or Android Edge — which represents a substantial share of mobile traffic in Chinese-speaking regions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions