Note: I am a Chinese-speaking user. This issue was investigated through a debugging session with Claude (Anthropic's AI assistant), which helped me systematically isolate the root cause. The findings below reflect what we discovered together.
Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to Intl.Segmenter ICU data inconsistency
Summary
Searching for Chinese terms on Android Chrome/Edge returns far fewer results than the same search on desktop Chrome, desktop Firefox, or Android Firefox. Through systematic debugging, the root cause was traced to Intl.Segmenter producing different word boundaries on Android Chromium, which causes pagefind to query different index shards and return an incomplete result set.
Environment
|
Value |
| Pagefind version |
1.5.2 |
| Build command |
pagefind_extended --site all --force-language zh |
| Affected browsers |
Android Chrome 148, Android Edge (EdgA) 148 |
| Unaffected browsers |
Desktop Chrome, Desktop Firefox, Android Firefox |
| Index language |
zh (155,922 pages) |
Steps to reproduce
- Build a pagefind index with
--force-language zh over a large Chinese corpus.
- Open the search page on Android Chrome or Android Edge.
- Search for a common two-character Chinese word, e.g.
小说 (novel).
- Compare the result count with the same search on desktop Chrome or Android Firefox.
Observed vs expected behaviour
| Browser |
Search term |
Results returned |
| Desktop Chrome |
小说 |
3,262 |
| Android Firefox |
小说 |
3,262 |
| Android Chrome 148 |
小说 |
44 |
| Android Edge 148 |
小说 |
44 |
Expected: all browsers return the same result count for the same query.
Root cause analysis
1. Intl.Segmenter produces different word boundaries
The search logic inside pagefind-worker.js uses Intl.Segmenter with granularity: "word" to split the query term before looking up index shards. The same call produces different output across platforms:
Desktop Chrome / Android Firefox (correct):
const seg = new Intl.Segmenter('zh', { granularity: 'word' });
[...seg.segment('小说')]
// → [{ segment: '小说', index: 0, isWordLike: false }]
// One word-chunk → queries one index shard
Android Chrome / Android Edge (incorrect):
const seg = new Intl.Segmenter('zh', { granularity: 'word' });
[...seg.segment('小说')]
// → [{ segment: '小', index: 0, isWordLike: true },
// { segment: '说', index: 1, isWordLike: true }]
// Two word-chunks → queries two different index shards
2. Different word-chunks → different index shards loaded
The two word-chunks each map to a different .pf_index shard. This was confirmed by inspecting the Network panel:
| Browser |
Shards loaded |
Results |
| Desktop Chrome |
zh_c9d4d46.pf_index (1 shard) |
3,262 |
| Android Chrome |
zh_768561a.pf_index + zh_de3f9ff.pf_index (2 shards) |
44 |
The AND-intersection computed across two mismatched shards is almost empty compared to the correct shard.
3. Why Android Chromium behaves differently
Android Chromium ships a trimmed ICU (International Components for Unicode) dataset to reduce APK size. This dataset lacks the Chinese lexicon data required for word-boundary detection, so Intl.Segmenter falls back to character-level splitting for most Chinese words. Desktop Chromium and all Firefox builds ship the full ICU dataset.
This has been reported as a Chromium-level limitation:
4. The segmentation happens inside a Web Worker
Pagefind runs its search logic (including the Intl.Segmenter call) inside pagefind-worker.js as a Web Worker, isolated from the main thread.
Additional observations
- The
pagefind-entry.json and pf_meta files are identical across browsers (same Content-Length, same content).
WebAssembly is supported and returns true on both affected and unaffected browsers — WASM is not the issue.
- The problem is confirmed at the JS API level, not just the UI layer:
// Run directly in browser console
const pagefind = await import('/pagefind/pagefind.js');
await pagefind.init();
const results = await pagefind.search('小说');
console.log(results.unfilteredResultCount);
// → 3262 on desktop Chrome / Android Firefox
// → 44 on Android Chrome / Android Edge
Impact
Any pagefind deployment serving Chinese content is affected for all users on Android Chrome or Android Edge — which represents a substantial share of mobile traffic in Chinese-speaking regions.
Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to
Intl.SegmenterICU data inconsistencySummary
Searching for Chinese terms on Android Chrome/Edge returns far fewer results than the same search on desktop Chrome, desktop Firefox, or Android Firefox. Through systematic debugging, the root cause was traced to
Intl.Segmenterproducing different word boundaries on Android Chromium, which causes pagefind to query different index shards and return an incomplete result set.Environment
pagefind_extended --site all --force-language zhzh(155,922 pages)Steps to reproduce
--force-language zhover a large Chinese corpus.小说(novel).Observed vs expected behaviour
Expected: all browsers return the same result count for the same query.
Root cause analysis
1.
Intl.Segmenterproduces different word boundariesThe search logic inside
pagefind-worker.jsusesIntl.Segmenterwithgranularity: "word"to split the query term before looking up index shards. The same call produces different output across platforms:Desktop Chrome / Android Firefox (correct):
Android Chrome / Android Edge (incorrect):
2. Different word-chunks → different index shards loaded
The two word-chunks each map to a different
.pf_indexshard. This was confirmed by inspecting the Network panel:zh_c9d4d46.pf_index(1 shard)zh_768561a.pf_index+zh_de3f9ff.pf_index(2 shards)The AND-intersection computed across two mismatched shards is almost empty compared to the correct shard.
3. Why Android Chromium behaves differently
Android Chromium ships a trimmed ICU (International Components for Unicode) dataset to reduce APK size. This dataset lacks the Chinese lexicon data required for word-boundary detection, so
Intl.Segmenterfalls back to character-level splitting for most Chinese words. Desktop Chromium and all Firefox builds ship the full ICU dataset.This has been reported as a Chromium-level limitation:
4. The segmentation happens inside a Web Worker
Pagefind runs its search logic (including the
Intl.Segmentercall) insidepagefind-worker.jsas a Web Worker, isolated from the main thread.Additional observations
pagefind-entry.jsonandpf_metafiles are identical across browsers (sameContent-Length, same content).WebAssemblyis supported and returnstrueon both affected and unaffected browsers — WASM is not the issue.Impact
Any pagefind deployment serving Chinese content is affected for all users on Android Chrome or Android Edge — which represents a substantial share of mobile traffic in Chinese-speaking regions.