Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to `Intl.Segmenter` ICU data inconsistency

> **Note:** I am a Chinese-speaking user. This issue was investigated through a debugging session with Claude (Anthropic's AI assistant), which helped me systematically isolate the root cause. The findings below reflect what we discovered together.

---

# Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to `Intl.Segmenter` ICU data inconsistency

## Summary

Searching for Chinese terms on Android Chrome/Edge returns far fewer results than the same search on desktop Chrome, desktop Firefox, or Android Firefox. Through systematic debugging, the root cause was traced to `Intl.Segmenter` producing different word boundaries on Android Chromium, which causes pagefind to query different index shards and return an incomplete result set.

---

## Environment

| | Value |
|---|---|
| Pagefind version | 1.5.2 |
| Build command | `pagefind_extended --site all --force-language zh` |
| Affected browsers | Android Chrome 148, Android Edge (EdgA) 148 |
| Unaffected browsers | Desktop Chrome, Desktop Firefox, Android Firefox |
| Index language | `zh` (155,922 pages) |

---

## Steps to reproduce

1. Build a pagefind index with `--force-language zh` over a large Chinese corpus.
2. Open the search page on **Android Chrome or Android Edge**.
3. Search for a common two-character Chinese word, e.g. `小说` (novel).
4. Compare the result count with the same search on desktop Chrome or Android Firefox.

---

## Observed vs expected behaviour

| Browser | Search term | Results returned |
|---|---|---|
| Desktop Chrome | 小说 | 3,262 |
| Android Firefox | 小说 | 3,262 |
| Android Chrome 148 | 小说 | **44** |
| Android Edge 148 | 小说 | **44** |

Expected: all browsers return the same result count for the same query.

---

## Root cause analysis

### 1. `Intl.Segmenter` produces different word boundaries

The search logic inside `pagefind-worker.js` uses `Intl.Segmenter` with `granularity: "word"` to split the query term before looking up index shards. The same call produces different output across platforms:

**Desktop Chrome / Android Firefox (correct):**
```js
const seg = new Intl.Segmenter('zh', { granularity: 'word' });
[...seg.segment('小说')]
// → [{ segment: '小说', index: 0, isWordLike: false }]
// One word-chunk → queries one index shard
```

**Android Chrome / Android Edge (incorrect):**
```js
const seg = new Intl.Segmenter('zh', { granularity: 'word' });
[...seg.segment('小说')]
// → [{ segment: '小', index: 0, isWordLike: true },
//    { segment: '说', index: 1, isWordLike: true }]
// Two word-chunks → queries two different index shards
```

### 2. Different word-chunks → different index shards loaded

The two word-chunks each map to a different `.pf_index` shard. This was confirmed by inspecting the Network panel:

| Browser | Shards loaded | Results |
|---|---|---|
| Desktop Chrome | `zh_c9d4d46.pf_index` (1 shard) | 3,262 |
| Android Chrome | `zh_768561a.pf_index` + `zh_de3f9ff.pf_index` (2 shards) | 44 |

The AND-intersection computed across two mismatched shards is almost empty compared to the correct shard.

### 3. Why Android Chromium behaves differently

Android Chromium ships a trimmed ICU (International Components for Unicode) dataset to reduce APK size. This dataset lacks the Chinese lexicon data required for word-boundary detection, so `Intl.Segmenter` falls back to character-level splitting for most Chinese words. Desktop Chromium and all Firefox builds ship the full ICU dataset.

This has been reported as a Chromium-level limitation:
- https://bugs.chromium.org/p/chromium/issues/detail?id=1264577

### 4. The segmentation happens inside a Web Worker

Pagefind runs its search logic (including the `Intl.Segmenter` call) inside `pagefind-worker.js` as a Web Worker, isolated from the main thread.

---

## Additional observations

- The `pagefind-entry.json` and `pf_meta` files are identical across browsers (same `Content-Length`, same content).
- `WebAssembly` is supported and returns `true` on both affected and unaffected browsers — WASM is not the issue.
- The problem is confirmed at the JS API level, not just the UI layer:

```js
// Run directly in browser console
const pagefind = await import('/pagefind/pagefind.js');
await pagefind.init();
const results = await pagefind.search('小说');
console.log(results.unfilteredResultCount);
// → 3262 on desktop Chrome / Android Firefox
// → 44   on Android Chrome / Android Edge
```

---

## Impact

Any pagefind deployment serving Chinese content is affected for all users on Android Chrome or Android Edge — which represents a substantial share of mobile traffic in Chinese-speaking regions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to `Intl.Segmenter` ICU data inconsistency #1176

Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to `Intl.Segmenter` ICU data inconsistency

Summary

Environment

Steps to reproduce

Observed vs expected behaviour

Root cause analysis

1. `Intl.Segmenter` produces different word boundaries

2. Different word-chunks → different index shards loaded

3. Why Android Chromium behaves differently

4. The segmentation happens inside a Web Worker

Additional observations

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	Value
Pagefind version	1.5.2
Build command	`pagefind_extended --site all --force-language zh`
Affected browsers	Android Chrome 148, Android Edge (EdgA) 148
Unaffected browsers	Desktop Chrome, Desktop Firefox, Android Firefox
Index language	`zh` (155,922 pages)

Browser	Search term	Results returned
Desktop Chrome	小说	3,262
Android Firefox	小说	3,262
Android Chrome 148	小说	44
Android Edge 148	小说	44

Browser	Shards loaded	Results
Desktop Chrome	`zh_c9d4d46.pf_index` (1 shard)	3,262
Android Chrome	`zh_768561a.pf_index` + `zh_de3f9ff.pf_index` (2 shards)	44

Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to Intl.Segmenter ICU data inconsistency #1176

Description

Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to Intl.Segmenter ICU data inconsistency

Summary

Environment

Steps to reproduce

Observed vs expected behaviour

Root cause analysis

1. Intl.Segmenter produces different word boundaries

2. Different word-chunks → different index shards loaded

3. Why Android Chromium behaves differently

4. The segmentation happens inside a Web Worker

Additional observations

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to `Intl.Segmenter` ICU data inconsistency #1176

Chinese search returns drastically fewer results on Android Chromium (Chrome/Edge) due to `Intl.Segmenter` ICU data inconsistency

1. `Intl.Segmenter` produces different word boundaries