pg_pinyin includes:
- SQL baseline (
sql/pinyin.sql) - Rust extension (
src/lib.rs)
Only two APIs are exposed for romanization:
pinyin_char_romanize(text)pinyin_char_romanize(text, suffix text)pinyin_word_romanize(text)pinyin_word_romanize(text, suffix text)pinyin_word_romanize(tokenizer_input anyelement)(overload; usepdbtokenizer input such asname::pdb.icu::text[])pinyin_word_romanize(tokenizer_input anyelement, suffix text)(overload with user-table suffix)
Recommended usage:
- char romanization +
pg_trgm - word romanization +
pg_search
CREATE EXTENSION IF NOT EXISTS pg_pinyin;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE voice (
id bigserial PRIMARY KEY,
description text NOT NULL,
pinyin text GENERATED ALWAYS AS (public.pinyin_char_romanize(description)) STORED
);
CREATE INDEX voice_pinyin_trgm_idx ON voice USING gin (pinyin gin_trgm_ops);
INSERT INTO voice (description) VALUES ('郑爽ABC');
SELECT id, description, pinyin FROM voice;You can provide custom dictionary tables in schema pinyin by suffix:
pinyin.pinyin_mapping_suffix1pinyin.pinyin_words_suffix1
When calling ...(..., '_suffix1'), romanization uses a merged dictionary:
- base tables (
pinyin_mapping/pinyin_words) - suffix tables (
pinyin_mapping_suffix1/pinyin_words_suffix1) with higher priority
Example:
CREATE TABLE IF NOT EXISTS pinyin.pinyin_mapping_suffix1 (
character text PRIMARY KEY,
pinyin text NOT NULL
);
CREATE TABLE IF NOT EXISTS pinyin.pinyin_words_suffix1 (
word text PRIMARY KEY,
pinyin text NOT NULL
);
INSERT INTO pinyin.pinyin_mapping_suffix1 (character, pinyin)
VALUES ('郑', '|zhengx|')
ON CONFLICT (character) DO UPDATE SET pinyin = EXCLUDED.pinyin;
INSERT INTO pinyin.pinyin_words_suffix1 (word, pinyin)
VALUES ('郑爽', '|zhengx| |shuangx|')
ON CONFLICT (word) DO UPDATE SET pinyin = EXCLUDED.pinyin;
SELECT public.pinyin_char_romanize('郑爽ABC', '_suffix1');
SELECT public.pinyin_word_romanize('郑爽ABC'::pdb.icu::text[], '_suffix1');The Rust extension now embeds these dictionaries at build time:
sql/data/pinyin_mapping.csvsql/data/pinyin_token.csvsql/data/pinyin_words.csv
During CREATE EXTENSION pg_pinyin, it seeds dictionary tables under schema pinyin
using PostgreSQL COPY from embedded CSV payloads (with SQL INSERT fallback).
No separate sql/load_data.sql step is required for extension usage.
Data prep logic is in this repo:
scripts/data/generate_extension_data.py(optimized pipeline)scripts/generate_data.sh(one-shot entrypoint)
The project includes mozillazg/pinyin-data as submodule at:
third_party/pinyin-data
Initialize submodule:
git submodule update --init third_party/pinyin-dataGenerate all extension data in one command:
./scripts/generate_data.shNotes:
- char/token data is generated from
third_party/pinyin-data. - word data uses
hanzi_pinyin_words.csvwhen available; otherwise an emptypinyin_words.csvis created.
Generated outputs:
sql/data/pinyin_mapping.csvsql/data/pinyin_token.csvsql/data/pinyin_words.csv
If needed, override source repo:
PINYIN_DATA_DIR=/path/to/pinyin-data ./scripts/generate_data.shpsql "$PGURL" -f sql/pinyin.sql
psql "$PGURL" \
-v mapping_file='/absolute/path/sql/data/pinyin_mapping.csv' \
-v token_file='/absolute/path/sql/data/pinyin_token.csv' \
-v words_file='/absolute/path/sql/data/pinyin_words.csv' \
-f sql/load_data.sqlpgTAP:
./test/pgtap/run.shRust extension tests:
cargo pgrx test pg18 --features pg18Dockerfiles:
docker/Dockerfile.test-trixiedocker/Dockerfile.release-trixie
Defaults now use upstream addresses (no mirror rewrite):
- base image:
postgres:18.3-trixie - apt source: base image defaults
- rustup/cargo source: upstream defaults
Build test image:
docker build -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .
# optional: pin pg_search version at build time
# docker build --build-arg PG_SEARCH_VERSION=0.21.10 -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .Build release image:
docker build -f docker/Dockerfile.release-trixie -t pg_pinyin/release:trixie .The Dockerfiles use BuildKit cache mounts for Rust download/index caches. If needed, ensure BuildKit is enabled:
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .Tokenization-only benchmark script:
scripts/benchmark_pg18.sh
It measures:
- SQL char tokenizer:
characters2romanize(name)(cold+warm) - Rust char tokenizer:
pinyin_char_romanize(name)(cold+warm) - Rust char tokenizer (user suffix overlay):
pinyin_char_romanize(name, '_<suffix>')(cold+warm) - SQL word tokenizer:
icu_romanize(name::pdb.icu::text[])(cold+warm, ifpg_searchexists) - Rust word tokenizer with tokenizer input:
pinyin_word_romanize(name::pdb.icu::text[])(cold+warm) - Rust word tokenizer with suffix overlay:
pinyin_word_romanize(name::pdb.icu::text[], '_<suffix>')(cold+warm) - Rust word tokenizer with plain text input:
pinyin_word_romanize(name)(cold+warm)
All benchmark queries use EXPLAIN (ANALYZE, BUFFERS, MEMORY, SUMMARY).
Run:
ROWS=2000 USER_TABLE_SUFFIX=_bench PGURL=postgres://localhost/postgres ./scripts/benchmark_pg18.shSession command:
ROWS=2000 USER_TABLE_SUFFIX=_bench PGURL=postgres://postgres@localhost:5432/postgres ./scripts/benchmark_pg18.shLatest run (PG18, ROWS=2000, 2026-03-01):
Character mode:
| Scenario | Cold | Warm | Speedup vs SQL (cold / warm) |
|---|---|---|---|
SQL baseline (characters2romanize) |
9159.522 |
9253.374 |
1.0x / 1.0x |
Rust (pinyin_char_romanize) |
80.719 |
28.094 |
113.5x / 329.4x |
Rust + suffix (pinyin_char_romanize(name, '_bench')) |
162.319 |
30.233 |
56.4x / 306.1x |
Word mode (pg_search tokenizer input):
| Scenario | Cold | Warm | Speedup vs SQL (cold / warm) |
|---|---|---|---|
SQL baseline (icu_romanize(name::pdb.icu::text[])) |
242.889 |
237.337 |
1.0x / 1.0x |
Rust (pinyin_word_romanize(name::pdb.icu::text[])) |
331.327 |
72.444 |
0.7x / 3.3x |
Rust + suffix (pinyin_word_romanize(name::pdb.icu::text[], '_bench')) |
760.339 |
77.731 |
0.3x / 3.1x |
Rust plain text (pinyin_word_romanize(name)) |
336.215 |
35.460 |
0.7x / 6.7x |
Times above are Execution Time in milliseconds from EXPLAIN (ANALYZE, BUFFERS, MEMORY, SUMMARY).
cold runs for Rust base paths force a dictionary version bump before execution to simulate first-use cache load.
Suffix dictionaries are cached on first use and reused across statements. If suffix tables are updated, clear cache with public.pinyin_clear_suffix_cache('_suffix') (or public.pinyin_clear_suffix_cache() for all).
- Tidy up the data generation pipeline and expand the word dictionary coverage.
Support user-provided dictionaries and allow romanization against a specific dictionary set.- Provide a smooth upgrade path for extension dictionaries and user dictionaries.
- Improve English handling (including stemming).
- Provide better examples without
pg_search. - Improve performance and memory balance (for example, evaluate frozen hash structures vs table lookups).
All dictionaries remain runtime-editable:
pinyin.pinyin_mappingpinyin.pinyin_wordspinyin.pinyin_token
No extension rebuild is required after table updates.
If you use the SQL-based romanization method (sql/pinyin.sql), cite:
- CN115905297A: 一种支持拼音检索和排序的方法及系统
BibTeX:
@patent{CN115905297A,
author = {Liang Zhanzhao},
title = {一种支持拼音检索和排序的方法及系统},
number = {CN115905297A},
country = {CN},
year = {2023},
url = {https://patents.google.com/patent/CN115905297A/zh}
}- Hanzi word-to-pinyin TSV source: tsroten/dragonmapper
hanzi_pinyin_words.tsv