Skip to content

aiyou178/pg_pinyin

Repository files navigation

pg_pinyin

中文说明

pg_pinyin includes:

  1. SQL baseline (sql/pinyin.sql)
  2. Rust extension (src/lib.rs)

Extension API (Reduced)

Only two APIs are exposed for romanization:

  • pinyin_char_romanize(text)
  • pinyin_char_romanize(text, suffix text)
  • pinyin_word_romanize(text)
  • pinyin_word_romanize(text, suffix text)
  • pinyin_word_romanize(tokenizer_input anyelement) (overload; use pdb tokenizer input such as name::pdb.icu::text[])
  • pinyin_word_romanize(tokenizer_input anyelement, suffix text) (overload with user-table suffix)

Recommended usage:

  1. char romanization + pg_trgm
  2. word romanization + pg_search

Generated Column Example (Raw SQL)

CREATE EXTENSION IF NOT EXISTS pg_pinyin;
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE TABLE voice (
  id bigserial PRIMARY KEY,
  description text NOT NULL,
  pinyin text GENERATED ALWAYS AS (public.pinyin_char_romanize(description)) STORED
);

CREATE INDEX voice_pinyin_trgm_idx ON voice USING gin (pinyin gin_trgm_ops);

INSERT INTO voice (description) VALUES ('郑爽ABC');
SELECT id, description, pinyin FROM voice;

User Dictionary Suffix Tables

You can provide custom dictionary tables in schema pinyin by suffix:

  • pinyin.pinyin_mapping_suffix1
  • pinyin.pinyin_words_suffix1

When calling ...(..., '_suffix1'), romanization uses a merged dictionary:

  1. base tables (pinyin_mapping / pinyin_words)
  2. suffix tables (pinyin_mapping_suffix1 / pinyin_words_suffix1) with higher priority

Example:

CREATE TABLE IF NOT EXISTS pinyin.pinyin_mapping_suffix1 (
  character text PRIMARY KEY,
  pinyin text NOT NULL
);

CREATE TABLE IF NOT EXISTS pinyin.pinyin_words_suffix1 (
  word text PRIMARY KEY,
  pinyin text NOT NULL
);

INSERT INTO pinyin.pinyin_mapping_suffix1 (character, pinyin)
VALUES ('', '|zhengx|')
ON CONFLICT (character) DO UPDATE SET pinyin = EXCLUDED.pinyin;

INSERT INTO pinyin.pinyin_words_suffix1 (word, pinyin)
VALUES ('郑爽', '|zhengx| |shuangx|')
ON CONFLICT (word) DO UPDATE SET pinyin = EXCLUDED.pinyin;

SELECT public.pinyin_char_romanize('郑爽ABC', '_suffix1');
SELECT public.pinyin_word_romanize('郑爽ABC'::pdb.icu::text[], '_suffix1');

Extension-Bundled Dictionary Data

The Rust extension now embeds these dictionaries at build time:

  • sql/data/pinyin_mapping.csv
  • sql/data/pinyin_token.csv
  • sql/data/pinyin_words.csv

During CREATE EXTENSION pg_pinyin, it seeds dictionary tables under schema pinyin using PostgreSQL COPY from embedded CSV payloads (with SQL INSERT fallback). No separate sql/load_data.sql step is required for extension usage.

Data Prep (Moved + One-Shot)

Data prep logic is in this repo:

  • scripts/data/generate_extension_data.py (optimized pipeline)
  • scripts/generate_data.sh (one-shot entrypoint)

The project includes mozillazg/pinyin-data as submodule at:

  • third_party/pinyin-data

Initialize submodule:

git submodule update --init third_party/pinyin-data

Generate all extension data in one command:

./scripts/generate_data.sh

Notes:

  • char/token data is generated from third_party/pinyin-data.
  • word data uses hanzi_pinyin_words.csv when available; otherwise an empty pinyin_words.csv is created.

Generated outputs:

  • sql/data/pinyin_mapping.csv
  • sql/data/pinyin_token.csv
  • sql/data/pinyin_words.csv

If needed, override source repo:

PINYIN_DATA_DIR=/path/to/pinyin-data ./scripts/generate_data.sh

Load SQL Baseline Data

psql "$PGURL" -f sql/pinyin.sql

psql "$PGURL" \
  -v mapping_file='/absolute/path/sql/data/pinyin_mapping.csv' \
  -v token_file='/absolute/path/sql/data/pinyin_token.csv' \
  -v words_file='/absolute/path/sql/data/pinyin_words.csv' \
  -f sql/load_data.sql

Tests

pgTAP:

./test/pgtap/run.sh

Rust extension tests:

cargo pgrx test pg18 --features pg18

Docker (General Upstream)

Dockerfiles:

  • docker/Dockerfile.test-trixie
  • docker/Dockerfile.release-trixie

Defaults now use upstream addresses (no mirror rewrite):

  • base image: postgres:18.3-trixie
  • apt source: base image defaults
  • rustup/cargo source: upstream defaults

Build test image:

docker build -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .

# optional: pin pg_search version at build time
# docker build --build-arg PG_SEARCH_VERSION=0.21.10 -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .

Build release image:

docker build -f docker/Dockerfile.release-trixie -t pg_pinyin/release:trixie .

The Dockerfiles use BuildKit cache mounts for Rust download/index caches. If needed, ensure BuildKit is enabled:

DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.test-trixie -t pg_pinyin/test:trixie .

Benchmark

Tokenization-only benchmark script:

  • scripts/benchmark_pg18.sh

It measures:

  • SQL char tokenizer: characters2romanize(name) (cold + warm)
  • Rust char tokenizer: pinyin_char_romanize(name) (cold + warm)
  • Rust char tokenizer (user suffix overlay): pinyin_char_romanize(name, '_<suffix>') (cold + warm)
  • SQL word tokenizer: icu_romanize(name::pdb.icu::text[]) (cold + warm, if pg_search exists)
  • Rust word tokenizer with tokenizer input: pinyin_word_romanize(name::pdb.icu::text[]) (cold + warm)
  • Rust word tokenizer with suffix overlay: pinyin_word_romanize(name::pdb.icu::text[], '_<suffix>') (cold + warm)
  • Rust word tokenizer with plain text input: pinyin_word_romanize(name) (cold + warm)

All benchmark queries use EXPLAIN (ANALYZE, BUFFERS, MEMORY, SUMMARY).

Run:

ROWS=2000 USER_TABLE_SUFFIX=_bench PGURL=postgres://localhost/postgres ./scripts/benchmark_pg18.sh

Benchmark Session (PG18)

Session command:

ROWS=2000 USER_TABLE_SUFFIX=_bench PGURL=postgres://postgres@localhost:5432/postgres ./scripts/benchmark_pg18.sh

Latest run (PG18, ROWS=2000, 2026-03-01):

Character mode:

Scenario Cold Warm Speedup vs SQL (cold / warm)
SQL baseline (characters2romanize) 9159.522 9253.374 1.0x / 1.0x
Rust (pinyin_char_romanize) 80.719 28.094 113.5x / 329.4x
Rust + suffix (pinyin_char_romanize(name, '_bench')) 162.319 30.233 56.4x / 306.1x

Word mode (pg_search tokenizer input):

Scenario Cold Warm Speedup vs SQL (cold / warm)
SQL baseline (icu_romanize(name::pdb.icu::text[])) 242.889 237.337 1.0x / 1.0x
Rust (pinyin_word_romanize(name::pdb.icu::text[])) 331.327 72.444 0.7x / 3.3x
Rust + suffix (pinyin_word_romanize(name::pdb.icu::text[], '_bench')) 760.339 77.731 0.3x / 3.1x
Rust plain text (pinyin_word_romanize(name)) 336.215 35.460 0.7x / 6.7x

Times above are Execution Time in milliseconds from EXPLAIN (ANALYZE, BUFFERS, MEMORY, SUMMARY). cold runs for Rust base paths force a dictionary version bump before execution to simulate first-use cache load. Suffix dictionaries are cached on first use and reused across statements. If suffix tables are updated, clear cache with public.pinyin_clear_suffix_cache('_suffix') (or public.pinyin_clear_suffix_cache() for all).

Roadmap

  1. Tidy up the data generation pipeline and expand the word dictionary coverage.
  2. Support user-provided dictionaries and allow romanization against a specific dictionary set.
  3. Provide a smooth upgrade path for extension dictionaries and user dictionaries.
  4. Improve English handling (including stemming).
  5. Provide better examples without pg_search.
  6. Improve performance and memory balance (for example, evaluate frozen hash structures vs table lookups).

User-Updatable Tables

All dictionaries remain runtime-editable:

  • pinyin.pinyin_mapping
  • pinyin.pinyin_words
  • pinyin.pinyin_token

No extension rebuild is required after table updates.

SQL Baseline Patent Citation

If you use the SQL-based romanization method (sql/pinyin.sql), cite:

BibTeX:

@patent{CN115905297A,
  author  = {Liang Zhanzhao},
  title   = {一种支持拼音检索和排序的方法及系统},
  number  = {CN115905297A},
  country = {CN},
  year    = {2023},
  url     = {https://patents.google.com/patent/CN115905297A/zh}
}

Acknowledgements

About

postgresql extension to transform chinese into pinyin, by character or by word

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors