A Two-Level Character Validation Framework for Legacy CJK Encodings and Short-Texts
This repo is mainly used for structural
encoding validation for CJK (StruVal-C) in Rust. And SemVal-S can be found at LGBench.
cargo bench If you want to evaluate a specific encoding, you can use cargo bench -- <name>, in which <name> can be gb2312,
gb18030, jp and kr.
Three real-world datasets are from public domain novels in CJK literary world.
| Dataset | Novel | Original Title | File | Size | Encoding |
|---|---|---|---|---|---|
| Dream-C | Dream of the Red Chamber | 红楼梦 | dream_gb2312.txt |
1.75 MB | GB2312 |
| Light-J | Light and Darkness | 明暗 | meian_eucjp.txt |
744 KB | EUC-JP |
| Heart-K | The Heartless | 무정 | heartless_euckr.txt |
555 KB | EUC-KR |
Note
Since the files are NOT encoded with UTF-8, your editor may display them with mojibake.
Take GB2312 for example, when running on a Linux machine with 64GB memory and Intel i9-12900K CPU,
the results are summerized:
- iconv: 0.65 GiB/s
- encoding_rs: 0.58 GiB/s
range: 1.04 GiB/slookup: 3.36 GiB/ssimd: 9.01 GiB/s
First generate the random data:
python3 random-data-generate.pyBy default, the benchmarks over random inputs are not enabled, and currently you have to add them manually. For example,
criterion_group!(benches, random_0_benchmark);