Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .pytest_failures.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
ACL Chinese name tests 'Tong Zhang' True Tong Zhang True Zhang Tong
ACL Chinese name tests 'Bei Yu' True Bei Yu True Yu Bei
ACL Chinese name tests 'Fei Yu' True Fei Yu True Yu Fei
ACL order preservation tests 'Hao Fei' True Hao Fei True Fei Hao
ACL order preservation tests 'Hao-Ran Wei' True Hao-Ran Wei True Wei Hao-Ran
ACL order preservation tests 'Haoran Jin' True Hao-Ran Jin True Jin Haoran
ACL order preservation tests 'Haoran Que' True Hao-Ran Que True Que Haoran
ACL order preservation tests 'Haoran Ye' True Hao-Ran Ye True Ye Haoran
ACL order preservation tests 'Junjie Fang' True Jun-Jie Fang True Fang Junjie
ACL order preservation tests 'Junjie Peng' True Jun-Jie Peng True Peng Junjie
ACL order preservation tests 'Junjie Ye' True Jun-Jie Ye True Ye Junjie
ACL order preservation tests 'Kun Kuang' True Kun Kuang True Kuang Kun
ACL order preservation tests 'Lecheng Zheng' True Lecheng Zheng True Zheng Lecheng
ACL order preservation tests 'Qianlong Du' True Qian-Long Du True Du Qianlong
ACL order preservation tests 'Qianlong Wang' True Qian-Long Wang True Wang Qianlong
ACL order preservation tests 'Xinlei Chen' True Xin-Lei Chen True Chen Xinlei
ACL order preservation tests 'Xinlei He' True Xin-Lei He True He Xinlei
ACL order preservation tests 'Yao Shu' True Yao Shu True Shu Yao
ACL order preservation tests 'Yuwen Wang' True Yuwen Wang True Wang Yuwen
ACL order preservation tests 'Yuxuan Dong' True Yuxuan Dong True Dong Yuxuan
ACL order preservation tests 'Yuxuan Gu' True Yuxuan Gu True Gu Yuxuan
Basic Chinese name tests 'Feng Cha' True Cha Feng True Feng Cha
Basic Chinese name tests 'He Cha' True Cha He True He Cha
Basic Chinese name tests 'Hu Cha' True Cha Hu True Hu Cha
Basic Chinese name tests 'Li Gong' True Gong Li True Li Gong
Basic Chinese name tests 'Gao Wei' True Wei Gao True Gao Wei
Basic Chinese name tests 'Kong Kung' True Kung Kong True Kong Kung
Basic Chinese name tests 'Lu Xun' True Xun Lu True Lu Xun
Basic Chinese name tests 'Qin Shi' True Shi Qin True Qin Shi
Basic Chinese name tests 'Xun Zhou' True Xun Zhou True Zhou Xun
Basic Chinese name tests 'Zhou Xun' True Xun Zhou True Zhou Xun
Compound name tests 'Leung Ka Fai' True Ka-Fai Leung True Leung-Ka Fai
Miscellaneous tests 'Jin Hua' True Hua Jin True Jin Hua
Miscellaneous tests 'Miao Yu' True Miao Yu True Yu Miao
Miscellaneous tests 'Yu Miao' True Miao Yu True Yu Miao
Miscellaneous tests 'Wen Jing' True Jing Wen True Wen Jing
Miscellaneous tests 'Jing Wen' True Jing Wen True Wen Jing
ML ranker test data tests 'Gui Rui' True Rui Gui True Gui Rui
ML ranker test data tests 'Shu Yao' True Yao Shu True Shu Yao
ML ranker test data tests 'Huang Yu Chang' True Yu-Chang Huang True Huang-Yu Chang
ML ranker test data tests 'Jia Jian Feng' True Jian-Feng Jia True Jia-Jian Feng
ML ranker test data tests 'Fan Jia Liang' True Jia-Liang Fan True Fan-Jia Liang
ML ranker test data tests 'Wei Wen Xing' True Wen-Xing Wei True Wei-Wen Xing
ML ranker test data tests 'Xi Zhao' True Zhao Xi True Xi Zhao
ML ranker test data tests 'Fu Meng Ting' True Meng-Ting Fu True Fu-Meng Ting
ML ranker test data tests 'ke chen' True Ke Chen True Chen Ke
ML ranker test data tests 'mi zhang' True Mi Zhang True Zhang Mi
ML ranker test data tests 'xu feng' True Feng Xu True Xu Feng
ML ranker test data tests 'yang guang' True Guang Yang True Yang Guang
Mixed scripts tests 'Zhou(Mary)Li' True Li Zhou True Zhou Li
Name formatting tests 'JinHua' True Hua Jin True Jin Hua
Name formatting tests 'LinShu' True Shu Lin True Lin Shu
40 changes: 37 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Formatted Output
### 5. Performance

* **High-Performance with Caching**
* The library is benchmarked to be very fast, capable of processing over 10,000 diverse names per second, and uses caching to significantly speed up the processing of repeated names.
* The library is benchmarked to be very fast, capable of processing over 3,000 diverse names per second, and uses caching to significantly speed up the processing of repeated names.

## How It Works

Expand Down Expand Up @@ -320,6 +320,36 @@ for result in results:
print(f"Processed: {result.result}")
```

### Persistent Multi-Process Processing

For high-throughput workloads, you can keep a persistent process pool alive and
reuse worker processes across multiple calls. This avoids repeated process
start-up overhead and works on Windows/macOS/Linux via `spawn`.

```python
from sinonym.detector import ChineseNameDetector

def main():
detector = ChineseNameDetector()
names_a = ["Li Wei", "Wang Weiming", "Zhang Ming"]
names_b = ["Xin Liu", "Yang Li", "Chen Huang"]

# Reuse workers across many calls
with detector.create_persistent_multiprocess_pool(max_workers=6, chunk_size=64) as pool:
results_a = pool.normalize_names(names_a)
results_b = pool.normalize_names(names_b)

# One-off convenience wrapper (creates and closes a temporary pool)
single_batch = detector.process_name_batch_multiprocess(names_a, max_workers=6, chunk_size=64)
return results_a, results_b, single_batch

if __name__ == "__main__":
main()
```

Use the `if __name__ == "__main__":` guard in scripts to ensure safe process
spawning on Windows and macOS.

### When to Use Batch Processing

* **Academic Papers**: Author lists typically follow consistent formatting
Expand Down Expand Up @@ -394,14 +424,14 @@ If you'd like to contribute to Sinonym, here’s how to set up your development
First, clone the repository:

```bash
git clone https://github.com/yourusername/sinonym.git
git clone https://github.com/allenai/sinonym.git
cd sinonym
```

Then, install the development dependencies:

```bash
uv sync --extra dev
uv sync --active --all-extras --dev
```

### Running Tests
Expand All @@ -422,6 +452,10 @@ uv run ruff check . --fix
uv run ruff format .
```

### Benchmarking & Profiling

See [scripts/README.md](scripts/README.md) for benchmark, profiling, and test status scripts.

## License

Sinonym is licensed under the Apache 2.0 License. See the `LICENSE` file for more details.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "sinonym"
version = "0.2.2"
version = "0.2.3"
description = "Chinese Name Detection and Normalization Module"
readme = "README.md"
requires-python = ">=3.10"
Expand Down
147 changes: 43 additions & 104 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,126 +1,65 @@
# Sinonym Scripts Directory
# Sinonym Scripts

This directory contains utility scripts for data generation, model training, and testing of the Sinonym library.
Utility scripts for benchmarking, profiling, testing, and model training.

## Scripts Overview
## Active Scripts

### 1. `train_ml_classifier_for_chinese_vs_japanese.py` ✅ ACTIVE
**Purpose**: Train the machine learning classifier that distinguishes Chinese names from Japanese names when written in Chinese characters.
### `check_test_status.py`
Runs the full test suite and reports individual test case failures with detailed diagnostics. Runs performance tests separately, then exits 0/1 based on whether the failure count matches the expected baseline (`EXPECTED_FAILURES = 52`). Improvements (fewer failures) also pass; regressions fail.

**Status**: ✅ **Successfully implemented and integrated**

**What it does**:
- Downloads Chinese (1.2M) and Japanese (180K) name corpora from GitHub
- Filters names to keep only those written in Chinese/Japanese characters (kanji)
- Trains a scikit-learn Pipeline with:
- TF-IDF character n-gram features (1-3 grams, max 5000 features)
- 20 linguistic heuristic features (Japanese markers, character patterns, etc.)
- Logistic Regression classifier with balanced class weights
- Saves the trained model as `data/chinese_japanese_classifier.joblib`
- Achieves 99.5% accuracy on test data

**Dependencies**:
- scikit-learn, numpy, scipy, joblib
- `sinonym.ml_model_components.EnhancedHeuristicFlags` (custom feature extractor)

**Output**:
- `data/chinese_japanese_classifier.skops` - The trained model used in production
- `data/model_features.json` - Feature vocabulary metadata

**Usage**:
```bash
python scripts/train_ml_classifier_for_chinese_vs_japanese.py
uv run python scripts/check_test_status.py
```

---

### 2. `generate_chinese_name_corpus_data.py` ❌ ABANDONED
**Purpose**: Generate training data for an ML-based name parsing disambiguation model.

**Status**: ❌ **Historical - Abandoned effort**
### `benchmark_stable.py`
Median-based performance benchmark gate. Spawns isolated worker subprocesses (fresh process per run) with controlled `PYTHONHASHSEED` and thread environment variables. Reports mean/median/stddev/CV of throughput and supports a `--min-median-names-per-sec` gate that exits non-zero on failure.

**What it was supposed to do**:
- Download 200K Chinese names from the Chinese Names Corpus
- Romanize Chinese names to pinyin (without tones)
- Generate all possible surname/given name parse candidates
- Create ground truth labels based on Chinese name structure rules
- Extract features for each parse (log probabilities, ranks, ratios)
- Save training data for an ML model to choose the best parse

**Why it was abandoned**:
- The ML parsing model "didn't work well" (as noted in code comments)
- The rule-based parsing system in `sinonym.services.parsing` works sufficiently well
- The complexity of training data generation and feature engineering didn't justify the marginal improvements

**Output files (still present but unused)**:
- `data/ml_parsing_training_data.json` - 199K training examples with parse candidates
- `data/ml_parsing_metadata.json` - Statistics about the training data

---

### 3. `generate_acl_data.py` ❌ ABANDONED
**Purpose**: Process ACL 2025 conference authors to create additional training examples for the parsing model.
```bash
uv run python scripts/benchmark_stable.py --runs 5 --names 3000 --warmup 3000
uv run python scripts/benchmark_stable.py --runs 7 --min-median-names-per-sec 5000
```

**Status**: ❌ **Historical - Part of abandoned ML parsing effort**
### `profile_hotspots.py`
Hotspot time-share profiler. Warms caches on deterministic test names, runs one `cProfile` pass, then reports top functions and modules ranked by internal time (`tottime`) share. Use `--sinonym-only` to filter out third-party/stdlib noise.

**What it does**:
- Loads author names from `data/acl_2025_authors.txt`
- Uses the ChineseNameDetector to identify Chinese names
- Converts ACL format names (Given Surname) to training examples
- Generates parse candidates with features for ML training
```bash
uv run python scripts/profile_hotspots.py --names 3000 --warmup 3000 --sinonym-only
```

**Why it exists**:
- Attempted to augment the ML parsing training data with real academic names
- ACL authors represent a different distribution (romanized, Western ordering)
- Was meant to improve the never-implemented parsing model
### `profile_run.py`
Quick single-process profiling script. Generates deterministic test names, warms caches, takes 5 pure timing measurements (no profiling overhead) for accurate throughput stats, then runs one `cProfile` pass for a top-25 function breakdown. Good for a fast sanity check during development.

**Output**:
- `data/acl_training_examples.json` - Training examples from ACL authors
- Would have updated `ml_parsing_train_split.json` (file doesn't exist)
```bash
uv run python scripts/profile_run.py
```

---
### `profile_threaded.py`
Multi-threaded performance and thread-safety validation. Tests `normalize_name` throughput across 1/2/4/8 threads using a shared `ChineseNameDetector` instance, verifies that multi-threaded results are identical to single-threaded results, and reports speedup and CV per thread count.

## Summary
```bash
uv run python scripts/profile_threaded.py
```

### Active Scripts
- **`train_ml_classifier_for_chinese_vs_japanese.py`** - The only actively used script that trains the Chinese vs Japanese classifier
### `profile_multiprocess.py`
Persistent multi-process throughput and parity check. Compares single-process throughput to a spawn-based persistent process pool, verifies that outputs are identical for a deterministic workload, and reports median speedup.

### Historical/Abandoned Scripts
- **`generate_chinese_name_corpus_data.py`** - Abandoned ML parsing model data generation
- **`generate_acl_data.py`** - Abandoned ACL author data processing for ML parsing
```bash
uv run python scripts/profile_multiprocess.py --names 12000 --warmup 3000 --runs 3 --workers 6 --chunk-size 64
```

## Data Flow
### `train_ml_classifier_for_chinese_vs_japanese.py`
Trains the Chinese-vs-Japanese name classifier used in production. Downloads Chinese (~1.2M) and Japanese (~180K) name corpora, trains a scikit-learn pipeline (TF-IDF character n-grams + 20 linguistic heuristic features + logistic regression), and saves the model to `data/chinese_japanese_classifier.skops`.

```bash
uv run python scripts/train_ml_classifier_for_chinese_vs_japanese.py
```
Chinese/Japanese Corpora (GitHub)
train_ml_classifier_for_chinese_vs_japanese.py
chinese_japanese_classifier.joblib ← [ACTIVELY USED BY LIBRARY]
+
model_features.json


Chinese Names Corpus (GitHub)
generate_chinese_name_corpus_data.py
ml_parsing_training_data.json ← [ABANDONED, NOT USED]
+
ml_parsing_metadata.json


ACL 2025 Authors
generate_acl_data.py
acl_training_examples.json ← [ABANDONED, NOT USED]
```

## Notes
## Abandoned Scripts

These remain for historical reference but are not used by the library. The rule-based parser in `sinonym.services.parsing` replaced the ML approach.

The scripts demonstrate two different ML efforts:
1. **Successful**: Chinese vs Japanese classification for names written in Chinese characters
2. **Abandoned**: ML-based parsing disambiguation to choose between multiple valid name parses
### `generate_chinese_name_corpus_data.py`
Was intended to generate training data for an ML-based name parsing disambiguation model. Downloads 200K Chinese names, romanizes them, generates all possible surname/given-name parses, and creates labeled training examples. The ML parsing model did not outperform the rule-based system.

The abandoned parsing model efforts remain in the codebase for historical reference but are not integrated into the library. The rule-based parsing in `sinonym.services.parsing.NameParsingService` handles name parsing instead.
### `generate_acl_data.py`
Supplementary data generator for the abandoned ML parsing effort. Processes ACL 2025 conference author names to create additional training examples in a different distribution (romanized, Western ordering).
Loading