English | 한국어
A Python library for detecting and reversibly pseudonymizing personal information (PII) in Korean documents. Works with rules + dictionaries + checksums only, without any external ML dependency. Especially strong on public/administrative documents, and usable as a preprocessing layer in front of any ML pipeline.
Public benchmark — the highest-accuracy rule-based Korean PII tool. On human‑labeled KDPII (4,891 docs), ko-pii beats Microsoft Presidio and openai/privacy-filter (F1 0.66 vs 0.27 / 0.26), runs at 0.19 ms/doc (~5,350 docs/s) — 22× faster than Presidio, and processes 1M documents at ~$0, fully on-premise. Deterministic IDs (RRN · card · phone · email) reach F1 ≈ 1.0 via checksum. Every number is measured and reproducible with a single scorer — see benchmark · full comparison.
from ko_pii import Anonymizer, ProcessingMode
result = Anonymizer(mode=ProcessingMode.STRICT, strategy="tokenize").process(
"신청인 홍길동 (880101-1234568) 연락처 010-1234-5678"
)
print(result.text)
# 신청인 <PERSON_1> (<RRN_1>) 연락처 <PHONE_1>
print(result.vault.reveal("<RRN_1>")) # 880101-1234568 (only authorized users can restore)
print(result.combined_risk.combined_risk.name) # CRITICALOriginal:
신청인 홍길동 (880101-1234568) 연락처 010-1234-5678
주소: 서울특별시 강남구 테헤란로 152
tokenize (token substitution + restorable via Vault):
신청인 <PERSON_1> (<RRN_1>) 연락처 <PHONE_1>
주소: <ADDRESS_1>
partial (mask only part — real-world form style):
신청인 홍OO (880101-1******) 연락처 010-****-5678
주소: 서울특별시 강남구 ***
redact (replace with category name):
신청인 [성명] ([주민등록번호]) 연락처 [전화번호]
주소: [주소]
New here? Use
mode=ProcessingMode.STRICT+strategy="tokenize". It's the safest default setting (blocks MEDIUM risk and above, with Vault-based restoration).
- Catches names even with attached particles — "홍길동이", "홍길동에게", "홍길동의" → the Korean particle is split off automatically, then PERSON is detected
- Hanja annotation — "홍길동(洪吉童)" → recognizes both the Hangul and the Hanja
- Romanized names — "Hong Gildong" → normalized to Hangul before matching
- Direct HWP/HWPX/DOCX/PDF input —
ko-pii report.hwp --strategy tokenize(handles tables, headers, footers, and metadata) - Automatic CSV/XLSX header recognition — headers like "성명/주민번호/연락처" → automatically mapped to PERSON/RRN/PHONE
- Automatic rejection of administrative dates — "시행일자: 2026-05-21", "감사기간: 3월~4월" → not birthdays (30+ non-birthday keywords)
- Automatic rejection of pseudonymized notations — "박씨", "김모씨", "○○○ 시민" → already pseudonymized (not PII)
- Automatic combined-risk assessment — a name alone may not be PII, but when name + RRN + address appear together → CRITICAL (quasi-identifier combination check per the "Guidelines on De-identification Measures for Personal Information")
- Audit logging — JSONL trace of who restored which token and when (Article 29 of the Personal Information Protection Act)
- Why you need it
- Key features
- Installation
- Usage scenarios
- Evaluation results
- Usage
- 33 PII categories
- Detection policy — which prefixes and anchors work
- Processing modes + substitution strategies
- Additional features
- FAQ
- Development
- License
RAG/LLM pipelines index and retrieve raw, unstructured data and feed it directly into models — PII gets exposed on both the vector DB and the response side.
- Legal obligation — Personal Information Protection Act (PIPA): restrictions on processing sensitive information such as resident registration numbers and health information (also relevant for GDPR/HIPAA abroad)
- Sovereignty / air-gapped networks — public-sector network-separated environments cannot send PII to external APIs → offline deterministic detection is essential
- Trust / reputation — a single leak can destroy trust in a service
- Complement to ML — NER/LLM detection is prone to hallucination and is non-reproducible. For PII that can be checksum-verified (RRN, card, business registration number), ko-pii confirms detection at F1 ≈ 1.0, filling the gaps ML leaves
ko-pii blocks PII at both ends of RAG — ingest (before entering the vector DB) and retrieval (before passing to the LLM). It substitutes the same person with the same token to preserve context (with LlamaIndex/LangChain integrations provided), and the Vault supports authorization-based restoration and audit tracing.
- Korea-specific — 33 categories of Korean PII (RRN, FRN, passport, business registration number, card, account, phone, email, address, vehicle, person, position, nationality, etc.). Especially strong on public documents
- Deterministic detection — rules + dictionaries + checksums. RRN, card, business registration number, etc. are checksum-verified at F1 ≈ 1.000
- Evasion blocking — neutralizes Unicode bypass tricks such as full-width digits (
010) and zero-width character insertion via normalization (detection offsets are preserved against the original) - No external dependencies — uses only the Python standard library. Runs offline / on air-gapped networks, no GPU required
- Preprocessing layer — emits standardized
DetectionResultobjects (label/start/end/text/confidence). Easy to slot in front of an ML pipeline - Reversible pseudonymization + Vault — keeps the token ↔ original mapping in a separate store, restorable
- Automatic legal-basis attachment — each detection is automatically tagged with the relevant PIPA article (audit trail)
- Diverse inputs — TXT, CSV, XLSX, HWP, HWPX, DOCX, PDF (
[file]extras)
| Domain | Recommended setting | Notes |
|---|---|---|
| Public documents (official documents, civil petitions, HR) | STRICT + tokenize |
Default. The best-fitting domain |
| LLM training-data preprocessing | PARANOID + tokenize or redact |
Prioritize leak prevention |
| Pharma / bio | STRICT + exclude={"AGE","HEIGHT","WEIGHT"} |
Avoid false positives on usages like "per 1 kg of body weight" |
| Finance / insurance | STRICT + tokenize |
Deterministic detection of RRN/card/account |
| General office (internal documents) | BALANCED + partial |
Readable partial masking |
# Pharma domain — prevent PERSON FP + body-attribute false positives
anon = Anonymizer(
mode=ProcessingMode.STRICT,
strategy="tokenize",
exclude={"AGE", "HEIGHT", "WEIGHT"}, # avoid false positives like "per 1 kg of body weight"
)
# If there are many PERSON FPs — inject a domain dictionary
# Add pharma ingredient names / manufacturer names to src/ko_pii/dictionaries/common_words.py
# e.g. "이부프로펜", "한미약품", "메트포르민" → automatically excluded from PERSONpip install ko-piiExtras (as needed):
pip install "ko-pii[file]" # HWP/HWPX/DOCX/PDF
pip install "ko-pii[security]" # Vault AES-256-GCMPython 3.10 or later. The core uses only the standard library.
Scenario 1 — Bulk pseudonymization of approved official documents (before external release / sending to an LLM)
from pathlib import Path
from ko_pii import Anonymizer, ProcessingMode
anon = Anonymizer(mode=ProcessingMode.PARANOID, strategy="tokenize")
for path in Path("./공문서/").glob("*.hwp"):
result = anon.process(path.read_text(encoding="utf-8"))
Path(f"./가명화/{path.name}").write_text(result.text, encoding="utf-8")
# Keep vault.json stored separately (only authorized users can restore)
result.vault.save(f"./vault/{path.stem}.json")- PARANOID mode — blocks everything at LOW risk and above (safe for LLM / external transmission)
- Keep the pseudonymized result externally and the Vault in an internal store, separated
- HWP/HWPX parser:
pip install "ko-pii[file]"
from ko_pii import Anonymizer, ProcessingMode, RiskLevel
anon = Anonymizer(mode=ProcessingMode.AUDIT) # no blocking, detection-only reporting
result = anon.process(incoming_petition_text)
# If the combined risk is CRITICAL, notify the handler
if result.combined_risk.combined_risk >= RiskLevel.CRITICAL:
notify_admin(
identifiers=result.combined_risk.distinct_identifiers, # ["RRN"]
quasi=result.combined_risk.distinct_quasi, # ["ADDRESS", "PERSON", "PHONE"]
)
# Give the responding staff a pseudonymized version
masked = Anonymizer(mode=ProcessingMode.STRICT, strategy="partial").process(
incoming_petition_text
).text- AUDIT mode — reports detections only, without blocking (for auditing / statistics)
- Combined-risk auto-assessment — quasi-identifier combination check per the "Guidelines on De-identification Measures for Personal Information"
- Provide responding staff with partial masking via the
partialstrategy (880101-1******)
# Automatically pseudonymize whenever logger.info("...") is called anywhere in the code
import logging
from ko_pii import Anonymizer, ProcessingMode
_anon = Anonymizer(mode=ProcessingMode.STRICT, strategy="redact")
class PIIFilter(logging.Filter):
def filter(self, record):
record.msg = _anon.process(str(record.msg)).text
return True
logging.getLogger().addFilter(PIIFilter())
logging.info("신청인 홍길동 (880101-1234568) 처리 완료")
# → "신청인 [성명] ([주민등록번호]) 처리 완료"KDPII v1.1 test split: 4,891 human-labeled documents of Korean everyday conversation. All systems are scored with a single canonical matcher (ko_pii.eval.kdpii.match_forms_overlap, substring set matching, position-agnostic), with person_min_length=3 (1–2 character PERSON spans excluded). All three systems run on the same documents with the same matcher.
| System | Type | F1 | Precision | Recall |
|---|---|---|---|---|
| ko-pii | Rules + dictionaries + checksums | 0.660 | 0.699 | 0.624 |
| Presidio (kr_adapt) | spaCy ko + regex | 0.273 | — | — |
| openai/privacy-filter | 660M transformer (ONNX) | 0.264 | — | — |
(TP/FP/FN — ko-pii: TP 813 / FP 350 / FN 489. Presidio: TP 220 / FP 85 / FN 1085. openai/PF: TP 294 / FP 634 / FN 1008.)
A generic Korean NER model (KoELECTRA NER) was not measured for this run (rough estimate ~0.10–0.15) and is therefore omitted from the headline table.
Fair comparison. The aggregate F1 partly reflects that Presidio and openai/privacy-filter lack many Korean PII categories entirely (they emit 0 on AGE, POSITION, RRN, …). Even restricting to the categories each tool does support, ko-pii still leads — vs openai/privacy-filter 0.61 : 0.37 (its 7 labels), vs Presidio 0.87 : 0.65 (its 9 labels). The gap is not merely missing categories; ko-pii is also more accurate on common ground.
Honest framing. KDPII is everyday conversational text. ko-pii is rule-based: it is strong on structural/deterministic PII and Korean administrative/form text, and weaker on free-form conversation (KDPII PERSON 0.135, ADDRESS 0.241). ko-pii's own generated eval set (below — 540 docs, admin/form-like, validated gold, independent of ko-pii's rules) at 0.790 shows where ko-pii is strong.
Checksum- and regex-verified categories reach near-perfect F1:
| Label | F1 | Label | F1 | |
|---|---|---|---|---|
| RRN | 1.000 | VEHICLE | 0.980 | |
| 1.000 | WEIGHT | 0.952 | ||
| IP | 1.000 | HEIGHT | 0.935 | |
| FRN | 1.000 | PASSPORT | 0.909 | |
| PHONE | 0.992 | AGE | 0.893 | |
| ACCOUNT | 0.819 |
1 unit = 1 CPU core.
| System | Latency / doc | Throughput | Hardware |
|---|---|---|---|
| ko-pii | 0.19 ms | ~5,350 docs/s | 1 CPU core |
| Presidio | 4.2 ms | ~238 docs/s | 1 CPU core |
| openai/PF (ONNX, CPU) | 481 ms | ~2 docs/s | 1 CPU core (bulk needs GPU) |
| System | Cost per 1M docs |
|---|---|
| ko-pii | ~$0 (1 CPU core, ~3 min) |
KDPII, 3-system run (ko-pii / openai-PF / Presidio):
python -m ko_pii.eval.model_comparison data/kdpii/test.json \
--mode kdpii --include-presidio --backend onnx --person-min-length 3Note. The generated eval set below uses the same matcher as the headline table (directly comparable); KLUE-NER is from an earlier run with a different scorer (context only).
| Domain | ko-pii | openai/PF | Presidio |
|---|---|---|---|
| Generated eval set (540 docs, admin/form-like, validated gold) | 0.790 | 0.451 | 0.483 |
| KLUE NER | 0.419 | 0.155 | 0.000 |
Full details: docs/BENCHMARK.md and docs/EVALUATION_REPORT.md.
Before production use: test 30–100 of your own real documents directly. Performance varies by domain.
- ko-pii is rule-based — strong on structural/deterministic PII and Korean administrative/form text, weak on free-form conversation (KDPII PERSON 0.135, ADDRESS 0.241).
- PERSON false positives (FP) — the biggest weakness of rule-based PERSON detection. Domain vocabulary (e.g. pharma ingredient names) can be picked up as a person's name. → inject a domain dictionary into
common_words.py, or turn it off withexclude={"PERSON"}. - Unstructured ADDRESS — weak on unstructured addresses like "강남 쪽에 살아" (needs an anchor). Structured addresses ("서울특별시 강남구 테헤란로 152") are fine.
- Deterministic PII (RRN, PHONE, EMAIL, card, business registration number) is checksum/format-verified, so false positives are rare.
Full evaluation: docs/EVALUATION_REPORT.md.
# Basic
ko-pii input.txt --mode STRICT --strategy tokenize \
--vault vault.json -o output.txt --report report.html
# Batch (whole directory, parallel)
ko-pii ./incoming/ --batch --workers 4 --output-dir ./anonymized/
# Vault encryption + audit log
KPII_VAULT_PASSWORD=secret ko-pii doc.hwp \
--vault vault.kvault --audit-log audit.jsonlfrom ko_pii import Anonymizer, ProcessingMode
anon = Anonymizer(mode=ProcessingMode.STRICT, strategy="tokenize")
result = anon.process(text)
print(result.text) # pseudonymized text
print(result.vault.reveal("<RRN_1>")) # restore original (authorized only)
print(result.summary["by_label"]) # {"RRN": 1, "PHONE": 1, "PERSON": 1}# Automatic combined-risk assessment of detection results
print(result.combined_risk.combined_risk.name) # CRITICAL
print(result.combined_risk.distinct_identifiers) # ["RRN"]
print(result.combined_risk.distinct_quasi) # ["PERSON", "PHONE"]
# k-anonymity assessment (aggregate data)
from ko_pii.analytics import k_anonymity
report = k_anonymity(records, quasi_keys=["age", "city", "job"], threshold=5)
print(report.k) # minimum group size
print(report.satisfies_threshold) # True/False
print(report.rationale) # ["준식별자 ['age', 'city', 'job'] 기준 N개 그룹", ...]from ko_pii.tabular import anonymize_records
import csv
rows = list(csv.DictReader(open("employees.csv")))
# Headers "성명/주민번호/연락처/주소" → automatically mapped to PERSON/RRN/PHONE/ADDRESS
anon_rows, vault = anonymize_records(rows, strategy="tokenize")
print(anon_rows[0])
# {'성명': '<PERSON_1>', '주민번호': '<RRN_1>', '연락처': '<PHONE_1>', '주소': '<ADDRESS_1>'}Low-confidence detections → saved to a review queue → a user marks them FP/OK/FN → from accumulated markings, dictionary patch suggestions are generated automatically (not applied automatically — applied only after human review).
result = anon.process(text)
# 1. Detections classified as REVIEW due to low confidence (auto-classified per mode)
for record in result.review_items():
d = record.detection
print(d.text, d.confidence, d.evidence)
# 2. Save to a separate JSONL queue → user marks verdicts
from ko_pii.review.queue import ReviewQueue
q = ReviewQueue("review.jsonl")
q.enqueue_review_records(result.review_items(), document=text)
# 3. Accumulated markings → generate patch files (common_words candidates / name candidates)
from ko_pii.review.feedback import apply_feedback
apply_feedback(
queue_path="review.jsonl",
output_dir="feedback_patches/",
min_repeat=2, # same token marked FP 2+ times → candidate (prevents dictionary pollution)
)
# → feedback_patches/common_words_additions.txt (PERSON FP candidates)
# → feedback_patches/names_to_add.txt (names marked FN)
# → feedback_patches/summary.jsonfrom ko_pii.patterns.rrn import detect
for r in detect("신청인 880101-1234568"):
print(r.label, r.text, r.confidence, r.legal_basis)
# RRN 880101-1234568 1.0 개인정보보호법 제24조의2List all label keys with
ko-pii --labels(CLI) orfrom ko_pii.labels import ALL_LABELS, LABEL_INFO(Python).
| Category | Verification | Risk |
|---|---|---|
| RRN (resident registration number) | 13 digits + date + Korean checksum | CRITICAL |
| FRN (alien registration number) | gender digit 5–8 + checksum | CRITICAL |
| Business registration number | National Tax Service weighted-sum checksum | HIGH |
| Corporate registration number | corporate checksum (RRN takes precedence) | MEDIUM |
| Driver's license number | regional-office code 11–28 whitelist | HIGH |
| Passport number | prefix (M/S/PP/PD etc.) + 8 digits | CRITICAL |
| Credit card | BIN whitelist + Luhn | CRITICAL |
| Parcel number (PNU) | 19 digits + province code | LOW |
| Category | Keywords |
|---|---|
| Health insurance card | 건강보험 / 의료보험 / 보험증 |
| Prescription number | 처방번호 / Rx / 교부번호 |
| Drug code | 약품코드 / KD코드 + Korean GS1 |
| Fax number | 팩스 / FAX |
| Account number | 계좌 / 60+ bank names (3-way anchor) |
| Employee number | 사번 / 공무원번호 / 직원번호 / 임용번호 |
| Civil-petition number | 민원 / 청구 / 정보공개 / 행정심판 |
| Case number | case type (가합 / 고합 / 구합 / 헌가 etc.) |
| Category | Verification |
|---|---|
| Phone number | mobile 010–019 / Seoul 02 / regional 031–064 / VoIP 070 / representative 15xx–18xx / +82 international |
| RFC 5322 | |
| IP | IPv4 octets + IPv6 RFC 4291 |
| URL | http(s) / ftp |
| Postal code | province first-digit mapping |
| Vehicle number | new-format NN[가-힣]NNNN + purpose-Hangul whitelist |
| Official document number | ministry name + format |
| Category | Dictionary size |
|---|---|
| Person (PERSON) | 286 surnames + adjacent position + 17 rejection rules |
| Address (ADDRESS) | 17 provinces + 226 districts + 240 frequent dong + 10K legal dong (anchor-conditional) + 38 building suffixes + dong/ho/floor bridge expansion |
| Nationality (NATIONALITY) | 70+ country names (대한민국, 미국, 일본, etc.) |
| Education (EDUCATION) | ~330 universities + abbreviations |
| Major (MAJOR) | ~400 departments (KEDI classification) |
| Position (POSITION) | 250+ titles (government, police, fire, military, prosecutor, judge, private sector) |
| Category | Verification | Risk |
|---|---|---|
| Date of birth | date + keyword/full-name/birth-year marker | HIGH |
| Age | "32세 / 32살 / 환갑 / 12개월 아기 / 30대" | INFO |
| Height | "175cm / 1.75m", range 50–250 | INFO |
| Weight | "70kg / 70킬로", range 1–300 | INFO |
Quasi-identifier — not identifying on its own, but carries re-identification risk when combined with other information.
analytics/combined_riskassesses this automatically.
Each PII detection is not a simple regex match but a multi-gate process: a combination of prefix label / keyword anchor / contextual dictionary / format verification.
1–4 Hangul characters immediately after a label → strong PERSON candidate.
| Domain | Labels |
|---|---|
| Basic | 성명 이름 성함 이 름 |
| Petition / administrative | 신청인 신청자 민원인 청구인 보호자 대리인 당사자 |
| Approval | 기안자 결재자 검토자 보고자 수신자 발신자 참조 |
| Judicial | 원고 피고 고소인 피고소인 증인 감정인 |
| Police / fire | 피의자 피해자 용의자 참고인 신고자 수사관 출동대장 |
| HR | 평가자 피평가자 면담자 추천인 |
| Medical | 환자 |
Recognizes 7 label variants: 성명: 홍길동 / [성명] 홍길동 / (성명) 홍길동 / <성명> 홍길동, etc.
- Single surname + rank/region/school/bank:
김부장,김포시,이화여대→ rejected - 16 Korean sentence-ending morphemes: ending in
~은데,~는데,~라서,~까지→ rejected - Ministry / institution names:
보건복지부,행정안전부→ rejected - Already-pseudonymized notations:
박씨,김모씨,○○○ 시민→ rejected
✓ 성명: 김도윤 (field label)
✓ 박지훈 과장님께 (adjacent position)
✓ 홍길동(洪吉童) (Hanja annotation)
✓ 880101-1234568 (RRN — checksum)
✓ 120-81-47521 (business reg. — NTS checksum)
✓ 4242-4242-4242-4242 (card — Luhn)
✓ M12345678 (passport)
✗ 김부장이 협조 안 함 (honorific = rejected)
✗ 보건복지부는 검토 후 (ministry name)
✗ 시행일자: 2026-05-20 (non-birthday rejection)
✗ 881301-1000004 (RRN — month 13 invalid)
✗ A12345678 (passport — A prefix rejected)
| Mode | Blocking threshold | Use |
|---|---|---|
PARANOID |
block LOW and above | before external release / sending to an LLM |
STRICT |
block MEDIUM and above | practical standard (default) |
BALANCED |
block HIGH and above | internal collaboration |
PERMISSIVE |
block CRITICAL only | analyst work |
AUDIT |
no blocking, detection-only reporting | auditing / statistics |
| Strategy | 880101-1234568 → |
Reversible | Description |
|---|---|---|---|
tokenize |
<RRN_1> |
✓ | token substitution, original kept in the Vault |
redact |
[주민등록번호] |
✗ | replace with the category name |
partial |
880101-1****** |
✗ | mask only part (practical standard) |
asterisk |
************** |
✗ | asterisk masking |
hashed |
<RRN:abc123> |
✗ | hash (same value → same token) |
fpe |
771202-2345671 |
✗ | format-preserving encryption (FPE) |
| Feature | Description | Install |
|---|---|---|
| HWP/HWPX/DOCX/PDF parser | Automatic parsing of Hancom Office / MS Word / PDF (body + tables + headers + metadata). See parser details below | [file] |
| Vault encryption | AES-256-GCM + PBKDF2 with 480k iterations | [security] |
| Audit log (JSONL) | records every reveal() call (Article 29 of PIPA) |
core |
| Batch processing | whole-directory + parallel workers | core |
| Review queue | low-confidence detections → human review → automatic learning of FP vocabulary | core |
| HTML report | visualization of true positives (green) / false positives (red) / misses (yellow) | core |
| Hanja/Romanization variants | 洪吉童 → 홍길동, Hong Gildong → 홍길동 |
core |
| RAG integration | PII masking of LlamaIndex/LangChain retrieval results (retrieve → mask → LLM) | [llamaindex] / [langchain] |
| Rule+ML hybrid | 4 rule+ML combination modes + threshold-based detection-sensitivity tuning (opt-in) | [classifier] |
| Format | Library used | Notes |
|---|---|---|
| HWP 5.x | olefile | parses OLE binary records directly, extracts body text |
| HWPX | stdlib (zipfile + xml) |
ZIP+XML structure, no external dependency |
| DOCX | stdlib (zipfile + xml) |
ZIP+XML structure, no external dependency |
| XLSX | stdlib (zipfile + xml) |
sharedStrings + sheet XML |
| pdfplumber (preferred) / pypdf (fallback) | extracts the text layer only (scanned PDFs need OCR) |
PDF note: because PDFs are coordinate-based, spurious spaces and line breaks are commonly inserted per cell. ko-pii automatically corrects unnecessary spaces/line breaks in the middle of PII patterns using its built-in normalization engine (
text_normalizer). Since pdfplumber performs layout analysis better than pypdf, installing pdfplumber is recommended.
Mask PII in retrieved documents before feeding them to the LLM. Within a single retrieval result, the same person is substituted with the same token (<PERSON_1>) so context is preserved; pass the vault and you can restore via vault.reveal() after answer generation.
# LlamaIndex — node postprocessor (retrieve → mask → LLM)
from ko_pii.integrations.llamaindex import KoPiiNodePostprocessor
qe = index.as_query_engine(
node_postprocessors=[KoPiiNodePostprocessor(mode="STRICT")]
)
# LangChain — drop directly into a Runnable chain
from ko_pii.integrations.langchain import KoPiiRedactor
chain = retriever | KoPiiRedactor(mode="STRICT") | prompt | llmThe core works without ML, but you can layer ML on top. There are two distinct hybrids — they are different features:
| ① Token-NER hybrid (span replacement) | ② Document-classifier hybrid (confidence blend) | |
|---|---|---|
| What | rules = deterministic IDs (checksums), ML = fuzzy categories — replaces detections | document-level "has PII" classifier reinforcing rule results (adds no spans) |
| Use | Anonymizer(secondary_detector=..., merge_mode="role_split") |
ko_pii.classifier.HybridAnonymizer |
| Performance | external validation F1 0.97 (docs/HYBRID_NER.md) |
review triggers / sensitivity tuning |
① Token-NER hybrid — plug in an NER model you trained with the docs/HYBRID_NER.md recipe:
from ko_pii import Anonymizer
from ko_pii.integrations.hf_token_ner import HFTokenNERAdapter
ml = HFTokenNERAdapter("out/ner_fuzzy/final") # model you trained (see recipe)
anon = Anonymizer(secondary_detector=ml, merge_mode="role_split")
result = anon.process(text) # all strategies (tokenize/partial/redact) and the Vault just work② Document-classifier hybrid — via the [classifier] extra, configure the combination method and sensitivity yourself:
| Combination mode | Behavior | Use |
|---|---|---|
SCORE |
rule detection + classifier-confidence reinforcement | default |
GATED |
skip rules if the classifier score < threshold | speed first |
REVIEW_FLAG |
if rules find 0 but the classifier is high, "recommend review" | human-review trigger |
UNION_BLOCK |
block if either side considers it PII | conservative masking |
Use classifier_threshold and gate_threshold to finely tune the rule ↔ ML combination ratio (sensitivity).
from ko_pii import Anonymizer, ProcessingMode
from ko_pii.classifier import PIIClassifier, HybridAnonymizer, HybridMode
clf = PIIClassifier.from_pretrained("models/...")
hybrid = HybridAnonymizer(
Anonymizer(mode=ProcessingMode.BALANCED), clf,
mode=HybridMode.REVIEW_FLAG, # combination method
classifier_threshold=0.5, # sensitivity (ratio) tuning
)pip install ko-pii[classifier] # torch + transformers + scikit-learn
python -m ko_pii.classifier.train ... # train the model yourselfPretrained weights are not distributed (training-data licensing) — code and training recipe are provided. The tuned NER model is planned for a later release (after licensing review and generalization evaluation).
Hybrid evaluation (rules-only vs hybrid, base vs tuned matrix):
docs/HYBRID_NER.md— +0.14–0.19 over rules-only (klue); the gains are consistent only with a properly tuned Korean NER (base zero-shot is counterproductive). The full experiment trail (training/eval code, run logs, raw results) is archived in (internal NER experiment archive — private).
Q1. Does rules-only, without ML, really work well? Korea's core PII (RRN, passport, card, business registration number, etc.) is checksum-verified at F1 ≈ 1.000 — an area that ML cannot replace. Context-dependent PII like PERSON may be better handled by ML, but on public/administrative documents ko-pii is practical at F1 0.790 (generated eval set, 540 docs, validated gold; see docs/BENCHMARK.md §3b).
Q2. What if there are too many false positives?
Inject a domain dictionary into common_words.py, turn off a specific category with exclude={"PERSON"}, or change the mode (STRICT → BALANCED).
Q3. What if I lose the Vault?
Restoration is impossible (by security design). Store it encrypted with the [security] extras, or use strategy="redact" (category-name substitution, no Vault needed).
Q4. Are HWP tables and headers all captured?
Yes. With the [file] extras installed, body + tables + headers + footers + metadata are all extracted.
Q5. How does it differ from other tools (Presidio / openai)?
- Presidio — English-centric. Lacks Korea-specific PII (RRN/FRN/passport, etc.)
- openai/privacy-filter — general multilingual PII. No labels for Korea's 14 core categories
- ko-pii — Korea-specific 33 categories, checksum verification, automatic legal-basis attachment
Q6. Name-only detection has too many false positives. Can I block only when a name + phone number appear together?
Use PERMISSIVE mode (block CRITICAL only) + conditional reprocessing with combined_risk:
result = Anonymizer(mode=ProcessingMode.PERMISSIVE).process(text)
if result.combined_risk.combined_risk >= RiskLevel.HIGH:
result = Anonymizer(mode=ProcessingMode.STRICT).process(text)git clone https://github.com/Marker-Inc-Korea/ko-pii
cd ko-pii
pip install -e ".[dev]"
pytest # full test suite passesDetailed docs: see the docs/ directory.
MIT License
- Personal Information Protection Act (Articles 2, 23, 24, 24-2, 28-2 to 28-5, 29)
- Personal Information Protection Commission, "Guidelines on the Processing of Pseudonymized Information" and "Guidelines on De-identification Measures for Personal Information"
- Commercial Act Article 40, Immigration Act Article 31, National Health Insurance Act Article 96, Act on Real Name Financial Transactions