Skip to content

Commit 338b7ae

Browse files
Yoojin-namclaude
andcommitted
chore(privacy): close gaps — TODO files, CSL upstream emails, CI enforcement
Additional findings caught after the first scrub pass: - skills/find-journal/TODO_neurointervention_profiles.md leaked a real professor name + hospital reference (in a section that was itself explaining how to keep such names OUT of the public repo — ironic). Replaced with parameterized examples. - CSL maintainer emails (skills/manage-refs/citation_styles/*.csl) are upstream open-source attribution; explicitly whitelisted with provenance comment so a future swap-in does not silently pass. - skills/deidentify/tests/test_phi_korean.csv contains synthetic Korean PHI for de-identifier testing. Added tests/README.md asserting all values are placeholder/constructed. Linter strengthening (validate_skills.sh): - TODO_*.md files at skill top-level are now scanned by rules 6/7/7b (PII checks). Previously excluded entirely. Verified with negative test (re-add prof name → FAIL → revert → PASS). - precedent_patterns extended: 임현철, 남유진, 삼성서울, 삼성창원, 서울아산. GitHub Actions (.github/workflows/validate.yml): - Server-side enforcement of validate_skills.sh on every push to main and every PR. Closes the gap where a local commit with --no-verify (or a commit from a different machine without the pre-commit hook) could reach the public repo unchecked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5c2935f commit 338b7ae

4 files changed

Lines changed: 78 additions & 13 deletions

File tree

.github/workflows/validate.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
name: Validate skills (PII + structure)
2+
3+
# Server-side enforcement of validate_skills.sh.
4+
# This catches commits that bypassed the local pre-commit hook
5+
# (--no-verify, different machine, different user) before they reach main.
6+
7+
on:
8+
push:
9+
branches: [main]
10+
pull_request:
11+
branches: [main]
12+
13+
jobs:
14+
validate:
15+
runs-on: ubuntu-latest
16+
steps:
17+
- uses: actions/checkout@v4
18+
19+
- name: Set up Python
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: "3.11"
23+
24+
- name: Install Python deps for contract validator
25+
run: pip install pyyaml
26+
27+
- name: Run validate_skills.sh (PII + structure)
28+
run: bash scripts/validate_skills.sh

scripts/validate_skills.sh

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,12 @@ for skill_dir in "$SKILLS_DIR"/*/; do
111111

112112
# ---------------- Content Integrity (v2 lints) ----------------
113113
# Scope: SKILL.md + references/**/*.md only (shipped prose).
114-
# Excluded (meta-docs): TODO_*.md, HANDOFF.md, and scripts/yaml files.
114+
# Excluded from style checks (rules 8, 9): TODO_*.md, HANDOFF.md.
115+
# PII checks (rules 6, 7, 7b) ALSO scan top-level TODO_*.md files because
116+
# those still ship publicly even though they are explicitly meta-docs.
115117

116118
integrity_files=()
119+
pii_only_files=() # Scanned by rules 6/7/7b only, not 8/9.
117120
[ -f "$skill_file" ] && integrity_files+=("$skill_file")
118121
if [ -d "${skill_dir}references" ]; then
119122
while IFS= read -r -d '' f; do
@@ -125,15 +128,19 @@ for skill_dir in "$SKILLS_DIR"/*/; do
125128
integrity_files+=("$f")
126129
done < <(find "${skill_dir}references" -type f -name "*.md" -print0 2>/dev/null)
127130
fi
131+
# Pick up TODO_*.md at the skill top level for PII-only scanning
132+
while IFS= read -r -d '' f; do
133+
pii_only_files+=("$f")
134+
done < <(find "${skill_dir%/}" -maxdepth 1 -name "TODO_*.md" -type f -print0 2>/dev/null)
128135

129136
# 6. Personal precedent leak (blocklist of project-specific identifiers)
130137
# Covers: legacy project IDs (CK-N, MA-N, RFA-Adjunct, MeducAI, CBCT, etc.),
131138
# institution / mentor identifiers, numbered workspace folders, and the
132139
# historical prefix patterns (Paper ①②③). Keep additions in alphabetical
133140
# blocks so future maintainers can spot what is being filtered.
134141
precedent_hits=0
135-
precedent_patterns='\bCK-[0-9]+\b|\bMA-[0-9]+\b|\b0_MI2RL\b|\b1_Samsung_Changwon\b|\b5_Personal_Research\b|\b6_Aperivue\b|\b10_Meta_Analysis\b|\b11_CheckUP\b|\b21_Aneurysm\b|01_RFA_Adjunct|02_CBCT_Biopsy|03_CBCT_Ablation|RFA-Adjunct|RFA_Adjunct|CBCT Ablation MA|CBCT Biopsy MA|Du 2023|FD Occlusion AI SR|FD Occlusion|Paper ①|Paper ②|Paper ③|MeducAI|CXRscoliosis|SkullFx|Samsung Changwon|Asan/UoU|\bKKW\b|\bLHC\b|\bKDY\b|\bLWJ\b|김경원|이덕희|김남국|Hyunchul Rhim|Pa Hong|Taein An|Hye Ree Cho|Yoojin Nam|Dong Yeong Kim|Kyung Won Kim|Jeong Min Song|Jaeyoon Kim'
136-
for f in "${integrity_files[@]}"; do
142+
precedent_patterns='\bCK-[0-9]+\b|\bMA-[0-9]+\b|\b0_MI2RL\b|\b1_Samsung_Changwon\b|\b5_Personal_Research\b|\b6_Aperivue\b|\b10_Meta_Analysis\b|\b11_CheckUP\b|\b21_Aneurysm\b|01_RFA_Adjunct|02_CBCT_Biopsy|03_CBCT_Ablation|RFA-Adjunct|RFA_Adjunct|CBCT Ablation MA|CBCT Biopsy MA|Du 2023|FD Occlusion AI SR|FD Occlusion|Paper ①|Paper ②|Paper ③|MeducAI|CXRscoliosis|SkullFx|Samsung Changwon|삼성서울|삼성창원|서울아산|Asan/UoU|\bKKW\b|\bLHC\b|\bKDY\b|\bLWJ\b|김경원|이덕희|김남국|임현철|남유진|Hyunchul Rhim|Pa Hong|Taein An|Hye Ree Cho|Yoojin Nam|Dong Yeong Kim|Kyung Won Kim|Jeong Min Song|Jaeyoon Kim'
143+
for f in "${integrity_files[@]}" ${pii_only_files[@]+"${pii_only_files[@]}"}; do
137144
if grep -qE "$precedent_patterns" "$f"; then
138145
hit=$(grep -nE "$precedent_patterns" "$f" | head -1)
139146
rel="${f#$REPO_ROOT/}"
@@ -145,7 +152,7 @@ for skill_dir in "$SKILLS_DIR"/*/; do
145152

146153
# 7. Absolute path leak (/Users/eugene/ or /home/<user>/)
147154
path_hits=0
148-
for f in "${integrity_files[@]}"; do
155+
for f in "${integrity_files[@]}" ${pii_only_files[@]+"${pii_only_files[@]}"}; do
149156
if grep -qE '/Users/eugene/|/home/eugene/' "$f"; then
150157
hit=$(grep -nE '/Users/eugene/|/home/eugene/' "$f" | head -1)
151158
rel="${f#$REPO_ROOT/}"
@@ -155,18 +162,22 @@ for skill_dir in "$SKILLS_DIR"/*/; do
155162
done
156163
[ "$path_hits" -eq 0 ] && pass "Absolute paths (no personal home-dir leak)"
157164

158-
# 7b. Real personal email leak. Whitelist: example.com / example.org /
159-
# known journal editorial-office domains (sciencedirect, lancet, ahajournals,
160-
# wjgnet, kams, wiley, aasld) + `your@email.com` style placeholders.
165+
# 7b. Real personal email leak. Whitelist categories:
166+
# - placeholder/example domains
167+
# - known journal editorial-office domains
168+
# - upstream open-source CSL maintainer addresses (vendored from
169+
# citationstyles.org; these are publicly registered style maintainers,
170+
# not user PII). Keep the explicit list here so a typo or rebase that
171+
# swaps in a different email does not silently pass.
161172
email_hits=0
162173
email_pattern='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
163-
email_whitelist='example\.com|example\.org|your@email\.com|user@host|name@|placeholder|noreply@|@lancet\.com|@strokeahajournal\.org|@aasld\.org|@wjgnet\.com|@wiley\.com|@kams\.or\.kr|@journal\.|aim-aicro\.com'
174+
email_whitelist='example\.com|example\.org|your@email\.com|user@host|name@|placeholder|noreply@|@lancet\.com|@strokeahajournal\.org|@aasld\.org|@wjgnet\.com|@wiley\.com|@kams\.or\.kr|@journal\.|aim-aicro\.com|francis\.deng@gmail\.com|obrienpat86@gmail\.com|atunis@gmail\.com|charles\.parnot@gmail\.com|citationstyler@gmail\.com|mberkowi@gmu\.edu'
164175
# Note: `aim-aicro.com` is a corporate domain that historically appeared in a
165176
# personal author roster. We allow the bare domain here only because the
166177
# precedent blocklist already catches the full `kyungwon.kim@aim-aicro.com`
167178
# string by way of the personal-name patterns above; remove from this
168179
# whitelist if the bare domain ever surfaces on its own.
169-
for f in "${integrity_files[@]}"; do
180+
for f in "${integrity_files[@]}" ${pii_only_files[@]+"${pii_only_files[@]}"}; do
170181
matches=$(grep -nE "$email_pattern" "$f" | grep -vE "$email_whitelist" || true)
171182
if [ -n "$matches" ]; then
172183
rel="${f#$REPO_ROOT/}"

skills/deidentify/tests/README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Deidentify Test Fixtures
2+
3+
All CSV files in this directory contain **synthetic test data only**.
4+
No real patient or person is represented.
5+
6+
- **Names** (`김철수`, `이영희`, `박민수`, etc.) are common Korean placeholder
7+
names equivalent to "John Doe" / "Jane Doe" in English. They were chosen
8+
precisely because they are generic enough to be unattributable to any
9+
real individual.
10+
- **RRN (주민번호)** values follow the public format specification but the
11+
digits are arbitrary and do not validate against the official checksum
12+
algorithm used by the Korean civil registry.
13+
- **Phone numbers**, **addresses**, **emails**, **chart numbers**, and
14+
**diagnoses** are all constructed for the purpose of exercising the
15+
PHI detector regexes shipped with `/deidentify`.
16+
17+
These fixtures exist to verify that the de-identifier:
18+
1. Detects the PHI patterns the skill claims to detect.
19+
2. Leaves non-PHI fields (clinical measurements, dates of routine
20+
nature) untouched.
21+
3. Handles edge cases (mixed date formats, half-width vs full-width
22+
digits, comma vs newline separators, missing fields).
23+
24+
If you need to add a new fixture, follow the same rule: every value must
25+
be either a published format example or a constructed synthetic string.
26+
Never copy real EMR data into this directory, even for one-off debugging.

skills/find-journal/TODO_neurointervention_profiles.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,12 +94,12 @@ Priority order:
9494

9595
### 핵심 구분
9696

97-
이 TODO의 Tier 1~3 15개 저널은 **사실상 공용 자산**(Stroke, JNS, Neurosurgery 등 뉴로인터벤션 연구자 누구나 유용). 사용자가 "개인용"이라고 느끼는 이유는 본인 FD Occlusion 프로젝트용이라는 맥락 때문인데, 프로파일 내용 자체는 universal. 따라서 **15개 모두 공개 커밋 권장**.
97+
이 TODO의 Tier 1~3 15개 저널은 **사실상 공용 자산**(Stroke, JNS, Neurosurgery 등 뉴로인터벤션 연구자 누구나 유용). 특정 프로젝트 컨텍스트에서 추가됐더라도 프로파일 내용 자체는 universal. 따라서 **15개 모두 공개 커밋 권장**.
9898

9999
다만 **진짜 개인적인 프로파일**이 미래에 생길 수 있음:
100-
- "SMC_internal_radiology_only.md" (삼성서울병원 내부 선호 저널 리스트)
101-
- "HRP_Rhim_preferred.md" (임현철 교수님이 선호하는 저널 집합)
102-
- "_submission_blacklist.md" (reject 이력 있는 저널)
100+
- 기관 내부 선호 저널 리스트 (예: "<Institution>_internal_only.md")
101+
- 특정 멘토가 선호하는 저널 집합 (예: "<MentorInitials>_preferred.md")
102+
- Submission blacklist (reject 이력 있는 저널)
103103

104104
이런 건 공개 레포에 올릴 이유가 없음.
105105

0 commit comments

Comments
 (0)