Skip to content

Commit a332db3

Browse files
Yoojin-namclaude
andcommitted
feat(present-paper): word-boundary parser + pronunciation augment + sharing-ready notes strip
Adds four patterns learned across nine iterations of a graduate-level academic lecture build cycle, all anonymized: 1. Word-boundary aware markdown parser — italic/bold regex with lookahead/lookbehind so HLA alleles, SNP IDs, and other alphanumeric-adjacent asterisks survive add_styled() unchanged. Documented in SKILL.md as a mandatory regex for any Nature/Lancet build script that handles allele-rich slide bodies. 2. inject_pronunciation_notes.py — CLI that appends a per-slide "[ Pronunciation ]" section to speaker notes from a YAML/JSON PRON_DICT. Word-boundary regex avoids false positives on short acronyms; separate allele regex synthesizes readings on the fly. Idempotent. Audience view unaffected — Presenter View only. 3. Speaker notes statistics density rule — when slide body already shows exact OR/CI/p-value, the notes should be narrative anchors + "see slide body" reminders, not numeric listings. Quick QC measurement (chars + stat-token count) documented. 4. strip_notes_for_sharing.py — clears every slide's notes_text_frame and rewrites docProps/app.xml so the PPTX can be circulated to the audience without leaking presenter-only narrative or pronunciation hints. Recommended 3-file sharing package (.pptx + .pdf + refs.zip) documented with naming convention <topic>_<initials>. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent b7e5d76 commit a332db3

3 files changed

Lines changed: 431 additions & 0 deletions

File tree

skills/present-paper/SKILL.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,119 @@ headers all removed). When the deck slot expects only the figure body
274274
(default for `build_pptx_nature_lancet.py`), point `FIG_DIR` at the cropped
275275
output dir.
276276

277+
### Word-boundary aware markdown parser (mandatory for HLA-rich decks)
278+
279+
When the build script parses inline `**bold**` / `*italic*` markers in slide
280+
body or speaker notes, the italic rule must use **word-boundary lookahead /
281+
lookbehind** so asterisk-bearing scientific tokens (HLA alleles like
282+
`DRB1*07:01`, `HLA-A*02:01`, SNP IDs, footnote markers) are not eaten as
283+
italic delimiters:
284+
285+
```python
286+
import re
287+
pattern = re.compile(
288+
r"(\*\*(?:(?!\*\*).)+?\*\*" # bold; inner single * allowed
289+
r"|(?<![A-Za-z0-9])\*[^*\n]+?\*(?![A-Za-z0-9]))" # italic (word-boundary)
290+
)
291+
```
292+
293+
Two regex tricks together:
294+
1. **Italic with boundary**: `(?<![A-Za-z0-9])` and `(?![A-Za-z0-9])` reject
295+
`*` adjacent to alphanumerics, so `DRB1*07:01` is left intact.
296+
2. **Bold tolerates inner single `*`**: `(?:(?!\*\*).)+?` allows
297+
`**DRB1*04:02**` (HLA allele inside bold) to match as a single bold span.
298+
299+
Without these, a naive `\*[^*]+\*` italic pattern silently corrupts every
300+
HLA allele in the deck. Add the regex to `add_styled()` (or equivalent) in
301+
every Nature/Lancet-style build script.
302+
303+
### Pronunciation auto-augment for non-native presenters
304+
305+
For decks where the presenter is uncomfortable with English pronunciation of
306+
acronyms, author names, drug names, or gene symbols, append a per-slide
307+
`[ Pronunciation ]` section to the speaker notes (audience sees nothing —
308+
only Presenter View). Use
309+
`${CLAUDE_SKILL_DIR}/scripts/inject_pronunciation_notes.py`:
310+
311+
```bash
312+
python3 "${CLAUDE_SKILL_DIR}/scripts/inject_pronunciation_notes.py" \
313+
input.pptx output.pptx \
314+
--dict pron_dict.yaml \
315+
--header "[ 발음 ]" # or any header you like
316+
```
317+
318+
The script:
319+
- Loads a YAML/JSON `PRON_DICT` (term → [reading, full_name]) supplied by
320+
the caller. The dict is domain-specific — assemble it for your audience
321+
(Korean readings, French readings, Spanish readings, etc.).
322+
- Uses **word-boundary regex** `(?<![A-Za-z0-9_]) … (?![A-Za-z0-9_])` so
323+
short acronyms (e.g. `AE`, `OR`) only match when standalone, never inside
324+
other words.
325+
- Recognizes allele-style tokens via a separate regex
326+
(`\b(?:HLA-)?[A-Z]{1,5}[0-9]?\*[0-9]{2}:[0-9]{2}\b` by default) and
327+
synthesizes their reading from the base allele entry in the dict.
328+
- Skips slides that already contain the header (idempotent — safe to re-run).
329+
330+
Realistic yield on a 47-slide academic deck: ~38 slides receive a section,
331+
~300 total term entries, 5–10 per annotated slide. Transition and divider
332+
slides have empty notes and are auto-skipped.
333+
334+
### Speaker notes statistics density
335+
336+
When the slide body already shows exact OR / 95% CI / p-value, the notes
337+
should NOT repeat the same numbers — the presenter ends up reading
338+
statistics aloud and the audience cannot keep up. Notes should be a
339+
**narrative** (key anchors + one-line "see the slide body for the exact
340+
numbers" reminder), not a numeric listing.
341+
342+
Quick measurement to spot dense slides during QC:
343+
344+
```python
345+
import re
346+
text = slide.notes_slide.notes_text_frame.text.split(pron_header)[0]
347+
n_char = len(text)
348+
n_stat = len(re.findall(r"\b(?:OR|p|CI)\s*[=<>]?\s*\d|\d+\.\d+|\d+%|×10", text))
349+
needs_compression = n_char > 1000 and n_stat >= 5
350+
```
351+
352+
Rule of thumb: 700–1,000 chars + 0–2 stat tokens is fine (30–60-second
353+
narrative). >1,000 chars + ≥5 stat tokens → compress to narrative tone and
354+
point at the slide body. Exact numbers belong in the slide body and
355+
footnotes (SSOT), not the notes.
356+
357+
### Sharing-ready notes-stripped variant
358+
359+
After the presentation, when the deck is shared with the audience (e.g. a
360+
professor asking for the slides), the speaker notes typically contain
361+
presenter-only material — second-language narrative, pronunciation hints,
362+
self-referential reminders ("Prof. ○○ will likely ask about …"). Stripping
363+
notes is mandatory before circulation. Use
364+
`${CLAUDE_SKILL_DIR}/scripts/strip_notes_for_sharing.py`:
365+
366+
```bash
367+
python3 "${CLAUDE_SKILL_DIR}/scripts/strip_notes_for_sharing.py" \
368+
presenter_v9.pptx share/<topic>_<initials>.pptx
369+
```
370+
371+
The script:
372+
- Clears every slide's `notes_text_frame` (idempotent, slide body and
373+
figures untouched).
374+
- Re-writes `docProps/app.xml` with the correct `Slides=` and `Notes=`
375+
counts so PowerPoint Mac does not show its repair dialog (see also the
376+
app.xml canonical fix in `pptx-mac-compatibility.md` §5).
377+
- Verifies that zero notes characters remain.
378+
379+
Recommended 3-file sharing package (filename pattern `<topic>_<initials>`):
380+
- `<topic>_<initials>.pptx` — notes-stripped variant for slide reuse
381+
- `<topic>_<initials>.pdf` — same deck, PDF for environment-agnostic
382+
preview (LibreOffice `--convert-to pdf` automatically drops the cleared
383+
notes pages)
384+
- `<topic>_<initials>_references.zip` — optional bundle of the reference
385+
PDFs; if it exceeds the email attachment limit, send a Google Drive link.
386+
387+
In the cover email, mention the PPTX is included specifically so the
388+
recipient can reuse individual slides if useful.
389+
277390
### Architecture
278391

279392
```
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
#!/usr/bin/env python3
2+
"""Append a pronunciation guide section to every slide's speaker notes.
3+
4+
Designed for non-native presenters who want a per-slide reading reference
5+
without affecting the audience-facing view. Scans each slide's notes text
6+
for tokens defined in a YAML/JSON ``PRON_DICT`` file (term → reading +
7+
full-name), uses word-boundary regex to avoid false positives, and writes
8+
a "[ 발음 ]" (or user-defined) section at the bottom of the notes.
9+
10+
Also auto-matches allele-style tokens (HLA-DRB1*07:01, HLA-A*02:01, …)
11+
that match a configurable regex and synthesizes a reading by combining
12+
the base reading from PRON_DICT with "스타 NN콜론NN".
13+
14+
This script is invocation-agnostic: it modifies a PPTX in place (or to a
15+
new path) and does not assume any specific lecture topic. The PRON_DICT
16+
file is supplied by the caller and is the only language/domain config.
17+
18+
Usage
19+
-----
20+
21+
PRON_DICT is YAML or JSON. Each key is the term as it appears in the
22+
notes text. Each value is a 2-tuple [reading, full_name].
23+
24+
```yaml
25+
# pron_dict.yaml
26+
GWAS: ["지와스", "Genome-wide association study"]
27+
HLA: ["에이치-엘-에이", "Human leukocyte antigen"]
28+
LGI1: ["엘-지-아이-원", "Leucine-rich glioma-inactivated 1"]
29+
Perriot: ["페리오", "프랑스, t 묵음"]
30+
```
31+
32+
```bash
33+
python3 inject_pronunciation_notes.py input.pptx output.pptx \
34+
--dict pron_dict.yaml --header "[ 발음 ]"
35+
```
36+
"""
37+
from __future__ import annotations
38+
39+
import argparse
40+
import json
41+
import re
42+
import shutil
43+
import sys
44+
from pathlib import Path
45+
46+
from pptx import Presentation
47+
from pptx.util import Pt
48+
49+
DEFAULT_ALLELE_RE = r"\b(?:HLA-)?[A-Z]{1,5}[0-9]?\*[0-9]{2}:[0-9]{2}(?:N|L|S|Q)?\b"
50+
DEFAULT_HEADER = "[ Pronunciation ]"
51+
52+
53+
def load_dict(path: Path) -> dict:
54+
if path.suffix.lower() in (".yaml", ".yml"):
55+
try:
56+
import yaml # type: ignore
57+
except ImportError:
58+
raise SystemExit("PyYAML required for YAML dict; pip install pyyaml")
59+
with path.open() as f:
60+
raw = yaml.safe_load(f)
61+
elif path.suffix.lower() == ".json":
62+
with path.open() as f:
63+
raw = json.load(f)
64+
else:
65+
raise SystemExit(f"unsupported dict format: {path.suffix}")
66+
out = {}
67+
for term, value in raw.items():
68+
if isinstance(value, (list, tuple)) and len(value) >= 1:
69+
reading = value[0]
70+
fullname = value[1] if len(value) > 1 else ""
71+
elif isinstance(value, str):
72+
reading, fullname = value, ""
73+
else:
74+
continue
75+
out[term] = (reading, fullname)
76+
return out
77+
78+
79+
def find_terms(text: str, pron_dict: dict, allele_re: str | None):
80+
"""Return [(term, reading, fullname)] for every dict key appearing in *text*."""
81+
if not text:
82+
return []
83+
hits = []
84+
seen = set()
85+
for term, (reading, fullname) in pron_dict.items():
86+
pat = r"(?<![A-Za-z0-9_])" + re.escape(term) + r"(?![A-Za-z0-9_])"
87+
if re.search(pat, text):
88+
if term not in seen:
89+
hits.append((term, reading, fullname))
90+
seen.add(term)
91+
if allele_re:
92+
allele_pattern = re.compile(allele_re)
93+
alleles = sorted(set(allele_pattern.findall(text)))
94+
for a in alleles:
95+
if a in seen:
96+
continue
97+
base = a.replace("HLA-", "").split("*")[0]
98+
tail = a.split("*", 1)[1] if "*" in a else ""
99+
base_reading = pron_dict.get(base, (base.lower(), ""))[0]
100+
reading = f"{base_reading} star {tail.replace(':', ' colon ')}"
101+
hits.append((a, reading, "HLA allele"))
102+
seen.add(a)
103+
return hits
104+
105+
106+
def inject_notes(slide, terms, header: str):
107+
if not terms:
108+
return False
109+
tf = slide.notes_slide.notes_text_frame
110+
111+
# blank separator
112+
p = tf.add_paragraph()
113+
r = p.add_run(); r.text = " "
114+
r.font.size = Pt(11)
115+
116+
# header
117+
p = tf.add_paragraph()
118+
r = p.add_run(); r.text = header
119+
r.font.bold = True
120+
r.font.size = Pt(12)
121+
122+
# each term: "▪ term — reading · fullname"
123+
for term, reading, fullname in terms:
124+
p = tf.add_paragraph()
125+
r = p.add_run(); r.text = f"▪ {term}"
126+
r.font.bold = True
127+
r.font.size = Pt(11)
128+
r2 = p.add_run(); r2.text = f" — {reading}"
129+
r2.font.size = Pt(11)
130+
if fullname:
131+
r3 = p.add_run(); r3.text = f" · {fullname}"
132+
r3.font.italic = True
133+
r3.font.size = Pt(11)
134+
return True
135+
136+
137+
def main():
138+
ap = argparse.ArgumentParser()
139+
ap.add_argument("src", type=Path)
140+
ap.add_argument("dst", type=Path)
141+
ap.add_argument("--dict", required=True, type=Path,
142+
help="path to pron_dict.yaml or pron_dict.json")
143+
ap.add_argument("--header", default=DEFAULT_HEADER,
144+
help="section header text (default: %(default)s)")
145+
ap.add_argument("--allele-regex", default=DEFAULT_ALLELE_RE,
146+
help="regex for allele-style tokens (set to empty string to disable)")
147+
args = ap.parse_args()
148+
149+
if not args.src.exists():
150+
print(f"source not found: {args.src}", file=sys.stderr)
151+
sys.exit(1)
152+
if args.src != args.dst:
153+
shutil.copy(args.src, args.dst)
154+
155+
pron_dict = load_dict(args.dict)
156+
print(f"loaded {len(pron_dict)} terms from {args.dict}")
157+
158+
allele_re = args.allele_regex if args.allele_regex else None
159+
prs = Presentation(args.dst)
160+
n_injected = 0
161+
n_terms_total = 0
162+
for slide in prs.slides:
163+
if not slide.has_notes_slide:
164+
continue
165+
body = slide.notes_slide.notes_text_frame.text
166+
if args.header in body:
167+
continue # already injected on a previous run
168+
terms = find_terms(body, pron_dict, allele_re)
169+
if inject_notes(slide, terms, args.header):
170+
n_injected += 1
171+
n_terms_total += len(terms)
172+
prs.save(args.dst)
173+
print(f"injected pronunciation on {n_injected} slides ({n_terms_total} term entries)")
174+
print(f"OK: {args.dst}")
175+
176+
177+
if __name__ == "__main__":
178+
main()

0 commit comments

Comments
 (0)