feat(present-paper): word-boundary parser + pronunciation augment + sharing-ready notes strip

Yoojin-nam · claude · Yoojin-nam · commit a332db37a464 · 2026-05-19T19:48:48.000+09:00
Adds four patterns learned across nine iterations of a graduate-level
academic lecture build cycle, all anonymized:

1. Word-boundary aware markdown parser — italic/bold regex with
   lookahead/lookbehind so HLA alleles, SNP IDs, and other
   alphanumeric-adjacent asterisks survive add_styled() unchanged.
   Documented in SKILL.md as a mandatory regex for any Nature/Lancet
   build script that handles allele-rich slide bodies.

2. inject_pronunciation_notes.py — CLI that appends a per-slide
   "[ Pronunciation ]" section to speaker notes from a YAML/JSON
   PRON_DICT. Word-boundary regex avoids false positives on short
   acronyms; separate allele regex synthesizes readings on the fly.
   Idempotent. Audience view unaffected — Presenter View only.

3. Speaker notes statistics density rule — when slide body already
   shows exact OR/CI/p-value, the notes should be narrative anchors
   + "see slide body" reminders, not numeric listings. Quick QC
   measurement (chars + stat-token count) documented.

4. strip_notes_for_sharing.py — clears every slide's notes_text_frame
   and rewrites docProps/app.xml so the PPTX can be circulated to the
   audience without leaking presenter-only narrative or pronunciation
   hints. Recommended 3-file sharing package (.pptx + .pdf + refs.zip)
   documented with naming convention &lt;topic&gt;_&lt;initials&gt;.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/skills/present-paper/SKILL.md b/skills/present-paper/SKILL.md
@@ -274,6 +274,119 @@ headers all removed). When the deck slot expects only the figure body
 (default for `build_pptx_nature_lancet.py`), point `FIG_DIR` at the cropped
 output dir.
 
+### Word-boundary aware markdown parser (mandatory for HLA-rich decks)
+
+When the build script parses inline `**bold**` / `*italic*` markers in slide
+body or speaker notes, the italic rule must use **word-boundary lookahead /
+lookbehind** so asterisk-bearing scientific tokens (HLA alleles like
+`DRB1*07:01`, `HLA-A*02:01`, SNP IDs, footnote markers) are not eaten as
+italic delimiters:
+
+```python
+import re
+pattern = re.compile(
+    r"(\*\*(?:(?!\*\*).)+?\*\*"                           # bold; inner single * allowed
+    r"|(?<![A-Za-z0-9])\*[^*\n]+?\*(?![A-Za-z0-9]))"      # italic (word-boundary)
+)
+```
+
+Two regex tricks together:
+1. **Italic with boundary**: `(?<![A-Za-z0-9])` and `(?![A-Za-z0-9])` reject
+   `*` adjacent to alphanumerics, so `DRB1*07:01` is left intact.
+2. **Bold tolerates inner single `*`**: `(?:(?!\*\*).)+?` allows
+   `**DRB1*04:02**` (HLA allele inside bold) to match as a single bold span.
+
+Without these, a naive `\*[^*]+\*` italic pattern silently corrupts every
+HLA allele in the deck. Add the regex to `add_styled()` (or equivalent) in
+every Nature/Lancet-style build script.
+
+### Pronunciation auto-augment for non-native presenters
+
+For decks where the presenter is uncomfortable with English pronunciation of
+acronyms, author names, drug names, or gene symbols, append a per-slide
+`[ Pronunciation ]` section to the speaker notes (audience sees nothing —
+only Presenter View). Use
+`${CLAUDE_SKILL_DIR}/scripts/inject_pronunciation_notes.py`:
+
+```bash
+python3 "${CLAUDE_SKILL_DIR}/scripts/inject_pronunciation_notes.py" \
+  input.pptx output.pptx \
+  --dict pron_dict.yaml \
+  --header "[ 발음 ]"            # or any header you like
+```
+
+The script:
+- Loads a YAML/JSON `PRON_DICT` (term → [reading, full_name]) supplied by
+  the caller. The dict is domain-specific — assemble it for your audience
+  (Korean readings, French readings, Spanish readings, etc.).
+- Uses **word-boundary regex** `(?<![A-Za-z0-9_]) … (?![A-Za-z0-9_])` so
+  short acronyms (e.g. `AE`, `OR`) only match when standalone, never inside
+  other words.
+- Recognizes allele-style tokens via a separate regex
+  (`\b(?:HLA-)?[A-Z]{1,5}[0-9]?\*[0-9]{2}:[0-9]{2}\b` by default) and
+  synthesizes their reading from the base allele entry in the dict.
+- Skips slides that already contain the header (idempotent — safe to re-run).
+
+Realistic yield on a 47-slide academic deck: ~38 slides receive a section,
+~300 total term entries, 5–10 per annotated slide. Transition and divider
+slides have empty notes and are auto-skipped.
+
+### Speaker notes statistics density
+
+When the slide body already shows exact OR / 95% CI / p-value, the notes
+should NOT repeat the same numbers — the presenter ends up reading
+statistics aloud and the audience cannot keep up. Notes should be a
+**narrative** (key anchors + one-line "see the slide body for the exact
+numbers" reminder), not a numeric listing.
+
+Quick measurement to spot dense slides during QC:
+
+```python
+import re
+text = slide.notes_slide.notes_text_frame.text.split(pron_header)[0]
+n_char = len(text)
+n_stat = len(re.findall(r"\b(?:OR|p|CI)\s*[=<>]?\s*\d|\d+\.\d+|\d+%|×10", text))
+needs_compression = n_char > 1000 and n_stat >= 5
+```
+
+Rule of thumb: 700–1,000 chars + 0–2 stat tokens is fine (30–60-second
+narrative). >1,000 chars + ≥5 stat tokens → compress to narrative tone and
+point at the slide body. Exact numbers belong in the slide body and
+footnotes (SSOT), not the notes.
+
+### Sharing-ready notes-stripped variant
+
+After the presentation, when the deck is shared with the audience (e.g. a
+professor asking for the slides), the speaker notes typically contain
+presenter-only material — second-language narrative, pronunciation hints,
+self-referential reminders ("Prof. ○○ will likely ask about …"). Stripping
+notes is mandatory before circulation. Use
+`${CLAUDE_SKILL_DIR}/scripts/strip_notes_for_sharing.py`:
+
+```bash
+python3 "${CLAUDE_SKILL_DIR}/scripts/strip_notes_for_sharing.py" \
+  presenter_v9.pptx share/<topic>_<initials>.pptx
+```
+
+The script:
+- Clears every slide's `notes_text_frame` (idempotent, slide body and
+  figures untouched).
+- Re-writes `docProps/app.xml` with the correct `Slides=` and `Notes=`
+  counts so PowerPoint Mac does not show its repair dialog (see also the
+  app.xml canonical fix in `pptx-mac-compatibility.md` §5).
+- Verifies that zero notes characters remain.
+
+Recommended 3-file sharing package (filename pattern `<topic>_<initials>`):
+- `<topic>_<initials>.pptx` — notes-stripped variant for slide reuse
+- `<topic>_<initials>.pdf` — same deck, PDF for environment-agnostic
+  preview (LibreOffice `--convert-to pdf` automatically drops the cleared
+  notes pages)
+- `<topic>_<initials>_references.zip` — optional bundle of the reference
+  PDFs; if it exceeds the email attachment limit, send a Google Drive link.
+
+In the cover email, mention the PPTX is included specifically so the
+recipient can reuse individual slides if useful.
+
 ### Architecture
 
 ```
diff --git a/skills/present-paper/scripts/inject_pronunciation_notes.py b/skills/present-paper/scripts/inject_pronunciation_notes.py
@@ -0,0 +1,178 @@
+#!/usr/bin/env python3
+"""Append a pronunciation guide section to every slide's speaker notes.
+
+Designed for non-native presenters who want a per-slide reading reference
+without affecting the audience-facing view. Scans each slide's notes text
+for tokens defined in a YAML/JSON ``PRON_DICT`` file (term → reading +
+full-name), uses word-boundary regex to avoid false positives, and writes
+a "[ 발음 ]" (or user-defined) section at the bottom of the notes.
+
+Also auto-matches allele-style tokens (HLA-DRB1*07:01, HLA-A*02:01, …)
+that match a configurable regex and synthesizes a reading by combining
+the base reading from PRON_DICT with "스타 NN콜론NN".
+
+This script is invocation-agnostic: it modifies a PPTX in place (or to a
+new path) and does not assume any specific lecture topic. The PRON_DICT
+file is supplied by the caller and is the only language/domain config.
+
+Usage
+-----
+
+PRON_DICT is YAML or JSON. Each key is the term as it appears in the
+notes text. Each value is a 2-tuple [reading, full_name].
+
+```yaml
+# pron_dict.yaml
+GWAS: ["지와스", "Genome-wide association study"]
+HLA: ["에이치-엘-에이", "Human leukocyte antigen"]
+LGI1: ["엘-지-아이-원", "Leucine-rich glioma-inactivated 1"]
+Perriot: ["페리오", "프랑스, t 묵음"]
+```
+
+```bash
+python3 inject_pronunciation_notes.py input.pptx output.pptx \
+    --dict pron_dict.yaml --header "[ 발음 ]"
+```
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import shutil
+import sys
+from pathlib import Path
+
+from pptx import Presentation
+from pptx.util import Pt
+
+DEFAULT_ALLELE_RE = r"\b(?:HLA-)?[A-Z]{1,5}[0-9]?\*[0-9]{2}:[0-9]{2}(?:N|L|S|Q)?\b"
+DEFAULT_HEADER = "[ Pronunciation ]"
+
+
+def load_dict(path: Path) -> dict:
+    if path.suffix.lower() in (".yaml", ".yml"):
+        try:
+            import yaml  # type: ignore
+        except ImportError:
+            raise SystemExit("PyYAML required for YAML dict; pip install pyyaml")
+        with path.open() as f:
+            raw = yaml.safe_load(f)
+    elif path.suffix.lower() == ".json":
+        with path.open() as f:
+            raw = json.load(f)
+    else:
+        raise SystemExit(f"unsupported dict format: {path.suffix}")
+    out = {}
+    for term, value in raw.items():
+        if isinstance(value, (list, tuple)) and len(value) >= 1:
+            reading = value[0]
+            fullname = value[1] if len(value) > 1 else ""
+        elif isinstance(value, str):
+            reading, fullname = value, ""
+        else:
+            continue
+        out[term] = (reading, fullname)
+    return out
+
+
+def find_terms(text: str, pron_dict: dict, allele_re: str | None):
+    """Return [(term, reading, fullname)] for every dict key appearing in *text*."""
+    if not text:
+        return []
+    hits = []
+    seen = set()
+    for term, (reading, fullname) in pron_dict.items():
+        pat = r"(?<![A-Za-z0-9_])" + re.escape(term) + r"(?![A-Za-z0-9_])"
+        if re.search(pat, text):
+            if term not in seen:
+                hits.append((term, reading, fullname))
+                seen.add(term)
+    if allele_re:
+        allele_pattern = re.compile(allele_re)
+        alleles = sorted(set(allele_pattern.findall(text)))
+        for a in alleles:
+            if a in seen:
+                continue
+            base = a.replace("HLA-", "").split("*")[0]
+            tail = a.split("*", 1)[1] if "*" in a else ""
+            base_reading = pron_dict.get(base, (base.lower(), ""))[0]
+            reading = f"{base_reading} star {tail.replace(':', ' colon ')}"
+            hits.append((a, reading, "HLA allele"))
+            seen.add(a)
+    return hits
+
+
+def inject_notes(slide, terms, header: str):
+    if not terms:
+        return False
+    tf = slide.notes_slide.notes_text_frame
+
+    # blank separator
+    p = tf.add_paragraph()
+    r = p.add_run(); r.text = " "
+    r.font.size = Pt(11)
+
+    # header
+    p = tf.add_paragraph()
+    r = p.add_run(); r.text = header
+    r.font.bold = True
+    r.font.size = Pt(12)
+
+    # each term: "▪ term  —  reading  ·  fullname"
+    for term, reading, fullname in terms:
+        p = tf.add_paragraph()
+        r = p.add_run(); r.text = f"▪ {term}"
+        r.font.bold = True
+        r.font.size = Pt(11)
+        r2 = p.add_run(); r2.text = f"  —  {reading}"
+        r2.font.size = Pt(11)
+        if fullname:
+            r3 = p.add_run(); r3.text = f"  ·  {fullname}"
+            r3.font.italic = True
+            r3.font.size = Pt(11)
+    return True
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("src", type=Path)
+    ap.add_argument("dst", type=Path)
+    ap.add_argument("--dict", required=True, type=Path,
+                    help="path to pron_dict.yaml or pron_dict.json")
+    ap.add_argument("--header", default=DEFAULT_HEADER,
+                    help="section header text (default: %(default)s)")
+    ap.add_argument("--allele-regex", default=DEFAULT_ALLELE_RE,
+                    help="regex for allele-style tokens (set to empty string to disable)")
+    args = ap.parse_args()
+
+    if not args.src.exists():
+        print(f"source not found: {args.src}", file=sys.stderr)
+        sys.exit(1)
+    if args.src != args.dst:
+        shutil.copy(args.src, args.dst)
+
+    pron_dict = load_dict(args.dict)
+    print(f"loaded {len(pron_dict)} terms from {args.dict}")
+
+    allele_re = args.allele_regex if args.allele_regex else None
+    prs = Presentation(args.dst)
+    n_injected = 0
+    n_terms_total = 0
+    for slide in prs.slides:
+        if not slide.has_notes_slide:
+            continue
+        body = slide.notes_slide.notes_text_frame.text
+        if args.header in body:
+            continue  # already injected on a previous run
+        terms = find_terms(body, pron_dict, allele_re)
+        if inject_notes(slide, terms, args.header):
+            n_injected += 1
+            n_terms_total += len(terms)
+    prs.save(args.dst)
+    print(f"injected pronunciation on {n_injected} slides ({n_terms_total} term entries)")
+    print(f"OK: {args.dst}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/skills/present-paper/scripts/strip_notes_for_sharing.py b/skills/present-paper/scripts/strip_notes_for_sharing.py