Skip to content

feat(search-lit): add East Asian name + CollectiveName heuristics to parse_pubmed#35

Merged
Yoojin-nam merged 1 commit into
mainfrom
feat/tier2-PR-B-parse-pubmed-east-asian
May 23, 2026
Merged

feat(search-lit): add East Asian name + CollectiveName heuristics to parse_pubmed#35
Yoojin-nam merged 1 commit into
mainfrom
feat/tier2-PR-B-parse-pubmed-east-asian

Conversation

@Yoojin-nam
Copy link
Copy Markdown
Contributor

PR T2-B: parse_pubmed East Asian + CollectiveName handling

Summary

Adds two anti-hallucination heuristics to
skills/search-lit/references/parse_pubmed.py:

  1. East Asian name reverse encoding — detects <LastName> /
    <ForeName> swap (LastName ≥3 alpha chars + ForeName 1-2 alpha
    chars), emits % [VERIFY] comment above the BibTeX entry and a ⚠
    note in efetch markdown. Author order preserved verbatim; the
    script never silently swaps.
  2. CollectiveName (corporate / consortium guideline) handling
    emits {{Group Name}} double-brace author + switches BibTeX entry
    type from @article to @misc when AuthorList contains only
    <CollectiveName> entries.

Motivation

Cross-project observations:

  • A first-author "Fu 2024" PubMed XML entry encoded the given name in
    LastName and the family name in ForeName. Naive parsers produced a
    BibTeX entry with the wrong first-author surname; downstream
    /verify-refs first-author cross-check then flagged a mismatch
    against the authoritative PubMed efetch source. The pre-detection
    heuristic in this PR raises a % [VERIFY] flag at bib-generation
    time so the user is aware before the downstream audit catches it.
  • KDIGO 2024 CKD guideline citations: the AuthorList contains
    <CollectiveName>KDIGO Working Group</CollectiveName>. Previously
    these authors were silently dropped, leaving the BibTeX entry with
    an empty author field. The new handling emits the corporate name as
    a double-braced author and switches to @misc, matching the
    manuscript-references corporate-author convention.

Both heuristics align with the v2.10 INTAKE Phase 4 deferred backlog
items from the East Asian name / corporate author E2E feedback cycle.

Changes

  • skills/search-lit/references/parse_pubmed.py:
    • new _looks_east_asian_reversed(last, fore) helper
    • new _extract_authors(author_list_el) helper (returns bib /
      display authors, first author surname, suspicions list, and a
      has_collective_only flag)
    • parse_efetch uses the helper and prints ⚠ notes for each suspicion
    • generate_bibtex uses the helper, prepends % [VERIFY] comments,
      and switches entry type to @misc when corporate-only

Test plan

  • validate_skills.sh ALL CHECKS PASSED
  • validate_skill_contracts.py 0 failures
  • Smoke test against synthetic XML containing Fu 2024 reverse case
    + KDIGO Working Group CollectiveName case. Both produce correct
    output with appropriate ⚠ / % [VERIFY] notes.

🤖 Generated with Claude Code

…parse_pubmed

Adds two anti-hallucination heuristics to
skills/search-lit/references/parse_pubmed.py:

1. East Asian name reverse encoding (LastName / ForeName swapped)
   PubMed XML occasionally encodes East Asian author names with the
   given name in <LastName> and the family name in <ForeName>. Naive
   parsers then emit BibTeX entries with the wrong first-author surname,
   which downstream /verify-refs first-author cross-check flags as a
   mismatch. The parser now detects this pattern (LastName ≥3 alpha
   chars + ForeName 1-2 alpha chars with no period) and prints a
   "% [VERIFY] East Asian name order suspected" comment above the
   BibTeX entry plus an inline ⚠ note in efetch markdown output. The
   author order is preserved verbatim — the script never silently
   swaps fields it isn't certain about.

2. CollectiveName (corporate / consortium guideline) handling
   AuthorList elements may contain <CollectiveName> instead of
   LastName / ForeName (KDIGO, AHA/ACC, WHO guideline patterns).
   Previously these authors were silently dropped, leaving the BibTeX
   entry with an empty author field. The parser now:
   - Emits the corporate name as {{Group Name}} (double-brace) so
     BibTeX styles do not try to split on commas/spaces.
   - Switches the BibTeX entry type from @Article to @misc when the
     AuthorList contains only CollectiveName entries (matches the
     /manuscript-references corporate-author convention).
   - Includes the corporate name in the cite-key surname slot.

Both heuristics share an _extract_authors() helper that returns
bib_authors, display_authors, first_author_last, suspicions, and a
has_collective_only flag, used by both parse_efetch and generate_bibtex.

Smoke-tested against synthetic XML with the Fu 2024 reverse-encoded
case and a KDIGO Working Group CollectiveName case. Both produce
correct output with the appropriate ⚠ / % [VERIFY] notes.
@Yoojin-nam Yoojin-nam marked this pull request as ready for review May 23, 2026 07:30
@Yoojin-nam Yoojin-nam merged commit 63ae163 into main May 23, 2026
1 check passed
@Yoojin-nam Yoojin-nam deleted the feat/tier2-PR-B-parse-pubmed-east-asian branch May 23, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant