feat(search-lit): add East Asian name + CollectiveName heuristics to parse_pubmed by Yoojin-nam · Pull Request #35 · Aperivue/medsci-skills

Yoojin-nam · 2026-05-23T07:12:27Z

PR T2-B: parse_pubmed East Asian + CollectiveName handling

Summary

Adds two anti-hallucination heuristics to
skills/search-lit/references/parse_pubmed.py:

East Asian name reverse encoding — detects <LastName> /
<ForeName> swap (LastName ≥3 alpha chars + ForeName 1-2 alpha
chars), emits % [VERIFY] comment above the BibTeX entry and a ⚠
note in efetch markdown. Author order preserved verbatim; the
script never silently swaps.
CollectiveName (corporate / consortium guideline) handling —
emits {{Group Name}} double-brace author + switches BibTeX entry
type from @article to @misc when AuthorList contains only
<CollectiveName> entries.

Motivation

Cross-project observations:

A first-author "Fu 2024" PubMed XML entry encoded the given name in
LastName and the family name in ForeName. Naive parsers produced a
BibTeX entry with the wrong first-author surname; downstream
/verify-refs first-author cross-check then flagged a mismatch
against the authoritative PubMed efetch source. The pre-detection
heuristic in this PR raises a % [VERIFY] flag at bib-generation
time so the user is aware before the downstream audit catches it.
KDIGO 2024 CKD guideline citations: the AuthorList contains
<CollectiveName>KDIGO Working Group</CollectiveName>. Previously
these authors were silently dropped, leaving the BibTeX entry with
an empty author field. The new handling emits the corporate name as
a double-braced author and switches to @misc, matching the
manuscript-references corporate-author convention.

Both heuristics align with the v2.10 INTAKE Phase 4 deferred backlog
items from the East Asian name / corporate author E2E feedback cycle.

Changes

skills/search-lit/references/parse_pubmed.py:
- new _looks_east_asian_reversed(last, fore) helper
- new _extract_authors(author_list_el) helper (returns bib /
  display authors, first author surname, suspicions list, and a
  has_collective_only flag)
- parse_efetch uses the helper and prints ⚠ notes for each suspicion
- generate_bibtex uses the helper, prepends % [VERIFY] comments,
  and switches entry type to @misc when corporate-only

Test plan

validate_skills.sh ALL CHECKS PASSED
validate_skill_contracts.py 0 failures
Smoke test against synthetic XML containing Fu 2024 reverse case
+ KDIGO Working Group CollectiveName case. Both produce correct
output with appropriate ⚠ / % [VERIFY] notes.

🤖 Generated with Claude Code

@misc

…parse_pubmed Adds two anti-hallucination heuristics to skills/search-lit/references/parse_pubmed.py: 1. East Asian name reverse encoding (LastName / ForeName swapped) PubMed XML occasionally encodes East Asian author names with the given name in <LastName> and the family name in <ForeName>. Naive parsers then emit BibTeX entries with the wrong first-author surname, which downstream /verify-refs first-author cross-check flags as a mismatch. The parser now detects this pattern (LastName ≥3 alpha chars + ForeName 1-2 alpha chars with no period) and prints a "% [VERIFY] East Asian name order suspected" comment above the BibTeX entry plus an inline ⚠ note in efetch markdown output. The author order is preserved verbatim — the script never silently swaps fields it isn't certain about. 2. CollectiveName (corporate / consortium guideline) handling AuthorList elements may contain <CollectiveName> instead of LastName / ForeName (KDIGO, AHA/ACC, WHO guideline patterns). Previously these authors were silently dropped, leaving the BibTeX entry with an empty author field. The parser now: - Emits the corporate name as {{Group Name}} (double-brace) so BibTeX styles do not try to split on commas/spaces. - Switches the BibTeX entry type from @Article to @misc when the AuthorList contains only CollectiveName entries (matches the /manuscript-references corporate-author convention). - Includes the corporate name in the cite-key surname slot. Both heuristics share an _extract_authors() helper that returns bib_authors, display_authors, first_author_last, suspicions, and a has_collective_only flag, used by both parse_efetch and generate_bibtex. Smoke-tested against synthetic XML with the Fu 2024 reverse-encoded case and a KDIGO Working Group CollectiveName case. Both produce correct output with the appropriate ⚠ / % [VERIFY] notes.

Yoojin-nam marked this pull request as ready for review May 23, 2026 07:30

Yoojin-nam merged commit 63ae163 into main May 23, 2026
1 check passed

Yoojin-nam deleted the feat/tier2-PR-B-parse-pubmed-east-asian branch May 23, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search-lit): add East Asian name + CollectiveName heuristics to parse_pubmed#35

feat(search-lit): add East Asian name + CollectiveName heuristics to parse_pubmed#35
Yoojin-nam merged 1 commit into
mainfrom
feat/tier2-PR-B-parse-pubmed-east-asian

Yoojin-nam commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yoojin-nam commented May 23, 2026

PR T2-B: parse_pubmed East Asian + CollectiveName handling

Summary

Motivation

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant