feat(search-lit): add East Asian name + CollectiveName heuristics to parse_pubmed#35
Merged
Merged
Conversation
…parse_pubmed
Adds two anti-hallucination heuristics to
skills/search-lit/references/parse_pubmed.py:
1. East Asian name reverse encoding (LastName / ForeName swapped)
PubMed XML occasionally encodes East Asian author names with the
given name in <LastName> and the family name in <ForeName>. Naive
parsers then emit BibTeX entries with the wrong first-author surname,
which downstream /verify-refs first-author cross-check flags as a
mismatch. The parser now detects this pattern (LastName ≥3 alpha
chars + ForeName 1-2 alpha chars with no period) and prints a
"% [VERIFY] East Asian name order suspected" comment above the
BibTeX entry plus an inline ⚠ note in efetch markdown output. The
author order is preserved verbatim — the script never silently
swaps fields it isn't certain about.
2. CollectiveName (corporate / consortium guideline) handling
AuthorList elements may contain <CollectiveName> instead of
LastName / ForeName (KDIGO, AHA/ACC, WHO guideline patterns).
Previously these authors were silently dropped, leaving the BibTeX
entry with an empty author field. The parser now:
- Emits the corporate name as {{Group Name}} (double-brace) so
BibTeX styles do not try to split on commas/spaces.
- Switches the BibTeX entry type from @Article to @misc when the
AuthorList contains only CollectiveName entries (matches the
/manuscript-references corporate-author convention).
- Includes the corporate name in the cite-key surname slot.
Both heuristics share an _extract_authors() helper that returns
bib_authors, display_authors, first_author_last, suspicions, and a
has_collective_only flag, used by both parse_efetch and generate_bibtex.
Smoke-tested against synthetic XML with the Fu 2024 reverse-encoded
case and a KDIGO Working Group CollectiveName case. Both produce
correct output with the appropriate ⚠ / % [VERIFY] notes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR T2-B: parse_pubmed East Asian + CollectiveName handling
Summary
Adds two anti-hallucination heuristics to
skills/search-lit/references/parse_pubmed.py:<LastName>/<ForeName>swap (LastName ≥3 alpha chars + ForeName 1-2 alphachars), emits
% [VERIFY]comment above the BibTeX entry and a ⚠note in efetch markdown. Author order preserved verbatim; the
script never silently swaps.
emits
{{Group Name}}double-brace author + switches BibTeX entrytype from
@articleto@miscwhen AuthorList contains only<CollectiveName>entries.Motivation
Cross-project observations:
LastName and the family name in ForeName. Naive parsers produced a
BibTeX entry with the wrong first-author surname; downstream
/verify-refsfirst-author cross-check then flagged a mismatchagainst the authoritative PubMed efetch source. The pre-detection
heuristic in this PR raises a
% [VERIFY]flag at bib-generationtime so the user is aware before the downstream audit catches it.
<CollectiveName>KDIGO Working Group</CollectiveName>. Previouslythese authors were silently dropped, leaving the BibTeX entry with
an empty author field. The new handling emits the corporate name as
a double-braced author and switches to
@misc, matching themanuscript-referencescorporate-author convention.Both heuristics align with the v2.10 INTAKE Phase 4 deferred backlog
items from the East Asian name / corporate author E2E feedback cycle.
Changes
skills/search-lit/references/parse_pubmed.py:_looks_east_asian_reversed(last, fore)helper_extract_authors(author_list_el)helper (returns bib /display authors, first author surname, suspicions list, and a
has_collective_onlyflag)parse_efetchuses the helper and prints ⚠ notes for each suspiciongenerate_bibtexuses the helper, prepends% [VERIFY]comments,and switches entry type to
@miscwhen corporate-onlyTest plan
validate_skills.shALL CHECKS PASSEDvalidate_skill_contracts.py0 failures+ KDIGO Working Group CollectiveName case. Both produce correct
output with appropriate ⚠ /
% [VERIFY]notes.🤖 Generated with Claude Code