Skip to content

Commit c41b70c

Browse files
mkassafclaude
andcommitted
v0.3.2: PyMuPDF extractor, DBLP always-on for low-scoring refs
- Switch primary PDF text extractor to PyMuPDF (fitz); fixes multi-column layout garbling (Cinkusz DOI now found, Kostka/Tran titles correct) - Fall back to pdfminer when PyMuPDF is not installed - Add pymupdf as a core dependency - Fix DBLP trigger: query domain sources whenever best candidate score is below pass threshold, not only when candidates list is empty; fixes Leviathan (ICML) and ReAct (ICLR) reliably finding via DBLP Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent ddcc096 commit c41b70c

7 files changed

Lines changed: 36 additions & 5 deletions

File tree

citesentry/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
"""citesentry — citation verification tool."""
22

3-
__version__ = "0.3.1"
3+
__version__ = "0.3.2"
0 Bytes
Binary file not shown.
366 Bytes
Binary file not shown.

citesentry/checks/existence.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,10 @@ async def check_existence(
163163
except Exception as e:
164164
evidence[f"{src.name}_error"] = str(e)
165165

166-
if not candidates and domain_sources:
166+
# Query domain sources when no good candidate found yet — this handles papers
167+
# (e.g. ICML/ICLR proceedings) that Semantic Scholar misses but DBLP covers well.
168+
best_score_so_far = max((c[0] for c in candidates), default=0.0)
169+
if (not candidates or best_score_so_far < _TITLE_PASS_THRESHOLD / 100.0) and domain_sources:
167170
for src in domain_sources:
168171
try:
169172
if effective_doi:

citesentry/parse/pdf_refs.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,20 @@
1717

1818

1919
def _extract_text(path: Path) -> str:
20+
# PyMuPDF handles multi-column layouts and line order far better than pdfminer
21+
try:
22+
import fitz # PyMuPDF
23+
doc = fitz.open(str(path))
24+
text = "\n".join(page.get_text() for page in doc)
25+
doc.close()
26+
return text
27+
except ImportError:
28+
pass
2029
try:
2130
from pdfminer.high_level import extract_text
2231
return extract_text(str(path))
2332
except ImportError as e:
24-
raise ImportError("pdfminer.six is required: pip install pdfminer.six") from e
33+
raise ImportError("Install pymupdf or pdfminer.six: pip install pymupdf") from e
2534

2635

2736
def _find_ref_section(text: str) -> str | None:

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "citesentry"
7-
version = "0.3.1"
7+
version = "0.3.2"
88
description = "Citation verification tool: existence, URL liveness, and content relevance checks"
99
readme = "README.md"
1010
requires-python = ">=3.10"
@@ -20,6 +20,7 @@ dependencies = [
2020
"pdfminer.six>=20221105",
2121
"mcp[cli]>=1.0",
2222
"platformdirs>=4",
23+
"pymupdf>=1.27.2.3",
2324
]
2425

2526
[project.optional-dependencies]

uv.lock

Lines changed: 19 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)