Commit 4505c8c
Integrate venue-acronyms-2025 pipeline: acronym-level lookup with ISSN support (#1025)
* Remove obsolete acronym/abbreviation handling, adapt import for venue-acronyms-2025
- CLI: remove add, add-bibtex, list, export, stats, abbreviation-stats,
list-ambiguous commands from acronym group; keep only status, import, clear
- CLI: simplify import to support venue-acronyms-2025 consensus format
(original_name, confidence_score, null-entry filtering, auto-detect source)
- CLI: remove --include-abbreviations flag from clear command
- acronym_cache.py: remove obsolete methods:
export_all_variants, export_all_abbreviations, import_abbreviations,
list_all_acronyms, clear_learned_abbreviations, check_acronym_conflict,
list_ambiguous_acronyms, bulk_store_acronyms, increment_usage_count,
get_learned_abbreviations, store_learned_abbreviation,
get_abbreviation_confidence
- acronym_cache.py: add import_variants with venue-acronyms-2025 support,
_post_import_update, _detect_and_mark_ambiguous (Jaccard < 0.3),
confidence_score x 100 as usage_count proxy
- normalizer.py: disable abbreviation expansion (LLM handles it); add
_strip_parenthetical_acronym to clean trailing (ACRONYM) from names
* Fix mypy errors: remove dead abbreviation code in bibtex_parser and normalizer
Remove extract_acronyms_from_entries() from BibtexParser — this method was the
only caller of the now-deleted get_learned_abbreviations() and
store_learned_abbreviation() on AcronymCache, and is itself not called from
any active code path.
Remove _expand_abbreviations() from InputNormalizer — the method was already
disabled at the call site and called the deleted get_learned_abbreviations().
* Apply linter formatting
* feat: add variant-based lookup to AcronymCache and extend dispatcher fallback [AI-assisted]
The venue_acronyms table stores all observed name forms (abbreviated and
expanded) in a JSON variants array, but there was no way to look up the
canonical name from an abbreviated variant such as
"acm trans. sens. networks".
Add AcronymCache.get_canonical_for_variant() that searches the variants
JSON array using SQLite json_each(), enabling reverse lookups from
abbreviated forms to their canonical names.
Extend QueryDispatcher._try_acronym_fallback() with a second lookup path:
after the existing standalone-acronym check (Path 1), also try
get_canonical_for_variant() for abbreviated multi-word forms (Path 2).
Both paths feed into the same retry-with-expanded-name flow.
Also update the CLI import docstring to reflect the v2.0 pipeline format
and add venue_acronyms table documentation to NORMALIZED_DATABASE_DESIGN.md.
* refactor: normalize venue acronym storage — replace JSON columns with relational tables [AI-assisted]
Storing variants and ISSNs as JSON TEXT columns prevents direct SQL
queries (no index, no JOIN, full table scan with json_each).
Replace the single JSON-heavy venue_acronyms table with three normalized
tables:
- venue_acronyms: one row per (acronym, entity_type), no JSON columns
- venue_acronym_variants: one row per name variant (FK + COLLATE NOCASE index)
- venue_acronym_issns: one row per ISSN (FK + index)
ON DELETE CASCADE keeps child rows in sync when a parent is deleted.
Migration: init_database() detects old JSON columns via PRAGMA table_info
and drops the cluster before re-creating it with the new schema.
All cache methods rewritten to use JOIN queries. get_canonical_for_variant
now uses a straight JOIN instead of json_each(). get_issns() added.
import_acronyms() upserts the parent row then does DELETE+INSERT for
child rows to ensure a full replace on re-import.
* feat: add ISSN-based canonical lookup and wire as dispatcher fallback path 3 [AI-assisted]
The normalized venue_acronym_issns table is now queryable but was not yet
used at runtime.
Add AcronymCache.get_canonical_for_issn(issn) — a JOIN against
venue_acronym_issns that returns the canonical name for a venue given any
of its known ISSNs. No entity_type filter needed since ISSNs are globally
unique across venue types.
Extend QueryDispatcher._try_acronym_fallback() with Path 3: if an ISSN is
present in query_input.identifiers (extracted from BibTeX or user input),
try get_canonical_for_issn() before giving up. This enables correct
expansion even when the journal name is heavily abbreviated or missing but
the ISSN is present.
* feat: add get_full_stats() and improve acronym status output [AI-assisted]
The acronym status command showed only a single total count, giving no
insight into how the data is distributed or whether variants/ISSNs were
imported.
Add AcronymCache.get_full_stats() that returns total_acronyms,
total_variants, total_issns, and a per-entity_type breakdown using a
single LEFT JOIN query across all three normalized tables.
Update the 'acronym status' CLI command to display the full breakdown:
totals at the top, then a table by entity type with acronym, variant, and
ISSN counts. Update existing CLI tests to mock get_full_stats().
* refactor: align VenueAcronym model with venue-acronyms-2025 pipeline output format [AI-assisted]
Replace VariantRecord/LearnedAbbreviation with VenueAcronym matching the new
acronym-level JSON structure (canonical, variants, issn, confidence_score).
* fix: update bibtex parser tests to expect full macro expansion [AI-assisted]
\pasp now correctly expands to the full journal name via _expand_latex_journal_macros.
Update assertions to reflect the correct expanded form instead of the old acronym.
* style: remove extra blank lines from conflict resolution (ruff)
* feat: bump schema to v3, reject old databases with clear delete+sync message [AI-assisted]
- SCHEMA_VERSION 2 → 3 (acronym-level relational schema with canonical/variants/ISSN)
- init_database() always checks version for existing DBs; no bypass, no migration
- check_schema_compatibility() now tells users to delete the DB and run sync (pre-1.0)
- migrate_database() returns False for legacy DBs (no migration path)
- Fix test assertions to match new behavior
---------
Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>1 parent 5cf3b82 commit 4505c8c
File tree
11 files changed
+881
-841
lines changed- dev-notes
- src/aletheia_probe
- cache
- tests/unit
11 files changed
+881
-841
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
118 | 118 | | |
119 | 119 | | |
120 | 120 | | |
121 | | - | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
122 | 166 | | |
123 | 167 | | |
124 | 168 | | |
| |||
0 commit comments