Skip to content

Commit 4505c8c

Browse files
Integrate venue-acronyms-2025 pipeline: acronym-level lookup with ISSN support (#1025)
* Remove obsolete acronym/abbreviation handling, adapt import for venue-acronyms-2025 - CLI: remove add, add-bibtex, list, export, stats, abbreviation-stats, list-ambiguous commands from acronym group; keep only status, import, clear - CLI: simplify import to support venue-acronyms-2025 consensus format (original_name, confidence_score, null-entry filtering, auto-detect source) - CLI: remove --include-abbreviations flag from clear command - acronym_cache.py: remove obsolete methods: export_all_variants, export_all_abbreviations, import_abbreviations, list_all_acronyms, clear_learned_abbreviations, check_acronym_conflict, list_ambiguous_acronyms, bulk_store_acronyms, increment_usage_count, get_learned_abbreviations, store_learned_abbreviation, get_abbreviation_confidence - acronym_cache.py: add import_variants with venue-acronyms-2025 support, _post_import_update, _detect_and_mark_ambiguous (Jaccard < 0.3), confidence_score x 100 as usage_count proxy - normalizer.py: disable abbreviation expansion (LLM handles it); add _strip_parenthetical_acronym to clean trailing (ACRONYM) from names * Fix mypy errors: remove dead abbreviation code in bibtex_parser and normalizer Remove extract_acronyms_from_entries() from BibtexParser — this method was the only caller of the now-deleted get_learned_abbreviations() and store_learned_abbreviation() on AcronymCache, and is itself not called from any active code path. Remove _expand_abbreviations() from InputNormalizer — the method was already disabled at the call site and called the deleted get_learned_abbreviations(). * Apply linter formatting * feat: add variant-based lookup to AcronymCache and extend dispatcher fallback [AI-assisted] The venue_acronyms table stores all observed name forms (abbreviated and expanded) in a JSON variants array, but there was no way to look up the canonical name from an abbreviated variant such as "acm trans. sens. networks". Add AcronymCache.get_canonical_for_variant() that searches the variants JSON array using SQLite json_each(), enabling reverse lookups from abbreviated forms to their canonical names. Extend QueryDispatcher._try_acronym_fallback() with a second lookup path: after the existing standalone-acronym check (Path 1), also try get_canonical_for_variant() for abbreviated multi-word forms (Path 2). Both paths feed into the same retry-with-expanded-name flow. Also update the CLI import docstring to reflect the v2.0 pipeline format and add venue_acronyms table documentation to NORMALIZED_DATABASE_DESIGN.md. * refactor: normalize venue acronym storage — replace JSON columns with relational tables [AI-assisted] Storing variants and ISSNs as JSON TEXT columns prevents direct SQL queries (no index, no JOIN, full table scan with json_each). Replace the single JSON-heavy venue_acronyms table with three normalized tables: - venue_acronyms: one row per (acronym, entity_type), no JSON columns - venue_acronym_variants: one row per name variant (FK + COLLATE NOCASE index) - venue_acronym_issns: one row per ISSN (FK + index) ON DELETE CASCADE keeps child rows in sync when a parent is deleted. Migration: init_database() detects old JSON columns via PRAGMA table_info and drops the cluster before re-creating it with the new schema. All cache methods rewritten to use JOIN queries. get_canonical_for_variant now uses a straight JOIN instead of json_each(). get_issns() added. import_acronyms() upserts the parent row then does DELETE+INSERT for child rows to ensure a full replace on re-import. * feat: add ISSN-based canonical lookup and wire as dispatcher fallback path 3 [AI-assisted] The normalized venue_acronym_issns table is now queryable but was not yet used at runtime. Add AcronymCache.get_canonical_for_issn(issn) — a JOIN against venue_acronym_issns that returns the canonical name for a venue given any of its known ISSNs. No entity_type filter needed since ISSNs are globally unique across venue types. Extend QueryDispatcher._try_acronym_fallback() with Path 3: if an ISSN is present in query_input.identifiers (extracted from BibTeX or user input), try get_canonical_for_issn() before giving up. This enables correct expansion even when the journal name is heavily abbreviated or missing but the ISSN is present. * feat: add get_full_stats() and improve acronym status output [AI-assisted] The acronym status command showed only a single total count, giving no insight into how the data is distributed or whether variants/ISSNs were imported. Add AcronymCache.get_full_stats() that returns total_acronyms, total_variants, total_issns, and a per-entity_type breakdown using a single LEFT JOIN query across all three normalized tables. Update the 'acronym status' CLI command to display the full breakdown: totals at the top, then a table by entity type with acronym, variant, and ISSN counts. Update existing CLI tests to mock get_full_stats(). * refactor: align VenueAcronym model with venue-acronyms-2025 pipeline output format [AI-assisted] Replace VariantRecord/LearnedAbbreviation with VenueAcronym matching the new acronym-level JSON structure (canonical, variants, issn, confidence_score). * fix: update bibtex parser tests to expect full macro expansion [AI-assisted] \pasp now correctly expands to the full journal name via _expand_latex_journal_macros. Update assertions to reflect the correct expanded form instead of the old acronym. * style: remove extra blank lines from conflict resolution (ruff) * feat: bump schema to v3, reject old databases with clear delete+sync message [AI-assisted] - SCHEMA_VERSION 2 → 3 (acronym-level relational schema with canonical/variants/ISSN) - init_database() always checks version for existing DBs; no bypass, no migration - check_schema_compatibility() now tells users to delete the DB and run sync (pre-1.0) - migrate_database() returns False for legacy DBs (no migration path) - Fix test assertions to match new behavior --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent 5cf3b82 commit 4505c8c

File tree

11 files changed

+881
-841
lines changed

11 files changed

+881
-841
lines changed

dev-notes/NORMALIZED_DATABASE_DESIGN.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,51 @@ CREATE TABLE retraction_statistics (
118118
);
119119
```
120120

121-
### 8. Assessment Cache Table
121+
### 8. Venue Acronyms Cluster (three tables)
122+
123+
**Purpose**: Pre-compiled venue name lookup, imported from the venue-acronyms-2025 pipeline.
124+
Normalized into three tables so every value is directly queryable.
125+
126+
```sql
127+
-- One row per (acronym, entity_type) pair
128+
CREATE TABLE venue_acronyms (
129+
id INTEGER PRIMARY KEY AUTOINCREMENT,
130+
acronym TEXT NOT NULL COLLATE NOCASE,
131+
entity_type TEXT NOT NULL, -- VenueType: 'journal', 'conference', ...
132+
canonical TEXT NOT NULL, -- Fully-expanded lowercase authoritative name
133+
confidence_score REAL DEFAULT 0.0, -- LLM consensus confidence (0.0–1.0)
134+
source_file TEXT, -- Source acronyms-YYYY-MM.json filename
135+
imported_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
136+
UNIQUE(acronym, entity_type)
137+
);
138+
139+
-- All observed name forms (expanded and abbreviated) — one row per variant
140+
CREATE TABLE venue_acronym_variants (
141+
id INTEGER PRIMARY KEY AUTOINCREMENT,
142+
venue_acronym_id INTEGER NOT NULL,
143+
variant TEXT NOT NULL COLLATE NOCASE,
144+
FOREIGN KEY (venue_acronym_id) REFERENCES venue_acronyms(id) ON DELETE CASCADE,
145+
UNIQUE(venue_acronym_id, variant)
146+
);
147+
148+
-- Known ISSNs — one row per ISSN
149+
CREATE TABLE venue_acronym_issns (
150+
id INTEGER PRIMARY KEY AUTOINCREMENT,
151+
venue_acronym_id INTEGER NOT NULL,
152+
issn TEXT NOT NULL,
153+
FOREIGN KEY (venue_acronym_id) REFERENCES venue_acronyms(id) ON DELETE CASCADE,
154+
UNIQUE(venue_acronym_id, issn)
155+
);
156+
```
157+
158+
**Lookup patterns**:
159+
- Acronym → canonical: `SELECT canonical FROM venue_acronyms WHERE acronym = ? AND entity_type = ?`
160+
- Variant → canonical: `SELECT va.canonical FROM venue_acronyms va JOIN venue_acronym_variants vav ON va.id = vav.venue_acronym_id WHERE vav.variant = ? AND va.entity_type = ?`
161+
- ISSN → acronym entries: `SELECT va.* FROM venue_acronyms va JOIN venue_acronym_issns vai ON va.id = vai.venue_acronym_id WHERE vai.issn = ?`
162+
163+
**Import source**: `aletheia-probe acronym import <acronyms-YYYY-MM.json>`
164+
165+
### 9. Assessment Cache Table
122166

123167
**Purpose**: Domain-specific caching for structured journal/conference assessment results
124168

0 commit comments

Comments
 (0)