Skip to content

Commit 675eb37

Browse files
feat: add DOAJ local mode backed by user-provided CSV snapshot (#1094)
* feat: add DOAJ local mode backed by user-provided CSV snapshot When DOAJ_MODE=local, DOAJLocalBackend queries a SQLite cache populated from a DOAJ CSV file placed in .aletheia-probe/doaj/. Remote mode (default) is unchanged — HTTP API is used as before. Key changes: - New DOAJSource(DataSource) reads journalcsv__doaj_*.csv and feeds the updater framework; 'aletheia-probe sync doaj' always works regardless of DOAJ_MODE - DOAJLocalBackend(ConfiguredCachedBackend) is always registered so sync capability is always present; query() delegates to DOAJBackend (HTTP) in remote mode and to the SQLite journal cache in local mode - Add "doaj": "DOAJ_MODE" to RUNTIME_MODE_ENV_BY_BACKEND so 'aletheia-probe status' shows mode=local/remote - docs/local-doaj-backend.md: setup guide + note on matching differences vs remote mode * fix: cast super().query() return to BackendResult to satisfy mypy --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent 4efe57e commit 675eb37

File tree

5 files changed

+373
-3
lines changed

5 files changed

+373
-3
lines changed

docs/local-doaj-backend.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Local DOAJ Backend
2+
3+
By default, `aletheia-probe` checks journal legitimacy against the
4+
[DOAJ (Directory of Open Access Journals)](https://doaj.org/) REST API.
5+
The **local mode** lets you run entirely offline using a CSV snapshot
6+
downloaded directly from DOAJ — no network calls, no rate limiting.
7+
8+
---
9+
10+
## Step 1 — Download the DOAJ CSV
11+
12+
1. Go to <https://doaj.org/docs/public-data-dump/>
13+
2. Click **"Download journals CSV"**. No login is required.
14+
3. The downloaded file will be named something like:
15+
16+
```
17+
journalcsv__doaj_20260314_1626_utf8.csv
18+
```
19+
20+
> **Note:** Use the CSV export (not the JSON bulk-download).
21+
22+
---
23+
24+
## Step 2 — Place the file
25+
26+
Create the directory `.aletheia-probe/doaj/` **inside your working directory**
27+
(the directory from which you run `aletheia-probe`) and copy the file there:
28+
29+
```bash
30+
mkdir -p .aletheia-probe/doaj/
31+
cp ~/Downloads/journalcsv__doaj_*.csv .aletheia-probe/doaj/
32+
```
33+
34+
If multiple files matching `journalcsv__doaj_*.csv` are present, the most
35+
recently modified one is used.
36+
37+
---
38+
39+
## Step 3 — Sync the local cache
40+
41+
```bash
42+
aletheia-probe sync doaj
43+
```
44+
45+
This reads the CSV and writes the journal records into the local SQLite
46+
database. Re-running sync within 30 days of the last update is a no-op
47+
unless `--force` is passed.
48+
49+
---
50+
51+
## Step 4 — Enable local mode
52+
53+
Set the environment variable `DOAJ_MODE=local` before running any
54+
`aletheia-probe` command:
55+
56+
```bash
57+
export DOAJ_MODE=local
58+
aletheia-probe assess "Nature"
59+
```
60+
61+
Or inline for a single run:
62+
63+
```bash
64+
DOAJ_MODE=local aletheia-probe mass-eval --input papers.bib
65+
```
66+
67+
---
68+
69+
## Verifying the setup
70+
71+
```bash
72+
DOAJ_MODE=local aletheia-probe status
73+
```
74+
75+
The DOAJ line should show `mode=local` together with the entry count and
76+
last-updated date:
77+
78+
```
79+
✅ doaj (enabled, cached, mode=local) 📊 has data (22,672 entries) (updated: 2026-03-14)
80+
```
81+
82+
---
83+
84+
## Keeping the data fresh
85+
86+
DOAJ publishes updated snapshots regularly. To refresh:
87+
88+
1. Download the latest CSV from <https://doaj.org/docs/public-data-dump/>.
89+
2. Replace the file in `.aletheia-probe/doaj/`.
90+
3. Run `aletheia-probe sync doaj --force`.
91+
92+
---
93+
94+
## Differences from remote mode
95+
96+
Local mode is **not** a 1:1 replacement for the remote DOAJ API. There are
97+
two notable differences:
98+
99+
**Coverage** — The CSV contains only journals that DOAJ has accepted as fully
100+
open access. The remote API may return results for journals that are in
101+
DOAJ's index but not yet reflected in the most recently downloaded CSV
102+
snapshot. Conversely, journals removed from DOAJ since the snapshot was
103+
taken will still appear in the local cache until the next sync.
104+
105+
**Matching strategy** — The remote API performs server-side fuzzy/full-text
106+
search and may return approximate matches for journal names it cannot find
107+
exactly (for example, matching "Nature" against "Nature-Nurture Journal of
108+
Psychology"). The local mode uses exact name and ISSN matching only, so
109+
it will not return such approximate hits. This makes local mode
110+
**more precise** but means it may return `not_found` for queries where the
111+
remote API would have returned a low-confidence fuzzy match.
112+
113+
In practice, for well-formed journal names or ISSNs the two modes produce
114+
identical results. Discrepancies are a sign that the remote API result was
115+
a false positive, or that the local snapshot is out of date.
116+
117+
---
118+
119+
## Switching back to remote mode
120+
121+
Remove or unset `DOAJ_MODE` (or set it to `remote`) to use the live DOAJ API:
122+
123+
```bash
124+
unset DOAJ_MODE
125+
aletheia-probe assess "Nature"
126+
```

src/aletheia_probe/backends/doaj.py

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# SPDX-License-Identifier: MIT
22
"""DOAJ (Directory of Open Access Journals) backend for legitimate journal verification."""
33

4-
from typing import Any
4+
import os
5+
from typing import Any, cast
56
from urllib.parse import quote
67

78
import aiohttp
@@ -25,8 +26,9 @@
2526
VenueType,
2627
)
2728
from ..retry_utils import async_retry_with_backoff
29+
from ..updater.sources.doaj import DOAJSource
2830
from ..utils.dead_code import code_is_used
29-
from .base import ApiBackendWithCache, get_backend_registry
31+
from .base import ApiBackendWithCache, ConfiguredCachedBackend, get_backend_registry
3032
from .fallback_mixin import FallbackStrategyMixin
3133

3234

@@ -358,9 +360,48 @@ def _build_not_found_result_with_chain(
358360
)
359361

360362

363+
class DOAJLocalBackend(ConfiguredCachedBackend):
364+
"""DOAJ backend registered in the backend registry.
365+
366+
Always registered so that ``aletheia-probe sync doaj`` works regardless of
367+
``DOAJ_MODE``. At query time the mode is checked:
368+
369+
- ``DOAJ_MODE=local`` → query the local SQLite cache (populated by sync)
370+
- ``DOAJ_MODE=remote`` (default) → delegate to :class:`DOAJBackend` (HTTP API)
371+
372+
The CSV file for local mode must be placed in ``.aletheia-probe/doaj/`` in
373+
the current working directory and imported via ``aletheia-probe sync doaj``.
374+
"""
375+
376+
def __init__(self, remote_cache_ttl_hours: int = 24) -> None:
377+
super().__init__(
378+
backend_name="doaj",
379+
list_type=AssessmentType.LEGITIMATE,
380+
evidence_type=EvidenceType.LEGITIMATE_LIST,
381+
cache_ttl_hours=24 * 30, # Monthly cache for static file
382+
data_source_factory=lambda: DOAJSource(),
383+
)
384+
self._remote_cache_ttl_hours = remote_cache_ttl_hours
385+
self._remote_backend: DOAJBackend | None = None
386+
387+
def _get_remote_backend(self) -> DOAJBackend:
388+
if self._remote_backend is None:
389+
self._remote_backend = DOAJBackend(
390+
cache_ttl_hours=self._remote_cache_ttl_hours
391+
)
392+
return self._remote_backend
393+
394+
async def query(self, query_input: QueryInput) -> BackendResult:
395+
"""Query local cache when DOAJ_MODE=local, otherwise use HTTP API."""
396+
mode = os.environ.get("DOAJ_MODE", "remote").strip().lower()
397+
if mode == "local":
398+
return cast(BackendResult, await super().query(query_input))
399+
return await self._get_remote_backend().query(query_input)
400+
401+
361402
# Register the backend with factory for configuration support
362403
get_backend_registry().register_factory(
363404
"doaj",
364-
lambda cache_ttl_hours=24: DOAJBackend(cache_ttl_hours=cache_ttl_hours),
405+
lambda cache_ttl_hours=24: DOAJLocalBackend(remote_cache_ttl_hours=cache_ttl_hours),
365406
default_config={"cache_ttl_hours": 24},
366407
)

src/aletheia_probe/cli_commands/system_commands.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212

1313
RUNTIME_MODE_ENV_BY_BACKEND: dict[str, str] = {
14+
"doaj": "DOAJ_MODE",
1415
"openalex_analyzer": "OPENALEX_MODE",
1516
"opencitations_analyzer": "OPENCITATIONS_MODE",
1617
}

src/aletheia_probe/updater/sources/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from .core import CoreConferenceSource, CoreJournalSource
77
from .custom import CustomListSource
88
from .dblp import DblpVenueSource
9+
from .doaj import DOAJSource
910
from .kscien_generic import KscienGenericSource
1011
from .kscien_hijacked_journals import KscienHijackedJournalsSource
1112
from .kscien_publishers import KscienPublishersSource
@@ -30,6 +31,7 @@
3031
"CoreConferenceSource",
3132
"CoreJournalSource",
3233
"DblpVenueSource",
34+
"DOAJSource",
3335
"KscienGenericSource",
3436
"KscienHijackedJournalsSource",
3537
"KscienPublishersSource",

0 commit comments

Comments
 (0)