Skip to content

Commit b366f0d

Browse files
feat: add DBLP venues backend with local dump support (#1021)
* feat: add DBLP venues backend with local dump parsing Integrate curated DBLP venue metadata as an additional legitimacy signal for journals and conferences, while supporting reproducible local-cache syncs and large-dump processing with bounded memory and progress logging. * fix: use defusedxml for DBLP XML parsing Replace stdlib ElementTree iterparse with defusedxml to satisfy security scanning and mitigate XML parser attack risk. Add explicit dependency and align tests with the new parse error type. * fix: ignore missing defusedxml stubs in mypy The parser security fix depends on defusedxml, but stub package availability differs across CI environments. Scope ignore_missing_imports to defusedxml modules so strict mypy checks remain unchanged elsewhere. --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent eff3e70 commit b366f0d

File tree

12 files changed

+952
-2
lines changed

12 files changed

+952
-2
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,10 +86,11 @@ This tool acts as a **data aggregator** - it doesn't provide data itself, but co
8686
- **Kscien Publishers** - Known predatory publishers
8787
- **Kscien Hijacked Journals** - Legitimate journals that have been hijacked by predatory actors
8888
- **Kscien Predatory Conferences** - Database of predatory conferences
89+
- **DBLP Venues** - Curated computer science journals and conference series from DBLP XML
8990
9091
The tool analyzes publication patterns, citation metrics, and metadata quality to provide comprehensive coverage beyond traditional blacklist/whitelist approaches.
9192
92-
**Note on Conference Assessment**: Conference checking is currently limited compared to journal assessment. The primary source for conference evaluation is the Kscien Predatory Conferences database. Most other data sources focus exclusively on journals, so conference assessments may have less comprehensive coverage and fewer cross-validation opportunities.
93+
**Note on Conference Assessment**: Conference checking now combines curated predatory conference signals (Kscien) with curated legitimacy signals from DBLP venue series. Coverage is stronger for computer science venues than for other domains.
9394
9495
## Quick Start
9596
@@ -120,6 +121,7 @@ These provide authoritative yes/no decisions for journals they cover:
120121
| **Kscien Publishers** | Predatory publishers | 1,200+ entries | Known predatory publishers |
121122
| **Kscien Hijacked Journals** | Hijacked journals | ~200 entries | Legitimate journals compromised by predatory actors |
122123
| **Kscien Predatory Conferences** | Predatory conferences | ~450 entries | Identified predatory conference venues |
124+
| **DBLP Venues** | Legitimate venues (CS) | dump-derived | Curated DBLP journals and conference series from local XML cache |
123125
| **Retraction Watch** | Quality indicator | ~27,000 journals | Retraction rates and patterns for quality assessment |
124126
| **Institutional Lists** | Custom whitelist/blacklist | Organization-specific | Local policy enforcement |
125127

docs/CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
12+
- **DBLP Venues Backend**: Added `dblp_venues` cached backend with local XML dump synchronization
13+
- Downloads and caches `dblp.xml.gz` locally under `.aletheia-probe/dblp/`
14+
- Parses DBLP conference (`conf/*`) entries via streaming XML
15+
- Adds curated DBLP conference series as legitimate conference evidence
16+
1017
## [0.8.0] - 2026-01-08
1118

1219
### Added

docs/api-reference/backends.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ YAML structure:
120120
| Backend | Purpose | Source |
121121
|---------|---------|--------|
122122
| **doaj** | Directory of Open Access Journals | `src/aletheia_probe/backends/doaj.py` |
123+
| **dblp_venues** | DBLP curated venue series (CS journals + conferences) | `src/aletheia_probe/backends/dblp_venues.py` |
123124

124125
### Quality Indicators
125126

docs/configuration.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -307,6 +307,25 @@ backends:
307307

308308
See backend implementations in `src/aletheia_probe/backends/kscien_*.py`
309309

310+
### DBLP Venues Backend
311+
312+
```yaml
313+
backends:
314+
dblp_venues:
315+
enabled: true
316+
weight: 0.8
317+
timeout: 10
318+
config: {}
319+
```
320+
321+
**Behavior**:
322+
- Downloads and caches the complete DBLP XML dump (`dblp.xml.gz`) locally in `.aletheia-probe/dblp/`
323+
- Extracts venue series from DBLP `conf/*` and `journals/*` entries
324+
- Sync cadence is monthly by default due dump size (~1 GB compressed)
325+
326+
**Related URL setting**:
327+
- `data_source_urls.dblp_xml_dump_url`: Defaults to `https://dblp.org/xml/dblp.xml.gz`
328+
310329
### Scopus Backend
311330

312331
```yaml

pyproject.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ dependencies = [
4040
"pydantic>=2.0.0",
4141
"aiohttp>=3.8.0",
4242
"aiofiles>=23.0.0",
43+
"defusedxml>=0.7.1",
4344
"click>=8.0.0",
4445
"pyyaml>=6.0",
4546
"rarfile>=4.1",
@@ -139,6 +140,10 @@ plugins = ["pydantic.mypy"]
139140
module = "tests.*"
140141
disallow_untyped_defs = false
141142

143+
[[tool.mypy.overrides]]
144+
module = ["defusedxml", "defusedxml.*"]
145+
ignore_missing_imports = true
146+
142147
[tool.ruff]
143148
target-version = "py310"
144149
line-length = 88
@@ -165,4 +170,4 @@ ignore = [
165170
known-first-party = ["aletheia_probe"]
166171
section-order = ["future", "standard-library", "third-party", "first-party", "local-folder"]
167172
split-on-trailing-comma = true
168-
lines-after-imports = 2
173+
lines-after-imports = 2

src/aletheia_probe/backends/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
bealls,
88
crossref_analyzer,
99
custom_list,
10+
dblp_venues,
1011
doaj,
1112
kscien_hijacked_journals,
1213
kscien_predatory_conferences,
@@ -28,6 +29,7 @@
2829
"algerian_ministry",
2930
"bealls",
3031
"crossref_analyzer",
32+
"dblp_venues",
3133
"custom_list",
3234
"doaj",
3335
"kscien_hijacked_journals",
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# SPDX-License-Identifier: MIT
2+
"""DBLP conference backend for legitimate conference verification."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.dblp import DblpVenueSource
13+
14+
15+
_CACHE_TTL_HOURS = 24 * 30 # Monthly cache for large DBLP dump refreshes
16+
17+
18+
class DblpVenuesBackend(CachedBackend):
19+
"""Backend that checks venue legitimacy using DBLP venue series data."""
20+
21+
def __init__(self) -> None:
22+
"""Initialize DBLP venues backend."""
23+
super().__init__(
24+
source_name="dblp_venues",
25+
list_type=AssessmentType.LEGITIMATE,
26+
cache_ttl_hours=_CACHE_TTL_HOURS,
27+
)
28+
self._data_source: DblpVenueSource | None = None
29+
30+
def get_name(self) -> str:
31+
"""Return backend identifier."""
32+
return "dblp_venues"
33+
34+
def get_evidence_type(self) -> EvidenceType:
35+
"""Return evidence type for DBLP venues backend."""
36+
return EvidenceType.LEGITIMATE_LIST
37+
38+
def get_data_source(self) -> "DataSource | None":
39+
"""Get DBLP venue source for cache synchronization."""
40+
if self._data_source is None:
41+
from ..updater.sources.dblp import DblpVenueSource
42+
43+
self._data_source = DblpVenueSource()
44+
return self._data_source
45+
46+
47+
get_backend_registry().register_factory(
48+
"dblp_venues",
49+
lambda: DblpVenuesBackend(),
50+
default_config={},
51+
)

src/aletheia_probe/config.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,10 @@ class DataSourceUrlConfig(BaseModel):
108108
"https://ugccare.unipune.ac.in/Apps1/User/Web/ScopusDelisted",
109109
description="URL for UGC-CARE Group-II delisted journals page",
110110
)
111+
dblp_xml_dump_url: str = Field(
112+
"https://dblp.org/xml/dblp.xml.gz",
113+
description="URL for DBLP full XML dump",
114+
)
111115

112116

113117
class DataSourceProcessingConfig(BaseModel):

src/aletheia_probe/updater/sources/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
from .algerian import AlgerianMinistrySource
55
from .bealls import BeallsListSource
66
from .custom import CustomListSource
7+
from .dblp import DblpVenueSource
78
from .kscien_generic import KscienGenericSource
89
from .kscien_hijacked_journals import KscienHijackedJournalsSource
910
from .kscien_publishers import KscienPublishersSource
@@ -24,6 +25,7 @@
2425
"AlgerianMinistrySource",
2526
"BeallsListSource",
2627
"CustomListSource",
28+
"DblpVenueSource",
2729
"KscienGenericSource",
2830
"KscienHijackedJournalsSource",
2931
"KscienPublishersSource",

0 commit comments

Comments
 (0)