Skip to content

Commit 5cf3b82

Browse files
feat: add CORE conference and journal backends (#1026)
* feat: add CORE conference and journal backends [AI-assisted] Introduce CORE/ICORE conference and CORE journal integrations so ranked venues can be used as first-class legitimate-list evidence in assessments. This adds paginated portal sources, cached backends, config defaults, and focused unit tests to keep behavior reliable as portal data evolves. * fix: use urllib for CORE source fetching [AI-assisted] Replace aiohttp/curl-specific fetch path with urllib-based page retrieval for CORE conference and journal sources. This avoids environment-specific aiohttp timeout behavior and keeps cross-platform compatibility while preserving retry/backoff logic. Also add a small connectivity diagnostics script for CORE endpoints. * docs: add CORE integration documentation [AI-assisted] Document CORE conference and journal integrations in README and integration notes, including backend coverage, usage, and limitations. * fix: resolve CORE lint and typing checks [AI-assisted] Add SPDX header to CORE requests diagnostic script and make urllib read type explicit so mypy no-any-return passes consistently. * fix: harden CORE fetch against unsafe URL schemes [AI-assisted] Replace urllib urlopen with explicit HTTPSConnection requests and enforce https/host validation to satisfy Bandit B310 while keeping CORE sync behavior unchanged. * docs: keep README backend docs generic [AI-assisted] Remove the CORE-specific link from the main README and keep only the shared backend integration documentation entry. Also remove the temporary CORE connectivity test script no longer needed. --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent b3b4320 commit 5cf3b82

File tree

11 files changed

+789
-1
lines changed

11 files changed

+789
-1
lines changed

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,10 +87,12 @@ This tool acts as a **data aggregator** - it doesn't provide data itself, but co
8787
- **Kscien Hijacked Journals** - Legitimate journals that have been hijacked by predatory actors
8888
- **Kscien Predatory Conferences** - Database of predatory conferences
8989
- **DBLP Venues** - Curated computer science journals and conference series from DBLP XML
90+
- **CORE Conferences (ICORE/CORE)** - Ranked conference venues from the CORE portal
91+
- **CORE Journals (legacy)** - Ranked journals from the CORE journal portal (latest list: CORE2020)
9092
9193
The tool analyzes publication patterns, citation metrics, and metadata quality to provide comprehensive coverage beyond traditional blacklist/whitelist approaches.
9294
93-
**Note on Conference Assessment**: Conference checking now combines curated predatory conference signals (Kscien) with curated legitimacy signals from DBLP venue series. Coverage is stronger for computer science venues than for other domains.
95+
**Note on Conference Assessment**: Conference checking now combines curated predatory conference signals (Kscien) with curated legitimacy signals from DBLP and CORE/ICORE ranked venues. Coverage is strongest for computer science venues.
9496
9597
## Quick Start
9698
@@ -122,6 +124,8 @@ These provide authoritative yes/no decisions for journals they cover:
122124
| **Kscien Hijacked Journals** | Hijacked journals | ~200 entries | Legitimate journals compromised by predatory actors |
123125
| **Kscien Predatory Conferences** | Predatory conferences | ~450 entries | Identified predatory conference venues |
124126
| **DBLP Venues** | Legitimate venues (CS) | dump-derived | Curated DBLP journals and conference series from local XML cache |
127+
| **CORE Conferences** | Legitimate ranked conferences | ~825 entries (ICORE2026 ranked) | CORE/ICORE conference rankings portal |
128+
| **CORE Journals (legacy)** | Legitimate ranked journals | ~582 entries (CORE2020 ranked) | CORE journal rankings portal (discontinued, no post-2020 updates) |
125129
| **Retraction Watch** | Quality indicator | ~27,000 journals | Retraction rates and patterns for quality assessment |
126130
| **Institutional Lists** | Custom whitelist/blacklist | Organization-specific | Local policy enforcement |
127131

@@ -148,6 +152,7 @@ Journal Query → [Curated Databases + Pattern Analyzers] → Combined Assessmen
148152
├─ UGC-CARE discontinued lists
149153
├─ PredatoryJournals.org
150154
├─ Kscien databases
155+
├─ CORE Conferences / CORE Journals
151156
├─ Retraction Watch (quality)
152157
├─ OpenAlex Analyzer (patterns)
153158
├─ Crossref Analyzer (metadata)
@@ -271,6 +276,7 @@ To enhance coverage with Scopus data:
271276
- [Backend API](docs/api-reference/backends.md) - Creating custom backends
272277
- [Data Models](docs/api-reference/models.md) - Core data structures
273278
- [Extending Guide](docs/api-reference/extending-guide.md) - Extension patterns
279+
- [Backend Integration Docs](dev-notes/integration/README.md) - Source-specific backend documentation
274280
- [Contributing Guide](.github/community/CONTRIBUTING.md) - Development setup and guidelines
275281
- [Coding Standards](dev-notes/CODING_STANDARDS.md) - Code quality requirements
276282

dev-notes/integration/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ Each integration document should include:
7070
- **[DEPENDENCIES.md](../DEPENDENCIES.md)** - System and Python dependencies
7171
- **[CODING_STANDARDS.md](../CODING_STANDARDS.md)** - Code style and patterns
7272
- **[LOGGING_USAGE.md](../LOGGING_USAGE.md)** - Logging conventions
73+
- **[core.md](./core.md)** - CORE/ICORE conference and journal integration
7374

7475
## See Also
7576

dev-notes/integration/core.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# CORE Integration
2+
3+
## Overview
4+
5+
The CORE integration adds ranked conference and journal venue signals from the CORE/ICORE portal to assessment decisions.
6+
7+
- **Conference source**: ICORE/CORE conference rankings portal
8+
- **Journal source**: CORE journal rankings portal (legacy/discontinued dataset)
9+
- **Classification**: Legitimate-list evidence for ranked entries
10+
11+
## Data Sources
12+
13+
### Conferences
14+
- **Portal**: `https://portal.core.edu.au/conf-ranks/`
15+
- **Default source filter**: `ICORE2026`
16+
- **Format**: HTML (paginated table)
17+
- **Fields used**: title, acronym, source, rank
18+
19+
### Journals
20+
- **Portal**: `https://portal.core.edu.au/jnl-ranks/`
21+
- **Default source filter**: `CORE2020`
22+
- **Format**: HTML (paginated table)
23+
- **Fields used**: title, source, rank
24+
25+
## Architecture
26+
27+
### Data Source Components
28+
- **Module**: `src/aletheia_probe/updater/sources/core.py`
29+
- **Classes**:
30+
- `CoreConferenceSource`
31+
- `CoreJournalSource`
32+
- **Base class**: `DataSource`
33+
- **Update cadence**: Monthly (`30` days)
34+
35+
### Backend Components
36+
- **Conference backend**: `src/aletheia_probe/backends/core_conferences.py`
37+
- **Journal backend**: `src/aletheia_probe/backends/core_journals.py`
38+
- **Base class**: `CachedBackend`
39+
- **Evidence type**: `LEGITIMATE_LIST`
40+
- **Cache TTL**: `24 * 30` hours
41+
42+
### Configuration
43+
Defined in `src/aletheia_probe/config.py` (`DataSourceUrlConfig`):
44+
- `core_conference_rankings_url`
45+
- `core_journal_rankings_url`
46+
- `core_conference_default_source`
47+
- `core_journal_default_source`
48+
49+
## Data Processing Rules
50+
51+
1. Fetch paginated portal pages (`50` rows per page).
52+
2. Parse table rows into normalized venue entries.
53+
3. Keep only ranked entries with these rank labels:
54+
- `A*`, `A`, `B`, `C`, `Australasian B`, `Australasian C`
55+
4. Exclude non-ranked/non-classification statuses (for example `Unranked`, `National`, `Journal Published`, `Not ranked`, `not primarily CS`).
56+
5. Deduplicate by normalized name before writing to cache.
57+
58+
## Metadata Stored
59+
60+
Each CORE entry stores source metadata in `metadata`:
61+
- `source_url`
62+
- `core_entity_type` (`conference` or `journal`)
63+
- `core_source` (for example `ICORE2026`, `CORE2020`)
64+
- `core_rank`
65+
- `core_acronym` (conference source only)
66+
67+
## Usage
68+
69+
```bash
70+
# Sync only CORE conferences
71+
aletheia-probe sync core_conferences
72+
73+
# Sync only CORE journals
74+
aletheia-probe sync core_journals
75+
76+
# Normal sync includes CORE backends when enabled
77+
aletheia-probe sync
78+
```
79+
80+
## Limitations
81+
82+
1. CORE journal rankings are legacy/discontinued and should be interpreted accordingly.
83+
2. Parsing depends on portal table structure and rank label conventions.
84+
3. Matching is normalized name based; source-side aliases/variants may still miss edge cases.
85+
86+
## References
87+
88+
- `src/aletheia_probe/updater/sources/core.py`
89+
- `src/aletheia_probe/backends/core_conferences.py`
90+
- `src/aletheia_probe/backends/core_journals.py`
91+
- `tests/unit/updater/test_core_source.py`
92+
- `tests/unit/backends/test_core_backends.py`
93+
- https://portal.core.edu.au/conf-ranks/
94+
- https://portal.core.edu.au/jnl-ranks/

src/aletheia_probe/backends/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
from . import (
66
algerian_ministry,
77
bealls,
8+
core_conferences,
9+
core_journals,
810
crossref_analyzer,
911
custom_list,
1012
dblp_venues,
@@ -28,6 +30,8 @@
2830
__all__ = [
2931
"algerian_ministry",
3032
"bealls",
33+
"core_conferences",
34+
"core_journals",
3135
"crossref_analyzer",
3236
"dblp_venues",
3337
"custom_list",
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SPDX-License-Identifier: MIT
2+
"""CORE/ICORE conference rankings backend for legitimate conference verification."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.core import CoreConferenceSource
13+
14+
15+
class CoreConferencesBackend(CachedBackend):
16+
"""Backend that checks conferences against CORE/ICORE ranked venues."""
17+
18+
def __init__(self) -> None:
19+
super().__init__(
20+
source_name="core_conferences",
21+
list_type=AssessmentType.LEGITIMATE,
22+
cache_ttl_hours=24 * 30,
23+
)
24+
self._data_source: CoreConferenceSource | None = None
25+
26+
def get_name(self) -> str:
27+
return "core_conferences"
28+
29+
def get_evidence_type(self) -> EvidenceType:
30+
return EvidenceType.LEGITIMATE_LIST
31+
32+
def get_data_source(self) -> "DataSource | None":
33+
if self._data_source is None:
34+
from ..updater.sources.core import CoreConferenceSource
35+
36+
self._data_source = CoreConferenceSource()
37+
return self._data_source
38+
39+
40+
get_backend_registry().register_factory(
41+
"core_conferences",
42+
lambda: CoreConferencesBackend(),
43+
default_config={},
44+
)
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SPDX-License-Identifier: MIT
2+
"""CORE journal rankings backend for legitimate journal verification."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.core import CoreJournalSource
13+
14+
15+
class CoreJournalsBackend(CachedBackend):
16+
"""Backend that checks journals against CORE ranked journals."""
17+
18+
def __init__(self) -> None:
19+
super().__init__(
20+
source_name="core_journals",
21+
list_type=AssessmentType.LEGITIMATE,
22+
cache_ttl_hours=24 * 30,
23+
)
24+
self._data_source: CoreJournalSource | None = None
25+
26+
def get_name(self) -> str:
27+
return "core_journals"
28+
29+
def get_evidence_type(self) -> EvidenceType:
30+
return EvidenceType.LEGITIMATE_LIST
31+
32+
def get_data_source(self) -> "DataSource | None":
33+
if self._data_source is None:
34+
from ..updater.sources.core import CoreJournalSource
35+
36+
self._data_source = CoreJournalSource()
37+
return self._data_source
38+
39+
40+
get_backend_registry().register_factory(
41+
"core_journals",
42+
lambda: CoreJournalsBackend(),
43+
default_config={},
44+
)

src/aletheia_probe/config.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,22 @@ class DataSourceUrlConfig(BaseModel):
112112
"https://dblp.org/xml/dblp.xml.gz",
113113
description="URL for DBLP full XML dump",
114114
)
115+
core_conference_rankings_url: str = Field(
116+
"https://portal.core.edu.au/conf-ranks/",
117+
description="URL for CORE/ICORE conference rankings portal",
118+
)
119+
core_journal_rankings_url: str = Field(
120+
"https://portal.core.edu.au/jnl-ranks/",
121+
description="URL for CORE journal rankings portal",
122+
)
123+
core_conference_default_source: str = Field(
124+
"ICORE2026",
125+
description="Default source filter for CORE conference rankings",
126+
)
127+
core_journal_default_source: str = Field(
128+
"CORE2020",
129+
description="Default source filter for CORE journal rankings",
130+
)
115131

116132

117133
class DataSourceProcessingConfig(BaseModel):

src/aletheia_probe/updater/sources/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
from .algerian import AlgerianMinistrySource
55
from .bealls import BeallsListSource
6+
from .core import CoreConferenceSource, CoreJournalSource
67
from .custom import CustomListSource
78
from .dblp import DblpVenueSource
89
from .kscien_generic import KscienGenericSource
@@ -25,6 +26,8 @@
2526
"AlgerianMinistrySource",
2627
"BeallsListSource",
2728
"CustomListSource",
29+
"CoreConferenceSource",
30+
"CoreJournalSource",
2831
"DblpVenueSource",
2932
"KscienGenericSource",
3033
"KscienHijackedJournalsSource",

0 commit comments

Comments
 (0)