Skip to content

Commit 4b10972

Browse files
feat: add UGC-CARE discontinued list backends (#1019)
* feat: add UGC-CARE discontinued list backends [AI-assisted] Add dedicated sync-backed sources and cached backends for UGC-CARE cloned Group I, cloned Group II, and delisted Group II lists. Update backend/source registration, config URL defaults, output formatter list presence, and docs (README + API/config references) so the new backends are discoverable and usable. Include unit tests for source parsing, registration, and backend metadata to reduce regression risk for future list-format changes. * feat: add included-side UGC clone backends [AI-assisted] Add legitimate-list backends for left-side included journals from UGC clone correction pages (Group I and Group II) to prevent ISSN-only false impressions. Fix clone parser side mapping so cloned records use cloned-side ISSN/eISSN instead of original-side identifiers. Extend tests and documentation to cover both included and cloned sides of UGC clone pages. * fix: satisfy mypy title typing in UGC parser [AI-assisted] Guard clone-side title extraction with explicit type narrowing before calling _build_entry so mypy can prove non-optional str arguments. This preserves runtime behavior while eliminating strict type-check failures in ugc_care source parsing. --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent 9f9d934 commit 4b10972

16 files changed

+1058
-7
lines changed

README.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,18 @@ aletheia-probe journal --format json "Nature Reviews Drug Discovery"
6363

6464
**Output**: Combines data from multiple authoritative sources and advanced pattern analysis to provide confidence-scored assessments of journal legitimacy.
6565

66-
**Note**: The first sync downloads and processes data from multiple sources (DOAJ, Beall's List, etc.), which takes a few minutes. After that, queries typically complete in under 5 seconds.
66+
**Note**: The first sync downloads and processes data from multiple sources (DOAJ, Beall's List, UGC-CARE discontinued lists, etc.), which takes a few minutes. After that, queries typically complete in under 5 seconds.
6767
6868
## Data Sources
6969
7070
This tool acts as a **data aggregator** - it doesn't provide data itself, but combines information from multiple authoritative sources:
7171

7272
- **DOAJ** - Directory of Open Access Journals
7373
- **Beall's List** - Historical predatory journal archives
74+
- **UGC-CARE Cloned (Group I)** - UGC-CARE discontinued cloned journal list
75+
- **UGC-CARE Cloned (Group II)** - UGC-CARE discontinued cloned journal list
76+
- **UGC-CARE Delisted (Group II)** - UGC-CARE discontinued delisted journal list
77+
- **UGC-CARE Included from Clone Page (Group I/II)** - Left-side included journals from public clone-correction pages
7478
- **Algerian Ministry** - Algerian Ministry of Higher Education predatory journals list
7579
- **OpenAlex** - Publication pattern analysis
7680
- **Crossref** - Metadata quality assessment
@@ -105,6 +109,11 @@ These provide authoritative yes/no decisions for journals they cover:
105109
| **DOAJ** | Legitimate OA journals | 22,000+ journals | Gold standard for open access legitimacy |
106110
| **Scopus** (optional) | Legitimate indexed journals | 30,000+ journals | Major subscription and OA journals |
107111
| **Beall's List** | Predatory journal archives | ~2,900 entries | Historically identified predatory publishers |
112+
| **UGC-CARE Cloned (Group I)** | Cloned journals | ~80 entries | Public UGC-CARE discontinued Group I clone list |
113+
| **UGC-CARE Cloned (Group II)** | Cloned journals | ~114 entries | Public UGC-CARE discontinued Group II clone list |
114+
| **UGC-CARE Delisted (Group II)** | Delisted journals | ~12 entries | Public UGC-CARE discontinued Group II delisted list |
115+
| **UGC-CARE Included from Clone Page (Group I)** | Included journals | ~80 entries | Left-side included journals from Group I clone-correction page |
116+
| **UGC-CARE Included from Clone Page (Group II)** | Included journals | ~114 entries | Left-side included journals from Group II clone-correction page |
108117
| **PredatoryJournals.org** | Predatory journals/publishers | 15,000+ entries | Curated lists from predatoryjournals.org |
109118
| **Algerian Ministry** | Predatory journal list | ~3,300 entries | Ministry of Higher Education predatory journals |
110119
| **Kscien Standalone Journals** | Predatory journals | 1,400+ entries | Individual predatory journals identified by Kscien |
@@ -134,6 +143,7 @@ Journal Query → [Curated Databases + Pattern Analyzers] → Combined Assessmen
134143
├─ DOAJ (legitimate OA)
135144
├─ Scopus (indexed journals)
136145
├─ Beall's List (predatory)
146+
├─ UGC-CARE discontinued lists
137147
├─ PredatoryJournals.org
138148
├─ Kscien databases
139149
├─ Retraction Watch (quality)
@@ -145,6 +155,7 @@ Journal Query → [Curated Databases + Pattern Analyzers] → Combined Assessmen
145155
**Note**: Not all backends will find every journal. A journal may be:
146156
- Found in DOAJ → strong legitimate evidence
147157
- Found in Beall's → strong predatory evidence
158+
- Found in UGC-CARE cloned/delisted lists → strong predatory evidence
148159
- Not found in any curated database → rely on pattern analysis
149160
- Found in contradictory sources → cross-validation resolves conflicts
150161
@@ -206,16 +217,16 @@ Reasoning: "Found in Scopus with excellent publication patterns and metadata qua
206217
207218
#### **Scenario 2: Known Predatory Journal**
208219
```
209-
Input: "International Journal of Advanced Computer Science and Applications"
220+
Input: "Journal Appearing in UGC-CARE Cloned Group II"
210221
211222
├─ DOAJ: ✗ Not found
212-
├─ Predatory Lists: ✓ Found in Kscien database → "predatory"
223+
├─ Predatory Lists: ✓ Found in UGC-CARE Cloned (Group II) → "predatory"
213224
├─ Retraction Watch: ✗ Not found
214-
├─ OpenAlex: ✓ Found → High volume (>800/year), low citations
215-
├─ Crossref: ✓ Found → Poor metadata quality
225+
├─ OpenAlex: ✓ Found → suspicious pattern indicators
226+
├─ Crossref: ✓ Found → weak metadata quality
216227
217228
Result: PREDATORY (confidence: 0.90)
218-
Reasoning: "Listed in Kscien predatory database, confirmed by publication patterns"
229+
Reasoning: "Listed in UGC-CARE cloned journal list, corroborated by pattern analysis"
219230
```
220231
221232
#### **Scenario 3: Unknown Journal (Pattern Analysis)**

docs/api-reference/backends.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,11 @@ YAML structure:
103103
| Backend | Purpose | Source |
104104
|---------|---------|--------|
105105
| **bealls** | Beall's List archive | `src/aletheia_probe/backends/bealls.py` |
106+
| **ugc_care_cloned** | UGC-CARE Group I cloned journals | `src/aletheia_probe/backends/ugc_care_cloned.py` |
107+
| **ugc_care_cloned_group2** | UGC-CARE Group II cloned journals | `src/aletheia_probe/backends/ugc_care_cloned_group2.py` |
108+
| **ugc_care_delisted_group2** | UGC-CARE Group II delisted journals | `src/aletheia_probe/backends/ugc_care_delisted_group2.py` |
109+
| **ugc_care_included_from_clone_group1** | UGC-CARE Group I included journals from clone page | `src/aletheia_probe/backends/ugc_care_included_from_clone_group1.py` |
110+
| **ugc_care_included_from_clone_group2** | UGC-CARE Group II included journals from clone page | `src/aletheia_probe/backends/ugc_care_included_from_clone_group2.py` |
106111
| **predatoryjournals** | PredatoryJournals.com database | `src/aletheia_probe/backends/predatoryjournals.py` |
107112
| **algerian_ministry** | Algerian Ministry predatory list | `src/aletheia_probe/backends/algerian_ministry.py` |
108113
| **kscien_standalone_journals** | Kscien standalone journals | `src/aletheia_probe/backends/kscien_standalone_journals.py` |

docs/configuration.md

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,63 @@ backends:
204204

205205
See `src/aletheia_probe/backends/predatoryjournals.py`
206206

207+
### UGC-CARE Discontinued Lists Backends
208+
209+
```yaml
210+
backends:
211+
ugc_care_cloned:
212+
enabled: true
213+
weight: 0.9
214+
timeout: 5
215+
config:
216+
cache_ttl_hours: 720 # 30 days - discontinued static list
217+
218+
ugc_care_cloned_group2:
219+
enabled: true
220+
weight: 0.9
221+
timeout: 5
222+
config:
223+
cache_ttl_hours: 720 # 30 days - discontinued static list
224+
225+
ugc_care_delisted_group2:
226+
enabled: true
227+
weight: 0.9
228+
timeout: 5
229+
config:
230+
cache_ttl_hours: 720 # 30 days - discontinued static list
231+
232+
ugc_care_included_from_clone_group1:
233+
enabled: true
234+
weight: 1.0
235+
timeout: 5
236+
config:
237+
cache_ttl_hours: 720 # 30 days - discontinued static list
238+
239+
ugc_care_included_from_clone_group2:
240+
enabled: true
241+
weight: 1.0
242+
timeout: 5
243+
config:
244+
cache_ttl_hours: 720 # 30 days - discontinued static list
245+
```
246+
247+
**Configuration**:
248+
- `cache_ttl_hours`: How long cached UGC-CARE list data remains valid before requiring re-sync. Default is 720 hours (30 days), appropriate for discontinued/frozen sources.
249+
250+
**Backend Descriptions**:
251+
- `ugc_care_cloned`: UGC-CARE Group I cloned journals list
252+
- `ugc_care_cloned_group2`: UGC-CARE Group II cloned journals list
253+
- `ugc_care_delisted_group2`: UGC-CARE Group II delisted journals list
254+
- `ugc_care_included_from_clone_group1`: UGC-CARE Group I included journals from clone correction page (left side)
255+
- `ugc_care_included_from_clone_group2`: UGC-CARE Group II included journals from clone correction page (left side)
256+
257+
See implementations in:
258+
- `src/aletheia_probe/backends/ugc_care_cloned.py`
259+
- `src/aletheia_probe/backends/ugc_care_cloned_group2.py`
260+
- `src/aletheia_probe/backends/ugc_care_delisted_group2.py`
261+
- `src/aletheia_probe/backends/ugc_care_included_from_clone_group1.py`
262+
- `src/aletheia_probe/backends/ugc_care_included_from_clone_group2.py`
263+
207264
### Kscien Backends
208265

209266
The Kscien suite provides curated lists of predatory journals, publishers, hijacked journals, and conferences. All Kscien backends share the same configuration pattern.
@@ -613,4 +670,4 @@ aletheia-probe status
613670
614671
# Test with verbose output
615672
aletheia-probe journal --verbose "Test Journal"
616-
```
673+
```

src/aletheia_probe/backends/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,11 @@
1616
predatoryjournals,
1717
retraction_watch,
1818
scopus,
19+
ugc_care_cloned,
20+
ugc_care_cloned_group2,
21+
ugc_care_delisted_group2,
22+
ugc_care_included_from_clone_group1,
23+
ugc_care_included_from_clone_group2,
1924
)
2025

2126

@@ -33,4 +38,9 @@
3338
"predatoryjournals",
3439
"retraction_watch",
3540
"scopus",
41+
"ugc_care_cloned",
42+
"ugc_care_cloned_group2",
43+
"ugc_care_delisted_group2",
44+
"ugc_care_included_from_clone_group1",
45+
"ugc_care_included_from_clone_group2",
3646
]
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# SPDX-License-Identifier: MIT
2+
"""UGC-CARE cloned journals backend."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.ugc_care import UgcCareClonedSource
13+
14+
15+
class UgcCareClonedBackend(CachedBackend):
16+
"""Backend for UGC-CARE Group-I cloned journals list."""
17+
18+
def __init__(self) -> None:
19+
super().__init__(
20+
source_name="ugc_care_cloned",
21+
list_type=AssessmentType.PREDATORY,
22+
cache_ttl_hours=24 * 30,
23+
)
24+
self._data_source: UgcCareClonedSource | None = None
25+
26+
def get_name(self) -> str:
27+
return "ugc_care_cloned"
28+
29+
def get_evidence_type(self) -> EvidenceType:
30+
return EvidenceType.PREDATORY_LIST
31+
32+
def get_data_source(self) -> "DataSource | None":
33+
if self._data_source is None:
34+
from ..updater.sources.ugc_care import UgcCareClonedSource
35+
36+
self._data_source = UgcCareClonedSource()
37+
return self._data_source
38+
39+
40+
get_backend_registry().register_factory(
41+
"ugc_care_cloned", lambda: UgcCareClonedBackend(), default_config={}
42+
)
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SPDX-License-Identifier: MIT
2+
"""UGC-CARE Group-II cloned journals backend."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.ugc_care import UgcCareClonedGroup2Source
13+
14+
15+
class UgcCareClonedGroup2Backend(CachedBackend):
16+
"""Backend for UGC-CARE Group-II cloned journals list."""
17+
18+
def __init__(self) -> None:
19+
super().__init__(
20+
source_name="ugc_care_cloned_group2",
21+
list_type=AssessmentType.PREDATORY,
22+
cache_ttl_hours=24 * 30,
23+
)
24+
self._data_source: UgcCareClonedGroup2Source | None = None
25+
26+
def get_name(self) -> str:
27+
return "ugc_care_cloned_group2"
28+
29+
def get_evidence_type(self) -> EvidenceType:
30+
return EvidenceType.PREDATORY_LIST
31+
32+
def get_data_source(self) -> "DataSource | None":
33+
if self._data_source is None:
34+
from ..updater.sources.ugc_care import UgcCareClonedGroup2Source
35+
36+
self._data_source = UgcCareClonedGroup2Source()
37+
return self._data_source
38+
39+
40+
get_backend_registry().register_factory(
41+
"ugc_care_cloned_group2",
42+
lambda: UgcCareClonedGroup2Backend(),
43+
default_config={},
44+
)
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SPDX-License-Identifier: MIT
2+
"""UGC-CARE Group-II delisted journals backend."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.ugc_care import UgcCareDelistedGroup2Source
13+
14+
15+
class UgcCareDelistedGroup2Backend(CachedBackend):
16+
"""Backend for UGC-CARE Group-II delisted journals."""
17+
18+
def __init__(self) -> None:
19+
super().__init__(
20+
source_name="ugc_care_delisted_group2",
21+
list_type=AssessmentType.PREDATORY,
22+
cache_ttl_hours=24 * 30,
23+
)
24+
self._data_source: UgcCareDelistedGroup2Source | None = None
25+
26+
def get_name(self) -> str:
27+
return "ugc_care_delisted_group2"
28+
29+
def get_evidence_type(self) -> EvidenceType:
30+
return EvidenceType.PREDATORY_LIST
31+
32+
def get_data_source(self) -> "DataSource | None":
33+
if self._data_source is None:
34+
from ..updater.sources.ugc_care import UgcCareDelistedGroup2Source
35+
36+
self._data_source = UgcCareDelistedGroup2Source()
37+
return self._data_source
38+
39+
40+
get_backend_registry().register_factory(
41+
"ugc_care_delisted_group2",
42+
lambda: UgcCareDelistedGroup2Backend(),
43+
default_config={},
44+
)
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SPDX-License-Identifier: MIT
2+
"""UGC-CARE included (Group I clone-page left side) backend."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.ugc_care import UgcCareIncludedFromCloneGroup1Source
13+
14+
15+
class UgcCareIncludedFromCloneGroup1Backend(CachedBackend):
16+
"""Backend for included journals listed on Group-I clone correction page."""
17+
18+
def __init__(self) -> None:
19+
super().__init__(
20+
source_name="ugc_care_included_from_clone_group1",
21+
list_type=AssessmentType.LEGITIMATE,
22+
cache_ttl_hours=24 * 30,
23+
)
24+
self._data_source: UgcCareIncludedFromCloneGroup1Source | None = None
25+
26+
def get_name(self) -> str:
27+
return "ugc_care_included_from_clone_group1"
28+
29+
def get_evidence_type(self) -> EvidenceType:
30+
return EvidenceType.LEGITIMATE_LIST
31+
32+
def get_data_source(self) -> "DataSource | None":
33+
if self._data_source is None:
34+
from ..updater.sources.ugc_care import UgcCareIncludedFromCloneGroup1Source
35+
36+
self._data_source = UgcCareIncludedFromCloneGroup1Source()
37+
return self._data_source
38+
39+
40+
get_backend_registry().register_factory(
41+
"ugc_care_included_from_clone_group1",
42+
lambda: UgcCareIncludedFromCloneGroup1Backend(),
43+
default_config={},
44+
)
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SPDX-License-Identifier: MIT
2+
"""UGC-CARE included (Group II clone-page left side) backend."""
3+
4+
from typing import TYPE_CHECKING
5+
6+
from ..enums import AssessmentType, EvidenceType
7+
from .base import CachedBackend, get_backend_registry
8+
9+
10+
if TYPE_CHECKING:
11+
from ..updater.core import DataSource
12+
from ..updater.sources.ugc_care import UgcCareIncludedFromCloneGroup2Source
13+
14+
15+
class UgcCareIncludedFromCloneGroup2Backend(CachedBackend):
16+
"""Backend for included journals listed on Group-II clone correction page."""
17+
18+
def __init__(self) -> None:
19+
super().__init__(
20+
source_name="ugc_care_included_from_clone_group2",
21+
list_type=AssessmentType.LEGITIMATE,
22+
cache_ttl_hours=24 * 30,
23+
)
24+
self._data_source: UgcCareIncludedFromCloneGroup2Source | None = None
25+
26+
def get_name(self) -> str:
27+
return "ugc_care_included_from_clone_group2"
28+
29+
def get_evidence_type(self) -> EvidenceType:
30+
return EvidenceType.LEGITIMATE_LIST
31+
32+
def get_data_source(self) -> "DataSource | None":
33+
if self._data_source is None:
34+
from ..updater.sources.ugc_care import UgcCareIncludedFromCloneGroup2Source
35+
36+
self._data_source = UgcCareIncludedFromCloneGroup2Source()
37+
return self._data_source
38+
39+
40+
get_backend_registry().register_factory(
41+
"ugc_care_included_from_clone_group2",
42+
lambda: UgcCareIncludedFromCloneGroup2Backend(),
43+
default_config={},
44+
)

0 commit comments

Comments
 (0)