You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* docs: Add missing backend-specific configuration documentation [AI-assisted]
Add comprehensive configuration documentation for backends that were
previously undocumented in configuration.md:
- PredatoryJournals.com backend: Community-maintained predatory lists
with monthly cache (720 hours)
- Kscien backend suite: Four backends (standalone journals, publishers,
hijacked journals, conferences) with weekly cache (168 hours)
- Scopus backend: Legitimate journal verification using user-provided
static files with monthly cache (720 hours)
- Cross-validator backend: Combines OpenAlex and Crossref with
cross-validation, requires email configuration for API access
Each backend section includes:
- YAML configuration examples showing cache_ttl_hours settings
- Conceptual explanations of configuration options
- Guidance on when to adjust cache durations
- References to backend implementation files
Documentation focuses on configuration concepts rather than
implementation details, helping users understand what each option
controls and when to modify settings.
Closes#207
* docs: Remove version-specific counts and fix email format [AI-assisted]
Address review feedback:
- Remove specific entry counts (1476+, 1271+, 234+) from Kscien backend
descriptions as these change over time
- Fix email format from noreply.aletheia-probe.org to
noreply@aletheia-probe.org throughout documentation
---------
Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
Copy file name to clipboardExpand all lines: docs/configuration.md
+113-1Lines changed: 113 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -69,7 +69,7 @@ backends:
69
69
70
70
Several backends (crossref_analyzer, openalex_analyzer, cross_validator) use the `email` parameter for API identification and rate limiting. These APIs follow "polite pool" access patterns and require contact information for higher rate limits.
71
71
72
-
**Default Behavior**: If no email is configured, backends use `noreply.aletheia-probe.org` as a default contact address.
72
+
**Default Behavior**: If no email is configured, backends use `noreply@aletheia-probe.org` as a default contact address.
73
73
74
74
**Recommended Configuration**: Configure your own email address to:
75
75
- Comply with API provider policies
@@ -187,6 +187,118 @@ backends:
187
187
- `blacklist_file`: CSV file with disapproved journals
188
188
- `category_weights`: Weights for different journal categories
189
189
190
+
### PredatoryJournals.com Backend
191
+
192
+
```yaml
193
+
backends:
194
+
predatoryjournals:
195
+
enabled: true
196
+
weight: 0.9
197
+
timeout: 5
198
+
config:
199
+
cache_ttl_hours: 720 # 30 days - monthly cache for community lists
200
+
```
201
+
202
+
**Configuration**:
203
+
- `cache_ttl_hours`: How long cached predatory journal list data remains valid before requiring re-sync. Default is 720 hours (30 days). The predatoryjournals.org lists are community-maintained and updated monthly, so longer cache periods are appropriate.
204
+
205
+
See `src/aletheia_probe/backends/predatoryjournals.py`
206
+
207
+
### Kscien Backends
208
+
209
+
The Kscien suite provides curated lists of predatory journals, publishers, hijacked journals, and conferences. All Kscien backends share the same configuration pattern.
210
+
211
+
```yaml
212
+
backends:
213
+
kscien_standalone_journals:
214
+
enabled: true
215
+
weight: 0.9
216
+
timeout: 5
217
+
config:
218
+
cache_ttl_hours: 168 # 7 days - weekly cache
219
+
220
+
kscien_publishers:
221
+
enabled: true
222
+
weight: 0.9
223
+
timeout: 5
224
+
config:
225
+
cache_ttl_hours: 168 # 7 days - weekly cache
226
+
227
+
kscien_hijacked_journals:
228
+
enabled: true
229
+
weight: 1.0
230
+
timeout: 5
231
+
config:
232
+
cache_ttl_hours: 168 # 7 days - weekly cache
233
+
234
+
kscien_predatory_conferences:
235
+
enabled: true
236
+
weight: 0.8
237
+
timeout: 5
238
+
config:
239
+
cache_ttl_hours: 168 # 7 days - weekly cache
240
+
```
241
+
242
+
**Configuration**:
243
+
- `cache_ttl_hours`: How long cached list data remains valid. Default is 168 hours (7 days). Kscien lists are updated weekly, so weekly cache refresh is recommended. Increase for more stable environments, decrease if you need the latest additions.
244
+
245
+
**Backend Descriptions**:
246
+
- `kscien_standalone_journals`: Checks against standalone predatory journals
247
+
- `kscien_publishers`: Checks against predatory publishers
248
+
- `kscien_hijacked_journals`: Identifies hijacked journals (clones of legitimate journals)
249
+
- `kscien_predatory_conferences`: Checks against predatory conference lists
250
+
251
+
See backend implementations in `src/aletheia_probe/backends/kscien_*.py`
252
+
253
+
### Scopus Backend
254
+
255
+
```yaml
256
+
backends:
257
+
scopus:
258
+
enabled: true
259
+
weight: 1.2
260
+
timeout: 5
261
+
config:
262
+
cache_ttl_hours: 720 # 30 days - monthly cache
263
+
```
264
+
265
+
**Configuration**:
266
+
- `cache_ttl_hours`: How long cached Scopus data remains valid. Default is 720 hours (30 days). Since Scopus uses user-provided static files, longer cache periods are appropriate.
267
+
268
+
**Important Notes**:
269
+
- Scopus backend requires manual setup - users must download and place Scopus journal list Excel file in `~/.aletheia-probe/scopus/`
270
+
- This backend identifies legitimate journals indexed in Scopus
271
+
- Backend remains inactive until Scopus data file is provided
272
+
273
+
See `src/aletheia_probe/backends/scopus.py`
274
+
275
+
### Cross-Validator Backend
276
+
277
+
```yaml
278
+
backends:
279
+
cross_validator:
280
+
enabled: true
281
+
weight: 1.3
282
+
timeout: 20
283
+
email: "your.email@institution.org"
284
+
config:
285
+
cache_ttl_hours: 24 # Cache query results for 24 hours
286
+
```
287
+
288
+
**Configuration**:
289
+
- `email`: Contact email for API identification (OpenAlex and Crossref). Default is `noreply@aletheia-probe.org`. Configure your own email for better rate limits and API compliance.
290
+
- `cache_ttl_hours`: How long individual query results are cached. Default is 24 hours. Cross-validator performs API queries to both OpenAlex and Crossref, so caching reduces API load.
291
+
292
+
**Purpose**:
293
+
Cross-validator combines and cross-validates data from OpenAlex and Crossref backends. It performs consistency checks on publisher names, publication volumes, DOAJ listings, and activity timelines across both sources. When backends agree, confidence is boosted; when they disagree, confidence is reduced.
294
+
295
+
**When to Adjust**:
296
+
- Set shorter `cache_ttl_hours` (1-6 hours) when assessing newly published journals or during active research
297
+
- Set longer `cache_ttl_hours` (48-168 hours) for batch processing or when API rate limits are a concern
298
+
- Configure `email` to comply with API polite pool policies and get better rate limits
299
+
300
+
See `src/aletheia_probe/backends/cross_validator.py`
0 commit comments