Skip to content

Commit f873078

Browse files
docs: Add missing backend-specific configuration documentation [AI-assisted] (#217)
* docs: Add missing backend-specific configuration documentation [AI-assisted] Add comprehensive configuration documentation for backends that were previously undocumented in configuration.md: - PredatoryJournals.com backend: Community-maintained predatory lists with monthly cache (720 hours) - Kscien backend suite: Four backends (standalone journals, publishers, hijacked journals, conferences) with weekly cache (168 hours) - Scopus backend: Legitimate journal verification using user-provided static files with monthly cache (720 hours) - Cross-validator backend: Combines OpenAlex and Crossref with cross-validation, requires email configuration for API access Each backend section includes: - YAML configuration examples showing cache_ttl_hours settings - Conceptual explanations of configuration options - Guidance on when to adjust cache durations - References to backend implementation files Documentation focuses on configuration concepts rather than implementation details, helping users understand what each option controls and when to modify settings. Closes #207 * docs: Remove version-specific counts and fix email format [AI-assisted] Address review feedback: - Remove specific entry counts (1476+, 1271+, 234+) from Kscien backend descriptions as these change over time - Fix email format from noreply.aletheia-probe.org to noreply@aletheia-probe.org throughout documentation --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent db2c5db commit f873078

File tree

1 file changed

+113
-1
lines changed

1 file changed

+113
-1
lines changed

docs/configuration.md

Lines changed: 113 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ backends:
6969

7070
Several backends (crossref_analyzer, openalex_analyzer, cross_validator) use the `email` parameter for API identification and rate limiting. These APIs follow "polite pool" access patterns and require contact information for higher rate limits.
7171

72-
**Default Behavior**: If no email is configured, backends use `noreply.aletheia-probe.org` as a default contact address.
72+
**Default Behavior**: If no email is configured, backends use `noreply@aletheia-probe.org` as a default contact address.
7373

7474
**Recommended Configuration**: Configure your own email address to:
7575
- Comply with API provider policies
@@ -187,6 +187,118 @@ backends:
187187
- `blacklist_file`: CSV file with disapproved journals
188188
- `category_weights`: Weights for different journal categories
189189

190+
### PredatoryJournals.com Backend
191+
192+
```yaml
193+
backends:
194+
predatoryjournals:
195+
enabled: true
196+
weight: 0.9
197+
timeout: 5
198+
config:
199+
cache_ttl_hours: 720 # 30 days - monthly cache for community lists
200+
```
201+
202+
**Configuration**:
203+
- `cache_ttl_hours`: How long cached predatory journal list data remains valid before requiring re-sync. Default is 720 hours (30 days). The predatoryjournals.org lists are community-maintained and updated monthly, so longer cache periods are appropriate.
204+
205+
See `src/aletheia_probe/backends/predatoryjournals.py`
206+
207+
### Kscien Backends
208+
209+
The Kscien suite provides curated lists of predatory journals, publishers, hijacked journals, and conferences. All Kscien backends share the same configuration pattern.
210+
211+
```yaml
212+
backends:
213+
kscien_standalone_journals:
214+
enabled: true
215+
weight: 0.9
216+
timeout: 5
217+
config:
218+
cache_ttl_hours: 168 # 7 days - weekly cache
219+
220+
kscien_publishers:
221+
enabled: true
222+
weight: 0.9
223+
timeout: 5
224+
config:
225+
cache_ttl_hours: 168 # 7 days - weekly cache
226+
227+
kscien_hijacked_journals:
228+
enabled: true
229+
weight: 1.0
230+
timeout: 5
231+
config:
232+
cache_ttl_hours: 168 # 7 days - weekly cache
233+
234+
kscien_predatory_conferences:
235+
enabled: true
236+
weight: 0.8
237+
timeout: 5
238+
config:
239+
cache_ttl_hours: 168 # 7 days - weekly cache
240+
```
241+
242+
**Configuration**:
243+
- `cache_ttl_hours`: How long cached list data remains valid. Default is 168 hours (7 days). Kscien lists are updated weekly, so weekly cache refresh is recommended. Increase for more stable environments, decrease if you need the latest additions.
244+
245+
**Backend Descriptions**:
246+
- `kscien_standalone_journals`: Checks against standalone predatory journals
247+
- `kscien_publishers`: Checks against predatory publishers
248+
- `kscien_hijacked_journals`: Identifies hijacked journals (clones of legitimate journals)
249+
- `kscien_predatory_conferences`: Checks against predatory conference lists
250+
251+
See backend implementations in `src/aletheia_probe/backends/kscien_*.py`
252+
253+
### Scopus Backend
254+
255+
```yaml
256+
backends:
257+
scopus:
258+
enabled: true
259+
weight: 1.2
260+
timeout: 5
261+
config:
262+
cache_ttl_hours: 720 # 30 days - monthly cache
263+
```
264+
265+
**Configuration**:
266+
- `cache_ttl_hours`: How long cached Scopus data remains valid. Default is 720 hours (30 days). Since Scopus uses user-provided static files, longer cache periods are appropriate.
267+
268+
**Important Notes**:
269+
- Scopus backend requires manual setup - users must download and place Scopus journal list Excel file in `~/.aletheia-probe/scopus/`
270+
- This backend identifies legitimate journals indexed in Scopus
271+
- Backend remains inactive until Scopus data file is provided
272+
273+
See `src/aletheia_probe/backends/scopus.py`
274+
275+
### Cross-Validator Backend
276+
277+
```yaml
278+
backends:
279+
cross_validator:
280+
enabled: true
281+
weight: 1.3
282+
timeout: 20
283+
email: "your.email@institution.org"
284+
config:
285+
cache_ttl_hours: 24 # Cache query results for 24 hours
286+
```
287+
288+
**Configuration**:
289+
- `email`: Contact email for API identification (OpenAlex and Crossref). Default is `noreply@aletheia-probe.org`. Configure your own email for better rate limits and API compliance.
290+
- `cache_ttl_hours`: How long individual query results are cached. Default is 24 hours. Cross-validator performs API queries to both OpenAlex and Crossref, so caching reduces API load.
291+
292+
**Purpose**:
293+
Cross-validator combines and cross-validates data from OpenAlex and Crossref backends. It performs consistency checks on publisher names, publication volumes, DOAJ listings, and activity timelines across both sources. When backends agree, confidence is boosted; when they disagree, confidence is reduced.
294+
295+
**When to Adjust**:
296+
- Set shorter `cache_ttl_hours` (1-6 hours) when assessing newly published journals or during active research
297+
- Set longer `cache_ttl_hours` (48-168 hours) for batch processing or when API rate limits are a concern
298+
- Configure `email` to comply with API polite pool policies and get better rate limits
299+
300+
See `src/aletheia_probe/backends/cross_validator.py`
301+
190302
## Assessment Heuristics
191303

192304
Configuration for the assessment algorithm:

0 commit comments

Comments
 (0)