A tool to automatically detect open access status of journals using Open Journal Systems (OJS) by analyzing their current issue pages.
This project scans OJS journal websites to determine if they provide open access to their articles. It analyzes the current issue page of each journal, looking for PDF download links and toll-access indicators to classify journals as open access, toll access, or unknown.
ojs_access_scan.py- Main scanning script that processes journal URLs and determines open access status
journals.csv- Input file containing ISSN and homepage URL for each journal (52K+ journals)ojs_oa.csv- Main results file with ISSN, current issue URL, and open access statusojs_oa_diagnostics.csv- Detailed diagnostic information including response times, errors, and detection details
ojs_oa.csv contains:
issn- Journal ISSNcurrent_issue_url- URL of the journal's current issue pageis_oa- Open access status (true,false, orunknown)
ojs_oa_diagnostics.csv contains additional diagnostic data:
status- HTTP response status or error typefinal_url- Final URL after redirectselapsed_ms- Response time in millisecondsbytes_checked- Number of bytes analyzedmatched- Detection method usedon_current_page- Whether analysis was performed on current issue pageerror- Error details if request failed
The scanner uses a sophisticated detection algorithm:
- Fetch Current Issue Page: Attempts to access
/issue/currentendpoint for each journal - Find Article Links: Searches for links containing
/article/view/patterns - Analyze Access: For each article, checks for:
- PDF galley links (indicates open access)
- Toll access indicators (subscription/restricted keywords)
- Access badges and icons
# Install dependencies
pip install aiohttp beautifulsoup4
# Run the scanner
python ojs_access_scan.py journals.csvKey parameters in the script:
GLOBAL_CONCURRENCY- Maximum concurrent requests (default: 10)PER_HOST_CONCURRENCY- Requests per host (default: 1)MAX_BYTES- Maximum response size to analyze (256KB)RETRIES- Number of retry attempts for failed requests (3)
From the scan of 52K+ OJS journals:
- Open Access: Journals providing free PDF access to articles
- Toll Access: Journals requiring subscriptions or payments
- Unknown: Journals where access status couldn't be determined (offline, errors, etc.)
The diagnostic file provides detailed information about scan performance, response times, and reasons for classification decisions.
The scanner is designed to be respectful of server resources:
- Conservative concurrency limits
- Retry logic with exponential backoff
- Per-host connection limiting
- User-agent headers for identification
- Python 3.7+
- aiohttp
- beautifulsoup4
Data and code provided for research purposes.