Skip to content

ourresearch/ojs-open-access-scanner

Repository files navigation

OJS Open Access Scanner

A tool to automatically detect open access status of journals using Open Journal Systems (OJS) by analyzing their current issue pages.

Overview

This project scans OJS journal websites to determine if they provide open access to their articles. It analyzes the current issue page of each journal, looking for PDF download links and toll-access indicators to classify journals as open access, toll access, or unknown.

Files

Scripts

  • ojs_access_scan.py - Main scanning script that processes journal URLs and determines open access status

Data Files

  • journals.csv - Input file containing ISSN and homepage URL for each journal (52K+ journals)
  • ojs_oa.csv - Main results file with ISSN, current issue URL, and open access status
  • ojs_oa_diagnostics.csv - Detailed diagnostic information including response times, errors, and detection details

Results Data Structure

ojs_oa.csv contains:

  • issn - Journal ISSN
  • current_issue_url - URL of the journal's current issue page
  • is_oa - Open access status (true, false, or unknown)

ojs_oa_diagnostics.csv contains additional diagnostic data:

  • status - HTTP response status or error type
  • final_url - Final URL after redirects
  • elapsed_ms - Response time in milliseconds
  • bytes_checked - Number of bytes analyzed
  • matched - Detection method used
  • on_current_page - Whether analysis was performed on current issue page
  • error - Error details if request failed

Detection Method

The scanner uses a sophisticated detection algorithm:

  1. Fetch Current Issue Page: Attempts to access /issue/current endpoint for each journal
  2. Find Article Links: Searches for links containing /article/view/ patterns
  3. Analyze Access: For each article, checks for:
    • PDF galley links (indicates open access)
    • Toll access indicators (subscription/restricted keywords)
    • Access badges and icons

Usage

# Install dependencies
pip install aiohttp beautifulsoup4

# Run the scanner
python ojs_access_scan.py journals.csv

Configuration

Key parameters in the script:

  • GLOBAL_CONCURRENCY - Maximum concurrent requests (default: 10)
  • PER_HOST_CONCURRENCY - Requests per host (default: 1)
  • MAX_BYTES - Maximum response size to analyze (256KB)
  • RETRIES - Number of retry attempts for failed requests (3)

Results Summary

From the scan of 52K+ OJS journals:

  • Open Access: Journals providing free PDF access to articles
  • Toll Access: Journals requiring subscriptions or payments
  • Unknown: Journals where access status couldn't be determined (offline, errors, etc.)

The diagnostic file provides detailed information about scan performance, response times, and reasons for classification decisions.

Network Considerations

The scanner is designed to be respectful of server resources:

  • Conservative concurrency limits
  • Retry logic with exponential backoff
  • Per-host connection limiting
  • User-agent headers for identification

Requirements

  • Python 3.7+
  • aiohttp
  • beautifulsoup4

License

Data and code provided for research purposes.

About

Script to detect open access status of OJS journals by analyzing current issue pages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages