Skip to content

nf-osi/synapse-miner

Repository files navigation

Synapse ID Miner

A Python package for mining Synapse IDs from scientific articles in Europe PMC's Open Access subset.

Features

  • Downloads and processes XML files from Europe PMC's Open Access subset
  • Extracts Synapse IDs with surrounding context from articles
  • Handles large XML files efficiently using parallel processing
  • Saves results incrementally to prevent data loss
  • Provides progress tracking during download and processing
  • Supports starting from a specific file and limiting the number of files processed
  • Automated weekly workflow to send results to synapse

Installation

pip install git+https://github.com/nf-osi/synapse-miner.git

Usage

Command Line Interface

Process XML files from Europe PMC's Open Access subset:

synapse-miner http -u https://europepmc.org/ftp/oa/ -o results.csv -s PMC3000001_PMC3010000.xml.gz -m 1

Arguments:

  • -u, --url: Base URL of the Europe PMC Open Access subset
  • -o, --output: Path to save results
  • -s, --start-from: Filename to start processing from (optional)
  • -m, --max-files: Maximum number of files to process (optional)

Python API

from synapse_miner import SynapseMiner

# Initialize miner
miner = SynapseMiner()

# Process files from HTTP server
miner.process_http_files(
    base_url="https://europepmc.org/ftp/oa/",
    output_path="results.csv",
    start_from="PMC3000001_PMC3010000.xml.gz",
    max_files=1
)

Output

The package generates two types of output files:

  1. Main results file (results.csv): Contains all findings across all processed files
  2. Batch files (results.csv.{filename}.csv): Contains findings from individual files

Each row in the output contains:

  • pmcid: The PubMed Central ID of the article with bioregistry prefix (e.g., "pmc:PMC1234567")
  • synid: The Synapse ID found in the article
  • context: 25 characters before and after the Synapse ID for context

Notes

  • The package automatically handles retries for failed downloads
  • Downloaded files are cleaned up after processing to save disk space
  • Progress is tracked and displayed during both download and processing
  • Results are saved after each file is processed to prevent data loss

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages