A Python package for mining Synapse IDs from scientific articles in Europe PMC's Open Access subset.
- Downloads and processes XML files from Europe PMC's Open Access subset
- Extracts Synapse IDs with surrounding context from articles
- Handles large XML files efficiently using parallel processing
- Saves results incrementally to prevent data loss
- Provides progress tracking during download and processing
- Supports starting from a specific file and limiting the number of files processed
- Automated weekly workflow to send results to synapse
pip install git+https://github.com/nf-osi/synapse-miner.git
Process XML files from Europe PMC's Open Access subset:
synapse-miner http -u https://europepmc.org/ftp/oa/ -o results.csv -s PMC3000001_PMC3010000.xml.gz -m 1Arguments:
-u, --url: Base URL of the Europe PMC Open Access subset-o, --output: Path to save results-s, --start-from: Filename to start processing from (optional)-m, --max-files: Maximum number of files to process (optional)
from synapse_miner import SynapseMiner
# Initialize miner
miner = SynapseMiner()
# Process files from HTTP server
miner.process_http_files(
base_url="https://europepmc.org/ftp/oa/",
output_path="results.csv",
start_from="PMC3000001_PMC3010000.xml.gz",
max_files=1
)The package generates two types of output files:
- Main results file (
results.csv): Contains all findings across all processed files - Batch files (
results.csv.{filename}.csv): Contains findings from individual files
Each row in the output contains:
pmcid: The PubMed Central ID of the article with bioregistry prefix (e.g., "pmc:PMC1234567")synid: The Synapse ID found in the articlecontext: 25 characters before and after the Synapse ID for context
- The package automatically handles retries for failed downloads
- Downloaded files are cleaned up after processing to save disk space
- Progress is tracked and displayed during both download and processing
- Results are saved after each file is processed to prevent data loss