Synapse ID Miner

A Python package for mining Synapse IDs from scientific articles in Europe PMC's Open Access subset.

Features

Downloads and processes XML files from Europe PMC's Open Access subset
Extracts Synapse IDs with surrounding context from articles
Handles large XML files efficiently using parallel processing
Saves results incrementally to prevent data loss
Provides progress tracking during download and processing
Supports starting from a specific file and limiting the number of files processed
Automated weekly workflow to send results to synapse

Installation

pip install git+https://github.com/nf-osi/synapse-miner.git

Usage

Command Line Interface

Process XML files from Europe PMC's Open Access subset:

synapse-miner http -u https://europepmc.org/ftp/oa/ -o results.csv -s PMC3000001_PMC3010000.xml.gz -m 1

Arguments:

-u, --url: Base URL of the Europe PMC Open Access subset
-o, --output: Path to save results
-s, --start-from: Filename to start processing from (optional)
-m, --max-files: Maximum number of files to process (optional)

Python API

from synapse_miner import SynapseMiner

# Initialize miner
miner = SynapseMiner()

# Process files from HTTP server
miner.process_http_files(
    base_url="https://europepmc.org/ftp/oa/",
    output_path="results.csv",
    start_from="PMC3000001_PMC3010000.xml.gz",
    max_files=1
)

Output

The package generates two types of output files:

Main results file (results.csv): Contains all findings across all processed files
Batch files (results.csv.{filename}.csv): Contains findings from individual files

Each row in the output contains:

pmcid: The PubMed Central ID of the article with bioregistry prefix (e.g., "pmc:PMC1234567")
synid: The Synapse ID found in the article
context: 25 characters before and after the Synapse ID for context

Notes

The package automatically handles retries for failed downloads
Downloaded files are cleaned up after processing to save disk space
Progress is tracked and displayed during both download and processing
Results are saved after each file is processed to prevent data loss

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
synapse_miner.egg-info		synapse_miner.egg-info
synapse_miner		synapse_miner
test_data		test_data
tests		tests
.gitignore		.gitignore
README.md		README.md
last_processed_pmc.json		last_processed_pmc.json
requirements.txt		requirements.txt
setup.py		setup.py
test_core.py		test_core.py
usage.md		usage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synapse ID Miner

Features

Installation

Usage

Command Line Interface

Python API

Output

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

nf-osi/synapse-miner

Folders and files

Latest commit

History

Repository files navigation

Synapse ID Miner

Features

Installation

Usage

Command Line Interface

Python API

Output

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages