Skip to content

jonatasgrosman/findpapers

Findpapers Logo

PyPI - License PyPI

Findpapers is a Python library that gives researchers unified access to hundreds of millions of academic papers from different databases - all through a single query. Instead of searching the databases one by one, each with its own interface and query language, Findpapers lets you write one boolean expression and run it everywhere at once, automatically merging and deduplicating the results.

Findpapers searches for papers through arXiv, IEEE Xplore, OpenAlex, PubMed, Scopus, and Semantic Scholar - together covering virtually every peer-reviewed paper, preprint, and conference proceeding published across all fields of science. It also supports paper enrichment, PDF downloading, citation graph building (snowballing), and export to multiple formats.

Key Features

  • Massive coverage - access hundreds of millions of papers across six databases that together span every scientific discipline
  • Multi-database search - query all databases in parallel with one boolean search expression - no need to learn six different query syntaxes
  • Smart deduplication - automatically merges duplicate papers found across different databases
  • Paper enrichment - fetch additional metadata (abstracts, keywords, citations) via CrossRef and web scraping
  • PDF downloading - download PDFs with automatic URL resolution for major publishers
  • Citation snowballing - build citation graphs by traversing references and citations (forward and backward)
  • Flexible export - save results as JSON, BibTeX, or CSV
  • Filter codes - restrict search terms to specific fields (title, abstract, keywords, author, source, affiliation)
  • Parallel execution - speed up searches and downloads using multiple worker threads

Requirements

  • Python 3.11+

Installation

pip install git+https://github.com/jonatasgrosman/findpapers.git

Quick Start

import findpapers
import datetime

engine = findpapers.Engine()

# Search for papers across all databases
result = engine.search(
    "[machine learning] AND [healthcare]",
    since=datetime.date(2022, 1, 1),
)

# Enrich papers with additional metadata (abstracts, keywords, citations)
engine.enrich(result.papers)

# Download PDFs
engine.download(result.papers, "./pdfs")

# Build a citation graph from the top results
graph = engine.snowball(result.papers[:5], max_depth=1, direction="both")

# Save results
findpapers.save_to_json(result, "results.json")
findpapers.save_to_bibtex(result.papers, "references.bib")
findpapers.save_to_json(graph, "citation_graph.json")

Supported Databases

The table below summarizes each supported database - for full details on authentication, rate limits, and per-database quirks, see the Databases documentation.

Database Size (papers) API Key Coverage
arXiv 3M+ ¹ Not required Open-access preprints in physics, math, CS, biology, economics, and more
IEEE Xplore 7M+ ² Required Journals, conferences, and standards in electrical engineering and CS
OpenAlex 243M+ ³ Optional The largest open catalog of scholarly works across all disciplines
PubMed 40M+ Optional Biomedical and life sciences literature (MEDLINE, PMC, and more)
Scopus 100M+ Required Peer-reviewed literature in science, technology, medicine, social sciences, and humanities
Semantic Scholar 214M+ Optional AI-powered academic graph covering all fields of science

Estimated paper counts were consulted in March 2026 from each database's official website. Click the superscript links for the original sources. These numbers grow continuously.

Every API key from the databases listed above can be obtained at no cost - just create an account on each provider’s website. We strongly recommend getting all of them before using Findpapers, as they unlock additional databases (IEEE, Scopus) and dramatically improve rate limits and reliability on the others (OpenAlex, PubMed, Semantic Scholar). See Databases for more details on how to get these API keys, and Configuration for how to set them up.

Documentation

Document Description
Getting Started Installation, configuration, and first search
Databases Supported databases, authentication, and per-database details
Query Syntax How to write search queries, boolean operators, wildcards, and filter codes
Configuration Environment variables, proxy, SSL, and API keys
Search Multi-database search with boolean queries
Enrich Enrich papers with additional metadata from CrossRef and web scraping
Download Download PDFs for papers
Snowball Build citation graphs via forward and backward snowballing
Fetch by DOI Look up a single paper by DOI
Save/Load JSON, BibTeX, and CSV persistence details
API Reference Public classes, functions, enums, and exceptions

Want to help?

See the contribution guidelines if you'd like to contribute to the project. Please follow our Code of Conduct. You don't need to know how to code to contribute, even improving documentation is a valuable contribution.

If this project has been useful for you, please share it with your friends and give us a star on GitHub to help others discover it. You can also sponsor me to support the development of Findpapers.

Support the project by starring and sponsoring

Citation

If you use Findpapers in your research, please cite it:

@misc{grosman2020findpapers,
  title={{Findpapers: A tool for helping researchers who are looking for related works}},
  author={Grosman, Jonatas},
  howpublished={\url{https://github.com/jonatasgrosman/findpapers}},
  year={2020}
}