Skip to content

szaghi/mosaic

Repository files navigation

MOSAIC logo

a vivid mosaic of open scientific literature, assembled in seconds

Multi-sOurce Scientific Article Indexer and Collector

Search, discover, and download scientific papers from multiple open databases — with a single command. Send results directly to Google NotebookLM for AI-powered summaries, podcasts, and more.

Version Tests Coverage Python License: GPL v3 License: BSD-2 License: BSD-3 License: MIT

Full documentation


What MOSAIC does

# Search 21 sources at once, deduplicate, download OA PDFs
mosaic search "attention is all you need" --oa-only --download

# Discover related literature from any DOI or arXiv ID — no query needed
mosaic similar 10.48550/arXiv.1706.03762 --sort citations

# Bulk-download your entire Zotero / JabRef library
mosaic get --from refs.bib --oa-only

# Turn results into an AI-powered notebook: podcast, slides, quiz, mind map…
mosaic notebook create "Transformers" --query "transformer architecture" --oa-only --podcast

MOSAIC quick search demo

Authors

Stefano Zaghi · stefano.zaghi@gmail.com

Creator & Maintainer

Andrea Giulianini

Grand Pixel Overlord & Architect of the Sacred Button — world-class web UI designer, responsible for making MOSAIC actually look good

Claude (Anthropic)

Omniscient Code Oracle & Tireless Rubber Duck — AI pair programmer, responsible for writing the boring parts so humans don't have to

Web UI

Launch with mosaic ui (requires [ui] extra — see Web UI docs).

Key features

🌐 21 sources, one command

arXiv · Semantic Scholar · OpenAlex · PubMed · PubMed Central · Europe PMC · DOAJ · Crossref · Springer · IEEE · NASA ADS · Zenodo · BASE · CORE · DBLP · HAL · ScienceDirect · bioRxiv/medRxiv · and more

🔭 Find similar papers

mosaic similar <doi> — discover related literature from any DOI or arXiv ID via OpenAlex graph + Semantic Scholar ML, no query needed

Smart deduplication

Results merged by DOI: best citation count, richest abstract, earliest PDF URL wins

📥 OA PDF downloads

Direct links · Unpaywall fallback · browser-session authenticated access · bulk download from .bib/.csv

🎛️ Sort & filter

Year · author · journal · open-access · citation count — composable, applied at API level where supported

📤 Export anywhere

Markdown · CSV · JSON · BibTeX · Zotero (local & web API)

🤖 NotebookLM integration

Podcast · video · slides · quiz · mind map · flashcards · briefing — queued in one command

Offline-first cache

SQLite — repeated queries are instant, no re-fetching

🧩 Custom sources

Wire any JSON REST API as a new source with a few lines of TOML — no Python needed

📚 Zotero integration

Push results directly into your Zotero library — local API (Zotero running on your machine) or web API (api.zotero.org). Organise into collections, link downloaded PDFs as attachments, and sync across devices — all with a single --zotero flag.

Zotero integration screenshot

Sources

Source Shorthand Coverage Auth OA PDF
arXiv arxiv Physics, CS, Math, Biology… None Always
Semantic Scholar ss 214 M papers, all disciplines Optional key When indexed
ScienceDirect sd Elsevier journals & books API key or browser session OA articles
Springer Nature sp Springer, Nature & affiliated journals (browser) None ([browser] extra) Via Unpaywall
Springer Nature API springer OA articles from Springer, Nature & affiliated journals Free API key Direct PDF link
DOAJ doaj 8 M+ fully open-access articles None Always
Europe PMC epmc 45 M biomedical papers None PMC articles
OpenAlex oa 250 M+ works, all disciplines None When available
BASE base 300 M+ docs from 10 000+ repos None When OA + PDF format
CORE core 200 M+ OA full-text from repos Free API key downloadUrl field
NASA ADS ads 15 M+ astronomy & astrophysics records Free API token OA articles
IEEE Xplore ieee 5 M+ IEEE journals, transactions & conference proceedings Free API key OA articles
Zenodo zenodo 3 M+ OA research outputs (papers, datasets, software) None (token optional) Attached PDF files
Crossref crossref 150 M+ scholarly works (DOI registry) None (email optional) When deposited by publisher
DBLP dblp 6 M+ CS publications (journals, conferences) None Via ee field (arXiv/OA links)
HAL hal 1.5 M+ OA documents, strong for French academic output None Direct PDF when deposited
PubMed pubmed 35 M+ biomedical citations (NCBI) None (API key optional) PMC PDF for OA articles
PubMed Central pmc 5 M+ free full-text biomedical articles None (API key optional) Always — all PMC articles are OA
bioRxiv / medRxiv rxiv Life-science and medical preprints None Always (all preprints are OA)
PEDro pedro Physiotherapy evidence database None (fair-use ack) No (abstracts only)
Scopus scopus 90 M+ abstracts from Elsevier's citation database API key or browser session Via Unpaywall
Unpaywall PDF resolver for any DOI Email only Legal OA copy

Installation

# recommended — isolated install, globally available
pipx install mosaic-search        # or: uv tool install mosaic-search
# pip — must be inside a virtualenv (modern systems enforce PEP 668)
python -m venv ~/.venvs/mosaic && source ~/.venvs/mosaic/bin/activate
pip install mosaic-search
# from source
git clone https://github.com/szaghi/mosaic
cd mosaic
python -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python 3.11+

Quick Start

# 1. Set your email (enables Unpaywall PDF fallback)
mosaic config --unpaywall-email you@example.com

# 2. Optional: add an Elsevier API key to unlock ScienceDirect
mosaic config --elsevier-key YOUR_KEY

# 3. Search and download
mosaic search "transformer architecture" --oa-only --download

MOSAIC search and download demo

Usage

Search

# Search all enabled sources (10 results per source by default)
mosaic search "protein folding"

# More results, open-access only
mosaic search "deep learning" -n 25 --oa-only

# Single source
mosaic search "RNA velocity" --source epmc

Source shorthands: arxiv · ss · sd · doaj · epmc · oa · base · core · sp · springer · ads · ieee · zenodo · crossref · dblp · hal · pubmed · pmc · rxiv · pedro · scopus

Custom sources defined in config.toml are also queried and addressable by their name.

Filters

# By year — single, range, or list
mosaic search "BERT" --year 2019
mosaic search "diffusion models" -y 2020-2023
mosaic search "GPT" -y 2020,2022,2024

# By author (repeatable, OR logic, case-insensitive substring)
mosaic search "attention" -a Vaswani -a Shazeer

# By journal (case-insensitive substring)
mosaic search "CRISPR" --journal "Nature"

# Combine freely
mosaic search "graph neural" -y 2021-2023 -a Kipf -j "ICLR" --oa-only --download

Find similar papers

# Discover related literature from any DOI or arXiv ID
mosaic similar 10.48550/arXiv.1706.03762

# Sort by citation count, open-access only
mosaic similar arxiv:1706.03762 -n 20 --sort citations --oa-only

# Save to BibTeX
mosaic similar 10.1038/s41586-021-03819-2 --output related.bib

Uses OpenAlex related_works (always) and Semantic Scholar recommendations (when ss-key is configured). Results are deduplicated and merged — the higher citation count and richer metadata always win.

Download by DOI

mosaic get 10.48550/arXiv.1706.03762

Checks the local cache first, then tries Unpaywall if no PDF URL is known.

Bulk download from BibTeX / CSV

# Export your Zotero/JabRef/Mendeley library and download everything
mosaic get --from refs.bib

# CSV with a 'doi' column works too
mosaic get --from references.csv --oa-only

Extracts all DOIs from the file, deduplicates, and downloads with the same fallback chain (direct PDF → Unpaywall → browser session).

Export to Zotero

# Push to your Zotero library (Zotero must be running — no config needed)
mosaic search "CRISPR" --oa-only --zotero

# Push to a named collection (created automatically if missing)
mosaic search "transformers" --zotero --zotero-collection "Deep Learning"

# Download PDFs and link them as Zotero attachments (local mode)
mosaic search "diffusion models" --download --zotero --zotero-collection "Generative AI"

For the web API (Zotero not running locally), configure once:

mosaic config --zotero-key YOUR_API_KEY

MOSAIC Zotero integration demo

Configuration

mosaic config --show                          # print current config
mosaic config --unpaywall-email me@uni.edu
mosaic config --elsevier-key abc123
mosaic config --ss-key xyz789
mosaic config --download-dir ~/papers

Config is stored at ~/.config/mosaic/config.toml. Downloaded PDFs go to ~/mosaic-papers/ by default.

Custom sources

Any number of JSON REST APIs can be added as new sources directly in config.toml — one [[custom_sources]] block per source, no Python required:

[[custom_sources]]
name         = "My Institution Repo"
enabled      = true
url          = "https://repo.myuni.edu/api/search"
method       = "GET"
query_param  = "q"
results_path = "results"

[custom_sources.fields]
title    = "title"
doi      = "doi"
year     = "year"
authors  = "authors"    # flat string array
journal  = "source.title"

See the Custom Sources guide for the full reference.

NotebookLM

Send search results directly to a Google NotebookLM notebook:

# 1. Inject into MOSAIC (--include-apps exposes the notebooklm CLI)
pipx inject --include-apps mosaic-search "notebooklm-py[browser]"

# 2. Install Chromium — playwright lives inside the pipx venv, call it directly
~/.local/share/pipx/venvs/mosaic-search/bin/playwright install chromium

# 3. Authenticate once
notebooklm login

# 4. Search, download, and create a notebook in one command
mosaic notebook create "Transformers" --query "transformer architecture" --oa-only --podcast

# Or import PDFs you already have
mosaic notebook create "My Papers" --from-dir ~/mosaic-papers/

MOSAIC uploads local PDFs when available, falls back to URLs otherwise, and respects NotebookLM's 50-source limit. With --podcast, an Audio Overview is queued automatically.

Architecture

flowchart LR
    CLI -->|query + filters| Search
    Search --> arXiv & SS[Semantic Scholar] & SD[ScienceDirect] & SP[Springer browser] & SPN[Springer API] & DOAJ & EPMC[Europe PMC] & OA[OpenAlex] & BASE & CORE & ADS[NASA ADS] & IEEE[IEEE Xplore] & ZEN[Zenodo] & CR[Crossref] & DBLP[DBLP] & HAL[HAL] & PM[PubMed] & PMC[PubMed Central] & RXV[bioRxiv/medRxiv]
    arXiv & SS & SD & SP & SPN & DOAJ & EPMC & OA & BASE & CORE & ADS & IEEE & ZEN & CR & DBLP & HAL & PM & PMC & RXV -->|Paper list| Dedup{Deduplicate\nby DOI}
    Dedup --> Cache[(SQLite\ncache)]
    Dedup --> Table[Rich table]
    Table -->|--download| DL[Downloader]
    DL -->|no pdf_url| UPW[Unpaywall]
    UPW --> DL
    DL -->|no OA copy| AUTH[Browser session]
    AUTH --> DL
    DL --> Disk[(~/mosaic-papers/)]
    DL -->|mosaic notebook create| NLM[NotebookLM]
Loading

Development

pip install -e ".[dev]"

# with NotebookLM integration (includes Playwright for auth)
pip install -e ".[dev,notebooklm]"
playwright install chromium

# run tests + coverage
pytest

# live docs
cd docs && npm install && npm run docs:dev

Coverage report and badge JSON are written to docs/public/ after every test run.

License

MOSAIC is available under your choice of license:

License SPDX File
GNU General Public License v3 GPL-3.0-or-later LICENSE.gpl3.md
BSD 2-Clause BSD-2-Clause LICENSE.bsd-2.md
BSD 3-Clause BSD-3-Clause LICENSE.bsd-3.md
MIT MIT LICENSE.mit.md

© Stefano Zaghi

About

Multi-source Scientific Article Index and Collector

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages