A large-scale, historically comprehensive knowledge graph (KG) constructed from the zbMATH Open platform, designed to capture historical and conceptual connections across centuries of mathematical research. The KG spans over 250 years and incorporates curated publications dating back to 1763. This temporal depth makes it particularly suitable for longitudinal analyses and historically grounded scholarly exploration and discovery use cases.
- Temporal Span: 1763~2025. See (
src/retrieval-tasks/year-count.tsv) for the per-year distribution. - Triples: 159M+
- Distinct Entities: 36M+
- Publications: 4M+
- Disambiguated Authors/Reviewers: 1M+
- Reviews: 3M+
- Subject Classifications (MSC): 6,500+
- Keywords: 3M+
- Software: 30k+ ... (and more)
-
RDF-Based Semantic Knowledge Graph
Compliant with RDF and Semantic Web standards, the zbMATH Open KG is built entirely from RDF triples using widely adopted ontologies and vocabularies (e.g.,schema:, dcterms:, skos:, cito:), supporting semantic interoperability and adheres to Linked Open Data principles. The full RDF dumps will be published on Zenodo after the anonymous review period concludes. A sample of 200 records is available here:data/subset-200.ttl. -
Expert-Curated, High-Quality Mathematical Metadata
In addition to standard bibliographic metadata, it incorporates annotated mathematical publications with expert-curated reviews and keywords, disambiguated authors, and Mathematics Subject Classification (MSC) codes — a fine-grained ontology for math subject classification. -
Historically-Grounded Scholarly Discovery and Exploration
Its comprehensive and long-term coverage enable long-range intellectual analysis such as historically-grounded retrieval tasks e.g., identifying overlooked precursors and tracing conceptual lineages across (sub)disciplines. -
SPARQL Query Interface
A SPARQL endpoint (temporarily at SPARQL endpoint url) for directly executing queries over the KG. -
Linked Data Integration
Cross-links with external URLs and persistent identifiers (e.g., DOI).
- Python 3.12+
- Python libraries:
rdflib,SPARQLWrapper, and others (see requirements.txt) - Java 8 or higher (required only if you run Apache Jena libraries outside Docker)
- Docker (for running RDF triple stores like Apache Jena Fuseki without manual Java setup)
- We use Apache Jena Fuseki as an example for its simplicity
- Note: Production SPARQL endpoints use Virtuoso (See the
zb-virtuosodirectory for the complete Virtuoso setup.)
To harvest data by zbMATH ID (e.g., ID list of zbMATH open access subset: zbMATH OA subset), run:
python harvest-by-id.py For bulk download (via sickle), refers to: zbMATHOpen Harvester
Using raw .jsonl zbMATH data obtained from the API (see example: data/subset-200.jsonl), run the following commands to automatically generate the RDF KG:
# Option 1: Run the Python script
python create-rdf.py data/subset-200.jsonl subset-200
# Option 2: Run the shell script for batch processing
run-convert.sh
We provide example using Apache Jena Fuseki as the RDF triple store for the KG. Fuseki provides a lightweight SPARQL server to host and query your knowledge graph. The example setup is provided in front/.
We provide a sample subset of the zbMATH Open KG data you can use here: data/subset-200.ttl. Before running the example, ensure this initial data file is located in the same folder as the docker-compose.yml file. If not, update the volume mapping in front/docker-compose.yml accordingly:
- ./subset-200.ttl:/data.ttlThen, start the service by running:
docker compose up -dThis will launch Fuseki on port 3030 and load the initial data via fuseki-entrypoint.sh.
Your SPARQL endpoint URL will be available at: http://localhost:3030/dataset/sparql
For Virtuoso setup, see the zb-virtuoso directory.
data/–.jsonlraw data and.ttlRDF KG (subset), ontology files (.ttl), etc.front/– Fuseki triple store setup for serving the RDF subset (example only — SPARQL endpoint runs on Virtuoso for scalability)src/– Source code for KG construction (data harvest, statistics calculation, RDF transformation, etc).src/retrieval-tasks/– Source code and SPARQL queries for historically-grounded scholarly exploration and discovery.use-case/– Use case-specific results and visualizationsrun-convert.sh– Shell script to convert raw data into RDF formatREADME.md– Project documentation
All content generated by zbMATH Open KG are distributed under CC-BY-SA 4.0., in accordance with the specification at zbMATH Open OAI-PMH API:
Content generated by zbMATH Open, such as reviews, classifications, software, or author disambiguation data,are distributed under CC-BY-SA 4.0.
This defines the license for the whole dataset, which also contains non-copyrighted bibliographic metadata and reference data derived from I4OC (CC0).
Note that the API only provides a subset of the data in the zbMATH Open Web interface.
In several cases, third-party information, such as abstracts, cannot be made available under a suitable license through the API.
In those cases, we replaced the data with the string "zbMATH Open Web Interface contents unavailable due to conflicting licenses."
📧 Contact: yuni.susanti@fiz-karlsruhe.de