Skip to content

susantiyuni/zbmath-open-kg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

181 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

zbMATH Open Knowledge Graph

A large-scale, historically comprehensive knowledge graph (KG) constructed from the zbMATH Open platform, designed to capture historical and conceptual connections across centuries of mathematical research. The KG spans over 250 years and incorporates curated publications dating back to 1763. This temporal depth makes it particularly suitable for longitudinal analyses and historically grounded scholarly exploration and discovery use cases.


Key Statistics (as of September 2025)

  • Temporal Span: 1763~2025. See (src/retrieval-tasks/year-count.tsv) for the per-year distribution.
  • Triples: 159M+
  • Distinct Entities: 36M+
  • Publications: 4M+
  • Disambiguated Authors/Reviewers: 1M+
  • Reviews: 3M+
  • Subject Classifications (MSC): 6,500+
  • Keywords: 3M+
  • Software: 30k+ ... (and more)

Key Features

  • RDF-Based Semantic Knowledge Graph
    Compliant with RDF and Semantic Web standards, the zbMATH Open KG is built entirely from RDF triples using widely adopted ontologies and vocabularies (e.g., schema:, dcterms:, skos:, cito:), supporting semantic interoperability and adheres to Linked Open Data principles. The full RDF dumps will be published on Zenodo after the anonymous review period concludes. A sample of 200 records is available here: data/subset-200.ttl.

  • Expert-Curated, High-Quality Mathematical Metadata
    In addition to standard bibliographic metadata, it incorporates annotated mathematical publications with expert-curated reviews and keywords, disambiguated authors, and Mathematics Subject Classification (MSC) codes — a fine-grained ontology for math subject classification.

  • Historically-Grounded Scholarly Discovery and Exploration
    Its comprehensive and long-term coverage enable long-range intellectual analysis such as historically-grounded retrieval tasks e.g., identifying overlooked precursors and tracing conceptual lineages across (sub)disciplines.

  • SPARQL Query Interface
    A SPARQL endpoint (temporarily at SPARQL endpoint url) for directly executing queries over the KG.

  • Linked Data Integration
    Cross-links with external URLs and persistent identifiers (e.g., DOI).

Construction and Setup

Prerequisites

  • Python 3.12+
  • Python libraries: rdflib, SPARQLWrapper, and others (see requirements.txt)
  • Java 8 or higher (required only if you run Apache Jena libraries outside Docker)
  • Docker (for running RDF triple stores like Apache Jena Fuseki without manual Java setup)
    • We use Apache Jena Fuseki as an example for its simplicity
    • Note: Production SPARQL endpoints use Virtuoso (See the zb-virtuoso directory for the complete Virtuoso setup.)

Data Harvesting

To harvest data by zbMATH ID (e.g., ID list of zbMATH open access subset: zbMATH OA subset), run:

python harvest-by-id.py 

For bulk download (via sickle), refers to: zbMATHOpen Harvester

RDF Construction

Using raw .jsonl zbMATH data obtained from the API (see example: data/subset-200.jsonl), run the following commands to automatically generate the RDF KG:

# Option 1: Run the Python script
python create-rdf.py data/subset-200.jsonl subset-200

# Option 2: Run the shell script for batch processing
run-convert.sh

RDF Triple Store Setup

We provide example using Apache Jena Fuseki as the RDF triple store for the KG. Fuseki provides a lightweight SPARQL server to host and query your knowledge graph. The example setup is provided in front/.

We provide a sample subset of the zbMATH Open KG data you can use here: data/subset-200.ttl. Before running the example, ensure this initial data file is located in the same folder as the docker-compose.yml file. If not, update the volume mapping in front/docker-compose.yml accordingly:

- ./subset-200.ttl:/data.ttl

Then, start the service by running:

docker compose up -d

This will launch Fuseki on port 3030 and load the initial data via fuseki-entrypoint.sh.

Your SPARQL endpoint URL will be available at: http://localhost:3030/dataset/sparql

For Virtuoso setup, see the zb-virtuoso directory.

Repository Structure

  • data/.jsonl raw data and .ttl RDF KG (subset), ontology files (.ttl), etc.
  • front/ – Fuseki triple store setup for serving the RDF subset (example only — SPARQL endpoint runs on Virtuoso for scalability)
  • src/ – Source code for KG construction (data harvest, statistics calculation, RDF transformation, etc).
  • src/retrieval-tasks/ – Source code and SPARQL queries for historically-grounded scholarly exploration and discovery.
  • use-case/ – Use case-specific results and visualizations
  • run-convert.sh – Shell script to convert raw data into RDF format
  • README.md – Project documentation

License

All content generated by zbMATH Open KG are distributed under CC-BY-SA 4.0., in accordance with the specification at zbMATH Open OAI-PMH API:

Content generated by zbMATH Open, such as reviews, classifications, software, or author disambiguation data,are distributed under CC-BY-SA 4.0.
This defines the license for the whole dataset, which also contains non-copyrighted bibliographic metadata and reference data derived from I4OC (CC0).
Note that the API only provides a subset of the data in the zbMATH Open Web interface.
In several cases, third-party information, such as abstracts, cannot be made available under a suitable license through the API.
In those cases, we replaced the data with the string "zbMATH Open Web Interface contents unavailable due to conflicting licenses."

📧 Contact: yuni.susanti@fiz-karlsruhe.de

About

zbMATH Open RDF Knowledge Graph (2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors