NASA SPOKE-GeneLab Knowledge Graph

This repository contains the code and metadata needed to build a Knowledge Graph (KG) for NASA GeneLab omics datasets hosted on the Open Science Data Repository (OSDR).

🚀 Features

Automated graph construction from datasets in the OSDR
Incremental update for new datasets
Statistical filtering of results for significance
Species selection via a configurable whitelist
Versioned metadata for reproducibility (v0.0.3)
Federated query using Neo4j Fabric with the Scalable Precision Medicine Open Knowledge Engine (SPOKE) KG

🧪 Supported Data Types

Measurement	Technology	Property	Selection Criteria
Transcription profiling	RNA Sequencing (RNA‑Seq)	Log2 fold change	Adjusted p-value <= 0.05
Transcription profiling	DNA microarray	Log2 fold change	Adjusted p-value <= 0.05
DNA methylation profiling	Whole Genome Bisulfite Sequencing	Methylation difference %	q-value <= 0.05
DNA methylation profiling	Reduced‑Representation Bisulfite Sequencing (RRBS)	Methylation difference %	q-value <= 0.05

⚙️ How It Works

Fetch omics study records using the OSDR API
Filter datasets by statistical thresholds and target species
Map model organism genes to human genes
Map cell and tissue types to the Cell (CL) and Uber Anatomy Ontology (UBERON) ontology, respectively
Export CSV files for graph database upload
Import CSV files into a Neo4j Graph database

🕸️ Graph Schema

Figure: Schematic overview of the GeneLab knowledge graph structure, highlighting key node types (circles) and relationships (arrows).

The Assay–MEASURED–MGene relationship encodes Log₂ fold changes derived from transcription profiling assays, while the Assay–MEASURED–MethylationRegion relationship captures methylation differences identified through DNA methylation assays. The MGene–METHYLATED_IN–MethylationRegion relationship links model organism genes (MGene) to 1,000 base pair genomic regions (MethylationRegion) exhibiting differential methylation.

Proxy nodes (shown in gray) represent standardized identifiers for human genes (ENTREZ ID), anatomical structures (UBERON ID), and cell types (CL ID), enabling integration with external Neo4j databases and supporting composite graph database construction.

Diagram generated using arrows.app.

📁 Metadata Directory Structure

The following node and relationship metadata files define the graph schema.

Nodes
kg/v0.0.3/metadata/nodes/
Relationships
kg/v0.0.3/metadata/relationships/

The organization and conventions for defining the metadata and data are described in the kg-import Git repository.

🔗 SPOKE - GeneLab Composite Database

Figure: Integration of the SPOKE and GeneLab knowledge graphs using proxy nodes.
The GeneLab graph (right), a knowledge graph representing spaceflight omics datasets, depicts key experimental entities: Assay, Study, Mission, MGene, and MethylationRegion, along with their relationships. Proxy nodes (gray) represent external identifiers (ENTREZ, UBERON, CL) and enable linkage to the SPOKE graph (left), a rich biomedical knowledge graph comprising biological processes, molecular functions, diseases, compounds, and more. The dashed lines indicate mappings to enable the construction of a composite Neo4j graph database. The composite graph enables federated queries across multiple KGs.

⚙️ Data Import Into Neo4j Knowledge Graph

Setup Neo4j Desktop

Download the Neo4j Desktop application from the Neo4j Download Center and follow the installation instructions.
When the installation is complete, Neo4j Desktop will launch. Click the New button to create a new project.

Hover the cursor over the created project, click the edit button, and change the project name from Project to spoke-genelab.

Click the ADD button and select Local DBMS. Select Neo4j version 5.23.0.

Enter the password neo4jdemo and click Create.

Select Terminal to open a terminal window.

Type pwd in the terminal window to show the path to the NEO4J_INSTALL_PATH directory. This path is required in the .env file, see the next section.

Setup the Environment

Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba (faster than Conda)

Install Miniconda3
Update an existing miniconda3 installation: conda update conda
Install Mamba: conda install mamba -n base -c conda-forge
Install Git (if not installed): conda install git -n base -c anaconda

Clone this Repository

git clone https://github.com/BaranziniLab/spoke_genelab.git
cd spoke_genelab

Create a Conda environment

The file environment.yml specifies the Python version and all required dependencies.

mamba env create -f environment.yml

Create an account in BioPortal and copy the API key. BioPortal is used to map terms to ontologies.
Copy the file env_template to .env
Edit the file .env and set the following variables

KG version number

KG_VERSION=v0.0.3

Path to the cloned git repository

KG_GIT=/Users/.../spoke_genelab/

Path to the Neo4J instance in Neo4j Desktop (in quotes). Make sure to enclose the path in quotes.

NEO4J_INSTALL_PATH="/Users/.../Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-3d4b95d1-0219-480b-a3c4-ee5a409cc383"

BioPortal API Key

BIOPORTAL_API_KEY=<bioportal api key>

Download and Process Datasets and upload to Neo4J Graph Database

Start the spoke-genelab Graph DBMS

Activate the conda environment

conda activate spoke-genelab

Launch Jupyter Lab

jupyter lab

Navigate to the notebooks directory and run the following notebooks

Notebook	Description
1_download_datasets.ipynb	Downloads datasets
2_create_study_mission_nodes.ipynb	Creates Study and Mission nodes and their relationships
3_create_gene_nodes.ipynb	Creates MGene (model organism) and mapped Gene (human) gene nodes
4_create_assay_nodes.ipynb	Creates Assay nodes and their relationships
5_import_to_neo4j.ipynb	Imports the formatted data into a Neo4j KG
6_query_examples.ipynb	Runs example queries (optional)

When the import is completed, click the Refresh button in Neo4j Desktop. The newly created database spoke-genelab-v0.0.3 will be listed.

Click the Open button to launch the database.

Click on the database icon on the left.

Use the pull-down menu to select a version of spoke-genelab-v0.0.3 database. Wait for about 30+ seconds until the database is loaded and the nodes are listed as shown below.

Set the Graph Stylesheet

Drag the file kg/v0.0.3/style.grass onto the Neo4j Browser window to set the node colors, sizes, and labels.

Now you are ready to run Cypher queries on the selected database.
When you are finished, stop the database in the Neo4j Desktop.

To stop the conda environment, type

conda deactivate

Dump Neo4J Graph Database

Stop the database
Hover the cursor over the spoke-genelab-v0.0.3 database and select Dump from the menu.

When the dump is complete, click the Reveal files in Finder button to open the directory that contains the spoke-genelab-v0.0.3.dump file.

This database dump will be used to create the SPOKE-GeneLab composite database.

📚 Citation

PW Rose, CA Nelson, SG Gebre, K Soman, KA Grigorev, LM Sanders, SV Costes, SE Baranzini, NASA SPOKE-GeneLab Knowledge Graph. Available online: https://github.com/BaranziniLab/spoke_genelab (2025)

CA Nelson, PW Rose, K Soman, LM Sanders, SG Gebre, SV Costes, SE Baranzini, Nasa Genelab-Knowledge Graph Fabric Enables Deep Biomedical Analysis of Multi-Omics Datasets, https://ntrs.nasa.gov/citations/20250000723 (2025)

💰 Funding

NSF Award number 2333819, Proto-OKN Theme 1: Connecting Biomedical information on Earth and in Space via the SPOKE knowledge graph.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
docs		docs
kg		kg
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env_template		env_template
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NASA SPOKE-GeneLab Knowledge Graph

🚀 Features

🧪 Supported Data Types

⚙️ How It Works

🕸️ Graph Schema

📁 Metadata Directory Structure

🔗 SPOKE - GeneLab Composite Database

⚙️ Data Import Into Neo4j Knowledge Graph

Setup Neo4j Desktop

Setup the Environment

Download and Process Datasets and upload to Neo4J Graph Database

Dump Neo4J Graph Database

📚 Citation

💰 Funding

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

BaranziniLab/spoke_genelab

Folders and files

Latest commit

History

Repository files navigation

NASA SPOKE-GeneLab Knowledge Graph

🚀 Features

🧪 Supported Data Types

⚙️ How It Works

🕸️ Graph Schema

📁 Metadata Directory Structure

🔗 SPOKE - GeneLab Composite Database

⚙️ Data Import Into Neo4j Knowledge Graph

Setup Neo4j Desktop

Setup the Environment

Download and Process Datasets and upload to Neo4J Graph Database

Dump Neo4J Graph Database

📚 Citation

💰 Funding

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages