This repository contains the code and metadata needed to build a Knowledge Graph (KG) for NASA GeneLab omics datasets hosted on the Open Science Data Repository (OSDR).
- Automated graph construction from datasets in the OSDR
- Incremental update for new datasets
- Statistical filtering of results for significance
- Species selection via a configurable whitelist
- Versioned metadata for reproducibility (v0.0.3)
- Federated query using Neo4j Fabric with the Scalable Precision Medicine Open Knowledge Engine (SPOKE) KG
Measurement | Technology | Property | Selection Criteria |
---|---|---|---|
Transcription profiling | RNA Sequencing (RNA‑Seq) | Log2 fold change | Adjusted p-value <= 0.05 |
Transcription profiling | DNA microarray | Log2 fold change | Adjusted p-value <= 0.05 |
DNA methylation profiling | Whole Genome Bisulfite Sequencing | Methylation difference % | q-value <= 0.05 |
DNA methylation profiling | Reduced‑Representation Bisulfite Sequencing (RRBS) | Methylation difference % | q-value <= 0.05 |
- Fetch omics study records using the OSDR API
- Filter datasets by statistical thresholds and target species
- Map model organism genes to human genes
- Map cell and tissue types to the Cell (CL) and Uber Anatomy Ontology (UBERON) ontology, respectively
- Export CSV files for graph database upload
- Import CSV files into a Neo4j Graph database
All entities and their connections follow this simplified schema:
*Figure: High‑level overview of nodes (circles) and relationships (arrows). Proxy nodes (gray) can be used to link to nodes in the SPOKE KG using the Neo4j Fabric composite database.
The following node and relationship metadata files define the graph schema.
-
Relationships
kg/v0.0.3/metadata/relationships/
The organization and syntax for defining the metadata and data are described in the kg-import Git repository.
-
Download the Neo4j Desktop application from the Neo4j Download Center and follow the installation instructions.
-
When the installation is complete, Neo4j Desktop will launch. Click the
New
button to create a new project.
- Hover the cursor over the created project, click the edit button, and change the project name from
Project
tospoke-genelab
.
- Click the
ADD
button and selectLocal DBMS
.
- Enter the password
neo4jdemo
and clickCreate
(use the default Neo4j Version).
- Select
Terminal
to open a terminal window.
- Type
pwd
in the terminal window to show the path to theNEO4J_HOME
directory. This path is required in the.env
file, see the next section.
Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba (faster than Conda)
- Install Miniconda3
- Update an existing miniconda3 installation:
conda update conda
- Install Mamba:
conda install mamba -n base -c conda-forge
- Install Git (if not installed):
conda install git -n base -c anaconda
- Clone this Repository
git clone https://github.com/BaranziniLab/spoke_genelab.git
cd spoke_genelab
- Create a Conda environment
The file environment.yml
specifies the Python version and all required dependencies.
mamba env create -f environment.yml
-
Create an account in BioPortal and copy the API key. BioPortal is used to map terms to ontologies.
-
Copy the file
env_template
to.env
-
Edit the file
.env
and set the following variables
KG version number
KG_VERSION=v0.0.3
Path to the cloned git repository
KG_GIT=/Users/.../spoke_genelab/
Path to the Neo4J instance in Neo4j Desktop (in quotes). Make sure to enclose the path in quotes.
NEO4J_INSTALL_PATH="/Users/.../Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-3d4b95d1-0219-480b-a3c4-ee5a409cc383"
BioPortal API Key
BIOPORTAL_API_KEY=<bioportal api key>
- Start the spoke-genelab Graph DBMS
- Activate the conda environment
conda activate spoke-genelab
- Navigate to the
notebooks
directory and run the following notebooks
Notebook | Description |
---|---|
1_download_datasets.ipynb | Downloads datasets |
2_create_study_mission_nodes.ipynb | Creates Study and Mission nodes and their relationships |
3_create_gene_nodes.ipynb | Creates MGene (model organism) and mapped Gene (human) gene nodes |
4_create_assay_nodes.ipynb | Creates Assay nodes and their relationships |
5_import_to_neo4j.ipynb | Imports the formatted data into a Neo4j KG |
6_query_examples.ipynb | Runs example queries (optional) |
- When the import is completed, click the
Refresh
button in Neo4j Desktop. The newly created databasespoke-genelab-v0.0.3
will be listed.
- Click the
Open
button to launch the database.
- Click on the database icon on the left.
- Use the pull-down menu to select a version of
spoke-genelab-v0.0.3
database. Wait for about 30+ seconds until the database is loaded and the nodes are listed as shown below.
- Set the Graph Stylesheet
Drag the file kg/v0.0.3/style.grass onto the Neo4j Browser window to set the node colors, sizes, and labels.
-
Now you are ready to run Cypher queries on the selected database.
-
When you are finished, stop the database in the Neo4j Desktop.
To stop the conda environment, type
conda deactivate
PW Rose, CA Nelson, SG Gebre, K Soman, KA Grigorev, LM Sanders, SV Costes, SE Baranzini, NASA SPOKE-GeneLab Knowledge Graph. Available online: https://github.com/BaranziniLab/spoke_genelab (2025)
CA Nelson, PW Rose, K Soman, LM Sanders, SG Gebre, SV Costes, SE Baranzini, Nasa Genelab-Knowledge Graph Fabric Enables Deep Biomedical Analysis of Multi-Omics Datasets, https://ntrs.nasa.gov/citations/20250000723 (2025)
NSF Award number 2333819, Proto-OKN Theme 1: Connecting Biomedical information on Earth and in Space via the SPOKE knowledge graph.