Biomedical knowledge graph infrastructure. Integrates 67 datasets into a unified graph and generates network embeddings.
Based on Bioteque (IRB Barcelona, MIT license). Citation: Fernandez-Torras et al., Nature Communications (2022). doi:10.1038/s41467-022-33026-0
%%{init:{'theme':'base','themeVariables':{'primaryColor':'#f8f9fa','primaryTextColor':'#1a1a2e','primaryBorderColor':'#adb5bd','lineColor':'#6c757d','fontSize':'13px'}}}%%
graph LR
subgraph Sources[" 67 External Sources "]
direction TB
S1(["DrugBank · STRING · LINCS"])
S2(["GPSAdb · DisGeNET · CCLE"])
S3(["CTD · SIDER · Reactome ..."])
end
subgraph ETL[" datasets/ "]
direction TB
SC["script.py"]
GD["get_data.sh"]
GD --> SC
end
subgraph Meta[" metadata/ "]
direction TB
MAP["mappings/\nGEN · CPD · DIS · CLL · TIS"]
ONT["ontologies/\nDOID · GO · BTO · HPO"]
end
subgraph Processing[" code/kgraph/ "]
direction TB
UTIL["utils/\nmappers · ontology"]
PROC["process_raw_data.py"]
UTIL --> PROC
end
subgraph Embed[" code/embeddings/ "]
direction TB
EDGE["get_edges.py"]
WALK["walks.py"]
SKIP["mp2vec.py"]
VAL["validation/"]
EDGE --> WALK --> SKIP --> VAL
end
RAW[("graph/raw/\nedges by metaedge")]
DONE[("graph/processed/\npropagated + depropagated")]
EMB[("embeddings/\nnode vectors .h5")]
Sources --> ETL
MAP -.-> ETL
ETL --> RAW --> Processing
MAP -.-> Processing
ONT -.-> Processing
Processing --> DONE --> Embed --> EMB
style Sources fill:#fef3e2,stroke:#bc6c25,stroke-width:1.5px,color:#6b4226
style ETL fill:#f0f7e8,stroke:#588157,stroke-width:1.5px,color:#344e41
style Meta fill:#e8f4f8,stroke:#2c7da0,stroke-width:1.5px,color:#184e77
style Processing fill:#f0f7e8,stroke:#588157,stroke-width:1.5px,color:#344e41
style Embed fill:#f0f7e8,stroke:#588157,stroke-width:1.5px,color:#344e41
style RAW fill:#f3e8f9,stroke:#7b2d8e,stroke-width:1.5px,color:#4a1259
style DONE fill:#f3e8f9,stroke:#7b2d8e,stroke-width:1.5px,color:#4a1259
style EMB fill:#f3e8f9,stroke:#7b2d8e,stroke-width:2px,color:#4a1259
cd datasets/gpsadb && python3 script.py # process a dataset
python -m pytest -v # run tests- GPSAdb 2.0 dataset (7,665 gene perturbation experiments, 2,810 genes)
- Gene mappings regenerated from UniProt 2026_01 (+350 gene names)
- Provenance attributes and cell line links on perturbagen edges
- pytest suite for ETL and mapping generation
MIT (inherited from Bioteque, Copyright (c) 2022 SBNB)