ETL pipeline for processing and exporting iAtlas files to be cBioportal ingestion ready
These instructions assume that you already have Python (>3.9) and a Python version manager (pyenv) installed
Follow directions to install uv
From the project root:
uv sync
Now you can run commands like:
uv run python -V # run python and get version
uv run pytest # run tests, if you have them and pytest installed
Prior to testing/developing/running this locally, you will need to setup the Docker image in order to run this.
Optional: You can also build your environment via python env and install the uv.lock file
- Create and activate your venv
python3 -m venv <your_env_name>
source <your_env_name>/bin/activate
- Export dependencies from uv.lock
pip install uv
uv export > requirements.txt
- Install into your venv
pip install -r requirements.txt
But it is highly recommended you use the docker image
- Build the dockerfile
cd /orca-recipes/local/iatlas/cbioportal_export
docker build -f Dockerfile -t <some_docker_image_name> .
OR
docker pull ghcr.io/sage-bionetworks/iatlas-cbioportal-export:main
- Run the Dockerfile
docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>
- Follow the How to Run section below
This script will run the iatlas mutations data through genome nexus so it can be ingested by cbioportal team for visualization.
The script does the following:
- Reads in and merges all the individual mafs from a given folder
- Splits the maf into smaller chunks for genome nexus annotation
- Annotates via genome nexus
- Concatenates the results
- Creates the required meta_* data
This script will process/transform the iatlas clinical data to be cbioportal format friendly so it can be ingested by cbioportal team for visualization.
The script does the following:
- Preprocesses the data and adds required mappings like ONCOTREE or LENS_ID
- Adds clinical headers
- Creates the required meta_* data
- Creates the required caselists
- Validates the files for cbioportal
Getting help
python3 clinical.py --help
python3 maf.py --help
python3 load.py --help
This pipeline generates the following key datasets that eventually get uploaded to synapse and ingested by cbioportal.
All datasets will be saved to:
<datahub_tools_path>/add-clinical-header/<dataset_name>/ unless otherwise stated
-
data_mutations_annotated.txt– Annotated MAF file from genome nexus- Generated by:
concatenate_mafs()
- Generated by:
-
data_mutations_error_report.txt– Error report from genome nexus- Generated by:
genome_nexus
- Generated by:
-
meta_mutations.txt– Metadata file for mutations data- Generated by:
datahub-study-curation-tools'generate-meta-filescode
- Generated by:
-
data_clinical_patient.txt– Clinical patient data file- Generated by:
add_clinical_header()
- Generated by:
-
data_clinical_sample.txt– Clinical sample data file- Generated by:
add_clinical_header()
- Generated by:
-
meta_clinical_patient.txt– Metadata file for clinical patient data file- Generated by:
datahub-study-curation-tools'generate-meta-filescode
- Generated by:
-
meta_clinical_sample.txt– Metadata file for clinical sample data file- Generated by:
datahub-study-curation-tools'generate-meta-filescode
- Generated by:
-
meta_study.txt– Metadata file for the entire study- Generated by:
datahub-study-curation-tools'generate-meta-filescode
- Generated by:
-
cases_<cancer_type>.txt– case list files for each cancer type available in the clinical data<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/- Generated by:
datahub-study-curation-tools'generate-case-listscode
-
iatlas_validation_log.txt- Validator results from our own iatlas validation results for all of the files- Generated by: updated by each validation function
-
cbioportal_validator_output.txt– Validator results from cbioportal for all of the files not just clinical- Generated by:
cbioportal' validator code
- Generated by:
-
cases_all.txt– case list file for all the clinical samples in the study<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/- Generated by:
datahub-study-curation-tools'generate-case-listscode
-
cases_sequenced.txt– case list file containing the sequenced samples (mutation) in the study<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/- Generated by:
datahub-study-curation-tools'generate-case-listscode
Any additional files are the intermediate processing files and can be ignored.
- Run processing on the maf datasets via
maf.py - Run processing on the clinical datasets via
clinical.py - Run
load.pyto create case lists - Run the general validation + cbioportal validator on your outputted files via
validate.py - Check your
cbioportal_validator_output.txt - Resolve any
ERRORs - Repeat steps 4-6 until all
ERRORs are gone - Run
load.pynow with theuploadflag to upload to synapse
Example: Sample workflow
Run clinical processing
python3 clinical.py
--input_df_synid syn66314245 \
--cli_to_cbio_mapping_synid syn66276162
--cli_to_oncotree_mapping_synid syn66313842 \
--datahub_tools_path /<some_path>/datahub-study-curation-tools \
--lens_id_mapping_synid syn68826836
--neoantigen_data_synid syn21841882
Run maf processing
python3 maf.py
--dataset Riaz
--input_folder_synid syn68785881
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--n_workers 3
Create the case lists
python3 load.py
--dataset Riaz
--output_folder_synid syn64136279
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--create_case_lists
Run the general iatlas validation + cbioportal validator on all files
python3 validate.py
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--neoantigen_data_synid syn69918168
--cbioportal_path /<some_path>/cbioportal/
--dataset Riaz
Save into synapse with version comment v1
python3 load.py
--dataset Riaz
--output_folder_synid syn64136279
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--version_comment "v1"
--upload
Tests are written via pytest.
In your docker environment or local environment, install pytest via
pip install pytest
Then run all tests via
python3 -m pytest tests