iatlas-cbioportal-export

ETL pipeline for processing and exporting iAtlas files to be cBioportal ingestion ready

Setup

These instructions assume that you already have Python (>3.9) and a Python version manager (pyenv) installed

Setup locally with uv

Follow directions to install uv

From the project root:

uv sync

Now you can run commands like:

uv run python -V                        # run python and get version
uv run pytest                           # run tests, if you have them and pytest installed

Prior to testing/developing/running this locally, you will need to setup the Docker image in order to run this. Optional: You can also build your environment via python env and install the uv.lock file

Setup locally with pip

Create and activate your venv

python3 -m venv <your_env_name>
source <your_env_name>/bin/activate

Export dependencies from uv.lock

pip install uv
uv export > requirements.txt

Install into your venv

pip install -r requirements.txt

Setup locally with Docker (recommended)

But it is highly recommended you use the docker image

Build the dockerfile

cd /orca-recipes/local/iatlas/cbioportal_export
docker build -f Dockerfile -t <some_docker_image_name> .

OR

Pull pre-existing docker image for your branch from list

docker pull ghcr.io/sage-bionetworks/iatlas-cbioportal-export:main

Run the Dockerfile

docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>

Follow the How to Run section below

↑ Back to top

Overview

maf.py

This script will run the iatlas mutations data through genome nexus so it can be ingested by cbioportal team for visualization.

The script does the following:

Reads in and merges all the individual mafs from a given folder
Splits the maf into smaller chunks for genome nexus annotation
Annotates via genome nexus
Concatenates the results
Creates the required meta_* data

clinical.py

This script will process/transform the iatlas clinical data to be cbioportal format friendly so it can be ingested by cbioportal team for visualization.

The script does the following:

Preprocesses the data and adds required mappings like ONCOTREE or LENS_ID
Adds clinical headers
Creates the required meta_* data
Creates the required caselists
Validates the files for cbioportal

↑ Back to top

How to Run

Getting help

python3 clinical.py --help

python3 maf.py --help

python3 load.py --help

↑ Back to top

Outputs

This pipeline generates the following key datasets that eventually get uploaded to synapse and ingested by cbioportal. All datasets will be saved to: <datahub_tools_path>/add-clinical-header/<dataset_name>/ unless otherwise stated

maf.py

data_mutations_annotated.txt – Annotated MAF file from genome nexus
- Generated by: concatenate_mafs()
data_mutations_error_report.txt – Error report from genome nexus
- Generated by: genome_nexus
meta_mutations.txt – Metadata file for mutations data
- Generated by: datahub-study-curation-tools' generate-meta-files code

clinical.py

data_clinical_patient.txt – Clinical patient data file
- Generated by: add_clinical_header()
data_clinical_sample.txt – Clinical sample data file
- Generated by: add_clinical_header()
meta_clinical_patient.txt – Metadata file for clinical patient data file
- Generated by: datahub-study-curation-tools' generate-meta-files code
meta_clinical_sample.txt – Metadata file for clinical sample data file
- Generated by: datahub-study-curation-tools' generate-meta-files code
meta_study.txt – Metadata file for the entire study
- Generated by: datahub-study-curation-tools' generate-meta-files code
cases_<cancer_type>.txt – case list files for each cancer type available in the clinical data
- <datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/
- Generated by: datahub-study-curation-tools' generate-case-lists code

validate.py

iatlas_validation_log.txt - Validator results from our own iatlas validation results for all of the files
- Generated by: updated by each validation function
cbioportal_validator_output.txt – Validator results from cbioportal for all of the files not just clinical
- Generated by: cbioportal' validator code

load.py

cases_all.txt – case list file for all the clinical samples in the study
- <datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/
- Generated by: datahub-study-curation-tools' generate-case-lists code
cases_sequenced.txt – case list file containing the sequenced samples (mutation) in the study
- <datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/
- Generated by: datahub-study-curation-tools' generate-case-lists code

Any additional files are the intermediate processing files and can be ignored.

↑ Back to top

General Workflow

Run processing on the maf datasets via maf.py
Run processing on the clinical datasets via clinical.py
Run load.py to create case lists
Run the general validation + cbioportal validator on your outputted files via validate.py
Check your cbioportal_validator_output.txt
Resolve any ERRORs
Repeat steps 4-6 until all ERRORs are gone
Run load.py now with the upload flag to upload to synapse

Example: Sample workflow

Run clinical processing

python3 clinical.py 
    --input_df_synid syn66314245 \
    --cli_to_cbio_mapping_synid syn66276162 
    --cli_to_oncotree_mapping_synid syn66313842 \
    --datahub_tools_path /<some_path>/datahub-study-curation-tools \
    --lens_id_mapping_synid syn68826836
    --neoantigen_data_synid syn21841882

Run maf processing

python3 maf.py 
    --dataset Riaz
    --input_folder_synid syn68785881 
    --datahub_tools_path /<some_path>/datahub-study-curation-tools 
    --n_workers 3

Create the case lists

python3 load.py 
    --dataset Riaz  
    --output_folder_synid syn64136279 
    --datahub_tools_path /<some_path>/datahub-study-curation-tools  
    --create_case_lists

Run the general iatlas validation + cbioportal validator on all files

python3 validate.py 
    --datahub_tools_path /<some_path>/datahub-study-curation-tools 
    --neoantigen_data_synid syn69918168 
    --cbioportal_path /<some_path>/cbioportal/ 
    --dataset Riaz

Save into synapse with version comment v1

python3 load.py
    --dataset Riaz  
    --output_folder_synid syn64136279
    --datahub_tools_path /<some_path>/datahub-study-curation-tools  
    --version_comment "v1"
    --upload

↑ Back to top

Running tests

Tests are written via pytest.

In your docker environment or local environment, install pytest via

pip install pytest

Then run all tests via

python3 -m pytest tests

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
src/iatlascbioportalexport		src/iatlascbioportalexport
tests		tests
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

iatlas-cbioportal-export

Table of Contents

Setup

Setup locally with uv

Setup locally with pip

Setup locally with Docker (recommended)

Overview

maf.py

clinical.py

How to Run

Outputs

maf.py

clinical.py

validate.py

load.py

General Workflow

Running tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

Uh oh!

Sage-Bionetworks/iatlas-cbioportal-export

Folders and files

Latest commit

History

Repository files navigation

iatlas-cbioportal-export

Table of Contents

Setup

Setup locally with uv

Setup locally with pip

Setup locally with Docker (recommended)

Overview

maf.py

clinical.py

How to Run

Outputs

maf.py

clinical.py

validate.py

load.py

General Workflow

Running tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages