Skip to content

ETL pipeline for processing and exporting iAtlas files to be cBioportal ingestion ready

Sage-Bionetworks/iatlas-cbioportal-export

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iatlas-cbioportal-export

ETL pipeline for processing and exporting iAtlas files to be cBioportal ingestion ready

Table of Contents

↑ Back to top

Setup

These instructions assume that you already have Python (>3.9) and a Python version manager (pyenv) installed

Setup locally with uv

Follow directions to install uv

From the project root:

uv sync

Now you can run commands like:

uv run python -V                        # run python and get version
uv run pytest                           # run tests, if you have them and pytest installed

Prior to testing/developing/running this locally, you will need to setup the Docker image in order to run this. Optional: You can also build your environment via python env and install the uv.lock file

Setup locally with pip

  1. Create and activate your venv
python3 -m venv <your_env_name>
source <your_env_name>/bin/activate
  1. Export dependencies from uv.lock
pip install uv
uv export > requirements.txt
  1. Install into your venv
pip install -r requirements.txt

Setup locally with Docker (recommended)

But it is highly recommended you use the docker image

  1. Build the dockerfile
cd /orca-recipes/local/iatlas/cbioportal_export
docker build -f Dockerfile -t <some_docker_image_name> .

OR

  1. Pull pre-existing docker image for your branch from list
docker pull ghcr.io/sage-bionetworks/iatlas-cbioportal-export:main
  1. Run the Dockerfile
docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>
  1. Follow the How to Run section below

↑ Back to top

Overview

maf.py

This script will run the iatlas mutations data through genome nexus so it can be ingested by cbioportal team for visualization.

The script does the following:

  1. Reads in and merges all the individual mafs from a given folder
  2. Splits the maf into smaller chunks for genome nexus annotation
  3. Annotates via genome nexus
  4. Concatenates the results
  5. Creates the required meta_* data

clinical.py

This script will process/transform the iatlas clinical data to be cbioportal format friendly so it can be ingested by cbioportal team for visualization.

The script does the following:

  1. Preprocesses the data and adds required mappings like ONCOTREE or LENS_ID
  2. Adds clinical headers
  3. Creates the required meta_* data
  4. Creates the required caselists
  5. Validates the files for cbioportal

↑ Back to top

How to Run

Getting help

python3 clinical.py --help
python3 maf.py --help
python3 load.py --help

↑ Back to top

Outputs

This pipeline generates the following key datasets that eventually get uploaded to synapse and ingested by cbioportal. All datasets will be saved to: <datahub_tools_path>/add-clinical-header/<dataset_name>/ unless otherwise stated

maf.py

  • data_mutations_annotated.txt – Annotated MAF file from genome nexus

    • Generated by: concatenate_mafs()
  • data_mutations_error_report.txt – Error report from genome nexus

    • Generated by: genome_nexus
  • meta_mutations.txt – Metadata file for mutations data

    • Generated by: datahub-study-curation-tools' generate-meta-files code

clinical.py

  • data_clinical_patient.txt – Clinical patient data file

    • Generated by: add_clinical_header()
  • data_clinical_sample.txt – Clinical sample data file

    • Generated by: add_clinical_header()
  • meta_clinical_patient.txt – Metadata file for clinical patient data file

    • Generated by: datahub-study-curation-tools' generate-meta-files code
  • meta_clinical_sample.txt – Metadata file for clinical sample data file

    • Generated by: datahub-study-curation-tools' generate-meta-files code
  • meta_study.txt – Metadata file for the entire study

    • Generated by: datahub-study-curation-tools' generate-meta-files code
  • cases_<cancer_type>.txt – case list files for each cancer type available in the clinical data

    • <datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/
    • Generated by: datahub-study-curation-tools' generate-case-lists code

validate.py

  • iatlas_validation_log.txt - Validator results from our own iatlas validation results for all of the files

    • Generated by: updated by each validation function
  • cbioportal_validator_output.txt – Validator results from cbioportal for all of the files not just clinical

    • Generated by: cbioportal' validator code

load.py

  • cases_all.txt – case list file for all the clinical samples in the study

    • <datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/
    • Generated by: datahub-study-curation-tools' generate-case-lists code
  • cases_sequenced.txt – case list file containing the sequenced samples (mutation) in the study

    • <datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/
    • Generated by: datahub-study-curation-tools' generate-case-lists code

Any additional files are the intermediate processing files and can be ignored.

↑ Back to top

General Workflow

  1. Run processing on the maf datasets via maf.py
  2. Run processing on the clinical datasets via clinical.py
  3. Run load.py to create case lists
  4. Run the general validation + cbioportal validator on your outputted files via validate.py
  5. Check your cbioportal_validator_output.txt
  6. Resolve any ERRORs
  7. Repeat steps 4-6 until all ERRORs are gone
  8. Run load.py now with the upload flag to upload to synapse

Example: Sample workflow

Run clinical processing

python3 clinical.py 
    --input_df_synid syn66314245 \
    --cli_to_cbio_mapping_synid syn66276162 
    --cli_to_oncotree_mapping_synid syn66313842 \
    --datahub_tools_path /<some_path>/datahub-study-curation-tools \
    --lens_id_mapping_synid syn68826836
    --neoantigen_data_synid syn21841882

Run maf processing

python3 maf.py 
    --dataset Riaz
    --input_folder_synid syn68785881 
    --datahub_tools_path /<some_path>/datahub-study-curation-tools 
    --n_workers 3 

Create the case lists

python3 load.py 
    --dataset Riaz  
    --output_folder_synid syn64136279 
    --datahub_tools_path /<some_path>/datahub-study-curation-tools  
    --create_case_lists

Run the general iatlas validation + cbioportal validator on all files

python3 validate.py 
    --datahub_tools_path /<some_path>/datahub-study-curation-tools 
    --neoantigen_data_synid syn69918168 
    --cbioportal_path /<some_path>/cbioportal/ 
    --dataset Riaz  

Save into synapse with version comment v1

python3 load.py
    --dataset Riaz  
    --output_folder_synid syn64136279
    --datahub_tools_path /<some_path>/datahub-study-curation-tools  
    --version_comment "v1"
    --upload

↑ Back to top

Running tests

Tests are written via pytest.

In your docker environment or local environment, install pytest via

pip install pytest

Then run all tests via

python3 -m pytest tests

About

ETL pipeline for processing and exporting iAtlas files to be cBioportal ingestion ready

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •