diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json deleted file mode 100644 index 07612b76..00000000 --- a/.devcontainer/devcontainer.json +++ /dev/null @@ -1,6 +0,0 @@ -{ - "image": "ghcr.io/sage-bionetworks/genie:develop", - "mounts": [ - "source=${localEnv:HOME}/.synapseConfig,target=/root/.synapseConfig,type=bind,consistency=cached" - ] -} diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3f403484..a0be4f51 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -3,6 +3,22 @@ We welcome all contributions! Please head to [issues](https://github.com/Sage-Bionetworks/Genie/issues) to either file any bugs/feature requests or find a task you want to assist with. Make sure to assign yourself the task if you decide to work on it. +## Table of Contents + +- [Coding Style](#coding-style) +- [The Development Life Cycle](#the-development-life-cycle) + - [Fork and clone this repository](#fork-and-clone-this-repository) + - [Install development dependencies](#install-development-dependencies) + - [Developing](#developing) + - [Testing](#testing) + - [Running tests](#running-tests) + - [Tests in Python](#tests-in-python) + - [Tests in R](#tests-in-r) + - [Test Development](#test-development) + - [Mock Testing](#mock-testing) + - [Release Procedure (For Package Maintainers)](#release-procedure-for-package-maintainers) + - [Contributing to the docs](#contributing-to-the-docs) + ## Coding Style This package uses `flake8` - it's settings are described in [setup.cfg](setup.cfg). The code in this package is also automatically formatted by `black` for consistency. @@ -23,22 +39,7 @@ This package uses `flake8` - it's settings are described in [setup.cfg](setup.cf ### Install development dependencies -This will install all the dependencies of the package including the active branch of `Genie`. We highly recommend that you leverage some form of python version management like [pyenv](https://github.com/pyenv/pyenv) or [anaconda](https://www.anaconda.com/products/individual). There are two ways you can install the dependencies for this package. - -#### pip -This is the more traditional way of installing dependencies. Follow instructions [here](https://pip.pypa.io/en/stable/installation/) to learn how to install pip. - -``` -pip install -r requirements-dev.txt -pip install -r requirements.txt -``` - -#### pipenv -`pipenv` is a Python package manager. Learn more about [pipenv](https://pipenv.pypa.io/en/latest/) and how to install it. - -``` -# Coming soon -``` +This will install all the dependencies of the package including the active branch of `Genie`. We highly recommend that you leverage some form of python version management like [pyenv](https://github.com/pyenv/pyenv) or [anaconda](https://www.anaconda.com/products/individual). Follow [dependencies installation instruction here](./README.md#running-locally) ### Developing @@ -54,17 +55,17 @@ The GENIE project follows the standard [git flow](https://www.atlassian.com/git/ git pull upstream develop ``` -1. Create a feature branch which off the `develop` branch. If there is a GitHub/JIRA issue that you are addressing, name the branch after the issue with some more detail (like `{GH|JIRA}-123-add-some-new-feature`). +1. Create a feature branch which off the `develop` branch. If there is a GitHub/JIRA issue that you are addressing, name the branch after the issue with some more detail (like `{GH|GEN}-123-add-some-new-feature`). ``` git checkout develop - git checkout -b JIRA-123-new-feature + git checkout -b GEN-123-new-feature ``` -1. At this point, you have only created the branch locally, you need to push this to your fork on GitHub. +1. At this point, you have only created the branch locally, you need to push this remotely to Github. ``` - git push --set-upstream origin JIRA-123-new-feature + git push ``` You should now be able to see the branch on GitHub. Make commits as you deem necessary. It helps to provide useful commit messages - a commit message saying 'Update' is a lot less helpful than saying 'Remove X parameter because it was unused'. @@ -92,11 +93,8 @@ The GENIE project follows the standard [git flow](https://www.atlassian.com/git/ This package uses [semantic versioning](https://semver.org/) for releasing new versions. The version should be updated on the `develop` branch as changes are reviewed and merged in by a code maintainer. The version for the package is maintained in the [genie/__init__.py](genie/__init__.py) file. A github release should also occur every time `develop` is pushed into `main` and it should match the version for the package. -### Testing - -#### Running test pipeline -Make sure to run each of the [pipeline steps here](README.md#developing-locally) on the test pipeline and verify that your pipeline runs as expected. This is __not__ automatically run by Github Actions and have to be manually run. +### Testing #### Running tests @@ -110,8 +108,6 @@ Here's how to run the test suite: pytest -vs tests/ ``` -Tests in Python are also run automatically by Github Actions on any pull request and are required to pass before merging. - ##### Tests in R This package uses [`testthat`](https://testthat.r-lib.org/) to run tests in R. The test code is located in the [testthat](./R/tests/testthat) subdirectory. @@ -166,18 +162,6 @@ Follow gitflow best practices as linked above. 12. Push changes in `develop`. 13. Wait for the CI/CD to finish. -### Modifying Docker - -Follow this section when modifying the [Dockerfile](https://github.com/Sage-Bionetworks/Genie/blob/main/Dockerfile): - -1. Have your synapse authentication token handy -1. ```docker build -f Dockerfile -t .``` -1. ```docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN ``` -1. Run [test code](README.md#developing-locally) relevant to the dockerfile changes to make sure changes are present and working -1. Once changes are tested, follow [genie contributing guidelines](#developing) for adding it to the repo -1. Once deployed to main, make sure the CI/CD build successfully completed (our docker image gets automatically deployed via Github Actions CI/CD) [here](https://github.com/Sage-Bionetworks/Genie/actions/workflows/ci.yml) -1. Check that your docker image got successfully deployed [here](https://github.com/Sage-Bionetworks/Genie/pkgs/container/genie) - ### Contributing to the docs This [documentation](https://sagebionetworks.jira.com/wiki/spaces/APGD/pages/3369631808/Contributing+to+Main+GENIE+repository+docs) is internal to Sage employees. diff --git a/README.md b/README.md index fb3b69a1..df4d1d52 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,33 @@ [![GHCR Docker Package](https://img.shields.io/badge/ghcr.io-sage--bionetworks%2Fgenie-blue?style=for-the-badge&logo=github)](https://github.com/orgs/sage-bionetworks/packages/container/package/genie) [![GitHub CI](https://img.shields.io/github/actions/workflow/status/Sage-Bionetworks/Genie/ci.yml?branch=develop&style=for-the-badge&logo=github)](https://github.com/Sage-Bionetworks/Genie) + +## Table of Contents + +- [Introduction](#introduction) +- [Documentation](#documentation) +- [Dependencies](#dependencies) +- [File Validator](#file-validator) + - [Setting up your environment](#setting-up-your-environment) + - [Running the validator](#running-the-validator) + - [Example commands](#example-commands) +- [Contributing](#contributing) +- [Sage Bionetworks Only](#sage-bionetworks-only) + - [Running locally](#running-locally) + - [Using conda](#using-conda) + - [Using pipenv](#using-pipenv) + - [Using docker (**HIGHLY** Recommended)](#using-docker-highly-recommended) + - [Setting up](#setting-up) + - [Developing](#developing) + - [Developing with Docker](#developing-with-docker) + - [Modifying Docker](#modifying-docker) +- [Testing](#testing) + - [Running unit tests](#running-unit-tests) + - [Running integration tests](#running-integration-tests) +- [Production](#production) +- [Github Workflows](#github-workflows) + + ## Introduction This repository documents code used to gather, QC, standardize, and analyze data uploaded by institutes participating in AACR's Project GENIE (Genomics, Evidence, Neoplasia, Information, Exchange). @@ -82,7 +109,6 @@ Running validator on cna file. **Note** that the flag `--nosymbol-check` is **RE genie validate data_cna_SAGE.txt SAGE --nosymbol-check ``` - ## Contributing Please view [contributing guide](CONTRIBUTING.md) to learn how to contribute to the GENIE package. @@ -90,30 +116,84 @@ Please view [contributing guide](CONTRIBUTING.md) to learn how to contribute to # Sage Bionetworks Only -## Developing locally +## Running locally -These are instructions on how you would develop and test the pipeline locally. +These are instructions on how you would setup your environment and run the pipeline locally. 1. Make sure you have read through the [GENIE Onboarding Docs](https://sagebionetworks.jira.com/wiki/spaces/APGD/pages/2163344270/Onboarding) and have access to all of the required repositories, resources and synapse projects for Main GENIE. 1. Be sure you are invited to the Synapse GENIE Admin team. 1. Make sure you are a Synapse certified user: [Certified User - Synapse User Account Types](https://help.synapse.org/docs/Synapse-User-Account-Types.2007072795.html#SynapseUserAccountTypes-CertifiedUser) +1. Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal and `git checkout` the version of the repo pinned to the [Dockerfile](https://github.com/Sage-Bionetworks/Genie/blob/main/Dockerfile) +1. Be sure to clone the annotation-tools repo: https://github.com/Sage-Bionetworks/annotation-tools and `git checkout` the version of the repo pinned to the [Dockerfile](https://github.com/Sage-Bionetworks/Genie/blob/main/Dockerfile). + +### Using `conda` + +Follow instructions to install conda on your computer: + +Install `conda-forge` and [`mamba`](https://github.com/mamba-org/mamba) +``` +conda install -n base -c conda-forge mamba +``` + +Install Python and R versions via `mamba` +``` +mamba create -n genie_dev -c conda-forge python=3.10 r-base=4.3 +``` + +### Using `pipenv` + +Installing via [pipenv](https://pipenv.pypa.io/en/latest/installation.html) + +1. Specify a python version that is supported by this repo: + + ``` + pipenv --python + ``` + +1. [pipenv install from requirements file](https://docs.pipenv.org/en/latest/advanced.html#importing-from-requirements-txt) + +1. Activate your `pipenv`: + + ``` + pipenv shell + ``` + +### Using `docker` (**HIGHLY** Recommended) + +This is the most reproducible method even though it will be the most tedious to develop with. See [CONTRIBUTING docs for how to locally develop with docker.](/CONTRIBUTING.md). This will setup the docker image in your environment. + +1. Pull pre-existing docker image or build from Dockerfile: + Pull pre-existing docker image. You can find the list of images [from here.](https://github.com/Sage-Bionetworks/Genie/pkgs/container/genie) + ``` + docker pull + ``` + + Build from Dockerfile + ``` + docker build -f Dockerfile -t . + ``` + +1. Run docker image: + ``` + docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN + ``` + +### Setting up + 1. Clone this repo and install the package locally. + Install Python packages. This is the more traditional way of installing dependencies. Follow instructions [here](https://pip.pypa.io/en/stable/installation/) to learn how to install pip. + ``` pip install -e . pip install -r requirements.txt pip install -r requirements-dev.txt ``` - If you are having trouble with the above, try installing via `pipenv` - - 1. Specify a python version that is supported by this repo: - ```pipenv --python ``` - - 1. [pipenv install from requirements file](https://docs.pipenv.org/en/latest/advanced.html#importing-from-requirements-txt) - - 1. Activate your `pipenv`: - ```pipenv shell``` + Install R packages. Note that the R package setup of this is the most unpredictable so it's likely you have to manually install specific packages first before the rest of it will install. + ``` + Rscript R/install_packages.R + ``` 1. Configure the Synapse client to authenticate to Synapse. 1. Create a Synapse [Personal Access token (PAT)](https://help.synapse.org/docs/Managing-Your-Account.2055405596.html#ManagingYourAccount-PersonalAccessTokens). @@ -131,47 +211,149 @@ These are instructions on how you would develop and test the pipeline locally. synapse login ``` -1. Run the different pipelines on the test project. The `--project_id syn7208886` points to the test project. +1. Run the different steps of the pipeline on the test project. The `--project_id syn7208886` points to the test project. You should always be using the test project when developing, testing and running locally. 1. Validate all the files **excluding vcf files**: ``` - python bin/input_to_database.py main --project_id syn7208886 --onlyValidate + python3 bin/input_to_database.py main --project_id syn7208886 --onlyValidate ``` 1. Validate **all** the files: ``` - python bin/input_to_database.py mutation --project_id syn7208886 --onlyValidate --genie_annotation_pkg ../annotation-tools + python3 bin/input_to_database.py mutation --project_id syn7208886 --onlyValidate --genie_annotation_pkg ../annotation-tools ``` 1. Process all the files aside from the mutation (maf, vcf) files. The mutation processing was split because it takes at least 2 days to process all the production mutation data. Ideally, there is a parameter to exclude or include file types to process/validate, but that is not implemented. ``` - python bin/input_to_database.py main --project_id syn7208886 --deleteOld + python3 bin/input_to_database.py main --project_id syn7208886 --deleteOld ``` - 1. Process the mutation data. Be sure to clone this repo: https://github.com/Sage-Bionetworks/annotation-tools and `git checkout` the version of the repo pinned to the [Dockerfile](https://github.com/Sage-Bionetworks/Genie/blob/main/Dockerfile). This repo houses the code that re-annotates the mutation data with genome nexus. The `--createNewMafDatabase` will create a new mutation tables in the test project. This flag is necessary for production data for two main reasons: + 1. Process the mutation data. This command uses the `annotation-tools` repo that you cloned previously which houses the code that standardizes/merges the mutation (both maf and vcf) files and re-annotates the mutation data with genome nexus. The `--createNewMafDatabase` will create a new mutation tables in the test project. This flag is necessary for production data for two main reasons: * During processing of mutation data, the data is appended to the data, so without creating an empty table, there will be duplicated data uploaded. * By design, Synapse Tables were meant to be appended to. When a Synapse Tables is updated, it takes time to index the table and return results. This can cause problems for the pipeline when trying to query the mutation table. It is actually faster to create an entire new table than updating or deleting all rows and appending new rows when dealing with millions of rows. * If you run this more than once on the same day, you'll run into an issue with overwriting the narrow maf table as it already exists. Be sure to rename the current narrow maf database under `Tables` in the test synapse project and try again. ``` - python bin/input_to_database.py mutation --project_id syn7208886 --deleteOld --genie_annotation_pkg ../annotation-tools --createNewMafDatabase + python3 bin/input_to_database.py mutation --project_id syn7208886 --deleteOld --genie_annotation_pkg ../annotation-tools --createNewMafDatabase ``` - 1. Create a consortium release. Be sure to add the `--test` parameter. Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal and `git checkout` the version of the repo pinned to the [Dockerfile](https://github.com/Sage-Bionetworks/Genie/blob/main/Dockerfile). For consistency, the processingDate specified here should match the one used for TEST pipeline in [nf-genie.](https://github.com/Sage-Bionetworks-Workflows/nf-genie/blob/main/main.nf) + 1. Create a consortium release. Be sure to add the `--test` parameter. For consistency, the `processingDate` specified here should match the one used in the `consortium_map` for the `TEST` key [nf-genie.](https://github.com/Sage-Bionetworks-Workflows/nf-genie/blob/main/main.nf) ``` - python bin/database_to_staging.py Jul-2022 ../cbioportal TEST --test + python3 bin/database_to_staging.py ../cbioportal TEST --test ``` - 1. Create a public release. Be sure to add the `--test` parameter. Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal and `git checkout` the version of the repo pinned to the [Dockerfile](https://github.com/Sage-Bionetworks/Genie/blob/main/Dockerfile). For consistency, the processingDate specified here should match the one used for TEST pipeline in [nf-genie.](https://github.com/Sage-Bionetworks-Workflows/nf-genie/blob/main/main.nf) + 1. Create a public release. Be sure to add the `--test` parameter. For consistency, the `processingDate` specified here should match the one used in the `public_map` for the `TEST` key [nf-genie.](https://github.com/Sage-Bionetworks-Workflows/nf-genie/blob/main/main.nf) ``` - python bin/consortium_to_public.py Jul-2022 ../cbioportal TEST --test + python3 bin/consortium_to_public.py ../cbioportal TEST --test ``` +## Developing + +1. Navigate to your cloned repository on your computer/server. +1. Make sure your `develop` branch is up to date with the `Sage-Bionetworks/Genie` `develop` branch. + + ``` + cd Genie + git checkout develop + git pull + ``` + +1. Create a feature branch which off the `develop` branch. If there is a GitHub/JIRA issue that you are addressing, name the branch after the issue with some more detail (like `{GH|GEN}-123-add-some-new-feature`). + + ``` + git checkout -b GEN-123-new-feature + ``` + +1. At this point, you have only created the branch locally, you need to push this remotely to Github. + + ``` + git push -u origin GEN-123-new-feature + ``` + +1. Add your code changes and push them via useful commit message + ``` + git add + git commit changed_file.txt -m "Remove X parameter because it was unused" + git push + ``` + +1. Once you have completed all the steps above, in Github, create a pull request (PR) from your feature branch to the `develop` branch of Sage-Bionetworks/Genie. + + +### Developing with Docker + +See [using `docker`](#using-docker-highly-recommended) for setting up the initial docker environment. + +A docker build will be created for your feature branch every time you have an open PR on github and add the label `run_integration_tests` to it. + +It is recommended to develop with docker. You can either write the code changes locally, push it to your remote and wait for docker to rebuild OR do the following: + +1. Make any code changes. These cannot be dependency changes - those would require a docker rebuild. +1. Create a running docker container with the image that you pulled down or created earlier + + ``` + docker run -d /bin/bash -c "while true; do sleep 1; done" + ``` + +1. Copy your code changes to the docker image: + + ``` + docker cp :/root/Genie/ + ``` + +1. Run your image in interactive mode: + + ``` + docker exec -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN /bin/bash + ``` + +1. Do any commands or tests you need to do + +### Modifying Docker + +Follow this section when modifying the [Dockerfile](https://github.com/Sage-Bionetworks/Genie/blob/main/Dockerfile): + +1. Have your synapse authentication token handy +1. ```docker build -f Dockerfile -t .``` +1. ```docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN ``` +1. Run [test code](README.md#developing-locally) relevant to the dockerfile changes to make sure changes are present and working +1. Once changes are tested, follow [genie contributing guidelines](#developing) for adding it to the repo +1. Once deployed to main, make sure the CI/CD build successfully completed (our docker image gets automatically deployed via Github Actions CI/CD) [here](https://github.com/Sage-Bionetworks/Genie/actions/workflows/ci.yml) +1. Check that your docker image got successfully deployed [here](https://github.com/Sage-Bionetworks/Genie/pkgs/container/genie) + + +## Testing + +Currently our Github Actions will run unit tests from our test suite `/tests` and run integration tests - each of the [pipeline steps here](README.md#developing-locally) on the test pipeline. + +These are all triggered by adding the Github label `run_integration_tests` on your open PR. + +To trigger `run_integration_tests`: + +- Add `run_integration_tests` for the first time when you just open your PR +- Remove `run_integration_tests` label and re-add it +- Make any commit and pushes when the PR is still open + +If you are developing with docker, docker images for your feature branch also gets build via the `run_integration_tests` trigger so check that your docker image got successfully deployed[here](https://github.com/Sage-Bionetworks/Genie/pkgs/container/genie). + +### Running unit tests + +Unit tests in Python are also run automatically by Github Actions on any PR and are required to pass before merging. + +Otherwise, if you want to add tests and run tests outside of the CI/CD, see [how to run tests and general test development](./CONTRIBUTING.md#testing) + +### Running integration tests + +See [running pipeline steps here](README.md#developing-locally) if you want to run the integration tests locally. + +You can also run them in nextflow via [nf-genie](https://github.com/Sage-Bionetworks-Workflows/nf-genie/blob/main/README.md) + + ## Production The production pipeline is run on Nextflow Tower and the Nextflow workflow is captured in [nf-genie](https://github.com/Sage-Bionetworks-Workflows/nf-genie). It is wise to create an ec2 via the Sage Bionetworks service catalog to work with the production data, because there is limited PHI in GENIE.