Skip to content

Commit 7d91b10

Browse files
authored
Refactor iatlas to cbioportal pipeline (#119)
* refactor order of ops * adjust code to remove .metadata * add docker and dep * move all and sequenced caselist generation to load.py * remove not needed code * update to current workflow * add docker setup and tests to README * [DPE-1453] Process ANDERS clinical dataset (#122) * add anders dataset specific filtering, convert lens map to be string vals * address PR comments * [DPE-1468] Add neoantigen data into clinical sample data (#124) * initial commit for incorporating neoantigen data * rearrange code to have a general validation script * add tests * remove unused code * remove unused code * add unit tests and docstring * update docstring order of ops * add indicator in logs for any error that study failed, address PR comments
1 parent 8ee8f23 commit 7d91b10

File tree

13 files changed

+1863
-352
lines changed

13 files changed

+1863
-352
lines changed

local/iatlas/README.md

Lines changed: 115 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ python3 local/iatlas/lens.py run --dataset_id <yaml-dataset-synapse-id> --s3_pre
103103
104104
### Overview
105105
106-
#### maf_to_cbioportal.py
106+
#### maf.py
107107
This script will run the iatlas mutations data through genome nexus so it can be ingested by cbioportal team for visualization.
108108
109109
The script does the following:
@@ -115,7 +115,7 @@ The script does the following:
115115
5. [Creates the required meta_* data](https://github.com/cBioPortal/datahub-study-curation-tools/tree/master/generate-meta-files)
116116
117117
118-
#### clinical_to_cbioportal.py
118+
#### clinical.py
119119
This script will process/transform the iatlas clinical data to be cbioportal format friendly so it can be ingested by cbioportal team for visualization.
120120
121121
The script does the following:
@@ -128,18 +128,59 @@ The script does the following:
128128
129129
130130
### Setup
131-
- `pandas` == 2.0
132-
- `synapseclient`==4.8.0
131+
Prior to testing/developing/running this locally, you will need to setup the Docker image in order to run this.
132+
Optional: You can also build your environment via python env and install the `uv.lock` file
133+
134+
1. Create and activate your venv
135+
136+
```
137+
python3 -m venv <your_env_name>
138+
source <your_env_name>/bin/activate
139+
```
140+
141+
2. Export dependencies from uv.lock
142+
143+
```
144+
pip install uv
145+
uv export > requirements.txt
146+
```
147+
148+
3. Install into your venv
149+
150+
```
151+
pip install -r requirements.txt
152+
```
153+
154+
But it is highly recommended you use the docker image
155+
156+
1. Build the dockerfile
157+
158+
```
159+
cd /orca-recipes/local/iatlas/cbioportal_export
160+
docker build -f Dockerfile -t <some_docker_name> .
161+
```
162+
163+
2. Run the Dockerfile
164+
165+
```
166+
docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>
167+
```
168+
169+
3. Follow the **How to Run** section below
133170
134171
### How to Run
135172
136173
Getting help
137174
```
138-
python3 clinical_to_cbioportal.py --help
175+
python3 clinical.py --help
176+
```
177+
178+
```
179+
python3 maf.py --help
139180
```
140181
141182
```
142-
python3 maf_to_cbioportal.py --help
183+
python3 load.py --help
143184
```
144185
145186
### Outputs
@@ -148,7 +189,7 @@ This pipeline generates the following key datasets that eventually get uploaded
148189
All datasets will be saved to:
149190
`<datahub_tools_path>/add-clinical-header/<dataset_name>/` unless otherwise stated
150191
151-
#### maf_to_cbioportal.py
192+
#### maf.py
152193
153194
- `data_mutations_annotated.txt` – Annotated MAF file from genome nexus
154195
- Generated by: `concatenate_mafs()`
@@ -160,7 +201,7 @@ All datasets will be saved to:
160201
- Generated by: `datahub-study-curation-tools`' `generate-meta-files` code
161202
162203
163-
#### clinical_to_cbioportal.py
204+
#### clinical.py
164205
165206
- `data_clinical_patient.txt` – Clinical patient data file
166207
- Generated by: `add_clinical_header()`
@@ -181,6 +222,17 @@ All datasets will be saved to:
181222
- `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
182223
- Generated by: `datahub-study-curation-tools`' `generate-case-lists` code
183224
225+
226+
#### validate.py
227+
228+
- `iatlas_validation_log.txt` - Validator results from our own iatlas validation results for all of the files
229+
- Generated by: updated by each validation function
230+
231+
- `cbioportal_validator_output.txt` – Validator results from cbioportal for all of the files not just clinical
232+
- Generated by: `cbioportal`' validator code
233+
234+
#### load.py
235+
184236
- `cases_all.txt` – case list file for all the clinical samples in the study
185237
- `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
186238
- Generated by: `datahub-study-curation-tools`' `generate-case-lists` code
@@ -190,59 +242,84 @@ in the study
190242
- `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
191243
- Generated by: `datahub-study-curation-tools`' `generate-case-lists` code
192244
193-
- `cbioportal_validator_output.txt` – Validator results from cbioportal for all of the files not just clinical
194-
- Generated by: `cbioportal`' validator code
195-
196245
197246
Any additional files are the intermediate processing files and can be ignored.
198247
199248
200249
### General Workflow
201250
202-
1. Do a dry run on the maf datasets (this won't upload to Synapse).
203-
2. Do a dry run on the clinical datasets (this won't upload to Synapse, will run the cbioportal validator and output results from there)
204-
3. Check your `cbioportal_validator_output.txt` from the dry run.
205-
4. Resolve any `ERROR`s
206-
5. Repeat steps 1-3 until all `ERROR`s are gone
207-
6. Run the same command now without the `dry_run` flag (so you upload to Synapse) for both the clinical and maf datasets
251+
1. Run processing on the maf datasets via `maf.py`
252+
2. Run processing on the clinical datasets via `clinical.py`
253+
3. Run `load.py` to create case lists
254+
4. Run the general validation + cbioportal validator on your outputted files via `validate.py`
255+
5. Check your `cbioportal_validator_output.txt`
256+
6. Resolve any `ERROR`s
257+
7. Repeat steps 4-6 until all `ERROR`s are gone
258+
8. Run `load.py` now with the `upload` flag to upload to synapse
208259
209260
**Example:**
210-
Doing a dry run on all of the datasets:
261+
Sample workflow
211262
212-
For clinical
263+
Run clinical processing
213264
```
214-
python3 clinical_to_cbioportal.py
265+
python3 clinical.py
215266
--input_df_synid syn66314245 \
216267
--cli_to_cbio_mapping_synid syn66276162
217268
--cli_to_oncotree_mapping_synid syn66313842 \
218-
--output_folder_synid syn64136279 \
219269
--datahub_tools_path /<some_path>/datahub-study-curation-tools \
220-
--cbioportal_path /<some_path>/cbioportal
221270
--lens_id_mapping_synid syn68826836
222-
--dry_run
271+
--neoantigen-data-synid syn21841882
223272
```
224273
225-
For mafs
274+
Run maf processing
226275
```
227-
python3 maf_to_cbioportal.py
276+
python3 maf.py
228277
--dataset Riaz
229278
--input_folder_synid syn68785881
230-
--output_folder_synid syn68633933
231-
--datahub_tools_path /<some_path>/datahub-study-curation-tools --n_workers 3
232-
--dry_run
279+
--datahub_tools_path /<some_path>/datahub-study-curation-tools
280+
--n_workers 3
233281
```
234282
235-
**Example:**
236-
Saving clinical files to synapse with comment
283+
Create the case lists
284+
```
285+
python3 load.py
286+
--dataset Riaz
287+
--output_folder_synid syn64136279
288+
--datahub_tools_path /<some_path>/datahub-study-curation-tools
289+
--create_case_lists
290+
```
237291
292+
Run the general iatlas validation + cbioportal validator on all files
238293
```
239-
python3 clinical_to_cbioportal.py
240-
--input_df_synid syn66314245 \
241-
--cli_to_cbio_mapping_synid syn66276162
242-
--cli_to_oncotree_mapping_synid syn66313842 \
243-
--output_folder_synid syn64136279 \
244-
--lens_id_mapping_synid syn68826836 \
245-
--datahub_tools_path /some_path/datahub-study-curation-tools \
246-
--cbioportal_path /<some_path>/cbioportal
294+
python3 validate.py
295+
--datahub_tools_path /<some_path>/datahub-study-curation-tools
296+
--neoantigen_data_synid syn69918168
297+
--cbioportal_path /<some_path>/cbioportal/
298+
--dataset Riaz
299+
```
300+
301+
Save into synapse with version comment `v1`
302+
303+
```
304+
python3 load.py
305+
--dataset Riaz
306+
--output_folder_synid syn64136279
307+
--datahub_tools_path /<some_path>/datahub-study-curation-tools
247308
--version_comment "v1"
309+
--upload
310+
```
311+
312+
### Running tests
313+
314+
Tests are written via `pytest`.
315+
316+
In your docker environment or local environment, install `pytest` via
317+
318+
```
319+
pip install pytest
320+
```
321+
322+
Then run all tests via
248323
```
324+
python3 -m pytest tests
325+
```
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# uv + Python 3.10 preinstalled
2+
FROM ghcr.io/astral-sh/uv:python3.10-bookworm
3+
WORKDIR /root/cbioportal_export/
4+
5+
RUN uv venv /opt/venv
6+
# Use the virtual environment automatically
7+
ENV VIRTUAL_ENV=/opt/venv
8+
# Place entry points in the environment at the front of the path
9+
ENV PATH="/opt/venv/bin:$PATH"
10+
11+
# Install dep
12+
COPY pyproject.toml uv.lock* ./
13+
14+
# Install exactly what's locked (fails if lock is out of date)
15+
RUN uv sync --frozen --no-dev
16+
17+
# copy code
18+
COPY . .
19+
20+
WORKDIR /root/
21+
22+
# clone dep repos
23+
RUN git clone https://github.com/rxu17/datahub-study-curation-tools.git -b upgrade-to-python3
24+
RUN git clone https://github.com/cBioPortal/cbioportal.git -b v6.3.2
25+
26+
WORKDIR /root/cbioportal_export/

0 commit comments

Comments
 (0)