Skip to content

Commit 720edac

Browse files
authored
Add Carrot test for EHdn WDL (#209)
* Create test and eval WDLs of EHdn for Carrot. * check in first draft of the Carrot utility script. * Compare the files and set the exit code. * Remove test WDL. * Check-in ehdn carrot eval wdl & inputs. * Move carrot_helper.py to the wdl_test dir. * Add models to interact with a carrot server. * Add creating results & mapping to template. * Fix lint issues. * Refactor input json files. * Refactor .json files to match carrot namings. * Add method for creating a test. * Add template path to Test & sort keys when persisted in json. * Serialize json as objects instead of a dict of dicts. * Add the Run model & methods to create and persist runs. * Add a method to update the status of runs & pprint. * Add run_dir to run status & improve updating run properties. * Run carrot in quiet mode. * Use named args & add path option for creating runs. * Remove unused test files. * Use dataclass to simplify object initialization. * Remove a comment block. * Overhaul args parsing: # - Simplified `run` to only accept path # as input and not pipeline, template & # run dir; # - Removed `list` since this can be # achieved using the `tree` tool; # - Moved all the code used by the args # to the CarrotHelper. * Improve pretty-printing the status. * Compute & store checksum of test & eval wdls. * Capture the missing runs.json file exception. * Compute & store checksum for test & eval default inputs. * Move the code of creating template & results to sep. methods. * Update resources if needed while loading them. * Thorough refactor. * Default named & positional args of calling carrot to None. * Bug fix. * Set dest for [sub-]commands args parser. * Add prune command. * Bug fix. * Refactoring. * Read/write from/to configs.json. * Bug fix. * Remove temp code. * Add README for wdl_test. * Rewrite based on EHdn & add ref. to docs. * Add current limitations to the README. * Extend description. * Add a subcommand to [re-]configure. * Get current working directory from __file__. * Add the req. of invoking the script from wdl_test dir to description. * Use urljoin for a safer uri composition. * Add carrot_helper setup and reusable resources to the README. * Replace urljoin with url[un]split.
1 parent f01cbb9 commit 720edac

9 files changed

Lines changed: 1213 additions & 1 deletion

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,5 @@ src/svtk/svtk/utils/utils.pyc
3333
src/svtk/svtk/vcfcluster.pyc
3434
/test_inputs
3535
/inputs/
36+
/wdl_test/.runs.json
37+
/wdl_test/.configs.json

wdl/ExpansionHunterDenovo.wdl

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,16 @@
88
99
version 1.0
1010

11-
import "Structs.wdl"
11+
#import "Structs.wdl"
12+
# Carrot currently does not support imports.
13+
struct RuntimeAttr {
14+
Float? mem_gb
15+
Int? cpu_cores
16+
Int? disk_gb
17+
Int? boot_disk_gb
18+
Int? preemptible_tries
19+
Int? max_retries
20+
}
1221

1322
struct FilenamePostfixes {
1423
String locus
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
version 1.0
2+
3+
workflow EvalCaseControlLocus {
4+
input {
5+
File multisample_profile
6+
File multisample_profile_expected
7+
String docker_image
8+
}
9+
call RunEHdn {
10+
input:
11+
multisample_profile = multisample_profile,
12+
multisample_profile_expected = multisample_profile_expected,
13+
docker_image = docker_image
14+
}
15+
output {
16+
File data_file = RunEHdn.data_file
17+
}
18+
}
19+
20+
task RunEHdn {
21+
input {
22+
File multisample_profile
23+
File multisample_profile_expected
24+
String docker_image
25+
}
26+
command <<<
27+
cmp --silent ~{multisample_profile} ~{multisample_profile_expected} && exit 0 || exit 1
28+
>>>
29+
runtime {
30+
docker: docker_image
31+
}
32+
output {
33+
File data_file = stdout()
34+
}
35+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"EvalCaseControlLocus.docker_image": "python:3.7-slim"
3+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"EvalCaseControlLocus.multisample_profile": "test_output:EHdnSTRAnalysis.multisample_profile",
3+
"EvalCaseControlLocus.multisample_profile_expected": "gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/example_dataset.multisample_profile.json"
4+
}
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
{
2+
"EHdnSTRAnalysis.analysis_type": "casecontrol",
3+
"EHdnSTRAnalysis.str_comparison_type": "locus",
4+
"EHdnSTRAnalysis.sample_bams_or_crams": [
5+
"gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/bamlets/sample1.bam",
6+
"gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/bamlets/sample2.bam",
7+
"gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/bamlets/sample3.bam",
8+
"gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/bamlets/sample4.bam",
9+
"gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/bamlets/sample5.bam",
10+
"gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/bamlets/sample6.bam",
11+
"gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/bamlets/sample7.bam"
12+
],
13+
"EHdnSTRAnalysis.samples_status": [
14+
"case",
15+
"case",
16+
"case",
17+
"control",
18+
"control",
19+
"control",
20+
"control"
21+
],
22+
"EHdnSTRAnalysis.reference_fasta": "gs://broad-dsde-methods-vj/ehdn_unit_test_data/case-control/reference.fasta",
23+
"EHdnSTRAnalysis.min_anchor_mapq": 50,
24+
"EHdnSTRAnalysis.max_irr_mapq": 40
25+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"EHdnSTRAnalysis.ehdn_docker": "vjalili/ehdn:01"
3+
}

wdl_test/README.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
This directory contains [carrot](https://github.com/broadinstitute/carrot)
2+
tests for the GAKT-SV pipeline's WDLs; the tests are organized in folders
3+
containing `carrot` resources (e.g., evaluation WDL, default/test inputs).
4+
Additionally, a utility script, `carrot_help.py`, is provided that
5+
automates defining tests to Carrot, and running and checking their execution
6+
status. Generally, with tests organized in a particular folder hierarchy,
7+
with a single call to the utility script
8+
(`python carrot_helper.py test run ./*`) every step from defining and
9+
running tests are automatically executed, ideally simplifying
10+
defining/running Carrot tests without requiring domain-specific expertise.
11+
12+
## Organize Tests in Directories
13+
14+
Test cases for a WDL need to be organized in directories where each
15+
contain separate subdirectories for test cohorts each containing
16+
`carrot` resources. For instance, the `ExpansionHunterDenovo.wdl`
17+
performs case-control analysis and outlier detection on a set
18+
of BAM files based on their short-tandem repeat (STR) profiles.
19+
In order to test this WDL for case-control analysis using a cohort
20+
of simulated data, `carrot` resources need to be organized as
21+
the following for the `carrot_helper` utility to automatically
22+
setup, run, and track `carrot` tests.
23+
24+
```shell
25+
├── wdl
26+
│   └── ExpansionHunterDenovo.wdl
27+
└── wdl_test
28+
   └── ExpansionHunterDenovo
29+
   └── casecontrol
30+
   ├── simulated_data
31+
   │   ├── eval_input.json
32+
   │   └── test_input.json
33+
   ├── eval.wdl
34+
   ├── eval_input_defaults.json
35+
   └── test_input_defaults.json
36+
```
37+
38+
Accordingly:
39+
40+
- Create a folder with the _same name_ as the WDL file containing the
41+
workflow you want to test (`ExpansionHunterDenovo` for
42+
`ExpansionHunterDenovo.wdl` in this example).
43+
44+
45+
- Create a separate folder for every evaluation you want to perform (e.g.,
46+
the `casecontrol` folder to evaluate `ExpansionHunterDenovo.wdl`'s
47+
case-control analysis on the STR profiles of input BAM files).
48+
While all assertions can be part of a single evaluation, it is generally
49+
a good practice to break assertions into smaller atomic evaluations.
50+
51+
52+
- Inside every evaluation directory, create three files: `eval.wdl`,
53+
`eval_input_defaults.json`, and `test_input_defaults.json`.
54+
The `eval.wdl` WDL receives outputs of the workflow you're testing and
55+
asserts their values. The JSON files provide default inputs to the test
56+
(`ExpansionHunterDenovo.wdl`) and `eval.wdl` WDLs. For instance, if the
57+
majority of the tests are running `eval.wdl` on a common docker image,
58+
the image name can be set in the `eval_input_defaults.json`, which can be
59+
overridden in the tests that execute `eval.wdl` on a different docker image.
60+
61+
62+
- An evaluation can be performed using different set of inputs for the
63+
test and evaluation workflows. For instance, in the STR analysis scenario,
64+
we pass
65+
[seven BAM files](https://github.com/VJalili/gatk-sv/blob/89e67350ea7fec8edc687011ac7308e3e1db17ff/wdl_test/ExpansionHunterDenovo/casecontrol/simulated_data/test_input.json#L4-L12)
66+
to the `ExpansionHunterDenovo.wdl`, run the WDL, and pass
67+
[its output](https://github.com/VJalili/gatk-sv/blob/89e67350ea7fec8edc687011ac7308e3e1db17ff/wdl_test/ExpansionHunterDenovo/casecontrol/simulated_data/eval_input.json#L2)
68+
along with the
69+
[expected output](https://github.com/VJalili/gatk-sv/blob/89e67350ea7fec8edc687011ac7308e3e1db17ff/wdl_test/ExpansionHunterDenovo/casecontrol/simulated_data/eval_input.json#L3)
70+
to the evaluation WDL. Different combinations of inputs the test and
71+
evaluation workflows are grouped under separate subdirectories (e.g.,
72+
the `simulated_data` subdirectory for `casecontrol` assertion of
73+
`ExpansionHunterDenovo.wdl`). The inputs for test and evaluation
74+
WDLs are specified using two JSON files, `test_input.json` and
75+
`eval_input.json`, containing inputs for the test and evaluation
76+
WDLs respectively. The files should be located in the subdirectory of the
77+
test cohort (e.g, `casecontrol/simulated_data/test_input.json`).
78+
79+
80+
- In order to pass any file to the WDLs via the JSON files, the files
81+
need to be stored on a publicly accessible Google storage bucket.
82+
83+
84+
- In order to pass the output of test WDL as input to the evaluation WDL,
85+
the value of the key should be prefixed with `test_output:` (see `carrot`'s
86+
[documentation](https://github.com/broadinstitute/carrot/blob/0f616c0a9933a44bb92bc9ddbc90b81b0b532de6/UserGuide.md#-mapping-test-outputs-to-eval-inputs)).
87+
For instance:
88+
89+
```json
90+
"EvalCaseControlLocus.multisample_profile": "test_output:EHdnSTRAnalysis.multisample_profile",
91+
```
92+
93+
94+
## Carrot Helper
95+
96+
The `carrot_helper` utility script automates few routine task for
97+
running and updating Carrot tests. This script is not a replacement
98+
for `carrot_cli` or Carrot's API that have more expressive power,
99+
wider functionality, and generalization than `carrot_helper`.
100+
101+
### Setup
102+
103+
1. Install `carrot_cli`:
104+
- Install the `dev` version of [`carrot_cli`](https://github.com/broadinstitute/carrot_cli)
105+
as the following. We install the `dev` since `carrot_helper` leverages
106+
unreleased feature of `carrot_cli`.
107+
108+
```shell
109+
git clone https://github.com/broadinstitute/carrot_cli/
110+
pip install -r dev-requirements.txt
111+
pip install -e .
112+
```
113+
114+
- [Configure `carrot_cli`]((https://github.com/broadinstitute/carrot/blob/master/UserGuide.md#-carrot-cli)):
115+
configure it to access a [Carrot server](https://github.com/broadinstitute/carrot)
116+
and set your email address.
117+
118+
119+
2. Install latest version of
120+
[`womtool`](https://github.com/broadinstitute/cromwell/releases).
121+
122+
123+
3. Setup `carrot_helper.py` by executing the following command providing
124+
values for its prompts:
125+
126+
```shell
127+
$ cd gatk-sv/wdl_test
128+
$ python carrot_helper.py config
129+
```
130+
131+
Carrot fetches the test and evaluation WDLs for every test from
132+
a publicly accessible GitHub repository. Therefore, in order to define/update
133+
tests, `carrot_helper` requires to know the GitHub repository and the git
134+
branch where the test and evaluation WDLs are available. If you want to run
135+
existing tests, you may use `https://github.com/broadinstitute/gatk-sv` and
136+
`master` for repository and branch respectively. If you are developing
137+
a carrot test for a WDL, then you may set the repository to your fork
138+
of `github.com/broadinstitute/gatk-sv` and set the branch to your feature
139+
branch.
140+
141+
142+
### Run Carrot Helper
143+
144+
```shell
145+
cd wdl_test
146+
python carrot_helper.py test run ./*
147+
```
148+
_Note that the script should be invoked from the `wdl_test` directory._
149+
150+
This above command will define every test (in the above-discussed directory
151+
structure) to Carrot, and will run them all. The information of the created
152+
and executed tests are persisted in `.carrot_pipelines.json` and `.runs.json`
153+
files.
154+
155+
You can specify a single test to run; for instance:
156+
157+
```shell
158+
python carrot_helper.py test run STRAnalyzer/comparative/real_cohort
159+
```
160+
161+
Or you may use wildcards to specify particular tests to run. For instance:
162+
163+
```shell
164+
python carrot_helper.py test run STRAnalyzer/*/real_cohort
165+
```
166+
167+
To check for the status of the runs, you use the following command.
168+
169+
```shell
170+
python carrot_helper.py test update_status
171+
```
172+
173+
### Reusable Resources
174+
The `carrot_helper.py` persists any metadata about the carrot resources it
175+
creates (e.g.,
176+
[pipeline](https://github.com/broadinstitute/carrot/blob/master/UserGuide.md#-pipeline),
177+
[template](https://github.com/broadinstitute/carrot/blob/master/UserGuide.md#-template),
178+
[test](https://github.com/broadinstitute/carrot/blob/master/UserGuide.md#-test),
179+
[result](https://github.com/broadinstitute/carrot/blob/master/UserGuide.md#-result)
180+
and any necessary mapping between them) in the `.carrot_pipelines.json`.
181+
182+
The `.carrot_pipelines.json` file tracked on git contains metadata belonging
183+
to the `carrot` resources defined for tests and WDLs available from the
184+
`master` branch of the
185+
[`github.com/broadinstitute/gatk-sv`](https://github.com/broadinstitute/gatk-sv)
186+
repository on a `carrot` server maintained for internal use at the Broad
187+
institute. You may use this file to run and updated (read the following)
188+
tests if you have access to Broad's VPN. Otherwise, you may remove or rename
189+
the `.carrot_pipelines.json` file, **without tracking the changes on git**,
190+
and let the `carrot_helper.py` create resources on the `carrot` server for
191+
the repository and branch [you have configured](#setup-carrot-helper).
192+
193+
`carrot_helper.py` automatically initializes and updates the
194+
`.carrot_pipelines.json`. When `carrot_helper.py test run` is invoked,
195+
the script traverses the `wdl_test` and initializes/updates `carrot`
196+
resources if any of the test or evaluations WDLs or their inputs are
197+
changed. Carrot reads test and evaluation WDLs from github; therefore,
198+
make sure you commit and push changes to your branch when updating
199+
test and evaluation WDLs.
200+
201+
202+
### Carrot Report
203+
Carrot can pass the output of an evaluation workflow to a Jupyter notebook,
204+
which enables more in-depth evaluations/assertions and visualizations.
205+
visualization. In general, this requires defining a template notebook
206+
(ideally separate notebooks for each test to have test-specific visualization),
207+
defining a `report` in carrot and mapping a template to the report.
208+
Please refer to [Carrot documentation for details.](https://github.com/broadinstitute/carrot/blob/48c58446d4fb044cabbdafe8962b67ee511b483a/UserGuide.md#-2-define-a-report-in-carrot)
209+
The `carrot_helper` does not currently support defining `report`.
210+
211+
212+
### Current limitations
213+
214+
Carrot is under active development and new functionalities emerge
215+
as new versions are released. There are a few functionalities that are
216+
under development and not yet released that impact the workflows that
217+
can be tested using `carrot`. Specifically, Carrot does not currently
218+
support relative imports in WDL files (i.e., importing workflow via
219+
a WDL file is provided via the `--imports` argument of `cromwell`).
220+
A workaround to is to host required imports on a Google cloud storage
221+
bucket and import using the object's URL. However, this would require
222+
modifying all the WDLs of the GATK-SV pipeline. The carrot team is working
223+
on supporting an `--imports`-like functionality in carrot.
224+
225+
Additionally, `carrot` do not currently support `Array` type outputs
226+
(e.g., `Array[File]`). In other words, the array type outputs of a
227+
test WDL cannot be passed to evaluation WDLs for assertions. A workaround
228+
is to encapsulate array output in a zip archive, hence the test WDL outputs
229+
a single file, and extract the content of zip in the eval WDL. This workaround
230+
would require a significant modification to GATK-SV pipeline workflows, hence
231+
we currently do not assert array type outputs.

0 commit comments

Comments
 (0)