-
Notifications
You must be signed in to change notification settings - Fork 30
merge_bams_bulk #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 121 commits
fa66e5d
77c7c4d
4bd3f22
2ec6e0f
92d119d
1333e6d
7327f77
c14ab66
ab5e06c
cda9c49
5f56e1e
309b84a
2103912
49e36d7
6899f34
0eb1b28
2b475f9
6ddae5b
f905e27
2c69f08
1af46b3
011403a
4d79c3b
fbea2de
589a18f
ad3861e
fb2644f
135c0f1
6ca90ed
1280d6b
0023bd6
237836b
cb651d7
b60a71c
3f17184
d5b3f39
bc785a4
6a08adf
c8e9c22
f57a1f0
13d98da
2270e2a
e2601d6
41463b0
d69cff6
b5aa092
47165fc
3d14888
b059954
074eeb2
7b5133d
9cba71a
622b060
2daf2f7
0a2ca27
8f87d81
59763a0
c0fda0c
e2c676e
c5bd718
306652a
56d6a93
3f240ca
d75a7d1
26445d2
b405037
01284c9
af63b20
cdde672
8d09e93
b6baf72
db53b41
f2f5ee2
f78f56c
8fff2bb
19d7625
d2d03dc
81ccd39
8ad0efc
2003e73
3909690
5a55b42
0d48fdb
3e6a314
e7403be
c6f83c0
34ec57d
ea81aef
86b7145
d23ae27
2cd8b34
e6bef52
1898c7f
70c1794
efc4cb0
48824f5
2819c29
e8136f6
3f6314b
e47d6a5
c0f3ab4
e5c81d8
4281ef3
183c550
bd16496
38e21fe
b2ea552
4575496
3e45419
20ccbbd
280ff5f
a5ecff6
d0f4166
9456b6f
4e330d3
b10c66d
c211760
ad5f699
d03aa0e
888bb2a
85d8e62
97ddcbd
4e6d91e
8c69883
877f573
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| import "tasks_demux.wdl" as demux | ||
|
|
||
| workflow merge_bams_bulk { | ||
| Array[File]+ in_bams # any order | ||
| File in_bam_out_bam_table # first column: input bam file basename, second column: output bam file basename (one line per INPUT file) | ||
| File? reheader_table | ||
| String? docker="quay.io/broadinstitute/viral-core" | ||
|
|
||
| # removes ".bam"s from ends of filenames in in_bam_out_bam_table | ||
| call clean_in_bam_out_bam_table { | ||
| input: table = in_bam_out_bam_table | ||
| } | ||
|
|
||
| # generates map with key: input bam file name -> value: output bam file basename | ||
| Map[String, String] in_bam_to_out_bam = read_map(clean_in_bam_out_bam_table.clean_table) | ||
|
|
||
| # retrieves unique output bam file basenames (no repeats) | ||
| call unique_values_in_second_column { | ||
| input: table = clean_in_bam_out_bam_table.clean_table | ||
| } | ||
| Array[String] out_bams = unique_values_in_second_column.unique_values | ||
|
|
||
| # collects and merges input bam files for each output bam file | ||
| scatter (out_bam in out_bams) { | ||
| String placeholder = write_map(in_bam_to_out_bam) # need to touch map in outer scatter for it to be seen in inner scatter | ||
|
|
||
| # retrieves the input bam files for this output bam file | ||
| scatter (in_bam in in_bams) { | ||
| String in_bam_basename = basename(in_bam, ".bam") | ||
| if(in_bam_to_out_bam[in_bam_basename] == out_bam) { | ||
| File relevant_in_bam = in_bam | ||
| } | ||
| } | ||
| Array[File] relevant_in_bams = select_all(relevant_in_bam) # gathers input bam files from the scatter | ||
|
Comment on lines
+24
to
+33
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fascinating -- as I stare at this, it appears you're basically trying to pull off two things in these particular lines of code highlighted here:
Is it possible we could avoid all this if the workflow took, as input, an Then maybe if there's really a UI-based desire to present the input mapping as a table with filenames, perhaps we make another WDL workflow (merge_bams_bulk_from_tsv or something like that?) that reads the input text table and
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would be so glorious. Unfortunately The closest thing I could find to an explanation is at https://github.com/dnanexus/dxWDL/blob/master/doc/Internals_v0_81.md:
—or at https://libraries.io/github/dnanexus/dxWDL under "Design":
I don't actually understand what this is saying, or if the takeaway is that there is some way to make complex types work with the UI. If you have any insights beyond mine then maybe we could make it work. You can play around with a demo of what this and other complex datatypes are like at
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wow that quote under "Design" sounds exactly what you're doing in your pipeline. You take an input
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I previously just didn't parse that at all, but no reason except that I couldn't figure out how the actual UI inputs work. For any complex data type (in this case we are looking at Somehow, this string should contain the json
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's my main issue. Anytime we write a WDL pipeline, I want it to be portable across different execution environments--otherwise, why are we writing it in WDL in the first place? So it should run just fine on your laptop, on a cluster, on DNAnexus, on Terra, and other environments I've never used before. One of the main principles in making that happen is to try to remove as much of the unix file system concept from the high level pipeline itself, leaving it only to the code within each task, because most of the time, these pipelines run in places where directories don't exist, and so forth. Talking about a bunch of filenames in the text content of an input file starts getting very infrastructure specific, because the way files are organized and referred to are very infrastructure specific. For example, in the case of DNAnexus, the official / unambiguous way to refer to a file object is by its DNAnexus opaque file identifier ( As for the json representation, almost every WDL executor uses json anyway for describing inputs to executions (that's what you used in that dx-defaults file you deleted). It'd be something to the effect of: I mean, maybe those input files could be presented in a directory/filename-like representation instead of the opaque identifiers I used above, but maybe not -- the dx-defaults don't accept those forms, but you never know until you try. Anyway, although the above json would be annoying to human-write, I'm assuming your tabfile was script-produced anyway, and there's no real reason it couldn't produce this form of the output as well. This might seem like splitting hairs between two ways that are semantically doing the same thing (providing a serialized mapping of the two-dimensional data structure, one in tsv form, one in json, and the former is certainly simpler), but I think there's actually a significant difference philosophically between the two: in the latter case, we're telling the execution engine (whether it's DNAnexus or something else) the data structure in a way it understands and having it resolve it. In the former case, we're telling our own custom-written code about the mapping and having it resolve it -- and when it comes to Also, if the primary purpose of the preceding bits is to solve a UI issue on DNAnexus, it seems that it would be sensible to make this part of the solution very DNAnexus-specific code. Similar to the assembly stats python script (that is DNAnexus specific), we could imagine a "parse-the-tabfile-to-json" script that is also DNAnexus specific and provided separately from all the WDL.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree on all counts. My only concern is that I reaaaaally hope it doesn't turn out that we need to use |
||
|
|
||
| # merges the relevant input bam files to produce this output file | ||
| call demux.merge_and_reheader_bams { | ||
| input: | ||
| out_basename = basename(out_bam, ".bam"), | ||
| in_bams = relevant_in_bams, | ||
| reheader_table = reheader_table, | ||
| docker = docker | ||
| } | ||
| } | ||
| } | ||
|
|
||
| task clean_in_bam_out_bam_table { | ||
| File table | ||
|
|
||
| command { | ||
| cat ${table} | sed 's/[.]bam$//g' | sed $'s/[.]bam\t/\t/g' | tee in_bam_out_bam_table | ||
| } | ||
|
|
||
| output { | ||
| File clean_table = "in_bam_out_bam_table" | ||
| } | ||
| } | ||
|
Comment on lines
+46
to
+56
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This task seems unnecessary -- you're already calling the WDL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (As you definitely already can tell,) my thinking here was to deal with the user potentially including the .bam extension in the input and/or output bam filenames listed in or even a more reasonable vs. I ran into problems with this line: Note that This would be fine if we could check the map for |
||
|
|
||
| task unique_values_in_second_column { | ||
| File table | ||
|
|
||
| command { | ||
| cut -f2 ${table} | sort | uniq | tee unique_values | ||
| } | ||
|
|
||
| output { | ||
| Array[String] unique_values = read_lines("unique_values") | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| { | ||
| "assemble_denovo_bulk.trim_clip_db": | ||
| "dx://file-BXF0vYQ0QyBF509G9J12g927" | ||
| } | ||
lakras marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||



Uh oh!
There was an error while loading. Please reload this page.