-
Notifications
You must be signed in to change notification settings - Fork 4
Consider removing combine hail table step #99
Description
Currently the four datasets are combined into a single hail table with one row for every gene, a struct for gene info containing info per dataset, and a struct for variants containing each dataset and its respective list of variants.
Then, when writing the tables, this entire table is written to a temp .tsv file, then the .tsv file is written to individual gene results, and variant results files per gene, one for each dataset.
Unless I am missing something, this combination into a single combined table does very little, given we validate the outputs from each of the individual pipelines. We could just generate the results json files from each individual dataset.
I was working on getting Transcript Consequences on a variant level for the IBD dataset, and after it working in development with a smaller subset of genes, it choked in production in the write results files steps, leading to looking a bit closer at whats happening and filing this issue.