This set of Snakemake workflows replaces the Atlas new experiments processing, reprocessing and recalculations operations done from Atlas-Prod codebase which had direct involvement of the LSF CLI and could only run on the original cluster.
It contains data analysis rules for:
- RNA-Seq baseline analysis
- Microarray differential analysis
- RNA-Seq differential analysis
- Proteomics baseline analysis
- Proteomics differential analysis
A recalculations run requires that reprocess has been performed a priori, and it is currently not available for proteomics experiments. Recalculations are the operations that need to happen on load or after an E! Update, and generate a subset of the outputs produced during (re)processing.
- Snakemake (tested with version 7.32.4)
- LSF or SLURM batch schedulers
- Set up configuration variables at
run_sorting_hat_test_data.sample.sh
for goal 'reprocess' or 'recalculations'.
./run_sorting_hat_test_data.sh EXPS_DIR
The experiments path contains one or more directories with Atlas accession names E-* (e.g. E-MTAB-5577
), having at least configuration files in xml format after curation process.
Optionally, worflow execution can be tailored to specific accessions or species by defining these variables in the sorting-hat script.
Completed processing by iRAP Single Lib (ISL) is necessary before new experiment processing. It will run all rules avilable for the experiment type.
This is necessary for the Ensembl Update (E! Update) part of a Data Release. E! Update brings annotations from Biomart, E! Mysql databases and ftp sites for all the relevant organisms in Expressiona Atlas, and leaves them in a format that can be consumed for the decoration process and the web applications. Ensembl Update validators (for Biomart attributes, GTF URL validations and existence of MySQL database) are performed before running recalculations.
For differential RNA-seq and microarray experiments, the following outputs are generated (which correspond to rules):
- Percentile ranks rule
- Output:
{accession}-percentile-ranks.tsv
- Differential tracks rule
- Outputs:
{accession}.{contrast_id}.genes.pval.bedGraph
- Outputs:
{accession}.{contrast_id}.genes.log2foldchange.bedGraph
- Differential GSEA rule
- Outputs:
{accession}.{contrast_id}.{ext_db}.gsea.tsv
- Outputs:
{accession}.{contrast_id}.{ext_db}.gsea_list.tsv
- atlas_experiment_summary rule
- Output:
{accession}-atlasExperimentSummary.Rdata
For baseline RNA-seq experiments, the following outputs are generated:
- Baseline tracks rule
- Outputs:
{accession}.{assay_id}.genes.expressions_{metric}.bedGraph
(where metric could be fpkm/tpm)
- Baseline heatmap rule
- Outputs:
{accession}-heatmap-{metric}.pdf
- Output:
{accession}-heatmap.pdf
- Baseline coexpression rule
- Outputs:
{accession}-{metric}-coexpressions.tsv.gz
- Output:
{accession}-coexpressions.tsv.gz
- atlas_experiment_summary rule
- Output:
{accession}-atlasExperimentSummary.Rdata
For proteomics experiments (proteomics_baseline
, proteomics_baseline_dia
, proteomics_differential
), recalculations are not implemented.
Rule | Baseline | Differential RNA-seq | Differential microarray |
---|---|---|---|
atlas_experiment_summary | ✓ | ✓ | ✓ |
check_differential_gsea | ✓ | ✓ | |
differential_gsea | ✓ | ✓ | |
differential_tracks | ✓ | ✓ | |
percentile_ranks | ✓ | ✓ | |
baseline_coexpression | ✓ | ||
baseline_heatmap | ✓ | ||
baseline_tracks | ✓ | ||
link_baseline_coexpression | ✓ | ||
link_baseline_heatmap | ✓ | ||
touch_inputs_baseline | ✓ |