I understand the rationale for making all of the inputs (inputs and runtime configuration) part of the cache key. It makes sense, a lot of the runtime parameters directly affect computation results (cpu, mem, but mostly docker) But it would be nice if time_minutes was not included in that because it doesn't affect the computations in the same way as the others, it just puts a limit, computation would proceed identically given different values (iff larger than the required time.)
Background
On our SLURM cluster, jobs require a timeout which we set as time_minutes in https://github.com/miniwdl-ext/miniwdl-slurm/
This has been working perfectly, but now I'm running someone else's workflow, which doesn't annotate time minutes (nor uses any formulae for calculating them per job). I've been guessing at time_minutes values but sometimes I get it wrong. Or I encounter new datasets which require longer to run with specific tools.
I've now increased that in my workflow of 20 datasets, but as a result, that's busted the job cache, and now I'm going to burn some weeks of CPU time re-doing all of the previously successfully completed datasets, because one tool in the middle needs longer to finish.
If time_minutes could be excluded from the cache that would fix my issue, though I'm sure I'd need to re-run it end to end at least once to take advantage of that, but that's fine.
cc @rhpvorderman
I understand the rationale for making all of the inputs (inputs and runtime configuration) part of the cache key. It makes sense, a lot of the runtime parameters directly affect computation results (cpu, mem, but mostly
docker) But it would be nice iftime_minuteswas not included in that because it doesn't affect the computations in the same way as the others, it just puts a limit, computation would proceed identically given different values (iff larger than the required time.)Background
On our SLURM cluster, jobs require a timeout which we set as
time_minutesin https://github.com/miniwdl-ext/miniwdl-slurm/This has been working perfectly, but now I'm running someone else's workflow, which doesn't annotate time minutes (nor uses any formulae for calculating them per job). I've been guessing at time_minutes values but sometimes I get it wrong. Or I encounter new datasets which require longer to run with specific tools.
I've now increased that in my workflow of 20 datasets, but as a result, that's busted the job cache, and now I'm going to burn some weeks of CPU time re-doing all of the previously successfully completed datasets, because one tool in the middle needs longer to finish.
If
time_minutescould be excluded from the cache that would fix my issue, though I'm sure I'd need to re-run it end to end at least once to take advantage of that, but that's fine.cc @rhpvorderman