Skip to content

Ignore 'time_minutes' when caching jobs #823

@hexylena

Description

@hexylena

I understand the rationale for making all of the inputs (inputs and runtime configuration) part of the cache key. It makes sense, a lot of the runtime parameters directly affect computation results (cpu, mem, but mostly docker) But it would be nice if time_minutes was not included in that because it doesn't affect the computations in the same way as the others, it just puts a limit, computation would proceed identically given different values (iff larger than the required time.)

Background

On our SLURM cluster, jobs require a timeout which we set as time_minutes in https://github.com/miniwdl-ext/miniwdl-slurm/

This has been working perfectly, but now I'm running someone else's workflow, which doesn't annotate time minutes (nor uses any formulae for calculating them per job). I've been guessing at time_minutes values but sometimes I get it wrong. Or I encounter new datasets which require longer to run with specific tools.

I've now increased that in my workflow of 20 datasets, but as a result, that's busted the job cache, and now I'm going to burn some weeks of CPU time re-doing all of the previously successfully completed datasets, because one tool in the middle needs longer to finish.

If time_minutes could be excluded from the cache that would fix my issue, though I'm sure I'd need to re-run it end to end at least once to take advantage of that, but that's fine.

cc @rhpvorderman

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions