Skip to content

Running multiple jobs prevents detection of stuck jobs #26

@rkwalters

Description

@rkwalters

The motherscripts detect failing/stuck tasks by comparing the current call in send_jobarray to that last task queued (as recorded in that master log file for that motherscript). If it's submitting the same job attempted previously, it errors and stops as intended.

This comparison relies on reading the last line of the master log file to match against. If 2+ jobs with the same master log file are running simultaneously (e.g. postimp with different phenotypes, pca with and without reference population samples, etc), the previous task submission may no longer be on the last line of the log file if one of the other jobs has submitted a task in the interim.

In the worst case scenario, 2+ jobs that are all failing can cycle on the same task infinitely, never stopping since they don't see their own previous task on the last line of the log file, and depending on the task potentially generating 100,000s of temp files in the process.

Same log below based on real-world example:

/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:00:22_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:02:02_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:03:37_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:05:05_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:06:21_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:07:39_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:09:01_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:10:18_2015

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions