Running multiple jobs prevents detection of stuck jobs

The motherscripts detect failing/stuck tasks by comparing the current call in send_jobarray to that last task queued (as recorded in that master log file for that motherscript). If it's submitting the same job attempted previously, it errors and stops as intended.

This comparison relies on reading the last line of the master log file to match against. If 2+ jobs with the same master log file are running simultaneously (e.g. postimp with different phenotypes, pca with and without reference population samples, etc), the previous task submission may no longer be on the last line of the log file if one of the other jobs has submitted a task in the interim.

In the worst case scenario, 2+ jobs that are all failing can cycle on the same task infinitely, never stopping since they don't see their own previous task on the last line of the log file, and depending on the task potentially generating 100,000s of temp files in the process. 

Same log below based on real-world example:

```
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:00:22_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:02:02_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:03:37_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:05:05_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:06:21_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:07:39_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:09:01_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:10:18_2015
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multiple jobs prevents detection of stuck jobs #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running multiple jobs prevents detection of stuck jobs #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions