The motherscripts detect failing/stuck tasks by comparing the current call in send_jobarray to that last task queued (as recorded in that master log file for that motherscript). If it's submitting the same job attempted previously, it errors and stops as intended.
This comparison relies on reading the last line of the master log file to match against. If 2+ jobs with the same master log file are running simultaneously (e.g. postimp with different phenotypes, pca with and without reference population samples, etc), the previous task submission may no longer be on the last line of the log file if one of the other jobs has submitted a task in the interim.
In the worst case scenario, 2+ jobs that are all failing can cycle on the same task infinitely, never stopping since they don't see their own previous task on the last line of the log file, and depending on the task potentially generating 100,000s of temp files in the process.
Same log below based on real-world example:
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921 Thu_Dec_17_13:00:22_2015
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921 Thu_Dec_17_13:02:02_2015
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921 Thu_Dec_17_13:03:37_2015
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921 Thu_Dec_17_13:05:05_2015
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921 Thu_Dec_17_13:06:21_2015
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921 Thu_Dec_17_13:07:39_2015
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921 Thu_Dec_17_13:09:01_2015
/path/to/datadir postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921 Thu_Dec_17_13:10:18_2015
The motherscripts detect failing/stuck tasks by comparing the current call in send_jobarray to that last task queued (as recorded in that master log file for that motherscript). If it's submitting the same job attempted previously, it errors and stops as intended.
This comparison relies on reading the last line of the master log file to match against. If 2+ jobs with the same master log file are running simultaneously (e.g. postimp with different phenotypes, pca with and without reference population samples, etc), the previous task submission may no longer be on the last line of the log file if one of the other jobs has submitted a task in the interim.
In the worst case scenario, 2+ jobs that are all failing can cycle on the same task infinitely, never stopping since they don't see their own previous task on the last line of the log file, and depending on the task potentially generating 100,000s of temp files in the process.
Same log below based on real-world example: