-
Notifications
You must be signed in to change notification settings - Fork 9
feat: Run detector timelines on SLURM #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
3934355
907bbb3
5106d24
ec91e62
e747db4
505cf83
137577d
898b940
eb19e4b
5f43d1e
dcdfae9
1d6c484
e25670c
dc4d422
80a81ca
22508a0
5acfc0e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -3,6 +3,12 @@ | |||||
| set -e | ||||||
| set -u | ||||||
| source $(dirname $0)/environ.sh | ||||||
| # constants ############################################################ | ||||||
| # slurm settings | ||||||
| SLURM_MEMORY=1500 | ||||||
| SLURM_TIME=10:00:00 | ||||||
| SLURM_LOG=/farm_out/%u/%x-%A_%a | ||||||
| ######################################################################## | ||||||
|
|
||||||
| # default options | ||||||
| match="^" | ||||||
|
|
@@ -12,7 +18,7 @@ outputDir="" | |||||
| numThreads=8 | ||||||
| singleTimeline="" | ||||||
| declare -A modes | ||||||
| for key in list build skip-mya focus-timelines focus-qa debug help; do | ||||||
| for key in list build skip-mya focus-timelines focus-qa run-slurm organize-only single series submit swifjob debug help; do | ||||||
| modes[$key]=false | ||||||
| done | ||||||
|
|
||||||
|
|
@@ -55,6 +61,26 @@ usage() { | |||||
| --focus-timelines only produce the detector timelines, do not run detector QA code | ||||||
| --focus-qa only run the QA code (assumes you have detector timelines already) | ||||||
|
|
||||||
| --run-slurm run timelines on SLURM instead of running multi-threaded locally | ||||||
| --organize-only only organize timelines assuming they have already been run with --run-slurm | ||||||
| if not used, all files from output directories will be removed | ||||||
mfmceneaney marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| *** EXECUTION CONTROL OPTIONS: choose only one, or the default will generate a | ||||||
| Slurm job description and print out the suggested \`sbatch\` command | ||||||
|
|
||||||
| --single run only the first job, locally; useful for | ||||||
| testing before submitting jobs to slurm | ||||||
|
|
||||||
| --series run all jobs locally, one at a time; useful | ||||||
| for testing on systems without slurm | ||||||
mfmceneaney marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| --submit submit the slurm jobs, rather than just | ||||||
| printing the \`sbatch\` command | ||||||
|
|
||||||
| --swifjob run this on a workflow runner, where the input | ||||||
| files are found in ./; overrides some other settings; this is NOT meant | ||||||
| to be used interactively, but rather as a part of a workflow | ||||||
|
|
||||||
mfmceneaney marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| --debug enable debug mode: run a single timeline with stderr and stdout printed to screen; | ||||||
| it is best to use this with the '-t' option to debug specific timeline issues | ||||||
|
|
||||||
|
|
@@ -177,13 +203,13 @@ detDirs=( | |||||
| trigger | ||||||
| ) | ||||||
|
|
||||||
| # cleanup output directories | ||||||
| if ${modes['focus-all']} || ${modes['focus-timelines']}; then | ||||||
| # cleanup output directories IF you are not just organizing files after running on SLURM | ||||||
| if (${modes['focus-all']} || ${modes['focus-timelines']}) && ! ${modes['organize-only']}; then | ||||||
| if [ -d $finalDirPreQA ]; then | ||||||
| rm -rv $finalDirPreQA | ||||||
| fi | ||||||
| fi | ||||||
| if [ -d $logDir ]; then | ||||||
| if [ -d $logDir ] && ! ${modes['organize-only']}; then | ||||||
| for fail in $(find $logDir -name "*.fail"); do | ||||||
| rm $fail | ||||||
| done | ||||||
|
|
@@ -231,26 +257,167 @@ if ${modes['focus-all']} || ${modes['focus-timelines']}; then | |||||
| done | ||||||
|
|
||||||
| # produce timelines, multithreaded | ||||||
| job_ids=() | ||||||
| job_names=() | ||||||
| for timelineObj in $timelineList; do | ||||||
| logFile=$logDir/$timelineObj | ||||||
| [ -n "$singleTimeline" -a "$timelineObj" != "$singleTimeline" ] && continue | ||||||
| echo ">>> producing timeline '$timelineObj' ..." | ||||||
| if ${modes['debug']}; then | ||||||
| java $TIMELINE_JAVA_OPTS $run_detectors_script $timelineObj $inputDir | ||||||
| echo "PREMATURE EXIT, since --debug option was used" | ||||||
| exit | ||||||
| if ! ${modes['run-slurm']} || ${modes['debug']} && ! ${modes['organize-only']}; then | ||||||
mfmceneaney marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| job_ids=() | ||||||
| job_names=() | ||||||
| for timelineObj in $timelineList; do | ||||||
| logFile=$logDir/$timelineObj | ||||||
| [ -n "$singleTimeline" -a "$timelineObj" != "$singleTimeline" ] && continue | ||||||
| echo ">>> producing timeline '$timelineObj' ..." | ||||||
| if ${modes['debug']}; then | ||||||
| java $TIMELINE_JAVA_OPTS $run_detectors_script $timelineObj $inputDir | ||||||
| echo "PREMATURE EXIT, since --debug option was used" | ||||||
| exit | ||||||
| else | ||||||
| #sleep 1 | ||||||
| java $TIMELINE_JAVA_OPTS $run_detectors_script $timelineObj $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail & | ||||||
| job_ids+=($!) | ||||||
| job_names+=($timelineObj) | ||||||
| fi | ||||||
| wait_for_jobs $numThreads | ||||||
| done | ||||||
|
|
||||||
| wait_for_jobs 0 | ||||||
|
|
||||||
| fi # condition end: produce timelines, multi-threaded | ||||||
|
|
||||||
| # produce timelines, distributed on SLURM or test singly or sequentially locally | ||||||
mfmceneaney marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| if ${modes['run-slurm']} && ! ${modes['organize-only']}; then | ||||||
|
|
||||||
| # initial checks and preparations | ||||||
| echo $dataset | grep -q "/" && printError "dataset name must not contain '/' " && echo && exit 100 | ||||||
| [ -z "$dataset" ] && printError "dataset name must not be empty" && echo && exit 100 | ||||||
| slurmJobName=clas12-timeline--$dataset | ||||||
|
|
||||||
| # start job lists | ||||||
| echo """ | ||||||
| Generating job scripts...""" | ||||||
| slurmDir=$TIMELINESRC/slurm | ||||||
|
||||||
| mkdir -p $slurmDir/scripts | ||||||
| jobkeys=() | ||||||
| for timelineObj in $timelineList; do | ||||||
| [ -n "$singleTimeline" -a "$timelineObj" != "$singleTimeline" ] && continue | ||||||
| jobkeys+=($timelineObj) | ||||||
| done | ||||||
| #NOTE: A separate list is created for each key in run-monitoring.sh, | ||||||
| # but here we just want to submit all timelines in the same slurm job array so just create one job list. | ||||||
| joblist=$slurmDir/job.$dataset.detectors.list | ||||||
| > $joblist | ||||||
|
|
||||||
| # get list of input files, and append prefix for SWIF | ||||||
| echo "..... getting input files ....." | ||||||
| inputListFile=$slurmDir/files.$dataset.inputs.list | ||||||
| realpath $inputDir > $inputListFile | ||||||
|
|
||||||
| # generate job scripts | ||||||
| echo "..... generating job scripts ....." | ||||||
| for key in ${jobkeys[@]}; do | ||||||
|
|
||||||
| # set log file | ||||||
| logFile=$logDir/$key | ||||||
|
|
||||||
| # make job scripts for each $key | ||||||
| jobscript=$slurmDir/scripts/$key.$dataset.sh | ||||||
|
|
||||||
| cat > $jobscript << EOF | ||||||
| #!/usr/bin/env bash | ||||||
| set -e | ||||||
| set -u | ||||||
| set -o pipefail | ||||||
| echo "TIMELINE OBJECT $key" | ||||||
|
|
||||||
| # set classpath | ||||||
| export CLASSPATH=$CLASSPATH | ||||||
|
Comment on lines
+320
to
+321
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, thanks! Will keep an eye on this. |
||||||
|
|
||||||
| # produce detector timelines | ||||||
| java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail | ||||||
|
||||||
| java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail | |
| java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir |
You may also remove the logFile=$logDir/$key from a few lines above.
Later below, in the "error checking" part, we'll need to figure out how to read the Slurm error logs... or just tell the user to check them for themselves...
If we do end up reading the Slurm error logs, we'll need to use the job ID or something, so in the case where the user runs this script on twice, the correct set of log files is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I just removed what you suggested and in the documentation I just told the user to check for the job errors following the directions in step 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I also removed the extra log file definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also update the documentation how to use these new options?
doc/chef_guide.md: supposed to be as terse as possibledoc/procedure.md: where you don't have to be terse (in fact,chef_guide.mdwas created becauseprocedure.mdwas too verbose...)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a paragraph in both of these files, but let me know if I was too verbose in the chefs' documentation.