Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 188 additions & 21 deletions bin/run-detectors-timelines.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@
set -e
set -u
source $(dirname $0)/environ.sh
# constants ############################################################
# slurm settings
SLURM_MEMORY=1500
SLURM_TIME=10:00:00
SLURM_LOG=/farm_out/%u/%x-%A_%a
########################################################################

# default options
match="^"
Expand All @@ -12,7 +18,7 @@ outputDir=""
numThreads=8
singleTimeline=""
declare -A modes
for key in list build skip-mya focus-timelines focus-qa debug help; do
for key in list build skip-mya focus-timelines focus-qa run-slurm organize-only single series submit swifjob debug help; do
modes[$key]=false
done

Expand Down Expand Up @@ -55,6 +61,26 @@ usage() {
--focus-timelines only produce the detector timelines, do not run detector QA code
--focus-qa only run the QA code (assumes you have detector timelines already)

--run-slurm run timelines on SLURM instead of running multi-threaded locally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update the documentation how to use these new options?

  • doc/chef_guide.md: supposed to be as terse as possible
  • doc/procedure.md: where you don't have to be terse (in fact, chef_guide.md was created because procedure.md was too verbose...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a paragraph in both of these files, but let me know if I was too verbose in the chefs' documentation.

--organize-only only organize timelines assuming they have already been run with --run-slurm
if not used, all files from output directories will be removed

*** EXECUTION CONTROL OPTIONS: choose only one, or the default will generate a
Slurm job description and print out the suggested \`sbatch\` command

--single run only the first job, locally; useful for
testing before submitting jobs to slurm

--series run all jobs locally, one at a time; useful
for testing on systems without slurm

--submit submit the slurm jobs, rather than just
printing the \`sbatch\` command

--swifjob run this on a workflow runner, where the input
files are found in ./; overrides some other settings; this is NOT meant
to be used interactively, but rather as a part of a workflow

--debug enable debug mode: run a single timeline with stderr and stdout printed to screen;
it is best to use this with the '-t' option to debug specific timeline issues

Expand Down Expand Up @@ -177,13 +203,13 @@ detDirs=(
trigger
)

# cleanup output directories
if ${modes['focus-all']} || ${modes['focus-timelines']}; then
# cleanup output directories IF you are not just organizing files after running on SLURM
if (${modes['focus-all']} || ${modes['focus-timelines']}) && ! ${modes['organize-only']}; then
if [ -d $finalDirPreQA ]; then
rm -rv $finalDirPreQA
fi
fi
if [ -d $logDir ]; then
if [ -d $logDir ] && ! ${modes['organize-only']}; then
for fail in $(find $logDir -name "*.fail"); do
rm $fail
done
Expand Down Expand Up @@ -231,26 +257,167 @@ if ${modes['focus-all']} || ${modes['focus-timelines']}; then
done

# produce timelines, multithreaded
job_ids=()
job_names=()
for timelineObj in $timelineList; do
logFile=$logDir/$timelineObj
[ -n "$singleTimeline" -a "$timelineObj" != "$singleTimeline" ] && continue
echo ">>> producing timeline '$timelineObj' ..."
if ${modes['debug']}; then
java $TIMELINE_JAVA_OPTS $run_detectors_script $timelineObj $inputDir
echo "PREMATURE EXIT, since --debug option was used"
exit
if ! ${modes['run-slurm']} || ${modes['debug']} && ! ${modes['organize-only']}; then
job_ids=()
job_names=()
for timelineObj in $timelineList; do
logFile=$logDir/$timelineObj
[ -n "$singleTimeline" -a "$timelineObj" != "$singleTimeline" ] && continue
echo ">>> producing timeline '$timelineObj' ..."
if ${modes['debug']}; then
java $TIMELINE_JAVA_OPTS $run_detectors_script $timelineObj $inputDir
echo "PREMATURE EXIT, since --debug option was used"
exit
else
#sleep 1
java $TIMELINE_JAVA_OPTS $run_detectors_script $timelineObj $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail &
job_ids+=($!)
job_names+=($timelineObj)
fi
wait_for_jobs $numThreads
done

wait_for_jobs 0

fi # condition end: produce timelines, multi-threaded

# produce timelines, distributed on SLURM or test singly or sequentially locally
if ${modes['run-slurm']} && ! ${modes['organize-only']}; then

# initial checks and preparations
echo $dataset | grep -q "/" && printError "dataset name must not contain '/' " && echo && exit 100
[ -z "$dataset" ] && printError "dataset name must not be empty" && echo && exit 100
slurmJobName=clas12-timeline--$dataset

# start job lists
echo """
Generating job scripts..."""
slurmDir=$TIMELINESRC/slurm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chefs won't have write access to $TIMELINESRC, use ./slurm instead (for consistency with run-monitoring.sh)

Also, some files created in this directory will overwrite those from step 1, if Slurm is used there too (e.g., job.$dataset.detectors.list). Perhaps an easy way to avoid this conflict is to set slurmDir to be different for the two steps, such as ./slurm/step1 for run-monitoring.sh and ./slurm/step2 here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the second option using ./slurm/step1 for and ./slurm/step2 but would still like to check that this runs fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up having to set the output slurm directory to $outputDir/slurm/step2 in run-detector-timelines.sh since the script changes directories. It might also be wise to use this full path in run-monitoring.sh too even though that script does not change directories in a way that it would affect whether ./slurm or $outputDir/slurm would be identical.

mkdir -p $slurmDir/scripts
jobkeys=()
for timelineObj in $timelineList; do
[ -n "$singleTimeline" -a "$timelineObj" != "$singleTimeline" ] && continue
jobkeys+=($timelineObj)
done
#NOTE: A separate list is created for each key in run-monitoring.sh,
# but here we just want to submit all timelines in the same slurm job array so just create one job list.
joblist=$slurmDir/job.$dataset.detectors.list
> $joblist

# get list of input files, and append prefix for SWIF
echo "..... getting input files ....."
inputListFile=$slurmDir/files.$dataset.inputs.list
realpath $inputDir > $inputListFile

# generate job scripts
echo "..... generating job scripts ....."
for key in ${jobkeys[@]}; do

# set log file
logFile=$logDir/$key

# make job scripts for each $key
jobscript=$slurmDir/scripts/$key.$dataset.sh

cat > $jobscript << EOF
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
echo "TIMELINE OBJECT $key"

# set classpath
export CLASSPATH=$CLASSPATH
Comment on lines +320 to +321
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$CLASSPATH is necessary for now, but is removed in #293. Depending on whether we merge this PR or #293 first, we'll need to remember to deal with this (though if we forget, the script will just fail, reporting $CLASSPATH as unbound).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks! Will keep an eye on this.


# produce detector timelines
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slurm will handle the logging, automatically splitting stdout and stderr.

Suggested change
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir

You may also remove the logFile=$logDir/$key from a few lines above.

Later below, in the "error checking" part, we'll need to figure out how to read the Slurm error logs... or just tell the user to check them for themselves...

If we do end up reading the Slurm error logs, we'll need to use the job ID or something, so in the case where the user runs this script on twice, the correct set of log files is used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I just removed what you suggested and in the documentation I just told the user to check for the job errors following the directions in step 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I also removed the extra log file definition.

EOF

# grant permission and add it `joblist`
chmod u+x $jobscript
echo $jobscript >> $joblist

done # loop over `jobkeys`

# now generate slurm descriptions and/or local scripts
echo """
Generating batch scripts..."""
exelist=()

# check if we have any jobs to run
[ ! -s $joblist ] && printError "there are no timeline jobs to run" && continue
slurm=$(echo $joblist | sed 's;.list$;.slurm;')

# either generate single/sequential run scripts
if ${modes['single']} || ${modes['series']} || ${modes['swifjob']}; then
localScript=$(echo $joblist | sed 's;.list$;.local.sh;')
echo "#!/usr/bin/env bash" > $localScript
echo "set -e" >> $localScript
if ${modes['single']}; then
head -n1 $joblist >> $localScript
else # ${modes['series']} || ${modes['swifjob']}
cat $joblist >> $localScript
fi
chmod u+x $localScript
exelist+=($localScript)

# otherwise generate slurm description
else
#sleep 1
java $TIMELINE_JAVA_OPTS $run_detectors_script $timelineObj $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail &
job_ids+=($!)
job_names+=($timelineObj)
cat > $slurm << EOF
#!/bin/sh
#SBATCH --ntasks=1
#SBATCH --job-name=$slurmJobName
#SBATCH --output=$SLURM_LOG.out
#SBATCH --error=$SLURM_LOG.err
#SBATCH --partition=production
#SBATCH --account=clas12

#SBATCH --mem-per-cpu=$SLURM_MEMORY
#SBATCH --time=$SLURM_TIME

#SBATCH --array=1-$(cat $joblist | wc -l)
#SBATCH --ntasks=1

srun \$(head -n\$SLURM_ARRAY_TASK_ID $joblist | tail -n1)
EOF
exelist+=($slurm)
fi
wait_for_jobs $numThreads
done

wait_for_jobs 0
# execution
[ ${#exelist[@]} -eq 0 ] && printError "no jobs were created at all; check errors and warnings above" && exit 100
echo """
$sep
"""
if ${modes['single']} || ${modes['series']} || ${modes['swifjob']}; then
if ${modes['single']}; then
echo "RUNNING ONE SINGLE JOB LOCALLY:"
elif ${modes['series']}; then
echo "RUNNING ALL JOBS SEQUENTIALLY, LOCALLY:"
fi
for exe in ${exelist[@]}; do
echo """
$sep
EXECUTING: $exe
$sep"""
$exe
done
elif ${modes['submit']}; then
echo "SUBMITTING JOBS TO SLURM"
echo $sep
for exe in ${exelist[@]}; do sbatch $exe; done
echo $sep
echo "JOBS SUBMITTED!"
else
echo """ SLURM JOB DESCRIPTIONS GENERATED
- Slurm job name prefix will be: $slurmJobName
- To submit all jobs to slurm, run:
------------------------------------------"""
for exe in ${exelist[@]}; do echo " sbatch $exe"; done
echo """ ------------------------------------------
"""
fi
exit 0
fi # condition end: produce timelines, distributed on SLURM or test singly or sequentially locally

# organize output timelines
echo ">>> organizing output timelines..."
Expand Down