Hi, am I again, Jaime. I managed to compile the model on our machine, as I mentioned in the previous issue. Now, I'm struggling to run it. I'm finding some problems to run the model with the default settings (existing) in our machine (to do the first tests to run).
The amount of nodes/cores available to run the model is large and It no all the time these quantities are available to run the model with the configuration presented on our machine. That is, running with more than 3100 colors (actually 3165, atmos_npes = 1728, 1728 core for atmospheric model, and ocean_npes = 1437, 1437 for ocean), as configured in the namelist (in floder ./ESM4_rundir/input.nml) and run script ( floder ./run/) of the model.
In the partition I have access to, each node has 48 cores, in fact, there are those cores quantities (there are about 90 nodes in the partition that I have access to), but I'm trying to test with just a few nodes (about 10 just to see how the model will behave). It seems to me, that 66 of 90 nodes, is a sufficient number of nodes to run the model with these default settings, however, the logistics of this are quite difficult. For that reason, I'm trying fewer nodes just as a test to see if the model will run and how it will behave. However, when I'm testing with a configuration of 10 knots or even less (I've tested less and more than 10, in this case, 30) the round breaks before it even starts. Could someone help me understand why?
Below is the bash script used to run the model:
#!/bin/bash
#SBATCH --nodes=10 #Number of Nodes ## if use cptec, 30 is maxima, for cptec_long partition, 90 is the maxima (but this can be available).
#SBATCH --ntasks-per-node=48 #Number of tasks per node
#SBATCH --ntasks=480 ##1440 ##1728##3168 #Total number of MPI tasks
##SBATCH --cpus-per-task=6 #Number of threads per MPI task
#SBATCH --job-name=runTest-ESM4-gfdl #Job name
#SBATCH --mail-user=ja...@gmail.com ##ja...@inpe.br
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --time=1-00:00:00 #Change the timeout to stop (after 20 days)
#SBATCH -p cptec #Queue to be used ## can change to "het_scal or cptec_long" node limit for cptec queue is 30 and cptec_long is 90
#SBATCH --exclusive #Exclusive use of nodes during job execution
##Display the nodes allocated to the Job
echo $SLURM_JOB_NODELIST
nodeset -e $SLURM_JOB_NODELIST
cd $SLURM_SUBMIT_DIR
#Configure the compilers
##-------------------------#
## 1) Using OpenMPI with Intel PSXE (2016, "2017", 2018 or 2019)
source /scratch/app/modulos/intel-psxe-2017.sh
#Clear module cache
module purge
module load openmpi/icc/2.0.4.2
#****************************************************** *************************
## GNU Lesser General Public License
#****************************************************** *************************
############################################## ##########
## ############## USER INPUT SECTION ################# ##
############################################## ##########
## directory where the model will run
## This directory should contain the input.nml file, *_table, INPUT, and RESTART folders.
workDir=/prj/cptec/ja../Models/ESM4_Model/ESM4/ESM4_rundir
##workDir=../ESM4_rundir
#Path to executable
#Path to executable
##executable=${PWD}../exec/esm4.1.x
executable=/prj/cptec/ja../Models/ESM4_Model/ESM4/exec/esm4.1.x
#MPI run program (srun, mpirun, mpiexec, aprun ...)
run_prog=srun
#Option to specify number of colors
ncores=-n
#Option to specify number of nodes
nnodes=-N
#Option to specify number of threads
nthreads=-c
## Set up for run, these are the default values set in the input.nml file
#Number of colors to run the atmosphere
atm_cores=288 ##960 ##1056 ##1728
#Number of threads to use for the atmosphere
atm_threads=4
#Number of nodes to use for the atmosphere
atm_ncores_per_node=6
#Number of colors to run the ocean
#Number of colors to run the ocean
ocn_cores=192 ##480 ##1440 ##original 1437, it's was changed also in SM4_rundir/input.nml, line with &coupler_nml, ocean_npes = 1437
## Add any additional options here that you need for your
ocn_ncores_per_node=4 ##10
ocn_threads=2
############################################## ##########
## ############# END USER INPUT SECTION ############### ##
############################################## ##########
## Set environment variables
export KMP_STACKSIZE=512m
export NC_BLKSZ=1M
export F_UFMTENDIAN=big
##export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
## Set the stacksize to unlimited
unlimited unlimited
ulimit -S -s unlimited
ulimit -S -c unlimited
**## Go to the workDir
cd ${workDir}**
## Add new
##echo "================================="
##echo "= NTASKS = $SLURM_NTASKS ="
##echo "= NNODES = "$SLURM_NNODES" ="
##echo "= CPUS_PER_TASK ="$SLURM_CPUS_PER_TASK" ="
##echo "================================="
##echo "= nodelist: " $SLURM_JOB_NODELIST" ="
##echo "================================="
## Run the model in the workDir
**$*{run_prog}* $*{nnodes}* $*{atm_ncores_per_node}* $*{ncores}* $*{atm_cores}* $*{nthreads}* $*{atm_threads}* $*{executable}* : $*{nnodes}* $*{ocn_ncores_per_node}* $*{ncores}* $*{ocn_cores}* ${executable}** |& tee stdout.log
Please, looking at these settings, can anyone see any errors that are causing the test runs to break?
Another thing, when I change the default values (atm_cores=1728 and ocn_cores=1437), should I also change the values in the namelist (in ./ESM4_rundir/input.nml)?
I made this change. I don't know if this is what is causing the crash.
Thanks.
Hi, am I again, Jaime. I managed to compile the model on our machine, as I mentioned in the previous issue. Now, I'm struggling to run it. I'm finding some problems to run the model with the default settings (existing) in our machine (to do the first tests to run).
The amount of nodes/cores available to run the model is large and It no all the time these quantities are available to run the model with the configuration presented on our machine. That is, running with more than 3100 colors (actually 3165, atmos_npes = 1728, 1728 core for atmospheric model, and ocean_npes = 1437, 1437 for ocean), as configured in the namelist (in floder ./ESM4_rundir/input.nml) and run script ( floder ./run/) of the model.
In the partition I have access to, each node has 48 cores, in fact, there are those cores quantities (there are about 90 nodes in the partition that I have access to), but I'm trying to test with just a few nodes (about 10 just to see how the model will behave). It seems to me, that 66 of 90 nodes, is a sufficient number of nodes to run the model with these default settings, however, the logistics of this are quite difficult. For that reason, I'm trying fewer nodes just as a test to see if the model will run and how it will behave. However, when I'm testing with a configuration of 10 knots or even less (I've tested less and more than 10, in this case, 30) the round breaks before it even starts. Could someone help me understand why?
Below is the bash script used to run the model:
Please, looking at these settings, can anyone see any errors that are causing the test runs to break?
Another thing, when I change the default values (atm_cores=1728 and ocn_cores=1437), should I also change the values in the namelist (in ./ESM4_rundir/input.nml)?
I made this change. I don't know if this is what is causing the crash.
Thanks.