I'm currently running GPUSPH on a cluster, which uses SLURM scheduling. The cluster scheduling is configured to give priority to certain users. In one instance, my job was killed during execution. I therefore attempted to resume the job using a hotstart file. GPUSPH successfully read the hotstart file and the simulation carried on as expected.
After the job finished, I check my output directory and noticed that there was no output generated following the hotstart. The only output provided was that associated with the initial simulation, prior to the job being killed.
The simulation is a modified version of the "WaveTank" example test case provided with the GPUSPH source code downloaded from here (github master branch). The only thing that I changed was removal of the slope in the experiment. I've run it in the past and it works as intended, so I'm 99.9% sure it has nothing to do with the specific application.
I suspect that the bug might be related to me specifying the output directory (non default). Somewhere in the hotstart procedure, it fails to properly identify that output is requested and where it is to be generated.
* No devices specified, falling back to default (0)...
GPUSPH version v5.0+custom
Release version without fastmath for compute capability 7.5
Chrono : enabled
HDF5 : enabled
MPI : disabled
Catalyst : disabled
Compiled for problem "MY_WaveTank"
[Network] rank 0 (1/1), host
tot devs = 1 (1 * 1)
paddle_amplitude (radians): 0.218669
Info stream: GPUSPH-776718
Initializing...
Water level not set, autocomputed: 0.4525
Max particle speed not set, autocomputed from max fall: 2.97136
Expected maximum shear rate: 3076.92 1/s
dt = 5e-05 (CFL conditions from soundspeed: 6.5e-05, from gravity 0.00514816, from viscosity 5.28125)
Using computed max neib list size 128
Using computed neib bound pos 127
Artificial viscosity epsilon is not set, using default value: 4.225000e-07
Problem calling set grid params
Influence radius / neighbor search radius / expected cell side : 0.013 / 0.013 / 0.013
Autocomputed SPS Smagorinsky factor 3.6e-07 from C_s = 0.12, ∆p = 0.005
Autocomputed SPS isotropic factor 1.1e-07 from C_i = 0.0066, ∆p = 0.005
- World origin: 0 , 0 , 0
- World size: 12 x 1.2 x 1
- Cell size: 0.0130011 x 0.0130435 x 0.0131579
- Grid size: 923 x 92 x 76 (6,453,616 cells)
- Cell linearization: y,z,x
- Dp: 0.005
- R0: 0.005
Generating problem particles...
Hot starting from /home/user/nfs_fs02/high_res/data/hot_00082.bin...
VTKWriter will write every 0.1 (simulated) seconds
HotStart checkpoints every 0.1 (simulated) seconds
will keep the last 8 checkpoints
Allocating shared host buffers...
Numbodies : 1
Numforcesbodies : 0
numOpenBoundaries : 0
allocated 1.27 GiB on host for 17,086,280 particles (17,086,279 active)
read buffer header: Position
read buffer header: Velocity
read buffer header: Info
read buffer header: Hash
Restoring body #0 ...
RB First/Last Index:
Preparing the problem...
Body: 0
Cg grid pos: 13 46 25
Cg pos: -0.00144029 -0.00652174 0.00613915
- device at index 0 has 17,086,279 particles assigned and offset 0
Integrator predictor/corrector instantiated.
Starting workers...
number of forces rigid bodies particles = 0
thread 0x2b93acd3c700 device idx 0: CUDA device 0/1, PCI device 0000:1b:00.0: GeForce RTX 2080 Ti
Device idx 0: free memory 10821 MiB, total memory 10989 MiB
Estimated memory consumption: 400B/particle
Device idx 0 (CUDA: 0) allocated 0 B on host, 6.1 GiB on device
assigned particles: 17,086,279; allocated: 17,086,280
GPUSPH: initialized
Performing first write...
Letting threads upload the subdomains...
Thread 0 uploading 17086279 Position items (260.72 MiB) on device 0 from position 0
Thread 0 uploading 17086279 Velocity items (260.72 MiB) on device 0 from position 0
Thread 0 uploading 17086279 Info items (130.36 MiB) on device 0 from position 0
Thread 0 uploading 17086279 Hash items (65.18 MiB) on device 0 from position 0
Entering the main simulation cycle
Simulation time t=8.200351e+00s, iteration=139,290, dt=5.909090e-05s, 17,086,279 parts (0, cum. 0 MIPPS), maxneibs 83+0
Simulation time t=8.300006e+00s, iteration=140,977, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 91+0
Simulation time t=8.400047e+00s, iteration=142,670, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 91+0
Simulation time t=8.500029e+00s, iteration=144,362, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 91+0
Simulation time t=8.600003e+00s, iteration=146,054, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 92+0
Simulation time t=8.700042e+00s, iteration=147,747, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=8.800022e+00s, iteration=149,439, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=8.900055e+00s, iteration=151,134, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.000036e+00s, iteration=152,826, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.100010e+00s, iteration=154,518, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.200050e+00s, iteration=156,211, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.300029e+00s, iteration=157,903, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.400006e+00s, iteration=159,595, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 96+0
Simulation time t=9.500047e+00s, iteration=161,288, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.600018e+00s, iteration=162,980, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.700022e+00s, iteration=164,674, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.800039e+00s, iteration=166,367, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=9.900003e+00s, iteration=168,059, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Simulation time t=1.000004e+01s, iteration=169,752, dt=5.909090e-05s, 17,086,279 parts (14, cum. 14 MIPPS), maxneibs 97+0
Elapsed time of simulation cycle: 3.7e+04s
Peak particle speed was ~2.30357 m/s at 9.50005 s -> can set maximum vel 2.5 for this problem
Simulation end, cleaning up...
Deallocating...
Bug description
during hotstart, GPUSPH fails to write output to specified directory
Summary
I'm currently running GPUSPH on a cluster, which uses SLURM scheduling. The cluster scheduling is configured to give priority to certain users. In one instance, my job was killed during execution. I therefore attempted to resume the job using a hotstart file. GPUSPH successfully read the hotstart file and the simulation carried on as expected.
After the job finished, I check my output directory and noticed that there was no output generated following the hotstart. The only output provided was that associated with the initial simulation, prior to the job being killed.
This is the command that I executed in the initial job submission
./GPUSPH --deltap 0.005 --dir /home/user/nfs_fs02/high_resThis is the command that I executed after the job was killed to resume
./GPUSPH --deltap 0.005 --dir /home/user/nfs_fs02/high_res --resume /home/user/nfs_fs02/high_res/data/hot_00082.binThe simulation is a modified version of the "WaveTank" example test case provided with the GPUSPH source code downloaded from here (github master branch). The only thing that I changed was removal of the slope in the experiment. I've run it in the past and it works as intended, so I'm 99.9% sure it has nothing to do with the specific application.
I suspect that the bug might be related to me specifying the output directory (non default). Somewhere in the hotstart procedure, it fails to properly identify that output is requested and where it is to be generated.
Details
Here is my error log
and here is my output log
The "git_branch.txt" output is
The "make_show.txt" output is
The "summary.txt" output is