-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Sorry again for not being able to provide a reproducer which definitely limits what you can do.... Just reporting to see if this sounds like something you might have seen before...
I have a run using parallel io with an integer variable being written by all ranks. Periodically, approximately 30% to 50% of the time, the variable has a chunk of contiguous zeroes in the middle of the variable. The chunk does not correspond to any particular ranks output and we verify that the data being written does not contain any zeroes. The chunks are in a group of 661 or 1332 4-byte integers with pnetcdf-1.12.1 and groups of 704 or 1408 4-byte integers with pnetcdf-1.14.0.
If I change the variable to 8-byte integers, I don't get the chunks of zeros; it runs correctly...
It happens only with gcc based openmpi and not with clang. We are using gcc-12.3 and openmpi-4.1.6.
When I run with address sanitizer, then there is no memory issues, but we also don't get the zeroes.
Similarly with valgrind although I only ran about 4 runs... (compared to 50--100 for the others)
I tried setting nc_header_align_size and nc_var_align_size, and nc_record_align_size, but none of those had any affect.
However, setting nc_num_aggrs_per_node to any value greater than zero and less than the number of ranks gives no errors on either debug or release. Setting it to num_ranks gives the error.
Does the fact that nc_num_aggrs_per_node has an affect give you any idea of where to look for a possible problem?
Again, apologies for no reproducer...