Skip to content

Bad output for integer variable. #192

@gsjaardema

Description

@gsjaardema

Sorry again for not being able to provide a reproducer which definitely limits what you can do.... Just reporting to see if this sounds like something you might have seen before...

I have a run using parallel io with an integer variable being written by all ranks. Periodically, approximately 30% to 50% of the time, the variable has a chunk of contiguous zeroes in the middle of the variable. The chunk does not correspond to any particular ranks output and we verify that the data being written does not contain any zeroes. The chunks are in a group of 661 or 1332 4-byte integers with pnetcdf-1.12.1 and groups of 704 or 1408 4-byte integers with pnetcdf-1.14.0.

If I change the variable to 8-byte integers, I don't get the chunks of zeros; it runs correctly...

It happens only with gcc based openmpi and not with clang. We are using gcc-12.3 and openmpi-4.1.6.
When I run with address sanitizer, then there is no memory issues, but we also don't get the zeroes.
Similarly with valgrind although I only ran about 4 runs... (compared to 50--100 for the others)

I tried setting nc_header_align_size and nc_var_align_size, and nc_record_align_size, but none of those had any affect.
However, setting nc_num_aggrs_per_node to any value greater than zero and less than the number of ranks gives no errors on either debug or release. Setting it to num_ranks gives the error.

Does the fact that nc_num_aggrs_per_node has an affect give you any idea of where to look for a possible problem?

Again, apologies for no reproducer...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions