Skip to content

Conversation

@adammoody
Copy link

@adammoody adammoody commented Feb 10, 2022

@roblatham00 , I was hacking together a prototype for MPI_File_fence. We talked about defining two more calls: MPI_File_flush and MPI_File_fetch, but I'll just stick with fence to start with. I think fence is likely the call HDF would eventually use in place of:

MPI_File_sync()
MPI_Barrier()
MPI_File_sync()

Recall the fence synchronizes with the file system servers to expose newly written data, but it does not require the data to be flushed to disk. The data only need to be made visible for reads by other procs. Fence also implies a global barrier across ranks so that all ranks know that all other ranks have also flushed their data upon returning from fence.

One intrusive change is that it requires adding a new field to the struct of ADIO function pointers. I've not completed that for all ADIO implementations. I just added it in a couple places to bring up the point.

@adammoody
Copy link
Author

adammoody commented Feb 15, 2022

With the second commit in this PR, this at least now compiles for me. It's still not added to the function pointer table across all ADIO implementations yet. It's now defined for NFS, TESTFS, UFS, Lustre, GPFS, and UNIFY.

@adammoody
Copy link
Author

adammoody commented Feb 15, 2022

I extended the src/mpi/romio/test/perf.c test as a check and verified that it works on NFS.

@adammoody
Copy link
Author

@roblatham00 , for POSIX-compliant file systems like Lustre and GPFS, the MPI_File_fence only needs to execute a barrier. That's how I implemented the GEN_Fence.

However, for file systems that are not POSIX-compliant like NFS, ADIO_Fence may need to do more, perhaps even fall back to call ADIO_Flush for an fsync. I could use your help in adding it to the other backends.

roblatham00 pushed a commit that referenced this pull request Nov 10, 2025
Even though there can not be a buffer overflow as the string is properly
sized, noncontig_coll2 fails when built with -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 :
----
FAIL: noncontig_coll2
=====================
Thread 1 "noncontig_coll2" received signal SIGABRT, Aborted.
0x00007ffff709c5fc in __pthread_kill_implementation () from /lib64/libc.so.6
(gdb) bt
 #0  0x00007ffff709c5fc in __pthread_kill_implementation ()
    from /lib64/libc.so.6
 #1  0x00007ffff7042106 in raise () from /lib64/libc.so.6
 #2  0x00007ffff702938b in abort () from /lib64/libc.so.6
 pmodels#3  0x00007ffff702a3ab in __libc_message_impl.cold () from /lib64/libc.so.6
 pmodels#4  0x00007ffff712b4fb in __fortify_fail () from /lib64/libc.so.6
 pmodels#5  0x00007ffff712adc6 in __chk_fail () from /lib64/libc.so.6
 pmodels#6  0x00007ffff712c8f5 in __snprintf_chk () from /lib64/libc.so.6
 pmodels#7  0x000000000040275e in snprintf (__s=0x4aafee "", __n=<optimized out>,
     __fmt=0x404077 "%s,") at /usr/include/bits/stdio2.h:68
 pmodels#8  default_str (mynod=<optimized out>, len=61, array=0x59fca0,
     dest=0x4aafd0 "hostname,")
     at src/mpi/romio/test/noncontig_coll2.c:189
 pmodels#9  main (argc=<optimized out>, argv=<optimized out>)
     at src/mpi/romio/test/noncontig_coll2.c:330
----
This is due to the len parameter of snprintf not being updated as we
advance in the string.
Fix this issue by introducing a remaining len var that contains the exact amount
of bytes left.

Signed-off-by: Nicolas Morey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant