Skip to content

Conversation

guj
Copy link
Contributor

@guj guj commented Oct 13, 2025

This is a simple version of BTD. While writing this, it looks like something unexpected is observed.

to reproduce:

  1. compile and run 16_btd_write_parallel
  2. use more than one rank (2 is ok)
  3. use -s (span) with -b (mpi barrier before flush)
  4. run "mpirun -n 2 16_btd_write_parallel -s " works, however "mpirun -n 2 16_btd_write_parallel -s -b " hangs. It is not able to finish through the storeChunk call.

@franzpoeschel if you have time, please verify.

@guj guj changed the title Added a simple btd example [WIP] Added a simple btd example Oct 13, 2025
@franzpoeschel
Copy link
Contributor

Yep, I see the hangup, too. Will have a look.

@franzpoeschel
Copy link
Contributor

Hey Junmin, the issue in here is that the Span-based storeChunk API is collective, which was not properly documented so far. Fixing that with #1794.

For your code, this means that all ranks must call storeChunk() when using the Span API. The following patch should do that:

diff --git a/examples/16_btd_write_parallel.cpp b/examples/16_btd_write_parallel.cpp
index a7b420ba2..65a602d94 100644
--- a/examples/16_btd_write_parallel.cpp
+++ b/examples/16_btd_write_parallel.cpp
@@ -204,6 +204,10 @@ void doWork(
                 input.get(), input.get() + numElements, spanBuffer.data());
         }
     }
+    if (m_span)
+    {
+        mymesh.storeChunk<double>({0, 0, 0}, {0, 0, 0}).currentBuffer();
+    }
 }
 
 int main(int argc, char *argv[])

@guj
Copy link
Contributor Author

guj commented Oct 14, 2025

Hey Junmin, the issue in here is that the Span-based storeChunk API is collective, which was not properly documented so far. Fixing that with #1794.

For your code, this means that all ranks must call storeChunk() when using the Span API. The following patch should do that:

diff --git a/examples/16_btd_write_parallel.cpp b/examples/16_btd_write_parallel.cpp
index a7b420ba2..65a602d94 100644
--- a/examples/16_btd_write_parallel.cpp
+++ b/examples/16_btd_write_parallel.cpp
@@ -204,6 +204,10 @@ void doWork(
                 input.get(), input.get() + numElements, spanBuffer.data());
         }
     }
+    if (m_span)
+    {
+        mymesh.storeChunk<double>({0, 0, 0}, {0, 0, 0}).currentBuffer();
+    }
 }
 
 int main(int argc, char *argv[])

Oh, I did not know that.
Then why is the barrier call cause the hanging? while without the barrier call, store chunk with span succeeded?

{
if (i + 1 < argc)
{
int value = std::atoi(argv[++i]);

Check notice

Code scanning / CodeQL

For loop variable changed in body Note

Loop counters should not be modified in the body of the
loop
.
@franzpoeschel
Copy link
Contributor

Hey Junmin, the issue in here is that the Span-based storeChunk API is collective, which was not properly documented so far. Fixing that with #1794.
For your code, this means that all ranks must call storeChunk() when using the Span API. The following patch should do that:

diff --git a/examples/16_btd_write_parallel.cpp b/examples/16_btd_write_parallel.cpp
index a7b420ba2..65a602d94 100644
--- a/examples/16_btd_write_parallel.cpp
+++ b/examples/16_btd_write_parallel.cpp
@@ -204,6 +204,10 @@ void doWork(
                 input.get(), input.get() + numElements, spanBuffer.data());
         }
     }
+    if (m_span)
+    {
+        mymesh.storeChunk<double>({0, 0, 0}, {0, 0, 0}).currentBuffer();
+    }
 }
 
 int main(int argc, char *argv[])

Oh, I did not know that. Then why is the barrier call cause the hanging? while without the barrier call, store chunk with span succeeded?

Okay, I take it back, the Span API can be used non-collectively, but there is currently a bug in Iteration::open(), fixed by #1794.
So you can roll back the last commit and use the fix on #1794 instead.

@guj
Copy link
Contributor Author

guj commented Oct 15, 2025

Hey Junmin, the issue in here is that the Span-based storeChunk API is collective, which was not properly documented so far. Fixing that with #1794.
For your code, this means that all ranks must call storeChunk() when using the Span API. The following patch should do that:

diff --git a/examples/16_btd_write_parallel.cpp b/examples/16_btd_write_parallel.cpp
index a7b420ba2..65a602d94 100644
--- a/examples/16_btd_write_parallel.cpp
+++ b/examples/16_btd_write_parallel.cpp
@@ -204,6 +204,10 @@ void doWork(
                 input.get(), input.get() + numElements, spanBuffer.data());
         }
     }
+    if (m_span)
+    {
+        mymesh.storeChunk<double>({0, 0, 0}, {0, 0, 0}).currentBuffer();
+    }
 }
 
 int main(int argc, char *argv[])

Oh, I did not know that. Then why is the barrier call cause the hanging? while without the barrier call, store chunk with span succeeded?

Okay, I take it back, the Span API can be used non-collectively, but there is currently a bug in Iteration::open(), fixed by #1794. So you can roll back the last commit and use the fix on #1794 instead.

I just checked, and it works. will you merge your fix?

@franzpoeschel
Copy link
Contributor

I have merged it now, you can rebase this branch

@guj guj changed the title [WIP] Added a simple btd example Added a simple btd example Oct 15, 2025
@guj
Copy link
Contributor Author

guj commented Oct 15, 2025

@ax3l @franzpoeschel can we merge this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants