Skip to content

Fix bug in restarting with different MPI processes#1545

Draft
chongchonghe wants to merge 17 commits intodevelopmentfrom
chong/fix-particle-restart-bug
Draft

Fix bug in restarting with different MPI processes#1545
chongchonghe wants to merge 17 commits intodevelopmentfrom
chong/fix-particle-restart-bug

Conversation

@chongchonghe
Copy link
Contributor

Description

Related issues

#1544

Checklist

Before this pull request can be reviewed, all of these tasks should be completed. Denote completed tasks with an x inside the square brackets [ ] in the Markdown source below:

  • I have added a description (see above).
  • I have added a link to any related issues (if applicable; see above).
  • I have read the Contributing Guide.
  • I have added tests for any new physics that this PR adds to the code.
  • (For quokka-astro org members) I have manually triggered the GPU tests with the magic comment /azp run.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 27, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chongchonghe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug that prevented simulations from correctly restarting when the number of MPI processes differed from the original checkpoint. The changes introduce comprehensive handling for particle data files to ensure consistency across varying MPI ranks during restart. Additionally, the PR includes updates to the ParticleRadiation test problem and its input configuration to validate this new functionality, particularly in scenarios involving adaptive mesh refinement.

Highlights

  • MPI Process Handling on Restart: Implemented robust logic within restartParticleContainerWithRefinement to correctly handle particle container restarts when the number of MPI processes has changed since the checkpoint was written. This includes ensuring all levels have the necessary data files and redistributing particles across the new process count.
  • Adaptive Mesh Refinement (AMR) Test Setup: Modified the ParticleRadiation problem to include a refineGrid specialization that tags all cells for static mesh refinement, enabling testing of the restart functionality with AMR.
  • Test Case Configuration: Adjusted the ParticleRadiation.in input file to enable AMR (max_level = 1), set appropriate checkpoint and plotfile intervals, and specified a restartfile to facilitate testing of the new restart logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the bug: wrong answer/failure/crash Something isn't working label Dec 27, 2025
@chongchonghe chongchonghe marked this pull request as draft December 27, 2025 00:50
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for restarting simulations with a different number of MPI processes, particularly for particle data. The core logic involves creating dummy particle data files for new MPI ranks to prevent crashes on restart. While the approach is sound, I've identified a critical security vulnerability and a potential crash due to a division-by-zero error in the implementation. Additionally, a change in one of the test files effectively disables its ability to report failures. My review provides suggestions to address these issues.

Comment on lines +3982 to +4014
// For each level, ensure it exists and has the correct number of DATA files
for (int lev = 0; lev <= finest_level; ++lev) {
std::string level_path = pc_path + "/Level_" + std::to_string(lev);
if (!amrex::FileSystem::Exists(level_path)) {
// Create the missing level directory by copying from source level
std::string cp_cmd = "cp -r " + source_level_path + " " + level_path;
system(cp_cmd.c_str());
}
// Now ensure this level has the correct number of DATA files
int num_data_files = 0;
for (int i = 0;; ++i) {
std::string data_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5);
if (amrex::FileSystem::Exists(data_file)) {
num_data_files = i + 1;
} else {
break;
}
}
if (num_data_files < num_procs) {
// Copy DATA files from source level
for (int i = num_data_files; i < num_procs; ++i) {
std::string src_file = level_path + "/DATA_" + amrex::Concatenate("", i % num_source_data_files, 5);
std::string dst_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5);
if (!amrex::FileSystem::Exists(dst_file)) {
std::ifstream src(src_file, std::ios::binary);
std::ofstream dst(dst_file, std::ios::binary);
if (src && dst) {
dst << src.rdbuf();
}
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of code has two critical issues:

  1. Command Injection Vulnerability: The use of system("cp -r ...") with a path constructed from user input (restart_chkfile) is a serious security risk. A malicious checkpoint file name could lead to arbitrary command execution.
  2. Division by Zero: If num_source_data_files is 0 (i.e., no particle data files are found in the source level), the code will crash due to a division by zero in i % num_source_data_files.

I've provided a suggestion that fixes both issues by:

  • Replacing the system call with std::filesystem::copy, which is safer and more portable. The <filesystem> header is already included in this file.
  • Adding a check to ensure num_source_data_files > 0 before attempting to use it in the modulo operation.
if (num_source_data_files > 0) {
					// For each level, ensure it exists and has the correct number of DATA files
					for (int lev = 0; lev <= finest_level; ++lev) {
						std::string level_path = pc_path + "/Level_" + std::to_string(lev);
						if (!amrex::FileSystem::Exists(level_path)) {
							// Create the missing level directory by copying from source level
							std::filesystem::copy(source_level_path, level_path, std::filesystem::copy_options::recursive);
						}
						// Now ensure this level has the correct number of DATA files
						int num_data_files = 0;
						for (int i = 0;; ++i) {
							std::string data_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5);
							if (amrex::FileSystem::Exists(data_file)) {
								num_data_files = i + 1;
							} else {
								break;
							}
						}
						if (num_data_files < num_procs) {
							// Copy DATA files from source level
							for (int i = num_data_files; i < num_procs; ++i) {
								std::string src_file = level_path + "/DATA_" + amrex::Concatenate("", i % num_source_data_files, 5);
								std::string dst_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5);
								if (!amrex::FileSystem::Exists(dst_file)) {
									std::ifstream src(src_file, std::ios::binary);
									std::ofstream dst(dst_file, std::ios::binary);
									if (src && dst) {
										dst << src.rdbuf();
									}
								}
							}
						}
					}
				}

}

return status;
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function now always returns 0, which means the test will report success even if it fails. The status variable, which correctly tracks the test outcome, should be returned instead.

return status;

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 269 to +270

return status;
return 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate particle radiation test failures

The particle radiation test sets status = 1 when the energy tolerance check fails (lines 259–263), but the function now unconditionally returns 0 at the end. Any restart/energy regression that sets status will no longer fail the test harness, silently masking errors this change is meant to catch.

Useful? React with 👍 / 👎.

if (!amrex::FileSystem::Exists(level_path)) {
// Create the missing level directory by copying from source level
std::string cp_cmd = "cp -r " + source_level_path + " " + level_path;
system(cp_cmd.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not manipulate the on-disk AMReX format. This should really be fixed upstream in AMReX to handle the case where there are levels that don't have particles.

if 'mass' in components:
mass_idx = 3 + components.index('mass')
masses = data[:, mass_idx] / M_SUN # Convert to M_sun
mass_min = np.min(masses)
mass_idx = 3 + components.index('mass')
masses = data[:, mass_idx] / M_SUN # Convert to M_sun
mass_min = np.min(masses)
mass_max = np.max(masses)
masses = data[:, mass_idx] / M_SUN # Convert to M_sun
mass_min = np.min(masses)
mass_max = np.max(masses)
mass_median = np.median(masses)
vy_idx = 3 + components.index('vy')
vz_idx = 3 + components.index('vz')
velocities = data[:, [vx_idx, vy_idx, vz_idx]]
vel_disp = np.std(velocities, axis=0) / KM_S
if 'mass' in components:
mass_idx = 3 + components.index('mass')
masses = data[:, mass_idx] / M_SUN # Convert to M_sun
mass_min = np.min(masses)

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable mass_min is not used.

Copilot Autofix

AI about 2 months ago

To fix the problem, the code should either remove the unused variable or, more appropriately here, use the computed mass statistics in output. The best fix without changing existing functionality is to replace the placeholder print(".1f") with meaningful, formatted printing of mass_min, mass_max, and mass_median. This both removes the unused-variable issue and makes the function behave as its name and surrounding code suggest. Similarly, the velocity statistics branch should be updated to print values instead of print(".1f"), so that the computed vel_disp is also used.

Concretely, in scripts/python/create_particles.py, inside print_particle_info:

  • Replace the single print(".1f") under the mass statistics if block with one or more print statements that include mass_min, mass_max, and mass_median, formatted in solar masses with :.1f as implied.
  • Replace the print(".1f") under the velocity statistics if block with a print that includes vel_disp (e.g., per component) also with :.1f formatting and in km/s, consistent with the units used for DEFAULT_VELOCITY_DISP.

No new imports or additional helper functions are required; all necessary constants (M_SUN, KM_S) and numpy functions are already available in this file.

Suggested changeset 1
scripts/python/create_particles.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/scripts/python/create_particles.py b/scripts/python/create_particles.py
--- a/scripts/python/create_particles.py
+++ b/scripts/python/create_particles.py
@@ -294,7 +294,8 @@
         mass_min = np.min(masses)
         mass_max = np.max(masses)
         mass_median = np.median(masses)
-        print(".1f")
+        print(f"Mass range:   [{mass_min:.1f}, {mass_max:.1f}] M_sun")
+        print(f"Mass median:  {mass_median:.1f} M_sun")
 
     # Velocity statistics (if applicable)
     if 'vx' in components:
@@ -303,7 +304,12 @@
         vz_idx = 3 + components.index('vz')
         velocities = data[:, [vx_idx, vy_idx, vz_idx]]
         vel_disp = np.std(velocities, axis=0) / KM_S
-        print(".1f")
+        print(
+            "Velocity dispersion (km/s): "
+            f"sigma_x={vel_disp[0]:.1f}, "
+            f"sigma_y={vel_disp[1]:.1f}, "
+            f"sigma_z={vel_disp[2]:.1f}"
+        )
 
 def main():
     parser = argparse.ArgumentParser(
EOF
@@ -294,7 +294,8 @@
mass_min = np.min(masses)
mass_max = np.max(masses)
mass_median = np.median(masses)
print(".1f")
print(f"Mass range: [{mass_min:.1f}, {mass_max:.1f}] M_sun")
print(f"Mass median: {mass_median:.1f} M_sun")

# Velocity statistics (if applicable)
if 'vx' in components:
@@ -303,7 +304,12 @@
vz_idx = 3 + components.index('vz')
velocities = data[:, [vx_idx, vy_idx, vz_idx]]
vel_disp = np.std(velocities, axis=0) / KM_S
print(".1f")
print(
"Velocity dispersion (km/s): "
f"sigma_x={vel_disp[0]:.1f}, "
f"sigma_y={vel_disp[1]:.1f}, "
f"sigma_z={vel_disp[2]:.1f}"
)

def main():
parser = argparse.ArgumentParser(
Copilot is powered by AI and may make mistakes. Always verify output.
mass_idx = 3 + components.index('mass')
masses = data[:, mass_idx] / M_SUN # Convert to M_sun
mass_min = np.min(masses)
mass_max = np.max(masses)

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable mass_max is not used.

Copilot Autofix

AI about 2 months ago

In general, to fix an unused local variable you either remove the assignment if the value is not needed, or you use the variable in a meaningful way (e.g., include it in logging or further calculations). If the variable is intentionally unused (for API or unpacking reasons), you rename it to an “unused” pattern like _ or unused_mass_max.

Here, the best fix without changing the script’s overall behavior is to integrate mass_min, mass_max, and mass_median into the printed particle summary instead of leaving a placeholder print(".1f"). This aligns with the function’s stated purpose (“Print summary information about generated particles”), provides useful information to the user, and ensures mass_max (and the other statistics) are actually used.

Concretely, within print_particle_info in scripts/python/create_particles.py, we will:

  • Replace the placeholder print(".1f") line in the mass statistics block with a formatted print that includes mass_min, mass_max, and mass_median (e.g., in solar masses with one decimal place).
  • Similarly, replace the placeholder print(".1f") in the velocity block with a print that uses vel_disp, for consistency and to avoid future unused-variable issues.

No new imports or helper methods are needed; we just change the print statements in that function.

Suggested changeset 1
scripts/python/create_particles.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/scripts/python/create_particles.py b/scripts/python/create_particles.py
--- a/scripts/python/create_particles.py
+++ b/scripts/python/create_particles.py
@@ -294,7 +294,7 @@
         mass_min = np.min(masses)
         mass_max = np.max(masses)
         mass_median = np.median(masses)
-        print(".1f")
+        print(f"Mass statistics (M_sun): min={mass_min:.1f}, max={mass_max:.1f}, median={mass_median:.1f}")
 
     # Velocity statistics (if applicable)
     if 'vx' in components:
@@ -303,7 +303,8 @@
         vz_idx = 3 + components.index('vz')
         velocities = data[:, [vx_idx, vy_idx, vz_idx]]
         vel_disp = np.std(velocities, axis=0) / KM_S
-        print(".1f")
+        print(f"Velocity dispersion (km/s): "
+              f"sigma_x={vel_disp[0]:.1f}, sigma_y={vel_disp[1]:.1f}, sigma_z={vel_disp[2]:.1f}")
 
 def main():
     parser = argparse.ArgumentParser(
EOF
@@ -294,7 +294,7 @@
mass_min = np.min(masses)
mass_max = np.max(masses)
mass_median = np.median(masses)
print(".1f")
print(f"Mass statistics (M_sun): min={mass_min:.1f}, max={mass_max:.1f}, median={mass_median:.1f}")

# Velocity statistics (if applicable)
if 'vx' in components:
@@ -303,7 +303,8 @@
vz_idx = 3 + components.index('vz')
velocities = data[:, [vx_idx, vy_idx, vz_idx]]
vel_disp = np.std(velocities, axis=0) / KM_S
print(".1f")
print(f"Velocity dispersion (km/s): "
f"sigma_x={vel_disp[0]:.1f}, sigma_y={vel_disp[1]:.1f}, sigma_z={vel_disp[2]:.1f}")

def main():
parser = argparse.ArgumentParser(
Copilot is powered by AI and may make mistakes. Always verify output.
masses = data[:, mass_idx] / M_SUN # Convert to M_sun
mass_min = np.min(masses)
mass_max = np.max(masses)
mass_median = np.median(masses)

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable mass_median is not used.

Copilot Autofix

AI about 2 months ago

In general, to fix an unused local variable you either remove the variable (and any associated computation) if the result is not needed, or you actually use it (for example, in a return value or a log/print statement) if the computation is meaningful. Here, mass_min, mass_max, and mass_median are all computed, but none are used; instead, there is a placeholder print(".1f"). Similarly, for velocities, vel_disp is computed but only followed by print(".1f"). The function is documented as printing summary information, so the best fix that preserves and completes the intended functionality is to incorporate these statistics into formatted print statements and remove the placeholder print(".1f") calls.

Concretely, in print_particle_info in scripts/python/create_particles.py, replace the mass block so that it prints the min, max, and median masses (in solar masses) using the computed values and delete the print(".1f") line. Then, in the velocity block, similarly replace the placeholder print(".1f") with a formatted print that uses vel_disp to show the velocity dispersion along each axis, likely in km/s (since we divide by KM_S). No new imports or helper methods are needed; we only add more detailed print statements using the existing variables.

Suggested changeset 1
scripts/python/create_particles.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/scripts/python/create_particles.py b/scripts/python/create_particles.py
--- a/scripts/python/create_particles.py
+++ b/scripts/python/create_particles.py
@@ -294,7 +294,8 @@
         mass_min = np.min(masses)
         mass_max = np.max(masses)
         mass_median = np.median(masses)
-        print(".1f")
+        print(f"Mass range: [{mass_min:.1f}, {mass_max:.1f}] M_sun")
+        print(f"Mass median: {mass_median:.1f} M_sun")
 
     # Velocity statistics (if applicable)
     if 'vx' in components:
@@ -303,7 +304,12 @@
         vz_idx = 3 + components.index('vz')
         velocities = data[:, [vx_idx, vy_idx, vz_idx]]
         vel_disp = np.std(velocities, axis=0) / KM_S
-        print(".1f")
+        print(
+            "Velocity dispersion: "
+            f"sigma_x={vel_disp[0]:.1f} km/s, "
+            f"sigma_y={vel_disp[1]:.1f} km/s, "
+            f"sigma_z={vel_disp[2]:.1f} km/s"
+        )
 
 def main():
     parser = argparse.ArgumentParser(
EOF
@@ -294,7 +294,8 @@
mass_min = np.min(masses)
mass_max = np.max(masses)
mass_median = np.median(masses)
print(".1f")
print(f"Mass range: [{mass_min:.1f}, {mass_max:.1f}] M_sun")
print(f"Mass median: {mass_median:.1f} M_sun")

# Velocity statistics (if applicable)
if 'vx' in components:
@@ -303,7 +304,12 @@
vz_idx = 3 + components.index('vz')
velocities = data[:, [vx_idx, vy_idx, vz_idx]]
vel_disp = np.std(velocities, axis=0) / KM_S
print(".1f")
print(
"Velocity dispersion: "
f"sigma_x={vel_disp[0]:.1f} km/s, "
f"sigma_y={vel_disp[1]:.1f} km/s, "
f"sigma_z={vel_disp[2]:.1f} km/s"
)

def main():
parser = argparse.ArgumentParser(
Copilot is powered by AI and may make mistakes. Always verify output.
vy_idx = 3 + components.index('vy')
vz_idx = 3 + components.index('vz')
velocities = data[:, [vx_idx, vy_idx, vz_idx]]
vel_disp = np.std(velocities, axis=0) / KM_S

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable vel_disp is not used.

Copilot Autofix

AI about 2 months ago

In general, to fix an unused local variable you either (a) remove the assignment if the result isn’t needed and there are no side effects, or (b) use the variable meaningfully (e.g., in logging, return values, or further calculations). If the variable is intentionally unused, you rename it to something like _ or unused_vel_disp.

Here, the best fix that keeps and clarifies existing intended functionality is to actually use vel_disp when printing velocity statistics. The surrounding code in print_particle_info prints detailed position statistics and sets up placeholders in the mass and velocity sections. The variable vel_disp holds the 1D standard deviation of velocity components divided by KM_S, so it represents the velocity dispersion per component in km/s. The minimal, non-breaking improvement is to replace the placeholder print(".1f") with a formatted print that uses vel_disp, similar in spirit to how mass statistics would be printed.

Concretely:

  • In scripts/python/create_particles.py, within print_particle_info, replace the print(".1f") in the velocity statistics block with something like:
    print(f"Velocity dispersion: vx={vel_disp[0]:.1f} km/s, "
          f"vy={vel_disp[1]:.1f} km/s, vz={vel_disp[2]:.1f} km/s")
  • This change requires no new imports or definitions; it only uses existing variables and constants.
  • We leave the calculation of vel_disp intact and simply use it, resolving the CodeQL finding and improving the information printed.
Suggested changeset 1
scripts/python/create_particles.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/scripts/python/create_particles.py b/scripts/python/create_particles.py
--- a/scripts/python/create_particles.py
+++ b/scripts/python/create_particles.py
@@ -303,7 +303,8 @@
         vz_idx = 3 + components.index('vz')
         velocities = data[:, [vx_idx, vy_idx, vz_idx]]
         vel_disp = np.std(velocities, axis=0) / KM_S
-        print(".1f")
+        print(f"Velocity dispersion: vx={vel_disp[0]:.1f} km/s, "
+              f"vy={vel_disp[1]:.1f} km/s, vz={vel_disp[2]:.1f} km/s")
 
 def main():
     parser = argparse.ArgumentParser(
EOF
@@ -303,7 +303,8 @@
vz_idx = 3 + components.index('vz')
velocities = data[:, [vx_idx, vy_idx, vz_idx]]
vel_disp = np.std(velocities, axis=0) / KM_S
print(".1f")
print(f"Velocity dispersion: vx={vel_disp[0]:.1f} km/s, "
f"vy={vel_disp[1]:.1f} km/s, vz={vel_disp[2]:.1f} km/s")

def main():
parser = argparse.ArgumentParser(
Copilot is powered by AI and may make mistakes. Always verify output.
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug: wrong answer/failure/crash Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants