Fix bug in restarting with different MPI processes by chongchonghe · Pull Request #1545 · quokka-astro/quokka

chongchonghe · 2025-12-27T00:48:12Z

Description

Related issues

Checklist

Before this pull request can be reviewed, all of these tasks should be completed. Denote completed tasks with an x inside the square brackets [ ] in the Markdown source below:

I have added a description (see above).
I have added a link to any related issues (if applicable; see above).
I have read the Contributing Guide.
I have added tests for any new physics that this PR adds to the code.
(For quokka-astro org members) I have manually triggered the GPU tests with the magic comment /azp run.

gemini-code-assist · 2025-12-27T00:48:28Z

Summary of Changes

Hello @chongchonghe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug that prevented simulations from correctly restarting when the number of MPI processes differed from the original checkpoint. The changes introduce comprehensive handling for particle data files to ensure consistency across varying MPI ranks during restart. Additionally, the PR includes updates to the ParticleRadiation test problem and its input configuration to validate this new functionality, particularly in scenarios involving adaptive mesh refinement.

Highlights

MPI Process Handling on Restart: Implemented robust logic within restartParticleContainerWithRefinement to correctly handle particle container restarts when the number of MPI processes has changed since the checkpoint was written. This includes ensuring all levels have the necessary data files and redistributing particles across the new process count.
Adaptive Mesh Refinement (AMR) Test Setup: Modified the ParticleRadiation problem to include a refineGrid specialization that tags all cells for static mesh refinement, enabling testing of the restart functionality with AMR.
Test Case Configuration: Adjusted the ParticleRadiation.in input file to enable AMR (max_level = 1), set appropriate checkpoint and plotfile intervals, and specified a restartfile to facilitate testing of the new restart logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a fix for restarting simulations with a different number of MPI processes, particularly for particle data. The core logic involves creating dummy particle data files for new MPI ranks to prevent crashes on restart. While the approach is sound, I've identified a critical security vulnerability and a potential crash due to a division-by-zero error in the implementation. Additionally, a change in one of the test files effectively disables its ability to report failures. My review provides suggestions to address these issues.

gemini-code-assist · 2025-12-27T00:50:20Z

src/simulation.hpp

+				// For each level, ensure it exists and has the correct number of DATA files
+				for (int lev = 0; lev <= finest_level; ++lev) {
+					std::string level_path = pc_path + "/Level_" + std::to_string(lev);
+					if (!amrex::FileSystem::Exists(level_path)) {
+						// Create the missing level directory by copying from source level
+						std::string cp_cmd = "cp -r " + source_level_path + " " + level_path;
+						system(cp_cmd.c_str());
+					}
+					// Now ensure this level has the correct number of DATA files
+					int num_data_files = 0;
+					for (int i = 0;; ++i) {
+						std::string data_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5);
+						if (amrex::FileSystem::Exists(data_file)) {
+							num_data_files = i + 1;
+						} else {
+							break;
+						}
+					}
+					if (num_data_files < num_procs) {
+						// Copy DATA files from source level
+						for (int i = num_data_files; i < num_procs; ++i) {
+							std::string src_file = level_path + "/DATA_" + amrex::Concatenate("", i % num_source_data_files, 5);
+							std::string dst_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5);
+							if (!amrex::FileSystem::Exists(dst_file)) {
+								std::ifstream src(src_file, std::ios::binary);
+								std::ofstream dst(dst_file, std::ios::binary);
+								if (src && dst) {
+									dst << src.rdbuf();
+								}
+							}
+						}
+					}
+				}


This block of code has two critical issues:

Command Injection Vulnerability: The use of system("cp -r ...") with a path constructed from user input (restart_chkfile) is a serious security risk. A malicious checkpoint file name could lead to arbitrary command execution.

Division by Zero: If num_source_data_files is 0 (i.e., no particle data files are found in the source level), the code will crash due to a division by zero in i % num_source_data_files.

I've provided a suggestion that fixes both issues by:

Replacing the system call with std::filesystem::copy, which is safer and more portable. The <filesystem> header is already included in this file.

Adding a check to ensure num_source_data_files > 0 before attempting to use it in the modulo operation.

if (num_source_data_files > 0) { // For each level, ensure it exists and has the correct number of DATA files for (int lev = 0; lev <= finest_level; ++lev) { std::string level_path = pc_path + "/Level_" + std::to_string(lev); if (!amrex::FileSystem::Exists(level_path)) { // Create the missing level directory by copying from source level std::filesystem::copy(source_level_path, level_path, std::filesystem::copy_options::recursive); } // Now ensure this level has the correct number of DATA files int num_data_files = 0; for (int i = 0;; ++i) { std::string data_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5); if (amrex::FileSystem::Exists(data_file)) { num_data_files = i + 1; } else { break; } } if (num_data_files < num_procs) { // Copy DATA files from source level for (int i = num_data_files; i < num_procs; ++i) { std::string src_file = level_path + "/DATA_" + amrex::Concatenate("", i % num_source_data_files, 5); std::string dst_file = level_path + "/DATA_" + amrex::Concatenate("", i, 5); if (!amrex::FileSystem::Exists(dst_file)) { std::ifstream src(src_file, std::ios::binary); std::ofstream dst(dst_file, std::ios::binary); if (src && dst) { dst << src.rdbuf(); } } } } } }

gemini-code-assist · 2025-12-27T00:50:20Z

src/problems/ParticleRadiation/testParticleRadiation.cpp

 	}

-	return status;
+	return 0;


The function now always returns 0, which means the test will report success even if it fails. The status variable, which correctly tracks the test outcome, should be returned instead.

return status;

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-27T00:50:45Z

src/problems/ParticleRadiation/testParticleRadiation.cpp


-	return status;
+	return 0;


Propagate particle radiation test failures

The particle radiation test sets status = 1 when the energy tolerance check fails (lines 259–263), but the function now unconditionally returns 0 at the end. Any restart/energy regression that sets status will no longer fail the test harness, silently masking errors this change is meant to catch.

Useful? React with 👍 / 👎.

BenWibking · 2025-12-27T17:10:37Z

src/simulation.hpp

+					if (!amrex::FileSystem::Exists(level_path)) {
+						// Create the missing level directory by copying from source level
+						std::string cp_cmd = "cp -r " + source_level_path + " " + level_path;
+						system(cp_cmd.c_str());


We should not manipulate the on-disk AMReX format. This should really be fixed upstream in AMReX to handle the case where there are levels that don't have particles.

for more information, see https://pre-commit.ci