Skip to content

Clang UB sanitizer CI test: increase coverage #5597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: development
Choose a base branch
from

Conversation

lucafedeli88
Copy link
Member

@lucafedeli88 lucafedeli88 commented Jan 23, 2025

In CI test based on clang UB sanitizers, most of the time (~ 1h30) is spent in compiling the code, while just a few minutes are spent actually running some simulations. This means that we can increase the coverage of the test by adding some more simulations to the tests with a negligible increase of the total runtime.
This PR does just that: now most of the cases in Examples/Physics_applications are tested with the UB sanitizer.

Note that some cases cannot run in double precision (see below). For this reason, the PR also splits the UB sanitizer test into single precision and double precision (in double precision only the cases that cannot run in single precision are tested).

Updates:

Issue found while running inputs_test_3d_beam_beam_collision --> We need to run this case in double precision

The tool has found an issue while running mpirun -n 2 ./build/bin/warpx.3d Examples/Physics_applications/beam_beam_collision/inputs_test_3d_beam_beam_collision :

STEP 1 starts ...
/home/runner/work/WarpX/WarpX/build/_deps/fetchedpicsar-src/multi_physics/QED/include/picsar_qed/containers/picsar_tables.hpp:310:17: runtime error: -nan is outside the range of representable values of type 'int'
/home/runner/work/WarpX/WarpX/build/_deps/fetchedpicsar-src/multi_physics/QED/include/picsar_qed/containers/picsar_tables.hpp:310:17: runtime error: -nan is outside the range of representable values of type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/runner/work/WarpX/WarpX/build/_deps/fetchedpicsar-src/multi_physics/QED/include/picsar_qed/containers/picsar_tables.hpp:310:17 in 
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/runner/work/WarpX/WarpX/build/_deps/fetchedpicsar-src/multi_physics/QED/include/picsar_qed/containers/picsar_tables.hpp:310:17 in 

I've temporarily commented out this case while I investigate the cause. For the moment, I am not able to reproduce the issue on my local machine. The issue is using single precision for this specific test case! Specifically, momenta end up being NaN and the sanitizer detects the attempt to convert a floating point NaN into an integer.
I have added an ABLASTR_ALWAYS_ASSERT_WITH_MESSAGE in PoissonSolver.H so that a more readable error message is provided to the user.

Issue found while running inputs_test_2d_background_mcc --> We need to run this case in double precision

MLMG does not converge in single precision for this simulation case. We need to run it in double precision.

Issue found while running free_electron_laser --> We need to run this case in double precision

I have observed this issue:

 STEP 444 starts ...
0::1::Assertion `m_current_z_lab[i_buffer] >= m_buffer_domain_lab[i_buffer].lo(m_moving_window_dir) and m_current_z_lab[i_buffer] <= m_buffer_domain_lab[i_buffer].hi(m_moving_window_dir)' failed, file "/home/runner/work/WarpX/WarpX/Source/Diagnostics/BTDiagnostics.cpp", line 870, Msg: 
 ### ERROR   : z-slice in lab-frame (0.299976) is outside the buffer domain
#            physical extent (0.299976 to 0.299988).
 !!!

which seems to be related to using single precision instead of double precision. Therefore, we need to run this case in double precision.

Issue found while running inputs_test_2d_laser_ion_acc --> bugfix in WarpX

inputs_test_2d_laser_ion_acc case has the following issue in single precision:

--- INFO    : Writing openPMD file diags/openPMDbw000000
terminate called after throwing an instance of 'std::runtime_error'
  what():  Datatypes of chunk data (FLOAT) and record component (DOUBLE) do not match.
SIGABRT
See Backtrace.0.0 file for details

This comes from the fact that the datatype of this dataset in ParticleHistogram2D.cpp is hard-coded as double:

    auto dataset = io::Dataset(
            io::determineDatatype<double>(),
            {static_cast<unsigned long>(m_bin_num_ord), static_cast<unsigned long>(m_bin_num_abs)});

this PR modifies these lines as follows:

    auto dataset = io::Dataset(
            io::determineDatatype<amrex::Real>(),
            {static_cast<unsigned long>(m_bin_num_ord), static_cast<unsigned long>(m_bin_num_abs)});

@lucafedeli88 lucafedeli88 added the component: tests Tests and CI label Jan 23, 2025
@lucafedeli88 lucafedeli88 changed the title Clang UB sanitizer CI test: increase coverage [WIP] Clang UB sanitizer CI test: increase coverage Jan 23, 2025
@lucafedeli88 lucafedeli88 changed the title [WIP] Clang UB sanitizer CI test: increase coverage Clang UB sanitizer CI test: increase coverage Jan 24, 2025
@lucafedeli88 lucafedeli88 requested review from EZoni and ax3l January 24, 2025 09:32
@@ -298,7 +298,7 @@ void ParticleHistogram2D::WriteToFile (int step) const
data.setPosition<amrex::Real>({0.5, 0.5});

auto dataset = io::Dataset(
io::determineDatatype<double>(),
io::determineDatatype<amrex::Real>(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the properties be in ParticleReal?

Suggested change
io::determineDatatype<amrex::Real>(),
io::determineDatatype<amrex::ParticleReal>(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we also have to modify the datatype of m_h_data_2D to be coherent (I've done that in my latest commit): amrex::TableData<amrex::Real,2> m_h_data_2D -> amrex::TableData<amrex::ParticleReal,2> m_h_data_2D

@lucafedeli88
Copy link
Member Author

ping, @EZoni !

@lucafedeli88 lucafedeli88 requested a review from dpgrote April 8, 2025 13:52
Copy link
Member

@dpgrote dpgrote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks!

I will comment that NaNs always make me nervous that there is something bad lurking in the code. Do you know why they appear in that case with single precision?

@lucafedeli88
Copy link
Member Author

This looks good, thanks!

I will comment that NaNs always make me nervous that there is something bad lurking in the code. Do you know why they appear in that case with single precision?

Thanks, @dpgrote . The NaN comes from this line in PoissonSolver.H. beta_solver[2] is 1.00059783 here due to numerical errors, so the square root is negative. We could indeed try to fix that.

geom[lev].CellSize(2)/std::sqrt(1._rt-beta_solver[2]*beta_solver[2]))};

@lucafedeli88
Copy link
Member Author

This looks good, thanks!
I will comment that NaNs always make me nervous that there is something bad lurking in the code. Do you know why they appear in that case with single precision?

Thanks, @dpgrote . The NaN comes from this line in PoissonSolver.H. beta_solver[2] is 1.00059783 here due to numerical errors, so the square root is negative. We could indeed try to fix that.

geom[lev].CellSize(2)/std::sqrt(1._rt-beta_solver[2]*beta_solver[2]))};

Actually, in this case, we have a 250 GeV electron beam. So we can't represent beta reliably in single precision for such high energies. I think that the best we can do is to add an assert to ensure that beta < 1 .

@lucafedeli88
Copy link
Member Author

@dpgrote , I've added this check inside PoissonSolver.H :

    ABLASTR_ALWAYS_ASSERT_WITH_MESSAGE (
        std::all_of(beta_solver.begin(), beta_solver.end(),
            [](const auto b){return (b<1.0_rt);}),
        "Components of beta_solver must be < 1.");

so that at least the user has a more readable error message.

@lucafedeli88 lucafedeli88 enabled auto-merge (squash) April 10, 2025 12:16
auto-merge was automatically disabled April 29, 2025 14:47

Pull request was closed

@lucafedeli88 lucafedeli88 reopened this Apr 29, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of wondering, instead of running specific input files manually, couldn't we simply run specific tests with ctest?

I would expect this to be equivalent and it is part of what the CTest migration was meant for (I mean, run tests through ctest rather than manually).

For example, replacing the mpirun commands with

ctest --test-dir build --output-on-failure -R "laser_acceleration"
ctest --test-dir build --output-on-failure -R "laser_ion"
...

We could also make the -R filter more complex to filter specific tests and/or only execute the .run tests, if we want to skip analysis, checksums, etc.

What do you think? Would you be okay with trying this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: tests Tests and CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants