Skip to content

Building the fortran model with call_py_fort in debug mode leads to crashes #365

@spencerkclark

Description

@spencerkclark

I was exploring the possibility of addressing #340 (now that we are removing the serialize tests in #364 we might as well explore eliminating testing in docker entirely). This requires running tests in debug mode in the nix environment. In doing so I came across the fact that the basic native regression tests crash due to call_py_fort-related code:

call set_state("rank", rank)

A workaround would be to build the model without call_py_fort in debug mode to exercise this functionality, but ideally these tests would not crash in debug mode even when the model is built with call_py_fort active.

A basic way to reproduce this is to copy the configure.fv3.nix file into a new file within FV3/conf, set DEBUG=Y and REPRO= within it, configure/build the model, and run the tests:

$ cp FV3/conf/configure.fv3.nix FV3/conf/configure.fv3.nix_debug

    <edit configure.fv3.nix_debug>

$ cd FV3
$ configure nix_debug
$ cd ..
$ make build_native
$ pytest -vv -k default --native tests/pytest/test_regression.py

The traceback for one of the failing tests can be found below:

===================================================================== FAILURES ======================================================================
_____________________________________________________ test_regression_native[Linux-default.yml] _____________________________________________________

run_native = <function run_native.<locals>.run_native at 0x7f080f93c3a0>, config_filename = 'default.yml'
tmpdir = local('/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0')
system_regtest = <pytest_regtest.RegTestFixture object at 0x7f07e563ec70>

    @pytest.mark.parametrize(
        "config_filename",
        [
            pytest.param("default.yml", marks=pytest.mark.basic),
            pytest.param("model-level-coarse-graining.yml", marks=pytest.mark.coarse),
            pytest.param("pressure-level-coarse-graining.yml", marks=pytest.mark.coarse),
            "baroclinic.yml",
            "restart.yml",
        ],
    )
    def test_regression_native(run_native, config_filename: str, tmpdir, system_regtest):
        config = get_config(config_filename)
        rundir = tmpdir.join("rundir")
>       run_native(config, str(rundir))

test_regression.py:123:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = {'data_table': 'default', 'diag_table': default
2000 1 1 0 0 0

"atmos_static", -1, "hours", 1, "hours", "time"
"atmos... "all", "none", "none", 2
, 'experiment_name': 'default', 'forcing': 'gs://vcm-fv3config/data/base_forcing/v1.1/', ...}
run_dir = '/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0/rundir', error_expected = False

    def run_native(config, run_dir: str, error_expected=False):
        fv3config.write_run_directory(config, run_dir)
        completed_process = subprocess.run(
            ["mpirun", "-n", "6", exe.absolute().as_posix()],
            cwd=run_dir,
            capture_output=True,
        )
        if completed_process.returncode != 0 and not error_expected:
            print("Tail of Stderr:")
            print(completed_process.stderr[-2000:].decode())
            print("Tail of Stdout:")
            print(completed_process.stdout[-2000:].decode())
>           pytest.fail()
E           Failed

conftest.py:77: Failed
--------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------
Tail of Stderr:
       0  shoc_cld= F  uni_cld= F  ntot3d=           1  ntot2d=           1  shocaftcnv= F  indcld=          -1  shoc_parm=   7000.0000000000000        1.0000000000000000        4.2857143000000004       0.69999999999999996       -999.00000000000000       ncnvw=        -999  ncnvc=        -999
  resetting Model%frac_grid= F

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f6c45875b90 in ???
#1  0x7f6c45874dc5 in ???
#2  0x7f6c3b54b39f in ???
#3  0x7f6c126490cf in ???
#4  0x7f6c3b20d26a in ???
#5  0x7f6c3b1fdb38 in ???
#6  0x7f6c3b1f7ab2 in ???
#7  0x7f6c3b1f8dac in ???
#8  0x7f6c3b1ffe03 in ???
#9  0x7f6c3b3060be in ???
#10  0x7f6c3b30645d in ???
#11  0x7f6c3b30648a in ???
#12  0x7f6c3b302cc8 in ???
#13  0x7f6c3b26c372 in ???
#14  0x7f6c3b22819e in ???
#15  0x7f6c3b200d67 in ???
#16  0x7f6c3b3060be in ???
#17  0x7f6c3b2260e1 in ???
#18  0x7f6c3b1f8dac in ???
#19  0x7f6c3b1ffe03 in ???
#20  0x7f6c3b1f7ab2 in ???
#21  0x7f6c3b1f8dac in ???
#22  0x7f6c3b1fce8b in ???
#23  0x7f6c3b1f7ab2 in ???
#24  0x7f6c3b1f8dac in ???
#25  0x7f6c3b1fc326 in ???
#26  0x7f6c3b1f7ab2 in ???
#27  0x7f6c3b1f8dac in ???
#28  0x7f6c3b1fc326 in ???
#29  0x7f6c3b1f7ab2 in ???
#30  0x7f6c3b2266d4 in ???
#31  0x7f6c3b226a4b in ???
#32  0x7f6c3b32c20e in ???
#33  0x7f6c3b1ff9d5 in ???
#34  0x7f6c3b3060be in ???
#35  0x7f6c3b30645d in ???
#36  0x7f6c3b30648a in ???
#37  0x7f6c45aebd84 in ???
#38  0x7f6c45aec06f in ???
#39  0x7f6c45aeba8d in ???
#40  0x7f6c45aeb466 in ???
#41  0x43c099 in __atmos_model_mod_MOD_update_atmos_physics
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:463
#42  0x4431b6 in __atmos_model_mod_MOD_update_atmos_radiation_physics
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:280
#43  0x476877 in coupler_main
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:192
#44  0x47964c in main
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:35

Tail of Stdout:
de=           0 (0=count only, 1=replace)
 performing qc of albm     mode=           0 (0=count only, 1=replace)
 performing qc of zorm     mode=           0 (0=count only, 1=replace)
 performing qc of stc1m    mode=           0 (0=count only, 1=replace)
 performing qc of stc2m    mode=           0 (0=count only, 1=replace)
 performing qc of stc3m    mode=           0 (0=count only, 1=replace)
 performing qc of stc4m    mode=           0 (0=count only, 1=replace)
 performing qc of smc1m    mode=           0 (0=count only, 1=replace)
 performing qc of smc2m    mode=           0 (0=count only, 1=replace)
 performing qc of smc3m    mode=           0 (0=count only, 1=replace)
 performing qc of smc4m    mode=           0 (0=count only, 1=replace)
 performing qc of vegm     mode=           1 (0=count only, 1=replace)
 performing qc of vetm     mode=           1 (0=count only, 1=replace)
 performing qc of sotm     mode=           1 (0=count only, 1=replace)
 performing qc of sihm     mode=           1 (0=count only, 1=replace)
 performing qc of sicm     mode=           1 (0=count only, 1=replace)
 performing qc of vmnm     mode=           1 (0=count only, 1=replace)
 performing qc of vmxm     mode=           1 (0=count only, 1=replace)
 performing qc of slpm     mode=           1 (0=count only, 1=replace)
 performing qc of absm     mode=           1 (0=count only, 1=replace)
 ==============
 final results
 ==============
 dbgx --fixratio: F F F F

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5415 RUNNING AT spencer-vm
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions