Skip to content

set non-existent task while reloading causes crash #6814

@dwsutherland

Description

@dwsutherland

Description

Version: 8.4.0

2025-06-24T23:32:55Z INFO - LOADING job data
2025-06-24T23:33:01Z INFO - Reload completed.
2025-06-24T23:33:02Z INFO - RESUMING the workflow now
2025-06-24T23:33:02Z INFO - Command "reload_workflow" actioned. ID=6f53dee1-09da-4e5b-b9c0-d652b7ed35f3
2025-06-24T23:33:02Z INFO - [20250624T1200Z/bop_ocean_nzcsm_sftp_upload/01:failed] => succeeded
2025-06-24T23:33:03Z INFO - Command "set" actioned. ID=f3e6942b-bb5c-49cb-8c88-f1429893a85f
2025-06-24T23:33:03Z INFO - Broadcast set:
    + [20250624T2250Z/himawari_sftp_upload] [environment]poll_himawari_netcdf_1000_workflow=himawari/run1
    + [20250624T2250Z/himawari_sftp_upload] [environment]poll_himawari_netcdf_1000_task=res_1000
    + [20250624T2250Z/himawari_sftp_upload] [environment]poll_himawari_netcdf_1000_point=20250624T2250Z
    + [20250624T2250Z/himawari_sftp_upload] [environment]poll_himawari_netcdf_1000_cylc_run_dir=/niwa/oper/ecox_oper/cylc-run
    + [20250624T2250Z/himawari_sftp_upload] [environment]poll_himawari_netcdf_1000_status=succeeded
2025-06-24T23:33:03Z CRITICAL - An uncaught error caused Cylc to shut down.
    If you think this was an issue in Cylc, please report the following traceback to the developers.
    https://github.com/cylc/cylc-flow/issues/new?assignees=&labels=bug&template=bug.md&title=;
2025-06-24T23:33:03Z ERROR - 'bop_ocean_nzcsm_sftp_upload'
    Traceback (most recent call last):
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/scheduler_cli.py", line 695, in cylc_play
        asyncio.get_running_loop()
    RuntimeError: no running event loop
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/scheduler.py", line 707, in run_scheduler
        await self._main_loop()
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/scheduler.py", line 1773, in _main_loop
        self.xtrigger_mgr.call_xtriggers_async(itask)
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/xtrigger_mgr.py", line 696, in call_xtriggers_async
        self.broadcast_mgr.put_broadcast(
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/broadcast_mgr.py", line 332, in put_broadcast
        self.data_store_mgr.delta_broadcast()
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/data_store_mgr.py", line 2233, in delta_broadcast
        self._generate_broadcast_node_deltas(
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/data_store_mgr.py", line 2261, in _generate_broadcast_node_deltas
        cfg['runtime'][node.name]
      File "/opt/niwa/share/cylc/8.4.0/lib/python3.9/site-packages/cylc/flow/parsec/OrderedDict.py", line 38, in __getitem__
        return OrderedDict.__getitem__(self, key)
    KeyError: 'bop_ocean_nzcsm_sftp_upload'
2025-06-24T23:33:03Z CRITICAL - Workflow shutting down - 'bop_ocean_nzcsm_sftp_upload'
2025-06-24T23:33:03Z WARNING - Orphaned tasks:
    * 20250624T1200Z/ecmwf_sftp_ifs_upload (running)
    * 20250624T1200Z/geotiff_gis_nzcsm_very_low_cloud (running)
    * 20250624T1800Z/bop_nzlam_sftp_upload (running)
    * 20250624T2230Z/gsmapnow_netcdf (submitted)
2025-06-24T23:33:03Z WARNING - 20250624T1200Z/bop_ocean_nzcsm_sftp_upload/01: incomplete task event handler ('event-handler-00', 'succeeded')
2025-06-24T23:33:04Z INFO - DONE

Looks like the reason is the xtrigger satisfaction of a removed task, relating to this section in the xtrigger manager:

            # General case: potentially slow asynchronous function call.
            if sig in self.sat_xtrig:
                # Already satisfied, just update the task
                if not itask.state.xtriggers[label]:
                    itask.state.xtriggers[label] = True
                    res = {}
                    for key, val in self.sat_xtrig[sig].items():
                        res["%s_%s" % (label, key)] = val
                    if res:
                        xtrigger_env = [{'environment': {key: str(val)}} for
                                        key, val in res.items()]
                        self.broadcast_mgr.put_broadcast(
                            [str(itask.point)],
                            [itask.tdef.name],
                            xtrigger_env
                        )
                    if self.all_task_seq_xtriggers_satisfied(itask):
                        self.sequential_spawn_next.add(itask.identity)
                continue

(put_broadcast creates a delta which searches for a non-existent cfg item)
pretty hard to reproduce

Reproducible Example

Test in fix replicated this bug

Expected Behaviour

no crash

Metadata

Metadata

Assignees

Labels

bugSomething is wrong :(

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions