Skip to content

self._remote_func_filename is not defined when a SLURM job hits the walltime #49

@Andrew-S-Rosen

Description

@Andrew-S-Rosen

Environment

  • Covalent version: 0.209.1
  • Covalent-Slurm plugin version: Custom branch off of develop here so that I could log in. The code in question should not be impacted by this branch.
  • Python version: 3.9
  • Operating system: Linux

What is happening?

I tried submitting a SLURM job and got the following traceback.

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 452, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 474, in run
await self._poll_slurm(slurm_job_id, conn)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 333, in _poll_slurm
raise RuntimeError("Job failed with status:\n", status)
RuntimeError: ('Job failed with status:\n', '')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_dispatcher/_core/runner.py", line 293, in _run_task
output, stdout, stderr, exception_raised = await executor._execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 421, in _execute
return await self.execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 459, in execute
await self.teardown(task_metadata=task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 505, in teardown
remote_func_filename=self._remote_func_filename,
AttributeError: 'SlurmExecutor' object has no attribute '_remote_func_filename'

My guess (?) is that self._remote_func_filename is not defined since the RuntimeError was raised.

How can we reproduce the issue?

import covalent as ct
import time

executor = ct.executor.SlurmExecutor(<redacted>)

@ct.lattice
@ct.electron(executor=executor)
def add(val1,val2):
    time.sleep(10000) # make sure the walltime is less than this
    return val1+val2

dispatch_id = ct.dispatch(add)(1,2)
result = ct.get_result(dispatch_id,wait=True)
print(result)

What should happen?

The covalent task should abort gracefully.

Any suggestions?

I think this error happens anytime the job dies unexpectedly (e.g. hits the walltime or otherwise). It doesn't seem to "terminate gracefully."

Addendum

It seems that adding the parsable: "" option fixes the lack of a returned status but otherwise the same issue arises.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions