Skip to content

Slurm Job Fails Due to Missing SSL Certificates When Creating Cluster using dask-gateway-server #705

Open
@woestler

Description

@woestler

When I created a cluster on HPC using Slurm and dask-gateway-server, I encountered a problem. My understanding of the running process is as follows: when dask-gateway-server receives the new_cluster command from the client, it converts the command into an sbatch command. I have edited the dask_gateway_server/backends/jobqueue/slurm.py file and print the variables cmd, env, and script in get_submit_cmd_env_stdin, the output are as follows:

cmd


['/usr/bin/sbatch', '--parsable', '--job-name=dask-gateway', '--chdir=/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d', '--output=/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask-scheduler-014af831909a4d8ab6b900b03fc9598d.log', '--cpus-per-task=2', '--mem=4096M', '--export=DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION,DASK_DISTRIBUTED__COMM__TLS__CA_FILE,DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT,DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY,DASK_GATEWAY_API_TOKEN,DASK_GATEWAY_API_URL,DASK_GATEWAY_CLUSTER_NAME']

env


 {'DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION': 'True', 'DASK_GATEWAY_API_URL': '<http://local3:8000/api>', 'DASK_GATEWAY_API_TOKEN': '3497e6f64a16424eae3b5545f151fb79', 'DASK_GATEWAY_CLUSTER_NAME': '014af831909a4d8ab6b900b03fc9598d', 'DASK_DISTRIBUTED__COMM__TLS__CA_FILE': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.crt', 'DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.pem', 'DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.crt'}

script

#!/bin/sh
source /opt/dask-gateway/anaconda/bin/activate /opt/dask
dask-scheduler --protocol tls --port 0 --host 0.0.0.0 --dashboard-address 0.0.0.0:0 --preload dask_gateway.scheduler_preload --dg-api-address 0.0.0.0:0 --dg-heartbeat-period 15 --dg-adaptive-period 3.0 --dg-idle-timeout 0.0

When the Slurm node receives this command and begins execution, if the non-edge node receives the Slurm Job, it will try to find the dask.crt and dask.pem files that appear in the environment variables above, but these files do not exist on this node. The Slurm task will fail and the error message is as follows:

2023-05-29 17:09:58,047 - distributed.preloading - INFO - Import preload module: dask_gateway.scheduler_preload
/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py:140: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
  warnings.warn(
2023-05-29 17:09:58,049 - distributed.scheduler - INFO - -----------------------------------------------
2023-05-29 17:09:58,050 - distributed.preloading - INFO - Creating preload: dask_gateway.scheduler_preload
2023-05-29 17:09:58,050 - distributed.preloading - INFO - Import preload module: dask_gateway.scheduler_preload
2023-05-29 17:09:58,050 - distributed.scheduler - INFO - End scheduler
Traceback (most recent call last):
  File "/opt/dask/bin/dask-scheduler", line 8, in <module>
    sys.exit(main())
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py", line 249, in main
    asyncio.run(run())
  File "/opt/dask/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/dask/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py", line 209, in run
    scheduler = Scheduler(
  File "/opt/dask/lib/python3.10/site-packages/distributed/scheduler.py", line 3464, in __init__
    self.connection_args = self.security.get_connection_args("scheduler")
  File "/opt/dask/lib/python3.10/site-packages/distributed/security.py", line 342, in get_connection_args
    "ssl_context": self._get_tls_context(tls, ssl.Purpose.SERVER_AUTH),
  File "/opt/dask/lib/python3.10/site-packages/distributed/security.py", line 299, in _get_tls_context
    ctx = ssl.create_default_context(purpose=purpose, cafile=ca)
  File "/opt/dask/lib/python3.10/ssl.py", line 766, in create_default_context
    context.load_verify_locations(cafile, capath, cadata)
FileNotFoundError: [Errno 2] No such file or directory

@jcrist @consideRatio @TomAugspurger @jacobtomlinson @martindurant

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions