Skip to content

Conversation

@himanshugoel2797
Copy link

In some setups, it may not be possible to open a port to the internet for the job agent to communicate to the Sirepo server. Using SSH to forward a port can be a nice fallback for this situation. This can also eventually allow users to connect to clusters while running Sirepo locally.

We use Unix Domain Sockets (UDS) for this instead of port-to-port forwarding. This eliminates the possibility of port conflicts on the cluster, and provides better security, as the UDS can only be accessed by the user that owns it.

Instead of adding another configuration option, if the supervisor URI points to localost, while the SLURM host is not localhost, the supervisor URI is likely inaccessible and we should try to setup an SSH port forwarding over SSH.

@himanshugoel2797 himanshugoel2797 marked this pull request as ready for review July 31, 2025 19:37
@robnagler
Copy link
Member

@himanshugoel2797 I created branch 7640-ssh-job-agent which has big TODO's in its commit. The code has not been tested at all. Here are some bullets:

  • Connection life cycle was not robust
  • Added want_persistent_ssh config to simplify cascade. The ";" solution was too much code and implicit coupling (see Explicit Coupling)
  • Modularized using functions to make code more maintainable and robust
  • Never put files in /tmp. It's unnecessary with the agent which has it's own start directory -- easier cleanup, too.
  • Refactored some other code to make context management clearer/easier
  • Don't create tasks unless necessary (listener close)
  • Follow DesignHints and CodingStyle

I don't have more time to spend on this right now. sim_db_file is a showstopper. To test, create a custom SRW magnet file, and try to use that. It will fail when it tries to connect to the supervisor uri. Adding a test for this case would be very useful. Global resources is not as important, but it's all using the same code (agent_supervisor_api) so both cases are fixed by changing that to use the resolver. The code will need to be shared with job_agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants