Node discovery is currently implemented by either having nodes coordinate via a distributed filesystem or otherwise waiting until the jobs are running and using the K8s API to check which node is running a particular head node script.
A supported way of launching a job set and labelling worker and head nodes so these can be easily discovered from the jobs themselves would be very useful for supporting distributed ML on armada.
┆Issue is synchronized with this Jira Task by Unito