Skip to content

Example for training on local compute target does not work - run stuck on "Starting" #1554

Open
@ishouldbedany

Description

@ishouldbedany

Environment

  • Ubuntu 20.04
  • conda environment based on Python 3.8
  • Azure ML SDK version 1.32.0
  • AML workspace and associated resources in the Western Europe region
  • Azure Free Trial subscription with plenty of credits

Steps

  1. Followed the configuration notebook successfully to configure access to my AML workspace.
  2. Followed the train-on-local notebook and submitted the simplest run possible, using a user-managed environment (section 6.A, although the behaviour is similar on system and Docker based environments).
  3. Experiments starts successfully and no error is reported. Experiment is available on the web UI.
  4. Upon checking, experiment is permanently in a "Starting..." status. There are no outputs/logs streamed but the snapshot of the source directory is correctly uploaded.

image

  1. When attaching to the experiment using the CLI client in debug mode (az ml job stream --debug etc etc), no errors are reported and the output is as shown below:
urllib3.connectionpool: Starting new HTTPS connection (1): westeurope.experiments.azureml.net:443
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None

And it continues ad aeternum. There are a couple of urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net logs in there every now and then, is this a problem?

Additional information

I wonder if there is any connection setting or firewall permission I am missing. I did not find such information in the docs and I can easily submit jobs to the remote compute targets. The behaviour when submitting jobs defined via an .yml file to a local compute target using the CLI (az ml job -f job.yml etc etc) is exactly the same.

This seems like a very standard workflow (and a great advantage of AML) but it is completely broken for me.

Thanks for any help or pointers in the right direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ADOIssue is documented on MSFT ADO for internal trackingTrainingbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions