Open
Description
Environment
- Ubuntu 20.04
conda
environment based on Python 3.8- Azure ML SDK version 1.32.0
- AML workspace and associated resources in the Western Europe region
- Azure Free Trial subscription with plenty of credits
Steps
- Followed the configuration notebook successfully to configure access to my AML workspace.
- Followed the train-on-local notebook and submitted the simplest run possible, using a user-managed environment (section 6.A, although the behaviour is similar on system and Docker based environments).
- Experiments starts successfully and no error is reported. Experiment is available on the web UI.
- Upon checking, experiment is permanently in a "Starting..." status. There are no outputs/logs streamed but the snapshot of the source directory is correctly uploaded.
- When attaching to the experiment using the CLI client in debug mode (
az ml job stream --debug etc etc
), no errors are reported and the output is as shown below:
urllib3.connectionpool: Starting new HTTPS connection (1): westeurope.experiments.azureml.net:443
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
And it continues ad aeternum. There are a couple of urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net
logs in there every now and then, is this a problem?
Additional information
I wonder if there is any connection setting or firewall permission I am missing. I did not find such information in the docs and I can easily submit jobs to the remote compute targets. The behaviour when submitting jobs defined via an .yml
file to a local compute target using the CLI (az ml job -f job.yml etc etc
) is exactly the same.
This seems like a very standard workflow (and a great advantage of AML) but it is completely broken for me.
Thanks for any help or pointers in the right direction.