Skip to content

Job submission in the notebook doesn't work and no errors are given. #1969

Open
@WilliamDoman

Description

@WilliamDoman

Question.

I'm trying to learn to train a vision model and azure machine learning workspace notebooks.

I am trying to create an environment where i can run both Azure AI SK2 and pytourch to train a vision model and have access to data assets in both the notebook and the remote compute.

When I run my environment i can see the versions of packages are all correct.

The problem is that the notebook with my environment and kernel won't submit the job, but no errors and if i switch to the built in Python 3.10 - SDK V2 kernel it submits.

# Define the command job
job = command(
    code="./",  # Path to your training script
    command="python trainV2.py",  # Adjust to your script name
    inputs={
        "train_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}train_val_list_v2.txt"),
        "test_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}test_list_v2.txt"),
        "labels": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}Data_Entry_2017.csv"),
        "images": Input(type=AssetTypes.URI_FOLDER, path=f"{dataset.path}images")
    },
    outputs = {
        "outputFolder" : Output(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RW_MOUNT)
    },
    environment=environment,
    compute=compute_cluster_name,
    instance_count=1,
    display_name="exp",
    experiment_name="exp"
)

# Submit the job
results = ml_client.jobs.create_or_update(job)

The results i get in my environment.

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration

But if i runt he same code with the default Python 3.10 - SDK V2 kernel i get the same output but an additional line.

Uploading Exp (0.11 MBs): 100%|██████████| 107858/107858 [00:00<00:00, 970196.92it/s]

My environment configuration is using a standard image and adding to the requirements.txt the packages. I've done hundreds of versions of this but this is basically the latest rendition.

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202408.2

# Install pip dependencies
COPY requirements.txt .

#RUN pip install scikit-build==0.16.7 --no-cache-dir
RUN pip install -r requirements.txt --no-cache-dir

# Inference requirements
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=500
EXPOSE 5001 8883 8888

# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update
RUN apt-get install -y openssh-server openssh-client

With this in requirements.txt

# Azure ML SDK v2 packages
azure-ai-ml==1.16.1
azure-core==1.30.2
azure-identity==1.17.1
azure-storage-blob==12.22.0
azure-storage-file-datalake==12.16.0

# PyTorch and related packages
torch==2.2.2  # Match the internal version if necessary
torch-nebula==0.16.13  # If needed, otherwise omit
torch-ort==1.17.0  # If needed, otherwise omit
torchaudio==2.2.2+cu121
torchdata==0.7.1
torchmetrics==1.2.0
torch-tb-profiler==0.4.3
torchvision==0.17.2+cu121

# Core scientific packages
numpy>=1.23.0,<2.0    # ==1.23.0
pandas==1.5.0
#scikit-image>=0.21.0
#SimpleITK==2.1.0
matplotlib==3.5.0
pydicom==2.3.0
pybind11==2.13.4
regex==2024.7.24

# Data handling and serialization
pyarrow==14.0.2  # Match the version in the successful environment
fsspec  # Match the successful environment's version ==2024.10.0

# Additional dependencies
albumentations==1.4.14  # As per your original list
mltable==1.6.1
tqdm==4.66.5
urllib3==2.2.2
cryptography==43.0.0
aiohttp==3.10.1
py-spy==0.3.12
debugpy==1.6.7.post1
ipykernel==6.29.5
tensorboard==2.17.1
psutil==5.8.0
Pillow==10.4.0
plotly==5.23.0
dcmstack==0.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions