Skip to content

ModelTrainer in Mode.LOCAL_CONTAINER throws an error when cleaning up folders upon exiting the container #5542

@pdifranc

Description

@pdifranc

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.4)

Describe the bug
When using ModelTrainer with Mode.LOCAL_CONTAINER, upon completing the training script, ModelTrainer throws an error when attempting to clean the folders it created locally.

Same behavior regardless of which container you run.

The problem is that one of the folder created is owned by the root user, and not by the regular user on the host machine.

sagemaker-user@default:~/amazon-sagemaker-immersion-day/mlops/01-tracking$ ls -lah algo-1/input/
total 0
drwxr-xr-x. 4 sagemaker-user users 32 Feb  6 12:25 .
drwxr-xr-x. 3 sagemaker-user users 19 Feb  6 12:25 ..
drwxr-xr-x. 2 sagemaker-user users 89 Feb  5 16:44 config
drwxr-xr-x. 5 root           root  49 Feb  5 16:45 data
sagemaker-user@default:~/amazon-sagemaker-immersion-day/mlops/01-tracking$

To reproduce
Run any ModelTrainer in with LOCAL_CONTAINER

from sagemaker.train.model_trainer import ModelTrainer, Mode
from sagemaker.core.training.configs import SourceCode, Compute, InputData
from sagemaker.core import image_uris

sklearn_image = image_uris.retrieve(framework="sklearn", region=region, version="1.4-2")

sklearn_local = ModelTrainer(
    training_mode=Mode.LOCAL_CONTAINER,
    training_image=sklearn_image,
    source_code=SourceCode(
        source_dir="training_code",
        entry_script="train.py",
        requirements="requirements.txt"
    ),
    compute=Compute(instance_type="ml.c5.xlarge", instance_count=1),
    hyperparameters={"max_leaf_nodes": "30"},
    role=role,
    environment={
        "MLFLOW_TRACKING_URI": mlflow_arn,
        "MODE": "local-mode",
        "LOGNAME": user_profile_name,
        "MLFLOW_EXPERIMENT_NAME": mlflow_experiment_name
    },
)

sklearn_local.train(
    input_data_config=[InputData(channel_name="train", data_source=train_input)],
    wait=True
)

Expected behavior
At the end of the training, everything is removed and it does not throw an error

Screenshots or logs

lgo-1-1  | 
algo-1-1  | [notice] A new release of pip is available: 25.3 -> 26.0.1
algo-1-1  | [notice] To update, run: python3 -m pip install --upgrade pip
algo-1-1  | Running Basic Script driver
algo-1-1  | ++ echo 'Running Basic Script driver'
algo-1-1  | ++ /usr/bin/python3 /opt/ml/input/data/sm_drivers/distributed_drivers/basic_script_driver.py
algo-1-1  | Executing command: /usr/bin/python3 train.py --max_leaf_nodes 30
algo-1-1  | 🏃 View run Local-Training at: https://mlflow.sagemaker.us-west-2.app.aws/#/experiments/1/runs/2785c5fc936c453a8e575d422ae94548
algo-1-1  | 🧪 View experiment at: https://mlflow.sagemaker.us-west-2.app.aws/#/experiments/1
algo-1-1  | ++ echo 'Training Container Execution Completed'
algo-1-1  | Training Container Execution Completed
algo-1-1 exited with code 0
 Compose Stopping Aborting on container exit...
 Container 01-tracking-algo-1-1 Stopping 
 Container 01-tracking-algo-1-1 Stopped 

[02/06/26 12:25:21] INFO     Local training job completed, output artifacts saved to         [local_container.py](file:///opt/conda/lib/python3.12/site-packages/sagemaker/train/local/local_container.py):[218](file:///opt/conda/lib/python3.12/site-packages/sagemaker/train/local/local_container.py#218)
                             file:///home/sagemaker-user/amazon-sagemaker-immersion-day/mlop                       
                             s/01-tracking/compressed_artifacts/model.tar.gz


in <module>:26                                                                                   │
│                                                                                                  │
│   23 │   },                                                                                      │
│   24 )                                                                                           │
│   25                                                                                             │
│ ❱ 26 sklearn_local.train(                                                                        │
│   27 │   input_data_config=[InputData(channel_name="train", data_source=train_input)],           │
│   28 │   wait=True                                                                               │
│   29 )                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.12/site-packages/sagemaker/core/telemetry/telemetry_logging.py:176 in     │
│ wrapper                                                                                          │
│                                                                                                  │
│   173 │   │   │   │   │   "sagemaker_session is not provided or not valid.",                     │
│   174 │   │   │   │   │   func_name,                                                             │
│   175 │   │   │   │   )                                                                          │
│ ❱ 176 │   │   │   │   return func(*args, **kwargs)                                               │
│   177 │   │                                                                                      │
│   178 │   │   return wrapper                                                                     │
│   179                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.12/site-packages/sagemaker/core/workflow/pipeline_context.py:346 in       │
│ wrapper                                                                                          │
│                                                                                                  │
│   343 │   │   │                                                                                  │
│   344 │   │   │   return _StepArguments(retrieve_caller_name(self_instance), run_func, *args,    │
│   345 │   │                                                                                      │
│ ❱ 346 │   │   return run_func(*args, **kwargs)                                                   │
│   347 │                                                                                          │
│   348 │   return wrapper                                                                         │
│   349                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py:39 in               │
│ wrapper_function                                                                                 │
│                                                                                                  │
│    36 │   │                                                                                      │
│    37 │   │   @functools.wraps(wrapped)                                                          │
│    38 │   │   def wrapper_function(*args, **kwargs):                                             │
│ ❱  39 │   │   │   return wrapper(*args, **kwargs)                                                │
│    40 │                                                                                          │
│    41 │   # We need to manually update this because `partial` object has no `__name__` and `__   │
│    42 │   wrapper_function.__name__ = extract_function_name(wrapped)                             │
│                                                                                                  │
│ /opt/conda/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py:136 in __call__     │
│                                                                                                  │
│   133 │   │   if not self.__pydantic_complete__:                                                 │
│   134 │   │   │   self._create_validators()                                                      │
│   135 │   │                                                                                      │
│ ❱ 136 │   │   res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args,   │
│   137 │   │   if self.__return_pydantic_validator__:                                             │
│   138 │   │   │   return self.__return_pydantic_validator__(res)                                 │
│   139 │   │   else:                                                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.12/site-packages/sagemaker/train/model_trainer.py:808 in train            │
│                                                                                                  │
│    805 │   │   │   │   hyper_parameters=training_request["hyper_parameters"],                    │
│    806 │   │   │   │   environment=training_request["environment"],                              │
│    807 │   │   │   )                                                                             │
│ ❱  808 │   │   │   local_container.train(wait)                                                   │
│    809 │   │   if self._temp_code_dir is not None:                                               │
│    810 │   │   │   self._temp_code_dir.cleanup()                                                 │
│    811                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.12/site-packages/sagemaker/train/local/local_container.py:223 in train    │
│                                                                                                  │
│   220 │   │   shutil.rmtree(os.path.join(self.container_root, "input"))                          │
│   221 │   │   shutil.rmtree(os.path.join(self.container_root, "shared"))                         │
│   222 │   │   for host in self.hosts:                                                            │
│ ❱ 223 │   │   │   shutil.rmtree(os.path.join(self.container_root, host))                         │
│   224 │   │   for folder in self._temporary_folders:                                             │
│   225 │   │   │   shutil.rmtree(os.path.join(self.container_root, folder))                       │
│   226 │   │   return artifacts                                                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.12/shutil.py:759 in rmtree                                                │
│                                                                                                  │
│    756 │   │   stack = [(os.lstat, dir_fd, path, None)]                                          │
│    757 │   │   try:                                                                              │
│    758 │   │   │   while stack:                                                                  │
│ ❱  759 │   │   │   │   _rmtree_safe_fd(stack, onexc)                                             │
│    760 │   │   finally:                                                                          │
│    761 │   │   │   # Close any file descriptors still on the stack.                              │
│    762 │   │   │   while stack:                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.12/shutil.py:703 in _rmtree_safe_fd                                       │
│                                                                                                  │
│    700 │   │   │   │   onexc(os.unlink, fullname, err)                                           │
│    701 │   except OSError as err:                                                                │
│    702 │   │   err.filename = path                                                               │
│ ❱  703 │   │   onexc(func, path, err)                                                            │
│    704                                                                                           │
│    705 _use_fd_functions = ({os.open, os.stat, os.unlink, os.rmdir} <=                           │
│    706 │   │   │   │   │    os.supports_dir_fd and                                               │
│                                                                                                  │
│ /opt/conda/lib/python3.12/shutil.py:662 in _rmtree_safe_fd                                       │
│                                                                                                  │
│    659 │   │   │   os.close(dirfd)                                                               │
│    660 │   │   │   return                                                                        │
│    661 │   │   if func is os.rmdir:                                                              │
│ ❱  662 │   │   │   os.rmdir(name, dir_fd=dirfd)                                                  │
│    663 │   │   │   return                                                                        │
│    664 │   │                                                                                     │
│    665 │   │   # Note: To guard against symlink races, we use the standard                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
PermissionError: [Errno 13] Permission denied: 
'/home/sagemaker-user/amazon-sagemaker-immersion-day/mlops/01-tracking/algo-1/input/data/code'

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 3.4.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): SKlearn managed container 1.4-2
  • Framework version: 1.4-2
  • Python version: 3.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N):

Additional context
When running the ModelTrainer on the managed infrastructure, everything runs fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions