Skip to content

Local processing job replaces local output location with S3 URI #5551

@moose-in-australia

Description

@moose-in-australia

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When using LocalSession with FrameworkProcessor, the SDK ignores file:// URIs specified in ProcessingOutput.s3_output.s3_uri and replaces them with S3 URIs. This prevents local processing jobs from saving outputs to local directories as intended.

The issue is in Processor._normalize_outputs() (line 489 in sagemaker-core/src/sagemaker/core/processing.py), which replaces any non-S3 URI with an S3 URI without checking if the session is a LocalSession that should preserve file:// URIs.

To reproduce

from sagemaker.core import FrameworkProcessor
from sagemaker.core.local import LocalSession
from sagemaker.core.shapes import ProcessingOutput, ProcessingS3Output
from sagemaker.core.image_uris import retrieve
import os

# Create local session
local_session = LocalSession()

# Get processor image
processor_image_uri = retrieve(
    framework="sklearn",
    version="1.4-2",
    region=local_session.boto_region_name
)

# Define outputs with file:// URIs
local_processing_dir = os.path.abspath("processing")
os.makedirs(f"{local_processing_dir}/train", exist_ok=True)

local_processing_outputs = [
    ProcessingOutput(
        output_name="train",
        s3_output=ProcessingS3Output(
            s3_uri=f"file://{local_processing_dir}/train",
            local_path="/opt/ml/processing/output/train",
            s3_upload_mode="EndOfJob")
    )
]

# Create processor with LocalSession
processor = FrameworkProcessor(
    image_uri=processor_image_uri,
    role="arn:aws:iam::123456789012:role/DummyRole",
    instance_type="local",
    instance_count=1,
    sagemaker_session=local_session,
    base_job_name='test-local-processing',
    command=["python3"]
)

# Create a simple processing script
os.makedirs("processing_code", exist_ok=True)
with open("processing_code/test.py", "w") as f:
    f.write("""
import os
with open('/opt/ml/processing/output/train/output.txt', 'w') as f:
    f.write('test output')
print('Processing complete')
""")

# Run processor
processor.run(
    code="test.py",
    source_dir="./processing_code",
    outputs=local_processing_outputs,
    wait=False,
    logs=True
)

# Check where outputs went
print(f"Expected output location: {local_processing_dir}/train/output.txt")
print(f"File exists locally: {os.path.exists(f'{local_processing_dir}/train/output.txt')}")

Expected behavior
When using LocalSession with file:// URIs in ProcessingOutput, the outputs should be saved to the specified local directories, not uploaded to S3.

Screenshots or logs
N/A

System information

  • SageMaker Python SDK version: 3.4.0 (sagemaker-core 2.4.0)
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): scikit-learn
  • Framework version: 1.4-2
  • Python version: 3.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions