Skip to content

Conversation

@vicheey
Copy link
Contributor

@vicheey vicheey commented Nov 3, 2025

Pull Request Summary

Which issue(s) does this change fix?

This change fixes two critical issues in AWS SAM CLI's --warm-containers LAZY mode:

Issue 1: Incomplete Container Cleanup

  • When multiple containers are created for concurrent requests, only one container is cleaned up on process termination (SIGINT/SIGTERM), leaving orphaned containers running
  • Root cause: Container tracking used Dict[str, Container] which overwrites entries, so only the last container per function was tracked

Issue 2: Inconsistent Container Reuse

  • After concurrent requests create multiple containers, subsequent requests only reuse one container instead of distributing load across all available containers
  • Root cause: Root cause 1 with only one container is tracked. There is no container selection logic that account for all available containers. SAM proces always uses the same container reference.

Why is this change necessary?

Resource Management: Orphaned containers consume system resources (memory, CPU, disk) and can accumulate over development sessions.

Performance & Accuracy: Users may notice the difference in invoke speed between the very first invokes (parallel with container scale up) and the subsequent invokes (use only one container and request get queued despite other running untracked containers available)

Developer Experience: The fix improve simulation of Lambda horizontal scaling with concurrent requests. Local testing already reflect this behavior but with inaccurate load distribution due to bug.

How does it address the issue?

Container Tracking (Issue 1)

  • Changed container storage from Dict[str, Container] to Dict[str, List[Container]] using defaultdict(list)
  • Updated cleanup logic to iterate over all containers in the (per function) lists
  • Modified clean_running_containers_and_related_resources() to handle multiple containers per function with error handling

Container Selection (Issue 2)

  • Implemented capacity signaling interface with has_capacity(), _mark_busy(), and _mark_idle() methods with thread safety.
  • Added _find_available_container() to check all containers for capacity using first-available strategy
  • Created _acquire_container() to handle container reuse and creation with thread-safe locking
  • Containers are marked busy BEFORE releasing lock to prevent race conditions

Thread Safety and Error Handling

  • Two-level locking: _container_lock for runtime operations, _busy_lock for container state.
  • Cleanup continues even if individual containers fail to stop
  • Logs summary with success/fail counts at debug level
  • Handles decompressed paths and observer cleanup errors gracefully

What side effects does this change have?

Positive Side Effects

  • Better resource utilization: All idle containers are now available for reuse for both EAGER and LAZY mode.
  • Improved local testing accuracy: Concurrent execution within containers behavior now consistent between invoke batches
  • Cleaner system state: No orphaned containers
  • Better observability: Debug logs show container lifecycle and cleanup status

Potential Considerations

  • Slightly increased memory usage for tracking multiple containers (negligible - just references)
  • Test infrastructure updated to send SIGTERM before SIGKILL for graceful shutdown testing.

Backward Compatibility

  • No breaking changes to public APIs or CLI options
  • COLD mode behavior unchanged
  • All existing tests pass

Mandatory Checklist

PRs will only be reviewed after checklist is complete

  • Review the generative AI contribution guidelines
  • Add input/output type hints to new functions/methods
  • Write design document if needed (Do I need to write a design document?)
  • Write/update unit tests
    • Added 194 lines of unit tests for Container capacity signaling
    • Added 857 lines of unit tests for WarmLambdaRuntime container management
    • Tests cover: cleanup logic, container selection, thread safety, error handling
  • Write/update integration tests
    • Added 299 lines of integration tests for start-api (LAZY and EAGER modes)
    • Added 302 lines of integration tests for start-lambda (LAZY and EAGER modes)
    • Tests verify: concurrent requests, container reuse, cleanup, multiple functions
  • Write/update functional tests if needed
    • Not required - integration tests provide sufficient coverage
  • make pr passes
    • All unit tests pass
    • All LAZY mode integration tests pass
    • EAGER mode scale-up tests fail as expected (not a regression)
  • make update-reproducible-reqs if dependencies were changed
  • Write documentation

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@vicheey vicheey marked this pull request as ready for review November 3, 2025 06:56
@vicheey vicheey requested a review from a team as a code owner November 3, 2025 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant