fix: Container clean up and load distribution #8389

vicheey · 2025-11-03T06:18:54Z

Pull Request Summary

Which issue(s) does this change fix?

This change fixes two critical issues in AWS SAM CLI's --warm-containers LAZY mode:

Issue 1: Incomplete Container Cleanup

When multiple containers are created for concurrent requests, only one container is cleaned up on process termination (SIGINT/SIGTERM), leaving orphaned containers running
Root cause: Container tracking used Dict[str, Container] which overwrites entries, so only the last container per function was tracked

Issue 2: Inconsistent Container Reuse

After concurrent requests create multiple containers, subsequent requests only reuse one container instead of distributing load across all available containers
Root cause: Root cause 1 with only one container is tracked. There is no container selection logic that account for all available containers. SAM proces always uses the same container reference.

Why is this change necessary?

Resource Management: Orphaned containers consume system resources (memory, CPU, disk) and can accumulate over development sessions.

Performance & Accuracy: Users may notice the difference in invoke speed between the very first invokes (parallel with container scale up) and the subsequent invokes (use only one container and request get queued despite other running untracked containers available)

Developer Experience: The fix improve simulation of Lambda horizontal scaling with concurrent requests. Local testing already reflect this behavior but with inaccurate load distribution due to bug.

How does it address the issue?

Container Tracking (Issue 1)

Changed container storage from Dict[str, Container] to Dict[str, List[Container]] using defaultdict(list)
Updated cleanup logic to iterate over all containers in the (per function) lists
Modified clean_running_containers_and_related_resources() to handle multiple containers per function with error handling

Container Selection (Issue 2)

Implemented capacity signaling interface with has_capacity(), _mark_busy(), and _mark_idle() methods with thread safety.
Added _find_available_container() to check all containers for capacity using first-available strategy
Created _acquire_container() to handle container reuse and creation with thread-safe locking
Containers are marked busy BEFORE releasing lock to prevent race conditions

Thread Safety and Error Handling

Two-level locking: _container_lock for runtime operations, _busy_lock for container state.
Cleanup continues even if individual containers fail to stop
Logs summary with success/fail counts at debug level
Handles decompressed paths and observer cleanup errors gracefully

What side effects does this change have?

Positive Side Effects

Better resource utilization: All idle containers are now available for reuse for both EAGER and LAZY mode.
Improved local testing accuracy: Concurrent execution within containers behavior now consistent between invoke batches
Cleaner system state: No orphaned containers
Better observability: Debug logs show container lifecycle and cleanup status

Potential Considerations

Slightly increased memory usage for tracking multiple containers (negligible - just references)
Test infrastructure updated to send SIGTERM before SIGKILL for graceful shutdown testing.

Backward Compatibility

No breaking changes to public APIs or CLI options
COLD mode behavior unchanged
All existing tests pass

Mandatory Checklist

PRs will only be reviewed after checklist is complete

Review the generative AI contribution guidelines
Add input/output type hints to new functions/methods
Write design document if needed (Do I need to write a design document?)
Write/update unit tests
- Added 194 lines of unit tests for Container capacity signaling
- Added 857 lines of unit tests for WarmLambdaRuntime container management
- Tests cover: cleanup logic, container selection, thread safety, error handling
Write/update integration tests
- Added 299 lines of integration tests for start-api (LAZY and EAGER modes)
- Added 302 lines of integration tests for start-lambda (LAZY and EAGER modes)
- Tests verify: concurrent requests, container reuse, cleanup, multiple functions
Write/update functional tests if needed
- Not required - integration tests provide sufficient coverage
make pr passes
- All unit tests pass
- All LAZY mode integration tests pass
- EAGER mode scale-up tests fail as expected (not a regression)
make update-reproducible-reqs if dependencies were changed
Write documentation

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…d-ditribution

fix: Container clean up and load distribution

b16a34e

github-actions bot added area/local/start-api sam local start-api command area/local/invoke sam local invoke command area/local/start-invoke pr/internal labels Nov 3, 2025

fix make pr

1d4c9ca

vicheey marked this pull request as ready for review November 3, 2025 06:56

vicheey requested a review from a team as a code owner November 3, 2025 06:56

vicheey and others added 2 commits November 2, 2025 23:03

refactor unit tests

292d81a

Merge branch 'develop' into fix-container-clean-up-and-concurrent-loa…

47c41bb

…d-ditribution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Container clean up and load distribution #8389

fix: Container clean up and load distribution #8389

Uh oh!

vicheey commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: Container clean up and load distribution #8389

Are you sure you want to change the base?

fix: Container clean up and load distribution #8389

Uh oh!

Conversation

vicheey commented Nov 3, 2025

Pull Request Summary

Which issue(s) does this change fix?

Why is this change necessary?

How does it address the issue?

Container Tracking (Issue 1)

Container Selection (Issue 2)

Thread Safety and Error Handling

What side effects does this change have?

Positive Side Effects

Potential Considerations

Backward Compatibility

Mandatory Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant