Skip to content

Make container Service more robust#76

Merged
dongwang218 merged 7 commits intomainfrom
dong_container_fix1
Aug 29, 2025
Merged

Make container Service more robust#76
dongwang218 merged 7 commits intomainfrom
dong_container_fix1

Conversation

@dongwang218
Copy link
Contributor

Why ?

Fix seveal container issues during testing swe-bench

How ?

  1. Make ContainerDeployment non-blocking. currently it won't become ruuning until all containers are alive. The means if we delete the deployment, the del function is not called and no proper cleanup.
  2. Add timeout in acquire/release/execute
  3. Make ContainerClient return json with error key instead of throw exception.
  4. add ray_get_async to wait for a list of futures

Test plan

matrix check_health --app_name container

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2025
Comment on lines +145 to +150
session_timeout = aiohttp.ClientTimeout(total=timeout + 5) if timeout else None
async with aiohttp.ClientSession(timeout=session_timeout) as session:
status, content = await post_url(
session, f"{self.base_url}/execute", payload
)
return await self._handle_response(status, content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider make util function? seems several occurance and only different in timeout and url?

@dongwang218 dongwang218 merged commit a2a8f82 into main Aug 29, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants