Runaway Agent Issue: AWS Bedrock AgentCore Has No Runtime Session Kill Switch
Summary
AWS Bedrock AgentCore provides no API to list or force-terminate active runtime sessions (microVM instances). When an agent enters a runaway loop, operators have no direct way to stop it — the agent continues consuming API tokens and incurring costs until it dies naturally or is killed via indirect workarounds.
Observed Incident
- Runtime:
<runtime-name> (dev)
- Duration: ~58 minutes
- Cost incurred: $72.67 from a single runaway session
- Root cause: A Claude Code
Stop hook failure produced error output that was fed back as a new prompt on every turn, creating a self-sustaining loop with no exit condition.
Why It's Hard to Stop
1. No API to list active runtime sessions
$ aws bedrock-agentcore list-sessions
# ERROR: requires --memory-id and --actor-id
# This lists Memory sessions, NOT runtime/microVM sessions
There is no equivalent of kubectl get pods or ecs list-tasks for AgentCore microVMs. The only visibility into running sessions is CloudWatch logs.
2. stop_runtime_session does not kill the container
client.stop_runtime_session(
agentRuntimeArn='<runtime-arn>',
runtimeSessionId='<session-id>',
)
# ResourceNotFoundException: Session not found or has been terminated
AgentCore had already marked the session terminated at the control plane, but the underlying microVM container kept running independently and continued making Bedrock API calls.
3. update_agent_runtime does not kill active containers mid-loop
Calling update_agent_runtime (even with a new image URI, as a CI/CD deploy does) does not forcefully terminate containers that are actively processing requests. AgentCore waits for the container to go idle before applying the update — but a looping agent never goes idle.
4. Persistent volume backup creates a deadlock
Once a shutdown is initiated, AgentCore tries to back up the persistent volume (/mnt/workspace) before killing the container. A looping agent continuously writes to the volume, so the backup never completes and the container never dies:
Write failed: waiting to be backed up. ← repeated indefinitely
Claude: OK.
Write failed: waiting to be backed up.
Claude: Noted.
How We Killed It
After exhausting the above options, the only effective solution was to deny Bedrock API permissions on the runtime's IAM role:
iam.put_role_policy(
RoleName='<runtime-iam-role>',
PolicyName='emergency-bedrock-deny',
PolicyDocument={
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream",
"bedrock:CallWithBearerToken"
],
"Resource": "*"
}]
}
)
This caused the agent's next Bedrock call to fail with a 403, breaking the loop. The IAM change propagated in ~30–60 seconds.
Side effects: This also breaks all other active sessions using the same runtime role. Not acceptable for production.
What AWS Should Provide
list_runtime_sessions — list all active microVM sessions per runtime, with session ID, start time, and invocation count.
force_stop_runtime_session — immediately SIGKILL the container regardless of in-flight requests or pending volume backups.
- Per-session spend alerts — CloudWatch alarm on Bedrock token spend scoped to a single
runtimeSessionId.
- Configurable max-cost / max-turn guardrails at the AgentCore runtime level, not just in agent code.
Recommendations for AgentCore Users
Until AWS addresses this, self-protect with:
- Always set
max_turns in your agent SDK options.
- Never configure a
Stop hook that can fail and produce output that gets re-ingested as a prompt.
- Set up a CloudWatch metric alarm on your Bedrock invocation costs with a short evaluation period (e.g., >$1 in 5 minutes).
- Keep your runtime IAM role isolated (one role per runtime) so an emergency deny policy doesn't blast other workloads.
- Know your runtime role name ahead of time — you'll need it fast if a runaway starts.
Runaway Agent Issue: AWS Bedrock AgentCore Has No Runtime Session Kill Switch
Summary
AWS Bedrock AgentCore provides no API to list or force-terminate active runtime sessions (microVM instances). When an agent enters a runaway loop, operators have no direct way to stop it — the agent continues consuming API tokens and incurring costs until it dies naturally or is killed via indirect workarounds.
Observed Incident
<runtime-name>(dev)Stophook failure produced error output that was fed back as a new prompt on every turn, creating a self-sustaining loop with no exit condition.Why It's Hard to Stop
1. No API to list active runtime sessions
There is no equivalent of
kubectl get podsorecs list-tasksfor AgentCore microVMs. The only visibility into running sessions is CloudWatch logs.2.
stop_runtime_sessiondoes not kill the containerAgentCore had already marked the session terminated at the control plane, but the underlying microVM container kept running independently and continued making Bedrock API calls.
3.
update_agent_runtimedoes not kill active containers mid-loopCalling
update_agent_runtime(even with a new image URI, as a CI/CD deploy does) does not forcefully terminate containers that are actively processing requests. AgentCore waits for the container to go idle before applying the update — but a looping agent never goes idle.4. Persistent volume backup creates a deadlock
Once a shutdown is initiated, AgentCore tries to back up the persistent volume (
/mnt/workspace) before killing the container. A looping agent continuously writes to the volume, so the backup never completes and the container never dies:How We Killed It
After exhausting the above options, the only effective solution was to deny Bedrock API permissions on the runtime's IAM role:
This caused the agent's next Bedrock call to fail with a 403, breaking the loop. The IAM change propagated in ~30–60 seconds.
Side effects: This also breaks all other active sessions using the same runtime role. Not acceptable for production.
What AWS Should Provide
list_runtime_sessions— list all active microVM sessions per runtime, with session ID, start time, and invocation count.force_stop_runtime_session— immediately SIGKILL the container regardless of in-flight requests or pending volume backups.runtimeSessionId.Recommendations for AgentCore Users
Until AWS addresses this, self-protect with:
max_turnsin your agent SDK options.Stophook that can fail and produce output that gets re-ingested as a prompt.