Skip to content

No API to list or force-terminate active runtime sessions — runaway agents burn uncapped costs #498

@rrajpuro

Description

@rrajpuro

Runaway Agent Issue: AWS Bedrock AgentCore Has No Runtime Session Kill Switch

Summary

AWS Bedrock AgentCore provides no API to list or force-terminate active runtime sessions (microVM instances). When an agent enters a runaway loop, operators have no direct way to stop it — the agent continues consuming API tokens and incurring costs until it dies naturally or is killed via indirect workarounds.

Observed Incident

  • Runtime: <runtime-name> (dev)
  • Duration: ~58 minutes
  • Cost incurred: $72.67 from a single runaway session
  • Root cause: A Claude Code Stop hook failure produced error output that was fed back as a new prompt on every turn, creating a self-sustaining loop with no exit condition.

Why It's Hard to Stop

1. No API to list active runtime sessions

$ aws bedrock-agentcore list-sessions
# ERROR: requires --memory-id and --actor-id
# This lists Memory sessions, NOT runtime/microVM sessions

There is no equivalent of kubectl get pods or ecs list-tasks for AgentCore microVMs. The only visibility into running sessions is CloudWatch logs.

2. stop_runtime_session does not kill the container

client.stop_runtime_session(
    agentRuntimeArn='<runtime-arn>',
    runtimeSessionId='<session-id>',
)
# ResourceNotFoundException: Session not found or has been terminated

AgentCore had already marked the session terminated at the control plane, but the underlying microVM container kept running independently and continued making Bedrock API calls.

3. update_agent_runtime does not kill active containers mid-loop

Calling update_agent_runtime (even with a new image URI, as a CI/CD deploy does) does not forcefully terminate containers that are actively processing requests. AgentCore waits for the container to go idle before applying the update — but a looping agent never goes idle.

4. Persistent volume backup creates a deadlock

Once a shutdown is initiated, AgentCore tries to back up the persistent volume (/mnt/workspace) before killing the container. A looping agent continuously writes to the volume, so the backup never completes and the container never dies:

Write failed: waiting to be backed up.   ← repeated indefinitely
Claude: OK.
Write failed: waiting to be backed up.
Claude: Noted.

How We Killed It

After exhausting the above options, the only effective solution was to deny Bedrock API permissions on the runtime's IAM role:

iam.put_role_policy(
    RoleName='<runtime-iam-role>',
    PolicyName='emergency-bedrock-deny',
    PolicyDocument={
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream",
                "bedrock:CallWithBearerToken"
            ],
            "Resource": "*"
        }]
    }
)

This caused the agent's next Bedrock call to fail with a 403, breaking the loop. The IAM change propagated in ~30–60 seconds.

Side effects: This also breaks all other active sessions using the same runtime role. Not acceptable for production.

What AWS Should Provide

  1. list_runtime_sessions — list all active microVM sessions per runtime, with session ID, start time, and invocation count.
  2. force_stop_runtime_session — immediately SIGKILL the container regardless of in-flight requests or pending volume backups.
  3. Per-session spend alerts — CloudWatch alarm on Bedrock token spend scoped to a single runtimeSessionId.
  4. Configurable max-cost / max-turn guardrails at the AgentCore runtime level, not just in agent code.

Recommendations for AgentCore Users

Until AWS addresses this, self-protect with:

  • Always set max_turns in your agent SDK options.
  • Never configure a Stop hook that can fail and produce output that gets re-ingested as a prompt.
  • Set up a CloudWatch metric alarm on your Bedrock invocation costs with a short evaluation period (e.g., >$1 in 5 minutes).
  • Keep your runtime IAM role isolated (one role per runtime) so an emergency deny policy doesn't blast other workloads.
  • Know your runtime role name ahead of time — you'll need it fast if a runaway starts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions