No API to list or force-terminate active runtime sessions — runaway agents burn uncapped costs

# Runaway Agent Issue: AWS Bedrock AgentCore Has No Runtime Session Kill Switch

## Summary

AWS Bedrock AgentCore provides no API to list or force-terminate active runtime sessions (microVM instances). When an agent enters a runaway loop, operators have no direct way to stop it — the agent continues consuming API tokens and incurring costs until it dies naturally or is killed via indirect workarounds.

## Observed Incident

- **Runtime:** `<runtime-name>` (dev)
- **Duration:** ~58 minutes
- **Cost incurred:** **$72.67** from a single runaway session
- **Root cause:** A Claude Code `Stop` hook failure produced error output that was fed back as a new prompt on every turn, creating a self-sustaining loop with no exit condition.

## Why It's Hard to Stop

### 1. No API to list active runtime sessions

```bash
$ aws bedrock-agentcore list-sessions
# ERROR: requires --memory-id and --actor-id
# This lists Memory sessions, NOT runtime/microVM sessions
```

There is no equivalent of `kubectl get pods` or `ecs list-tasks` for AgentCore microVMs. The only visibility into running sessions is CloudWatch logs.

### 2. `stop_runtime_session` does not kill the container

```python
client.stop_runtime_session(
    agentRuntimeArn='<runtime-arn>',
    runtimeSessionId='<session-id>',
)
# ResourceNotFoundException: Session not found or has been terminated
```

AgentCore had already marked the session terminated at the control plane, but the underlying microVM container kept running independently and continued making Bedrock API calls.

### 3. `update_agent_runtime` does not kill active containers mid-loop

Calling `update_agent_runtime` (even with a new image URI, as a CI/CD deploy does) does not forcefully terminate containers that are actively processing requests. AgentCore waits for the container to go idle before applying the update — but a looping agent never goes idle.

### 4. Persistent volume backup creates a deadlock

Once a shutdown is initiated, AgentCore tries to back up the persistent volume (`/mnt/workspace`) before killing the container. A looping agent continuously writes to the volume, so the backup never completes and the container never dies:

```
Write failed: waiting to be backed up.   ← repeated indefinitely
Claude: OK.
Write failed: waiting to be backed up.
Claude: Noted.
```

## How We Killed It

After exhausting the above options, the only effective solution was to **deny Bedrock API permissions on the runtime's IAM role**:

```python
iam.put_role_policy(
    RoleName='<runtime-iam-role>',
    PolicyName='emergency-bedrock-deny',
    PolicyDocument={
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream",
                "bedrock:CallWithBearerToken"
            ],
            "Resource": "*"
        }]
    }
)
```

This caused the agent's next Bedrock call to fail with a 403, breaking the loop. The IAM change propagated in ~30–60 seconds.

**Side effects:** This also breaks all other active sessions using the same runtime role. Not acceptable for production.

## What AWS Should Provide

1. **`list_runtime_sessions`** — list all active microVM sessions per runtime, with session ID, start time, and invocation count.
2. **`force_stop_runtime_session`** — immediately SIGKILL the container regardless of in-flight requests or pending volume backups.
3. **Per-session spend alerts** — CloudWatch alarm on Bedrock token spend scoped to a single `runtimeSessionId`.
4. **Configurable max-cost / max-turn guardrails** at the AgentCore runtime level, not just in agent code.

## Recommendations for AgentCore Users

Until AWS addresses this, self-protect with:

- Always set `max_turns` in your agent SDK options.
- Never configure a `Stop` hook that can fail and produce output that gets re-ingested as a prompt.
- Set up a CloudWatch metric alarm on your Bedrock invocation costs with a short evaluation period (e.g., >$1 in 5 minutes).
- Keep your runtime IAM role isolated (one role per runtime) so an emergency deny policy doesn't blast other workloads.
- Know your runtime role name ahead of time — you'll need it fast if a runaway starts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No API to list or force-terminate active runtime sessions — runaway agents burn uncapped costs #498

Runaway Agent Issue: AWS Bedrock AgentCore Has No Runtime Session Kill Switch

Summary

Observed Incident

Why It's Hard to Stop

1. No API to list active runtime sessions

2. `stop_runtime_session` does not kill the container

3. `update_agent_runtime` does not kill active containers mid-loop

4. Persistent volume backup creates a deadlock

How We Killed It

What AWS Should Provide

Recommendations for AgentCore Users

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

No API to list or force-terminate active runtime sessions — runaway agents burn uncapped costs #498

Description

Runaway Agent Issue: AWS Bedrock AgentCore Has No Runtime Session Kill Switch

Summary

Observed Incident

Why It's Hard to Stop

1. No API to list active runtime sessions

2. stop_runtime_session does not kill the container

3. update_agent_runtime does not kill active containers mid-loop

4. Persistent volume backup creates a deadlock

How We Killed It

What AWS Should Provide

Recommendations for AgentCore Users

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `stop_runtime_session` does not kill the container

3. `update_agent_runtime` does not kill active containers mid-loop