Skip to content
This repository was archived by the owner on Jun 3, 2026. It is now read-only.
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions designs/snapshot-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Design Doc: Low-Level Snapshot API

**Status**: Proposed

**Date**: 2026-01-28

**Issue**: https://github.com/strands-agents/sdk-python/issues/1138

## Motivation

Today, developers who want to manually snapshot and restore agent state can *almost* do so by saving and loading these properties directly:

- `Agent.messages` — the conversation history
- `Agent.state` — custom application state
- `Agent._interrupt_state` — internal state for interrupt handling
- Conversation manager internal state — state held by the conversation manager (e.g., sliding window position)

However, this approach is fragile: it requires knowledge of internal implementation details, and the set of properties may change between versions. This proposal introduces a stable, convenient API to accomplish the same thing without relying on internals.

**This API does not change agent behavior** — it simply provides a clean way to serialize and restore the existing state that already exists on the agent.

## Context

Developers need a way to preserve and restore the exact state of an agent at a specific point in time. The existing SessionManagement doesn't address this:

- SessionManager works in the background, incrementally recording messages rather than full state. This means it's not possible to restore to arbitrary points in time.
- After a message is saved, there is no way to modify it and have it recorded in session-management, preventing more advance context-management strategies while being able to pause & restore agents.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a use case for wanting to modify a past message?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversation Managers do this, so yes.

- There is no way to proactively trigger session-management (e.g., after modifying `agent.messages` or `agent.state` directly)

## Decision

Add a low-level, explicit snapshot API as an alternative to automatic session-management. This enables preserving the exact state of an agent at a specific point and restoring it later — useful for evaluation frameworks, custom session management, and checkpoint/restore workflows.

### API Changes

```python
class Snapshot(TypedDict):
type: str # the type of data stored (e.g., "agent")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd want to see a timestamp of snapshot so we can go back in time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, Probably it can be in the sate / metadata.

I notice that otel trace / span properties are good to have in many cases. I wonder if some of them can be added into state or metadata.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea Timestamp could be in metadata/user provided. That said, it might be common enough to include top-level.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that otel trace / span properties are good to have in many cases. I wonder if some of them can be added into state or metadata.

I can see a future where we'd want to include it for evaluation purposes but not necessarily for persistence purposes (unless we also expect to restore it to the agent). However, I do see a need that plugins should be able to contribute to what is put into a snapshot, which makes me think we're missing the extensibility story here. Will have to look into that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding SHA is also crucial here ^^

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're not handling the persistence of the snapshots, I'm not sure hash here makes sense. That said, there's going to be a strong need to make snapshots easy to persist and this might be a good feature to include as part of that

state: dict[str, Any] # opaque; do not modify — format subject to change
metadata: dict # user-provided data to be stored with the snapshot

class Agent:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want specific agent methods, or Snapshottable (wip name) interface?

Having an interface like below, would make implementation a lot easier. And it would allow us to extend to other types (multi-agent, etc)

Snapshottable:
   def save():
   def load():

This is similar to @JackYPCOnline 's idea on SessionAwarebut different naming pretty much

def save_snapshot(self, metadata: dict | None = None) -> Snapshot:
"""Capture the current agent state as a snapshot."""
...

def load_snapshot(self, snapshot: Snapshot) -> None:
"""Restore agent state from a snapshot."""
...
Comment on lines +43 to +49

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! I would consider moving these methods to their own class and inject it in the agent class agent.snapshot.load()

@JackYPCOnline JackYPCOnline Feb 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am picky on naming, what it actually does is to_dict() and from_dict(), we are not saving anything

```

### Behavior

Snapshots capture **agent state** (data), not **runtime behavior** (code):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, it is not easy to capture and persist code, and I dont think strands should try to do this.

However, we should explore how one would restore an agent from a snapshot, and load lets say tools back into the agent after persisting it. I would like to see an example devex of what this looks like.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I view the tool state as a feature that we'd be adding to the agent to make "enabled" tools into a state on the agent. So, if we had that I imagine it would be something like:

agent = Agent(tools=[tool1, tool2, tool3, tool4], enabled_tools=["tool1"])

Where only tool1 would be enabled/available on the agent. Then to enable other tools something would eventually trigger:

agent.enabled_tools = ["tool1", "tool3"]

and for restoring an agent with specific tools, it would be the same as

agent2 = Agent(tools=[tool3, tool4])
agent2.load_snapshot(snapshot)

and the snapshot would be restoring the enabled_tools state back into the agent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the term "snapshot" makes me think of a disk snapshot - literally everything. I would like to see this incorporate tools etc in the future.


- **Agent State** — Data persisted as part of session-management: conversation messages, context, and other JSON-serializable data. This is what snapshots save and restore.
- **Runtime Behavior** — Configuration that defines how the agent operates: model, tools, ConversationManager, etc. These are *not* included in snapshots and must be set separately when creating or restoring an agent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow these components to expose "snapshot-able data"? e.g. I am a conv manager developer, I want my data to be restored with snapshots

What's the recommendation? Keeping that data in agent state?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the recommendation? Keeping that data in agent state?

Yeah; the recommendation is agent state

Do we allow these components to expose "snapshot-able data"?

It should be Agent State (AgentState directly; or if we're missing something, an equivalent thereof). The idea that I'm trying to get across in this section is "Snapshots do not represent anything other than what already exists in agent state/session-management, it just provides a more direct api to control it".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a larger question about how to partition agent state more easily by plugin, but for now I believe the answer is AgentState

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saying that we don't believe these things should be part of a snapshot or are we just saying that we are not trying to expand the scoep by limiting to the current capabilities of Session Management.

For example I could see the following being important

  • models: what happens if I change this as the agent runs
  • tools: what happens if I add a tool and want to revert back to a time when I did not

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial goal was not to expand scope, but I'm more convinced we should be looking at the things that should be serialized.

But per the meeting, we're going to POC this design (with tweaks) to see where the pain points are.

More concretely:

Are we saying that we don't believe these things should be part of a snapshot

Yes, I do not believe that the tools we have on the agent today should be serialized to the snapshot, as it would be a lossy system without code-loading and I don't think we should do code-loading as part of a snapshot (e.g . "The obvious path is the happy path"). I'd like to instead see a feature in strands which makes tool selection first class configuration instead of code-only

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean a configuration like model_id wouldn't be stored in the snapshot? Is there a specific reason why? May have missed it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial reason was "because models are shared and snapshotting something shared doesn't make sense"; that said, we're going to do a POC to tease out some of the details, because the team felt that serializing model properties made sense


The intent is that anything stored or restored by session-management would be stored in a snapshot — so this proposal is *not* documenting or changing what is persisted, but rather providing an explicit way to do what session-management does automatically.

### Contract

- **`metadata`** — Application-owned. Strands does not read, modify, or manage this field. Use it to store checkpoint labels, timestamps, or any application-specific data without the need for a separate/standalone object/datastore.
Comment on lines +62 to +63

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming is hard but I think describing this as metadata is a stretch. If it is just store checkpoint labels, timestamps I agree but the any application-specific data part seems like this is broader

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look as part of POC

- **`type` and `state`** — Strands-owned. These fields are managed internally and should be treated as opaque. The format of `state` is subject to change; do not modify or depend on its structure.
- **Serialization** — Strands guarantees that `type` and `state` will only contain JSON-serializable values.

### Future Concerns

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have: We can enable a traces for snapshot actions

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tracing when a snapshot is taken?
Agreed, worth doing in the future


- Snapshotting for MultiAgent constructs - Snapshot is designed in a way that the snapshot could be reused for multi-agent with a similar api
- Providing a storage API for snapshot CRUD operations (save to disk, database, etc.)
- Providing APIs to customize serialization formats

## Developer Experience

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow users to snapshot in hooks on life cycle events?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow it, but I didn't solve the "resuming" part here. Will take a look at that though


### Evaluations via Rewind and Replay

```python
agent = Agent(tools=[tool1, tool2])
snapshot = agent.save_snapshot()

@poshinchen poshinchen Feb 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: what happens if an user calls save_snapshot() twice?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshots are completely independent and so "nothing other than a new snapshot is generated".


result1 = agent("What is the weather?")

# ...

agent2 = Agent(tools=[tool3, tool4])
Comment on lines +78 to +85

@mehtarac mehtarac Feb 2, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unclear here. agent has tools tool1 and tool2. agent2, is initialized with tool3 and tool4. When agent2 loads the snapshot, does it have all the tools or just the tool1 and tool2

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the example, but it will have tool3 & tool4; no changes to tools are made

agent2.load_snapshot(snapshot)
Comment on lines +85 to +86

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider allowing for passing in the snapshot in the Agent init (as well)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed if we can make the priority clear enough; my only concern is potential confusion if a snapshot has properties that are different than what's in the constructor, it might not be clear that the snapshot takes priority.

Will experiment in the POC

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is the intended flow where we are creating a new instance, did you consider pros/cons of acting on the constructor?

Agent(snapshot=...)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to give customers the option to override specific data in a snapshot? So keep most things the same but try tweaking one value to see how the agent behaves.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have anything in here, but IMHO that would be better done on the agent itself when it's re-hydrated. E.g. a snapshot is a blob of stuff, but the agent is the api to adjust it

result2 = agent2("What is the weather?")
# ...
# Human/manual evaluation if one outcome was better than the other
# ...
```

### Advanced Context Management

```python
agent = Agent(conversation_manager=CompactingConversationManager())
snapshot = agent.save_snapshot(metadata={"checkpoint": "before_long_task"})

# ... later ...
later_agent = Agent(conversation_manager=CompactingConversationManager())
later_agent.load_snapshot(snapshot)
```

### Persisting Snapshots

```python
import json

agent = Agent(tools=[tool1, tool2])
agent("Remember that my favorite color is orange.")

# Save to file
snapshot = agent.save_snapshot(metadata={"user_id": "123"})
with open("snapshot.json", "w") as f:
json.dump(snapshot, f)
Comment on lines +114 to +115

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have these as methods or as different snapshot implementations such as S3Snapshot etc and then users call snapshot.persist()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converters or serializers might make sense:

snapshot.persist(S3Persister())
snapshot.persist(FileSystem())

S3Snapshot maybe not so much because an agent would have to be aware of all snapshot types/classes


# Later, restore from file
with open("snapshot.json", "r") as f:
snapshot: Snapshot = json.load(f)

agent = Agent(tools=[tool1, tool2])
agent.load_snapshot(snapshot)
agent("What is my favorite color?") # "Your favorite color is orange."
```

### Edge cases

Restoring runtime behavior (e.g., tools) is explicitly not supported:

```python
agent1 = Agent(tools=[tool1, tool2])
snapshot = agent1.save_snapshot()
agent_no = Agent(snapshot) # tools are NOT restored
```

## Up for Debate

### What state should be included in a snapshot?

The current proposal includes:

- **messages** — conversation history
- **interrupt state** — internal state for paused/resumed interrupts
- **agent state** — custom application state (`agent.state`)
- **conversation manager state** — internal state of the conversation manager (but not the conversation manager itself)

This draws a distinction between "evolving state" (data that changes as the agent runs) and "agent definition" (configuration that defines what the agent *is*):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue system prompt is also evolving state. You can inject context at runtime to update system prompt, same with tools and conversation manager to be honest.

Thinking further, I don't think I like this distinction. Given meta-agents, skills, context maangement, etc. I'd argue this distinction is pretty blurry

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should snapshot all the info that we list below.

I'd do something like agent code is the superset, and snapshot is current data.

E.g. You might have tools in python definition of the code

Agent(tools=[a,b,c,d,e])

but your snapshot might include

tools: [d,e,f]

then the loaded agent should be tools=[d,e] (f would load because we dont know the definition)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemPrompt I'm convinced about - I think I overindexed too much on our current session manager which doesn't include it.

Tools, I addressed in other comments, but I don't believe we should include it until we have a solution which doesn't involve code-loading.

That said, we need an mechanism where we can evolve this going forward; will look into as part of the POC


| Evolving State (snapshotted) | Agent Definition (not snapshotted) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned above, but I think this makes the assumption that all of these things are static when we have seen dynamic use cases for each of these (excluding conversation manager)

|------------------------------|-----------------------------------|
| messages | system_prompt |
| interrupt state | tools |
| agent state | model |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for if we do model, we should be persisting model configuration probably be a ModelSnapshot not an str

| conversation manager state | conversation_manager |

Further justification: these three properties are also what SessionManagement persists today, so this API aligns with existing behavior.

**Open question:** Is this the right boundary? Are there other properties that should be considered "evolving state"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the right boundary, but I want to understand the devex a bit more for restoring "Agent Definition" after loading a snapshot. I get you can do this:

agent = Agent(tools=[tool1, tool2])
snapshot = agent.save_snapshot()

result = agent("What is the weather?")

# ...

agent = Agent(tools=[tool1, tool2])
agent.load_snapshot(snapshot)

Im thinking about defining custom serializers and deserializers you can pass into save_snapshot and load_snapshot, but I guess that doesnt really make sense since the customer can do that themselves anyway today like this:

snapshot = custom_serializer(agent)

agent = custom_deserializer(agent)

Maybe this is where AgentConfig comes in to save the day?

agent = Agent.from_config(config)
agent.load_snapshot(snapshot)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is where AgentConfig comes in to save the day?

This makes a lot of sense to me.

Though I do think we need a way to make it easier in devx to persist the snapshots - I commented about that in a separate comment and will POC it


## Consequences

**Easier:**
- Building evaluation frameworks with rewind/replay capabilities

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like timestamps are missing for rewind ^^

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might up-level this, but timestamps could be stored in metadata as long as we provide a way for snapshots to be extensible

- Implementing custom session management strategies
- Creating checkpoints during long-running agent tasks
- Cloning agents (load the same snapshot into multiple agent instances)
- Resetting agents to a known state (we do this manually for Graphs)

**More difficult:**
- N/A — this is an additive API

## Willingness to Implement

Yes