[Help]: recommended pattern for durable agent runs with workflow-visible LLM/tool steps

Hi Mastra team,

I’m trying to design a durable agent runtime on top of Mastra, AI SDK v5, and a workflow runtime, and I’m not sure what the recommended boundary should be.

The current straightforward pattern is to put a whole Mastra agent run inside a single workflow step, e.g. something conceptually like:

```ts
workflow.step("runAgentChatStep", async () => {
  return await mastra.getAgent("...").stream(...)
})
```

That works for the happy path, but it makes the internal LLM/tool loop invisible to the workflow engine. My understanding from the v5 workflow docs is that a step can be retried by default, so if the entire agent run is one retryable step, the workflow engine cannot distinguish between:

- an LLM turn that was already streamed to the user,
- a tool call that already produced an external side effect,
- background polling / follow-up workflows that were already scheduled,
- persisted stream chunks or run journal entries that were already written,
- and the next durable agent step that still needs to continue.

This is not necessarily an immediate bug, but it seems to explain why side effects, polling workflows, and stream chunk persistence can become consistency risks in exception/retry/resume scenarios.

What I’m looking for is closer to a `DurableAgent` pattern where the workflow can see and checkpoint the agent loop at meaningful boundaries, for example:

1. call model / produce tool calls,
2. persist assistant delta or final message chunks,
3. execute tool calls with idempotency keys,
4. persist tool results,
5. decide whether to continue, suspend, or finish.

Questions:

1. Is wrapping a whole Mastra agent run inside a single workflow step considered an expected pattern, or should users avoid that for durable/long-running agents?
2. Does Mastra have, or plan to have, a first-class durable agent abstraction where the LLM/tool loop is decomposed into workflow-visible steps?
3. If not, is there a recommended architecture for combining Mastra Agents with workflow step retry/resume semantics, especially around side effects and streaming?
4. Are there existing hooks/events in Mastra that are stable enough to build this ourselves without forking the agent loop?

I’m happy to provide a more concrete example if useful. The main design question is where Mastra thinks the durability boundary should live: around the whole agent run, or around each LLM/tool iteration inside the agent run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help]: recommended pattern for durable agent runs with workflow-visible LLM/tool steps #17701

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Help]: recommended pattern for durable agent runs with workflow-visible LLM/tool steps #17701

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions