Skip to content

docs: add Python data handling section#4378

Open
lennessyy wants to merge 3 commits intolarge-payload-prereleasefrom
feat/python-data-handling
Open

docs: add Python data handling section#4378
lennessyy wants to merge 3 commits intolarge-payload-prereleasefrom
feat/python-data-handling

Conversation

@lennessyy
Copy link
Copy Markdown
Contributor

@lennessyy lennessyy commented Apr 2, 2026

Summary

Test plan

  • Verify sidebar renders correctly with the new "Data handling" category
  • Verify all four new pages load and render properly
  • Confirm no broken links in the new section

🤖 Generated with Claude Code

┆Attachments: EDU-6148 docs: add Python data handling section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lennessyy lennessyy requested a review from a team as a code owner April 2, 2026 01:21
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Error Error Apr 4, 2026 0:44am

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

📖 Docs PR preview links

| | [PayloadConverter](/develop/python/data-handling/data-conversion) | [PayloadCodec](/develop/python/data-handling/data-encryption) | [ExternalStorage](/develop/python/data-handling/large-payload-storage) |
| ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store |
| **Must be deterministic** | Yes | No | No |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this line. What are we trying to help the user with here? And @jmaeagle99 could you take a look?

  • For codec, I think we say that due to content hashing, codec should be deterministic for cases when the workflow task fails.

Copy link
Copy Markdown
Contributor Author

@lennessyy lennessyy Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this came from the TypeScript page: https://docs.temporal.io/develop/typescript/converters-and-encryption

When I was creating the table, I used the TS page, which had specific instructions on whether or not these components can access external services or employ non-deterministic modules. I think the main thing was to tell users they cannot do that in the payload converter, and thus cannot do any encryption there either.

If you think that line about codec is worth adding, we can change it. Otherwise, I'm okay with removing this row.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit abstract as well. Not sure it's doing much good in such a concise form so prominently in the doc. But when I'm actually building a custom payload converter, I'd like to know that it should be deterministic/not access network.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think having a "Must be deterministic? Yes/No" explains much and might just create more questions. I think that this information is more for the authors of converters, codecs, and storage drivers rather than the authors of workflows. Even if workflow authors have to think about determinism, just a different kind of determinism.

I think there are two aspects to think about when talking about determinism of these things (converters, codecs, and external storage):

  • For a given input, the output should be reproducible when the operation is successful.
  • Whether the operation is allowed to fail.

For example, payload converter cannot raise/throw/return errors. That is because these run within the workflow code execution. The workflow code can handle the errors and compensate with the another workflow command. This will cause workflow non-determinism on replay.

In Python, codecs can raise/throw/return errors. That is because they are executed before the workflow code executes and after the workflow code has yielded. In either case, the workflow code has no ability to handle the error. Raising/throwing/returning errors here will cause the WFT to be retried and has no impact on workflow determinism. The same is allowed for external storage.

/>

Of these three layers, only the PayloadConverter is required. Temporal uses a default PayloadConverter that handles JSON
serialization. The PayloadCodec and ExternalStorage layers are optional. You only need to customize these layers when
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link to encyclopedia for external storage somewhere?

data_converter = dataclasses.replace(
temporalio.converter.default(),
external_storage=ExternalStorage(
drivers=[MyStorageDriver()],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use LocalDiskStorageDriver here ? (Is there a snipsync?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will snipsync all the code blcoks once all the content is approved.

## Configure payload size threshold

You can configure the payload size threshold that triggers external storage. By default, payloads larger than 256 KiB
are offloaded to external storage. You can adjust this with the `payload_size_threshold` parameter, or set it to 1 to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
are offloaded to external storage. You can adjust this with the `payload_size_threshold` parameter, or set it to 1 to
are offloaded to external storage. You can adjust this with the `payload_size_threshold` parameter, even setting it to 0 to

| ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store |
| **Must be deterministic** | Yes | No | No |
| **Default** | JSON serialization | None (passthrough) | None (passthrough) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| **Default** | JSON serialization | None (passthrough) | None (passthrough) |
| **Default** | JSON serialization | None (passthrough) | None (all payloads will be stored in Workflow History) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that "passthrough" is correct. It is whatever comes out of this "pipeline" of data handling is what is stored in workflow history and shouldn't be tied to the external storage step.

Copy link
Copy Markdown
Contributor

@jmaeagle99 jmaeagle99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a handful of feedback. Nothing blocking and could always be addressed later, if need be.

@@ -0,0 +1,252 @@
---
id: large-payload-storage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"external storage" and "large payload storage" are being used inconsistently throughout these docs. I think we should stick with one, namely external storage.

| | [PayloadConverter](/develop/python/data-handling/data-conversion) | [PayloadCodec](/develop/python/data-handling/data-encryption) | [ExternalStorage](/develop/python/data-handling/large-payload-storage) |
| ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store |
| **Must be deterministic** | Yes | No | No |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think having a "Must be deterministic? Yes/No" explains much and might just create more questions. I think that this information is more for the authors of converters, codecs, and storage drivers rather than the authors of workflows. Even if workflow authors have to think about determinism, just a different kind of determinism.

I think there are two aspects to think about when talking about determinism of these things (converters, codecs, and external storage):

  • For a given input, the output should be reproducible when the operation is successful.
  • Whether the operation is allowed to fail.

For example, payload converter cannot raise/throw/return errors. That is because these run within the workflow code execution. The workflow code can handle the errors and compensate with the another workflow command. This will cause workflow non-determinism on replay.

In Python, codecs can raise/throw/return errors. That is because they are executed before the workflow code executes and after the workflow code has yielded. In either case, the workflow code has no ability to handle the error. Raising/throwing/returning errors here will cause the WFT to be retried and has no impact on workflow determinism. The same is allowed for external storage.

| ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store |
| **Must be deterministic** | Yes | No | No |
| **Default** | JSON serialization | None (passthrough) | None (passthrough) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that "passthrough" is correct. It is whatever comes out of this "pipeline" of data handling is what is stored in workflow history and shouldn't be tied to the external storage step.


### Prerequisites

- An Amazon S3 bucket that you have write access to. Refer to [lifecycle management](/external-storage#lifecycle) to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- An Amazon S3 bucket that you have write access to. Refer to [lifecycle management](/external-storage#lifecycle) to
- An Amazon S3 bucket that you have read and write access to. Refer to [lifecycle management](/external-storage#lifecycle) to

If you want to be even more prescriptive, the identity needs at least s3:PutObject and s3:GetObject. It would be unlikely that you can get away with just s3:PutObject.


- An Amazon S3 bucket that you have write access to. Refer to [lifecycle management](/external-storage#lifecycle) to
ensure that your payloads remain available for the entire lifetime of the Workflow.
- The `aioboto3` library is installed and available.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python has an extra that installs this (and the types for the) library:

python -m pip install "temporalio[aioboto3]"

os.makedirs(self._store_dir, exist_ok=True)

prefix = self._store_dir
sc = context.serialization_context
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this is changing in this PR. Haven't been able to merge it yet due to failures impacting the repository.

Store payloads durably so that they survive process crashes and remain available for debugging and auditing after the
Workflow completes. Refer to [lifecycle management](/external-storage#lifecycle) for retention requirements.

The following example shows a complete custom driver implementation that uses local disk as the backing store:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we caveat that this example should not be used in production? It works for local development and demoing on one machine, but would not work for multi worker environments.

and [Payload Codec](/develop/python/data-handling/data-encryption) before it reaches the driver.
See the [Components of a Data Converter](/dataconversion#data-converter-components) for more details.

Return a `StorageDriverClaim` for each payload with enough information to retrieve it later. Structure your storage keys
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think how driver authors want to structure their keys is up to them. They could just use CAS and not use anything prefixing if they don't need sophisticated lifecycle management. So I think these are more recommended rather than something required.


## Use multiple storage drivers

When you have multiple drivers, such as for hot and cold storage tiers, pass a `driver_selector` function that chooses
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we might be able to use a better example of why you'd have multiple drivers:

  • You're worker needs to support receiving workflow starts that were created by far clients that don't use the same driver that you prefer for your worker. Register that far client driver and your preferred driver, and use the selector to always pick your driver.
  • Maybe some of your workflows could be optimized with local caching (like Redis) instead of going to a far storage service; you'd trading lower latency for durability, but maybe that workflow type is allowed to be less durable. Register your Redis driver and S3 driver, and use the selector to pick based on workflow type (coming in this PR).

@lennessyy lennessyy changed the base branch from main to large-payload-prerelease April 4, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants