[FEAT] Improve Python SDK reliability, retries, and failure diagnostics

The Python SDK can currently surface generic or low-context errors when encountering
common failure scenarios such as server unavailability, timeouts, or partial workflow
registration failures. In both local development and production environments, this
can make it difficult for users to quickly determine whether an issue is caused by
connectivity, configuration problems, or transient infrastructure issues.

I’d like to improve the reliability and debuggability of the Python SDK by making
these failure modes more explicit and easier to reason about. This would include
centralizing request error handling, introducing clearer and more actionable
exceptions for common scenarios, adding configurable timeout and retry behavior
where appropriate, and improving healthcheck robustness while preserving backward
compatibility with existing SDK usage.

While documenting common failure cases is an alternative, improving behavior directly
in the SDK provides better defaults and faster feedback for users running Hatchet
workflows in production. I’m happy to approach this incrementally and align on scope
before implementation, and wanted to confirm that this is an area where contributions
and potential longer-term ownership would be welcome.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Improve Python SDK reliability, retries, and failure diagnostics #2872

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT] Improve Python SDK reliability, retries, and failure diagnostics #2872

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions