S3 client can stall on PutObject after long idle gap (pooled connection reuse / configuration)

## Summary

`S3Client::put_object().send().await` can **appear to hang** when the **same client** is used for `get_object`, then a **long period of local work** (no HTTP I/O), then `put_object` to the same endpoint. The future may not complete until an **operation-level timeout** fires (e.g. `TimeoutConfig::operation_timeout`), with errors like **request has timed out**.

This is **most plausibly explained by HTTP connection pooling**: the client **reuses** an idle pooled connection that the **peer or path has already closed** (or half-closed), while the pool’s **idle retention** (Hyper’s default is **90s** when `pool_timer` is configured) may still treat the connection as reusable. That is a **client lifecycle / default tuning** issue relative to server-side idle behavior, not necessarily incorrect `PutObject` logic.

`get_object` can succeed; the problematic phase is **reuse after a gap**.

## Likely classification

- **Application configuration**: Workloads with **large gaps** between sequential calls on one `S3Client` should consider **shorter `pool_idle_timeout`** on a custom HTTP connector, **operation timeouts**, a **fresh client** (or connector) for the upload phase, or **not sharing** one client across long CPU-bound sections.
- **Documentation gap**: The AWS Rust SDK could **call out** this pattern explicitly (long idle between requests → tune pool idle or avoid reuse).
- **Optional product change**: Whether **default** pool idle timeout should be more conservative than Hyper’s 90s is a **tradeoff** (fewer dead reuse attempts vs more connection churn) for maintainers to decide—not a given “bug fix.”

## Environment

- **Runtime**: Linux x86_64 (Docker: CUDA base image + Rust binary from `rust:1-bookworm`)
- **Region**: `us-west-2`
- **Credentials**: IAM task role (AWS Batch on EC2)
- **Dependencies observed**: `aws-sdk-s3` **1.126.x** / **1.127.x** when unconstrained; hang also seen with `aws-sdk-s3` pinned `<1.127` (**1.126.x**). Client from `aws_config::defaults(BehaviorVersion::latest()).load()` + `aws_sdk_s3::Client::new(&cfg)`.

An older production image built around **2026-03-11** (older resolved SDK versions) did not show this for the **same** application pattern.

## Reproduction sketch

1. Single `aws_sdk_s3::Client` from default config.
2. `get_object`, consume body (success).
3. **~30+ seconds** (or longer) of CPU-heavy work **without** using the S3 client.
4. `put_object` with `ByteStream::from_path(...)` and `.send().await`.

**Observed**: `.await` blocks for a long time; with `TimeoutConfig::operation_timeout` (e.g. 300s), **request has timed out**.

**Expected (from an app perspective)**: Either successful upload, or a **fast, clear** connection error and retry on a **new** connection—without requiring every user to discover pool behavior by production incident.

## Additional observations

- **`aws s3 cp`** to the same bucket from the same task succeeds quickly, consistent with **fresh connection / different stack**, not IAM or S3 outage.
- Pinning **`aws-sdk-s3` < 1.127** alone did **not** resolve the stall in our tests—consistent with the issue being **below the S3 crate** (HTTP pool / hyper), not a single crate version regression.
- **Mitigations that worked for us**: `TimeoutConfig` on the config loader; performing uploads via **`aws s3 cp`** subprocess; optionally **new client** for upload phase.

## Suggested follow-ups for maintainers

1. **Docs**: Add guidance for **long idle gaps** between calls on one client (tune `pool_idle_timeout` via custom HTTP client, or separate clients / phases).
2. **Examples**: Optional example of `aws_config` + custom `aws_smithy_http_client::Connector` with explicit `pool_idle_timeout`.
3. **Defaults**: Evaluate whether Smithy’s default when `pool_idle_timeout` is unset should differ from Hyper’s 90s (tradeoff vs dead reuse)—**not** assumed without measurement.

## References

- Related prototype note (default pool idle): discussion in this thread.
- Internal context: OCR pipeline — download PDF → GPU/CPU processing → upload SQLite + PDF.

Thank you for considering documentation and defaults; we’re happy to help validate doc changes or a minimal repro crate if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 client can stall on PutObject after long idle gap (pooled connection reuse / configuration) #1417

Summary

Likely classification

Environment

Reproduction sketch

Additional observations

Suggested follow-ups for maintainers

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

S3 client can stall on PutObject after long idle gap (pooled connection reuse / configuration) #1417

Description

Summary

Likely classification

Environment

Reproduction sketch

Additional observations

Suggested follow-ups for maintainers

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions