Skip to content

TS Safe pause and resume #1017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Apr 4, 2025
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
137cf99
First draft - TS Safe pause and resume
Feediver1 Mar 18, 2025
e3f177c
Merge branch 'beta' into Doc-936
Feediver1 Mar 18, 2025
dce02ab
Merge branch 'beta' into Doc-936
JakeSCahill Mar 19, 2025
4e1c1a7
Apply suggestions from code review
Feediver1 Mar 19, 2025
bb8f3b5
Merge branch 'beta' into Doc-936
Feediver1 Mar 19, 2025
487551e
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 19, 2025
4186b1b
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 20, 2025
ca058cd
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 20, 2025
5a9eba5
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 20, 2025
997755f
Apply suggestions from code review
Feediver1 Mar 20, 2025
b4d8b76
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 21, 2025
45c3e44
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 25, 2025
2e7cf72
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 25, 2025
31bb213
Merge branch 'beta' into Doc-936
Feediver1 Mar 25, 2025
b20f348
Apply suggestions from code review
Feediver1 Mar 30, 2025
27658ec
Merge branch 'beta' into Doc-936
Feediver1 Mar 30, 2025
7626277
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 30, 2025
7df59ba
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 31, 2025
8f251db
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 31, 2025
5058f4a
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Mar 31, 2025
4cffb87
Merge branch 'beta' into Doc-936
Feediver1 Apr 4, 2025
b14d53e
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Apr 4, 2025
5460f91
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Apr 4, 2025
711cf0c
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Apr 4, 2025
ea46d3f
Apply suggestions from code review
Feediver1 Apr 4, 2025
06bdd62
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Apr 4, 2025
984ebbe
Update modules/manage/partials/tiered-storage.adoc
Feediver1 Apr 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions modules/manage/partials/tiered-storage.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -1133,6 +1133,76 @@ rpk topic alter-config <topic_name> --set redpanda.remote.read=true

See also: xref:{topic-recovery-link}[Topic Recovery], xref:manage:kubernetes/k-remote-read-replicas.adoc[Remote Read Replicas]

== Pause and resume uploads

IMPORTANT: Redpanda strongly recommends using pause and resume only under the guidance of https://support.redpanda.com/hc/en-us/requests/new[Redpanda Support^] or a member of your account team.

Starting in version 25.1, you can troubleshoot issues your cluster has interacting with object storage by pausing and resuming uploads. You can do this with no risk of data consistency or data loss. To pause or resume segment uploads to object storage, use the xref:reference:properties/object-storage-properties.adoc#cloud_storage_enable_segment_uploads[`cloud_storage_enable_segment_uploads`] configuration property (default is `true`). This allows segment uploads to proceed after the pause completes and uploads resume.

While uploads are paused, data accumulates locally, which can lead to full disks if the pause is prolonged. If the disks fill, Redpanda throttles produce requests and rejects new Kafka produce requests to prevent data from being written. Additionally, this pauses object storage housekeeping, meaning segments are neither uploaded nor removed from object storage. However, it is still possible to consume data from object storage while uploads are paused.

When you set `cloud_storage_enable_segment_uploads` to `false`, all in-flight segment uploads complete, but no new segment uploads begin until the value is set back to `true`. During this pause, Tiered Storage enforces consistency by ensuring that no segment in local storage is deleted until it successfully uploads to object storage. This means that when uploads are resumed, no user intervention is needed, and no data gaps are created.

Use the `redpanda_cloud_storage_paused_archivers` metric to monitor the status of paused uploads. It displays a non-zero value whenever uploads are paused.

[WARNING]
====
Do not use `redpanda.remote.read` or `redpanda.remote.write` to pause and resume segment uploads. Doing so can lead to a gap between local data and data in object storage. In such cases, it is possible that the oldest segment is not aligned with the last uploaded segment. Given that these settings are unsafe, if you choose to set either `redpanda.remote.write` or the cluster configuration setting `cloud_storage_enable_remote_write` to `false`, you receive a warning:

[source,bash]
----
Warning: disabling Tiered Storage may lead to data loss. If you only want to pause Tiered Storage temporarily, use the `cloud_storage_enable_segment_uploads` option. Abort?
# The default is Yes.
----
====


The following example shows a simple pause and resume with no gaps allowed:

```bash
rpk cluster config set cloud_storage_enable_segment_uploads false
# Segments are not uploaded to cloud storage, and cloud storage housekeeping is not running.
# The new data added to the topics with Tiered Storage is not deleted from disk
# because it can't be uploaded. The disks may fill up eventually.
# If the disks fill up, produce requests will be rejected.
...

rpk cluster config set cloud_storage_enable_segment_uploads true
# At this point the uploads should resume seamlessly and
# there should not be any data loss.
```

For some applications, where the newest data is more valuable than historical data, data accumulation can be worse than data loss. In such cases, where you cannot afford to lose the most recently-produced data by rejecting produce requests after producers have filled the local disks during the period of paused uploads, there is a less safe pause and resume mechanism. This mechanism prioritizes the ability to receive new data over retaining data that cannot be uploaded when disks are full:

- Set the xref:reference:properties/object-storage-properties.adoc#cloud_storage_enable_remote_allow_gaps[`cloud_storage_enable_remote_allow_gaps`] cluster configuration property to `true`. This allows for gaps in the logs of all Tiered Storage topics in the cluster.
- Set the `redpanda.remote.allow_gaps` configuration property to `true`. This allows gaps for one specific topic. This topic-level configuration option overrides the cluster-level default.

When you pause uploads and set one of these properties to `true`, there may be gaps in the range of offsets stored in object storage. You can seamlessly resume uploads by setting `*allow_gaps` to `true` at either the cluster or topic level. If set to `false`, disk space could be depleted and produce requests would be throttled.

The following example shows how to pause and resume Tiered Storage uploads while allowing for gaps:

```bash
rpk cluster config set cloud_storage_enable_segment_uploads false
# Segment uploads are paused and cloud storage housekeeping is not running.
# New data is stored on the local volume, which may overflow.
# To avoid overflow when allowing gaps in the log.
# In this example, data that is not uploaded to cloud storage may be
# deleted if a disk fills before uploads are resumed.

rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=true
# Uploads are paused and gaps are allowed. Local retention is allowed
# to delete data before it's uploaded, therefore some data loss is possible.
...

rpk cluster config set cloud_storage_enable_segment_uploads true
# Uploads are resumed but there could be gaps in the offsets.
# Wait until you see that the `redpanda_cloud_storage_paused_archivers`
# metric is equal to zero, indicating that uploads have resumed.

# Disable the gap allowance previously set for the topic.
rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=false
```

== Caching

When a consumer fetches an offset range that isn't available locally in the Redpanda data directory, Redpanda downloads remote segments from object storage. These downloaded segments are stored in the object storage cache.
Expand Down