Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TS Safe pause and resume #1017

Open
wants to merge 14 commits into
base: beta
Choose a base branch
from
Open

TS Safe pause and resume #1017

wants to merge 14 commits into from

Conversation

Feediver1
Copy link
Contributor

@Feediver1 Feediver1 commented Mar 18, 2025

Description

Resolves Doc-936
Review deadline: Thursday, March 20

Note: Once the metrics and configuration properties become available in the doc properties, I will add the links to the appropriate places.

Page previews

Safe pause and resume

PR Change Summary

Introduced the Safe Pause and Resume feature for Tiered Storage, allowing users to manage uploads to cloud storage without risking data loss or inconsistency.

Added documentation for the Safe Pause and Resume feature in Tiered Storage.
Explained the configuration properties for pausing and resuming uploads.
Provided guidelines for monitoring paused uploads and managing data consistency.
Modified Files

modules/manage/partials/tiered-storage.adoc

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@Feediver1 Feediver1 requested a review from a team as a code owner March 18, 2025 20:20
Copy link

netlify bot commented Mar 18, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 31bb213
🔍 Latest deploy log https://app.netlify.com/sites/redpanda-docs-preview/deploys/67e312fa807f4a0008ef2af7
😎 Deploy Preview https://deploy-preview-1017--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

hyperlint-ai bot commented Mar 18, 2025

PR Change Summary

Introduced the Safe Pause and Resume feature for Tiered Storage, allowing users to manage uploads to cloud storage without risking data loss or inconsistency.

  • Added documentation for the Safe Pause and Resume feature in Tiered Storage.
  • Explained the configuration properties for pausing and resuming uploads.
  • Provided guidelines for monitoring paused uploads and managing data consistency.

Modified Files

  • modules/manage/partials/tiered-storage.adoc

How can I customize these reviews?

Check out the Hyperlint AI Reviewer docs for more information on how to customize the review.

If you just want to ignore it on this PR, you can add the hyperlint-ignore label to the PR. Future changes won't trigger a Hyperlint review.

Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add hyperlint-ignore to the PR to ignore the link check for this PR.

What is Hyperlint?

Hyperlint is an AI agent that helps you write, edit, and maintain your documentation.

Learn more about the Hyperlint AI reviewer and the checks that we can run on your documentation.

@Feediver1 Feediver1 changed the title First draft - TS Safe pause and resume TS Safe pause and resume Mar 18, 2025
@micheleRP
Copy link
Contributor

@Feediver1 reminder to add a blurb for this in the What's New!

@Feediver1 Feediver1 requested a review from daisukebe March 21, 2025 12:33
@@ -1133,6 +1133,65 @@ rpk topic alter-config <topic_name> --set redpanda.remote.read=true

See also: xref:{topic-recovery-link}[Topic Recovery], xref:manage:kubernetes/k-remote-read-replicas.adoc[Remote Read Replicas]

== Safe pause and resume

Starting in version 25.1, when running Tiered Storage, you can safely pause and resume uploads to cloud storage without risking data consistency or loss. To pause or resume segment uploads to cloud storage, use the `cloud_storage_enable_segment_uploads` configuration property (default is `true`), which allows segment uploads to proceed normally after the pause completes and uploads resume.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should say when and how this will be used:

"Starting in version 25.1, when running Tiered Storage, to troubleshoot and resolve temporary issues related to a cluster's interaction with cloud storage, you can safely ..."

And, just in case this does end up in docs, let's also have an yellow box warning "We highly recommend using pause and resume only under the guidance of Redpanda support / customer success"

# there should not be any data loss.
```

For some applications where the newest data is more valuable than historical data, data accumulation can be worse than data loss. In such cases, where you must pause uploads but you cannot afford to accumulate data on disk and lose availability, there are a couple of less safe pause and resume mechanisms:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'In such cases, where you cannot afford to lose the most recently produced data by rejecting produce requests after producers have filled the local disks during the period of paused uploads, there is a less safe pause and resume mechanisms that prioritizes the ability to receive new data over retaining data that cannot be uploaded when disks are full.

- Set the `cloud_storage_enable_remote_allow_gaps` cluster configuration property to `true`, which allows gaps for all topics in the cluster.
- Set the `redpanda.remote.allow_gaps` configuration property to `true`, which allows gaps for one specific topic. This topic-level configuration option overrides the cluster-level default.

When you pause uploads and set one of these properties to `true`, gaps may result. However, you can seamlessly resume uploads by specifying `*allow_gaps` to `true` at either the cluster or topic level. Otherwise, if set to `false`, uploads could stall if a gap occurs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

' , gaps in the range of offsets store in cloud storage may result."

rpk cluster config set cloud_storage_enable_segment_uploads false
# Segment uploads are paused and cloud storage housekeeping is not running.
# New data is stored on the local volume, which may overflow.
# To avoid overflow, allow for gaps to be created.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To avoid overflow by allowing gaps in the log"

# Segment uploads are paused and cloud storage housekeeping is not running.
# New data is stored on the local volume, which may overflow.
# To avoid overflow, allow for gaps to be created.
# In this example, data that is not uploaded to cloud storage can be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be deleted if a disk fills before uploads are resumed

- Set the `cloud_storage_enable_remote_allow_gaps` cluster configuration property to `true`, which allows gaps for all topics in the cluster.
- Set the `redpanda.remote.allow_gaps` configuration property to `true`, which allows gaps for one specific topic. This topic-level configuration option overrides the cluster-level default.

When you pause uploads and set one of these properties to `true`, gaps may result. However, you can seamlessly resume uploads by specifying `*allow_gaps` to `true` at either the cluster or topic level. Otherwise, if set to `false`, uploads could stall if a gap occurs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lazin should the instruction here basically be to turn on allow gaps IF you see cloud storage uploads are stuck (you get errors, and want to alleviate the problem and continue). If that's true, what does it look like (what messages would you see in the logs, that can be. resolved by allowing gaps)?

@Feediver1 Feediver1 requested a review from wzzzrd86 March 24, 2025 12:04
The following example shows a simple pause and resume with no gaps allowed:

```bash
rpk cluster config set cloud_storage_enable_segment_uploads false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The line starts with a command rpk cluster config set. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
rpk cluster config set cloud_storage_enable_segment_uploads false
To disable segment uploads, use the command: `rpk cluster config set cloud_storage_enable_segment_uploads false`.

The original line starts with a command, which is against the style guide's recommendation. By rephrasing it to start with a descriptive phrase, we provide context for the command, making the documentation clearer and more informative. This change also aligns with best practices for technical writing, where context is provided before commands or code snippets.

# If the disks fill up, produce requests will be rejected.
...

rpk cluster config set cloud_storage_enable_segment_uploads true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The line starts with a command rpk cluster config set. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
rpk cluster config set cloud_storage_enable_segment_uploads true
To enable segment uploads, use the command: `rpk cluster config set cloud_storage_enable_segment_uploads true`.

The original line starts with a command, which is against the style guide's recommendation. By rephrasing it to start with a purpose ('To enable segment uploads'), the command is integrated into a more descriptive sentence, providing context and improving readability. This change aligns with the style guide's requirements and maintains the technical accuracy of the command.

The following example shows how to pause and resume Tiered Storage uploads while allowing for gaps:

```bash
rpk cluster config set cloud_storage_enable_segment_uploads false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The line starts with a command rpk cluster config set. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
rpk cluster config set cloud_storage_enable_segment_uploads false
To disable segment uploads, use the command: `rpk cluster config set cloud_storage_enable_segment_uploads false`.

The original line starts with a command, which is against the style guide's recommendation. By rephrasing it to start with a descriptive phrase, we provide context for the command, making the documentation clearer and more informative. This change also aligns with best practices for technical writing, where context is provided before commands or code snippets.

# In this example, data that is not uploaded to cloud storage can be
# deleted.

rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The line starts with a command rpk topic alter-config. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=true
To alter the configuration of a topic, use the command `rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=true`.

The original sentence starts with a command, which is against the style guide's recommendation. By rephrasing it to start with a descriptive phrase, we provide context and make the sentence more informative. This change also aligns with best practices for technical documentation, enhancing clarity and user understanding.

# to delete data before it's uploaded, therefore some data loss is possible.
...

rpk cluster config set cloud_storage_enable_segment_uploads true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The line starts with a command rpk cluster config set. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
rpk cluster config set cloud_storage_enable_segment_uploads true
To enable segment uploads, use the command: `rpk cluster config set cloud_storage_enable_segment_uploads true`.

The original line starts with a command, which is against the style guide's recommendation. By rephrasing it to start with a purpose ('To enable segment uploads'), the command is integrated into a more descriptive sentence, providing context and improving readability. This change aligns with the style guide's requirements and maintains the technical accuracy of the command.

# metric is equal to zero, indicating that uploads have resumed.

# Disable the gap allowance previously set for the topic.
rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The line starts with a command rpk topic alter-config. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=false
To alter the configuration of a topic, use the command `rpk topic alter-config $topic-name --set redpanda.remote.allowgaps=false`.

The original sentence starts with a command, which is against the style guide's recommendation. By rephrasing it to provide context, the sentence becomes more informative and aligns with the style guide. This change also helps in making the documentation more user-friendly by explaining the purpose of the command.


Starting in version 25.1, when running Tiered Storage, you can safely pause and resume uploads to cloud storage without risking data consistency or loss. To pause or resume segment uploads to cloud storage, use the `cloud_storage_enable_segment_uploads` configuration property (default is `true`), which allows segment uploads to proceed normally after the pause completes and uploads resume.

While uploads are paused, data accumulates locally, which can lead to full disks if the pause is prolonged. In such cases, Redpanda throttles produce requests, and rejects new Kafka produce requests to prevent data from being written. Additionally, this pauses cloud storage housekeeping, meaning segments are neither uploaded nor removed from cloud storage. However, it is still possible to consume data from cloud storage when you have paused uploads.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as "while uploads are paused, Redpanda throttles produce requests ..."

I think it might be clearer if we change this from "In such cases, Redpanda ..." to "If the disks fill, Redpanda throttles ..."

@@ -1133,6 +1133,65 @@ rpk topic alter-config <topic_name> --set redpanda.remote.read=true

See also: xref:{topic-recovery-link}[Topic Recovery], xref:manage:kubernetes/k-remote-read-replicas.adoc[Remote Read Replicas]

== Pause and resume
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to "Pause and resume uploads"


Use the `redpanda_cloud_storage_paused_archivers` metric to monitor the status of paused uploads. It displays a non-zero value whenever uploads are paused.

NOTE: Do not use `redpanda.remote.read` or `redpanda.remote.write` to pause and resume segment uploads. Doing so can lead to a gap between local data and the data in the cloud storage. In such cases, it is possible that the oldest segment is not aligned with the last uploaded segment due to the gap.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning since we're more actively discouraging people to never do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants