Skip to content

K8SPG-680: add ReadyForBackup condition to the pg-cluster #1133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented Apr 15, 2025

K8SPG-680 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPG-680

DESCRIPTION

Problem:
After a failed PVC resize on cluster1-repo1, scheduled backups cannot be created successfully. Although the pg-backup object is created, it gets stuck in the Starting state.

Cause:
When a PVC resize fails, the crunchy's PostgresCluster resource gets an Unknown status for the PGBackRestReplicaRepoReady condition. This condition is required to create a backup job in the reconcileManualBackup method:

// determine if the dedicated repository host is ready using the repo host ready
// condition, and return if not
repoCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions, ConditionRepoHostReady)
if repoCondition == nil || repoCondition.Status != metav1.ConditionTrue {
return nil
}
// Determine if the replica create backup is complete and return if not. This allows for proper
// orchestration of backup Jobs since only one backup can be run at a time.
backupCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions,
ConditionReplicaCreate)
if backupCondition == nil || backupCondition.Status != metav1.ConditionTrue {
return nil
}

As a result, the operator waits indefinitely for the backup job to appear:

if errors.Is(err, ErrBackupJobNotFound) {
log.Info("Waiting for backup to start")
return reconcile.Result{RequeueAfter: time.Second * 5}, nil
}
return reconcile.Result{}, errors.Wrap(err, "find backup job")

Solution:

  • Add a new .status.conditions field to the PerconaPGCluster resource.
  • If the required conditions in the PostgresCluster resource (PGBackRestRepoHostReady and PGBackRestReplicaCreate) are not True, a new ReadyForBackup condition is added to PerconaPGCluster with the False status.
  • If ReadyForBackup is False, the operator will skip the scheduled backup creation and log a message instead.
  • When a new PerconaPGBackup resource is created and the operator is waiting for its backup job to appear, it will check the ReadyForBackup condition. If it was set to False more than 2 minutes ago, the backup will be marked as Failed.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

@pooknull pooknull marked this pull request as ready for review April 16, 2025 11:18
Comment on lines +27 to +36
func (f *fakeClient) Patch(ctx context.Context, obj client.Object, patch client.Patch, options ...client.PatchOption) error {
err := f.Client.Patch(ctx, obj, patch, options...)
if !k8serrors.IsNotFound(err) {
return err
}
if err := f.Create(ctx, obj); err != nil {
return err
}
return f.Client.Patch(ctx, obj, patch, options...)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? By removing it nothing fails on the controller tests.

@@ -505,7 +470,7 @@ func updatePGBackrestInfo(ctx context.Context, c client.Client, pod *corev1.Pod,
}

func finishBackup(ctx context.Context, c client.Client, pgBackup *v2.PerconaPGBackup, job *batchv1.Job) (*reconcile.Result, error) {
if checkBackupJob(job) == v2.BackupSucceeded {
if job != nil && checkBackupJob(job) == v2.BackupSucceeded {
Copy link
Contributor

@gkech gkech Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe validate the input job once at the top of the function and avoid repeating the same check across different places?

e.g.

func finishBackup(ctx context.Context, c client.Client, pgBackup *v2.PerconaPGBackup, job *batchv1.Job) (*reconcile.Result, error) {
	if job == nil {
		// do something
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants