Strimzi Kafka external PVC [High Availability and Disaster Recovery] #8750

brunooliveiramac · 2023-06-29T17:01:21Z

brunooliveiramac
Jun 29, 2023

Hi everyone,

I'm testing Strimzi Kafka and we are trying to see how to achieve high availability and disaster recovery.
We have recently seen a weird behavior where a cluster was missing and we were only able to have this back by deleting the operator and recreating everything again by Argo.

Based on that we started to see how to ensure that the PVCs didn't get deleted.

Would you know some approaches? For example, using a mirror maker and having two operators in order to switch the traffic in case of an issue?
Or some way to use an external PVC, so if I delete the entire operator I still have the data stored somewhere.

scholzj · 2023-06-30T00:09:42Z

scholzj
Jun 30, 2023
Maintainer

There are some issues with Argo and PVC deletion - there are some other threads about that.

Using a Mirror Maker to duplicate the Kafka cluster and its data can definitely be used for DR / backup. But it is a fairly expensive way as you need to run a second cluster and in public clouds also the data transfers. There is no simple way to switch traffic between the clusters. You would need to reconfigure all clients to make sure they connect to the backup cluster or use some network infrastructure that would do it for you (including severing existing connections to the old cluster etc.). And assuming you use them as active-passive which would be typical for backup scenarios, there is also no good way how to revert the mirroring flow. You would basically start with a new cluster as a new backup after the switch. So operations-wise, it is not completely straightforward either.

Deleting the operator Deployment itself should not really delete anything. That should happen only when you delete the custom resources (e.g. by deleting them directly or deleting the CRDs which will cause garbage collection of the custom resources). You can to some extent prevent that for example by setting finalizers on the CRDs / CRs. It is up to you to evaluate what is the right measure for your situation and what the scenarios you want to have covered are - I'm just offering this as an option.

4 replies

brunooliveiramac Jun 30, 2023
Author

Oh yeh, that was my mistake. We deleted the whole application, so it deleted the CRD's as you mention.
The issue was related to a node that disappeared, for no reason. We couldn't find any log or events that would explain it. So to have that node back we had to crush everything. We tried to delete the stateful set and I didn't come back also. I think we should have gone for the operator, restarting it.
anyway. I'm trying to patch the CRD using Kustomize, I'm struggling a little but sounds like a good idea.

brunooliveiramac Jul 3, 2023
Author

Hi @scholzj, I saw that you are pretty active on the answers, thanks a lot for your help.
I have just one more quick question

I found this answer in one of the topics regarding disaster recovery.
https://github.com/orgs/strimzi/discussions/5894#discussioncomment-1651589
You pointed this doc: https://github.com/orgs/strimzi/discussions/4892#:~:text=https%3A//strimzi.io/docs/operators/latest/full/using.html%23cluster%2Drecovery_str

This is a little outdated, I couldn't find it.
That is what I found, using mirror maker: https://strimzi.io/docs/operators/latest/full/configuring.html#cluster_configuration
Is that what is expected to handle disaster recovery?

scholzj Jul 3, 2023
Maintainer

Maybe the is the link that works today? https://strimzi.io/docs/operators/latest/full/deploying.html#cluster-recovery-str

scholzj Jul 3, 2023
Maintainer

In general, I do not think Kafka has some super great DR story. Mirror Maker is the only thing inside the Kafka project itself that can help with it. But it is not completely straight forward (and I'm not really an expert on it to be honest)

sionsmith · 2026-02-13T16:50:03Z

sionsmith
Feb 13, 2026

We ran into the same PVC lifecycle problem — deleting Strimzi CRDs triggers garbage collection of PVCs, and node failures can cause permanent data loss. MirrorMaker2 helps with replication but requires a full secondary cluster and complex failover.

We built the Strimzi Backup Operator to solve this. It is a Strimzi-native Kubernetes operator that adds KafkaBackup and KafkaRestore CRDs under the backup.strimzi.io API group. It auto-discovers your Strimzi Kafka CR (bootstrap servers, TLS certs, KafkaUser credentials) and creates Kubernetes Jobs to back up topic data to S3, Azure Blob, or GCS.

Key features: cron-based scheduled backups, point-in-time recovery with millisecond precision, topic filtering, consumer group offset restore, retention policies, and Prometheus metrics. Written in Rust with kube-rs.

Would love feedback from the Strimzi community.

3 replies

scholzj Feb 13, 2026
Maintainer

@sionsmith Thanks for sharing this. I like the project. And backup is something many Strimzi users are asking for. So having it integrated with Strimzi and shared with the community is great.

However ...

I'm a bit concerned about the use of the strimzi.io domain in the CRD. If one day Strimzi has an official backup project, it might very easily collide and cause problems for all users. I guess it should probably use your own domain for it? E.g. strimzi.kafkabackup.com?
While I personally do not mind the name that much (although something like Backup Operator for Strimzi might be more in line with the general expectations, I have a bunch of Strimzi Something projects myself), it would be great if the GitHub repository made it clear that it is not part of the Strimzi project and the CNCF foundation.

(CC @strimzi/maintainers as this is just my personal take)

im-konge Feb 13, 2026
Maintainer

Thanks for sharing this @sionsmith , it's a great idea to have something like this :) regarding the points Jakub had, mainly about the domain name, I agree with him. I think it can also create some kind of confusion that it's something owned by Strimzi, so asking questions in this org.

ppatierno Feb 16, 2026
Maintainer

Thanks @sionsmith . It looks interesting but I have same concerns as Jakub and Lukas about the domain name and also the project name. It should make clearer it's not part of Strimzi but it's something "for Strimzi". Would you be interested to list this project in the https://github.com/strimzi/awesome-strimzi repository? I was also curious to know why it was written in Rust :-)

sionsmith · 2026-02-16T09:23:49Z

sionsmith
Feb 16, 2026

Thanks @scholzj, @im-konge, and @ppatierno for the feedback! Fair points!

We've addressed everything raised here in v0.1.0:

CRD API group changed: backup.strimzi.io → kafkabackup.com no more collision risk with a future official Strimzi backup project
Project renamed: "Strimzi Backup Operator" → Kafka Backup Operator for Strimzi makes it clear this is for Strimzi, not part of Strimzi
Non-affiliation disclaimer added to the README and Helm chart metadata, making it explicit this is an independent community project

The strimzi.io/cluster label is still used, but only as a reference to the Strimzi Kafka CR name (following Strimzi's own convention for cluster identification).

@ppatierno Great idea!!! We'd love to be listed in awesome-strimzi! As for why Rust: we wanted a small, fast operator binary with low memory footprint and strong compile-time safety. The kube-rs ecosystem has matured nicely and the resulting container image is ~15MB.

Thanks again for the nice words, please let me or the team know of anything else as we are happy to make any further adjustments the maintainers suggest.

1 reply

scholzj Feb 16, 2026
Maintainer

Thanks a lot for applying the changes. I appreciate it. I do not think that using the strimzi.io/cluster label is a problem - at least not for me.

Strimzi

Strimzi Kafka external PVC [High Availability and Disaster Recovery] #8750

Uh oh!

brunooliveiramac Jun 29, 2023

Replies: 3 comments · 8 replies

Uh oh!

scholzj Jun 30, 2023 Maintainer

Uh oh!

brunooliveiramac Jun 30, 2023 Author

Uh oh!

brunooliveiramac Jul 3, 2023 Author

Uh oh!

scholzj Jul 3, 2023 Maintainer

Uh oh!

scholzj Jul 3, 2023 Maintainer

Uh oh!

sionsmith Feb 13, 2026

Uh oh!

Uh oh!

scholzj Feb 13, 2026 Maintainer

Uh oh!

im-konge Feb 13, 2026 Maintainer

Uh oh!

ppatierno Feb 16, 2026 Maintainer

Uh oh!

Uh oh!

sionsmith Feb 16, 2026

Uh oh!

scholzj Feb 16, 2026 Maintainer

brunooliveiramac
Jun 29, 2023

Replies: 3 comments 8 replies

scholzj
Jun 30, 2023
Maintainer

brunooliveiramac Jun 30, 2023
Author

brunooliveiramac Jul 3, 2023
Author

scholzj Jul 3, 2023
Maintainer

scholzj Jul 3, 2023
Maintainer

sionsmith
Feb 13, 2026

scholzj Feb 13, 2026
Maintainer

im-konge Feb 13, 2026
Maintainer

ppatierno Feb 16, 2026
Maintainer

sionsmith
Feb 16, 2026

scholzj Feb 16, 2026
Maintainer