Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cloudflare): custom hostnames edge-cases causing duplicates #5183

Conversation

mrozentsvayg
Copy link
Contributor

@mrozentsvayg mrozentsvayg commented Mar 15, 2025

Description

Fixes and improvements:

  • using TXT registry could cause one-time attempt to create duplicate custom hostname, crashing container (once);
  • manual deleting DNS and registry (TXT) records alone could cause the same as above;
  • refactor submitChanges() custom hostnames related code to make contributions around it easier (eg. fix(cloudflare): regional hostnames #5175);
  • fix and add unit-tests to cover all above;

There are still similar edge-cases that could cause one-time container crash. Not custom hostnames related.
Eg. manually removing or renaming A/CNAME record, but not TXT registry record would crash:

INFO[0068] Changing record.                              action=CREATE record=mih-cf-test-2.t.conduit-stg.xyz ttl=1 type=A zone=***
INFO[0068] Changing record.                              action=CREATE record=a-mih-cf-test-2.t.conduit-stg.xyz ttl=1 type=TXT zone=***
ERRO[0069] failed to create record: An identical record already exists. (81058)  action=CREATE record=a-mih-cf-test-2.t.conduit-stg.xyz ttl=1 type=TXT zone=***
FATA[0069] Failed to do run once: failed to submit all changes for the following zones: [***]
exit status 1

The one-time crashing could be intentional implementation of letting know that something has happened not quite gracefully, but not so sure about the CREATE->DELETE order on rename.

Checklist

  • Unit tests updated
  • End user documentation updated - not needed

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 15, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mrozentsvayg. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 15, 2025
@mloiseleur
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 15, 2025
log.WithFields(logFields).Errorf("failed to delete custom hostname %v/%v: %v", chID, change.CustomHostname.Hostname, chErr)
}
} else {
log.WithFields(logFields).Infof("Custom hostname %v not found", change.CustomHostname.Hostname)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.WithFields(logFields).Infof("Custom hostname %v not found", change.CustomHostname.Hostname)
log.WithFields(logFields).Warnf("Custom hostname %v not found", change.CustomHostname.Hostname)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%v -> %q in all kind of log message please!

Copy link
Contributor Author

@mrozentsvayg mrozentsvayg Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%v -> %q in all kind of log message please!

Updated for Cloudflare provider.
There's still a logging inconsistency with the logrus's log.WithFields(), looking like:

INFO[0001] Changing record.                              action=CREATE record=mih-cf-test-2.***.xyz ttl=1 type=A zone=***
INFO[0001] Creating custom hostname "mih-test-ch-1.***"  action=CREATE record=mih-cf-test-2.*** ttl=1 type=A zone=***
INFO[0002] Changing record.                              action=CREATE record=a-mih-cf-test-2.*** ttl=1 type=TXT zone=***

If the consistent quoted output is a style policy goal, we could use log.SetFormatter(&log.TextFormatter{ForceQuote: true}), eg:

diff --git a/provider/cloudflare/cloudflare.go b/provider/cloudflare/cloudflare.go
index 37847491..e4ce8697 100644
--- a/provider/cloudflare/cloudflare.go
+++ b/provider/cloudflare/cloudflare.go
@@ -249,6 +249,7 @@ func NewCloudFlareProvider(domainFilter endpoint.DomainFilter, zoneIDFilter prov
 func (p *CloudFlareProvider) Zones(ctx context.Context) ([]cloudflare.Zone, error) {
        result := []cloudflare.Zone{}

+       log.SetFormatter(&log.TextFormatter{ForceQuote: true})
        // if there is a zoneIDfilter configured
        // && if the filter isn't just a blank string (used in tests)
        if len(p.zoneIDFilter.ZoneIDs) > 0 && p.zoneIDFilter.ZoneIDs[0] != "" {
@@ -361,6 +362,8 @@ func (p *CloudFlareProvider) ApplyChanges(ctx context.Context, changes *plan.Cha

 // submitChanges takes a zone and a collection of Changes and sends them as a single transaction.
 func (p *CloudFlareProvider) submitChanges(ctx context.Context, changes []*cloudFlareChange) error {
+       log.SetFormatter(&log.TextFormatter{ForceQuote: true})
+
        // return early if there is nothing to change
        if len(changes) == 0 {
                log.Info("All records are already up to date")
@@ -385,7 +388,6 @@ func (p *CloudFlareProvider) submitChanges(ctx context.Context, changes []*cloud
                                "action": change.Action,
                                "zone":   zoneID,
                        }
-
                        log.WithFields(logFields).Info("Changing record.")

                        if p.DryRun {

Should we go for it?

@ivankatliarchuk
Copy link
Contributor

ivankatliarchuk commented Mar 15, 2025

Hi. Could you share manifest to test edge cases? As well woul you be able to execute code coverage against modified lines in this pull request?

@mrozentsvayg
Copy link
Contributor Author

Hi. Could you share manifest to test edge cases?

The regular manifest before is as we previously discussed and as implemented in unit-tests:

  • create / delete regular records
  • create / update / delete records with custom hostnames
  • add/remove custom hostname to/from the regular record

The newly discovered edge-cases are caused by testing multi-cluster (GCP) configuration with the same Cloudflare zone and using TXT registry as a result.
There's currently no graceful way of updating TXT records, whereas changes are sometimes mandated (eg. to shorten txt suffix/prefix to fit into 63 bytes limit for the record).
Another peculiarity - if TXT record alone is removed, it's lost. The only graceful way to get it back is to remove the RR.
If RR record alone is removed, the container would crash, because the previous TXT record still exists.
Yet another - newly added --txt-new-format-only option, very useful to better control number of TXT records and TXT records length.
So in case there's a need to change prefix/suffix/owner, I'd controllably remove the RRs along with TXTs in small batches, monitoring the progress.
Specifically for the custom hostnames it's best not to touch it during such maintenance, because it would take time to revalidate and reissue the certificate.

So latest addition to the manifest:

  • manually remove RR&TXT records, make sure both correctly get recreated, still associated with the still existing custom hostname

As well woul you be able to execute code coverage against modified lines in this pull request?

I'm looking closely at the coverage and making sure it's not getting any more relaxed on every change.
Latest commits are not exception, can't attach html here unfortunately.

@mrozentsvayg mrozentsvayg force-pushed the mrozentsvayg/cloudflare-custom-hostnames-duplicate-records branch from bee85fc to 12cce6f Compare March 18, 2025 02:20
…arate method submitCustomHostnameChanges(); extend truncated logging
@ivankatliarchuk
Copy link
Contributor

/label tide/merge-method-squash
/kind bug
/kind cleanup

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. labels Mar 18, 2025
@mrozentsvayg
Copy link
Contributor Author

mrozentsvayg commented Mar 18, 2025

#5183 (comment)

Pulling this comment up from the thread.
Please let me know if anyone has a strong opinion.
If we are pushing for consistently having quoted names in log output, I would commit it for Cloudflare provider to start with.
Then eventually propagate it further on.

@ivankatliarchuk
Copy link
Contributor

Consistency is nice. No strong opinion how it should or could be implemented. Most likely should be same for the whole project, and is most likely out of the scope for this PR

Copy link
Contributor

@ivankatliarchuk ivankatliarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

cc: @mloiseleur @szuecs

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2025
@mloiseleur
Copy link
Collaborator

@mrozentsvayg I just have a last small change to suggest.

BTW, we are looking for help to maintain this project. The process is documented here. If you are interested, please reach out (on kubernetes slack).

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2025
@mloiseleur
Copy link
Collaborator

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ivankatliarchuk, mloiseleur

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2025
@k8s-ci-robot k8s-ci-robot merged commit 04f6a99 into kubernetes-sigs:master Mar 24, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants