Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(txt-registry): deprecate legacy txt-format #5172

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

ivankatliarchuk
Copy link
Contributor

@ivankatliarchuk ivankatliarchuk commented Mar 13, 2025

Description

TODO:

  • ✅ fix tests
  • ✅ test on real cluster and account
  • ✅ review docs
  • ✅ added TXT legacy record cleanup script

Checklist

  • Unit tests updated
  • End user documentation updated

Executed on real cluster with arguments

go run main.go \
    --provider=aws \
    --registry=txt \
    --source=fake \
    --aws-zone-type=private \
    --zone-id-filter=/hostedzone/XXXXXX \
    --log-level=debug \
    --policy=sync \
    --interval=60s \
    --fqdn-template=a1.ex.com

Without change, records current and old format created

aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_UNDER_TEST} --query "ResourceRecordSets[?Type=='TXT'].{Name:Name, Value:ResourceRecords[0].Value}" --output table
-----------------------------------------------------------------------------
|                          ListResourceRecordSets                           |
+--------------------+------------------------------------------------------+
|        Name        |                        Value                         |
+--------------------+------------------------------------------------------+
|  a-fhbr.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-gwbl.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-hgmc.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-hmqw.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-hoxg.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-ipfd.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-melr.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-mkde.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-uknh.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-xiqj.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  fhbr.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  gwbl.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  hgmc.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  hmqw.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  hoxg.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  ipfd.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  melr.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  mkde.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  uknh.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  xiqj.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
+--------------------+------------------------------------------------------+

With the change. New format records created, letagcy TXT records left untouched

❯❯ aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_UNDER_TEST} --query "ResourceRecordSets[?Type=='TXT'].{Name:Name, Value:ResourceRecords[0].Value}" --output table
-----------------------------------------------------------------------------
|                          ListResourceRecordSets                           |
+--------------------+------------------------------------------------------+
|        Name        |                        Value                         |
+--------------------+------------------------------------------------------+
|  a-dkmw.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-ihkf.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-iztb.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-jlmh.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-kmam.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-lqpv.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-rizx.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-sjfa.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-uvlq.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  a-vgpc.a1.ex.com. |  "heritage=external-dns,external-dns/owner=default"  |
|  bwwo.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  ecvx.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  ilan.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  iyda.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  nhhy.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  pazy.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  sdeo.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  tosb.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  wrej.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
|  yymt.a1.ex.com.   |  "heritage=external-dns,external-dns/owner=default"  |
+--------------------+------------------------------------------------------+
``

@ivankatliarchuk ivankatliarchuk marked this pull request as draft March 13, 2025 09:42
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 13, 2025
@ivankatliarchuk
Copy link
Contributor Author

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 13, 2025
@ivankatliarchuk
Copy link
Contributor Author

/kind cleanup

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 13, 2025
@ivankatliarchuk ivankatliarchuk changed the title WIP: feat(txt-registry): deprecate legacy txt-format feat(txt-registry): deprecate legacy txt-format Mar 14, 2025
@ivankatliarchuk ivankatliarchuk marked this pull request as ready for review March 14, 2025 11:28
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 14, 2025
@mloiseleur
Copy link
Collaborator

mloiseleur commented Mar 24, 2025

@ivankatliarchuk This PR lgtm. I left a last suggestion in order to reduce sentence length (and so increase readability).

@mrozentsvayg We need a second review before merge. Do you think you can review this PR ?

@mrozentsvayg
Copy link
Contributor

@mrozentsvayg We need a second review before merge. Do you think you can review this PR ?

There are several concerns regarding the registry implementation.
Specifically, the registry records are only looked up on the plan changes.
That means if the registry record is not there (manually dropped, for example, incidentally or not), it'll only get recreated when the resource record is manually deleted as well.
Why? Because without registry records the plan is no longer being able to detect the correct owner and filters out detected changes.
As a result, no changes for that record would get applied with "All records are already up to date".
Until the registry records are back, which is technically possible, but unlikely for regular user (no existing tooling), or resource records are deleted as well. Which is what I prefer when doing maintenance - controllably removing both resource and txt records in batches.
The maintenance is usually caused by need to change owner (in case if multi-cluster/naming convention changes) or if prefix/suffix make some TXT records too long (then the registry state is broken too, might be worth documenting!).
And if the resource record alone is manually deleted, but registry record is there, then external-dns would permanently fail until it's fixed.

While reducing number of unneeded records is great, I don't think that this PR would make any of it worse.
Hypothetically, if there are users running older version with old records only, they might get affected on upgrade.

Regarding the review part of the PR, it seems we no longer need the existing toTXTName method in nameMapper interface, and could safely remove it and rename toNewTXTName to toTXTName.

Also, if we are not planning to add more providers support to the script, would it be a good idea to add aws into it's name?
Ideally, we could leverage existing providers implementations and build some tooling to fix the state, but without clear roadmap it's probably too much of an effort.

…into feat-txt-registry

* refs/remotes/origin/feat-txt-registry:
  feat(txt-registry): deprecate legacy txt-format
  feat(txt-registry): deprecate legacy txt-format
@ivankatliarchuk
Copy link
Contributor Author

Thanks. So the changes are

  • added aws- prefix to a script, as in docs it was saying it, but nothing in the script
  • removed old method toTXTName and renamted toNewTXTName to a toTXTName

Removing legacy TXT records is not required, and this is optional operation that is already documented

My personal opinion, TXT record type is a registry is a wrong idea on it's own. Not flexible at all.

@ivankatliarchuk
Copy link
Contributor Author

Could you share exact wording pls what to add to README. Also, would you be open to creating a TXT registry best practices page or section?

@mrozentsvayg
Copy link
Contributor

Could you share exact wording pls what to add to README.

I would suggest something like:

Note: --txt-prefix and --txt-suffix contribute to the 63-byte maximum record length.
To avoid errors, use them only if absolutely required and keep them as short as possible.

Also, would you be open to creating a TXT registry best practices page or section?

Sure, there is no specific/pressing timeline for that, is there?

My personal opinion, TXT record type is a registry is a wrong idea on it's own. Not flexible at all.

Right, but we need an alternative. If you're not using AWS, the TXT registry is the only way to manage the same zone with multi-cluster and sync policies.
For cloudflare specifically, we could use the record tags. Not an ideal solution (tags are not KV, only alphanumeric, hyphens and underscores are allowed), but would work to set the record owner. For other providers no such luck, in general.

@ivankatliarchuk
Copy link
Contributor Author

ivankatliarchuk commented Mar 25, 2025

Sure, there is no specific/pressing timeline for that, is there?

Indeed, not timeline

Right, but we need an alternative.

Agree, at the moment there is no compensation. I'll will try to create a proposal for provider agnostic registry, but it may take months to get approval and implemente a solution.

@ivankatliarchuk
Copy link
Contributor Author

I would suggest something like:

Added to docs/registry/txt.md

@ivankatliarchuk
Copy link
Contributor Author

This is a compensation for TXT registry #4598. It does require a short proposal https://github.com/kubernetes-sigs/external-dns/tree/master/docs/proposal, but in reality, as long as external-dns is running on kubernetes, it should just work. Another option is alternatvie registry issue #4998

@ivankatliarchuk
Copy link
Contributor Author

@mrozentsvayg any chance you lgtm if anything is fine?

@mrozentsvayg
Copy link
Contributor

@mrozentsvayg any chance you lgtm if anything is fine?

I didn't realize my status allows me to lgtm the PRs :)

Golang changes lgtm, but please let me know if you would like me to go over the python script as well.
Since the status is use on your own risk I could skim over the code and rubber stamp it, or spend more time recreating and verifying the test setup.

@ivankatliarchuk
Copy link
Contributor Author

Yeah this python script is just a helper. Up to you. Contributors could do lgtm

@ivankatliarchuk
Copy link
Contributor Author

This pr could w8 if you have plans to run e-2-e tests

@rouke-broersma
Copy link

This seems like a very strange decision to me. We use external dns to manage dns records with automation, so we don't have manual maintenance of dns. External dns stores ownership information in txt records automatically. External dns migrates to a new storage format. Now users are left manually cleaning up old ownership information. This creates an enormous amount of manual cleanup for thousands of users. Externaldns can already clean up the old style txt records when the a records are deleted, why would externaldns not clean them up when the new flag introduced in #4946 is enabled? Externaldns clearly knows the format of the old style txt records still.

@mloiseleur
Copy link
Collaborator

Now users are left manually cleaning up old ownership information. This creates an enormous amount of manual cleanup for thousands of users. External-dns can already clean up the old style txt records when the a records are deleted, why would external-dns not clean them up when the new flag introduced in #4946 is enabled?

This project is 100% community driven. I agree with you that would be far better if we collectively manage to reach this stage. I'm unsure if we can do it, if the implementation is really that easy. Feel free to open a PR.

@rouke-broersma
Copy link

rouke-broersma commented Mar 27, 2025

Now users are left manually cleaning up old ownership information. This creates an enormous amount of manual cleanup for thousands of users. External-dns can already clean up the old style txt records when the a records are deleted, why would external-dns not clean them up when the new flag introduced in #4946 is enabled?

This project is 100% community driven. I agree with you that would be far better if we collectively manage to reach this stage. I'm unsure if we can do it, if the implementation is really that easy. Feel free to open a PR.

I'm not saying it's that easy, I have no experience in this area of external-dns. I'm saying that there must already be a mechanism to identify and removal the old style txt records, by virtue of external-dns already doing this, so I extrapolate that it must be possible.

I am questioning why the previous pull request was accepted without an automatic migration scenario, and I am cautioning to not accept this pull request without first adding an automatic migration, because it places a great manual maintenance burden on end users. That would be antithesis to the whole purpose of this project. The only time to be able to introduce this, is before this pull request is merged. After this pull request external-dns will no longer recognize the old format so external-dns will not itself be able to remove the old format records. Or am I missing something?

@ivankatliarchuk
Copy link
Contributor Author

I do not think you understand the complexity of what you are saying. There is no way to reliably and safely automatically cleanup legacy TXT records. Just try to explain how would you do that for all providers and different combinations with prefix, with suffix, and etc. One week is enough for you to came up with the idea and examples in other languages like python, we could add them to scripts?

One of the issues with legacy and current TXT formats is that, they created with the intention to be removed manually. Just go over issues and pull requests. There is a design flow in both of the formats. Example proposal for new formats #4624 (comment) to have a version. Without that, the only way to cleanup TXT records now, is to remove ALL records and recreate all of the records again.

If I have a solution, it will be already out, but I do not see one, and having dangling TXT records is not a risk but inconvenience

@mcharriere
Copy link
Contributor

mcharriere commented Mar 27, 2025

Later on, the old format will be dropped and only the new format will be kept (<record_type>-<endpoint_name>).

Cleanup will be done by controller itself.

The author of the new format saw the possibility of cleaning up the old registry records here: 25c7cb2

Although some of the code changed since then, this line still suggest the clean up can be done there.
I'm playing a bit with it but there are some edge cases that need further research.

@ivankatliarchuk
Copy link
Contributor Author

Ok, so in this case.

When you have different owner ids. So here is an issue. Legacy format has one owner ID, then for non legacy to have another owner ID, all records needs to be recreated...... Of course is easy to cleanup in that case but with the downtime.

@mcharriere
Copy link
Contributor

mcharriere commented Mar 27, 2025

Legacy format has one owner ID, then for non legacy to have another owner ID, all records needs to be recreated

which case is that one? changing owner id is not even supported by external-dns

EDIT:
Ah I think I understand your comment, the first file in that commit says something about owner-id. Please ignore and check the rest of the files, in particular docs/registry.md. The ‎docs/proposal/registry.md not really relevant for the discussion.

@ivankatliarchuk
Copy link
Contributor Author

I'm sorry, I'm not entirely clear on what's being asked. The issue is that I haven't discovered a reliable way to automate legacy TXT record deletions without downtimes. That's why it's not in this pull request. If someone knows how to automate this, nothing wrong to open a pull request with proposed functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants