xml: add preserve_namespaces option to retain XML namespace prefixes#4318
Open
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
Open
xml: add preserve_namespaces option to retain XML namespace prefixes#4318ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
Conversation
The default XML to JSON conversion drops namespace prefixes on element and attribute names (e.g. `<dc:title>` becomes the key `title`) because the underlying github.com/clbanning/mxj library uses only Name.Local. This makes the original XML impossible to reconstruct from the JSON. Adds an opt-in `preserve_namespaces` flag to both the `xml` processor and the `parse_xml` bloblang method. When enabled, parsing routes through a small stdlib-only walker that keeps a URI->prefix map from `xmlns:*` declarations as it descends, so element and attribute keys keep their `prefix:local` form and `xmlns:*` declarations are emitted as `-xmlns:*` attributes. Default behaviour is unchanged. Fixes redpanda-data#3928.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3928.
The
xmlprocessor andparse_xmlbloblang method drop namespace prefixes on element and attribute names during XML→JSON conversion (<dc:title>becomes the keytitle), which makes the original XML impossible to reconstruct from the JSON output. The root cause is thatgithub.com/clbanning/mxjalways usesxml.Name.Local, so there is no flag on the existing call path that preserves prefixes.This PR adds an opt-in
preserve_namespacesboolean to both surfaces (processor field + bloblang param, defaultfalse). When enabled, parsing routes through a small stdlib-only walker ininternal/impl/xml/package.gothat:xmlns:*declarations on each element.prefix:localwhen the namespace URI is known, falling back to the raw prefix (which Go'sxml.Decoderleaves inName.Spacewhen a prefix is used but never declared).xmlns:*declarations as-xmlns:prefixattributes so the original XML is reconstructable.-attrfor attributes,#textfor mixed content, arrays for repeated elements), and mirrors mxj's cast order (int → uint → float → bool) for thecastoption.Default behaviour is unchanged, so this is non-breaking for existing users.
Why not wait for upstream Go
The issue links Go CL 355353, which would add namespace-prefix retention to
encoding/xml. That CL has been stalled for ~4 years. We don't actually need it:xml.Decoderalready exposes the resolved URI inName.Spaceand reportsxmlns:*declarations as attributes, so a ~70 LOC walker can reconstruct prefixes today.PoC results — validation on the issue's exact input
Before implementing, I validated three paths on the issue's XML:
Path 1 — current
mxj.NewMapXml(lossy, reproduces bug):{ "root": { "-dc": "http://my.namespace/dc", "-ot": "http://my.namespace/ot", "description": { "#text": "This is a description", "-tone": "boring" }, "elements": [ { "#text": "foo1", "-id": "1" }, { "#text": "foo2", "-id": "2" }, "foo3" ], "title": "This is a title" } }Element prefixes dropped; the
xmlns:on the declarations is also lost (attributes appear as-dc/-otrather than-xmlns:dc/-xmlns:ot).Path 2 —
mxj.NewMapXmlSeq(2-LOC change, but heavy output shift):{ "root": { "#attr": { "xmlns:dc": { "#seq": 0, "#text": "http://my.namespace/dc" }, "xmlns:ot": { "#seq": 1, "#text": "http://my.namespace/ot" } }, "dc:description": { "#attr": { "tone": { "#seq": 0, "#text": "boring" } }, "#seq": 1, "#text": "This is a description" }, "dc:title": { "#seq": 0, "#text": "This is a title" }, "ot:elements": [ { "#attr": { "id": { "#seq": 0, "#text": "1" } }, "#seq": 2, "#text": "foo1" }, { "#attr": { "id": { "#seq": 0, "#text": "2" } }, "#seq": 3, "#text": "foo2" }, { "#seq": 4, "#text": "foo3" } ] } }Preserves prefixes but wraps every element in
#attr/#seqmetadata — would require every existing consumer to adapt to a new output shape.Path 3 — custom stdlib walker (this PR):
{ "root": { "-xmlns:dc": "http://my.namespace/dc", "-xmlns:ot": "http://my.namespace/ot", "dc:description": { "#text": "This is a description", "-tone": "boring" }, "dc:title": "This is a title", "ot:elements": [ { "#text": "foo1", "-id": "1" }, { "#text": "foo2", "-id": "2" }, "foo3" ] } }Prefixes preserved, output shape identical to the current default except for the added
dc:/ot:/-xmlns:*keys. Round-trips cleanly to XML.Tests
Added to
internal/impl/xml/processor_test.go:TestXMLPreserveNamespaces— 5 sub-cases:preserve_namespaces: trueon XML with no namespaces is a no-op vs. default output.xmlns:aat two levels mapping to different URIs).xsi:type).xmlnsdeclaration stays literal.TestXMLPreserveNamespacesWithCast— confirmscast: trueapplies to element and attribute values under the new path.TestXMLDefaultStripsNamespacesUnchanged— regression guard: omitting the flag keeps the current lossy behaviour byte-for-byte.Added to
internal/impl/xml/bloblang_test.go:parse_xml(preserve_namespaces: true)round-trip.parse_xml()without the flag still strips prefixes (opt-in guard).Full test run:
All existing tests (
TestXMLCases,TestXMLWithCast,TestParseXML) continue to pass unchanged.Test plan
go test ./internal/impl/xml/passesgo vet ./internal/impl/xml/cleango build ./internal/impl/xml/cleanpreserve_namespacesomitted orfalse) output byte-identical to previous behaviourpreserve_namespaces: true