[bugfix] Decode '+' as a literal plus in the xmldb URI functions (#1824) by joewiz · Pull Request #6451 · eXist-db/exist

joewiz · 2026-06-07T05:16:21Z

[This PR was co-authored with Claude Code. -Joe]

Summary

xmldb:decode and xmldb:decode-uri decoded a + to a space, so a name containing a literal + could not be round-tripped through xmldb:encode-uri / xmldb:decode-uri. Both functions routed through URIUtils.urlDecodeUtf8, which wraps java.net.URLDecoder and therefore applies application/x-www-form-urlencoded rules — where + means space. That is wrong for URI path components, which follow RFC 3986 (where + is a literal plus and only %20 is a space).

Fixes #1824 (and the long-closed #44, the original report of the same URLDecoder mistake).

What changed

URIUtils.decodeForURI(String) — new, the exact inverse of the existing encodeForURI(). It decodes %XX escapes (interpreting consecutive escapes as a UTF-8 byte sequence) and leaves every other character — including + — literal. decodeForURI(encodeForURI(s)) equals s for every s.
XMLDBURIFunctions — xmldb:decode / xmldb:decode-uri now route through decodeForURI instead of urlDecodeUtf8.
urlDecodeUtf8 is left unchanged for its other callers — only the user-facing xmldb: URI functions are switched, to keep the blast radius minimal.

Before / after

xmldb:decode-uri(xs:anyURI("a%2Bb"))   (: before: "a b"   after: "a+b" :)
xmldb:decode-uri(xs:anyURI("a+b"))     (: before: "a b"   after: "a+b" :)
xmldb:decode-uri(xs:anyURI("a%20b"))   (: before: "a b"   after: "a b" — unchanged :)

Scope note

XmldbURI.getCollectionPath() has the same URLDecoder pattern, but it is a central internal path used throughout storage. Changing it is deferred to the broader name/URI contract work (gated on the cross-surface conformance harness, PR A) rather than bundled into this focused bug fix.

Test plan

URIUtilsTest (JUnit) — decodeForURI for + (encoded and bare), space, percent (incl. literal %2F → %252F round-trip), unreserved, and 2/3/4-byte UTF-8; plus an encode → decode round-trip property over a corpus of awkward names (space, +, %, reserved/sub-delims, café, Cyrillic, CJK, multi-byte). 10/10 green.
xmldb/uri-encoding-tests.xql (XQSuite) — functional xmldb:decode-uri / xmldb:decode regression assertions (a%2Bb → a+b, a+b → a+b, a%20b → a b, and a mixed name). 5/5 green.
Codacy/PMD clean on the new code.
No existing test relied on the old +→space behavior.

xmldb:decode and xmldb:decode-uri routed through URIUtils.urlDecodeUtf8, which wraps java.net.URLDecoder and therefore follows application/x-www-form-urlencoded rules — turning '+' into a space. That broke round-tripping of names through xmldb:encode-uri / xmldb:decode-uri (eXist-db#1824, eXist-db#44): a name containing a literal '+' could not be recovered. Add URIUtils.decodeForURI(), the RFC 3986 inverse of the existing encodeForURI(): it decodes %XX escapes (interpreting consecutive escapes as a UTF-8 byte sequence) and leaves every other character — including '+' — literal, so decodeForURI(encodeForURI(s)) == s for every s. Route xmldb:decode / xmldb:decode-uri through it. urlDecodeUtf8 is left unchanged for its other callers. The same form-decoding pattern exists in XmldbURI.getCollectionPath(); it is a central internal path used throughout storage, so changing it is deferred to the broader name/URI contract work (gated on the cross-surface conformance harness) rather than bundled here. Tests: - URIUtilsTest: decodeForURI for '+', space, percent, unreserved, and 2/3/4-byte UTF-8, plus an encode/decode round-trip property over a corpus of awkward names. - xmldb/uri-encoding-tests.xql: xmldb:decode-uri / xmldb:decode regression assertions (a%2Bb -> a+b, a+b -> a+b, a%20b -> a b, and a mixed name). Closes eXist-db#1824 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

reinhapa · 2026-06-07T09:53:26Z

Hmm, what about using the Java URI as it is more precise for paths & handles encoding/decoding automatically? In case this does not have a negative performance implication, it would greatly reduce the existing code within the URIUtils class..

import java.net.URI; import java.net.URISyntaxException; String pathPart = "my path/with spaces+plus"; try { URI uri = new URI(null, null, pathPart, null, null); String encodedPath = uri.toString(); // Automatically encodes path System.out.println(encodedPath); // Output: my%20path/with%20spaces+plus // To decode: URI decodedUri = new URI(encodedPath); String decodedPath = decodedUri.getPath(); System.out.println(decodedPath); // Output: /my path/with spaces+plus } catch (URISyntaxException e) { e.printStackTrace(); }

Address review feedback on eXist-db#6451: explain in decodeForURI's javadoc why it is a standalone percent-decoder rather than new java.net.URI(s).getPath(), and add a test pinning the behavior. java.net.URI is unsuitable as a general decoder for the arbitrary strings xmldb:decode/xmldb:decode-uri accept: it throws URISyntaxException on inputs that are valid here (a literal space, a trailing or malformed '%', braces), and silently truncates at '?' and '#' (parsing the remainder as a URI query or fragment) -- losing data with no error. decodeForURI never throws and never truncates; any '%' not followed by two hex digits is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

joewiz · 2026-06-07T16:24:49Z

[This response was co-authored with Claude Code. -Joe]

Thanks Patrick — you're right that java.net.URI keeps + literal in a path, which is the behavior this fix is after, so it's a fair thing to reach for. I tried it as the decoder and it doesn't hold up for the general case: xmldb:decode/xmldb:decode-uri are public functions called on arbitrary strings, and new URI(s).getPath() mishandles a number of inputs that this decoder must accept. Comparing the two on the same inputs:

input	`new URI(s).getPath()`	`decodeForURI`
`a%2Bb`	`a+b`	`a+b`
`a+b` (the #1824 case)	`a+b`	`a+b`
`a%20b`	`a b`	`a b`
`%C3%A9`	`é`	`é`
`a b` (literal space)	throws `URISyntaxException`	`a b`
`100%`	throws	`100%`
`a%ZZb`	throws	`a%ZZb`
`a{b}c`	throws	`a{b}c`
`a?b`	`a` (truncates — `?b` parsed as query)	`a?b`
`a#b`	`a` (truncates — `#b` parsed as fragment)	`a#b`

Two problems: it throws on ordinary inputs, and worse, it silently truncates at ? and # — data loss with no error. So I've kept the standalone percent-decoder, but addressed the spirit of your comment: I added a javadoc paragraph explaining exactly why java.net.URI isn't used here, and a test (decodeForURIRobustOnMalformedAndReservedInput) pinning each of the rows above so the next reader doesn't have to re-derive it.

On the encode side, URIUtils.encodeForURI is existing code and this PR is decode-only, so I'd rather not fold an encodeForURI rewrite into it — happy to look at that separately if it's worth doing.

Pushed as f0c3ec1.

reinhapa · 2026-06-07T18:14:04Z

The way you tested it is NOT the way I suggested. Use the following URI constructor, that takes every part of the URI separate. This will not throw an exception in our case...

jshell> new URI(null, null, null, 0, "a b", null, null)
$1 ==> a%20b

jshell> new URI(null, null, null, 0, "100%", null, null)
$2 ==> 100%25

jshell> new URI(null, null, null, 0, "a%ZZb", null, null)
$3 ==> a%25ZZb

jshell> new URI(null, null, null, 0, "a?b", null, null)
$4 ==> a%3Fb

jshell> new URI(null, null, null, 0, "a#b", null, null)
$5 ==> a%23b

joewiz · 2026-06-07T19:59:07Z

[This response was co-authored with Claude Code. -Joe]

Thanks Patrick — and you're right that the multi-argument constructor is a robust encoder: it won't throw on any of those, exactly as your examples show. The catch is direction. decodeForURI decodes (percent-encoded → literal: a%20b → a b), whereas new URI(null, null, null, 0, path, null, null) encodes (literal → percent-encoded: a b → a%20b) — the opposite transform. So it can't stand in for the decoder. Feeding it an already-encoded string doesn't decode it:

jshell> new URI(null, null, null, 0, "a%20b", null, null).getPath()
$1 ==> a%20b        // unchanged — not decoded to "a b"

jshell> new URI(null, null, null, 0, "a%20b", null, null).toString()
$2 ==> a%2520b      // double-encoded

The only java.net.URI route that actually decodes is the single-argument parse plus getPath() — which is the one in my table, and the one that throws on a b / 100% / a%ZZb and silently truncates a?b → a. Hence the standalone decoder for this direction.

Your multi-arg encoder is a good fit for the encode side, i.e. encodeForURI — which is existing code this PR doesn't touch, so I've left it out of scope here; happy to look at swapping that separately if it's worthwhile. Thanks for the approval and the careful review.

joewiz · 2026-06-07T20:02:25Z

@reinhapa Please let me know if Claude is still on the wrong track here and needs to be steered in the right direction. Sorry!

db:to-display decoded names with xmldb:decode-uri, which form-decodes "+" to a space (the x-www-form-urlencoded convention; eXist-db/exist#1824). But a "+" in a stored name is always a literal "+" -- spaces are stored as %20 -- and db:to-stored (fn:iri-to-uri) leaves "+" untouched on the encode side, so a name like "naïve+test.xml" stored correctly but read back as "naïve test.xml". Protect a literal "+" as %2B before xmldb:decode-uri so it decodes back to "+", restoring symmetry with the encode side. This mirrors what URIUtils.decodeForURI (the core fix in eXist-db/exist#6451) does, applied at the API layer so it is correct independent of the core build, and forward-compatible once #6451 lands. Spaces (%20) are unaffected. Verified end-to-end against a live instance: "naïve+test.xml" stores as na%C3%AFve+test.xml on disk, lists and reads back as naïve+test.xml. Adds the "+" case to the Cypress awkward-name coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

joewiz requested a review from a team as a code owner June 7, 2026 05:16

reinhapa requested changes Jun 7, 2026

View reviewed changes

reinhapa approved these changes Jun 7, 2026

View reviewed changes

This was referenced Jun 10, 2026

[bugfix] Speak decoded UTF-8 resource names on the wire eXist-db/existdb-openapi#54

Open

Define eXist's resource-naming contract: name vs URI (5 decisions) #6463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] Decode '+' as a literal plus in the xmldb URI functions (#1824)#6451

[bugfix] Decode '+' as a literal plus in the xmldb URI functions (#1824)#6451
joewiz wants to merge 2 commits into
eXist-db:developfrom
joewiz:bugfix/1824-plus-decode

joewiz commented Jun 7, 2026

Uh oh!

reinhapa Jun 7, 2026

Uh oh!

joewiz commented Jun 7, 2026

Uh oh!

reinhapa commented Jun 7, 2026

Uh oh!

joewiz commented Jun 7, 2026

Uh oh!

joewiz commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joewiz commented Jun 7, 2026

Summary

What changed

Before / after

Scope note

Test plan

Related

Uh oh!

reinhapa Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

joewiz commented Jun 7, 2026

Uh oh!

reinhapa commented Jun 7, 2026

Uh oh!

joewiz commented Jun 7, 2026

Uh oh!

joewiz commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants