Skip to content

[bugfix] Decode '+' as a literal plus in the xmldb URI functions (#1824)#6451

Open
joewiz wants to merge 2 commits into
eXist-db:developfrom
joewiz:bugfix/1824-plus-decode
Open

[bugfix] Decode '+' as a literal plus in the xmldb URI functions (#1824)#6451
joewiz wants to merge 2 commits into
eXist-db:developfrom
joewiz:bugfix/1824-plus-decode

Conversation

@joewiz

@joewiz joewiz commented Jun 7, 2026

Copy link
Copy Markdown
Member

[This PR was co-authored with Claude Code. -Joe]

Summary

xmldb:decode and xmldb:decode-uri decoded a + to a space, so a name containing a literal + could not be round-tripped through xmldb:encode-uri / xmldb:decode-uri. Both functions routed through URIUtils.urlDecodeUtf8, which wraps java.net.URLDecoder and therefore applies application/x-www-form-urlencoded rules — where + means space. That is wrong for URI path components, which follow RFC 3986 (where + is a literal plus and only %20 is a space).

Fixes #1824 (and the long-closed #44, the original report of the same URLDecoder mistake).

What changed

  • URIUtils.decodeForURI(String) — new, the exact inverse of the existing encodeForURI(). It decodes %XX escapes (interpreting consecutive escapes as a UTF-8 byte sequence) and leaves every other character — including + — literal. decodeForURI(encodeForURI(s)) equals s for every s.
  • XMLDBURIFunctionsxmldb:decode / xmldb:decode-uri now route through decodeForURI instead of urlDecodeUtf8.
  • urlDecodeUtf8 is left unchanged for its other callers — only the user-facing xmldb: URI functions are switched, to keep the blast radius minimal.

Before / after

xmldb:decode-uri(xs:anyURI("a%2Bb"))   (: before: "a b"   after: "a+b" :)
xmldb:decode-uri(xs:anyURI("a+b"))     (: before: "a b"   after: "a+b" :)
xmldb:decode-uri(xs:anyURI("a%20b"))   (: before: "a b"   after: "a b" — unchanged :)

Scope note

XmldbURI.getCollectionPath() has the same URLDecoder pattern, but it is a central internal path used throughout storage. Changing it is deferred to the broader name/URI contract work (gated on the cross-surface conformance harness, PR A) rather than bundled into this focused bug fix.

Test plan

  • URIUtilsTest (JUnit) — decodeForURI for + (encoded and bare), space, percent (incl. literal %2F%252F round-trip), unreserved, and 2/3/4-byte UTF-8; plus an encode → decode round-trip property over a corpus of awkward names (space, +, %, reserved/sub-delims, café, Cyrillic, CJK, multi-byte). 10/10 green.
  • xmldb/uri-encoding-tests.xql (XQSuite) — functional xmldb:decode-uri / xmldb:decode regression assertions (a%2Bba+b, a+ba+b, a%20ba b, and a mixed name). 5/5 green.
  • Codacy/PMD clean on the new code.
  • No existing test relied on the old +→space behavior.

Related

xmldb:decode and xmldb:decode-uri routed through URIUtils.urlDecodeUtf8, which wraps
java.net.URLDecoder and therefore follows application/x-www-form-urlencoded rules —
turning '+' into a space. That broke round-tripping of names through
xmldb:encode-uri / xmldb:decode-uri (eXist-db#1824, eXist-db#44): a name containing a literal '+'
could not be recovered.

Add URIUtils.decodeForURI(), the RFC 3986 inverse of the existing encodeForURI(): it
decodes %XX escapes (interpreting consecutive escapes as a UTF-8 byte sequence) and
leaves every other character — including '+' — literal, so
decodeForURI(encodeForURI(s)) == s for every s. Route xmldb:decode / xmldb:decode-uri
through it. urlDecodeUtf8 is left unchanged for its other callers.

The same form-decoding pattern exists in XmldbURI.getCollectionPath(); it is a central
internal path used throughout storage, so changing it is deferred to the broader
name/URI contract work (gated on the cross-surface conformance harness) rather than
bundled here.

Tests:
- URIUtilsTest: decodeForURI for '+', space, percent, unreserved, and 2/3/4-byte UTF-8,
  plus an encode/decode round-trip property over a corpus of awkward names.
- xmldb/uri-encoding-tests.xql: xmldb:decode-uri / xmldb:decode regression assertions
  (a%2Bb -> a+b, a+b -> a+b, a%20b -> a b, and a mixed name).

Closes eXist-db#1824

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joewiz joewiz requested a review from a team as a code owner June 7, 2026 05:16

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what about using the Java URI as it is more precise for paths & handles encoding/decoding automatically? In case this does not have a negative performance implication, it would greatly reduce the existing code within the URIUtils class..

import java.net.URI;
import java.net.URISyntaxException;

String pathPart = "my path/with spaces+plus";
try {
    URI uri = new URI(null, null, pathPart, null, null);
    String encodedPath = uri.toString(); // Automatically encodes path
    System.out.println(encodedPath); // Output: my%20path/with%20spaces+plus

    // To decode:
    URI decodedUri = new URI(encodedPath);
    String decodedPath = decodedUri.getPath();
    System.out.println(decodedPath); // Output: /my path/with spaces+plus
} catch (URISyntaxException e) {
    e.printStackTrace();
}

Address review feedback on eXist-db#6451: explain in decodeForURI's javadoc why it
is a standalone percent-decoder rather than new java.net.URI(s).getPath(),
and add a test pinning the behavior.

java.net.URI is unsuitable as a general decoder for the arbitrary strings
xmldb:decode/xmldb:decode-uri accept: it throws URISyntaxException on inputs
that are valid here (a literal space, a trailing or malformed '%', braces),
and silently truncates at '?' and '#' (parsing the remainder as a URI query
or fragment) -- losing data with no error. decodeForURI never throws and
never truncates; any '%' not followed by two hex digits is preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joewiz

joewiz commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

[This response was co-authored with Claude Code. -Joe]

Thanks Patrick — you're right that java.net.URI keeps + literal in a path, which is the behavior this fix is after, so it's a fair thing to reach for. I tried it as the decoder and it doesn't hold up for the general case: xmldb:decode/xmldb:decode-uri are public functions called on arbitrary strings, and new URI(s).getPath() mishandles a number of inputs that this decoder must accept. Comparing the two on the same inputs:

input new URI(s).getPath() decodeForURI
a%2Bb a+b a+b
a+b (the #1824 case) a+b a+b
a%20b a b a b
%C3%A9 é é
a b (literal space) throws URISyntaxException a b
100% throws 100%
a%ZZb throws a%ZZb
a{b}c throws a{b}c
a?b a (truncates — ?b parsed as query) a?b
a#b a (truncates — #b parsed as fragment) a#b

Two problems: it throws on ordinary inputs, and worse, it silently truncates at ? and # — data loss with no error. So I've kept the standalone percent-decoder, but addressed the spirit of your comment: I added a javadoc paragraph explaining exactly why java.net.URI isn't used here, and a test (decodeForURIRobustOnMalformedAndReservedInput) pinning each of the rows above so the next reader doesn't have to re-derive it.

On the encode side, URIUtils.encodeForURI is existing code and this PR is decode-only, so I'd rather not fold an encodeForURI rewrite into it — happy to look at that separately if it's worth doing.

Pushed as f0c3ec1.

@reinhapa

reinhapa commented Jun 7, 2026

Copy link
Copy Markdown
Member

The way you tested it is NOT the way I suggested. Use the following URI constructor, that takes every part of the URI separate. This will not throw an exception in our case...

jshell> new URI(null, null, null, 0, "a b", null, null)
$1 ==> a%20b

jshell> new URI(null, null, null, 0, "100%", null, null)
$2 ==> 100%25

jshell> new URI(null, null, null, 0, "a%ZZb", null, null)
$3 ==> a%25ZZb

jshell> new URI(null, null, null, 0, "a?b", null, null)
$4 ==> a%3Fb

jshell> new URI(null, null, null, 0, "a#b", null, null)
$5 ==> a%23b

@joewiz

joewiz commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

[This response was co-authored with Claude Code. -Joe]

Thanks Patrick — and you're right that the multi-argument constructor is a robust encoder: it won't throw on any of those, exactly as your examples show. The catch is direction. decodeForURI decodes (percent-encoded → literal: a%20ba b), whereas new URI(null, null, null, 0, path, null, null) encodes (literal → percent-encoded: a ba%20b) — the opposite transform. So it can't stand in for the decoder. Feeding it an already-encoded string doesn't decode it:

jshell> new URI(null, null, null, 0, "a%20b", null, null).getPath()
$1 ==> a%20b        // unchanged — not decoded to "a b"

jshell> new URI(null, null, null, 0, "a%20b", null, null).toString()
$2 ==> a%2520b      // double-encoded

The only java.net.URI route that actually decodes is the single-argument parse plus getPath() — which is the one in my table, and the one that throws on a b / 100% / a%ZZb and silently truncates a?ba. Hence the standalone decoder for this direction.

Your multi-arg encoder is a good fit for the encode side, i.e. encodeForURI — which is existing code this PR doesn't touch, so I've left it out of scope here; happy to look at swapping that separately if it's worthwhile. Thanks for the approval and the careful review.

@joewiz

joewiz commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

@reinhapa Please let me know if Claude is still on the wrong track here and needs to be steered in the right direction. Sorry!

joewiz added a commit to joewiz/existdb-openapi that referenced this pull request Jun 10, 2026
db:to-display decoded names with xmldb:decode-uri, which form-decodes "+"
to a space (the x-www-form-urlencoded convention; eXist-db/exist#1824).
But a "+" in a stored name is always a literal "+" -- spaces are stored as
%20 -- and db:to-stored (fn:iri-to-uri) leaves "+" untouched on the encode
side, so a name like "naïve+test.xml" stored correctly but read back as
"naïve test.xml".

Protect a literal "+" as %2B before xmldb:decode-uri so it decodes back to
"+", restoring symmetry with the encode side. This mirrors what
URIUtils.decodeForURI (the core fix in eXist-db/exist#6451) does, applied
at the API layer so it is correct independent of the core build, and
forward-compatible once #6451 lands. Spaces (%20) are unaffected.

Verified end-to-end against a live instance: "naïve+test.xml" stores as
na%C3%AFve+test.xml on disk, lists and reads back as naïve+test.xml. Adds
the "+" case to the Cypress awkward-name coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in decoding of "+" in xmldb URI-related functions Coding of path components in URIs

2 participants