[bugfix] Decode '+' as a literal plus in the xmldb URI functions (#1824)#6451
[bugfix] Decode '+' as a literal plus in the xmldb URI functions (#1824)#6451joewiz wants to merge 2 commits into
Conversation
xmldb:decode and xmldb:decode-uri routed through URIUtils.urlDecodeUtf8, which wraps java.net.URLDecoder and therefore follows application/x-www-form-urlencoded rules — turning '+' into a space. That broke round-tripping of names through xmldb:encode-uri / xmldb:decode-uri (eXist-db#1824, eXist-db#44): a name containing a literal '+' could not be recovered. Add URIUtils.decodeForURI(), the RFC 3986 inverse of the existing encodeForURI(): it decodes %XX escapes (interpreting consecutive escapes as a UTF-8 byte sequence) and leaves every other character — including '+' — literal, so decodeForURI(encodeForURI(s)) == s for every s. Route xmldb:decode / xmldb:decode-uri through it. urlDecodeUtf8 is left unchanged for its other callers. The same form-decoding pattern exists in XmldbURI.getCollectionPath(); it is a central internal path used throughout storage, so changing it is deferred to the broader name/URI contract work (gated on the cross-surface conformance harness) rather than bundled here. Tests: - URIUtilsTest: decodeForURI for '+', space, percent, unreserved, and 2/3/4-byte UTF-8, plus an encode/decode round-trip property over a corpus of awkward names. - xmldb/uri-encoding-tests.xql: xmldb:decode-uri / xmldb:decode regression assertions (a%2Bb -> a+b, a+b -> a+b, a%20b -> a b, and a mixed name). Closes eXist-db#1824 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Hmm, what about using the Java URI as it is more precise for paths & handles encoding/decoding automatically? In case this does not have a negative performance implication, it would greatly reduce the existing code within the URIUtils class..
import java.net.URI;
import java.net.URISyntaxException;
String pathPart = "my path/with spaces+plus";
try {
URI uri = new URI(null, null, pathPart, null, null);
String encodedPath = uri.toString(); // Automatically encodes path
System.out.println(encodedPath); // Output: my%20path/with%20spaces+plus
// To decode:
URI decodedUri = new URI(encodedPath);
String decodedPath = decodedUri.getPath();
System.out.println(decodedPath); // Output: /my path/with spaces+plus
} catch (URISyntaxException e) {
e.printStackTrace();
}
Address review feedback on eXist-db#6451: explain in decodeForURI's javadoc why it is a standalone percent-decoder rather than new java.net.URI(s).getPath(), and add a test pinning the behavior. java.net.URI is unsuitable as a general decoder for the arbitrary strings xmldb:decode/xmldb:decode-uri accept: it throws URISyntaxException on inputs that are valid here (a literal space, a trailing or malformed '%', braces), and silently truncates at '?' and '#' (parsing the remainder as a URI query or fragment) -- losing data with no error. decodeForURI never throws and never truncates; any '%' not followed by two hex digits is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
[This response was co-authored with Claude Code. -Joe] Thanks Patrick — you're right that
Two problems: it throws on ordinary inputs, and worse, it silently truncates at On the encode side, Pushed as |
|
The way you tested it is NOT the way I suggested. Use the following URI constructor, that takes every part of the URI separate. This will not throw an exception in our case... |
|
[This response was co-authored with Claude Code. -Joe] Thanks Patrick — and you're right that the multi-argument constructor is a robust encoder: it won't throw on any of those, exactly as your examples show. The catch is direction. The only Your multi-arg encoder is a good fit for the encode side, i.e. |
|
@reinhapa Please let me know if Claude is still on the wrong track here and needs to be steered in the right direction. Sorry! |
db:to-display decoded names with xmldb:decode-uri, which form-decodes "+" to a space (the x-www-form-urlencoded convention; eXist-db/exist#1824). But a "+" in a stored name is always a literal "+" -- spaces are stored as %20 -- and db:to-stored (fn:iri-to-uri) leaves "+" untouched on the encode side, so a name like "naïve+test.xml" stored correctly but read back as "naïve test.xml". Protect a literal "+" as %2B before xmldb:decode-uri so it decodes back to "+", restoring symmetry with the encode side. This mirrors what URIUtils.decodeForURI (the core fix in eXist-db/exist#6451) does, applied at the API layer so it is correct independent of the core build, and forward-compatible once #6451 lands. Spaces (%20) are unaffected. Verified end-to-end against a live instance: "naïve+test.xml" stores as na%C3%AFve+test.xml on disk, lists and reads back as naïve+test.xml. Adds the "+" case to the Cypress awkward-name coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
[This PR was co-authored with Claude Code. -Joe]
Summary
xmldb:decodeandxmldb:decode-uridecoded a+to a space, so a name containing a literal+could not be round-tripped throughxmldb:encode-uri/xmldb:decode-uri. Both functions routed throughURIUtils.urlDecodeUtf8, which wrapsjava.net.URLDecoderand therefore appliesapplication/x-www-form-urlencodedrules — where+means space. That is wrong for URI path components, which follow RFC 3986 (where+is a literal plus and only%20is a space).Fixes #1824 (and the long-closed #44, the original report of the same
URLDecodermistake).What changed
URIUtils.decodeForURI(String)— new, the exact inverse of the existingencodeForURI(). It decodes%XXescapes (interpreting consecutive escapes as a UTF-8 byte sequence) and leaves every other character — including+— literal.decodeForURI(encodeForURI(s))equalssfor everys.XMLDBURIFunctions—xmldb:decode/xmldb:decode-urinow route throughdecodeForURIinstead ofurlDecodeUtf8.urlDecodeUtf8is left unchanged for its other callers — only the user-facingxmldb:URI functions are switched, to keep the blast radius minimal.Before / after
Scope note
XmldbURI.getCollectionPath()has the sameURLDecoderpattern, but it is a central internal path used throughout storage. Changing it is deferred to the broader name/URI contract work (gated on the cross-surface conformance harness, PR A) rather than bundled into this focused bug fix.Test plan
URIUtilsTest(JUnit) —decodeForURIfor+(encoded and bare), space, percent (incl. literal%2F→%252Fround-trip), unreserved, and 2/3/4-byte UTF-8; plus anencode → decoderound-trip property over a corpus of awkward names (space,+,%, reserved/sub-delims,café, Cyrillic, CJK, multi-byte). 10/10 green.xmldb/uri-encoding-tests.xql(XQSuite) — functionalxmldb:decode-uri/xmldb:decoderegression assertions (a%2Bb→a+b,a+b→a+b,a%20b→a b, and a mixed name). 5/5 green.+→space behavior.Related