[feature] Resource-naming contract prototype: decisions 1, 2, 3, 5 (draft illustration for #6463)#6508
Draft
joewiz wants to merge 12 commits into
Draft
Conversation
Add the canonical store/display codec for the resource-naming contract (eXist-db#6463, decision 2): URIUtils.encodeForURILenient on store and URIUtils.decodeForURI on display, plus path-level helpers. The lenient encoder leaves sub-delimiters literal (it's.xml, a+b.xml -- the eXide#824 direction) but always escapes a literal percent to %25, which is the single change that makes the encoding bijective so a literal % in a name round-trips. ResourceNameCodecTest is the keystone proof: it round-trips the full corpus (spaces, Unicode, sub-delimiters, and the literal-% cases 50%.xml, a%20b.xml, %25.xml), pins each canonical stored form, and contrasts with today's store boundary (FunEscapeURI.escape + new URI), which silently corrupts a%20b -> 'a b' and rejects a bare 5%done. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surface 1 of the resource-naming contract (eXist-db#6463): route the decoded name argument of xmldb:store and xmldb:create-collection through URIUtils.encodeForURILenient/encodePathForURILenient, so a literal % round-trips (stored as %25), distinct names never collide, and a raw space is accepted (decision 3) instead of throwing FORG0001. The downstream createResource / createCollection re-escape is idempotent on valid percent-escapes, so no double-encoding. ResourceNamingXmldbRoundTripTest proves it live against the embedded database: the awkward-name corpus (space, Unicode, sub-delimiters incl. &, and the literal -% cases) stores, lists back via get-child-resources/-collections, and every stored key equals encodeForURILenient(name) and decodes back to the name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surface 2 of the resource-naming contract (eXist-db#6463, decision 5): DocUtils.getDocumentByPathFromDB and ExtCollection.getCollectionItems now resolve a bare db-path with escape=true, the same idempotent normalization xmldb:store applies. So doc('/db/x/café.xml') and collection('/db/café-col') resolve the key xmldb:store wrote, ending the store-normalizes / doc-doesn't asymmetry. escape=true leaves a literal '%' alone, so an already-encoded path (e.g. from xmldb:encode-uri) still resolves -- the change is additive for callers that pre-encode. Verified live (ResourceNamingXmldbRoundTripTest): a non-percent decoded name resolves directly via doc(); a literal-% name remains addressable only by its %25 stored form (decision 2 boundary); collection() resolves a decoded non-ASCII path. DocTest, CollectionTest (storage + xmldb), and XPathQueryTest (150) stay green. Boundaries deferred: raw space on the doc() READ path still throws FODC0005 at the dynamically-available-documents URI check (decision 3, read side); and a full XQTS run remains the pre-merge gate for this core change, per the issue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surface 3 of the resource-naming contract (eXist-db#6463): XML-RPC is a decoded-string protocol (names travel as plain method-argument strings, not URL percent-encoded), so it must return names in decoded display form. RpcConnection's listing/describe methods (getCollectionDesc, describeCollection, describeResource, getCollectionListing, getDocumentListing) now decodeForURI the stored name before returning it, fixing the caf%C3%A9 display seen by XML-RPC clients (e.g. the basis of the Oxygen-plugin display issue). XmlRpcTest stays green (46/46); testCollectionWithAccentsAndSpaces is updated to expect the decoded child name (the contract), not the stored form. Scope: this is the read side. XML-RPC write-side bijective handling of a literal % (routing incoming names through encodeForURILenient across RpcConnection's many store entry points, as xmldb:store now does) is a follow-up; accented/spaced/ bracketed names already round-trip write+read, per the test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surfaces 4-6 of the resource-naming contract (eXist-db#6463). Per decision 1, REST and WebDAV keep percent-encoding on the wire, so they need no decode-on-read; their requirement is round-trip consistency and a stored key that agrees with the other surfaces. Bring PR eXist-db#6450's cross-surface conformance harness onto this branch (stores via WebDAV, reads back via WebDAV AND REST) and extend it with a stored-form ORACLE: every stored key must equal URIUtils.encodeForURILenient(requestedName) -- the same encoder xmldb:store/create-collection now apply. Running that oracle surfaced a real divergence and fixed it: REST/WebDAV (via the client's standard URL encoding) store '#' as %23 because it is the URL fragment delimiter, but encodeForURILenient kept '#' literal -- so xmldb:store would have stored a#b.xml while REST stored a%23b.xml, unreachable across surfaces. Align encodeForURILenient to the RFC 3986 'pchar' set: keep unreserved + sub-delimiters + ':' '@' literal, and percent-encode '#' '?' '[' ']' (plus space, '/', '%', non-ASCII) as a URL path segment must. With that, the oracle passes for the full corpus: every surface converges on one key, and literal-% (a%25b) stays distinct from encoded-space (a%2520b). Tests (all green): ResourceNameCodecTest 8/8 (now incl. # ? [ ]), ResourceNamingXmldbRoundTripTest 4/4, ResourceNamingConformanceTest 1/1 (REST+WebDAV + stored-form oracle). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st FS-hostile chars
A live store probe revealed eXist's XmldbURI cannot store a literal ':' -- it wraps
java.net.URI, and new URI("a:b.xml") parses 'a' as a scheme. So ':' is RFC 3986
pchar but eXist-unstorable; encodeForURILenient now encodes it as %3A. '*' and the
non-pchar set (< > " \ |) store fine (' * ' literal, the rest percent-encoded).
ResourceNameCodecTest gains the filename-hostile set (< > " \ | : *), documenting
that the codec encodes all but '*' (and notes ':'/'*' are macOS/Windows-illegal in a
host filename -- a backup/export concern, not a native-store one, since eXist's store
is page-based, not one-file-per-resource). ResourceNamingXmldbRoundTripTest adds a
live probe confirming these round-trip through the native store.
Tests green: ResourceNameCodecTest 8/8, ResourceNamingXmldbRoundTripTest 5/5,
ResourceNamingConformanceTest 1/1 (capstone unchanged: its corpus has no ':').
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend ResourceNamingConformanceTest's corpus with the characters macOS/Windows reject in filenames: ':' '*' '<' '>' '|' '"'. Run on macOS (mirrors the CI integration job: this webdav *Test runs under 'verify -DskipUnitTests=true' on all three OSes -- webdav does NOT gate surefire on skipUnitTests, unlike exist-core). Results: '*' '<' '>' '|' '"' round-trip and their REST/WebDAV stored form matches the canonical codec (stored-form oracle passes). ':' does NOT round-trip on any surface -- a WebDAV PUT of '.../a:b.xml' returns HTTP 500 and xmldb:store rejects it, because XmldbURI wraps java.net.URI which reads 'a:' as a scheme. Recorded 'colon' in KNOWN_FAILURES with the rationale; the canonical codec encodes ':' to %3A so the name is addressable as '.../a%3Ab.xml'. NOT renaming the harness to *IT: extensions/webdav declares no failsafe plugin and surefire's default includes don't match *IT, so an *IT there runs under neither runner (the module's WindowsPathResourceIT is already dead). The *Test already runs cross-OS; renaming would remove it from CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the stored-form convergence check out of crossSurfaceNamingConformance into a storedFormMismatches(rows) helper, clearing the PMD NPathComplexity flag (960) that the added oracle loop introduced. No behavior change; ResourceNamingConformanceTest stays green (1/1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sion 5) Surface 2 of the resource-naming contract (eXist-db#6463, decision 5), extended from fn:doc()/fn:collection() to the xmldb: module's collection functions. The shared XMLDBAbstractCollectionManipulator.getLocalCollection now resolves the collection path with escape=true -- the same codec XmldbURI.create and xmldb:store apply when a collection is written -- instead of new AnyURIValue(name).toXmldbURI(), which resolved with escape=false. So a caller's decoded or descriptor-derived literal path resolves the percent-encoded key that was stored, and an awkward name no longer either misses or throws on a raw illegal character. escape=true leaves a literal '%' alone, so it is idempotent on an already-encoded path (e.g. the internal collection URI passed by the node branch of eval(), or a caller path produced by xmldb:encode-uri). Motivating manifestation: Deployment mints a package's /db/system/repo collection as abbrev + "-" + version via XmldbURI.create, so a package whose version contains '{'/'}' -- e.g. an unsubstituted "${app.version}" -- installs under /db/system/repo/<abbrev>-$%7B...%7D. Before this change xmldb:collection-available returned false and xmldb:get-child-resources threw for the descriptor-derived literal /db/system/repo/<abbrev>-${app.version}; the collection was reachable only by its encoded form. Scope is the address side only: the minting sites are unchanged (every stored form is identical) and there is no install-time validation -- both are noted as follow-ups for the contract discussion. XmldbCollectionAddressByDecodedNameTest stores an ${...}-style awkward-named collection and addresses it by the decoded literal via xmldb:collection-available and xmldb:get-child-resources; it fails on the escape=false baseline and passes with this change. The write-boundary round-trip (ResourceNamingXmldbRoundTripTest) and store tests stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ad path (decision 3) Closes the read-side half of decision 3 (eXist-db#6463): xmldb:store accepts a raw space in a resource name, but reading it back by the decoded db-path still failed. fn:doc and fn:doc-available validate the path by parsing it as a java.net.URI before resolution, and a raw space (a valid resource name, not a valid URI) made that parse throw -- so the path never reached the decision-5 normalization (xmldbUriFor(path, true)) that escapes the space. There were three parse points, all now scoped to defer to DB normalization for a db-path (a path under /db or an xmldb: URI) while preserving the spec-mandated FODC0005 for any other malformed URI: - DocUtils.getFromDynamicallyAvailableDocuments: on URISyntaxException, a db-path cannot be a key in the dynamically-available-documents map (keyed by valid URIs), so return null and let the DB branch normalize and resolve it. - DocUtils.resolveAgainstBaseUri: also catch IllegalArgumentException from URI.create(relativePath); the caller passes the original path (not the resolved form) to the DB branch, which normalizes it. - FunDocAvailable.eval: the standalone new URI(path) guard defers to DocUtils for a db-path instead of throwing. The shared predicate is DocUtils.isDbPath. A genuinely malformed, non-db URI (qt3tests fn-doc-17 "%gg", K2-SeqDocFunc-7/8/9 Windows backslash paths, K2-SeqDocFunc-14 ":/") still raises FODC0005. DocTest gains doc_resolvesRawSpaceDbPathOnRead (the raw space now resolves via doc() and doc-available()) and doc_malformedNonDbUriStillRaisesFODC0005 (the conformance cases stay FODC0005). ResourceNamingXmldbRoundTripTest's doc() resolution oracle is updated: a decoded name without a literal '%' now resolves, including a raw space; only literal '%' remains a decision-2 boundary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecision 2, read side) Stops deferring the literal-% read case (eXist-db#6463, decision 2). The write side was already bijective -- encodeForURILenient escapes a literal '%' to %25, so "50%.xml" stores distinctly as "50%25.xml" with no collision or data loss. This closes the read side: doc() now canonicalizes a db-path by decode-then-encode (decodePathForURI then encodePathForURILenient) instead of escape-only, so it resolves the stored key for both a decoded display name and an already-encoded path: - doc("/db/x/cafe.xml") -> caf%C3%A9.xml (decoded name, decision 5) - doc("/db/x/caf%C3%A9.xml") -> caf%C3%A9.xml (pre-encoded path, idempotent) - doc("/db/x/50%.xml") -> 50%25.xml (literal '%', NEW) - doc("/db/x/50%25.xml") -> 50%25.xml (encoded stored form, idempotent) decodeForURI is the exact inverse of encodeForURILenient and never throws or truncates, so decode-then-encode maps every form of a name to its one canonical stored key. The irreducible residual is the single-decoded-wire ambiguity: a name that literally IS a valid escape ("a%20b.xml") is read as its decoded meaning ("a b.xml"), so it is addressed by its encoded form, not that literal string. DocTest.doc_resolvesLiteralPercentNameByDecodedForm demonstrates the store (no data loss) and the read by both decoded and encoded form. ResourceNamingXmldbRoundTripTest's doc() oracle is now exact: a name resolves by its decoded form iff decodeForURI(name) == name (decoding is a no-op), which holds for every normal name and for a literal '%' that is not itself a valid escape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…decision 2, all surfaces) Extends the doc() decode-then-encode canonicalization to the remaining read surfaces so a literal '%' (and any decoded/descriptor-derived name) resolves consistently everywhere, not just in fn:doc (eXist-db#6463, decisions 1+2+5): - fn:collection (ExtCollection.eval): canonicalize a bare db-path before asUri, so collection("/db/x/50%done") reaches the 50%25done key. asUri's new URI(path) rejected a literal '%' (malformed escape) before; a scheme-ful URI (xmldb:, file:, ...) is left untouched so its scheme/authority are not rewritten. - xmldb: collection reads (XMLDBAbstractCollectionManipulator.getLocalCollection): upgrade the earlier escape=true normalization to decode-then-encode, so xmldb:collection-available / get-child-resources / get-child-collections resolve a literal-% collection by its decoded name, matching fn:doc / fn:collection. decodeForURI is the exact inverse of encodeForURILenient, so each is idempotent on an already-encoded path. XmldbCollectionAddressByDecodedNameTest gains literalPercentCollectionResolvesByDecodedName (xmldb:collection-available and fn:collection over a literal-% collection). XPathQueryTest (150) stays green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[This PR was prompted by Joe, drafted by Claude Code, and reviewed by Joe.]
Why this is a draft
This PR is an illustration, not a merge request. It implements the recommended answers in #6463 as running, tested code so that the contract discussion (which @dizzzz rightly flagged as complex, and worth a community call or workshop) has a concrete artifact to react to: you can see exactly what each decision costs, what it changes, and where the boundaries are. It is intentionally a draft; if the discussion wants it as a real merge candidate, we can un-draft it and add the remaining pre-merge gate (a full XQTS run).
It merges cleanly onto current
developtoday (no conflicts).Which decisions of #6463 it reflects
RpcConnection.java, conformance test%(the non-injective / silent-collision defect)%fallout in one branch). A bijective lenient codec (encodeForURILenient/decodeForURI) escapes%→%25at the store boundary, socafé.xmland the literalcaf%C3%A9.xmlno longer collide; and the read path resolves a literal-%name by its decoded form (doc("/db/x/50%.xml")→50%25.xml) acrossdoc(),collection(), and thexmldb:readsURIUtils.java,DocUtils.java,ExtCollection.java,XMLDBAbstractCollectionManipulator.java, codec test/and control/NULxmldb:store/create-collectionaccept a raw space (and other Unicode) instead of throwingFORG0001— and the read path now resolves a raw-space db-path too (doc()/doc-available()), closing the write-accepts/read-rejects gapXMLDBStore.java,URIUtils.java,DocUtils.java,FunDocAvailable.javadoc()/collection()normalize a bare db-path the waystoredoesdoc()andcollection()canonicalize a bare db-path to its stored key (decode-then-encode, idempotent on already-encoded paths); extended to thexmldb:collection read functions, so a decoded or descriptor-derived literal path resolves the stored keyDocUtils.java,ExtCollection.java,XMLDBAbstractCollectionManipulator.javaDecision 4 (guard non-
/dbpaths at the API surface, leave core alone) is deliberately not in this PR — it lives at the application-API layer (e.g., existdb-openapi), per the issue.The codec is proven to converge across surfaces: a stored-form oracle asserts every surface stores exactly
encodeForURILenient(name)and decodes back to the original, over an awkward-name corpus (space, non-ASCII, sub-delimiters, parentheses, apostrophe, literal%, and filesystem-hostile chars).Benefits
store-normalizes /doc-doesn't asymmetry.doc($col || "/" || $name)afterxmldb:store($col, $name, …)now resolves for any name — no moredoc(xmldb:encode-uri($n))dance.café.xmlvs literalcaf%C3%A9.xml→ same key, one destroyed) becomes bijective; distinct names get distinct keys.xmldb:store/create-collection,doc(),collection(), thexmldb:read functions, and XML-RPC — verified converging, not asserted.FORG0001surprise) and resolvable on the read path (doc("/db/x/with space.xml")afterxmldb:store("/db/x", "with space.xml", …)now hits).%is fully handled — stored without data loss and resolvable by its decoded name on read (doc(),collection(),xmldb:reads), not postponed.Is it a breaking change? What would a developer need to adjust?
No data migration. The stored form of every name that is creatable on
developtoday is unchanged — existing databases are unaffected. The only names whose stored form differs are ones that previously could not be stored at all (a raw space →FORG0001) or that were silently colliding (literal%).The behavior changes a developer might notice, and how to adjust:
RpcConnectionlisting results and decoded them itself, or compared them against the percent-encoded form, will see the already-decoded name. Adjust: treat XML-RPC resource/collection names as raw decoded UTF-8 (the same contract REST clients follow after decoding the wire form).xmldb:store/xmldb:create-collectionno longer throwFORG0001on a raw space (decision 3) and now store a literal-%name bijectively (decision 2). Code that relied on the throw to reject such names will no longer get the exception. Adjust: if rejection is intended, validate the name explicitly instead of depending on the codec to throw.%names store and resolve distinctly.a%20b.xmlnow stores asa%2520b.xmlinstead of colliding witha b.xml, and a literal-%name resolves by its decoded form on read (see the full-fallout section). This affects only names that were previously unsafe or data-losing — no correctly-stored data moves. The one caveat is the residual ambiguity above: a name that literally is a valid escape is addressed by its encoded form.doc()/collection()/xmldb:reads may now resolve a decoded path that previously missed. This is strictly additive for callers that currently get an empty result and want a hit; the honest risk is code that branched on the miss (treated empty as "absent"). Adjust: none expected, but the full XQTS run is the gate to confirm nothing leaned on the old miss.On read-path validity: the
doc()/doc-available()read path previously threwFODC0005on a raw-space db-path (three separatejava.net.URIparse points). All three now defer to the DB normalization for db-paths (/dborxmldb:), so a raw-space name resolves — while any genuinely malformed non-db URI (%gg,:/, a Windows backslash path) still raisesFODC0005, exactly as the qt3testsfn/doccases require.Literal
%— the full fallout (per @duncdrum's request)The branch handles
%rather than postponing it, so the trade-off is visible in one place rather than deferred to a Phase 2.encodeForURILenientescapes a literal%→%25, so a name containing%and a name that is a percent-escape get distinct stored keys. The old non-injective encoding silently overwrote one with the other; that is gone.decodeForURIthenencodeForURILenient— treating it as decoded UTF-8 (decision 1) and re-encoding to the stored key. BecausedecodeForURIis the exact inverse ofencodeForURILenient, this maps both a decoded display name and an already-encoded path to the same key, so pre-encoding callers (doc(xmldb:encode-uri($n)), TEI Publisher) keep working unchanged.a%20b.xml,%25.xml) is read as its decoded meaning (a b.xml,%.xml), so it is addressed by its encoded form, not that literal string. This is inherent to a single decoded-wire contract (you cannot havea%20b.xmlmean both "the literal name" and "a b.xml" on the same wire). It is pinned exactly by the round-trip oracle: a name resolves by its decoded form iffdecodeForURI(name) == name— i.e. every normal name, plus a literal%that is not itself a valid escape (50%.xml,100%done%.xmlresolve;a%20b.xmlis reached by its encoded form).No stored data migrates: the only names whose stored form is new are ones that previously could not be stored without collision (a literal
%) — there was no correct prior form to migrate from.Tests
ResourceNameCodecTest— the codec in isolation: bijective round-trip of the full corpus incl.50%.xml/a%20b.xml/%25.xml, pins canonical stored forms, demonstrates the legacy collision/rejection.ResourceNamingXmldbRoundTripTest— thexmldb:write boundary against a real database: every stored key equalsencodeForURILenient(name)and decodes back.XmldbCollectionAddressByDecodedNameTest— thexmldb:read boundary: an${…}-style awkward-named collection is addressable by its decoded literal (fails on the pre-fix baseline, passes here), plusliteralPercentCollectionResolvesByDecodedName(a literal-%collection resolves viaxmldb:collection-availableandfn:collection).DocTest— read-path raw space and literal%:doc_resolvesRawSpaceDbPathOnRead,doc_resolvesLiteralPercentNameByDecodedForm(store50%.xml→50%25.xmlwith no data loss, then resolve by both the decoded and the encoded form), anddoc_malformedNonDbUriStillRaisesFODC0005(the qt3tests garbage cases stayFODC0005).XPathQueryTest(150) — regression guard for the sharedcollection()/getLocalCollectionchanges: stays green.ResourceNamingConformanceTest(webdav module, cross-surface) — the stored-form oracle: REST/WebDAV/XML-RPC converge on one key; documents the remainingKNOWN_FAILURES(:colon, etc.) with rationale.XmlRpcTest— XML-RPC read decode.Pre-merge gate not yet run: a full XQTS pass (the issue's stated condition for the decision-5 core change). This is why it's a draft.
What this deliberately leaves for follow-up
Nothing in decisions 2, 3, or 5 is deferred. What this branch intentionally does not include, so the scope stays reviewable:
xmldb:get-child-resources/-collectionsreturn. These now resolve a decoded path, but still return the encoded stored names (caf%C3%A9.xml, notcafé.xml) — the long-standing//TODO: decode names?. Changing the default return would break every caller that expects the encoded form, so the right shape is an opt-in$decodeargument (already prototyped on a separate branch). Until then, callers decode the result themselves, exactly as eXide does (xmldb:decode-uri).storeResource,createCollection, …) accepting decoded names is the symmetric follow-up.:) and the otherKNOWN_FAILURES.a:b.xmlstill does not round-trip cross-surface —XmldbURIwrapsjava.net.URI, wherea:parses as a scheme. This is recorded in the conformance ratchet, not silently broken; a real fix needs the storage-layer change below.%-that-is-a-valid-escape residual.a%20b.xmlis read as its decoded meaning (a b.xml); see the full-fallout section. Inherent to a single decoded-wire contract, not a TODO./dbpath) — by design this lives at the application-API layer (existdb-openapi), not core.collections.dbx(no encoding at the persistence layer at all), which would also retire the colon residual. Large; captured for direction only, per the issue.