XSD 1.1 validation and catalog fixes. #6496
Open
duncdrum wants to merge 14 commits into
Open
Conversation
The bundled Xerces XSD 1.1 fork only wires 1.1 support into the JAXP SchemaFactory/Validator API, not into the validating-SAXParser pipeline `validation:jaxp()` uses for dynamic, schemaLocation-hint-driven schema discovery. Any XSD declaring `vc:minVersion="1.1"` (e.g. xs:assert) failed with `"cvc-elt.1.a: Cannot find the declaration of element"` even though the same schema validates fine via validation:jaxv() with the language explicitly set to v1.1. Add a narrow fallback: when the default pipeline reports `cvc-elt.1.a` and the instance references a schema via `xsi:schemaLocation` or `xsi:noNamespaceSchemaLocation`, retry once with a `SchemaFactory/Validator` built for v1.1. DTD-only documents. The default pipeline (and its DTD/grammar-pool behavior) is untouched for them. surfaced during work on see eXist-db#6002
round-trip `Jaxp.java` and `SearchResourceResolver.java` each serialized a catalog document stored in the database to a String just so `org.xmlresolve`r` could re-parse it from a StringReader - backed InputSource. xmlresolver's `CatalogManager` (reachable via `XMLResolverConfiguration#getFeature`) exposes loadCatalog(URI, SaxProducer), so the catalog's SAX events can be streamed directly into the loader instead.
`validation:jaxp()` already accepts an explicit catalog for schema and entity resolution; `jaxv()` had none, so `xs:import/xs:include` inside its explicitly-supplied grammars could only resolve via relative schemaLocation paths. Add a 4th, optional catalogs argument to `jaxv()/jaxv-report()`, wired via `SchemaFactory#setResourceResolver()` so resolution happens at schema-compile time.
6a509e0 to
df614e6
Compare
duncdrum
added a commit
to duncdrum/documentation
that referenced
this pull request
Jun 20, 2026
correct outdated references, allign with eXist-db/exist#6496
- Fix XSD 1.1 retry corruption, - locale-fragile detection, - and dedupe catalog SAX streaming style improvements, typos, codacy happyness Buffer-and-replay the catalog SaxProducer's first invocation instead of re-serializing from the database on each of xmlresolver's two invocations (RNG-validate, then load) -- avoids a second broker round-trip per catalog per call, and the two passes no longer risk seeing different content if the document is concurrently modified in between. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also rewrites RecordingContentHandler to use the existing org.exist.util.sax.event.contenthandler.* event-recording classes (StartElement, Characters, TextEvent, etc.) instead of a hand-rolled, parallel ContentHandlerEvent model -- these already provide the exact 'record SAX events, replay against a different handler later' capability this class needs, including the defensive Attributes/char[] copying. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`jaxp()` now peeks at the schema's `vc:minVersion` before parsing and picks the right pipeline directly, instead of always failing once and retrying. The peek only follows same-origin (scheme+authority) locations, since the `schemaLocation` hint is document-controlled and unrestricted resolution would let any caller make the server fetch arbitrary files or URLs. `SearchResourceResolver` now also implements `LSResourceResolver`, so directory-search catalogs work with the XSD 1.1 validator and with `jaxv()`'s 4th argument too. Broadens the retry guard to fire whenever a catalog is configured, since catalogs can resolve by namespace alone with no `schemaLocation` hint present.
resolveEntity()'s namespace branch and resolveResource() each ran the same findXSD-by-namespace lookup, debug log, and fixupExistCatalogUri normalization inline. Extract the shared logic into one helper so the two entry points (XNI and LSResourceResolver) can't drift apart. Also hoists the per-call 'new DatabaseResources(brokerPool)' (in resolveEntity's publicId branch and findXsdResourcePathByNamespace) to a constructor-initialized field -- SearchResourceResolver is already a long-lived per-resolution instance, so there's no need to reconstruct this thin wrapper on every resolveEntity/resolveResource call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval() inlined the up-front XSD 1.1 peek and the cvc-elt.1.a retry safety net as long blocks of logic mixed with mutable local state. Extract them into peekIsXsd11ViaSchemaLocation() and retryWithXsd11ValidatorIfNeeded(), threading the retry's content-handler/builder pair back out via a small ParseTarget record instead of reassigning eval()'s locals from deep inside a conditional. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also threads the cache-grammars flag into the XSD 1.1 validator path (newXsd11Validator/retryWithXsd11ValidatorIfNeeded), shares the catalog-URL dispatch logic with validation:jaxv() via Shared.resolveCatalogArgument (throwing instead of just logging on a malformed catalog URL), and makes isMissingElementDeclaration package-private so IsMissingElementDeclarationTest can pin its match against the real Xerces message text directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also splits the three-condition compound boolean guard in retryWithXsd11ValidatorIfNeeded into separate early-return checks, one per independent bail-out reason, and extracts a getGrammarPool() helper shared between eval()'s SAX-pipeline grammar-pool wiring and newXsd11Validator()'s SchemaFactory cache-wiring (previously the same Configuration/GrammarPool lookup duplicated in both places). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Schema The up-front XSD 1.1 peek follows the instance's own xsi:schemaLocation hint before any catalog/permission check would otherwise govern the fetch, so it must only ever resolve within the instance's own origin. Cover this directly: same-origin relative locations are followed, cross-scheme/cross-host absolute locations (http, and cross-host xmldb://, which XmldbURL.isEmbedded() treats as a remote XML-RPC target) are refused without attempting a network connection. Also documents the one residual nuance (file: URIs have no authority component) inline on isXsd11Schema, gated behind the existing util:-interop privilege boundary needed to get a file: instance at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isXsd11Schema() re-fetches and re-peeks the candidate schema's root element on every call, even when validating many documents against the same schema back-to-back. Add a bounded, LRU-evicted cache keyed by the resolved schema URI, cleared by validation:clear-grammar-cache() alongside the Xerces grammar pool so there's one function to reset every validation-related cache. Failures (unreadable/unparseable candidates, transient IO errors) are deliberately never cached, since a stale negative would wrongly keep a legitimate schema on the slower retry-after-failure path forever. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The directory-search XSD 1.1 schema and its valid/invalid instances were inline Java string constants, unlike every other fixture in this test class, which loads via Samples.SAMPLES.getSample(). Move them to real resource files under exist-samples for consistency, and load them the same way. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three @ignore("todo") tests referenced fixtures this class's @BeforeClass never stores (/db/tournament/1.5/...) and validated a Schematron .sch file through jaxp-report(), whose 2-arg signature has no grammar parameter at all -- the .sch document was silently coerced to the unrelated enable-grammar-cache boolean via its effective boolean value. Replace them with working tests against this class's actual addressbook fixtures, mirroring the sibling ParseDtdNokTest's stored/anyURI valid/invalid pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add CatalogResolutionRegressionTest, exercising ResolverFactory and XercesXmlResolverAdapter directly -- the same machinery XMLReaderObjectFactory wires into every pooled XMLReader, which both fn:parse-xml()/util:parse() and the default SAX validation pipeline parse documents with. catalogResolvesPublicEntityWithNoBaseUri proves a configured catalog's PUBLIC entry still resolves a DTD with no document base URI to resolve a relative SYSTEM identifier against -- exactly the scenario reported in eXist-db#1975, where parse-xml()/util:parse() appeared not to consult catalogs at all. catalogWithoutMatchingEntryDoesNotFetchUnmatchedRemoteSystemId proves a catalog with no matching entry declines (returns null) rather than fetching the literal, document-author-controlled URI itself -- governed by org.xmlresolver.ResolverFeature#ALWAYS_RESOLVE, which ResolverFactory never sets, relying on the library default of false. This is what prevents the outbound-network-exfiltration behavior reported in eXist-db#2476. Closes eXist-db#1975 Closes eXist-db#2476 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tures Tournament.xsd and Tournament.rng (siblings of tournament-schema.sch, the only one of the three actually exercised by any existing test) were dead/unused. Their own embedded comment explains why they exist side by side: Tournament-valid.xml and Tournament-invalid.xml only differ in a Singles-tournament co-occurrence constraint (nbrParticipants must equal nbrTeams) that neither RELAX NG nor W3C XML Schema can express on their own -- Schematron rules are embedded in both via xsd:appinfo/RELAX NG annotations to add it. Add TournamentSchemaLanguageComparisonTest, validating both documents against the bare XSD (via directory-search) and bare RNG grammars. RNG reports both as valid, as documented. XSD reports both as invalid, but for an unrelated, pre-existing reason in this 2001-vintage sample: Match m3 references a Team t5 that's never declared, caught by XSD's built-in ID/IDREF referential-integrity check, which this Tournament.rng doesn't declare for the same elements. Both documents fail with the exact same error either way, which is itself the point: XSD's verdict doesn't change based on the co-occurrence violation either. JingSchematronTest already covers the Schematron leg against the same fixtures, correctly distinguishing the two documents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move XSD11_DETECTION_CACHE/XSD11_DETECTION_CACHE_MAX_ENTRIES to the top of Jaxp, alongside its other fields, instead of declaring them mid-class next to the methods that use them. Make isXsd11Schema package-private instead of private, the same pattern already used for clearXsd11DetectionCache(), so JaxpSchemaLocationSecurityTest/JaxpXsd11DetectionCacheTest (both in the same package) can call it directly instead of via setAccessible(true). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also scopes XSD11_DETECTION_CACHE by the requesting Subject's name (a cache hit skips isXsd11Schema's permission-checked openStream() entirely, so without this a Subject lacking read permission on a schema resource could observe a boolean populated by a different Subject's earlier fetch), and replaces the hand-rolled synchronized LinkedHashMap-based LRU with a Caffeine cache (already a project dependency) for the same bound with less hand-written locking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
df614e6 to
b651e5e
Compare
duncdrum
added a commit
to duncdrum/documentation
that referenced
this pull request
Jun 20, 2026
correct outdated references, allign with eXist-db/exist#6496
dizzzz
approved these changes
Jun 22, 2026
dizzzz
left a comment
Member
There was a problem hiding this comment.
Impressive PR. Nice to see that you could build on top of my rather old foundation. Having full xsd1.1 support is a nice feature (though I doubt it is used a lot).
A lot of text, conceptually it looks ok.
I remember overtime that somebody added @ignore tags to some tests, you might want to check these,
One item that I remember.. is that if both an xml document and an ssd were stored in the same collection we could get a kinda deadlock. Is that still there?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stumbled across this while working on exist's native schemas which will be a follow-up PR. 🐇 🕳️
Summary
validation:jaxp()now validates XSD 1.1 schemas (vc:minVersion="1.1"), transparently routed through a dedicatedjavax.xml.validationpipeline since the bundled Xerces fork only wires XSD 1.1 support there, not into the dynamic-discovery SAX parser.validation:jaxv()gains a 4th, optional$catalogsargument, matchingjaxp()'s existing catalog support (system catalog / explicit.xmlcatalog / directory-search collection catalog).String.What changed
Jaxp.java/Jaxv.java: XSD 1.1 detection + validator pipeline, shared catalog-argumentShared.resolveCatalogArgument, Subject-scoped and Caffeine-backed detection cache.ResolverFactory.java/RecordingContentHandler.java: SAX-streamed catalog loading without re-fetching from the database on xmlresolver's second pass.SearchResourceResolver.java: directory-search catalogs now work with both the SAX pipelineeline.schemaLocationhints — covered byJaxpSchemaLocationSecurityTest.Test plan
exist-coresuite: 7029 tests, 0 failures/errorsJaxpXsdCatalogTest,JaxvTest,JaxpSchemaLocationSecurityTest,JaxpXsd11DetectionCacheTest,IsMissingElementDeclarationTest,CatalogResolutionRegressionTest,TournamentSchemaLanguageComparisonTest