Skip to content

XSD 1.1 validation and catalog fixes. #6496

Open
duncdrum wants to merge 14 commits into
eXist-db:developfrom
duncdrum:dp-jaxp-xsd11-validation
Open

XSD 1.1 validation and catalog fixes. #6496
duncdrum wants to merge 14 commits into
eXist-db:developfrom
duncdrum:dp-jaxp-xsd11-validation

Conversation

@duncdrum

@duncdrum duncdrum commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Stumbled across this while working on exist's native schemas which will be a follow-up PR. 🐇 🕳️

Summary

  • validation:jaxp() now validates XSD 1.1 schemas (vc:minVersion="1.1"), transparently routed through a dedicated javax.xml.validation pipeline since the bundled Xerces fork only wires XSD 1.1 support there, not into the dynamic-discovery SAX parser.
  • validation:jaxv() gains a 4th, optional $catalogs argument, matching jaxp()'s existing catalog support (system catalog / explicit .xml catalog / directory-search collection catalog).
  • Database-stored catalogs are streamed to the resolver via SAX instead of round-tripping through a String.
  • Closes parse-xml() and util:parse() do not seem to use catalogs #1975, eXist 4.5 using local catalogs sending large amounts of outbound data #2476 (regression tests added; underlying fixes already present).

What changed

  • Jaxp.java / Jaxv.java: XSD 1.1 detection + validator pipeline, shared catalog-argument Shared.resolveCatalogArgument, Subject-scoped and Caffeine-backed detection cache.
  • ResolverFactory.java / RecordingContentHandler.java: SAX-streamed catalog loading without re-fetching from the database on xmlresolver's second pass.
  • SearchResourceResolver.java: directory-search catalogs now work with both the SAX pipelineeline.
  • Security: the XSD 1.1 detection peek only follows same-origin schemaLocation hints — covered by JaxpSchemaLocationSecurityTest.

Test plan

  • Full exist-core suite: 7029 tests, 0 failures/errors
  • All CI checks green (Codacy, License check, container build, W3C XQTS, unit + integration
  • New/updated regression tests: JaxpXsdCatalogTest, JaxvTest, JaxpSchemaLocationSecurityTest, JaxpXsd11DetectionCacheTest, IsMissingElementDeclarationTest, CatalogResolutionRegressionTest,
    TournamentSchemaLanguageComparisonTest

duncdrum added 3 commits June 18, 2026 12:11
The bundled Xerces XSD 1.1 fork only wires 1.1 support into the JAXP
SchemaFactory/Validator API, not into the validating-SAXParser pipeline
`validation:jaxp()` uses for dynamic, schemaLocation-hint-driven schema
discovery. Any XSD declaring `vc:minVersion="1.1"` (e.g. xs:assert)
failed
with `"cvc-elt.1.a: Cannot find the declaration of element"` even though
the same schema validates fine via validation:jaxv() with the language
explicitly set to v1.1.

Add a narrow fallback: when the default pipeline reports `cvc-elt.1.a`
and
the instance references a schema via `xsi:schemaLocation` or
`xsi:noNamespaceSchemaLocation`, retry once with a
`SchemaFactory/Validator`
built for v1.1. DTD-only documents. The default pipeline (and its
DTD/grammar-pool behavior) is untouched for them.

surfaced during work on see eXist-db#6002
round-trip

`Jaxp.java` and `SearchResourceResolver.java` each serialized a catalog
document stored in the database to a String just so `org.xmlresolve`r`
could re-parse it from a StringReader - backed InputSource.
xmlresolver's `CatalogManager` (reachable via
`XMLResolverConfiguration#getFeature`) exposes
loadCatalog(URI, SaxProducer), so the catalog's SAX events can be
streamed directly into the loader instead.
`validation:jaxp()` already accepts an explicit catalog for schema and
entity resolution; `jaxv()` had none, so `xs:import/xs:include` inside
its explicitly-supplied grammars could only resolve via relative
schemaLocation paths. Add a 4th, optional catalogs argument to
`jaxv()/jaxv-report()`, wired via `SchemaFactory#setResourceResolver()`
so resolution happens at schema-compile time.
@duncdrum duncdrum added this to v7.0.0 Jun 18, 2026
@duncdrum duncdrum moved this to In progress in v7.0.0 Jun 18, 2026
@duncdrum duncdrum added this to the eXist-7.0.0 milestone Jun 18, 2026
@duncdrum duncdrum added the enhancement new features, suggestions, etc. label Jun 18, 2026
@duncdrum duncdrum force-pushed the dp-jaxp-xsd11-validation branch 5 times, most recently from 6a509e0 to df614e6 Compare June 19, 2026 20:13
@duncdrum duncdrum added the needs documentation Signals issues or PRs that will require an update to the documentation repo label Jun 19, 2026
duncdrum added a commit to duncdrum/documentation that referenced this pull request Jun 20, 2026
correct outdated references,
allign with eXist-db/exist#6496
@duncdrum duncdrum requested a review from dizzzz June 20, 2026 07:16
@duncdrum duncdrum changed the title [wip] DP XSD 1.1 catalogue and validation fixes. XSD 1.1 validation and catalog fixes. Jun 20, 2026
duncdrum and others added 11 commits June 20, 2026 09:40
- Fix XSD 1.1 retry corruption,
- locale-fragile detection,
- and dedupe catalog SAX streaming

style improvements, typos, codacy happyness

Buffer-and-replay the catalog SaxProducer's first invocation instead of
re-serializing from the database on each of xmlresolver's two invocations
(RNG-validate, then load) -- avoids a second broker round-trip per catalog
per call, and the two passes no longer risk seeing different content if
the document is concurrently modified in between.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also rewrites RecordingContentHandler to use the existing
org.exist.util.sax.event.contenthandler.* event-recording classes
(StartElement, Characters, TextEvent, etc.) instead of a hand-rolled,
parallel ContentHandlerEvent model -- these already provide the exact
'record SAX events, replay against a different handler later'
capability this class needs, including the defensive Attributes/char[]
copying.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`jaxp()` now peeks at the schema's `vc:minVersion` before parsing and
picks the right pipeline directly, instead of always failing once
and retrying. The peek only follows same-origin (scheme+authority)
locations, since the `schemaLocation` hint is document-controlled and
unrestricted resolution would let any caller make the server fetch
arbitrary files or URLs.

`SearchResourceResolver` now also implements `LSResourceResolver`, so
directory-search catalogs work with the XSD 1.1 validator and with
`jaxv()`'s 4th argument too. Broadens the retry guard to fire whenever a
catalog is configured, since catalogs can resolve by namespace
alone with no `schemaLocation` hint present.
resolveEntity()'s namespace branch and resolveResource() each ran the
same findXSD-by-namespace lookup, debug log, and fixupExistCatalogUri
normalization inline. Extract the shared logic into one helper so the
two entry points (XNI and LSResourceResolver) can't drift apart.

Also hoists the per-call 'new DatabaseResources(brokerPool)' (in
resolveEntity's publicId branch and findXsdResourcePathByNamespace) to
a constructor-initialized field -- SearchResourceResolver is already a
long-lived per-resolution instance, so there's no need to reconstruct
this thin wrapper on every resolveEntity/resolveResource call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval() inlined the up-front XSD 1.1 peek and the cvc-elt.1.a retry
safety net as long blocks of logic mixed with mutable local state.
Extract them into peekIsXsd11ViaSchemaLocation() and
retryWithXsd11ValidatorIfNeeded(), threading the retry's
content-handler/builder pair back out via a small ParseTarget record
instead of reassigning eval()'s locals from deep inside a conditional.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also threads the cache-grammars flag into the XSD 1.1 validator path
(newXsd11Validator/retryWithXsd11ValidatorIfNeeded), shares the catalog-URL
dispatch logic with validation:jaxv() via Shared.resolveCatalogArgument
(throwing instead of just logging on a malformed catalog URL), and makes
isMissingElementDeclaration package-private so IsMissingElementDeclarationTest
can pin its match against the real Xerces message text directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also splits the three-condition compound boolean guard in
retryWithXsd11ValidatorIfNeeded into separate early-return checks, one
per independent bail-out reason, and extracts a getGrammarPool()
helper shared between eval()'s SAX-pipeline grammar-pool wiring and
newXsd11Validator()'s SchemaFactory cache-wiring (previously the same
Configuration/GrammarPool lookup duplicated in both places).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Schema

The up-front XSD 1.1 peek follows the instance's own xsi:schemaLocation
hint before any catalog/permission check would otherwise govern the
fetch, so it must only ever resolve within the instance's own origin.
Cover this directly: same-origin relative locations are followed,
cross-scheme/cross-host absolute locations (http, and cross-host
xmldb://, which XmldbURL.isEmbedded() treats as a remote XML-RPC
target) are refused without attempting a network connection. Also
documents the one residual nuance (file: URIs have no authority
component) inline on isXsd11Schema, gated behind the existing
util:-interop privilege boundary needed to get a file: instance at all.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isXsd11Schema() re-fetches and re-peeks the candidate schema's root
element on every call, even when validating many documents against
the same schema back-to-back. Add a bounded, LRU-evicted cache keyed
by the resolved schema URI, cleared by validation:clear-grammar-cache()
alongside the Xerces grammar pool so there's one function to reset
every validation-related cache. Failures (unreadable/unparseable
candidates, transient IO errors) are deliberately never cached, since
a stale negative would wrongly keep a legitimate schema on the slower
retry-after-failure path forever.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The directory-search XSD 1.1 schema and its valid/invalid instances
were inline Java string constants, unlike every other fixture in this
test class, which loads via Samples.SAMPLES.getSample(). Move them to
real resource files under exist-samples for consistency, and load them
the same way.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three @ignore("todo") tests referenced fixtures this class's
@BeforeClass never stores (/db/tournament/1.5/...) and validated a
Schematron .sch file through jaxp-report(), whose 2-arg signature has
no grammar parameter at all -- the .sch document was silently coerced
to the unrelated enable-grammar-cache boolean via its effective
boolean value. Replace them with working tests against this class's
actual addressbook fixtures, mirroring the sibling ParseDtdNokTest's
stored/anyURI valid/invalid pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add CatalogResolutionRegressionTest, exercising ResolverFactory and
XercesXmlResolverAdapter directly -- the same machinery
XMLReaderObjectFactory wires into every pooled XMLReader, which both
fn:parse-xml()/util:parse() and the default SAX validation pipeline
parse documents with.

catalogResolvesPublicEntityWithNoBaseUri proves a configured catalog's
PUBLIC entry still resolves a DTD with no document base URI to resolve
a relative SYSTEM identifier against -- exactly the scenario reported
in eXist-db#1975, where parse-xml()/util:parse() appeared not to consult
catalogs at all.

catalogWithoutMatchingEntryDoesNotFetchUnmatchedRemoteSystemId proves
a catalog with no matching entry declines (returns null) rather than
fetching the literal, document-author-controlled URI itself --
governed by org.xmlresolver.ResolverFeature#ALWAYS_RESOLVE, which
ResolverFactory never sets, relying on the library default of false.
This is what prevents the outbound-network-exfiltration behavior
reported in eXist-db#2476.

Closes eXist-db#1975
Closes eXist-db#2476

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tures

Tournament.xsd and Tournament.rng (siblings of tournament-schema.sch,
the only one of the three actually exercised by any existing test)
were dead/unused. Their own embedded comment explains why they exist
side by side: Tournament-valid.xml and Tournament-invalid.xml only
differ in a Singles-tournament co-occurrence constraint
(nbrParticipants must equal nbrTeams) that neither RELAX NG nor W3C
XML Schema can express on their own -- Schematron rules are embedded
in both via xsd:appinfo/RELAX NG annotations to add it.

Add TournamentSchemaLanguageComparisonTest, validating both documents
against the bare XSD (via directory-search) and bare RNG grammars.
RNG reports both as valid, as documented. XSD reports both as invalid,
but for an unrelated, pre-existing reason in this 2001-vintage sample:
Match m3 references a Team t5 that's never declared, caught by XSD's
built-in ID/IDREF referential-integrity check, which this Tournament.rng
doesn't declare for the same elements. Both documents fail with the
exact same error either way, which is itself the point: XSD's verdict
doesn't change based on the co-occurrence violation either.
JingSchematronTest already covers the Schematron leg against the same
fixtures, correctly distinguishing the two documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move XSD11_DETECTION_CACHE/XSD11_DETECTION_CACHE_MAX_ENTRIES to the
top of Jaxp, alongside its other fields, instead of declaring them
mid-class next to the methods that use them.

Make isXsd11Schema package-private instead of private, the same
pattern already used for clearXsd11DetectionCache(), so
JaxpSchemaLocationSecurityTest/JaxpXsd11DetectionCacheTest (both in
the same package) can call it directly instead of via
setAccessible(true).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also scopes XSD11_DETECTION_CACHE by the requesting Subject's name (a
cache hit skips isXsd11Schema's permission-checked openStream()
entirely,
so without this a Subject lacking read permission on a schema resource
could observe a boolean populated by a different Subject's earlier
fetch),
and replaces the hand-rolled synchronized LinkedHashMap-based LRU with a
Caffeine cache (already a project dependency) for the same bound with
less
hand-written locking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@duncdrum duncdrum force-pushed the dp-jaxp-xsd11-validation branch from df614e6 to b651e5e Compare June 20, 2026 07:48
@duncdrum duncdrum marked this pull request as ready for review June 20, 2026 07:50
@duncdrum duncdrum requested a review from a team as a code owner June 20, 2026 07:50
duncdrum added a commit to duncdrum/documentation that referenced this pull request Jun 20, 2026
correct outdated references,
allign with eXist-db/exist#6496

@dizzzz dizzzz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive PR. Nice to see that you could build on top of my rather old foundation. Having full xsd1.1 support is a nice feature (though I doubt it is used a lot).

A lot of text, conceptually it looks ok.

I remember overtime that somebody added @ignore tags to some tests, you might want to check these,

One item that I remember.. is that if both an xml document and an ssd were stored in the same collection we could get a kinda deadlock. Is that still there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement new features, suggestions, etc. needs documentation Signals issues or PRs that will require an update to the documentation repo

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

parse-xml() and util:parse() do not seem to use catalogs

2 participants