Skip to content

[wip] exist native schemas#6505

Draft
duncdrum wants to merge 30 commits into
eXist-db:developfrom
duncdrum:dp-exist-native-schemas
Draft

[wip] exist native schemas#6505
duncdrum wants to merge 30 commits into
eXist-db:developfrom
duncdrum:dp-exist-native-schemas

Conversation

@duncdrum

Copy link
Copy Markdown
Contributor

what I actually wanted to do.

extends #6496

duncdrum and others added 22 commits June 20, 2026 14:22
The bundled Xerces XSD 1.1 fork only wires 1.1 support into the JAXP
SchemaFactory/Validator API, not into the validating-SAXParser pipeline
`validation:jaxp()` uses for dynamic, schemaLocation-hint-driven schema
discovery. Any XSD declaring `vc:minVersion="1.1"` (e.g. xs:assert)
failed
with `"cvc-elt.1.a: Cannot find the declaration of element"` even though
the same schema validates fine via validation:jaxv() with the language
explicitly set to v1.1.

Add a narrow fallback: when the default pipeline reports `cvc-elt.1.a`
and
the instance references a schema via `xsi:schemaLocation` or
`xsi:noNamespaceSchemaLocation`, retry once with a
`SchemaFactory/Validator`
built for v1.1. DTD-only documents. The default pipeline (and its
DTD/grammar-pool behavior) is untouched for them.

surfaced during work on see eXist-db#6002
round-trip

`Jaxp.java` and `SearchResourceResolver.java` each serialized a catalog
document stored in the database to a String just so `org.xmlresolve`r`
could re-parse it from a StringReader - backed InputSource.
xmlresolver's `CatalogManager` (reachable via
`XMLResolverConfiguration#getFeature`) exposes
loadCatalog(URI, SaxProducer), so the catalog's SAX events can be
streamed directly into the loader instead.
`validation:jaxp()` already accepts an explicit catalog for schema and
entity resolution; `jaxv()` had none, so `xs:import/xs:include` inside
its explicitly-supplied grammars could only resolve via relative
schemaLocation paths. Add a 4th, optional catalogs argument to
`jaxv()/jaxv-report()`, wired via `SchemaFactory#setResourceResolver()`
so resolution happens at schema-compile time.
- Fix XSD 1.1 retry corruption,
- locale-fragile detection,
- and dedupe catalog SAX streaming

style improvements, typos, codacy happyness

Buffer-and-replay the catalog SaxProducer's first invocation instead of
re-serializing from the database on each of xmlresolver's two invocations
(RNG-validate, then load) -- avoids a second broker round-trip per catalog
per call, and the two passes no longer risk seeing different content if
the document is concurrently modified in between.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also rewrites RecordingContentHandler to use the existing
org.exist.util.sax.event.contenthandler.* event-recording classes
(StartElement, Characters, TextEvent, etc.) instead of a hand-rolled,
parallel ContentHandlerEvent model -- these already provide the exact
'record SAX events, replay against a different handler later'
capability this class needs, including the defensive Attributes/char[]
copying.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`jaxp()` now peeks at the schema's `vc:minVersion` before parsing and
picks the right pipeline directly, instead of always failing once
and retrying. The peek only follows same-origin (scheme+authority)
locations, since the `schemaLocation` hint is document-controlled and
unrestricted resolution would let any caller make the server fetch
arbitrary files or URLs.

`SearchResourceResolver` now also implements `LSResourceResolver`, so
directory-search catalogs work with the XSD 1.1 validator and with
`jaxv()`'s 4th argument too. Broadens the retry guard to fire whenever a
catalog is configured, since catalogs can resolve by namespace
alone with no `schemaLocation` hint present.
resolveEntity()'s namespace branch and resolveResource() each ran the
same findXSD-by-namespace lookup, debug log, and fixupExistCatalogUri
normalization inline. Extract the shared logic into one helper so the
two entry points (XNI and LSResourceResolver) can't drift apart.

Also hoists the per-call 'new DatabaseResources(brokerPool)' (in
resolveEntity's publicId branch and findXsdResourcePathByNamespace) to
a constructor-initialized field -- SearchResourceResolver is already a
long-lived per-resolution instance, so there's no need to reconstruct
this thin wrapper on every resolveEntity/resolveResource call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval() inlined the up-front XSD 1.1 peek and the cvc-elt.1.a retry
safety net as long blocks of logic mixed with mutable local state.
Extract them into peekIsXsd11ViaSchemaLocation() and
retryWithXsd11ValidatorIfNeeded(), threading the retry's
content-handler/builder pair back out via a small ParseTarget record
instead of reassigning eval()'s locals from deep inside a conditional.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also threads the cache-grammars flag into the XSD 1.1 validator path
(newXsd11Validator/retryWithXsd11ValidatorIfNeeded), shares the catalog-URL
dispatch logic with validation:jaxv() via Shared.resolveCatalogArgument
(throwing instead of just logging on a malformed catalog URL), and makes
isMissingElementDeclaration package-private so IsMissingElementDeclarationTest
can pin its match against the real Xerces message text directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also splits the three-condition compound boolean guard in
retryWithXsd11ValidatorIfNeeded into separate early-return checks, one
per independent bail-out reason, and extracts a getGrammarPool()
helper shared between eval()'s SAX-pipeline grammar-pool wiring and
newXsd11Validator()'s SchemaFactory cache-wiring (previously the same
Configuration/GrammarPool lookup duplicated in both places).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Schema

The up-front XSD 1.1 peek follows the instance's own xsi:schemaLocation
hint before any catalog/permission check would otherwise govern the
fetch, so it must only ever resolve within the instance's own origin.
Cover this directly: same-origin relative locations are followed,
cross-scheme/cross-host absolute locations (http, and cross-host
xmldb://, which XmldbURL.isEmbedded() treats as a remote XML-RPC
target) are refused without attempting a network connection. Also
documents the one residual nuance (file: URIs have no authority
component) inline on isXsd11Schema, gated behind the existing
util:-interop privilege boundary needed to get a file: instance at all.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isXsd11Schema() re-fetches and re-peeks the candidate schema's root
element on every call, even when validating many documents against
the same schema back-to-back. Add a bounded, LRU-evicted cache keyed
by the resolved schema URI, cleared by validation:clear-grammar-cache()
alongside the Xerces grammar pool so there's one function to reset
every validation-related cache. Failures (unreadable/unparseable
candidates, transient IO errors) are deliberately never cached, since
a stale negative would wrongly keep a legitimate schema on the slower
retry-after-failure path forever.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The directory-search XSD 1.1 schema and its valid/invalid instances
were inline Java string constants, unlike every other fixture in this
test class, which loads via Samples.SAMPLES.getSample(). Move them to
real resource files under exist-samples for consistency, and load them
the same way.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three @ignore("todo") tests referenced fixtures this class's
@BeforeClass never stores (/db/tournament/1.5/...) and validated a
Schematron .sch file through jaxp-report(), whose 2-arg signature has
no grammar parameter at all -- the .sch document was silently coerced
to the unrelated enable-grammar-cache boolean via its effective
boolean value. Replace them with working tests against this class's
actual addressbook fixtures, mirroring the sibling ParseDtdNokTest's
stored/anyURI valid/invalid pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add CatalogResolutionRegressionTest, exercising ResolverFactory and
XercesXmlResolverAdapter directly -- the same machinery
XMLReaderObjectFactory wires into every pooled XMLReader, which both
fn:parse-xml()/util:parse() and the default SAX validation pipeline
parse documents with.

catalogResolvesPublicEntityWithNoBaseUri proves a configured catalog's
PUBLIC entry still resolves a DTD with no document base URI to resolve
a relative SYSTEM identifier against -- exactly the scenario reported
in eXist-db#1975, where parse-xml()/util:parse() appeared not to consult
catalogs at all.

catalogWithoutMatchingEntryDoesNotFetchUnmatchedRemoteSystemId proves
a catalog with no matching entry declines (returns null) rather than
fetching the literal, document-author-controlled URI itself --
governed by org.xmlresolver.ResolverFeature#ALWAYS_RESOLVE, which
ResolverFactory never sets, relying on the library default of false.
This is what prevents the outbound-network-exfiltration behavior
reported in eXist-db#2476.

Closes eXist-db#1975
Closes eXist-db#2476

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tures

Tournament.xsd and Tournament.rng (siblings of tournament-schema.sch,
the only one of the three actually exercised by any existing test)
were dead/unused. Their own embedded comment explains why they exist
side by side: Tournament-valid.xml and Tournament-invalid.xml only
differ in a Singles-tournament co-occurrence constraint
(nbrParticipants must equal nbrTeams) that neither RELAX NG nor W3C
XML Schema can express on their own -- Schematron rules are embedded
in both via xsd:appinfo/RELAX NG annotations to add it.

Add TournamentSchemaLanguageComparisonTest, validating both documents
against the bare XSD (via directory-search) and bare RNG grammars.
RNG reports both as valid, as documented. XSD reports both as invalid,
but for an unrelated, pre-existing reason in this 2001-vintage sample:
Match m3 references a Team t5 that's never declared, caught by XSD's
built-in ID/IDREF referential-integrity check, which this Tournament.rng
doesn't declare for the same elements. Both documents fail with the
exact same error either way, which is itself the point: XSD's verdict
doesn't change based on the co-occurrence violation either.
JingSchematronTest already covers the Schematron leg against the same
fixtures, correctly distinguishing the two documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move XSD11_DETECTION_CACHE/XSD11_DETECTION_CACHE_MAX_ENTRIES to the
top of Jaxp, alongside its other fields, instead of declaring them
mid-class next to the methods that use them.

Make isXsd11Schema package-private instead of private, the same
pattern already used for clearXsd11DetectionCache(), so
JaxpSchemaLocationSecurityTest/JaxpXsd11DetectionCacheTest (both in
the same package) can call it directly instead of via
setAccessible(true).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also scopes XSD11_DETECTION_CACHE by the requesting Subject's name (a
cache hit skips isXsd11Schema's permission-checked openStream()
entirely,
so without this a Subject lacking read permission on a schema resource
could observe a boolean populated by a different Subject's earlier
fetch),
and replaces the hand-rolled synchronized LinkedHashMap-based LRU with a
Caffeine cache (already a project dependency) for the same bound with
less
hand-written locking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…negotiation

[feature] request module: Accept-header parsing and content negotiation
Wire xml-maven-plugin on the root aggregator to check conf.xml,
collection.xconf.init,
descriptor.xml, controller-config.xml, and mime-types.xml against
schema/*.xsd using
XSD 1.1 (Xerces + xpath2). Centralize version pins and plugin classpath
in exist-parent;
align three schemas with their canonical templates so validation passes.
Add path-filtered ci-schema-checks workflow:  XSL
governance that reads pairs from pom validationSets, compares base
@Version (xpath via GITHUB_ENV) to head, and fails with PR annotations
when a schema or canonical template changes without a version bump.
Lets eXist detect when a config template (conf.xml, collection.xconf,
descriptor.xml, mime-types.xml, controller-config.xml) was authored
against an older revision of its paired XSD. Each schema gains an
optional schemaVersion attribute mirroring xs:schema/@Version; runtime
parsers log a debug message for legacy documents that omit it and warn
when a declared value doesn't match what the running build expects.

see eXist-db#3062
mavenize and simplify the whole operation
fix instance schema-location paths
schema/ now ships at $EXIST_HOME/schema/. Fixed
conf.xml/descriptor.xml/controller-config.xml's schema-location hints,
which were source-tree-relative and never correct for the assembled
layout. Extended catalog.xml with entries for the remaining schemas;

see eXist-db#6189

catalog.xml is not WAI
see eXist-db#5541
see  eXist-db#350
Extend the existing template-vs-schema validation (pom.xml) to also
check every schema/**/*.xsd is itself a legal schema document, per
the W3C meta-schema. Idea borrowed from eXist-db#5541, where the same catalog
trick lets a user validate their own XSD's well-formedness.

Upgrades the bundled XMLSchema.xsd from the stale, unused 2001/2004
XSD 1.0 revision (no xs:assert/vc: support) to the 2009 XSD 1.1
revision our native schemas actually need, plus its XMLSchema.dtd/
datatypes.dtd dependents and a refreshed xml.xsd. Resolution stays
fully offline via catalogHandling=strict and the shipped catalog.xml.

Caught immediately: 5 schemas carried an xsi:type="dcterms:W3CDTF"
appinfo annotation with no backing schema, never caught before
because nothing validated this strictly. Removed (annotation-only,
no semantic effect) and bumped the affected xs:schema/@Version per
the governance policy, syncing SchemaVersion.java and
collection.xconf.init accordingly.

close eXist-db#5541
duncdrum and others added 5 commits June 22, 2026 23:19
Saxon

`transform:transform()` and `fn:transform()` never consulted eXist's
configured catalog when resolving xsl:import/xsl:include or runtime
document() calls.

Add the system catalog as a resolver fallback in
`XsltURIResolverHelper`'s chain for `transform:transform()`, and as
SaxonConfiguration's Configuration-level
ResourceResolver (`fn:transform()` runtime `doc()` calls).
Saxon 12 dropped `Configuration.setURIResolver()` in favor of
`setResourceResolver()`, so a small adapter wraps the existing
`org.xmlresolver.Resolver`.

fn:transform()'s own xsl:import resolution (URIResolution.java) is
wired identically and confirmed correctly invoked, but is blocked by a
separate, pre-existing fn:transform()+"stylesheet-node" defect
. Tests for that path are %test:pending, referencing the relevant
upstream issues (eXist-db#5051, eXist-db#5052, eXist-db#5682).

Closes eXist-db#350
Storing an XSD schema document (any version) with collection
validation enabled dynamically discovers its namespace
(http://www.w3.org/2001/XMLSchema) and resolves the grammar via the
system catalog -- which now points at the upgraded XSD 1.1
meta-schema. That resolution went through the default SAX
dynamic-discovery validating pipeline, which can never load an XSD 1.1
grammar at all (confirmed empirically: Xerces' internal schema/version
property throws SAXNotRecognizedException on a standard XMLReader).
Previously masked because the old, unused 1.0 meta-schema was simple
enough for the 1.0-only loader; storing any .xsd document, even a
plain 1.0-only one, started failing outright once the meta-schema was
upgraded.

Peek at the root element's namespace in MutableCollection before the
SAX pipeline runs (both the validate and store phases use the same
InputSource, parsed twice already, so this is a third, safe peek using
the same re-readable mechanism). If it's the XML Schema namespace,
validate via a SchemaFactory.newInstance("...v1.1")-backed Validator
instead, explicitly compiled from the system catalog's resolution of
that namespace (dynamic, no-pre-supplied-source compilation doesn't
pick up a grammar for a root namespace with no schemaLocation hint at
all), wired to the same ContentHandler/Indexer the SAX pipeline would
otherwise feed. Any other namespace is unaffected.

This also closes the gap from eXist-db#5541 for store-time validation: storing
an XSD 1.1-syntax schema document (xs:assert, xpathDefaultNamespace,
etc.) now validates correctly too, not just 1.0-syntax ones.
MutableCollection's at-store-time validation (<validation
mode="auto"/"yes">)
always used the plain dynamic-discovery SAX pipeline, which the bundled
Xerces
fork's XSD 1.1 support never wires into -- only the JAXP
SchemaFactory/Validator
API. The earlier narrow fix routed exactly one known case (storing a
schema
document itself, validated against the meta-schema namespace) through an
explicit-Source XSD 1.1 Validator; this generalizes that to two
detectable
cases, both decided up front (never via retry-after-failure, since
Indexer/
DocumentTriggers stream-build the persistent document and can't safely
re-fed
after a partial, aborted first pass):

1. Storing a schema document whose catalog-resolved namespace grammar
   needs
   1.1 to load -- decided empirically (probe-compile via the plain 1.0
   SchemaFactory first) rather than hardcoding the meta-schema namespace
   as a
   special case, so any catalog-registered namespace benefits.
2. An instance with its own xsi:schemaLocation/noNamespaceSchemaLocation
   hint
   resolving to a schema declaring vc:minVersion="1.1" -- shares
   detection
   logic with validation:jaxp()'s own up-front peek, extracted into a
   new
   org.exist.validation.Xsd11SchemaDetection so neither side duplicates
   it.

A schemaLocation hint resolving to a schema that doesn't self-declare
vc:minVersion (e.g. via catalog-mediated indirection) remains undetected
and
still fails as before -- an accepted, documented limitation of peek-only
detection with no retry safety net.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WARNING`

The no-arg `XMLResolverConfiguration` constructor adds `./catalog.xml`
as a default catalog path, which doesn't exist when running tests from
`exist-core/`. Removing it via
`resolverConfiguration.removeCatalog("./catalog.xml")` in both
`newResolver` and `newResolverFromSax` eliminates the three duplicate
`WARNING: Failed to load catalog` messages per test run.
- Add Server instance field to JettyStart for direct lifecycle access
- Set stop timeout (30s) on Server so server.stop() never hangs
- Shutdown Jetty directly in shutdown() instead of via
  ShutdownListenerImpl
  timer chain (BrokerPool.stopAll → listener → 1s Timer → server.stop)
- Keeps deadline-based wait(remaining) as defense-in-depth
- Removes unused Timer/TimerTask imports

Closes the test suite hang where Object.wait() in JettyStart.shutdown()
waited indefinitely for lifeCycleStopped() that never fired.
@duncdrum duncdrum force-pushed the dp-exist-native-schemas branch from c1fadb2 to feddfb5 Compare June 22, 2026 21:31
duncdrum added 3 commits June 23, 2026 00:30
validator

`isValid()` (XML-RPC) used a plain XSD-1.0-only SAX pipeline with no XSD
1.1
detection, unlike `validation:jaxp()` and store-time validation. Adds
the same
schemaLocation peek-and-route via Xsd11SchemaDetection, falling through
to
the existing pipeline when no base URI or 1.1 hint is available.
Removes manual sync between hand-copied version constants and the XSDs
they describe. Also adds a test reporting which test/sample config
fixtures still lack schemaVersion.
abbrev was already the de-facto name everywhere else (eXist-db#6008).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant