[wip] exist native schemas#6505
Draft
duncdrum wants to merge 30 commits into
Draft
Conversation
The bundled Xerces XSD 1.1 fork only wires 1.1 support into the JAXP SchemaFactory/Validator API, not into the validating-SAXParser pipeline `validation:jaxp()` uses for dynamic, schemaLocation-hint-driven schema discovery. Any XSD declaring `vc:minVersion="1.1"` (e.g. xs:assert) failed with `"cvc-elt.1.a: Cannot find the declaration of element"` even though the same schema validates fine via validation:jaxv() with the language explicitly set to v1.1. Add a narrow fallback: when the default pipeline reports `cvc-elt.1.a` and the instance references a schema via `xsi:schemaLocation` or `xsi:noNamespaceSchemaLocation`, retry once with a `SchemaFactory/Validator` built for v1.1. DTD-only documents. The default pipeline (and its DTD/grammar-pool behavior) is untouched for them. surfaced during work on see eXist-db#6002
round-trip `Jaxp.java` and `SearchResourceResolver.java` each serialized a catalog document stored in the database to a String just so `org.xmlresolve`r` could re-parse it from a StringReader - backed InputSource. xmlresolver's `CatalogManager` (reachable via `XMLResolverConfiguration#getFeature`) exposes loadCatalog(URI, SaxProducer), so the catalog's SAX events can be streamed directly into the loader instead.
`validation:jaxp()` already accepts an explicit catalog for schema and entity resolution; `jaxv()` had none, so `xs:import/xs:include` inside its explicitly-supplied grammars could only resolve via relative schemaLocation paths. Add a 4th, optional catalogs argument to `jaxv()/jaxv-report()`, wired via `SchemaFactory#setResourceResolver()` so resolution happens at schema-compile time.
- Fix XSD 1.1 retry corruption, - locale-fragile detection, - and dedupe catalog SAX streaming style improvements, typos, codacy happyness Buffer-and-replay the catalog SaxProducer's first invocation instead of re-serializing from the database on each of xmlresolver's two invocations (RNG-validate, then load) -- avoids a second broker round-trip per catalog per call, and the two passes no longer risk seeing different content if the document is concurrently modified in between. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also rewrites RecordingContentHandler to use the existing org.exist.util.sax.event.contenthandler.* event-recording classes (StartElement, Characters, TextEvent, etc.) instead of a hand-rolled, parallel ContentHandlerEvent model -- these already provide the exact 'record SAX events, replay against a different handler later' capability this class needs, including the defensive Attributes/char[] copying. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`jaxp()` now peeks at the schema's `vc:minVersion` before parsing and picks the right pipeline directly, instead of always failing once and retrying. The peek only follows same-origin (scheme+authority) locations, since the `schemaLocation` hint is document-controlled and unrestricted resolution would let any caller make the server fetch arbitrary files or URLs. `SearchResourceResolver` now also implements `LSResourceResolver`, so directory-search catalogs work with the XSD 1.1 validator and with `jaxv()`'s 4th argument too. Broadens the retry guard to fire whenever a catalog is configured, since catalogs can resolve by namespace alone with no `schemaLocation` hint present.
resolveEntity()'s namespace branch and resolveResource() each ran the same findXSD-by-namespace lookup, debug log, and fixupExistCatalogUri normalization inline. Extract the shared logic into one helper so the two entry points (XNI and LSResourceResolver) can't drift apart. Also hoists the per-call 'new DatabaseResources(brokerPool)' (in resolveEntity's publicId branch and findXsdResourcePathByNamespace) to a constructor-initialized field -- SearchResourceResolver is already a long-lived per-resolution instance, so there's no need to reconstruct this thin wrapper on every resolveEntity/resolveResource call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval() inlined the up-front XSD 1.1 peek and the cvc-elt.1.a retry safety net as long blocks of logic mixed with mutable local state. Extract them into peekIsXsd11ViaSchemaLocation() and retryWithXsd11ValidatorIfNeeded(), threading the retry's content-handler/builder pair back out via a small ParseTarget record instead of reassigning eval()'s locals from deep inside a conditional. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also threads the cache-grammars flag into the XSD 1.1 validator path (newXsd11Validator/retryWithXsd11ValidatorIfNeeded), shares the catalog-URL dispatch logic with validation:jaxv() via Shared.resolveCatalogArgument (throwing instead of just logging on a malformed catalog URL), and makes isMissingElementDeclaration package-private so IsMissingElementDeclarationTest can pin its match against the real Xerces message text directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also splits the three-condition compound boolean guard in retryWithXsd11ValidatorIfNeeded into separate early-return checks, one per independent bail-out reason, and extracts a getGrammarPool() helper shared between eval()'s SAX-pipeline grammar-pool wiring and newXsd11Validator()'s SchemaFactory cache-wiring (previously the same Configuration/GrammarPool lookup duplicated in both places). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Schema The up-front XSD 1.1 peek follows the instance's own xsi:schemaLocation hint before any catalog/permission check would otherwise govern the fetch, so it must only ever resolve within the instance's own origin. Cover this directly: same-origin relative locations are followed, cross-scheme/cross-host absolute locations (http, and cross-host xmldb://, which XmldbURL.isEmbedded() treats as a remote XML-RPC target) are refused without attempting a network connection. Also documents the one residual nuance (file: URIs have no authority component) inline on isXsd11Schema, gated behind the existing util:-interop privilege boundary needed to get a file: instance at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isXsd11Schema() re-fetches and re-peeks the candidate schema's root element on every call, even when validating many documents against the same schema back-to-back. Add a bounded, LRU-evicted cache keyed by the resolved schema URI, cleared by validation:clear-grammar-cache() alongside the Xerces grammar pool so there's one function to reset every validation-related cache. Failures (unreadable/unparseable candidates, transient IO errors) are deliberately never cached, since a stale negative would wrongly keep a legitimate schema on the slower retry-after-failure path forever. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The directory-search XSD 1.1 schema and its valid/invalid instances were inline Java string constants, unlike every other fixture in this test class, which loads via Samples.SAMPLES.getSample(). Move them to real resource files under exist-samples for consistency, and load them the same way. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three @ignore("todo") tests referenced fixtures this class's @BeforeClass never stores (/db/tournament/1.5/...) and validated a Schematron .sch file through jaxp-report(), whose 2-arg signature has no grammar parameter at all -- the .sch document was silently coerced to the unrelated enable-grammar-cache boolean via its effective boolean value. Replace them with working tests against this class's actual addressbook fixtures, mirroring the sibling ParseDtdNokTest's stored/anyURI valid/invalid pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add CatalogResolutionRegressionTest, exercising ResolverFactory and XercesXmlResolverAdapter directly -- the same machinery XMLReaderObjectFactory wires into every pooled XMLReader, which both fn:parse-xml()/util:parse() and the default SAX validation pipeline parse documents with. catalogResolvesPublicEntityWithNoBaseUri proves a configured catalog's PUBLIC entry still resolves a DTD with no document base URI to resolve a relative SYSTEM identifier against -- exactly the scenario reported in eXist-db#1975, where parse-xml()/util:parse() appeared not to consult catalogs at all. catalogWithoutMatchingEntryDoesNotFetchUnmatchedRemoteSystemId proves a catalog with no matching entry declines (returns null) rather than fetching the literal, document-author-controlled URI itself -- governed by org.xmlresolver.ResolverFeature#ALWAYS_RESOLVE, which ResolverFactory never sets, relying on the library default of false. This is what prevents the outbound-network-exfiltration behavior reported in eXist-db#2476. Closes eXist-db#1975 Closes eXist-db#2476 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tures Tournament.xsd and Tournament.rng (siblings of tournament-schema.sch, the only one of the three actually exercised by any existing test) were dead/unused. Their own embedded comment explains why they exist side by side: Tournament-valid.xml and Tournament-invalid.xml only differ in a Singles-tournament co-occurrence constraint (nbrParticipants must equal nbrTeams) that neither RELAX NG nor W3C XML Schema can express on their own -- Schematron rules are embedded in both via xsd:appinfo/RELAX NG annotations to add it. Add TournamentSchemaLanguageComparisonTest, validating both documents against the bare XSD (via directory-search) and bare RNG grammars. RNG reports both as valid, as documented. XSD reports both as invalid, but for an unrelated, pre-existing reason in this 2001-vintage sample: Match m3 references a Team t5 that's never declared, caught by XSD's built-in ID/IDREF referential-integrity check, which this Tournament.rng doesn't declare for the same elements. Both documents fail with the exact same error either way, which is itself the point: XSD's verdict doesn't change based on the co-occurrence violation either. JingSchematronTest already covers the Schematron leg against the same fixtures, correctly distinguishing the two documents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move XSD11_DETECTION_CACHE/XSD11_DETECTION_CACHE_MAX_ENTRIES to the top of Jaxp, alongside its other fields, instead of declaring them mid-class next to the methods that use them. Make isXsd11Schema package-private instead of private, the same pattern already used for clearXsd11DetectionCache(), so JaxpSchemaLocationSecurityTest/JaxpXsd11DetectionCacheTest (both in the same package) can call it directly instead of via setAccessible(true). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Also scopes XSD11_DETECTION_CACHE by the requesting Subject's name (a cache hit skips isXsd11Schema's permission-checked openStream() entirely, so without this a Subject lacking read permission on a schema resource could observe a boolean populated by a different Subject's earlier fetch), and replaces the hand-rolled synchronized LinkedHashMap-based LRU with a Caffeine cache (already a project dependency) for the same bound with less hand-written locking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…negotiation [feature] request module: Accept-header parsing and content negotiation
Wire xml-maven-plugin on the root aggregator to check conf.xml, collection.xconf.init, descriptor.xml, controller-config.xml, and mime-types.xml against schema/*.xsd using XSD 1.1 (Xerces + xpath2). Centralize version pins and plugin classpath in exist-parent; align three schemas with their canonical templates so validation passes.
Add path-filtered ci-schema-checks workflow: XSL governance that reads pairs from pom validationSets, compares base @Version (xpath via GITHUB_ENV) to head, and fails with PR annotations when a schema or canonical template changes without a version bump.
Lets eXist detect when a config template (conf.xml, collection.xconf, descriptor.xml, mime-types.xml, controller-config.xml) was authored against an older revision of its paired XSD. Each schema gains an optional schemaVersion attribute mirroring xs:schema/@Version; runtime parsers log a debug message for legacy documents that omit it and warn when a declared value doesn't match what the running build expects. see eXist-db#3062
mavenize and simplify the whole operation
fix instance schema-location paths schema/ now ships at $EXIST_HOME/schema/. Fixed conf.xml/descriptor.xml/controller-config.xml's schema-location hints, which were source-tree-relative and never correct for the assembled layout. Extended catalog.xml with entries for the remaining schemas; see eXist-db#6189 catalog.xml is not WAI see eXist-db#5541 see eXist-db#350
Extend the existing template-vs-schema validation (pom.xml) to also check every schema/**/*.xsd is itself a legal schema document, per the W3C meta-schema. Idea borrowed from eXist-db#5541, where the same catalog trick lets a user validate their own XSD's well-formedness. Upgrades the bundled XMLSchema.xsd from the stale, unused 2001/2004 XSD 1.0 revision (no xs:assert/vc: support) to the 2009 XSD 1.1 revision our native schemas actually need, plus its XMLSchema.dtd/ datatypes.dtd dependents and a refreshed xml.xsd. Resolution stays fully offline via catalogHandling=strict and the shipped catalog.xml. Caught immediately: 5 schemas carried an xsi:type="dcterms:W3CDTF" appinfo annotation with no backing schema, never caught before because nothing validated this strictly. Removed (annotation-only, no semantic effect) and bumped the affected xs:schema/@Version per the governance policy, syncing SchemaVersion.java and collection.xconf.init accordingly. close eXist-db#5541
ad054ed to
ff84ece
Compare
Saxon `transform:transform()` and `fn:transform()` never consulted eXist's configured catalog when resolving xsl:import/xsl:include or runtime document() calls. Add the system catalog as a resolver fallback in `XsltURIResolverHelper`'s chain for `transform:transform()`, and as SaxonConfiguration's Configuration-level ResourceResolver (`fn:transform()` runtime `doc()` calls). Saxon 12 dropped `Configuration.setURIResolver()` in favor of `setResourceResolver()`, so a small adapter wraps the existing `org.xmlresolver.Resolver`. fn:transform()'s own xsl:import resolution (URIResolution.java) is wired identically and confirmed correctly invoked, but is blocked by a separate, pre-existing fn:transform()+"stylesheet-node" defect . Tests for that path are %test:pending, referencing the relevant upstream issues (eXist-db#5051, eXist-db#5052, eXist-db#5682). Closes eXist-db#350
Storing an XSD schema document (any version) with collection validation enabled dynamically discovers its namespace (http://www.w3.org/2001/XMLSchema) and resolves the grammar via the system catalog -- which now points at the upgraded XSD 1.1 meta-schema. That resolution went through the default SAX dynamic-discovery validating pipeline, which can never load an XSD 1.1 grammar at all (confirmed empirically: Xerces' internal schema/version property throws SAXNotRecognizedException on a standard XMLReader). Previously masked because the old, unused 1.0 meta-schema was simple enough for the 1.0-only loader; storing any .xsd document, even a plain 1.0-only one, started failing outright once the meta-schema was upgraded. Peek at the root element's namespace in MutableCollection before the SAX pipeline runs (both the validate and store phases use the same InputSource, parsed twice already, so this is a third, safe peek using the same re-readable mechanism). If it's the XML Schema namespace, validate via a SchemaFactory.newInstance("...v1.1")-backed Validator instead, explicitly compiled from the system catalog's resolution of that namespace (dynamic, no-pre-supplied-source compilation doesn't pick up a grammar for a root namespace with no schemaLocation hint at all), wired to the same ContentHandler/Indexer the SAX pipeline would otherwise feed. Any other namespace is unaffected. This also closes the gap from eXist-db#5541 for store-time validation: storing an XSD 1.1-syntax schema document (xs:assert, xpathDefaultNamespace, etc.) now validates correctly too, not just 1.0-syntax ones.
MutableCollection's at-store-time validation (<validation mode="auto"/"yes">) always used the plain dynamic-discovery SAX pipeline, which the bundled Xerces fork's XSD 1.1 support never wires into -- only the JAXP SchemaFactory/Validator API. The earlier narrow fix routed exactly one known case (storing a schema document itself, validated against the meta-schema namespace) through an explicit-Source XSD 1.1 Validator; this generalizes that to two detectable cases, both decided up front (never via retry-after-failure, since Indexer/ DocumentTriggers stream-build the persistent document and can't safely re-fed after a partial, aborted first pass): 1. Storing a schema document whose catalog-resolved namespace grammar needs 1.1 to load -- decided empirically (probe-compile via the plain 1.0 SchemaFactory first) rather than hardcoding the meta-schema namespace as a special case, so any catalog-registered namespace benefits. 2. An instance with its own xsi:schemaLocation/noNamespaceSchemaLocation hint resolving to a schema declaring vc:minVersion="1.1" -- shares detection logic with validation:jaxp()'s own up-front peek, extracted into a new org.exist.validation.Xsd11SchemaDetection so neither side duplicates it. A schemaLocation hint resolving to a schema that doesn't self-declare vc:minVersion (e.g. via catalog-mediated indirection) remains undetected and still fails as before -- an accepted, documented limitation of peek-only detection with no retry safety net. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WARNING`
The no-arg `XMLResolverConfiguration` constructor adds `./catalog.xml`
as a default catalog path, which doesn't exist when running tests from
`exist-core/`. Removing it via
`resolverConfiguration.removeCatalog("./catalog.xml")` in both
`newResolver` and `newResolverFromSax` eliminates the three duplicate
`WARNING: Failed to load catalog` messages per test run.
- Add Server instance field to JettyStart for direct lifecycle access - Set stop timeout (30s) on Server so server.stop() never hangs - Shutdown Jetty directly in shutdown() instead of via ShutdownListenerImpl timer chain (BrokerPool.stopAll → listener → 1s Timer → server.stop) - Keeps deadline-based wait(remaining) as defense-in-depth - Removes unused Timer/TimerTask imports Closes the test suite hang where Object.wait() in JettyStart.shutdown() waited indefinitely for lifeCycleStopped() that never fired.
c1fadb2 to
feddfb5
Compare
validator `isValid()` (XML-RPC) used a plain XSD-1.0-only SAX pipeline with no XSD 1.1 detection, unlike `validation:jaxp()` and store-time validation. Adds the same schemaLocation peek-and-route via Xsd11SchemaDetection, falling through to the existing pipeline when no base URI or 1.1 hint is available.
Removes manual sync between hand-copied version constants and the XSDs they describe. Also adds a test reporting which test/sample config fixtures still lack schemaVersion.
abbrev was already the de-facto name everywhere else (eXist-db#6008).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
what I actually wanted to do.
extends #6496