Skip to content

Real world example of XML 2 JSON

johnataylor edited this page Feb 29, 2016 · 22 revisions

Following the basic idea described in JSON 2 XML lets look at a real world example.

Recap of Our Approach

Rather than trying to translate the constructs of XML into the constructs of JSON directly instead we are going to focus our efforts on simply labeling consistently the concepts or entities we find in our XML with URIs. Having done that everything else falls into place with little or no effort.

The Scenario

NuGet is a package management system for the .NET framework. nuget.org is a package repository and provides a series of endpoints that provide metadata about the packages in the repository. For future flexibility in the client implementation(s) the decision was made to serve this metadata up as JSON. However this metadata is largely derived from an XML file, called the "nuspec", contained in every package. So at the heart of the implementation on the server is a translation between XML and JSON.

Normalization Steps Within the XML World

When this new metadata endpoint was implemented there was already a substantial number of packages in the system. But as anyone who has had to deal with legacy data knows, there are always some surprises when we start to dig into the problem. Here it appeared the developers who had originally put the old system together had just missed the point on XML namespaces, in fact, they had managed to change the namespace in the XML for every release of the client tools and by the time work was started on the new metadata endpoints there had been quite a number of releases.

So the first step here is rather unusual: we are simply going to preprocess all the XML and normalize the namespaces. In this particular case this is a safe thing to do because we understand the limited context. That is, in this particular scenario, we know that properties of the same local-name but residing in different namespaces were actually intended to be the same property, so all we need to do is visit every node in the XML and replace the namespace with the one we want. The XSLT to perform this is a very direct expression of exactly this:

<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">
  <xsl:template match="*">
    <xsl:element name="{local-name()}" namespace="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
      <xsl:for-each select="@*">
        <xsl:copy-of select="."/>
      </xsl:for-each>
      <xsl:apply-templates select="node()"/>
    </xsl:element>
  </xsl:template>  
</xsl:stylesheet>

Perhaps we should have parametrized this with the namespace we normalizing to, in which case we would have had a generic solution, as it is, it was simple and worked for our scenario. This is currently running in the NuGet service and the file is normalizeNuspecNamespace.xslt

This experience was illustrative: although its unlikely that you will come across a confusion exactly like this, you should expect to come across something odd when you are dealing with historical data.

Using XSLT Extensions

While XSLT is very good at dealing with the higher level structural transformation it can become a little awkward at dealing with individual fields. Regular imperative code is a far better tool. As it happened the implementation of XSLT used here allowed for adding extensions in code.

One of the points made in JSON 2 XML was that the consistent creation of URIs was of the utmost importance. The NuGet system is based on case-insensitive package identifiers, in various places these package identifiers form part of the path component of URIs. However, the path component of URIs is case sensitive and the simplest thing to do was simply to consistently lowercase all the URIs. Another component of the generated URIs is the package version. NuGet package version numbers are based on Semantic Versioning, however, when the system was first put together validation at this level was very spotty, as a result the version numbers were very inconsistently formatted. Sometimes a package might state its version as 1.0 other times it might say 1.0.0.0; the correct Semantic Version would be 1.0.0. NuGet package resolution actually regards 1.0, 1.0.0.0 and 1.0.0 as all the same. So normalizing these to a correct Semantic Version format was important because now things that the system regards as the same would be given the same URI. String transformations, like lowercase, are more involved than they should be in XSLT, however, normalizing Semantic Version numbers would take an enormous amount of effort. The good news is that we have code in .NET to handle this. (This is actually available as a package called NuGet Versioning.) So rather than attempting to solve these field level transformations in XSLT the right thing to do is call out into code.

The extension object used is here.

The XSLT engine must be hosted and it is the responsibility of this hosting code to create these extension objects and make them available to the XSLT engine the code to do that contained in here.

The XSLT to translate the Nuspec XML can be found here.

And here is one of the specific areas the extension object is used:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <ng:PackageDetails>
    <xsl:variable name="path" select="concat(nuget:id, '.', obj:NormalizeVersion(nuget:version))" />
    <xsl:attribute name="rdf:about">
      <xsl:value-of select="obj:LowerCase(concat($base, $path, $extension))"/>
    </xsl:attribute>
...

Hopefully it is clear that the aim here is to create normalized, consistent URIs from our sometimes inconsistent historical data.

Making use of URI Fragments

We wanted to create absolute addresses that could be followed but when we wrote the XSLT we weren't exactly sure what that exact address would be so we parameterized the XSLT with a base address from which we constructed the URIs we were generating we also passed in a file extension. These are constant with values something like "http://api.nuget.org/v3" and ".json" for $base and $extension respectfully. More interesting is the use of URI fragments.

The JSON created with this XSLT represents a single physical document. The actual physical address of that document is the @id of the root object. This was a simplification and allows us to walk through all the data by clicking on links simply using a browser and a plugin that understands JSON documents. Consistent with this physical arrangement of the data we made all the URIs we generate for the various subordinate structures use fragments. If two URIs differ in their fragment they are different URIs, so we can use them to identify distinct entities, but they actually resolve to the same physical resource. HTML often makes heavy use of this with the anchor tag which given a URI with a fragment will jump the browser not just to that page but to the particular section of that page. We wanted exactly the same model in our JSON-LD (and one day we hope to have a client that understands exactly this behavior.) Here the challenge then became one of creating unique (and repeatable) URIs for the subordinate structures in the metadata that describes a package. This is done in our XSLT by appending to a $fragment variable as the recursive decent into the XML is done. In other words we create paths that correspond to the hierarchical nature of the XML and then we make those paths the fragment part of the root node's URI. Actually simpler than it sounds. And when we need to disambiguate between children we managed to find a value in the XML that we could include in the fragment path.

The following snippet from the nuspec XSLT is illustrative of this approach:

<xsl:template match="nuget:group">
  <xsl:param name="path" />
  <xsl:param name="parent_fragment" />
  <ng:dependencyGroup>
    <ng:PackageDependencyGroup>
      <xsl:variable name="fragment">
        <xsl:choose>
          <xsl:when test="@targetFramework">
            <xsl:value-of select="concat($parent_fragment, '/', @targetFramework)"/>
          </xsl:when>
          <xsl:when test="@name">
            <xsl:value-of select="concat($parent_fragment, '/', @name)"/>
          </xsl:when>
          <xsl:otherwise>
            <xsl:value-of select="$parent_fragment"/>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:variable>
      <xsl:attribute name="rdf:about">
        <xsl:value-of select="obj:LowerCase(concat($base, $path, $extension, $fragment))"/>
      </xsl:attribute>
...

Hopefully all this is illustrative of how most of the work we are doing is manufacturing URIs to consistently label the concepts we care about in our data.

Is There an Even More Generic Approach?

Looking back over this its tempting to ask is there an even more generic solution than we have? We were not actually trying to build something generic, we just happened to arrive at something that looks fairly generic but solves our specific problem.

One possible approach we considered but didn't have time to experiment with was having a much more generic XSLT, in fact one not dissimilar to the "visitor" namespace-normalizing one above, and then having an extension that acts as the URI "factory" involved for every element. This would essentially factor the implementation as an event stream and because we would be interested in tracking the state in order to concatenate a path for the URI fragment, we would then need some rules, essentially a grammar, that governed the behavior in the extension. At which point we've pretty much come full circle. Perhaps this would be an interesting experiment, but just walking through this clarifies the point that there is some scenario specific normalization happening here. We happened to have used XSLT to express the specific rules that govern that normalization.

Otherwise Just Copy the Properties Across

Otherwise we were happy with the properties in the in XML just becoming the properties in the JSON. This is basically how the book example worked too. So as a default, basically take the namespace+name of the XML element and make that the namespace+name of the corresponding JSON-LD property.

Consistency in the Generated JSON

Finally we wanted consistency in the structure of the JSON we created. Unlike JSON and most modern programming languages, RDF doesn't have a very effective first class notion of collections. JSON-LD is a true RDF serialization, however, it does a little extra work to overcome this incompatibility. By default if a property in the RDF data set has just a single value the JSON-LD framed format will be a property on an object in the JSON. However, if there happened to be two or more values associated with that property (aka predicate) name then the JSON-LD processor would have given us an array in the JSON. Flipping between an object property with a simple value and an object property with an array would be very annoying to any JavaScript programmer who happened to be trying to work with this data. The solution is to always force the creation of an array if we know the data might generate an array. This is done in the @context. So along with declaring the namespaces and some types we also declare that a property will also be a container. Having done that the property will always be rendered as an array. This the context we used (and its checked in here).

"@context": {
  "@vocab": "http://schema.nuget.org/schema#",
  "catalog": "http://schema.nuget.org/catalog#",
  "xsd": "http://www.w3.org/2001/XMLSchema#",
  "dependencies" : { "@id" : "dependency", "@container" : "@set" },
  "dependencyGroups" : { "@id" : "dependencyGroup", "@container" : "@set" },
  "packageEntries": { "@id": "packageEntry", "@container": "@set" },
  "supportedFrameworks": { "@id": "supportedFramework", "@container": "@set" },
  "tags": { "@id": "tag", "@container": "@set" },
  "published": { "@type": "xsd:dateTime" },
  "created": { "@type": "xsd:dateTime" },
  "lastEdited" : { "@type" : "xsd:dateTime" },
  "catalog:commitTimeStamp" : { "@type" : "xsd:dateTime" }
}

There are a few examples of declaring the container here: dependency, dependencyGroup, packageEntry, supportedFrameworks and tags could all be collections but might actually be single valued in the RDF data set, so we declare them all with "@container: @set". (Note the @ indicates a JSON-LD keyword.) Th either thing we have done is change the property name, for example "dependency" in the RDF becomes "dependencies" in the JSON. The reason we did this is because the JSON represents an object-model to the JavaScript or Java / C# programmer and we wanted that object model to look right and make sense grammatically.

Otherwise the @context is very straight forward. Declaring the @vocab makes all the property names short, and the moment we've done that the JavaScript programmer can use the languages dot-notation syntax to access properties. And finally we declare some types, in the JSON a field like "published" is really just a string but we want that to be consistently formatted and we sometimes deal with these in code so declaring the type removes any potential ambiguity.

Note this XSLT Currently Doesn't Run in The Transformation Service

Note the XSLT we have been discussing here won't actually run in the transformation service because it requires certain parameters to be set by the host, not least of which is the extension. Perhaps the solution is to develop a set of extensions the transformation service always understands and make available to every XSLT.