-
Notifications
You must be signed in to change notification settings - Fork 0
XML 2 JSON
With its tight integration in modern programming platforms, JSON has emerged as the natural choice for a data format in the enterprise. However, today much of an organizations key data is captured in XML. To unlock the power of this data we must first translate it into JSON.
Imagine we are starting with the following XML:
<?xml version="1.0" encoding="utf-8" ?>
<book xmlns="http://schemas.example.org/library">
<title>Transaction Processing: Concepts and Techniques</title>
<authors>
<author>Jim Gray</author>
<author>Andreas Reuter</author>
</authors>
<isbn>1-55860-190-2</isbn>
<publisher>Morgan Kaufmann</publisher>
<published>1993</published>
</book>
To create the JSON we are going to take a slightly roundabout route, the benefits of which will hopefully become clear later. The first step we are going to take is to translate this XML into a flattened graph serialization model called RDF. We don't need to be too concerned about all the details of RDF right now. Sufficient to say that RDF is a triple set data representation where the nodes and arcs of our graph are identified with URIs. RDF is not a data format exactly, rather it is a conceptual model that has a number of well defined data formats and it is this characteristic that we are going to make use of because one of the formats is XML and another JSON.
The XML representation is called RDFXML and the JSON representation is called JSON-LD. Like RDF itself, both of these are industry standards.
The interesing thing about the JSON serialization is that it can be made to look like normal, regular, JSON where conceptual objects in our model are actual objects in the JSON and conceptual properties in our model are actual properties in the JSON. This might sound all rather obvious, but there is a surprising amount of JSON kicking around that doesn't follow these basic rules, instead finding itself modeling artifacts of the internal implementation of one of the software agents. The problem is this doesn't bind to programming languages like JavaScript in the seamless way you might have hoped for. What we are looking for in the JSON is a format that gives us a natural object model from code; specifically, when we use JavaScript we would expect to be able to use the dot operator to access properties and the array operator to index into collections. If the JSON we produce passes this terse it will also work well with other languages like Java, Python or C#.
But to get started we are going to dig up another industry standard, good old XSLT.
The first step is craft an XSLT transform that takes the XML snippet and turns it into an RDFXML document. This turns out to be a straightforward exercise, here is our solution:
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:src="http://schemas.example.org/library"
xmlns:library="http://schemas.example.org/library#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
version="1.0">
<xsl:param name="baseAddress" />
<xsl:template match="/src:book">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<library:Book>
<xsl:attribute name="rdf:about">
<xsl:value-of select="concat($baseAddress, src:isbn, '.json')"/>
</xsl:attribute>
<xsl:for-each select="*">
<xsl:choose>
<xsl:when test="self::src:authors">
<xsl:apply-templates select="src:author" />
</xsl:when>
<xsl:otherwise>
<xsl:element name="{concat('library:', local-name())}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</library:Book>
</rdf:RDF>
</xsl:template>
<xsl:template match="src:author">
<library:author>
<library:Author>
<xsl:attribute name="rdf:about">
<xsl:value-of select="concat($baseAddress, 'author/', ., '.json')" />
</xsl:attribute>
<library:name>
<xsl:value-of select="."/>
</library:name>
</library:Author>
</library:author>
</xsl:template>
</xsl:stylesheet>
We are not aiming to change much in the way of structure in the XML beyond what is forced on us by the nature of RDFXML. Instead the aim is to manufacture unique identifiers, specifically URIs, that represent the key domain concepts captured in that XML.
Following this example, our transformation creates URIs for the book and each of the authors. We were lucky because in this case we were able to create these URIs using properties we found in the data. It might not always be so easy, but whatever tricks we find ourselves having to do, we should remember that it is the assignment of URIs to concepts buried in our XML that is the key to much of what follows. We don't actual care too much about the particular shape of the URIs, as we are just going to use them as identifiers; but what we do care about is that we have a consistent and repeatable way of creating them. As we said, here we were lucky because we found a natural unique identifier in the data itself: we decided to use the ISBN as part of the URI for our book, perhaps if this was a banking application we would have used an account number, for a postal address we could imagine the zip code being a component of the URI. If we had not been so lucky we may have found ourselves executing some specific code to lookup some properties in a database. The point is that this URI is key and deciding how to create our URIs is very much the primary activity.
We are just about done, now we have an RDFXML document we can load it into a RDF graph object and then just save it right out again, only this time we are going to tell the system to save it as JSON.
The only addition step we might need is to give that final process some hints so we can precisely control the final shape the JSON takes. We do this by defining a JSON-LD context. The context will not be of interest to every consumer of this JSON but it will be of use to some so we will attach it to the JSON we create. There are various ways to do this, for example point at it with a URI, but perhaps the simplest place to start is to just include it inline in the JSON under a special property that goes by the name of "@context". Here is the resulting JSON:
{
"@id": "http://example.org/book/1-55860-190-2.json",
"@type": "Book",
"authors": [
{
"@id": "http://example.org/book/author/Jim%20Gray.json",
"@type": "Author",
"name": "Jim Gray"
},
{
"@id": "http://example.org/book/author/Andreas%20Reuter.json",
"@type": "Author",
"name": "Andreas Reuter"
}
],
"isbn": "1-55860-190-2",
"published": "1993",
"publisher": "Morgan Kaufmann",
"title": "Transaction Processing: Concepts and Techniques",
"@context": {
"@vocab": "http://schemas.example.org/library#",
"authors": {
"@id": "author",
"@container": "@set"
},
"books": "@graph"
}
}
Generally code that deals with JSON is very good at ignoring what it doesn't understand and so we shouldn't be concerned about the various "@" properties. So looking past those, we have a JSON document where properties are properties in the JSON and collections are arrays in the JSON.
The "@" properties have captured extra information, in fact there is no information loss from the original XML, no accidental ambiguity. Looking at the @context itself we can see that its a mix of namespace declaration (the "@vocab"), aliasing (the word "books" is an alias for the JSON-LD keyword "@graph") and serialization hints ("authors" is declared as a collection so that even if there is a single property value it is still rendered as an array.)
The code we use to do this transformation is all built out of Open-Source components. (It should come as no surprise that once we had decided to follow industry standards we got to take our pick from some Open Source components.) However, if you just want to give it a try, here is a service that does the work. The code for that service is here TransformWebApplication.
The service is up and running in the cloud and you can go ahead and give it a try, just save the original XML into a file called "Book1.xml" and then you can use the following command line:
curl -X POST -T Book1.xml -o Book1.json http://transformwebapplication.azurewebsites.net/xml2json/808c53661b5efcca185d/Book.xslt/BookContext.json/Book
The various XSLT transforms are contained in the following public gist Metadata Repository and the service fetches the appropriate transform from there.
There is absolutely nothing special about this gist, in fact you can create your own gist with your own transforms. You just need to specify github's unique identifier for your gist in this request and the service will go and fetch the transform you ask it to from there.
Some of the sample book XML is also contained in the following gist Sample XML This was just a handy place to store it. The service doesn't access this gist, its does access the Metadata Repository gist you specify in the request.