Skip to content

Merging

johnataylor edited this page Feb 18, 2016 · 9 revisions

One of the characteristics of RDF datasets is that they can be trivially merged. This comes as a consequence of the effort made in identifying and labeling the key concepts and entities in our domain with unique identifiers. The automatic merging of data is even more effective when we have been consistent in the property vocabulary we have adopted.

When it comes to JSON documents, when we say "merge" we do not mean just concatenating one document with the next. What we mean is aligning entities in one document with entities in another in the sense that duplicate, redundant facts about entities are removed but distinct and valuable information is consolidated. Specifically we would expect additional values of array valued properties to be an extension of those arrays and at every level in the hierarchy.

In our Book examples, imagine we had some facts about the same Book contained in separate JSON files. For example, one file lists one of the authors of a particular book and another file lists another two of the authors: probably it was just a historical accident in data entry many years ago. The good news is we have correctly (that is, unambiguously) identified the book in question in both JSON documents. What we want is a single JSON document corresponding to the particular book but with an array of authors that includes all three authors we know about. Merging the documents means merging the contents of the authors array; it doesn't mean concatenating one JSON document with the next.

The implementation is simple and generic and doesn't even require any additional metadata. All we need to do is convert the compact form of our JSON-LD into a graph, then combine the graphs with a simple process of asserting all the triples in each into a new graph. Because of the semantics of our graph assert operation duplicate information is automatically eliminated. Now all that is left to do is serialize out the combined graph as a single compact JSON-LD document.

Here is a web service that does just that:

curl -X POST -F "[email protected]" -F "[email protected]" http://transformwebapplication.azurewebsites.net/merge

And if you want some test data you can grab some files from here. (Just remember to translate them into JSON-LD first using the translation service described under the XML 2 JSON section.)

https://gist.github.com/johnataylor/7e84a230b6dbb1755236

The interesting thing here is that this service is completely generic, you can send it any set of JSON-LD documents you like and it will combine them and return you the result. It takes the context and framing type from the first document in the post. (It might be nice to make them optional parameters on the request.)

The ability to merge documents in this way might seem somewhat esoteric. However, it is very illustrative of the type of data transformation that we got for free having done the hard work earlier of identifying and labeling with URIs all the important entities in our data. And the ability to merge data with no additional scenario specific logic, will be an important behavior when we start to model the actual business rule execution and workflows directly in our data.

Clone this wiki locally