Skip to content

Querying

johnataylor edited this page Mar 3, 2016 · 18 revisions

We have spend a lot of time crafting our data transformations such that the resulting JSON forms a natural and easy to use object model for the JavaScript programmer. However, binding our data to imperative code it is not the only game in town, in fact, for many of the most important aspects of our system imperative code would simply introduce too much coupling. We also want to be able to provide features to our users, who are not necessarily professional programmers, that allow them to customize the behavior of the software but at a very high level.

When we tug on this thread a little it turns out that both our desire for more maintainable engineering within our implementations and our requirements to build advanced customization features for our users depend on the ability to query the application data.

A large part of the agenda of normalizing all our data into JSON-LD is that it has made it easy to query. However, our query is not going to be in terms of the regular JSON structures we have previously seen, instead we are going to look at the JSON we have through the lens of an RDF data set.

Consider, for example, our book example:

{
  "@id": "http://example.org/book/1-55860-190-2.json",
  "@type": "Book",
  "authors": [
    {
      "@id": "http://example.org/book/author/Jim%20Gray.json",
      "@type": "Author",
      "name": "Jim Gray"
    },
    {
      "@id": "http://example.org/book/author/Andreas%20Reuter.json",
      "@type": "Author",
      "name": "Andreas Reuter"
    }
  ],
  "isbn": "1-55860-190-2",
  "published": "1993",
  "publisher": "Morgan Kaufmann",
  "title": "Transaction Processing: Concepts and Techniques",
  "@context": {
    "@vocab": "http://schemas.example.org/library#",
    "authors": {
      "@id": "author",
      "@container": "@set"
    },
    "books": "@graph"
  }
}

Viewing this as an RDF data set means seeing it as a series of statements or assertions. Each of these statements consists, rather like regular English language sentences, of a subject, predicate and object. This view on the data naturally assumes the explicit identity of the subjects and objects. And naturally, as in regular language, the object of one statement might well be the subject of another.

Here, then, are the statements in this JSON (shortening the @id URIs so they fit on the page):

Subject Predicate Object
.../book/1-55860-190-2.json @type Book
.../book/1-55860-190-2.json author .../book/author/Jim%20Gray.json
.../book/1-55860-190-2.json author .../book/author/Andreas%20Reuter.json
.../book/1-55860-190-2.json title "Transaction Processing: Concepts and Techniques"
.../book/1-55860-190-2.json isbn "1-55860-190-2"
.../book/1-55860-190-2.json published "1993"
.../book/1-55860-190-2.json publisher "Morgan Kaufmann"
.../book/author/Jim%20Gray.json @type Author
.../book/author/Jim%20Gray.json name "Jim Gray"
.../book/author/Andreas%20Reuter.json @type Author
.../book/author/Andreas%20Reuter.json name "Andreas Reuter"

Seeing the data in this tabular form and it's not too hard to imagine how we can now query it. The fundamental process is one of matching expressions against the data where the expressions have either a fixed term that must match or a variable that can take on any value. We will indicate variables with a preceding "?" character. For example, the following expression:

?book author ?author .

Will match the following:

Subject Predicate Object
.../book/1-55860-190-2.json author .../book/author/Jim%20Gray.json
.../book/1-55860-190-2.json author .../book/author/Andreas%20Reuter.json

Or the expression:

?book @type Book

Will list all the books we have in our library (assuming, of course, that we have Merged all our book data sets into the one big data set.)

More advanced queries might combine several expressions together, where a variable of the same name used across the set of expressions represents a join. For example:

?book title "Transaction Processing: Concepts and Techniques" .
?book author ?author .
?author name ?name .

would give us the names of the authors of the book titled "Transaction Processing: Concepts and Techniques".

As it turns out implementing this pattern matching query logic based on variable binding is relatively straight forward. Significantly easier than implementing a SQL query engine. Perhaps even simpler than implementing an XPath query engine: a little ironic given that what we have here is a graph query capability whereas the SQL and XPath queries are limited to a relational model and a tree structure respectfully.

There is an excellent description of the motivation with code examples in Python in the book "Programming the Semantic Web" by Toby Segaran, Colin Evans and Jamie Taylor.

TODO: refer to implementations similar to the Python code but in JavaScript and C#

SPARQL is a standard query language that works against triples sets. The basic query capability of SPARQL follows exactly the mechanism outlined above. In addition to matching graph patterns SPARQL also includes support for filtering according to expressions. SPARQL can return a tabular result set in a very similar manner to SQL, in fact it even attempts to mimic the syntax but perhaps more usefully it can also be told to construct another graph as the result.

Repeating the examples above using SPARQL syntax we have:

SELECT ?name
WHERE
{
  ?book title "Transaction Processing: Concepts and Techniques" .
  ?book author ?author .
  ?author name ?name .
}

SPARQL is a full featured query language, including much that would be familiar to a SQL programmer, for example, things like DISTINCT, GROUP BY etc. SPARQL also supports sub-selects, although in this case the SQL programmer needs to be very careful because the SPARQL semantics are very different: SPARQL queries are not set operations, they are pattern matches.

As a general rule its a good idea to keep SPARQL queries small. Large SPARQL queries can get unwieldy and hard to tune. One of the nice things about SPARQL is that its relatively easy and fun to write (at least once you get past all those question marks!) However its also very easy to write very inefficient SPARQL. For example SPARQL includes a few nice shorthands for common situations, one of these is property paths, and another is OPTIONAL statements. A property path allows you to write patterns that specify a path along multiple nodes without having to declare a bunch of binding variables. For example the query above could have been written like this:

SELECT ?name
WHERE
{
  ?book title "Transaction Processing: Concepts and Techniques" .
  ?book author/name ?name .
}

OPTIONAL is another handy feature that allows you to include some data if its there otherwise just return a result with the variable unbound. For example, imagine we optionally had title properties on some of our authors. Now a query that included the title might look like this:

SELECT ?title ?name
WHERE
{
  ?book title "Transaction Processing: Concepts and Techniques" .
  ?book author/name ?name .
  OPTIONAL { ?book author/title ?title . }
}

The problem is that this query is actually starting to get quite complex. The path basically hid an extra join and the OPTIONAL essentially hid an extra pattern with the result being a UNION of the two patterns. In this case we were fine, however, in an reasonable data set we might have had an extra couple of levels of nesting (so longer paths) and OPTIONAL clauses wrapping 10 or 20 properties. At this point the work we are actually asking the machine to do is getting quite significant.

So although its very easy to write large SPARQL statements that start to resemble large SQL Stored Procedures they very rarely work as well.

TODO: how to keep things simple: punt the problem out of query and into schema by inferring the relationships your queries are looking for

Clone this wiki locally