Discovery: Prior Art

Discovery Prior Art

This document lists projects and academic works that define the prior art for data discovery pertinent to the Linked Web Storage protocol. Data discovery is the process by which clients (such as apps) can find relevant information, even information created by other clients.

LWS at scale would be an ecosystem of billions of storage devices, each containing small slices of a global knowledge graph. A valuable feature of LWS should be the ability to find relevant information among this knowledge graph. However, all the world's data cannot be physically loaded into a single client, so there should be a process by which a client can "discover" slices of the graph (under Solid's terminology, this would be called a "resource") that are pertinent to the application.

Link Following

"Link Following" is one of the core features of the semantic web. If an unfetched URI is referenced in data already held by the client, the client can feel relatively confident that dereferencing the URI will yield more data about that URI. However, link following has downsides, as it does not guarantee that all information about a subject URI will be obtained by dereferencing a URI.

Many link following protocols depend on the client obtaining one or more starting URIs that can be dereferenced then followed to fetch additional links. Most often, this "starting URI" is the WebID of the user obtained during the authentication process.

A few link-following discovery protocols have been proposed:

Solid Type Index

Publisher: Solid Community Group
Link: https://solid.github.io/type-indexes/

The Solid Type Index specifies that every WebID should contain links to both a "public type index" and a "private type index." These indices (example below) contain a list of "Type Registrations," each with a "forClass" attribute that defines the class of data contained in a resource, and an "instance" attribute that describes the instance itself.

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix vcard: <http://www.w3.org/2006/vcard/ns#>.
@prefix bk: <http://www.w3.org/2002/01/bookmark#>.

<>
  a solid:TypeIndex ;
  a solid:ListedDocument.
                    
<#ab09fd> a solid:TypeRegistration;
  solid:forClass vcard:AddressBook;
  solid:instance </public/contacts/myPublicAddressBook.ttl>.
                    
<#bq1r5e> a solid:TypeRegistration;
  solid:forClass bk:Bookmark;
  solid:instanceContainer </public/myBookmarks/>.

When an application wants to find data of a particular class, it searches the type index for that class and retrieves all instances associated with it.

Every application must manually update the type index when data is added or removed.

Shape Trees

Publisher: Janeiro Digital
Link: https://shapetrees.org/TR/specification/

Similar to a "Type Registry" a "Shape Tree" maps a resource to a description of that resource. A Shape Tree maps a Shape (ShEx or Shacl) to a container of resources that have that shape.

In this example, the container "https://storage.example/data/projects/" contains resources with the "ex:DataCollectionShape" shape as defined by the shape tree "ex:DataCollectionTree."

PREFIX st: <http://www.w3.org/ns/shapetrees#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://www.example/ns/ex#>

<>
  a st:Manager ;
  st:hasAssignment
    <#assignment1>,
    <#assignment2> .

<#assignment1>
  st:assigns ex:DataCollectionTree ;
  st:manages <https://storage.example/data/projects/>
  st:hasRootAssignment <https://storage.example/data/.shapetree#assignment1> ;
  st:focusNode <https://storage.example/data/projects/#collection> ;
  st:shape ex:DataCollectionShape .

Search

Search protocols offer more complex methods for discovering data. They are often less interested in finding the resources that potentially contain data and more interested in discovering data on a per-triple level.

In the paper "Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for the Web", search-based data is defined on a spectrum from generic requests with high client cost and lower server cost to specific requests with low client costs and high server costs.

Comunica

Publisher: UGhent IDLab
Link: https://comunica.dev/

While not a protocol for search and data discovery itself, Comunica is a protocol-agnostic client-side tool for aggregating search results across many different search protocols. It accepts both a SPARQL query and a list of data sources. Those data sources could be as simple as the URL for documents, or as complicated as a full SPARQL endpoint.

// Example of Comunica in Action
const bindingsStream = await myEngine.queryBindings(`
  SELECT ?s ?p ?o WHERE {
    ?s ?p <http://dbpedia.org/resource/Belgium>.
    ?s ?p ?o
  } LIMIT 100`, {
  sources: [
    'http://fragments.dbpedia.org/2015/en',
    'https://www.rubensworks.net',
    'https://ruben.verborgh.org/profile/',
  ],
});

SPARQL Endpoint

Publisher: w3c
Link: https://www.w3.org/TR/sparql11-query/

A SPARQL endpoint enables an entire server to function as a database with a robust query language. SPARQL endpoints have many well-tested implementations. While robust, SPARQL endpoints offload the requirement of search to the data owner. They also do not enable a global search, requiring the client to know the URI for the SPARQL endpoint ahead of time.

SELECT ?title
WHERE
{
  <http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> ?title .
}

Linked Data Fragments

Publisher: UGhent IDLab
Link: https://linkeddatafragments.org/

Linked Data Fragments were designed as an intermediary between a SPARQL endpoint and a complete data dump, balancing processing requirements between the server and the client. SPARQL queries can be translated into simpler "triple patterns" that are less intensive on the server. The client can make multiple triple pattern requests to assemble the return value for the SPARQL request.

ESPRESSO

Publisher: Espresso Project
Link: https://link.springer.com/article/10.1007/s41019-024-00263-w

ESPRESSO is a framework that aims to perform a global search. Its goal is to find the location of data anywhere in the world, even if it can't be seen through link following. It achieves this in a federated manner. First, an application constructs a query and hands that query off to a GaianDB node. GaianDB is a peer-to-peer network, and the request propagates between every peer in the network, calling a search app (Coffee Filter) deployed with every existing Solid Pod. Coffee Filter uses metadata from the pod provider to retrieve the file addresses of resources relevant to the query. Currently, ESPRESSO employs a technique called "query flooding," where every Pod in existence is queried for every request. However, the team references future techniques that could reduce the workload of a query.

Searcher-Centric Search

Publisher: Jackson Morgan
Link: https://docs.google.com/presentation/d/1KXOGgT4NtHqlvuI3lZ8vKb6rQut2u6mD/edit?usp=sharing&ouid=114176645321964550648&rtpof=true&sd=true

Searcher-Centric Search is another proposal for global query. It relies on pre-caching data in either a Private Index or a Public Index, depending on the access control rules attributed to a resource. The goal of Searcher-Centric Search is to reduce the number of network requests required by increasing the amount of duplicated and co-located data, thereby prioritizing query time over storage efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery: Prior Art

Discovery Prior Art

Link Following

Solid Type Index

Shape Trees

Search

Comunica

SPARQL Endpoint

Linked Data Fragments

ESPRESSO

Searcher-Centric Search

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally