Skip to content
ozwillo-admin edited this page Dec 30, 2019 · 9 revisions

Welcome to the SoCaTel GraphQL API wiki!

Quick introduction

The SoCaTel project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under the Grant Agreement Nº 769975. More details at https://www.socatel.eu/.

A Knowledge Base has been constructed with mechanisms to extract data from online data sources, run a semantic annotation on them and make the resulting linked data available for consumption, all that automatically.

The Data Acquisition Layer consists of multiple handlers that consume data from the following online data sources:

1. Social Media 
    a. Twitter
    b.Facebook
2. Open Data 
    a. Statistical and Research Data 
    b. Governmental Data 
3. Linked Open Data 

The semantic pre-processing layer mapS the structure of the raw datasets with the SoCaTel conceptual data model. Data is converted in Resource Description Framework (RDF) via an Extraction, Transformation, Load (ETL) and storage process within the GraphDB triplestore.

This GraphQL API allows software modules to interface the SoCaTel semantic repository (GraphDB triplestore), abstracting the complexity of the underlying SPARQL query language native. GraphQL does not talk directly to GraphDB, so-called resolvers are used to convert all GraphQL queries into SPARQL queries.

Queries definition

Development in GraphQL is schema-first. A schema defines what queries can be performed against a GraphQL server, what each query takes as input, and what each query returns as output. The schema does not explain how the server fetches the result; it describes what the result should look like.

graphql-java library offers two different ways of defining the schema: Programmatically as Java code Special graphql dsl file (called SDL).

We have chosen the second way, because it allows to define the GraphQL schema in a language-agnostic way and it’s also the notation used in the GraphQL specification.

A basic schema using the SDL:

type Post {
    identifier: ID!
    description: String!
    language: String!
    numLikes: Int!
    numReplies: Int!
}

GraphQL provides a small set of predefined scalar types: Boolean, ID, Int, Float, and String. But we can define our own custom scalar types as well.

We have defined a custom type representing dates and times.

Here’s how we can declare and use a custom scalar type called DateTime instead just a simple string: scalar DateTime

type Post {
   …
   creationDate: DateTime!
}

Defining a custom scalar type in the schema is not enough. We also need to tell the GraphQL engine how to convert values of that type from the internal representation used in our code when writing a response or reading a request.

@Component
public class GraphQLProvider {
   ....
   private RuntimeWiring buildWiring() {
       return RuntimeWiring.newRuntimeWiring()
               .type(newTypeWiring("Query")
                       .dataFetcher("postById", postDataFetchers.getPostByIdDataFetcher())
                       .dataFetcher("posts", postDataFetchers.getPostsDataFetcher())
                       .dataFetcher("searchByTopics", genericDataFetchers.searchByTopicsDataFetcher()))
               .scalar(ExtendedScalars.Date)
               .scalar(ExtendedScalars.DateTime)
               .build();
   }
   ....
}

Queries are used by the client to request the data it needs from the server. Unlike REST APIs where there’s a clearly defined structure of information returned from each endpoint, GraphQL always exposes only one endpoint, allowing the client to decide what data it really needs from a predefined pattern. Here’s an example:

{
    posts {
        identifier
        description
    }
}

The post field in our query above is called the root field of the query. Anything that comes after the root field is known as the payload. Since we only added identifier and description in the payload, the query will return a list of all the post like this:

{
   "data": {
       "posts": [
           {
               "identifier": "1101908829393113088"
           },
           {
               "identifier": "1101908835365736449"
           },
           {
               "identifier": "1101289820066848768"
           },
        …
      ]
    }
}

This capability is described by GraphQL as ask just for what you need and you’ll get exactly that. This way we have more flexibility over the data we receive from the server, which is immensely valuable for many reasons. Also we can defined an argument in our schema for getting a single post from its id or topic, we can pass in the id as an argument, like this:

{
    postById(identifier: "1101908829393113088") {
   	 identifier
   	 description
    }
}

As expected, this query returns the post whose id matches our argument:

{
   "data": {
       "postById": {
           "identifier": "1101908829393113088",
           "description": "Have you visited my website? Check out my content at https://t.co/0qcS7oH4bY #blog #blogging #blogger #website #writing #writer #author #books #bookworm #reader https://t.co/JGjQACE6j7"
       }
   }
}

SPARQL queries implementation

In the above code snippet, the corresponding query is delegated to the following SPARQL implementation :

PREFIX socatel: <http://www.everis.es/SoCaTel/ontology#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?post ?identifier ?description ?creationDate ?language ?num_likes ?num_replies ?location_name ?location_alternateName ?location_countryCode ?owner_identifier ?owner_title ?owner_description ?owner_webLink ?owner_language ?owner_num_likes ?creator_name ?creator_username ( GROUP_CONCAT( DISTINCT ?prefLabel ; SEPARATOR = "," ) AS ?topics )
WHERE { ?post a socatel:Post ;
	socatel:identifier ?identifier ;
	socatel:description ?description ;
	socatel:creationDate ?creationDate ;
	socatel:language ?language ;
	socatel:num_likes ?num_likes ;
	sioc:num_replies ?num_replies .
OPTIONAL { ?post socatel:location ?location .
      ?location gn:name ?location_name ;
      gn:alternateName ?location_alternateName ;
      gn:countryCode ?location_countryCode . }
OPTIONAL { ?post sioc:has_owner ?owner .
      ?owner socatel:identifier ?owner_identifier ;
	socatel:title ?owner_title ;
	socatel:description ?owner_description ;
	socatel:webLink ?owner_webLink ;
	socatel:language ?owner_language ;
	socatel:num_likes ?owner_num_likes . }
OPTIONAL { ?post socatel:createdBy ?creator .
      ?creator foaf:name ?creator_name ;
	foaf:username ?creator_username . }
OPTIONAL { ?post socatel:topic ?topic .
      ?topic skos:prefLabel ?prefLabel . } 
}
GROUP BY ?post ?identifier ?description ?creationDate ?language ?num_likes ?num_replies ?location_name ?location_alternateName ?location_countryCode ?owner_identifier ?owner_title ?owner_description ?owner_webLink ?owner_language ?owner_num_likes ?creator_name ?creator_username
LIMIT 5
OFFSET 0

The SPARQL queries are prepared, bound and executed thanks to the rich and powerful RDF4J library :

implementation 'org.eclipse.rdf4j:rdf4j-repository-http:2.5.3'
implementation 'org.eclipse.rdf4j:rdf4j-queryresultio:2.5.3'
implementation 'org.eclipse.rdf4j:rdf4j-sparqlbuilder:2.5.3'
implementation 'org.eclipse.rdf4j:rdf4j-queryresultio-sparqlxml:2.5.3'

One of the interesting point of this library is that it provides a SPARQL builder API which allow to construct most of the needed queries (it indeed has some limitations, eg property paths are not supported) with a fluent and partially type-safe API.

Following our previous example, here is part of the code used to build the SPARQL query :

public ArrayList<Post> getPosts(LocalDate creationDateFrom, LocalDate creationDateTo, String screenName, Integer offset, Integer limit) {

    	ArrayList<Post> postList = new ArrayList<>();

    	Variable post = var("post");

    	GraphPattern postGraphPattern = buildPostGraphPattern(post, Optional.empty());

    	List<Expression> expressions = new ArrayList<>();
    	if (creationDateFrom != null) {
        	expressions.add(Expressions.gte(var("creationDate"),
                	literalOfType(creationDateFrom.format(DateTimeFormatter.ofPattern("YYYY-MM-dd")), XSD.iri("dateTime"))));
    	}

    	if (creationDateTo != null) {
        	expressions.add(Expressions.lte(var("creationDate"),
                	literalOfType(creationDateTo.format(DateTimeFormatter.ofPattern("YYYY-MM-dd")), XSD.iri("dateTime"))));
    	}

    	if (screenName != null) {
        	expressions.add(Expressions.equals(var("creator_username"),
                	literalOf(screenName)));
    	}

    	if (!expressions.isEmpty()) {
        	postGraphPattern = postGraphPattern.filter(Expressions.and(expressions.toArray(new Expression[expressions.size()])));
    	}

    	SelectQuery selectQuery = buildPostSelectQuery(this.projectables)
            	.where(postGraphPattern)
            	.where(buildLocationGraphPattern(post))
            	.where(buildOwnerGraphPattern(post))
            	.where(buildCreatorGraphPattern(post))
            	.where(buildTopicGraphPattern(post))
            	.groupBy(this.projectables.toArray(new Groupable[projectables.size()]))
            	.offset(offset)
            	.limit(limit);

    	LOGGER.debug("Issuing SPARQL query :\n{}", selectQuery.getQueryString());
    	try {
        	TupleQuery tupleQuery = repositoryConnection.prepareTupleQuery(QueryLanguage.SPARQL, selectQuery.getQueryString());
        	PostTupleQueryResultHandler postTupleQueryResultHandler = new PostTupleQueryResultHandler(repositoryConnection);

        	tupleQuery.evaluate(postTupleQueryResultHandler);

        	postList.addAll(postTupleQueryResultHandler.getPostList());

        	postTupleQueryResultHandler.endQueryResult();
    	} catch (RepositoryException repositoryException) {
        	LOGGER.error("An exception occurred on graphdb repository request {}", repositoryException.getMessage());
    	} catch (MalformedQueryException malformedQueryException) {
        	LOGGER.error("Something wrong in query {}", malformedQueryException.getMessage());
    	}

    	return postList;
}

Finally a query result handler allows us to fine tune, when necessary, how the query results are handled and transformed before returning them to the caller, eg for the getPosts query :

@Override
public void handleSolution(BindingSet bindingSet) throws TupleQueryResultHandlerException {
    	Post post = new Post();
    	Location location = new Location();
    	Owner owner = new Owner();
    	Creator creator = new Creator();

    	bindingSet.getBindingNames().forEach(bindingName -> {
        	if(bindingName.startsWith("location_")) {
            	    location.mapper(bindingName, bindingSet.getValue(bindingName).stringValue());
        	} else if (bindingName.startsWith("owner_")) {
            	    owner.mapper(bindingName, bindingSet.getValue(bindingName).stringValue());
        	} else if(bindingName.startsWith("creator_")) {
            	    creator.mapper(bindingName, bindingSet.getValue(bindingName).stringValue());
        	} else post.mapper(bindingName, bindingSet.getValue(bindingName).stringValue());
    	});

    	post.setLocation(location);
    	post.setOwner(owner);
    	post.setCreator(creator);

    	postList.add(post);
}

Security

Security support is provided by the Spring Security library. The module is protected by an HTTP Basic Authentication which is activated this way :

@Configuration
public class SecurityConfig extends WebSecurityConfigurerAdapter {

	@Override
	protected void configure(HttpSecurity http) throws Exception {
    	http.httpBasic().and()
            	.requestMatcher(EndpointRequest.to(Health.class)).authorizeRequests().anyRequest().permitAll().and()
            	.antMatcher("/graphql").authorizeRequests().anyRequest().fullyAuthenticated().and()
            	.csrf().disable();
	}

            // other code ...
}

In summary :

  • The health endpoint, which is called by the monitoring application, is authorized for everyone
  • The GraphQL endpoints are only allowed for authenticated users
  • CSRF is disabled as POST requests are made along the GraphQL endpoints by authenticated users only

Unit tests

To guarantee the quality of the GraphQL module, some unit tests have been setup. They are based on the test support natively offered by the Spring Framework which itself makes use of (among other libraries and utilities) :

The tests are currently mainly focused on the resolvers, as they contain most of the business logic implemented in the module and thus is more inclined to be subject to regressions.

The pure GraphQL API part is less pertinent as most of the work done here is natively provided by the graphql-java library.

Elasticsearch stack

Components deployed

The current Elasticsearch deployed is based on the 7.x series (7.2 at the time of this writing) and comprises the following modules :

  • Elastic Search : the famous indexing and search engine
  • Kibana : the monitoring and visualisation dashboard
  • Filebeat : a lightweight module used to send the modules logs to Elasticsearch (deployed alongside the integrated modules)

The Filebeat component is part of the Elastic Beats category of products progressively replacing the famous Logstash component. Compared to Logstash, they are more lightweight (they are written in Go) and focused on one specific task. In the case of Filebeat, it is, at its name suggests, focused on streaming files contents to an Elasticsearch index. It is deployed alongside each application that wants to index its logs files to have them displayable and searchable in the Kibana dashboard.

Deployment configuration

The whole stack is deployed as Docker containers and orchestrated via docker-compose :

version: '3.7'

services:
  es01:
	container_name: es01
	image: docker.elastic.co/elasticsearch/elasticsearch:7.2.0
	environment:
  	  - node.name=es01
  	  - discovery.seed_hosts=es02
  	  - cluster.initial_master_nodes=es01,es02
  	  - ELASTIC_PASSWORD=$ELASTIC_PASSWORD
  	  - "ES_JAVA_OPTS=-Xms256m -Xmx256m"
  	  - xpack.license.self_generated.type=basic
  	  - xpack.security.enabled=true
  	  - xpack.security.http.ssl.enabled=false
  	  - xpack.security.http.ssl.key=$CERTS_DIR/es01/es01.key
  	  - xpack.security.http.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
  	  - xpack.security.http.ssl.certificate=$CERTS_DIR/es01/es01.crt
  	  - xpack.security.transport.ssl.enabled=true
  	  - xpack.security.transport.ssl.verification_mode=certificate
  	  - xpack.security.transport.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
  	  - xpack.security.transport.ssl.certificate=$CERTS_DIR/es01/es01.crt
  	  - xpack.security.transport.ssl.key=$CERTS_DIR/es01/es01.key
	volumes: ['esdata_01:/usr/share/elasticsearch/data', './certs:$CERTS_DIR']
	ports:
  	  - 9200:9200
	healthcheck:
  	  test: curl --cacert $CERTS_DIR/ca/ca.crt -s https://localhost:9200 >/dev/null; if [[ $$? == 52 ]]; then echo 0; else echo 1; fi
  	  interval: 30s
  	  timeout: 10s
  	  retries: 5

  es02:
	container_name: es02
	image: docker.elastic.co/elasticsearch/elasticsearch:7.2.0
	environment:
  	  - node.name=es02
  	  - discovery.seed_hosts=es01
  	  - cluster.initial_master_nodes=es01,es02
  	  - ELASTIC_PASSWORD=$ELASTIC_PASSWORD
  	  - "ES_JAVA_OPTS=-Xms256m -Xmx256m"
  	  - xpack.license.self_generated.type=basic
  	  - xpack.security.enabled=true
  	  - xpack.security.http.ssl.enabled=false
  	  - xpack.security.http.ssl.key=$CERTS_DIR/es02/es02.key
  	  - xpack.security.http.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
  	  - xpack.security.http.ssl.certificate=$CERTS_DIR/es02/es02.crt
  	  - xpack.security.transport.ssl.enabled=true
  	  - xpack.security.transport.ssl.verification_mode=certificate
  	  - xpack.security.transport.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
  	  - xpack.security.transport.ssl.certificate=$CERTS_DIR/es02/es02.crt
  	  - xpack.security.transport.ssl.key=$CERTS_DIR/es02/es02.key
	volumes: ['esdata_02:/usr/share/elasticsearch/data', './certs:$CERTS_DIR']

  kibana:
	image: docker.elastic.co/kibana/kibana:7.2.0
	container_name: kibana
	environment:
  	  - ELASTICSEARCH_HOSTS=http://es01:9200
	volumes:
  	  - ./config/kibana.yml:/usr/share/kibana/config/kibana.yml
  	  - ./certs:/usr/share/kibana/config/certificates
	ports:
  	  - 5601:5601

The filebeat components are also deployed via docker-compose but, as said previously, are deployed alongside the target module. Here is an example for the instance tied to the GraphQL API module :

filebeat:
    image: docker.elastic.co/beats/filebeat:7.2.0
    container_name: filebeat
    volumes:
  	- ./config/filebeat.yml:/usr/share/filebeat/filebeat.yml
  	- graphql-logs:/app/logs
    depends_on:
  	- kibana
  	- es01

graphql:
    image: ozwillo/socatel-graphql:latest
    container_name: graphql
    environment:
  	- "SPRING_PROFILES_ACTIVE=docker"
    volumes:
  	- ./config/application-docker.yml:/application-docker.yml
  	- graphql-logs:/app/logs
    ports:
  	- 8080:8080

volumes:
    graphql-logs:

Focus on authentication

Adding authentication support has been one of the most complicated aspect of the Elasticsearch stack integration.

Until recently, the security features offered by Elasticsearch were only available via X-Pack plugins requiring to buy a licence.

Thus it required to integrate and configure community-developed plugins, with all the risks that it implied (long term maintenance, security vulnerabilities fixes, …).

Another option emerged at the beginning of the year when Amazon released OpenDistro: an Elasticsearch distribution packed with security features, which was good news even they were still quite complex to configure and deploy (eg certificates have to be copied manually in many places).

Fortunately, probably in reaction to the Amazon move, the company behind Elastic Search decided to make their security features free to use (but not Open Source yet), as part of the 7.2 release.

We so decided to use the features now natively available on the Elasticsearch stack :

  • Certificates are generated at the installation time and shared by all the Elasticsearch and Kibana instances to protect their communications
  • SSL certificates have not been configured as SSL communications with the outside world is already protected by the SSL Gateway provided by our hosting provider

##Indices topology To avoid names clashes between the different modules using the Elasticsearch indices, it has simply been decided to prefix them by kind of consumer :

  • From the portal, they are prefixed by so_
  • From the recommendation engine, they are prefixed by kb_
  • From the filebeat components, they are prefixed by filebeat_

It is then possible to precisely customize each field in an index pattern by defining its kind, specifying if it is searchable, …. The dashboard allows to view, organize and search inside each of the index created in Elasticsearch.

Integration with the GraphQL API module

For some specific queries (eg to get tweets or Facebook posts related to an organization), one first need to make a query to Elasticsearch in order to retrieve information about the organization (eg its Twitter handle or its Facebook page name) to then be able to query the GraphDB triplestore with the appropriate query parameters.

In addition to the obvious query features, we were looking for a library that would allow two specific features :

  • Authenticated queries, without having to write a lot of boilerplate code
  • Queries on a different port than the standard one (7200), because we wanted our components to be exposed behind a reverse proxy using the standard HTTPs port (443)

We then studied the three major Java libraries offering an Elasticsearch API :

The two first libraries proved to be cumbersome with respect to authentication integration, requiring to override the underlying HTTP client and implement a specific interface. Whereas the Jest library permitted to add authentication in one line of code :

factory.setHttpClientConfig(new HttpClientConfig
            .Builder(jestProperties.getUris().get(0))
            .defaultCredentials(jestProperties.getUsername(), jestProperties.getPassword())
            // other configuration parameters omitted
           	.build());

Surprisingly enough, we faced big difficulties when we tried to connect to an Elasticsearch instance that was behind a reverse proxy and exposed through the 443 port. Actually, we even never we managed to make it work on the pre-production platform and finally had to open the 7200 port on our HAProxy instance.

Once more, with the Jest library, it is just a matter of configuration :

spring:
  elasticsearch:
    jest:
      uris: https://socatel-es.ozwillo-preprod.eu

And finally, the Jest library revealed to be also very easy to use, as can be seen in the following sample that performs a search request by name for an organisation :

SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.matchQuery(organizationsNameField, name));

Search search = new Search.Builder(searchSourceBuilder.toString())
            .addIndex(organizationsIndex)
            .build();

SearchResult result;
try {
    result = client.execute(search);
} catch (IOException e) {
    LOGGER.error("Error while searching in ES", e);
    return Optional.empty();
}

List<Service> services = result.getSourceAsObjectList(Service.class, false);
return services.isEmpty() ? Optional.empty() : Optional.of(services.get(0));

The automatic deserialization to Java objects is ensured by the Google's GSON library and its annotation support :

package com.ozwillo.socatelgraphql. domain;

import com.google.gson.annotations.SerializedName;

public class Service {

    @SerializedName("organisation_id")
    private String organisationId;
    @SerializedName("organisation_name")
    private String organisationName;
    @SerializedName("twitter_screen_name")
    private String twitterScreenName;
    @SerializedName("twitter_user_id")
    private String twitterUserId;

    // other constructors and accessors omitted
}