Skip to content

Technical architecture

rgraf edited this page Sep 1, 2014 · 31 revisions

Introduction

As part of the [user workflows](User Flows), w3act links to or invokes a number of existing APIs. It also acts as a data source for other systems, most notably the web crawlers themselves. This page provides an overview of the high-level technical architecture.

Internal Architecture

The w3act is a Play Framework (v2.2.1) application developed in Java (JDK 7) and running on a PostgreSQL database (v9.3.1). This application can be run on both Windows and Linux operating systems. The w3act makes use of several 3rd party libraries. One of them is a pre-packaged Maxmind GeoIP2 database (v0.7.0) that provides location information like country for the given IP address. Another library is a Whois service. Whois lookup is a pre-packaged service for mapping between domain name and country. This service is based on JRuby (v1.7.9) scripts and whois gem (v3.4.2.2). For Web Browser styling and presentation the application employs Bootstrap (v3.0.0) and JQuery (v1.4.2) modules. The target archiving is supported by the rabbitMQ Java library (v3.3.1).

We use the javax.mail package for email integration in the Permissions workflows. Main project configurations for email and Drupal access are defined in w3act.properties file.

For password encryption we employ a method of secure hashing with random salt proposed by Taylor Hornby.

System Integration

The main purpose of the w3act application is the management of archived data in the form of crawl Targets (identified by URLs). The initial data set is imported from the previous ACT prototype (a Drupal site) during the application start. Additional data can be also be added during application start from playframework configuration files with a .YML extension. New data can be added by users with various roles who access the running application. Here, we cover the data that w3act consumes or provides, as illustrated below.

Integration Architecture

ACT Prototype Legacy Data

The Drupal-based ACT Prototype is the primary source of legacy data. w3act is built to pull the data from that system in order to initially populate the database. Once w3act can be deployed in production, it will replaced the ACT Prototype.

The following data-types should be migrated:

  • Targets
  • Instances
  • Organisations
  • Categories
  • Users
  • Subjects
  • Licences
  • Quality Issues

The following data-types can be defined in configuration files of w3act:

  • Users
  • Roles
  • Permissions
  • Organisations
  • Taxonomies
  • Tags
  • Flags
  • Contact Persons
  • Email Templates

Database model

The class diagram of the W3ACT database model is presented below. The relations between classes are illustrated by numbers and connection lines. The main classes are depicted by the green background colour. The helping classes are marked by yellow colour and user management classes by the blue colour. Classes 'DCollection', 'Subject', 'License' and 'QA Issue' are derived from the 'Taxonomy' class.

Database model

UKWA Nominations

In order to completely replace the existing Selection and Permissions Tool, w3act provides a RESTful API to which nominations received from the UKWA nominations form can be posted.

The Nomination API can be used from Internet browser like Firefox:

http://<servername>/actdev/ukwa/nominationform

where "servername" could be e.g. www.webarchive.org.uk.

Or by means of the POST method:

http://<servername>/actdev/ukwa/nominations/load/json 

The response should look like:

{"status":"OK","message":"Nomination Test Name"}

A curl command and associated JSON for an example nomination is:

curl --header "Content-type: application/json" --request POST --data '[{"id":5588938485188769491,"url":"act-5588938485188769491","name":"Nomination Test Name","title":"Nomination Test title","website_url":"http://www.webarchive.org.uk/00","email":[email protected],"tel":null,"address":"Library street 1","nominated_website_owner":null,"justification":null,"notes":"This is a test nomination object","nomination_date":null}]' http://<servername>/actdev/nominations/load

A sample JSON object for nomination:

[{
    "id":5588938485188769491,
    "url":"act-5588938485188769491",
    "name":"Nomination Test Name",
    "title":"Nomination Test title",
    "website_url":"http://www.webarchive.org.uk/00",
    "email":[email protected],
    "tel":null,
    "address":"Library street 1",
    "nominated_website_owner":null,
    "justification":null,
    "notes":"This is a test nomination object",
    "nomination_date":null
}]

Crawl Feeds

The primary way w3act affects the crawl is through a set of crawl feeds that are used to drive the crawlers. Critically, these feeds must only contain Targets for which we have permission to crawl. Furthermore, in order to ensure that material is stored appropriately, we require separate feeds for Legal Deposition and By Permission crawls.

For Legal Deposit Targets

/targets/export/ld/{frequency}

e.g. www.webarchive.org.uk/actdev/targets/export/ld/annual where frequency has value 'annual'. This request provides a list of JSON objects in response.

{
"nid":653,
"value":"",
"summary":"",
"format":"",
"field_scope":"root",
"field_depth":"capped",
"field_via_correspondence":false,
"field_uk_postal_address":false,
"field_uk_hosting":false,
"field_nominating_organisation":"act-01",
"field_crawl_frequency":"annual",
"field_crawl_start_date":"1365422400",
"field_uk_domain":true,
"field_crawl_permission":"",
"field_special_dispensation":false,
"field_uk_geoip":false,
"field_professional_judgement":false,
"vid":21730,
"is_new":false,
"type":"url",
"title":"Wiltshire Involvement Network (WIN)",
"language":"",
"url":"act-653",
"edit_url":"wct-21730",
"status":1,
"promote":0,
"sticky":0,
"created":"1364941596",
"changed":"1396448501",
"author":"act-9",
"log":"",
"comment":2,
"comment_count":0,
"comment_count_new":0,
"feed_nid":0,
"field_crawl_end_date":"",
"field_live_site_status":"live",
"field_wct_id":136020184,
"field_spt_id":168514,
"legacy_site_id":0,
"field_no_ld_criteria_met":false,
"field_key_site":false,
"field_professional_judgement_exp":"",
"field_ignore_robots_txt":false,
"revision":"initial evision",
"active":true,
"white_list":"",
"black_list":"",
"date_of_publication":"",
"justification":"",
"selector_notes":"",
"archivist_notes":"",
"selection_type":"SELECTION",
"selector":"",
"flag_notes":"",
"field_url":"http://www.wiltshireinvolvementnetwork.org.uk/",
"domain":"wiltshireinvolvementnetwork.org.uk",
"field_description":"",
"field_uk_postal_address_url":"",
"field_suggested_collections":"",
"field_collections":"",
"field_license":"act-168",
"field_collection_categories":"act-0",
"field_notes":"",
"field_instances":"",
"field_subject":"",
"field_subsubject":"",
"keywords":"",
"tags":"",
"synonyms":"",
"originating_organisation":"",
"flags":"",
"authors":"",
"field_qa_status":"act-165",
"qa_status":"",
"qa_issue_category":"",
"qa_notes":"",
"quality_notes":"",
"lastUpdate":1399463915995,
"organisation":
    {"title":"The British Library",
     "url":"act-101",
     "targetNumberByOrganisationUrl":7127
    },
"duplicateNumber":1,
"_user_by_id":"Jennie Grimshaw",
"selectedTags":"none",
"selectedFlags":"none",
"statusStr":"N/A"
}     

For By Permission Targets

/targets/export/by/{frequency}

where frequency can also be 'all', and/or we supply a list of frequencies in machine readable format:

New Instance API

At the end of a successful crawl, the crawl system will attempt to inform w3act that there is a new instance of a given target available.

The link to the instance in QA Wayback has following form

opera.bl.uk:8080/wayback/*/<URL>

where URL is a search site URL e.g. opera.bl.uk:8080/wayback/*/http://www.1001inventions.com/

The link to the instance in Open UK Webarchive has following form

http://www.webarchive.org.uk/wayback/archive/*/<URL>

where URL is a search site URL e.g. http://www.webarchive.org.uk/wayback/archive/*/http://www.1001inventions.com/

The link to the instance in other webarchives has following form

www.webarchive.org.uk/mementos/search/<URL>

where URL is a search site URL e.g. www.webarchive.org.uk/mementos/search/http://www.1001inventions.com/

The link to the live site

<URL>

where URL is a search site URL e.g. http://www.1001inventions.com/

QA Wayback

As data is collected via our various crawls, snapshots of individual sites and resources become available via our internal Wayback service before they appear anywhere else. This QA Wayback service is a critical integration point for w3act, as the Wayback system helps w3act to provide the user with the information they need in order to populate the Instances of the Target sites from the domain crawls.

The Target page in w3act should also allow the user to see what copies are available, and promote them to Instances. Note that the snapshots discovered via Wayback cannot be automatically promoted to Instances as they may not cover the entire scope of the Target. For example, if we've just got another copy of a homepage, but no deeper content, this may not be considered as being sufficient to act as an Instance of that Target.

The Wayback API

Including Wayback XML API details, e.g. http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.google.co.uk see http://wwwoh-access.archive.org/wwwoh/waybackapi.htm

UKWA Wayback

Similar to the integration with the QA Wayback, the w3act interface should check what snapshots are available via the public Wayback interface. It should highlight those that have been promoted for publishing, but are not available (yet), and highlight those that are public but are unknown to w3act or known but not marked for publishing.

Mementos

For any given Target, the w3act interface should provide a link to the Mementos search interface for that item, e.g.

http://www.webarchive.org.uk/mementos/search/http://example.com

Monitrix

Once Monitrix is in production and able to cope with our entire crawl history, w3act should at least provide direct links to the crawl information related to a given Target. The longer term goal would be to pull that information in automatically and augment the various w3act pages and reports with useful crawl-level information.

UKWA Publisher Feeds

Feeds of Instances and Targets that are OK to publish, to populate UKWA.

Clone this wiki locally