Spec: Related Items

Summary of the current implementation

This section summarizes the current (CKAN 2.2, Feb 2014) implementation of related items, both from the user's point of view and the technical details, the point is to get an idea of what the feature currently does and any problems with the current implementation.

Note that related items is not currently implemented as an extension! It's a core feature, and always on. Maybe it should be moved to an extension?

model

In the model for related items (see ckan/model/related.py), each related item has:

id
type (API, application, idea, news article, paper, post or visualization)
title
description
image_url
url
created date
owner_id (the user who created the item)
view_count
featured (true or false)

The way the UI and much of the code is implemented currently, each related item is related to exactly one dataset. You can't have an item related to multiple datasets. You can use the API to create items related to no datasets.

The type field seems a bit pointless, there's no actual difference between the different types of related items, they're all just objects with a title, description and link. What if someone wants to add a related item that doesn't come under those 7 types? On the other hand, you can filter related items by type and e.g. have a page showing just the apps, and having a list of types helps to suggest what sorts of things the feature might be used for. Maybe it just needs a final type "Other"?

`/related` page (the "related dashboard")

This page is not yet linked to from anywhere in the default CKAN theme.

This page shows all of the related items on the site, paginated. You can filter the related items by type, and also show only featured related items, and you can sort the related items by created date or by view count.

The default sort order is just labelled "Default" and you don't know what the ordering actually is.

All you can do with the related items on this page is click on them, and you'll be taken to the item's URL (i.e. the external URL that the item links to). You can't create or edit or delete related items from this page, you can't see the datasets that the items are related to, and there's no way to get to the dataset pages.

From a user point of view I don't think this page is very well thought through yet. Both the URL and name of this page seem very weird to me, related to what? The datasets that the related items are related to can't be seen or reached from this page. (Also not every related item is related to a dataset anyway, via the API you can create related items with no datasets, all of the related items on publicdata.eu are this way for example.)

I think some clarification is needed about what the purpose of this page is / what the use-cases for it are and then we can consider how it should be re-designed.

The page title is "Apps & Ideas" but this is different from the page URL (/related). The feature description in the page sidebar also just talks about apps and ideas:

What are applications?

These are applications built with the datasets as well as ideas for things that could be done with them.

(Btw, "ideas for what could be done with them", does them refer to the datasets? Or the applications?)

But when you create a related item, there are 7 possible types: API, app, idea, news article, paper, post, visualization, so it's not just apps and ideas.

Also when related items are shown on dataset pages, the name "Apps & Ideas" is not used (see below).

Viewing related items

You can view a list of a dataset's related items on the dataset's page, or a list of all related items on the related dashboard page. Clicking on a related item just redirects you to the item's external URL (e.g. the blog post or whatever) though, related items don't appear to have their own pages.

On the dataset page, the URL ends in /related and the tab is just titled "Related", the button is "Add Related Item", etc. (There's no use of the "Apps & Ideas" title from the dashboard page.) When adding or updating a related item, a third name "related media" appears:

What are related items?

Related Media is any app, article, visualisation or idea related to this dataset.

For example, it could be a custom visualisation, pictograph or bar chart, an app using all or part of the data or even a news story that references this dataset.

This is quite different from the description on the related dashboard page.

If we want to support items that are related to more than one dataset, then related items will probably need to get their own pages, because somewhere we need to show a list of all the datasets that the item is related to.

You can also view individual related items using the API with the related_show action.

Creating related items

It is possible to create a related item that isn't related to any dataset, using the related_create API. In the web interface though, the only way to create a related item is by going to a dataset's related items tab, and then the item will be related to that one dataset.

After creating the user is redirected to the dataset's list of related items.

View counting

Related items have their own view counting feature. Whenever someone clicks on a related item to follow its link, it increments the count (implemented in the controller class).

It looks like there's no limiting (e.g. the same user repeatedly clicking), viewing an item over the API doesn't count, I'm not sure if clicking on an item on the related dashboard page counts, or on custom pages that use the API to show related items (e.g. publicdata.eu's front page).

This view counting could maybe be integrated with CKAN's builtin page view tracking feature, although that feature also has problems.

Activity streams

Activity streams are created when you create, update or delete a related item. These could do with a little quality control though, e.g. I think I saw "api" spelled in lower-case. (API is also mis-spelled as "Api" in related item tooltips.) Also, what activity streams should these activities appear in? It looks like currently they appear in the user's activity stream, but not the dataset's? They probably don't appear in the group or organization's stream either.

Featured related items

A sysadmin can set mark certain related items as "featured". (Not sure whether this is doable in the web UI, or only the API.) The only place this is currently used is on the related dashboard page, where there's a checkbox to show only the featured items.

In extensions you could quite easily do things like show the featured items on the site front page, or have one page showing the featured apps and another page showing featured visualizations, etc. An example extension showing how this can be done might be a good idea.

Authorization

The way the auth functions are currently implemented is:

Anyone can see any related item or list of related items. (So private datasets with private related items isn't supported.)
Anyone who is logged-in can add a related item to any dataset. (It doesn't seem to matter whether they have permission to edit the dataset, although I didn't test it with organizations.)
Only a sysadmin or the "owner" of a related item (the person who created the item) can edit a related item.
The owner of a related item can't change, it's always the person who created it. (Maybe it can be changed via the API?)
Anyone who has permission to delete a dataset can delete an individual related item from the dataset (and this will completely remove the item from the site, not just remove it from the dataset). The creator of a related item can also delete it.
Only sysadmins can create featured related items or mark a related item as featured.

Technical details

Model

There's a separate table related_dataset that maps related items to datasets, so in theory the model supports a many-many relationship, but I think the rest of the implementation only allows a related item to be related to one dataset. (And allowing items to be related to multiple datasets would probably raise a lot of questions about authorization and user interface.)

I think we should move more code into the model, e.g. methods for returning lists of related items filtered by dataset, type, featured, etc. These methods can then be unit-tested in the model, and the logic can just call the model methods instead of doing its own sqlalchemy.

Controller

dashboard()

The related dashboard page is implemented by the dashboard() method in the related controller. It calls the related_list() action function to actually get the related items to show. It looks like the pagination is done in the controller, this should be moved into the action function so that the API supports pagination and so that it can be tested more easily.

read()

The related controller's read() method (which redirects the browser to the related item's external URL) accesses the model directly to get the related item. It should be going through the related_show() action function. It also does its own call to check_access() in the controller - again related_show() should be doing this and the controller just calling related_show(). View counting feature should not be implemented in the controller either.

_edit_or_new()

The related controller's _edit_or_new() method does its own check_access() call, that should be done by the related_create() action (there shouldn't be any auth stuff in controllers).

There's some other weird stuff going on in the controller here too, like some unflattening and tuplizing and putting a package in c.pkg_dict that may not be used anywhere.

It shows a flash message after creating the item, which I don't think we normally do in CKAN, shouldn't be there? (Updating and deleting related items also does this.)

The method docstring of the related controller's _edit_or_new() method looks more like a git commit message.

It adds "related" and "id" to the context at the end, not sure why.

Actions

related_list

The related_list() action function (returns a list of related items, optionally filtered by dataset, type and whether featured) is for some reason putting a package in c.pkg and c.pkg_dict, but I'm not sure if this is actually used (and it doesn't seem like something an action function should do.

related_list() can accept either a dataset ID (in a param named id, which would be clearer if it was called dataset_id) or a dataset dict (in a param called dataset). If the dataset param is not present, it falls back to the id param. This precedence isn't documented or tested. This seems unnecessary to me, much simpler to accept just the id.

If no dataset id or dict is given, it returns all related items, but I don't think this is documented either. Also it looks like all the sorting and filtering features are only applied when returning all related items, and not when returning a dataset's related items (again not mentioned in the docstring).

related_list() calls check_access('related_show'), i.e. it calls the related_show auth function, it should have its own related_list auth function.

It doesn't appear to do much validation of params, e.g. what happens if the given dataset dict or id is invalid?

It also makes a JSON dump of all of the package's resources for some reason?

related_create

After calling model_save.related_dict_save() the related_create() action seems to do its own thing to append the related item to the dataset's list of related items. Should this be done in model_save?

Schemas

The related_show() action uses the default_related_schema(), which is also used by related_create and related_update. I think each action should have its own schema: default_related_show_schema(), default_related_create_schema(), default_related_update schema(). If they're all the same, then make a _default_related_schema() helper function and have them all call it.

In the case of related items, I'm not sure I see the point of validating the data coming out of the database anyway, since I don't think any conversions are done. With packages and resources for example, the schemas can be customized by IDatasetForm and the custom schemas may include converter functions like convert_from_tags() and convert_from_extras(), so that's why data coming out of the database in package_show() needs to be converted/validated. But in related_show(), this seems pointless?

It looks like user_dictize(), if passed the 'with_related' option in the context, will also dictize all of the user's related items. When showing a user profile page, the user controller does pass this option. The dictized items are never used though.

I think this (and several other instances of odd, apparently unused stuff being put into contexts and Pylons template contexts by various related items functions) shows why we should be avoiding things like the template context, and instead using template helper functions. These context variables were presumably once used by the templates, but the templates have since been changed and no longer use them, so they're wasting CPU cycles and cluttering up the code. If the old templates had been calling helper functions instead, those helper function calls would have been deleted along with the old template code.

Some of these template context variables may be used by the legacy templates (and the tests!) but not used in the new templates.

Tests

There are related items tests in tests/functional/test_related.py, tests/functional/pi/test_activity.py and tests/logic/test_action.py. I haven't looked into these but it would probably be really good to write a complete new-style tests for the feature and delete all of these old ones.

New implementation

Aims

The "related items" feature (to be renamed, probably) is about re-uses of data, e.g. apps, visualizations, stories, etc. that use the data from a CKAN site.
seanh: It may not have been clear previously that "related items" were only meant for reuses of data. You might have posted, for example, a document about how the data was collected as a related item. In this new spec, we've decided to make the feature specifically about data reuses. We should probably rename the feature to "data reuses" or something to that effect.

We want to let:

Site maintainers promote data reuses on their CKAN sites
Data reusers relate their data reuses to the relevant datasets within the CKAN site
Site visitors search for and find valuable reuses
Site admins and dataset owners moderate reuses
Site admins and dataset owners showcase the best reuses

Note that site admins or organization admins may add their own data reuses to the site, so they may be playing the role of data reusers as well.

User stories

Note: the format of a user story is:

As a ROLE, I want to DESIRE so that BENEFIT

seanh: I've removed several phrases like "on dataset page", "site wide & on dataset page", etc. from the user stories below because I think that's a concrete user interface decision that probably doesn't belong in the user stories. But let it be known that Ira would like data reuses to appear both on dataset pages and on a site-wide marketplace page!

Data reuser user stories

As a data reuser I want to show the cool things I've done with a site's data so that my reuses reach a wider audience.
As a reuser I want to be able to associate multiple datasets with each of my data reuses so that I can represent all of the datasets that each of my reuses uses.
As a reuser I want to be able to associate my data reuse with multiple datasets so that I can showcase my reuse in all the relevant contexts.

seanh: The last two user stories are very similar but I think there are two different cases here. The first one is: someone is looking at a data reuse, and they want to see what datasets it was made from. So going from the reuse to the datasets. The second one is: someone is looking at a dataset (or multiple datasets, e.g. a group, organization, tag, dataset search result...), and they want to see what reuses have been made from that/those dataset(s). So going from the dataset(s) to the data reuse(s). Together, these two user stories imply a many-many relationship between datasets and data reuses.

Site visitor user stories

As a visitor I want to see any re-uses that have been made with this/these dataset(s) since they may be more useful to me than just the data itself.

seanh: This could apply to the page for an individual dataset (i.e. a dataset's reuses should be shown somewhere on the dataset's page), but it could also apply to any pages where multiple datasets are listed, e.g. group and organization pages, dataset search results.

As a visitor, I want to be able to see what datasets this data reuse was made from, so that... ?
As a visitor I want to see what data re-uses exist so that I can get ideas for myself or because I'm looking for interesting re-uses.
As a visitor I want to be able to search through all the data re-uses so that I can see if what I need (e.g. an app about hospital ratings in London) already exists.

Data publisher user stories

As a data publisher I want to see what reuses have been made from my data so that I can see what value is being made of my data.
As an organization or group admin, I want to see all the reuses that have been made of the datasets in my organization or group so that I can see how valuable and interesting the data is to re-users and be motivated to open up more data in better quality!

seanh: Those last two are kind of the same user story, I think one of them was meant to be about showing data reuses on the dataset pages, the other about showing them on the group and organization pages.

As an organization admin I want to be able to delete data reuses from my datasets if they are irrelevant or spam so that they don't pollute my dataset pages.
As an organization admin I want to be able to moderate new data reuses that are added to my datasets in a queue where I can approve or delete the items so that I can quality check any additions.

seanh: Presumably they will also need to moderate data reuse edits. What about deletions?

As a sysadmin I want a way to moderate related item additions centrally for all items being added to the portal.

seanh: ...so that? I don't have to visit every organization's page one-by-one? It seems like the data reuse moderation page for a given user should show a queue of all the data reuses that user can moderate. If the user is an admin of multiple organizations, they'd see reuses from each of those orgs. If they're a sysadmin, they'd see all reuses.

Specific changes from current state

Add search bar and standardise filters to be across top of page on site-wide index: https://github.com/okfn/ckan/issues/333
Add ability to associate datasets to related items. Should be a link and part of the related item info. This would allow related items to be added centrally, but still associate and show up on dataset pages. (https://github.com/okfn/ckan/issues/335 + https://github.com/okfn/ckan/issues/465)
- Allow editing/deleting of related items from anywhere when you have the auth to do so (i.e. if you are owner of the related item, or sysadmin) https://github.com/okfn/ckan/issues/334
Related items associated only to a private dataset should be hidden until dataset is made public

Bugs and current problems

No way to moderate related items. Currently only the user who added the related item can delete it. Organization admins/editors (or dataset "owners") should at least be able to delete related items added to their datasets.
Current user flow for both adding, editing and deleting datasets needs work.

Questions

What do we call this feature? Related? Apps and Ideas? Something else? What's the MVP to make this useful? Ability to delete / moderate? Site-wide search on related items? Both?

Spec: Related Items

Summary of the current implementation

model

/related page (the "related dashboard")

Viewing related items

Creating related items

View counting

Activity streams

Featured related items

Authorization

Technical details

Model

Controller

dashboard()

read()

_edit_or_new()

Actions

related_list

related_create

Schemas

Tests

New implementation

Aims

User stories

Data reuser user stories

Site visitor user stories

Data publisher user stories

Specific changes from current state

Bugs and current problems

Questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`/related` page (the "related dashboard")