Skip to content

Spec: Related Items

Sean Hammond edited this page Feb 5, 2014 · 11 revisions

Summary of the current implementation

This section summarizes the current (CKAN 2.2, Feb 2014) implementation of related items, both from the user's point of view and the technical details, the point is to get an idea of what the feature currently does and any problems with the current implementation.

Note that related items is not currently implemented as an extension! It's a core feature, and always on. Maybe it should be moved to an extension?

model

In the model for related items (see ckan/model/related.py), each related item has:

  • id
  • type (API, application, idea, news article, paper, post or visualization)
  • title
  • description
  • image_url
  • url
  • created date
  • owner_id (the user who created the item)
  • view_count
  • featured (true or false)

The way the UI and much of the code is implemented currently, each related item is related to exactly one dataset. You can't have an item related to multiple datasets. You can use the API to create items related to no datasets.

The type field seems a bit pointless, there's no actual difference between the different types of related items, they're all just objects with a title, description and link. What if someone wants to add a related item that doesn't come under those 7 types? On the other hand, you can filter related items by type and e.g. have a page showing just the apps, and having a list of types helps to suggest what sorts of things the feature might be used for. Maybe it just needs a final type "Other"?

/related page (the "related dashboard")

This page is not yet linked to from anywhere in the default CKAN theme.

This page shows all of the related items on the site, paginated. You can filter the related items by type, and also show only featured related items, and you can sort the related items by created date or by view count.

The default sort order is just labelled "Default" and you don't know what the ordering actually is.

All you can do with the related items on this page is click on them, and you'll be taken to the item's URL (i.e. the external URL that the item links to). You can't create or edit or delete related items from this page, you can't see the datasets that the items are related to, and there's no way to get to the dataset pages.

From a user point of view I don't think this page is very well thought through yet. Both the URL and name of this page seem very weird to me, related to what? The datasets that the related items are related to can't be seen or reached from this page. (Also not every related item is related to a dataset anyway, via the API you can create related items with no datasets, all of the related items on publicdata.eu are this way for example.)

I think some clarification is needed about what the purpose of this page is / what the use-cases for it are and then we can consider how it should be re-designed.

The page title is "Apps & Ideas" but this is different from the page URL (/related). The feature description in the page sidebar also just talks about apps and ideas:

What are applications?

These are applications built with the datasets as well as ideas for things that could be done with them.

(Btw, "ideas for what could be done with them", does them refer to the datasets? Or the applications?)

But when you create a related item, there are 7 possible types: API, app, idea, news article, paper, post, visualization, so it's not just apps and ideas.

Also when related items are shown on dataset pages, the name "Apps & Ideas" is not used (see below).

Viewing related items

You can view a list of a dataset's related items on the dataset's page, or a list of all related items on the related dashboard page. Clicking on a related item just redirects you to the item's external URL (e.g. the blog post or whatever) though, related items don't appear to have their own pages.

On the dataset page, the URL ends in /related and the tab is just titled "Related", the button is "Add Related Item", etc. (There's no use of the "Apps & Ideas" title from the dashboard page.) When adding or updating a related item, a third name "related media" appears:

What are related items?

Related Media is any app, article, visualisation or idea related to this dataset.

For example, it could be a custom visualisation, pictograph or bar chart, an app using all or part of the data or even a news story that references this dataset.

This is quite different from the description on the related dashboard page.

If we want to support items that are related to more than one dataset, then related items will probably need to get their own pages, because somewhere we need to show a list of all the datasets that the item is related to.

You can also view individual related items using the API with the related_show action.

Creating related items

It is possible to create a related item that isn't related to any dataset, using the related_create API. In the web interface though, the only way to create a related item is by going to a dataset's related items tab, and then the item will be related to that one dataset.

After creating the user is redirected to the dataset's list of related items.

View counting

Related items have their own view counting feature. Whenever someone clicks on a related item to follow its link, it increments the count (implemented in the controller class).

It looks like there's no limiting (e.g. the same user repeatedly clicking), viewing an item over the API doesn't count, I'm not sure if clicking on an item on the related dashboard page counts, or on custom pages that use the API to show related items (e.g. publicdata.eu's front page).

This view counting could maybe be integrated with CKAN's builtin page view tracking feature, although that feature also has problems.

Activity streams

Activity streams are created when you create, update or delete a related item. These could do with a little quality control though, e.g. I think I saw "api" spelled in lower-case. (API is also mis-spelled as "Api" in related item tooltips.) Also, what activity streams should these activities appear in? It looks like currently they appear in the user's activity stream, but not the dataset's? They probably don't appear in the group or organization's stream either.

Featured related items

A sysadmin can set mark certain related items as "featured". (Not sure whether this is doable in the web UI, or only the API.) The only place this is currently used is on the related dashboard page, where there's a checkbox to show only the featured items.

In extensions you could quite easily do things like show the featured items on the site front page, or have one page showing the featured apps and another page showing featured visualizations, etc. An example extension showing how this can be done might be a good idea.

Authorization

The way the auth functions are currently implemented is:

  • Anyone can see any related item or list of related items. (So private datasets with private related items isn't supported.)

  • Anyone who is logged-in can add a related item to any dataset. (It doesn't seem to matter whether they have permission to edit the dataset, although I didn't test it with organizations.)

  • Only a sysadmin or the "owner" of a related item (the person who created the item) can edit a related item.

  • The owner of a related item can't change, it's always the person who created it. (Maybe it can be changed via the API?)

  • Anyone who has permission to delete a dataset can delete an individual related item from the dataset (and this will completely remove the item from the site, not just remove it from the dataset). The creator of a related item can also delete it.

  • Only sysadmins can create featured related items or mark a related item as featured.

Technical details

Model

There's a separate table related_dataset that maps related items to datasets, so in theory the model supports a many-many relationship, but I think the rest of the implementation only allows a related item to be related to one dataset. (And allowing items to be related to multiple datasets would probably raise a lot of questions about authorization and user interface.)

I think we should move more code into the model, e.g. methods for returning lists of related items filtered by dataset, type, featured, etc. These methods can then be unit-tested in the model, and the logic can just call the model methods instead of doing its own sqlalchemy.

Controller

dashboard()

The related dashboard page is implemented by the dashboard() method in the related controller. It calls the related_list() action function to actually get the related items to show. It looks like the pagination is done in the controller, this should be moved into the action function so that the API supports pagination and so that it can be tested more easily.

read()

The related controller's read() method (which redirects the browser to the related item's external URL) accesses the model directly to get the related item. It should be going through the related_show() action function. It also does its own call to check_access() in the controller - again related_show() should be doing this and the controller just calling related_show(). View counting feature should not be implemented in the controller either.

_edit_or_new()

The related controller's _edit_or_new() method does its own check_access() call, that should be done by the related_create() action (there shouldn't be any auth stuff in controllers).

There's some other weird stuff going on in the controller here too, like some unflattening and tuplizing and putting a package in c.pkg_dict that may not be used anywhere.

It shows a flash message after creating the item, which I don't think we normally do in CKAN, shouldn't be there? (Updating and deleting related items also does this.)

The method docstring of the related controller's _edit_or_new() method looks more like a git commit message.

It adds "related" and "id" to the context at the end, not sure why.

Actions

related_list

The related_list() action function (returns a list of related items, optionally filtered by dataset, type and whether featured) is for some reason putting a package in c.pkg and c.pkg_dict, but I'm not sure if this is actually used (and it doesn't seem like something an action function should do.

related_list() can accept either a dataset ID (in a param named id, which would be clearer if it was called dataset_id) or a dataset dict (in a param called dataset). If the dataset param is not present, it falls back to the id param. This precedence isn't documented or tested. This seems unnecessary to me, much simpler to accept just the id.

If no dataset id or dict is given, it returns all related items, but I don't think this is documented either. Also it looks like all the sorting and filtering features are only applied when returning all related items, and not when returning a dataset's related items (again not mentioned in the docstring).

related_list() calls check_access('related_show'), i.e. it calls the related_show auth function, it should have its own related_list auth function.

It doesn't appear to do much validation of params, e.g. what happens if the given dataset dict or id is invalid?

It also makes a JSON dump of all of the package's resources for some reason?

related_create

After calling model_save.related_dict_save() the related_create() action seems to do its own thing to append the related item to the dataset's list of related items. Should this be done in model_save?

Schemas

The related_show() action uses the default_related_schema(), which is also used by related_create and related_update. I think each action should have its own schema: default_related_show_schema(), default_related_create_schema(), default_related_update schema(). If they're all the same, then make a _default_related_schema() helper function and have them all call it.

In the case of related items, I'm not sure I see the point of validating the data coming out of the database anyway, since I don't think any conversions are done. With packages and resources for example, the schemas can be customized by IDatasetForm and the custom schemas may include converter functions like convert_from_tags() and convert_from_extras(), so that's why data coming out of the database in package_show() needs to be converted/validated. But in related_show(), this seems pointless?


It looks like user_dictize(), if passed the 'with_related' option in the context, will also dictize all of the user's related items. When showing a user profile page, the user controller does pass this option. The dictized items are never used though.

I think this (and several other instances of odd, apparently unused stuff being put into contexts and Pylons template contexts by various related items functions) shows why we should be avoiding things like the template context, and instead using template helper functions. These context variables were presumably once used by the templates, but the templates have since been changed and no longer use them, so they're wasting CPU cycles and cluttering up the code. If the old templates had been calling helper functions instead, those helper function calls would have been deleted along with the old template code.

Some of these template context variables may be used by the legacy templates (and the tests!) but not used in the new templates.

Tests

There are related items tests in tests/functional/test_related.py, tests/functional/pi/test_activity.py and tests/logic/test_action.py. I haven't looked into these but it would probably be really good to write a complete new-style tests for the feature and delete all of these old ones.

New implementation

Aims

The "related items" feature (to be renamed, probably) is about re-uses of data, e.g. apps, visualizations, stories, etc. that use the data from a CKAN site.
seanh: It may not have been clear previously that "related items" were only meant for reuses of data. You might have posted, for example, a document about how the data was collected as a related item. In this new spec, we've decided to make the feature specifically about data reuses. We should probably rename the feature to "data reuses" or something to that effect.

We want to let:

  1. Site maintainers promote data reuses on their CKAN sites
  2. Data reusers relate their data reuses to the relevant datasets within the CKAN site
  3. Site visitors search for and find valuable reuses
  4. Site admins and dataset owners moderate reuses
  5. Site admins and dataset owners showcase the best reuses

Note that site admins or organization admins may add their own data reuses to the site, so they may be playing the role of data reusers as well.

User stories

Note: the format of a user story is:

As a ROLE, I want to DESIRE so that BENEFIT

seanh: I've removed several phrases like "on dataset page", "site wide & on dataset page", etc. from the user stories below because I think that's a concrete user interface decision that probably doesn't belong in the user stories. But let it be known that Ira would like data reuses to appear both on dataset pages and on a site-wide marketplace page!

Data reuser user stories

  • As a data reuser I want to show the cool things I've done with a site's data so that my reuses reach a wider audience.

  • As a reuser I want to be able to associate multiple datasets with each of my data reuses so that I can represent all of the datasets that each of my reuses uses.

  • As a reuser I want to be able to associate my data reuse with multiple datasets so that I can showcase my reuse in all the relevant contexts.

seanh: The last two user stories are very similar but I think there are two different cases here. The first one is: someone is looking at a data reuse, and they want to see what datasets it was made from. So going from the reuse to the datasets. The second one is: someone is looking at a dataset (or multiple datasets, e.g. a group, organization, tag, dataset search result...), and they want to see what reuses have been made from that/those dataset(s). So going from the dataset(s) to the data reuse(s). Together, these two user stories imply a many-many relationship between datasets and data reuses.

Site visitor user stories

  • As a visitor I want to see any re-uses that have been made with this/these dataset(s) since they may be more useful to me than just the data itself.

seanh: This could apply to the page for an individual dataset (i.e. a dataset's reuses should be shown somewhere on the dataset's page), but it could also apply to any pages where multiple datasets are listed, e.g. group and organization pages, dataset search results.

  • As a visitor, I want to be able to see what datasets this data reuse was made from, so that... ?

  • As a visitor I want to see what data re-uses exist so that I can get ideas for myself or because I'm looking for interesting re-uses.

  • As a visitor I want to be able to search through all the data re-uses so that I can see if what I need (e.g. an app about hospital ratings in London) already exists.

Data publisher user stories

  • As a data publisher I want to see what reuses have been made from my data so that I can see what value is being made of my data.

  • As an organization or group admin, I want to see all the reuses that have been made of the datasets in my organization or group so that I can see how valuable and interesting the data is to re-users and be motivated to open up more data in better quality!

seanh: Those last two are kind of the same user story, I think one of them was meant to be about showing data reuses on the dataset pages, the other about showing them on the group and organization pages.

  • As an organization admin I want to be able to delete data reuses from my datasets if they are irrelevant or spam so that they don't pollute my dataset pages.

  • As an organization admin I want to be able to moderate new data reuses that are added to my datasets in a queue where I can approve or delete the items so that I can quality check any additions.

seanh: Presumably they will also need to moderate data reuse edits. What about deletions?

  • As a sysadmin I want a way to moderate related item additions centrally for all items being added to the portal.

seanh: ...so that? I don't have to visit every organization's page one-by-one? It seems like the data reuse moderation page for a given user should show a queue of all the data reuses that user can moderate. If the user is an admin of multiple organizations, they'd see reuses from each of those orgs. If they're a sysadmin, they'd see all reuses.

Specific changes from current state

This is an attempt to list the specific changes and additions that will get us from the current implementation to one that meets the aims and user stories above. It might not be complete or correct, feel free to add, edit or comment.

  • Various technical refactorings, tests, etc. (See the description of the current implementation above.)

  • Allow multiple datasets to be related to a single related item.

    • This needs to be added to the create/edit related item form
    • Note that we may need to reconsider how things like user interface navigation and authorization to delete or edit related items works, now that a related item doesn't "belong to" just one dataset. Just something to keep in mind. (I haven't thought it through yet.)
  • Somehow show which dataset(s) a related item is related to.

    This is currently done by having showing the related item on a tab within its dataset's pages, on the site-wide /related page, the datasets that the related items are related to are not shown. (So you can navigate from a dataset to its related items, but you can't go from a related items to its datasets.)

    We'll need to figure out how to support finding a related item's datasets from the /related page and from other non-dataset pages that show lists of related items (group pages, organization pages, see below).

    Since we're now going to have items related to multiple datasets, we need to somewhere somehow show a list of the datasets for a related item.

    Do individual related items need to have their own pages to show this? (Currently clicking on a related item just redirects you to the item's external URL, individual related items don't have their own pages in CKAN.)

  • Interface for related items to be added centrally and associated with multiple datasets.

    Currently you have to go to a particular dataset's page and click the add related item button to add an item to that dataset. Ira wants a general "add a related item to this site" button somewhere, that takes you to a form where you can select multiple datasets that the item should be added to.

  • Show related items on group and organization pages.

    I guess this means adding a tab to the group and org pages, listing all of the related items related to any of the group or org's datasets. I guess it would look much like the related items pages that dataset's already have.

    One thing that comes to mind is that a group or org may have a much larger number of related items than a dataset has, so their pages may need extra features like search and pagination that the dataset pages don't have.

    This would mean adding new API actions to get all the related items of a group or organization.

  • Implement searching for related items.

    Ira: Add a search bar and filters etc. across the top of the /related page.

    seanh: Or can we somehow show related items in /dataset search results instead? Then we might not need the /related page.

    seanh: Either way this is probably not trivial to implement.

  • Implement moderation of related item creations and edits.

    Ira suggests some sort of moderation queue where people can approve or delete items.

    I think we need to answer the question: who can moderate a given related item creation or edit?
    This is not trivial, for example:

    • Need to pin down the question who are the "owners" of a dataset? I guess it's anyone who can edit that dataset? So currently by default, I think that's any editors or admins of the dataset's organization.

    • A new related item is created, and related to 3 datasets that belong to different organizations. Who gets to moderate this? What if someone from organization A approves the item, and it then gets added to organizations B and C as well?

    • Same question as above when someone edits a related item and changes what datasets it belongs to.

    • Even if someone just edits a related item and doesn't change its list of datasets, they may completely change the title, description, image and link to something else, then if this change is approved by a moderator from organization A, members of organization B and C may see one of their related items being changed to something they don't want.

    • What if someone has edited a related item and removed it from a dataset that I own? Do I get to moderate that edit, and prevent them from removing it?

    • What if someone changes the description of a related item, and in the same edit also changes what datasets it belongs to, and then a moderator who doesn't want it on their dataset rejects the change? Is the edited description also lost?

    • It seems like maybe related items should not be related to datasets, but rather datasets should be related to related items. So by editing a related item, you can edit its title, description, etc. But if you want to add an item to or remove an item from a dataset, you need to edit the dataset not the related item.

      That might solve a lot of the moderation questions above, but I think it may raise UI questions as creators of related items may obviously want to manage the list of datasets that an item is related to from the item, and not have to visit every dataset in turn.

      Also we would need a way for people who are not allowed to edit a dataset to be able to submit related items to the dataset, to be moderated by the dataset owners.

  • Where will the moderation interface go, and how will it be designed?

    From the user stories we want org admins to be able to moderate any related item in their org, but also sysadmins to be able to moderate any related items at all, site-wide. Also a user may be an admin of more than one org. So it seems like the page with the moderation queue belongs in a tab on the user dashboard, rather than in a tab on the organization page? Then it can just show a queue of all items that user is authorized to moderate.

  • Implement new authorization logic. Who is allowed to create related items (related to which datasets)? Who is allowed to edit them? Which changes need to be moderated, and who can moderate them? etc.

  • Allow editing/deleting of related items from anywhere when you have the auth to do so (i.e. if you are owner of the related item, or sysadmin)

    i.e. wherever related items are listed (on the /related page, on group or org pages...) there should be an edit button for each item (that you are allowed to edit).

    If related items are going to get their own individual pages, then the edit buttons could go there, which might be better than cluttering up pages that are showing lists of related items.

  • Hide related items that are associated only to private datasets

    If the item is not related to any dataset that the user is allowed to see, then hide the item from the user?

    If the user can see an item because it's related to a dataset the user can see, but the item is also related to some private datasets that the user can't see, presumably we should hide those datasets from the item's list of related datasets.

Clone this wiki locally