-
Notifications
You must be signed in to change notification settings - Fork 10
Spec: Related Items
This section summarizes the current (CKAN 2.2, Feb 2014) implementation of related items, both from the user's point of view and the technical details, the point is to get an idea of what the feature currently does and any problems with the current implementation.
Note that related items is not currently implemented as an extension! It's a core feature, and always on. Maybe it should be moved to an extension?
In the model for related items (see ckan/model/related.py), each related item
has:
- id
- type (API, application, idea, news article, paper, post or visualization)
- title
- description
- image_url
- url
- created date
- owner_id (the user who created the item)
- view_count
- featured (true or false)
The way the UI and much of the code is implemented currently, each related item is related to exactly one dataset. You can't have an item related to multiple datasets. You can use the API to create items related to no datasets.
The type field seems a bit pointless, there's no actual difference between the different types of related items, they're all just objects with a title, description and link. What if someone wants to add a related item that doesn't come under those 7 types? On the other hand, you can filter related items by type and e.g. have a page showing just the apps, and having a list of types helps to suggest what sorts of things the feature might be used for. Maybe it just needs a final type "Other"?
This page is not yet linked to from anywhere in the default CKAN theme.
This page shows all of the related items on the site, paginated. You can filter the related items by type, and also show only featured related items, and you can sort the related items by created date or by view count.
The default sort order is just labelled "Default" and you don't know what the ordering actually is.
All you can do with the related items on this page is click on them, and you'll be taken to the item's URL (i.e. the external URL that the item links to). You can't create or edit or delete related items from this page, you can't see the datasets that the items are related to, and there's no way to get to the dataset pages.
From a user point of view I don't think this page is very well thought through yet. Both the URL and name of this page seem very weird to me, related to what? The datasets that the related items are related to can't be seen or reached from this page. (Also not every related item is related to a dataset anyway, via the API you can create related items with no datasets, all of the related items on publicdata.eu are this way for example.)
I think some clarification is needed about what the purpose of this page is / what the use-cases for it are and then we can consider how it should be re-designed.
The page title is "Apps & Ideas" but this is different from the page URL
(/related). The feature description in the page sidebar also just talks about
apps and ideas:
What are applications?
These are applications built with the datasets as well as ideas for things that could be done with them.
(Btw, "ideas for what could be done with them", does them refer to the datasets? Or the applications?)
But when you create a related item, there are 7 possible types: API, app, idea, news article, paper, post, visualization, so it's not just apps and ideas.
Also when related items are shown on dataset pages, the name "Apps & Ideas" is not used (see below).
You can view a list of a dataset's related items on the dataset's page, or a list of all related items on the related dashboard page. Clicking on a related item just redirects you to the item's external URL (e.g. the blog post or whatever) though, related items don't appear to have their own pages.
On the dataset page, the URL ends in /related and the tab is just titled
"Related", the button is "Add Related Item", etc. (There's no use of the "Apps
& Ideas" title from the dashboard page.) When adding or updating a related
item, a third name "related media" appears:
What are related items?
Related Media is any app, article, visualisation or idea related to this dataset.
For example, it could be a custom visualisation, pictograph or bar chart, an app using all or part of the data or even a news story that references this dataset.
This is quite different from the description on the related dashboard page.
If we want to support items that are related to more than one dataset, then related items will probably need to get their own pages, because somewhere we need to show a list of all the datasets that the item is related to.
You can also view individual related items using the API with the
related_show action.
It is possible to create a related item that isn't related to any dataset,
using the related_create API. In the web interface though, the only way to
create a related item is by going to a dataset's related items tab, and then
the item will be related to that one dataset.
After creating the user is redirected to the dataset's list of related items.
Related items have their own view counting feature. Whenever someone clicks on a related item to follow its link, it increments the count (implemented in the controller class).
It looks like there's no limiting (e.g. the same user repeatedly clicking), viewing an item over the API doesn't count, I'm not sure if clicking on an item on the related dashboard page counts, or on custom pages that use the API to show related items (e.g. publicdata.eu's front page).
This view counting could maybe be integrated with CKAN's builtin page view tracking feature, although that feature also has problems.
Activity streams are created when you create, update or delete a related item. These could do with a little quality control though, e.g. I think I saw "api" spelled in lower-case. (API is also mis-spelled as "Api" in related item tooltips.) Also, what activity streams should these activities appear in? It looks like currently they appear in the user's activity stream, but not the dataset's? They probably don't appear in the group or organization's stream either.
A sysadmin can set mark certain related items as "featured". (Not sure whether this is doable in the web UI, or only the API.) The only place this is currently used is on the related dashboard page, where there's a checkbox to show only the featured items.
In extensions you could quite easily do things like show the featured items on the site front page, or have one page showing the featured apps and another page showing featured visualizations, etc. An example extension showing how this can be done might be a good idea.
The way the auth functions are currently implemented is:
-
Anyone can see any related item or list of related items. (So private datasets with private related items isn't supported.)
-
Anyone who is logged-in can add a related item to any dataset. (It doesn't seem to matter whether they have permission to edit the dataset, although I didn't test it with organizations.)
-
Only a sysadmin or the "owner" of a related item (the person who created the item) can edit a related item.
-
The owner of a related item can't change, it's always the person who created it. (Maybe it can be changed via the API?)
-
Anyone who has permission to delete a dataset can delete an individual related item from the dataset (and this will completely remove the item from the site, not just remove it from the dataset). The creator of a related item can also delete it.
-
Only sysadmins can create featured related items or mark a related item as featured.
There's a separate table related_dataset that maps related items to datasets,
so in theory the model supports a many-many relationship, but I think the rest
of the implementation only allows a related item to be related to one dataset.
(And allowing items to be related to multiple datasets would probably raise a
lot of questions about authorization and user interface.)
I think we should move more code into the model, e.g. methods for returning lists of related items filtered by dataset, type, featured, etc. These methods can then be unit-tested in the model, and the logic can just call the model methods instead of doing its own sqlalchemy.
The related dashboard page is implemented by the dashboard() method in the
related controller. It calls the related_list() action function to actually
get the related items to show. It looks like the pagination is done in the
controller, this should be moved into the action function so that the API
supports pagination and so that it can be tested more easily.
The related controller's read() method (which redirects the browser to the
related item's external URL) accesses the model directly to get the related
item. It should be going through the related_show() action function. It also
does its own call to check_access() in the controller - again
related_show() should be doing this and the controller just calling
related_show(). View counting feature should not be implemented in the
controller either.
The related controller's _edit_or_new() method does its own check_access()
call, that should be done by the related_create() action (there shouldn't be
any auth stuff in controllers).
There's some other weird stuff going on in the controller here too, like some
unflattening and tuplizing and putting a package in c.pkg_dict that may not
be used anywhere.
It shows a flash message after creating the item, which I don't think we normally do in CKAN, shouldn't be there? (Updating and deleting related items also does this.)
The method docstring of the related controller's _edit_or_new() method looks
more like a git commit message.
It adds "related" and "id" to the context at the end, not sure why.
The related_list() action function (returns a list of related items,
optionally filtered by dataset, type and whether featured) is for some reason
putting a package in c.pkg and c.pkg_dict, but I'm not sure if this is
actually used (and it doesn't seem like something an action function should do.
related_list() can accept either a dataset ID (in a param named id, which
would be clearer if it was called dataset_id) or a dataset dict (in a param
called dataset). If the dataset param is not present, it falls back to the
id param. This precedence isn't documented or tested. This seems unnecessary
to me, much simpler to accept just the id.
If no dataset id or dict is given, it returns all related items, but I don't think this is documented either. Also it looks like all the sorting and filtering features are only applied when returning all related items, and not when returning a dataset's related items (again not mentioned in the docstring).
related_list() calls check_access('related_show'), i.e. it calls the
related_show auth function, it should have its own related_list auth
function.
It doesn't appear to do much validation of params, e.g. what happens if the given dataset dict or id is invalid?
It also makes a JSON dump of all of the package's resources for some reason?
After calling model_save.related_dict_save() the related_create() action
seems to do its own thing to append the related item to the dataset's list of
related items. Should this be done in model_save?
The related_show() action uses the default_related_schema(), which is also
used by related_create and related_update. I think each action should have
its own schema: default_related_show_schema(),
default_related_create_schema(), default_related_update schema(). If
they're all the same, then make a _default_related_schema() helper function
and have them all call it.
In the case of related items, I'm not sure I see the point of validating the
data coming out of the database anyway, since I don't think any conversions are
done. With packages and resources for example, the schemas can be customized by
IDatasetForm and the custom schemas may include converter functions like
convert_from_tags() and convert_from_extras(), so that's why data coming
out of the database in package_show() needs to be converted/validated. But in
related_show(), this seems pointless?
It looks like user_dictize(), if passed the 'with_related' option in the
context, will also dictize all of the user's related items. When showing a
user profile page, the user controller does pass this option. The dictized
items are never used though.
I think this (and several other instances of odd, apparently unused stuff being put into contexts and Pylons template contexts by various related items functions) shows why we should be avoiding things like the template context, and instead using template helper functions. These context variables were presumably once used by the templates, but the templates have since been changed and no longer use them, so they're wasting CPU cycles and cluttering up the code. If the old templates had been calling helper functions instead, those helper function calls would have been deleted along with the old template code.
Some of these template context variables may be used by the legacy templates (and the tests!) but not used in the new templates.
There are related items tests in tests/functional/test_related.py,
tests/functional/pi/test_activity.py and tests/logic/test_action.py. I
haven't looked into these but it would probably be really good to write a
complete new-style tests for the feature and delete all of these old ones.
The "related items" feature (to be renamed, probably) is about re-uses of data, e.g. apps, visualizations, stories, etc. that use the data from a CKAN site.
seanh: It may not have been clear previously that "related items" were only meant for reuses of data. You might have posted, for example, a document about how the data was collected as a related item. In this new spec, we've decided to make the feature specifically about data reuses. We should probably rename the feature to "data reuses" or something to that effect.
We want to let:
- Site maintainers promote data reuses on their CKAN sites
- Data reusers relate their data reuses to the relevant datasets within the CKAN site
- Site visitors search for and find valuable reuses
- Site admins and dataset owners moderate reuses
- Site admins and dataset owners showcase the best reuses
Note that site admins or organization admins may add their own data reuses to the site, so they may be playing the role of data reusers as well.
Note: the format of a user story is:
As a ROLE, I want to DESIRE so that BENEFIT
seanh: I've removed several phrases like "on dataset page", "site wide & on dataset page", etc. from the user stories below because I think that's a concrete user interface decision that probably doesn't belong in the user stories. But let it be known that Ira would like data reuses to appear both on dataset pages and on a site-wide marketplace page!
-
As a data reuser I want to show the cool things I've done with a site's data so that my reuses reach a wider audience.
-
As a reuser I want to be able to associate multiple datasets with each of my data reuses so that I can represent all of the datasets that each of my reuses uses.
-
As a reuser I want to be able to associate my data reuse with multiple datasets so that I can showcase my reuse in all the relevant contexts.
seanh: The last two user stories are very similar but I think there are two different cases here. The first one is: someone is looking at a data reuse, and they want to see what datasets it was made from. So going from the reuse to the datasets. The second one is: someone is looking at a dataset (or multiple datasets, e.g. a group, organization, tag, dataset search result...), and they want to see what reuses have been made from that/those dataset(s). So going from the dataset(s) to the data reuse(s). Together, these two user stories imply a many-many relationship between datasets and data reuses.
- As a visitor I want to see any re-uses that have been made with this/these dataset(s) since they may be more useful to me than just the data itself.
seanh: This could apply to the page for an individual dataset (i.e. a dataset's reuses should be shown somewhere on the dataset's page), but it could also apply to any pages where multiple datasets are listed, e.g. group and organization pages, dataset search results.
-
As a visitor, I want to be able to see what datasets this data reuse was made from, so that... ?
-
As a visitor I want to see what data re-uses exist so that I can get ideas for myself or because I'm looking for interesting re-uses.
-
As a visitor I want to be able to search through all the data re-uses so that I can see if what I need (e.g. an app about hospital ratings in London) already exists.
-
As a data publisher I want to see what reuses have been made from my data so that I can see what value is being made of my data.
-
As an organization or group admin, I want to see all the reuses that have been made of the datasets in my organization or group so that I can see how valuable and interesting the data is to re-users and be motivated to open up more data in better quality!
seanh: Those last two are kind of the same user story, I think one of them was meant to be about showing data reuses on the dataset pages, the other about showing them on the group and organization pages.
-
As an organization admin I want to be able to delete data reuses from my datasets if they are irrelevant or spam so that they don't pollute my dataset pages.
-
As an organization admin I want to be able to moderate new data reuses that are added to my datasets in a queue where I can approve or delete the items so that I can quality check any additions.
seanh: Presumably they will also need to moderate data reuse edits. What about deletions?
- As a sysadmin I want a way to moderate related item additions centrally for all items being added to the portal.
seanh: ...so that? I don't have to visit every organization's page one-by-one? It seems like the data reuse moderation page for a given user should show a queue of all the data reuses that user can moderate. If the user is an admin of multiple organizations, they'd see reuses from each of those orgs. If they're a sysadmin, they'd see all reuses.
- Add search bar and standardise filters to be across top of page on site-wide index: https://github.com/okfn/ckan/issues/333
- Add ability to associate datasets to related items. Should be a link and part of the related item info. This would allow related items to be added centrally, but still associate and show up on dataset pages. (https://github.com/okfn/ckan/issues/335 + https://github.com/okfn/ckan/issues/465)
- Allow editing/deleting of related items from anywhere when you have the auth to do so (i.e. if you are owner of the related item, or sysadmin) https://github.com/okfn/ckan/issues/334
- Related items associated only to a private dataset should be hidden until dataset is made public
- No way to moderate related items. Currently only the user who added the related item can delete it. Organization admins/editors (or dataset "owners") should at least be able to delete related items added to their datasets.
- Current user flow for both adding, editing and deleting datasets needs work.
What do we call this feature? Related? Apps and Ideas? Something else? What's the MVP to make this useful? Ability to delete / moderate? Site-wide search on related items? Both?