Skip to content

Revisit manifest trimming script #224

@aclayton555

Description

@aclayton555

Emerges from #218 alongside #223.

We should revisit the manifest trimming script and make a decision to keep this off entirely or, more likely, make some updates to automate and improve this, potentially also automating our reporting process and any metrics of interest.

From Orion:

I had some ideas related to your suggestions:

triage step to assess the curation status of each PMID (NOT just the presence / absence of PMID)

Consider what type of checks could be included in this triage (and ideally automated), like # open-access vs # restricted, grant number = one of the grants on the CCKP, etc.

I definitely think we could approach this programmatically! The current manifest trimming script could be updated to include a synapse tableQuery call. We could set the call up to use the appropriate table syn id (maybe an option to flip between the Portal and _UNION tables), based on the value of the Component column in the input manifest. If using a _UNION table, GrantView Key values for entries with matching Pubmed Id should be consolidated to a single entry, with each set of grant numbers stored as a comma-separated list next to the single matching PMID. This matches the format of the grant numbers in manifests pre-upload (I think - is that right, @aditya-nath-sage?), which will enable us to easily compare the entries. We could also implement a function that addresses the three triage scenarios:

  • Addressing situation 1 (fully curated entry exists in the selected table) - function replaces entries in the new manifest with entries in the selected table, if:
    1. the Pubmed Id and GrantView Key values of an entry match a row in the selected table
    2. the matching entry is marked as Open Access in selected table
  • Addressing situation 2 (partially curated entry exists in the selected table) - entries in the new manifest will be retained (and maybe reported in a separate printout or CSV) if:
    1. the Pubmed Id and GrantView Key values of an entry match a row in the selected table
    2. the matching entry is marked as Restricted Access in selected table
  • Addressing situation 3 (no entry exists in the database) - entries in the new manifest will be retained and reported in a separate CSV if:
    1. the Pubmed Id value of an entry does not match a row in the selected table
    2. the GrantView Key value(s) associated with the new Pubmed Id matches an entry in the Portal - Grants Merged table

There are definitely metrics type things we could collect and report, potentially through a separate script:

  • @aclayton555 noted # open access vs # restricted above
  • # of papers with more than one grant number
  • # of papers from each consortium
  • Names of journals and # of articles on the CCKP for each
  • All the same stuff for datasets, tools

We can get some of this from table queries, but it could be interesting to just track everything and be able to look at the change over time?

Slightly separate: maybe we could check with Savitha to see if there are any other publication metrics that would be helpful for us to run monthly. For example, I think we can check citation counts and get lists of other publications that cite publications in the database. We could use this info to see what papers are being cited, including how often publications from MC2 consortia are referencing the same publications, other publications from MC2 consortia members, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions