Skip to content

GSoD 2023 Proposal

Noam Ross edited this page Mar 24, 2023 · 13 revisions

Documenting FAIR Data Deposition with deposits

About your organization

rOpenSci is a community of developers, researchers, and data scientists who support open scientific practices through developing technical and community infrastructure, particularly around the R language. This includes a large library of community-driven software packages for scientific data workflows, a software peer-review process, package development documentation, community infrastructure for automated building, testing, and dissemination of packages, and programs for co-working and mentoring. We are a community-driven organization, with a core team of developers and a large group of contributors. rOpenSci was founded in 2011. We are a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit organization in the United States.

About your project

rOpenSci has long supported scientists' work with data repositories, especially facilitating open science practices of timely and well-documented deposition of scientific data for open publication. This is an area of importance to our user community, as evidenced by high engagement in activities on this topic. Historically we have done so through a suite of API clients, each working with a particular data repository. deposits is our latest endeavor to make this task easier by creating a common workflow and interface for data deposition. It is an R package developed by rOpenSci, led by core member Mark Padgham, to serve as a common client for depositing, updating, and accessing research data in multiple generalist scientific repositories. It has consolidated previous work developing client libraries for single repositories (Zenodo, Figshare, and Dryad). deposits allows users to manage the online storage and publication of data in these repositories using a simple set of common actions that abstract over the complexity of multiple APIs. It also integrates with the tooling and standards of the Frictionless metadata framework, allowing researchers to harmonize their metadata generation and publication within a single workflow.

Your project’s problem

While deposits consolidates the process of scientific data deposition to repositories, it is still a complex process with documentation scattered across many resources. Beyond the tasks handled and abstracted by deposits, scientists must consider (meta)data structures, repository selection, and semantic linking and versioning in order to ensure that their data is Findable, Accessible, Interoperable, and Reproducible (FAIR). We aim to create documentation for deposits that can act as one-stop tutorials for the processes, including creating a data dictionary, creating the package metadata, and linking data to other scientific publications and artifacts. We also aim to create example repositories that show how to use deposits to create a data package that can be deposited in a repository.

In addition, a core area of development for rOpenSci is the development of multilingual resources for the R language. We aim to develop both English and Spanish-language versions of this documentation

Your project’s scope

The tasks of this project are split into the following categories:

  • Vignette creation:
    • Creation of a core tutorial walking users through the steps of creating documentation and metadata in preparation for deposition, validation, then practice deposition using deposits;
    • Creation of a tutorial for creating automated workflows using CI to validate and publish data;
    • Creation of a tutorial for retrieving data from a multiple repositories for analysis.
    • Creation of a reference document/cheatsheet of metadata fields and their use in deposits and across repositories.
  • Example repositories: Creation of example workflow repositories that represent common data structures and workflows
    • Deposition of a structured tabular data set;
    • Deposition of a mixed data/code repository;
    • Automated validation and deposition of regularly released data via continuous integration.
  • Beta-testing documentation with scientists and collecting feedback on clarity, usability, and completeness.
  • Translation of all these resources into Spanish (or English if initially written in Spanish).
  • Writing blog posts announcing and summarizing each resource as they are published.

How would we measure success?

  • We first measure success by meeting key deliverables: the vignettes are added to the package website, and the example repositories are published.
  • Secondly, we measure success by ensuring that our documentation is usable and consolidates the information scientists need in order to perform the tasks. To test this, we will reach out to scientists to beta-test the documentation. We will ask several first-time users to use the tutorials to walk the process to publish their real data. We will collect feedback on what was unclear and what information they needed to look up in other sources in order to complete the process. We will use this to revise the documentation.
  • Ultimate success will be measured by observing the deposition of data packages into scientific repositories that were uploaded using deposits as a workflow and that have complete, high-quality metadata. We will be able to measure this by monitoring scientific repositories for records that include opt-in tags indicating they use deposits and the Frictionless metadata format. We will consider the project a success observing multiple such records from researchers not affiliated with this project.

Timeline

We expect this project to take approximately six months, with the technical writer and translator working 5-10 hours per week during their work phases. The timeline is as follows:

Dates Action Items
2023-06-01 Project kick-off
2023-06-01 - 2023-07-31 Vignette creation (and post #1)
2023-08-01 - 2023-09-30 Example repository creation (and post #2)
2023-09-01 - 2023-09-30 Beta-testing, feedback collection, revision
2023-10-01 - 2023-12-31 Translation of all resources (and post #3)

Project Budget

Budget item Amount Running Total Notes/justifications
Technical Writer 7500 7500 Pay at $50/hr for 150 hrs (~7.5 hrs/week for 20 weeks)
Translator 3000 10500 Pay at $50/hr for 60 hrs (~7.5 hrs/week for 8 weeks)
Project manager: direct and review updates to the documentation 2000 12500 Pay at $80/hr for 25 hrs
Administration 875 13375 8% Administrative fees through Open Collective

Additional information

Previous experience with technical writers or documentation: rOpenSci core members Maëlle Salmon and Noam Ross have been the primary authors of the rOpenSci Package Development Guide, a broadly used resource for R package development and maintenance. rOpenSci has worked with contract technical writers for the development of our Community Contributing Guide. We are working with professional translators to create a Spanish-langauge version of our Package Development Guide. For this, we are developing an AI-assisted translation toolchain specifically for technical documents, that reduces translator effort while keeping a human-in-the-loop workflow to improve quality and harness translator expertise and domain-specific vocabulary.

Previous participation in Season of Docs, Google Summer of Code, or others: rOpenSci participated in Google Summer of Code for the development of one of our initial scientific database projects on accesssing biodiversity data.