-
Notifications
You must be signed in to change notification settings - Fork 3
GSoD 2023 Proposal
rOpenSci is a community of developers, researchers, and data scientists who support open scientific practices through developing technical and community infrastructure, particularly around the R language. This includes a large library of community-driven software packages for scientific data workflows, a software peer-review process, package development documentation, community infrastructure for automated building, testing, and dissemination of packages, and programs for co-working and mentoring. We are a community-driven organization, with a core team of developers and a large group of contributors. rOpenSci was founded in 2011. We are a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit organization in the United States.
rOpenSci has long supported scientists' work with data repositories, especially facilitating open science practices of timely and well-documented deposition of scientific data for open publication. This is an area of importance to our user community, as evidenced by high engagement in activities on this topic. Historically we have done so through a suite of API clients, each working with a particular data repository. deposits
is our latest endeavor to make this task easier by creating a common workflow and interface for data deposition. It is an R package developed by rOpenSci, led by core member Mark Padgham, to serve as a common client for depositing, updating, and accessing research data in multiple generalist scientific repositories. It has consolidated previous work developing client libraries for single repositories (Zenodo, Figshare, and Dryad). deposits
allows users to manage the online storage and publication of data in these repositories using a simple set of common actions that abstract over the complexity of multiple APIs. It also integrates with the tooling and standards of the Frictionless metadata framework, allowing researchers to harmonize their metadata generation and publication within a single workflow.
While deposits
consolidates the process of scientific data deposition to repositories, it is still a complex process with documentation scattered across many resources. Beyond the tasks handled and abstracted by deposits
, scientists must consider (meta)data structures, repository selection, and semantic linking and versioning in order to ensure that their data is Findable, Accessible, Interoperable, and Reproducible (FAIR). We aim to create documentation for deposits
that can act as one-stop tutorials for the processes, including creating a data dictionary, creating the package metadata, and linking data to other scientific publications and artifacts. We also aim to create example repositories that show how to use deposits
to create a data package that can be deposited in a repository.
In addition, a core area of development for rOpenSci is the development of multilingual resources for the R language. We aim to develop both English and Spanish-language versions of this documentation
The tasks of this project are split into the following categories:
- Vignette creation:
- Creation of a core tutorial walking users through the steps of creating documentation and metadata in preparation for deposition, validation, then practice deposition using
deposits
; - Creation of a tutorial for creating automated workflows using CI to validate and publish data;
- Creation of a tutorial for retrieving data from a multiple repositories for analysis.
- Creation of a reference document/cheatsheet of metadata fields and their use in
deposits
and across repositories.
- Creation of a core tutorial walking users through the steps of creating documentation and metadata in preparation for deposition, validation, then practice deposition using
- Example repositories: Creation of example workflow repositories that represent common data structures and workflows
- Deposition of a structured tabular data set;
- Deposition of a mixed data/code repository;
- Automated validation and deposition of regularly released data via continuous integration.
- Beta-testing documentation with scientists and collecting feedback on clarity, usability, and completeness.
- Translation of all these resources into Spanish (or English if initially written in Spanish).
- Writing blog posts announcing and summarizing each resource as they are published.
- We first measure success by meeting key deliverables: the vignettes are added to the package website, and the example repositories are published.
- Secondly, we measure success by ensuring that our documentation is usable and consolidates the information scientists need in order to perform the tasks. To test this, we will reach out to scientists to beta-test the documentation. We will ask several first-time users to use the tutorials to walk the process to publish their real data. We will collect feedback on what was unclear and what information they needed to look up in other sources in order to complete the process. We will use this to revise the documentation.
- Ultimate success will be measured by observing the deposition of data packages into scientific repositories that were uploaded using
deposits
as a workflow and that have complete, high-quality metadata. We will be able to measure this by monitoring scientific repositories for records that include opt-in tags indicating they usedeposits
and the Frictionless metadata format. We will consider the project a success observing multiple such records from researchers not affiliated with this project.
We expect this project to take approximately six months, with the technical writer and translator working 5-10 hours per week during their work phases. The timeline is as follows:
Dates | Action Items |
---|---|
2023-06-01 | Project kick-off |
2023-06-01 - 2023-07-31 | Vignette creation (and post #1) |
2023-08-01 - 2023-09-30 | Example repository creation (and post #2) |
2023-09-01 - 2023-09-30 | Beta-testing, feedback collection, revision |
2023-10-01 - 2023-12-31 | Translation of all resources (and post #3) |
Budget item | Amount | Running Total | Notes/justifications |
---|---|---|---|
Technical Writer | 7500 | 7500 | Pay at $50/hr for 150 hrs (~7.5 hrs/week for 20 weeks) |
Translator | 3000 | 10500 | Pay at $50/hr for 60 hrs (~7.5 hrs/week for 8 weeks) |
Project manager: direct and review updates to the documentation | 2000 | 12500 | Pay at $80/hr for 25 hrs |
Administration | 875 | 13375 | 8% Administrative fees through Open Collective |
Previous experience with technical writers or documentation: rOpenSci core members Maëlle Salmon and Noam Ross have been the primary authors of the rOpenSci Package Development Guide, a broadly used resource for R package development and maintenance. rOpenSci has worked with contract technical writers for the development of our Community Contributing Guide. We are working with professional translators to create a Spanish-langauge version of our Package Development Guide. For this, we are developing an AI-assisted translation toolchain specifically for technical documents, that reduces translator effort while keeping a human-in-the-loop workflow to improve quality and harness translator expertise and domain-specific vocabulary.
Previous participation in Season of Docs, Google Summer of Code, or others: rOpenSci participated in Google Summer of Code for the development of one of our initial scientific database projects on accesssing biodiversity data.