Skip to content

Latest commit

 

History

History
146 lines (133 loc) · 7.46 KB

File metadata and controls

146 lines (133 loc) · 7.46 KB

Registry of Open Data

A repository of publicly available datasets that are available for access from a wide variety of resources. Note that datasets in this registry are available via many different resources, but they are not provided by the R Consortium; these datasets are owned and maintained by a variety government organizations, researchers, businesses, and individuals.

What is this for?

When data is shared on in this repository, it is the intent that anyone can analyze it and build services on top of it using a broad range of compute and data analytics products. Sharing data in the cloud lets data users spend more time on data analysis rather than data acquisition. This repository exists to help people promote and discover datasets that are available for analysis using R.

How are datasets added to the registry?

Each dataset in this repository is described with metadata saved in a YAML file in the /datasets directory. We use these YAML files to provide three services:

  • A Registry of Open Data
  • A hosted YAML file listing all of the dataset entries
  • Hosted YAML files for each dataset

The YAML files use this structure:

Name:
Description:
Documentation:
Contact:
ManagedBy:
UpdateFrequency:
Tags:
  -
License:
Resources:
  - Description:
    ARN:
    Region:
    Type:
DataAtWork:
  Tutorials:
    - Title:
      URL:
      AuthorName:
      AuthorURL:
  Tools & Applications:
    - Title:
      URL:
      AuthorName:
      AuthorURL:
  Publications:
    - Title:
      URL:
      AuthorName:
      AuthorURL:

The metadata required for each dataset entry is as follows:

Field Type Description
Name String The public facing name of the dataset
Description String A high-level description of the dataset
Documentation URL A link to documentation of the dataset
Contact String May be an email address, a link to contact form, a link to GitHub issues page, or any other instructions to contact the producer of the dataset
ManagedBy String The name of the organization who is responsible for the data ingest process
UpdateFrequency String An explanation of how frequently the dataset is updated
Tags List of strings Tags that topically describe the dataset. A list of supported tags is maintained in the tags.yaml file in this repo. If you want to recommend a tag that is not included in tags.yaml, please submit a pull request to add it to that file.
License String An explanation of the dataset license and/or a URL to more information about data terms of use of the dataset
Resources List of lists A list of AWS resources that users can use to consume the data. Each resource entry requires the metadata below:
Resources > Description String A technical description of the data available within the AWS resource, including information about file formats and scope.
Resources > ARN String Amazon Resource Name for resource, e.g. arn:aws:s3:::commoncrawl
Resources > Region String AWS region unique identifier, e.g. us-east-1
Resources > Type String Can be CloudFront Distribution, DB Snapshot, S3 Bucket, or SNS Topic. A list of supported resources is maintained in the resources.yaml file in this repo. If you want to recommend a resource that is not included in resources.yaml, please submit a pull request to add it to that file.
Resources > RequesterPays (Optional) Boolean Only appropriate for Amazon S3 buckets, indicates whether the bucket has Requester Pays enabled or not.
DataAtWork [> Tutorials, Tools & Applications, Publications] (Optional) List of lists A list of links to example tutorials, tools & applications, publications that use the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > Title String The title of the tutorial, tool, application, or publication that uses the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > URL URL A link to the tutorial, tool, application, or publication that uses the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > AuthorName String Name of person or entity that created the tutorial, tool, application, or publication.
DataAtWork [> Tutorials, Tools & Applications, Publications] > AuthorURL String (Optional) URL for person or entity that created the tutorial, tool, application, or publication.

Example entry

Here is an example of the metadata behind a hosted dataset: https://registry.opendata.aws/gdelt/

Name: NEXRAD on AWS
Description: Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
Documentation: https://docs.opendata.aws/noaa-nexrad/readme.html
Contact: noaa.bdp@noaa.gov
ManagedBy: "[NOAA](http://www.noaa.gov/)"
UpdateFrequency: New Level II data is added as soon as it is available.
Tags:
  - aws-pds
  - earth observation
  - natural resource
  - weather
  - meteorological
  - sustainability
License: There are no restrictions on the use of this data.
Resources:
  - Description: NEXRAD Level II archive data
    ARN: arn:aws:s3:::noaa-nexrad-level2
    Region: us-east-1
    Type: S3 Bucket
  - Description: NEXRAD Level II real-time data
    ARN: arn:aws:s3:::unidata-nexrad-level2-chunks
    Region: us-east-1
    Type: S3 Bucket
  - Description: "[Rich notifications](https://docs.opendata.aws/noaa-nexrad/readme.html) for real-time data with filterable fields"
    ARN: arn:aws:sns:us-east-1:684042711724:NewNEXRADLevel2ObjectFilterable
    Region: us-east-1
    Type: SNS Topic
  - Description: Notifications for archival data
    ARN: arn:aws:sns:us-east-1:811054952067:NewNEXRADLevel2Archive
    Region: us-east-1
    Type: SNS Topic
DataAtWork:
  Tutorials:
    - Title: Using Python to Access NCEI Archived NEXRAD Level 2 Data (Jupyter notebook)
      URL: http://nbviewer.jupyter.org/gist/dopplershift/356f2e14832e9b676207
      AuthorName: Ryan May
      AuthorURL: http://dopplershift.github.io
    - Title: Mapping Noaa Nexrad Radar Data With CARTO
      URL: https://carto.com/blog/mapping-nexrad-radar-data/
      AuthorName: Stuart Lynn
      AuthorURL: https://carto.com/blog/author/stuart-lynn/
    - Title: NEXRAD on EC2 tutorial
      URL: https://github.com/openradar/AMS_radar_in_the_cloud
      AuthorName: openradar
      AuthorURL: https://github.com/openradar
  Tools & Applications:
    - Title: nexradaws on pypi.python.org - python module to query and download Nexrad data from Amazon S3
      URL: https://pypi.org/project/nexradaws/
      AuthorName: Aaron Anderson
      AuthorURL: https://github.com/aarande
    - Title: WeatherPipe - Amazon EMR based analysis tool for NEXRAD data stored on Amazon S3
      URL: https://github.com/stephenlienharrell/WeatherPipe
      AuthorName: Stephen Lien Harrell
      AuthorURL: https://github.com/stephenlienharrell
  Publications:
    - Title: Seasonal abundance and survival of North America’s migratory avifauna determined by weather radar
      URL: https://www.nature.com/articles/s41559-018-0666-4
      AuthorName: Adriaan M. Dokter, Andrew Farnsworth, Daniel Fink, Viviana Ruiz-Gutierrez, Wesley M. Hochachka, Frank A. La Sorte, Orin J. Robinson, Kenneth V. Rosenberg & Steve Kelling
    - Title: Unlocking the Potential of NEXRAD Data through NOAA’s Big Data Partnership
      URL: https://journals.ametsoc.org/doi/full/10.1175/BAMS-D-16-0021.1
      AuthorName: Steve Ansari and Stephen Del Greco