Providing a hands-on, project-based experience for the on-line student of data science can be a challenge especially when the experience involves interaction with a real dataset. Our project explores the hosting of datasets on Microsoft Azure for purposes of project-based experiences in data science training. Our goal is to use the Azure cloud infrastructure to store and curate datasets for use in a secure environment. The larger goal is to select learning modules that support the hands-of learning by drawing from the MBDH community.
In this pilot we create and evaluate a training environment for data science that allows students to interact with data. For dataset upload, we draw on our earlier work, [Sustainable Environments Actionable Data - SEAD] (http://sead-data.net), funded through a grant from the National Science Foundation, and extend it to publish datasets to Azure through an environment that allows curation of datasets and post-deposit discovery.
This pilot project focuses on the following activity:
- Data Deposit: Extend the SEAD data curation and publishing services to publish data to MS Azure so that data products gain additional metadata and receive a persistent identifier.
- Training and Outreach events
- Evaluation: Carry out an evaluation of the pilot including study of access control needed to protect the datasets; the access control issues with individual student computers; tools in place for tracking student activity and student signup.
Find the SEADTrain PID'ified Airbox Data Discovery User Interface below:
https://data-to-insight-center.github.io/SEADTrain/
The materials were developed by the Data To Insight Center of Indiana University and are available at https://figshare.com/articles/SEADTrain_Data_Analysis/6873800 under a Creative Commons 4.0 license. The data used in this training exercise is made available in part through funding from the National Science Foundation under award #1234983. The Azure resources are funded through an award from Microsoft for Azure credits. All software is licensed under an Apache 2.0 license.