Skip to content

Latest commit

 

History

History
41 lines (28 loc) · 2.97 KB

File metadata and controls

41 lines (28 loc) · 2.97 KB

Guide: Uploading Raman Spectroscopy Datasets to Hugging Face

This guide you illustrates the process of uploading datasets to the Hugging Face platform. Essentially, a Hugging Face dataset functions as a directory backed by a Git repository, which allows you to update your data at any time.


1. Structural Overview

To illustrate the structure, we can examine an existing dataset containing high-throughput measurements of the supernatant from 24 E. coli fermentations:

In this example, the repository contains five primary files:

  • .gitattributes: This is automatically generated by the system and generally requires no manual intervention.
  • README.md: This file serves as the Dataset Card. It provides the description visible on the main landing page and is written in Markdown format.
  • train.csv: Contains spectra and annotations designated for model training.
  • val.csv: Contains spectra and annotations designated for validation.
  • test.csv: Contains spectra and annotations designated for final testing and evaluation.

2. Creating Your Dataset

To initiate your own upload, you may refer to the official Hugging Face documentation. The primary steps include:

  1. Naming & Licensing: You will be prompted for a dataset name and a license. A common professional choice is "cc-by-4.0" (Creative Commons 4.0), which allows for flexible use and distribution of the data.
  2. Initialization: Click "Create Dataset" to establish the basic repository structure.
  3. Data Organization: You can upload files through the various interface tabs. While the train/val/test split is a standard convention, it is not mandatory. You may organize your files as you like — for example, by providing a single spectra.csv and a corresponding annotations.csv. Splitting can happen in the code later on.

3. Creating the Dataset Card

The Dataset Card is a critical component for providing context, such as your measurement methodology and the nature of your annotations.

Technical Reference: Detailed instructions on creating these cards can be found here: Dataset Card Documentation.

To edit your Dataset Card:

  1. Navigate to the "Dataset Card" tab in the top-left menu.
  2. Select "Edit Dataset Card" on the right side of the page.
  3. Author your content using Markdown. You are welcome to use my existing datasets as templates and adapt the text to your specific project requirements.
  4. Once your changes are complete, click "Commit changes to main" at the bottom of the page to save the updates and generate a new version of the dataset.