This guide you illustrates the process of uploading datasets to the Hugging Face platform. Essentially, a Hugging Face dataset functions as a directory backed by a Git repository, which allows you to update your data at any time.
To illustrate the structure, we can examine an existing dataset containing high-throughput measurements of the supernatant from 24 E. coli fermentations:
- Dataset Landing Page: RamanSpectraEcoliFermentation
- File Repository View: Repository Tree
In this example, the repository contains five primary files:
- .gitattributes: This is automatically generated by the system and generally requires no manual intervention.
- README.md: This file serves as the Dataset Card. It provides the description visible on the main landing page and is written in Markdown format.
- train.csv: Contains spectra and annotations designated for model training.
- val.csv: Contains spectra and annotations designated for validation.
- test.csv: Contains spectra and annotations designated for final testing and evaluation.
To initiate your own upload, you may refer to the official Hugging Face documentation. The primary steps include:
- Naming & Licensing: You will be prompted for a dataset name and a license. A common professional choice is "cc-by-4.0" (Creative Commons 4.0), which allows for flexible use and distribution of the data.
- Initialization: Click "Create Dataset" to establish the basic repository structure.
- Data Organization: You can upload files through the various interface tabs. While the
train/val/testsplit is a standard convention, it is not mandatory. You may organize your files as you like — for example, by providing a singlespectra.csvand a correspondingannotations.csv. Splitting can happen in the code later on.
The Dataset Card is a critical component for providing context, such as your measurement methodology and the nature of your annotations.
Technical Reference: Detailed instructions on creating these cards can be found here: Dataset Card Documentation.
To edit your Dataset Card:
- Navigate to the "Dataset Card" tab in the top-left menu.
- Select "Edit Dataset Card" on the right side of the page.
- Author your content using Markdown. You are welcome to use my existing datasets as templates and adapt the text to your specific project requirements.
- Once your changes are complete, click "Commit changes to main" at the bottom of the page to save the updates and generate a new version of the dataset.