~453B uncompressed tokens of spiking neural activity data recorded from rodents (tokens=neurons x time bins). Unless otherwise noted, the data consist of spike counts in 20 ms time bins recorded from each neuron.
This repository contains the code and instructions for building the dataset from scratch. The actual final dataset is hosted at this public HF repository.
The current component datasets and token counts per dataset are as follows:
| Name | Tokens | Source | Details | Species | Subjects | Sessions |
|---|---|---|---|---|---|---|
| VBN | 153,877,057,200 | dandi:000713 | link | mouse | 81 | 153 |
| IBL | 69,147,814,139 | dandi:000409 | link | mouse | 115 | 347 |
| SHIELD | 61,890,305,241 | dandi:001051 | link | mouse | 27 | 99 |
| VCN | 36,681,686,005 | dandi:000021 | link | mouse | 32 | 32 |
| VCN-2 | 30,600,253,445 | dandi:000022 | link | mouse | 26 | 26 |
| V2H | 24,600,171,007 | dandi:000690 | link | mouse | 25 | 25 |
| Petersen | 15,510,368,376 | dandi:000059 | link | rat | 5 | 24 |
| Oddball | 14,653,641,118 | dandi:000253 | link | mouse | 14 | 14 |
| Illusion | 13,246,412,456 | dandi:000248 | link | mouse | 12 | 12 |
| Huszar | 8,812,474,629 | dandi:000552 | link | mouse | 17 | 65 |
| Steinmetz | 7,881,422,592 | dandi:000017 | link | mouse | 10 | 39 |
| Le Merre | 3,903,005,243 | dandi:001260 | link | mouse | 41 | 74 |
| Peyrache | 2,198,184,372 | dandi:000056 | link | mouse | 7 | 40 |
| Prince | 1,921,336,974 | dandi:001371 | link | mouse | 7 | 66 |
| Senzai | 1,433,511,102 | dandi:000166 | link | mouse | 19 | 19 |
| Finkelstein | 1,313,786,316 | dandi:000060 | link | mouse | 9 | 98 |
| Grosmark | 1,158,299,763 | dandi:000044 | link | rat | 4 | 8 |
| Giocomo | 1,083,328,404 | dandi:000053 | link | mouse | 34 | 349 |
| Steinmetz-2 | 684,731,334 | figshare:7739750 | link | mouse | 3 | 3 |
| Jaramillo | 581,535,289 | dandi:000986 | link | mouse | 5 | 15 |
| Mehrotra | 465,402,824 | dandi:000987 | link | mouse | 3 | 14 |
| Iurilli | 388,791,426 | dandi:000931 | link | mouse | 1 | 1 |
| Gonzalez | 366,962,209 | dandi:000405 | link | rat | 5 | 276 |
| Li | 260,807,325 | dandi:000010 | link | mouse | 23 | 99 |
| Fujisawa | 132,563,010 | dandi:000067 | link | rat | 3 | 10 |
Total number of tokens: 452,793,851,799
The combined dataset takes up about 453 GB on disk when stored as memory-mapped .arrow files. The HF datasets library uses .arrow files for local caching, so you will need at least this much free disk space in order to be able to utilize it.
Please see the auto-generated requirements.txt file.
The data directory contains all the information needed to download and preprocess the individual component datasets and push them to the HF datasets hub (quick links to the subdirectories for component datasets are provided in the Details column in the table above). You can use these as a starting point if you would like to add more datasets to the mix. Adding further dandisets should be particularly easy based off of the current examples. When creating the component datasets, we split long sessions (>10M tokens) into smaller equal-sized chunks of no more than 10M tokens. This makes data loading more efficient and prevents errors while creating and uploading HF datasets.
Once we have created the individual component datasets, we merge them into a single dataset with the merge_datasets.py script. This also shuffles the combined dataset, creates a separate test split (1% of the data), and pushes the dataset to the HF datasets hub (please note that due to the size of the dataset, it can take several hours to push the dataset to the HF datasets hub). If you would like to add more datasets to the mix, simply add their HF dataset repository names to the repo_list in merge_datasets.py.
visualize_dataset.py provides some basic functionality to visualize random samples from the datasets as a basic sanity check:
python visualize_datasets.py --repo_name 'eminorhan/v2h' --n_examples 9This will randomly sample n_examples examples from the corresponding dataset and visualize them as below, where x is the time axis (binned into 20 ms windows) and the y axis represents the recorded units:
Users also have the option to visualize n_examples random examples from each component dataset by calling:
python visualize_datasets.py --plot_all --n_examples 9This will save the visualizations for all component datasets in a folder called rasters as in here.
