Skip to content

eminorhan/neural-pile-rodent

Repository files navigation

Static Badge License: MIT

Spiking neural activity data recorded from rodents

~453B uncompressed tokens of spiking neural activity data recorded from rodents (tokens=neurons x time bins). Unless otherwise noted, the data consist of spike counts in 20 ms time bins recorded from each neuron.

This repository contains the code and instructions for building the dataset from scratch. The actual final dataset is hosted at this public HF repository.

The current component datasets and token counts per dataset are as follows:

Name Tokens Source Details Species Subjects Sessions
VBN 153,877,057,200 dandi:000713 link mouse 81 153
IBL 69,147,814,139 dandi:000409 link mouse 115 347
SHIELD 61,890,305,241 dandi:001051 link mouse 27 99
VCN 36,681,686,005 dandi:000021 link mouse 32 32
VCN-2 30,600,253,445 dandi:000022 link mouse 26 26
V2H 24,600,171,007 dandi:000690 link mouse 25 25
Petersen 15,510,368,376 dandi:000059 link rat 5 24
Oddball 14,653,641,118 dandi:000253 link mouse 14 14
Illusion 13,246,412,456 dandi:000248 link mouse 12 12
Huszar 8,812,474,629 dandi:000552 link mouse 17 65
Steinmetz 7,881,422,592 dandi:000017 link mouse 10 39
Le Merre 3,903,005,243 dandi:001260 link mouse 41 74
Peyrache 2,198,184,372 dandi:000056 link mouse 7 40
Prince 1,921,336,974 dandi:001371 link mouse 7 66
Senzai 1,433,511,102 dandi:000166 link mouse 19 19
Finkelstein 1,313,786,316 dandi:000060 link mouse 9 98
Grosmark 1,158,299,763 dandi:000044 link rat 4 8
Giocomo 1,083,328,404 dandi:000053 link mouse 34 349
Steinmetz-2 684,731,334 figshare:7739750 link mouse 3 3
Jaramillo 581,535,289 dandi:000986 link mouse 5 15
Mehrotra 465,402,824 dandi:000987 link mouse 3 14
Iurilli 388,791,426 dandi:000931 link mouse 1 1
Gonzalez 366,962,209 dandi:000405 link rat 5 276
Li 260,807,325 dandi:000010 link mouse 23 99
Fujisawa 132,563,010 dandi:000067 link rat 3 10

Total number of tokens: 452,793,851,799

The combined dataset takes up about 453 GB on disk when stored as memory-mapped .arrow files. The HF datasets library uses .arrow files for local caching, so you will need at least this much free disk space in order to be able to utilize it.

Requirements

Please see the auto-generated requirements.txt file.

Creating the component datasets

The data directory contains all the information needed to download and preprocess the individual component datasets and push them to the HF datasets hub (quick links to the subdirectories for component datasets are provided in the Details column in the table above). You can use these as a starting point if you would like to add more datasets to the mix. Adding further dandisets should be particularly easy based off of the current examples. When creating the component datasets, we split long sessions (>10M tokens) into smaller equal-sized chunks of no more than 10M tokens. This makes data loading more efficient and prevents errors while creating and uploading HF datasets.

Merging the component datasets into a single dataset

Once we have created the individual component datasets, we merge them into a single dataset with the merge_datasets.py script. This also shuffles the combined dataset, creates a separate test split (1% of the data), and pushes the dataset to the HF datasets hub (please note that due to the size of the dataset, it can take several hours to push the dataset to the HF datasets hub). If you would like to add more datasets to the mix, simply add their HF dataset repository names to the repo_list in merge_datasets.py.

Visualizing the datasets

visualize_dataset.py provides some basic functionality to visualize random samples from the datasets as a basic sanity check:

python visualize_datasets.py --repo_name 'eminorhan/v2h' --n_examples 9

This will randomly sample n_examples examples from the corresponding dataset and visualize them as below, where x is the time axis (binned into 20 ms windows) and the y axis represents the recorded units:

Users also have the option to visualize n_examples random examples from each component dataset by calling:

python visualize_datasets.py --plot_all --n_examples 9

This will save the visualizations for all component datasets in a folder called rasters as in here.

About

spiking neural activity data from rodents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published