~40B uncompressed tokens of spiking neural activity data recorded from primates (tokens=neurons x time bins). Unless otherwise noted, the data consist of spike counts in 20 ms time bins recorded from each neuron.
This repository contains the code and instructions for building the dataset from scratch. The actual final dataset is hosted at this public HF repository.
The current component datasets and token counts per dataset are as follows:
| Name | Tokens | Source | Details | Species | Subjects | Sessions |
|---|---|---|---|---|---|---|
| Xiao | 17,695,820,059 | dandi:000628 | link | macaque | 13 | 679 |
| Neupane (PPC) | 7,899,849,087 | dandi:001275 | link | macaque | 2 | 10 |
| Chen | 2,720,804,574 | dandi:001435 | link | macaque | 2 | 51 |
| Card | 2,484,658,688 | dryad:dncjsxm85 | link | human | 1 | 45 |
| Willett | 1,796,119,552 | dryad:x69p8czpq | link | human | 1 | 44 |
| Churchland | 1,278,669,504 | dandi:000070 | link | macaque | 2 | 10 |
| Neupane (EC) | 911,393,376 | dandi:000897 | link | macaque | 2 | 15 |
| Kim | 804,510,741 | dandi:001357 | link | macaque | 2 | 159 |
| Even-Chen | 783,441,792 | dandi:000121 | link | macaque | 2 | 12 |
| Temmar | 781,701,792 | dandi:001201 | link | macaque | 1 | 330 |
| Papale | 775,618,560 | g-node:TVSD | link | macaque | 2 | 2 |
| Perich | 688,889,368 | dandi:000688 | link | macaque | 4 | 111 |
| Wojcik | 422,724,515 | dryad:c2fqz61kb | link | macaque | 2 | 50 |
| Makin | 375,447,744 | zenodo:3854034 | link | macaque | 2 | 47 |
| H2 | 297,332,736 | dandi:000950 | link | human | 1 | 47 |
| Lanzarini | 259,179,392 | osf:82jfr | link | macaque | 2 | 10 |
| Athalye | 101,984,317 | dandi:000404 | link | macaque | 2 | 13 |
| M1-A | 45,410,816 | dandi:000941 | link | macaque | 1 | 11 |
| M1-B | 43,809,344 | dandi:001209 | link | macaque | 1 | 12 |
| H1 | 33,686,576 | dandi:000954 | link | human | 1 | 40 |
| Moore | 30,643,839 | dandi:001062 | link | marmoset | 1 | 1 |
| Rajalingham | 14,923,100 | zenodo:13952210 | link | macaque | 2 | 2 |
| DMFC-rsg | 14,003,818 | dandi:000130 | link | macaque | 1 | 2 |
| M2 | 12,708,384 | dandi:000953 | link | macaque | 1 | 20 |
| Area2-bump | 7,394,070 | dandi:000127 | link | macaque | 1 | 2 |
Total number of tokens: 40,281,725,444
The combined dataset takes up about 40 GB on disk when stored as memory-mapped .arrow files. The HF datasets library uses .arrow files for local caching, so you will need at least this much free disk space in order to be able to utilize it.
Please see the auto-generated requirements.txt file.
The data directory contains all the information needed to download and preprocess the individual component datasets and push them to the HF datasets hub (quick links to the subdirectories for component datasets are provided in the Details column in the table above). You can use these as a starting point if you would like to add more datasets to the mix. Adding further dandisets should be particularly easy based off of the current examples. When creating the component datasets, we split long sessions (>10M tokens) into smaller equal-sized chunks of no more than 10M tokens. This makes data loading more efficient and prevents errors while creating and uploading HF datasets.
Once we have created the individual component datasets, we merge them into a single dataset with the merge_datasets.py script. This also shuffles the combined dataset, creates a separate test split (1% of the data), and pushes the dataset to the HF datasets hub. If you would like to add more datasets to the mix, simply add their HF dataset repository names to the repo_list in merge_datasets.py.
visualize_dataset.py provides some basic functionality to visualize random samples from the datasets as a basic sanity check:
python visualize_datasets.py --repo_name 'eminorhan/xiao' --n_examples 9This will randomly sample n_examples examples from the corresponding dataset and visualize them as below, where x is the time axis (binned into 20 ms windows) and the y axis represents the recorded units:
Users also have the option to visualize n_examples random examples from each component dataset by calling:
python visualize_datasets.py --plot_all --n_examples 9This will save the visualizations for all component datasets in a folder called rasters as in here.
For a more fine-grained analysis of the data, I also wrote a simple script in extract_motifs.py that extracts motifs from the data and keeps track of their statistics over the whole dataset. Given a particular motif or patch size (p_n, p_t), e.g. (1, 10), i.e. 1 neuron and 10 time bins, this script will extract all unique motifs of this size over the entire dataset together with their frequency of occurrence. This script will also visualize the most common motifs in a figure like the following (blue is 0, red is 1 in this figure):
For (1, 10) motifs, ~28M unique motifs are instantiated over the whole dataset (out of a maximum possible of ~4B unique motifs of this size). The "silent" motif (all zeros) dominates the dataset with something like ~2B occurrences overall, distantly followed by various single spike motifs, followed by the "all ones" motif (111...1), followed by various two spike motifs, etc., as shown above. More examples can be found in the motifs folder.

_motifs.jpeg)