Skip to content

eminorhan/neural-pile-primate

Repository files navigation

Static Badge License: MIT

Spiking neural activity data recorded from primates

~40B uncompressed tokens of spiking neural activity data recorded from primates (tokens=neurons x time bins). Unless otherwise noted, the data consist of spike counts in 20 ms time bins recorded from each neuron.

This repository contains the code and instructions for building the dataset from scratch. The actual final dataset is hosted at this public HF repository.

The current component datasets and token counts per dataset are as follows:

Name Tokens Source Details Species Subjects Sessions
Xiao 17,695,820,059 dandi:000628 link macaque 13 679
Neupane (PPC) 7,899,849,087 dandi:001275 link macaque 2 10
Chen 2,720,804,574 dandi:001435 link macaque 2 51
Card 2,484,658,688 dryad:dncjsxm85 link human 1 45
Willett 1,796,119,552 dryad:x69p8czpq link human 1 44
Churchland 1,278,669,504 dandi:000070 link macaque 2 10
Neupane (EC) 911,393,376 dandi:000897 link macaque 2 15
Kim 804,510,741 dandi:001357 link macaque 2 159
Even-Chen 783,441,792 dandi:000121 link macaque 2 12
Temmar 781,701,792 dandi:001201 link macaque 1 330
Papale 775,618,560 g-node:TVSD link macaque 2 2
Perich 688,889,368 dandi:000688 link macaque 4 111
Wojcik 422,724,515 dryad:c2fqz61kb link macaque 2 50
Makin 375,447,744 zenodo:3854034 link macaque 2 47
H2 297,332,736 dandi:000950 link human 1 47
Lanzarini 259,179,392 osf:82jfr link macaque 2 10
Athalye 101,984,317 dandi:000404 link macaque 2 13
M1-A 45,410,816 dandi:000941 link macaque 1 11
M1-B 43,809,344 dandi:001209 link macaque 1 12
H1 33,686,576 dandi:000954 link human 1 40
Moore 30,643,839 dandi:001062 link marmoset 1 1
Rajalingham 14,923,100 zenodo:13952210 link macaque 2 2
DMFC-rsg 14,003,818 dandi:000130 link macaque 1 2
M2 12,708,384 dandi:000953 link macaque 1 20
Area2-bump 7,394,070 dandi:000127 link macaque 1 2

Total number of tokens: 40,281,725,444

The combined dataset takes up about 40 GB on disk when stored as memory-mapped .arrow files. The HF datasets library uses .arrow files for local caching, so you will need at least this much free disk space in order to be able to utilize it.

Requirements

Please see the auto-generated requirements.txt file.

Creating the component datasets

The data directory contains all the information needed to download and preprocess the individual component datasets and push them to the HF datasets hub (quick links to the subdirectories for component datasets are provided in the Details column in the table above). You can use these as a starting point if you would like to add more datasets to the mix. Adding further dandisets should be particularly easy based off of the current examples. When creating the component datasets, we split long sessions (>10M tokens) into smaller equal-sized chunks of no more than 10M tokens. This makes data loading more efficient and prevents errors while creating and uploading HF datasets.

Merging the component datasets into a single dataset

Once we have created the individual component datasets, we merge them into a single dataset with the merge_datasets.py script. This also shuffles the combined dataset, creates a separate test split (1% of the data), and pushes the dataset to the HF datasets hub. If you would like to add more datasets to the mix, simply add their HF dataset repository names to the repo_list in merge_datasets.py.

Visualizing the datasets

visualize_dataset.py provides some basic functionality to visualize random samples from the datasets as a basic sanity check:

python visualize_datasets.py --repo_name 'eminorhan/xiao' --n_examples 9

This will randomly sample n_examples examples from the corresponding dataset and visualize them as below, where x is the time axis (binned into 20 ms windows) and the y axis represents the recorded units:

Users also have the option to visualize n_examples random examples from each component dataset by calling:

python visualize_datasets.py --plot_all --n_examples 9

This will save the visualizations for all component datasets in a folder called rasters as in here.

Extracting motifs

For a more fine-grained analysis of the data, I also wrote a simple script in extract_motifs.py that extracts motifs from the data and keeps track of their statistics over the whole dataset. Given a particular motif or patch size (p_n, p_t), e.g. (1, 10), i.e. 1 neuron and 10 time bins, this script will extract all unique motifs of this size over the entire dataset together with their frequency of occurrence. This script will also visualize the most common motifs in a figure like the following (blue is 0, red is 1 in this figure):

For (1, 10) motifs, ~28M unique motifs are instantiated over the whole dataset (out of a maximum possible of ~4B unique motifs of this size). The "silent" motif (all zeros) dominates the dataset with something like ~2B occurrences overall, distantly followed by various single spike motifs, followed by the "all ones" motif (111...1), followed by various two spike motifs, etc., as shown above. More examples can be found in the motifs folder.

About

spiking neural activity data from primates

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published