Spiking neural activity data recorded from primates

~40B uncompressed tokens of spiking neural activity data recorded from primates (tokens=neurons x time bins). Unless otherwise noted, the data consist of spike counts in 20 ms time bins recorded from each neuron.

This repository contains the code and instructions for building the dataset from scratch. The actual final dataset is hosted at this public HF repository.

The current component datasets and token counts per dataset are as follows:

Name	Tokens	Source	Details	Species	Subjects	Sessions
Xiao	17,695,820,059	dandi:000628	link	macaque	13	679
Neupane (PPC)	7,899,849,087	dandi:001275	link	macaque	2	10
Chen	2,720,804,574	dandi:001435	link	macaque	2	51
Card	2,484,658,688	dryad:dncjsxm85	link	human	1	45
Willett	1,796,119,552	dryad:x69p8czpq	link	human	1	44
Churchland	1,278,669,504	dandi:000070	link	macaque	2	10
Neupane (EC)	911,393,376	dandi:000897	link	macaque	2	15
Kim	804,510,741	dandi:001357	link	macaque	2	159
Even-Chen	783,441,792	dandi:000121	link	macaque	2	12
Temmar	781,701,792	dandi:001201	link	macaque	1	330
Papale	775,618,560	g-node:TVSD	link	macaque	2	2
Perich	688,889,368	dandi:000688	link	macaque	4	111
Wojcik	422,724,515	dryad:c2fqz61kb	link	macaque	2	50
Makin	375,447,744	zenodo:3854034	link	macaque	2	47
H2	297,332,736	dandi:000950	link	human	1	47
Lanzarini	259,179,392	osf:82jfr	link	macaque	2	10
Athalye	101,984,317	dandi:000404	link	macaque	2	13
M1-A	45,410,816	dandi:000941	link	macaque	1	11
M1-B	43,809,344	dandi:001209	link	macaque	1	12
H1	33,686,576	dandi:000954	link	human	1	40
Moore	30,643,839	dandi:001062	link	marmoset	1	1
Rajalingham	14,923,100	zenodo:13952210	link	macaque	2	2
DMFC-rsg	14,003,818	dandi:000130	link	macaque	1	2
M2	12,708,384	dandi:000953	link	macaque	1	20
Area2-bump	7,394,070	dandi:000127	link	macaque	1	2

Total number of tokens: 40,281,725,444

The combined dataset takes up about 40 GB on disk when stored as memory-mapped .arrow files. The HF datasets library uses .arrow files for local caching, so you will need at least this much free disk space in order to be able to utilize it.

Requirements

Please see the auto-generated requirements.txt file.

Creating the component datasets

The data directory contains all the information needed to download and preprocess the individual component datasets and push them to the HF datasets hub (quick links to the subdirectories for component datasets are provided in the Details column in the table above). You can use these as a starting point if you would like to add more datasets to the mix. Adding further dandisets should be particularly easy based off of the current examples. When creating the component datasets, we split long sessions (>10M tokens) into smaller equal-sized chunks of no more than 10M tokens. This makes data loading more efficient and prevents errors while creating and uploading HF datasets.

Merging the component datasets into a single dataset

Once we have created the individual component datasets, we merge them into a single dataset with the merge_datasets.py script. This also shuffles the combined dataset, creates a separate test split (1% of the data), and pushes the dataset to the HF datasets hub. If you would like to add more datasets to the mix, simply add their HF dataset repository names to the repo_list in merge_datasets.py.

Visualizing the datasets

visualize_dataset.py provides some basic functionality to visualize random samples from the datasets as a basic sanity check:

python visualize_datasets.py --repo_name 'eminorhan/xiao' --n_examples 9

This will randomly sample n_examples examples from the corresponding dataset and visualize them as below, where x is the time axis (binned into 20 ms windows) and the y axis represents the recorded units:

Users also have the option to visualize n_examples random examples from each component dataset by calling:

python visualize_datasets.py --plot_all --n_examples 9

This will save the visualizations for all component datasets in a folder called rasters as in here.

Extracting motifs

For a more fine-grained analysis of the data, I also wrote a simple script in extract_motifs.py that extracts motifs from the data and keeps track of their statistics over the whole dataset. Given a particular motif or patch size (p_n, p_t), e.g. (1, 10), i.e. 1 neuron and 10 time bins, this script will extract all unique motifs of this size over the entire dataset together with their frequency of occurrence. This script will also visualize the most common motifs in a figure like the following (blue is 0, red is 1 in this figure):

For (1, 10) motifs, ~28M unique motifs are instantiated over the whole dataset (out of a maximum possible of ~4B unique motifs of this size). The "silent" motif (all zeros) dominates the dataset with something like ~2B occurrences overall, distantly followed by various single spike motifs, followed by the "all ones" motif (111...1), followed by various two spike motifs, etc., as shown above. More examples can be found in the motifs folder.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
helpers		helpers
motifs		motifs
rasters		rasters
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_dataset.py		download_dataset.py
estimate_cumulative_probability.py		estimate_cumulative_probability.py
extract_motifs.py		extract_motifs.py
merge_datasets.py		merge_datasets.py
requirements.txt		requirements.txt
tokenize_motifs.py		tokenize_motifs.py
visualize_dataset.py		visualize_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spiking neural activity data recorded from primates

Requirements

Creating the component datasets

Merging the component datasets into a single dataset

Visualizing the datasets

Extracting motifs

About

Uh oh!

Releases

Packages

Languages

License

eminorhan/neural-pile-primate

Folders and files

Latest commit

History

Repository files navigation

Spiking neural activity data recorded from primates

Requirements

Creating the component datasets

Merging the component datasets into a single dataset

Visualizing the datasets

Extracting motifs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages