Skip to content

A prototype to test how fast metadata and specific ecephys unit data can be accessed from an NWB file

License

Notifications You must be signed in to change notification settings

bjhardcastle/lazynwb

Repository files navigation

lazynwb

PyPI Python version

Coverage CI/CD GitHub issues

Purpose

1. Make NWB table access faster and/or consume less memory by reading only the data required, when it's needed

As of 2025 and pynwb==3.0, there are a couple of ways to access data stored in an NWB file as a DynamicTable (e.g. trials, units):

  • get the pandas dataframe for the table and access the desired column
  • or access specific columns as arrays from disk

The NWB schema for the units table includes columns for list-like or nested data, such as spike_times, waveform_mean, waveform_sd, which can become large for Neuropixels probes and often not needed in their entirety for analysis. Reading the entire table into memory may be unnecessary and, especially when reading from NWBs stored in the cloud, can be slow.

Accessing individual columns as arrays, on the other hand, means we no longer have the convenience of a DataFrame.

Ideally, we would filter our table based on metrics in some columns, then access the larger columns for the filtered subset of rows, seamlessly with a single command.

To this end, lazynwb.scan_nwb() provides a polars.LazyFrame() interface to NWB tables, which supports both predicate pushdown and projection of columns.

It also supports reading multiple NWB files in one operation, producing a concatenated table:

>>> import lazynwb
>>> import polars as pl

>>> (
  lazynwb.scan_nwb(
    [nwb_path_0, nwb_path_1, ...],  # single path or iterable
    table_path='/units',             # or '/intervals/trials' etc
  )
  .filter(
    pl.col('activity_drift') <= 0.2,
    pl.col('amplitude_cutoff') <= 0.1,
    pl.col('presence_ratio') >= 0.7,
    pl.col('isi_violations_ratio') <= 0.5,
    pl.col('decoder_label') != 'noise',
  )
  .select('unit_id', 'location', 'spike_times', '_nwb_path', '_table_row_index')
  # _nwb_path and _table_row_index are not columns in the NWB table: they're added to identify source of each row in a table that spans multiple NWBs
)
shape: (101, 4)
┌─────────┬─────────────────────────────────┬─────────────────────────────────┬──────────────┐
│ unit_idspike_times_nwb_path_table_index │
│ ------------          │
│ i64list[f64]                       ┆ stru32          │
╞═════════╪═════════════════════════════════╪═════════════════════════════════╪══════════════╡
│ 193     ┆ [2722.628735, 2723.620493, … 4… ┆ /data/ecephys_702960_2024-03-1… ┆ 5            │
│ 23      ┆ [1784.801304, 1784.804037, … 3… ┆ /data/ecephys_725805_2024-07-1… ┆ 4            │
│ 0       ┆ [9.2712e6, 9.2712e6, … 9.2731e… ┆ /data/ecephys_737812_2024-08-0… ┆ 0            │
│ 300     ┆ [9.2713e6, 9.2714e6, … 9.2731e… ┆ /data/ecephys_737812_2024-08-0… ┆ 6            │
│ 19      ┆ [6115.424355, 6116.428649, … 7… ┆ /data/ecephys_702960_2024-03-1… ┆ 5            │
│ …       ┆ …                               ┆ …                               ┆ …            │
│ 437     ┆ [581.476385, 598.829113, … 331… ┆ /data/ecephys_666859_2023-06-1… ┆ 40           │
│ 439     ┆ [929.656482, 1134.993272, … 33… ┆ /data/ecephys_666859_2023-06-1… ┆ 41           │
│ 446     ┆ [626.940861, 661.785209, … 331… ┆ /data/ecephys_666859_2023-06-1… ┆ 42           │
│ 449     ┆ [618.939192, 618.991564, … 331… ┆ /data/ecephys_666859_2023-06-1… ┆ 43           │
│ 609     ┆ [594.415999, 646.51812, … 3312… ┆ /data/ecephys_666859_2023-06-1… ┆ 44           │
└─────────┴─────────────────────────────────┴─────────────────────────────────┴──────────────┘

2. Quickly provide a summary of the metadata for all NWB files in a project

>>> lazynwb.get_metadata_df(nwb_paths, as_polars=True)
```Getting metadata: 100%|█████████████████████| 252/252 [00:17<00:00, 14.51file/s]
shape: (252, 28)
┌────────────┬────────────┬───────────┬───────────┬───┬────────┬───────────┬───────────┬───────────┐
│ identifiersession_stsession_isession_d ┆ … ┆ weightstraindate_of_b_nwb_path │
│ ---art_timedescriptio ┆   ┆ ------irth---       │
│ str------n         ┆   ┆ nullstr---str       │
│            ┆ datetime[μstr---       ┆   ┆        ┆           ┆ datetime[ ┆           │
│            ┆ s, UTC]    ┆           ┆ str       ┆   ┆        ┆           ┆ μs, UTC]  ┆           │
╞════════════╪════════════╪═══════════╪═══════════╪═══╪════════╪═══════════╪═══════════╪═══════════╡
│ 0514cf12-22024-08-07713655_20ecephys   ┆ … ┆ nullSst-IRES-2023-11-2/data/dyn │
│ 41f-4ab2-a19:03:4424-08-07session   ┆   ┆        ┆ Cre;Ai323amicrouti │
│ ce9-1c2619UTC        ┆           ┆ (day 3)   ┆   ┆        ┆           ┆ 08:00:00ng_datacu │
│ …          ┆            ┆           ┆ with b…   ┆   ┆        ┆           ┆ UTCbe_…      │
│ 5c032dff-e2024-12-06743199_20ecephys   ┆ … ┆ nullVGAT-ChR22024-05-1/data/dyn │
│ 04f-4884-919:06:1724-12-06session   ┆   ┆        ┆ -YFP8amicrouti │
│ 85d-055ac7UTC        ┆           ┆ (day 4)   ┆   ┆        ┆           ┆ 07:00:00ng_datacu │
│ …          ┆            ┆           ┆ with b…   ┆   ┆        ┆           ┆ UTCbe_…      │
│ 4a7e9fdb-42022-09-27636397_20ecephys   ┆ … ┆ nullC57BL6J(N2022-06-0/data/dyn │
│ fab-4052-a18:36:5022-09-27session   ┆   ┆        ┆ P)        ┆ 2amicrouti │
│ 7fc-f2d109UTC        ┆           ┆ (day 2)   ┆   ┆        ┆           ┆ 07:00:00ng_datacu │
│ …          ┆            ┆           ┆ with b…   ┆   ┆        ┆           ┆ UTCbe_…      │
│ 9b4aab77-52025-01-16744279_20ecephys   ┆ … ┆ nullSst-IRES-2024-05-2/data/dyn │
│ 021-43f3-922:01:3725-01-16session   ┆   ┆        ┆ Cre;Ai325amicrouti │
│ f18-b13291UTC        ┆           ┆ (day 4)   ┆   ┆        ┆           ┆ 07:00:00ng_datacu │
│ …          ┆            ┆           ┆ with b…   ┆   ┆        ┆           ┆ UTCbe_…      │
│ b0ba34cb-42024-04-22706401_20ecephys   ┆ … ┆ nullSst-IRES-2023-10-0/data/dyn │
...
│ 971-495d-b19:18:5925-03-18session   ┆   ┆        ┆ -YFP6amicrouti │
│ 6ed-dc7b08UTC        ┆           ┆ (day 1)   ┆   ┆        ┆           ┆ 07:00:00ng_datacu │
│ …          ┆            ┆           ┆ withou…   ┆   ┆        ┆           ┆ UTCbe_…      │
└────────────┴────────────┴───────────┴───────────┴───┴────────┴───────────┴───────────┴───────────┘

3. Quickly provide a summary of the contents of a single NWB file

>>> lazynwb.get_internal_paths(nwb_paths[0])
{
  '/acquisition/frametimes_eye_camera/timestamps': <HDF5 dataset "timestamps": shape (267399,), type "<f8">,
  '/acquisition/frametimes_front_camera/timestamps': <HDF5 dataset "timestamps": shape (267204,), type "<f8">,
  '/acquisition/frametimes_side_camera/timestamps': <HDF5 dataset "timestamps": shape (267374,), type "<f8">,
  '/acquisition/lick_sensor_events/data': <HDF5 dataset "data": shape (2734,), type "<f8">,
  '/acquisition/lick_sensor_events/timestamps': <HDF5 dataset "timestamps": shape (2734,), type "<f8">,
  '/intervals/aud_rf_mapping_trials': <HDF5 group "/intervals/aud_rf_mapping_trials" (10 members)>,
  '/intervals/epochs': <HDF5 group "/intervals/epochs" (9 members)>,
  '/intervals/performance': <HDF5 group "/intervals/performance" (21 members)>,
  '/intervals/trials': <HDF5 group "/intervals/trials" (48 members)>,
  '/intervals/vis_rf_mapping_trials': <HDF5 group "/intervals/vis_rf_mapping_trials" (12 members)>,
  '/processing/behavior/dlc_eye_camera': <HDF5 group "/processing/behavior/dlc_eye_camera" (110 members)>,
  '/processing/behavior/eye_tracking': <HDF5 group "/processing/behavior/eye_tracking" (26 members)>,
  '/processing/behavior/facemap_front_camera/data': <HDF5 dataset "data": shape (267204, 500), type "<f4">,
  '/processing/behavior/facemap_front_camera/timestamps': <HDF5 dataset "timestamps": shape (267204,), type "<f8">,
  '/processing/behavior/facemap_side_camera/data': <HDF5 dataset "data": shape (267374, 500), type "<f4">,
  '/processing/behavior/facemap_side_camera/timestamps': <HDF5 dataset "timestamps": shape (267374,), type "<f8">,
  '/processing/behavior/licks/data': <HDF5 dataset "data": shape (2707,), type "<f8">,
  '/processing/behavior/licks/timestamps': <HDF5 dataset "timestamps": shape (2707,), type "<f8">,
  '/processing/behavior/lp_front_camera': <HDF5 group "/processing/behavior/lp_front_camera" (57 members)>,
  '/processing/behavior/lp_side_camera': <HDF5 group "/processing/behavior/lp_side_camera" (57 members)>,
  '/processing/behavior/quiescent_interval_violations/timestamps': <HDF5 dataset "timestamps": shape (131,), type "<f8">,
  '/processing/behavior/rewards/timestamps': <HDF5 dataset "timestamps": shape (130,), type "<f8">,
  '/processing/behavior/running_speed/data': <HDF5 dataset "data": shape (251998,), type "<f8">,
  '/processing/behavior/running_speed/timestamps': <HDF5 dataset "timestamps": shape (251998,), type "<f8">
 }

4. Get the common schema for a table in one or more NWB files

>>> lazynwb.get_table_schema(nwb_paths, table_path="/intervals/trials")
# uses polars (arrow) datatypes
OrderedDict([('condition', String), ('id', Int64), ('start_time', Float64), ('stop_time', Float64), ('_nwb_path', String), ('_table_path', String), ('_table_index', UInt32)])

Development

See instructions in https://github.com/bjhardcastle/lazynwb/CONTRIBUTING.md and the original template: https://github.com/bjhardcastle/copier-pdm-npc/blob/main/README.md

notes

  • hdf5 access seems to have a mutex lock that threads spend a long time waiting to acquire (with remfile)

About

A prototype to test how fast metadata and specific ecephys unit data can be accessed from an NWB file

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •