Skip to content

Commit e21a7c1

Browse files
authored
Merge pull request #475 from NeurodataWithoutBorders/enh/append2
external links
2 parents a999785 + 88829e5 commit e21a7c1

27 files changed

+1824
-239
lines changed
Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
'''
2+
Advanced HDF5 I/O
3+
=====================
4+
5+
The HDF5 storage backend supports a broad range of advanced dataset I/O options, such as,
6+
chunking and compression. Here we demonstrate how to use these features
7+
from PyNWB.
8+
'''
9+
10+
####################
11+
# Wrapping data arrays with :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO`
12+
# ---------------------------------------------------------------------------------
13+
#
14+
# In order to customize the I/O of datasets using the HDF I/O backend we simply need to wrap our datasets
15+
# using :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO`. Using H5DataIO allows us to keep the Container
16+
# classes independent of the I/O backend while still allowing us to customize HDF5-specific I/O features.
17+
#
18+
# Before we get started, lets create an NWBFile for testing so that we can add our data to it.
19+
#
20+
21+
from datetime import datetime
22+
from pynwb import NWBFile
23+
24+
start_time = datetime(2017, 4, 3, 11, 0, 0)
25+
create_date = datetime(2017, 4, 15, 12, 0, 0)
26+
27+
nwbfile = NWBFile(source='PyNWB tutorial',
28+
session_description='demonstrate advanced HDF5 I/O features',
29+
identifier='NWB123',
30+
session_start_time=start_time,
31+
file_create_date=create_date)
32+
33+
34+
####################
35+
# Normally if we create a timeseries we would do
36+
37+
from pynwb import TimeSeries
38+
import numpy as np
39+
40+
data = np.arange(100, 200, 10)
41+
timestamps = np.arange(10)
42+
test_ts = TimeSeries(name='test_regular_timeseries',
43+
source='PyNWB tutorial',
44+
data=data,
45+
unit='SIunit',
46+
timestamps=timestamps)
47+
nwbfile.add_acquisition(test_ts)
48+
49+
####################
50+
# Now let's say we want to compress the recorded data values. We now simply need to wrap our data with H5DataIO.
51+
# Everything else remains the same
52+
53+
from pynwb.form.backends.hdf5.h5_utils import H5DataIO
54+
wrapped_data = H5DataIO(data=data, compression=True) # <----
55+
test_ts = TimeSeries(name='test_compressed_timeseries',
56+
source='PyNWB tutorial',
57+
data=wrapped_data, # <----
58+
unit='SIunit',
59+
timestamps=timestamps)
60+
nwbfile.add_acquisition(test_ts)
61+
62+
####################
63+
# This simple approach gives us access to a broad range of advanced I/O features, such as, chunking and
64+
# compression. For a complete list of all available settings see :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO`
65+
66+
####################
67+
# Chunking
68+
# --------
69+
#
70+
# By default, data arrays are stored *contiguously*. This means that on disk/in memory the elements of a
71+
# multi-dimensional, such as, ```[[1 2] [3 4]]``` are actually stored in a one-dimensional buffer
72+
# ```1 2 3 4```. Using chunking, allows us to break up our array into chunks so that our array will be
73+
# stored not in one but multiple buffers, e.g., ``[1 2] [3 4]``. Using this approach allows optimization
74+
# of data locality for I/O operations and enables the application of filters (e.g., compression) on a
75+
# per-chunk basis.
76+
77+
#####################
78+
# .. tip::
79+
#
80+
# For an introduction to chunking and compression in HDF5 and h5py in particular see also the online book
81+
# `Python and HDF5 <https://www.safaribooksonline.com/library/view/python-and-hdf5/9781491944981/ch04.html>`__
82+
# by Andrew Collette.
83+
84+
85+
####################
86+
# To use chunking we again, simply need to wrap our dataset via :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO`.
87+
# Using chunking then also allows to also create resizable arrays simply by defining the ``maxshape`` of the array.
88+
89+
data = np.arange(10000).reshape((1000, 10))
90+
wrapped_data = H5DataIO(data=data,
91+
chunks=True, # <---- Enable chunking
92+
maxshape=(None, 10) # <---- Make the time dimension unlimited and hence resizeable
93+
)
94+
test_ts = TimeSeries(name='test_chunked_timeseries',
95+
source='PyNWB tutorial',
96+
data=wrapped_data, # <----
97+
unit='SIunit',
98+
starting_time=0.0,
99+
rate=10.0)
100+
nwbfile.add_acquisition(test_ts)
101+
102+
103+
####################
104+
# .. hint::
105+
#
106+
# By also specifying ``fillvalue`` we can define the value that should be used when reading uninitialized
107+
# portions of the dataset. If no fill value has been defined, then HDF5 will use a type-appropriate default value.
108+
#
109+
110+
####################
111+
# .. note::
112+
#
113+
# Chunking can help improve data read/write performance by allowing us to align chunks with common
114+
# read/write operations. The following blog post provides an example
115+
# `http://geology.beer/2015/02/10/hdf-for-large-arrays/ <http://geology.beer/2015/02/10/hdf-for-large-arrays/>`__.
116+
# for this. But you should also know that, with great power comes great responsibility! I.e., if you choose a
117+
# bad chunk size e.g., too small chunks that don't align with our read/write operations, then chunking can
118+
# also harm I/O performance.
119+
120+
####################
121+
# Compression and Other I/O Filters
122+
# -----------------------------------
123+
#
124+
# HDF5 supports I/O filters, i.e, data transformation (e.g, compression) that are applied transparently on
125+
# read/write operations. I/O filters operate on a per-chunk basis in HDF5 and as such require the use of chunking.
126+
# Chunking will be automatically enabled by h5py when compression and other I/O filters are enabled.
127+
#
128+
# To use compression, we can wrap our dataset using :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO` and
129+
# define the approbriate opions:
130+
131+
wrapped_data = H5DataIO(data=data,
132+
compression='gzip', # <---- Use GZip
133+
compression_opts=4, # <---- Optional GZip aggression option
134+
)
135+
test_ts = TimeSeries(name='test_gzipped_timeseries',
136+
source='PyNWB tutorial',
137+
data=wrapped_data, # <----
138+
unit='SIunit',
139+
starting_time=0.0,
140+
rate=10.0)
141+
nwbfile.add_acquisition(test_ts)
142+
143+
####################
144+
# .. hint::
145+
#
146+
# In addition to ``compression``, :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO` also allows us to
147+
# enable the ``shuffle`` and ``fletcher32`` HDF5 I/O filters.
148+
149+
####################
150+
# .. note::
151+
#
152+
# *h5py* (and *HDF5* more broadly) support a number of different compression
153+
# algorithms, e.g., *GZIP*, *SZIP*, or *LZF* (or even custom compression filters).
154+
# However, only *GZIP* is built by default with HDF5, i.e., while data compressed
155+
# with *GZIP* can be read on all platforms and installation of HDF5, other
156+
# compressors may not be installed everywhere so that not all users may
157+
# be able to access those files.
158+
#
159+
160+
161+
####################
162+
# Writing the data
163+
# -----------------------
164+
#
165+
#
166+
# Writing the data now works as usual.
167+
168+
from pynwb import NWBHDF5IO
169+
170+
io = NWBHDF5IO('advanced_io_example.nwb', 'w')
171+
io.write(nwbfile)
172+
io.close()
173+
174+
####################
175+
# Reading the data
176+
# ---------------------
177+
#
178+
#
179+
# Again, nothing has changed for read. All of the above advanced I/O features are handled transparently.
180+
181+
io = NWBHDF5IO('advanced_io_example.nwb')
182+
nwbfile = io.read()
183+
184+
####################
185+
# Now lets have a look to confirm that all our I/O settings where indeed used.
186+
187+
for k, v in nwbfile.acquisition.items():
188+
print("name=%s, chunks=%s, compression=%s, maxshape=%s" % (k,
189+
v.data.chunks,
190+
v.data.compression,
191+
v.data.maxshape))
192+
193+
####################
194+
#
195+
# .. code-block:: python
196+
#
197+
# name=test_regular_timeseries, chunks=None, compression=None, maxshape=(10,)
198+
# name=test_compressed_timeseries, chunks=(10,), compression=gzip, maxshape=(10,)
199+
# name=test_chunked_timeseries, chunks=(250, 5), compression=None, maxshape=(None, 10)
200+
# name=test_gzipped_timeseries, chunks=(250, 5), compression=gzip, maxshape=(1000, 10)
201+
#
202+
# As we can see, the datasets have been chunked and compressed correctly. Alos, as expected, chunking
203+
# was automatically enabled for the compressed datasets.
204+
205+
206+
####################
207+
# Wrapping ``h5py.Datasets`` with :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO`
208+
# ------------------------------------------------------------------------------------------------
209+
#
210+
# Just for completeness, :py:meth:`~pynwb.form.backends.hdf5.h5_utils.H5DataIO` also allows us to customize
211+
# how ``h5py.Dataset`` objects should be handled on write by the PyNWBs HDF5 backend via the ``link_data``
212+
# parameter. If ``link_data`` is set to ``True`` then a ``SoftLink`` or ``ExternalLink`` will be created to
213+
# point to the HDF5 dataset On the other hand, if ``link_data`` is set to ``False`` then the dataset
214+
# be copied using `h5py.Group.copy <http://docs.h5py.org/en/latest/high/group.html#Group.copy>`__
215+
# **without copying attributes** and **without expanding soft links, external links, or references**.
216+
#
217+
# .. note::
218+
#
219+
# When wrapping an ``h5py.Dataset`` object using H5DataIO, then all settings except ``link_data``
220+
# will be ignored as the h5py.Dataset will either be linked to or copied as on write.
221+
#
222+
223+
####################
224+
# Disclaimer
225+
# ----------------
226+
#
227+
# External links included in the tutorial are being provided as a convenience and for informational purposes only;
228+
# they do not constitute an endorsement or an approval by the authors of any of the products, services or opinions of
229+
# the corporation or organization or individual. The authors bear no responsibility for the accuracy, legality or
230+
# content of the external site or for that of subsequent links. Contact the external site for answers to questions
231+
# regarding its content.

docs/gallery/general/extensions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#
1919
# The following block of code demonstrates how to create a new namespace, and then add a new `neurodata_type`
2020
# to this namespace. Finally,
21-
# it calls :py:meth:`~form.spec.write.NamespaceBuilder.export` to save the extensions to disk for downstream use.
21+
# it calls :py:meth:`~pynwb.form.spec.write.NamespaceBuilder.export` to save the extensions to disk for downstream use.
2222

2323
from pynwb.spec import NWBNamespaceBuilder, NWBGroupSpec, NWBAttributeSpec
2424

0 commit comments

Comments
 (0)