How to use DataChunkIterator to wrap another DataChunkIterator? #102

rly · 2025-10-29T01:35:53Z

rly
Oct 29, 2025
Maintainer

@calderast wanted to take a SpikeInterfaceRecordingDataChunkIterator from NeuroConv and modify its output to return data multiplied by a conversion factor and converted back to an int16 before adding to an H5DataIO object (for ingestion into spyglass). The original approach was to wrap the SpikeInterfaceRecordingDataChunkIterator with another DataChunkIterator that yielded the new values from the old values.

# Convert to uV without loading the whole thing at once
def traces_in_microvolts_iterator(traces_as_iterator, conversion_factor_uv):
    for chunk in traces_as_iterator:
        yield (chunk * conversion_factor_uv).astype("int16")

# Wrap iterator in DataChunkIterator for H5DataIO
data_iterator = DataChunkIterator(
    traces_in_microvolts_iterator(traces_as_iterator, channel_conversion_factor_uv),
    buffer_size=1,  # number of chunks to keep in memory
    maxshape=(num_samples, num_channels),
    dtype=np.dtype("int16"),
)

data_data_io = H5DataIO(
    data=data_iterator,  # formerly traces_as_iterator
    chunks=(min(num_samples, 81920), min(num_channels, 64)),
    compression="gzip",
)

traces_as_iterator is the original SpikeInterfaceRecordingDataChunkIterator and channel_conversion_factor_uv is a numpy float.

This resulted in the error TypeError: Can't broadcast (1, 3000, 256) -> (1, 256).

Answered by rly

Oct 29, 2025

The issue is that DataChunkIterator assumes data are read in a very particular manner: It wraps returns one element along the iteration dimension at a time. I.e., the iterator is expected to return chunks that are one dimension lower than the array itself. For example, when iterating over the first dimension of a dataset with shape (1000, 10, 10), then the iterator would return 1000 chunks of shape (10, 10) one-chunk-at-a-time.

The solution was to create a new subclass of GenericDataChunkIterator or SpikeInterfaceRecordingDataChunkIterator that wraps the original SpikeInterfaceRecordingDataChunkIterator and modifies the _get_data method to get the data from the wrapped iterator, modify it…

View full answer

rly · 2025-10-29T01:39:02Z

rly
Oct 29, 2025
Maintainer Author

The issue is that DataChunkIterator assumes data are read in a very particular manner: It wraps returns one element along the iteration dimension at a time. I.e., the iterator is expected to return chunks that are one dimension lower than the array itself. For example, when iterating over the first dimension of a dataset with shape (1000, 10, 10), then the iterator would return 1000 chunks of shape (10, 10) one-chunk-at-a-time.

The solution was to create a new subclass of GenericDataChunkIterator or SpikeInterfaceRecordingDataChunkIterator that wraps the original SpikeInterfaceRecordingDataChunkIterator and modifies the _get_data method to get the data from the wrapped iterator, modify its values, and return those.

class MicrovoltsSpikeInterfaceRecordingDataChunkIterator(SpikeInterfaceRecordingDataChunkIterator):

    def __init__(self, iterator: SpikeInterfaceRecordingDataChunkIterator, conversion_factor_uv):
        self.iterator = iterator
        self.conversion_factor_uv = conversion_factor_uv
        super().__init__(iterator.recording)

    def _get_default_chunk_shape(self, chunk_mb: float = 10.0) -> tuple[int, int]:
        return self.iterator._get_default_chunk_shape(chunk_mb)

    def _get_data(self, selection: tuple[slice]):
        data = self.iterator._get_data(selection)
        return (data * self.conversion_factor_uv).astype("int16")

    def _get_dtype(self):
        return np.dtype("int16")

    def _get_maxshape(self):
        return self.iterator._get_maxshape()

uv_traces_as_iterator = MicrovoltsSpikeInterfaceRecordingDataChunkIterator(traces_as_iterator, channel_conversion_factor_uv)

data_data_io = H5DataIO(
    data=uv_traces_as_iterator,
    chunks=(min(num_samples, 81920), min(num_channels, 64)),
    compression="gzip",
)

2 replies

rly Oct 29, 2025
Maintainer Author

It was also confusing to debug this because apparently
hdmf.data_utils.DataChunk * float results in TypeError: unsupported operand type(s) for *: 'DataChunk' and 'float' but
hdmf.data_utils.DataChunk * np.float64 results in a numpy array.

rly Oct 29, 2025
Maintainer Author

The above would also work by extending GenericDataChunkIterator instead, since no methods or attributes unique to SpikeInterfaceRecordingDataChunkIterator are ever used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use DataChunkIterator to wrap another DataChunkIterator? #102

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use DataChunkIterator to wrap another DataChunkIterator? #102

Uh oh!

Uh oh!

rly Oct 29, 2025 Maintainer

Replies: 1 comment · 2 replies

Uh oh!

rly Oct 29, 2025 Maintainer Author

Uh oh!

rly Oct 29, 2025 Maintainer Author

Uh oh!

rly Oct 29, 2025 Maintainer Author

rly
Oct 29, 2025
Maintainer

Replies: 1 comment 2 replies

rly
Oct 29, 2025
Maintainer Author

rly Oct 29, 2025
Maintainer Author

rly Oct 29, 2025
Maintainer Author