Skip to content

Commit eb21cbf

Browse files
committed
switch to roll and shift - remove delete vacated
1 parent 30026ac commit eb21cbf

File tree

9 files changed

+256
-354
lines changed

9 files changed

+256
-354
lines changed

icechunk-python/docs/docs/moving-chunks.md

Lines changed: 47 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -15,41 +15,27 @@ This enables **rolling time windows**—continuously updating datasets like fore
1515

1616
| Method | Best For | Flexibility |
1717
|--------|----------|-------------|
18-
| [`shift_array`][icechunk.Session.shift_array] | Uniform shifts with edge handling | Simple—just specify offset and mode |
18+
| [`shift_array`][icechunk.Session.shift_array] | Shift with discard — rolling time windows | Simple—just specify offset |
19+
| [`roll_array`][icechunk.Session.roll_array] | Circular shift — no data loss | Simple—just specify offset |
1920
| [`reindex_array`][icechunk.Session.reindex_array] | Custom transformations | Maximum—you control every chunk |
2021

2122
## Offsets Are in Chunks, Not Elements
2223

23-
Both methods work with **chunk indices**, not array indices. If your array has `chunk_size=2`, then an offset of `(-1,)` shifts by 1 chunk, which is 2 elements:
24+
All three methods work with **chunk indices**, not array indices. If your array has `chunk_size=2`, then an offset of `(-1,)` shifts by 1 chunk, which is 2 elements:
2425

2526
```python
2627
# With chunk_size=2:
27-
shift_array("/arr", (-1,), "wrap") # → shifts by 1 chunk = 2 elements
28-
shift_array("/arr", (-2,), "wrap") # → shifts by 2 chunks = 4 elements
28+
shift_array("/arr", (-1,)) # → shifts by 1 chunk = 2 elements
29+
shift_array("/arr", (-2,)) # → shifts by 2 chunks = 4 elements
2930
```
3031

3132
Why chunks instead of elements? Because these are **metadata-only operations**. Shifting by partial chunks would require splitting and rewriting chunk data.
3233

33-
For convenience, `shift_array` returns the shift converted to element space—so you don't need to manually track chunk sizes when determining where to write new data.
34+
For convenience, `shift_array` and `roll_array` return the shift converted to element space—so you don't need to manually track chunk sizes when determining where to write new data.
3435

3536
## shift_array { #shift_array }
3637

37-
The [`shift_array`][icechunk.Session.shift_array] method moves all chunks by a fixed offset per dimension (negative to shift toward index 0, positive toward higher indices), with built-in handling for what happens at the boundaries. For convenience, it returns the **index shift** (`chunk_offset × chunk_size` for each dimension).
38-
39-
### Shift Modes
40-
41-
The `mode` parameter controls what happens to chunks that shift out of bounds:
42-
43-
| Mode | Behavior | Data Loss |
44-
|------|----------|-----------|
45-
| `"wrap"` | Chunks wrap to the other side | None |
46-
| `"discard"` | Out-of-bounds chunks are dropped | Yes |
47-
48-
You can use strings (`"wrap"`, `"discard"`) or the enum ([`ic.ShiftMode.WRAP`][icechunk.ShiftMode], [`ic.ShiftMode.DISCARD`][icechunk.ShiftMode]).
49-
50-
#### WRAP Mode
51-
52-
Chunks that shift out of one end reappear at the other—no data is lost.
38+
The [`shift_array`][icechunk.Session.shift_array] method moves all chunks by a fixed offset per dimension (negative to shift toward index 0, positive toward higher indices). Chunks that shift out of bounds are discarded, and vacated positions retain stale data — the caller typically writes new data there. It returns the **index shift** (`chunk_offset × chunk_size` for each dimension).
5339

5440
```python exec="on" session="chunks" source="material-block" result="code"
5541
import numpy as np
@@ -71,15 +57,15 @@ arr = zarr.create(
7157
arr[:] = np.arange(10)
7258
print("Before:", arr[:])
7359

74-
session.shift_array("/arr", (-2,), "wrap") # Shift left by 2 chunks
60+
session.shift_array("/arr", (-2,)) # Shift left by 2 chunks
7561
print("After: ", arr[:])
7662
```
7763

78-
Notice how `[0, 1, 2, 3]` wrapped around to the end.
64+
The chunks containing `[0, 1, 2, 3]` were discarded, and the last 4 positions retain stale data.
7965

80-
#### DISCARD Mode
66+
### Preserving Data with Resize
8167

82-
Out-of-bounds chunks are dropped, and vacated positions return the fill value.
68+
Chunks that shift out of bounds are lost. To preserve everything when shifting, resize first:
8369

8470
```python exec="on" session="chunks" source="material-block" result="code"
8571
repo = ic.Repository.create(ic.in_memory_storage())
@@ -95,35 +81,30 @@ arr = zarr.create(
9581
arr[:] = np.arange(10)
9682
print("Before:", arr[:])
9783

98-
session.shift_array("/arr", (-2,), "discard") # Shift left, discard overflow
84+
arr.resize((14,)) # Add space for 2 more chunks
85+
session.shift_array("/arr", (2,))
9986
print("After: ", arr[:])
10087
```
10188

102-
The chunks containing `[0, 1, 2, 3]` were discarded, and the vacated end filled with `-1`.
89+
### Example: Rolling Time Window
10390

104-
### Preserving Data with Resize
91+
Imagine a sensor array storing the last 7 days of hourly readings—shape `(168,)` with one chunk per day `(24,)`. Each day, you want to discard the oldest day and make room for new data:
10592

106-
With `"discard"` mode, chunks that shift out of bounds are lost. To preserve everything when shifting, resize first:
93+
```python
94+
# Each day: shift left by 1 chunk, discarding the oldest
95+
element_shift = session.shift_array("/sensors/temperature", (-1,))
96+
# element_shift = (-24,) — the shift in element space
10797

108-
```python exec="on" session="chunks" source="material-block" result="code"
109-
repo = ic.Repository.create(ic.in_memory_storage())
110-
session = repo.writable_session("main")
111-
arr = zarr.create(
112-
store=session.store,
113-
path="arr",
114-
shape=(10,),
115-
chunks=(2,),
116-
dtype="i4",
117-
fill_value=-1,
118-
)
119-
arr[:] = np.arange(10)
120-
print("Before:", arr[:])
98+
# Write new day's data to the vacated region
99+
arr[element_shift[0]:] = todays_readings
121100

122-
arr.resize((14,)) # Add space for 2 more chunks
123-
session.shift_array("/arr", (2,), "discard")
124-
print("After: ", arr[:])
101+
session.commit(f"Updated sensor data for {today}")
125102
```
126103

104+
The return value tells you exactly where to write new data—no need to manually track chunk sizes.
105+
106+
This pattern works identically whether your array is 1 KB or 1 PB, and whether it's on local disk or cloud object storage—the shift is always instant with zero data transfer.
107+
127108
### Multi-dimensional Arrays
128109

129110
For N-dimensional arrays, provide an offset for each dimension:
@@ -143,38 +124,14 @@ arr[:] = np.arange(24).reshape(6, 4)
143124
print("Original 6x4 array:")
144125
print(arr[:])
145126

146-
session.shift_array("/arr2d", (1, 0), "discard") # Shift down 1 chunk
127+
session.shift_array("/arr2d", (1, 0)) # Shift down 1 chunk
147128
print("\nAfter shift (1, 0):")
148129
print(arr[:])
149130
```
150131

151-
### Example: Rolling Time Window
132+
## roll_array { #roll_array }
152133

153-
Imagine a sensor array storing the last 7 days of hourly readings—shape `(168,)` with one chunk per day `(24,)`. Each day, you want to discard the oldest day and make room for new data:
154-
155-
```python
156-
# Each day: shift left by 1 chunk, discarding the oldest
157-
element_shift = session.shift_array("/sensors/temperature", (-1,), "discard")
158-
# element_shift = (-24,) — the shift in element space
159-
160-
# Write new day's data to the vacated region
161-
arr[element_shift[0]:] = todays_readings
162-
163-
session.commit(f"Updated sensor data for {today}")
164-
```
165-
166-
The return value tells you exactly where to write new data—no need to manually track chunk sizes.
167-
168-
This pattern works identically whether your array is 1 KB or 1 PB, and whether it's on local disk or cloud object storage—the shift is always instant with zero data transfer.
169-
170-
## reindex_array { #reindex_array }
171-
172-
For transformations that [`shift_array`][icechunk.Session.shift_array] can't express, [`reindex_array`][icechunk.Session.reindex_array] gives you complete control. You provide a function that maps each chunk's old position to its new position.
173-
174-
Your function receives a chunk index (as a list) and returns:
175-
176-
- A new index (as a list) to move the chunk there
177-
- `None` to discard the chunk
134+
The [`roll_array`][icechunk.Session.roll_array] method performs a circular shift — chunks that go out of one end wrap around to the other side. No data is lost.
178135

179136
```python exec="on" session="chunks" source="material-block" result="code"
180137
repo = ic.Repository.create(ic.in_memory_storage())
@@ -190,23 +147,20 @@ arr = zarr.create(
190147
arr[:] = np.arange(10)
191148
print("Before:", arr[:])
192149

193-
def shift_and_filter(idx):
194-
"""Shift left by 2, discard chunks that would go negative."""
195-
new_idx = idx[0] - 2
196-
return None if new_idx < 0 else [new_idx]
197-
198-
session.reindex_array("/arr", shift_and_filter, delete_vacated=True)
150+
session.roll_array("/arr", (-2,)) # Roll left by 2 chunks
199151
print("After: ", arr[:])
200152
```
201153

202-
### The delete_vacated Parameter
154+
Notice how `[0, 1, 2, 3]` wrapped around to the end.
155+
156+
## reindex_array { #reindex_array }
203157

204-
The `delete_vacated` parameter controls what happens to source positions after chunks move away:
158+
For transformations that [`shift_array`][icechunk.Session.shift_array] and [`roll_array`][icechunk.Session.roll_array] can't express, [`reindex_array`][icechunk.Session.reindex_array] gives you complete control. You provide a function that maps each chunk's old position to its new position.
159+
160+
Your function receives a chunk index (as a list) and returns:
205161

206-
| Value | Behavior |
207-
|-------|----------|
208-
| `True` | Vacated positions are deleted (return fill value) |
209-
| `False` | Vacated positions keep stale references |
162+
- A new index (as a list) to move the chunk there
163+
- `None` to discard the chunk
210164

211165
```python exec="on" session="chunks" source="material-block" result="code"
212166
repo = ic.Repository.create(ic.in_memory_storage())
@@ -220,25 +174,19 @@ arr = zarr.create(
220174
fill_value=-1,
221175
)
222176
arr[:] = np.arange(10)
223-
session.commit("setup")
177+
print("Before:", arr[:])
224178

225-
def shift_left_2(idx):
179+
def shift_and_filter(idx):
180+
"""Shift left by 2, discard chunks that would go negative."""
226181
new_idx = idx[0] - 2
227182
return None if new_idx < 0 else [new_idx]
228183

229-
# delete_vacated=False: source positions keep stale data
230-
session = repo.writable_session("main")
231-
arr = zarr.open_array(session.store, path="arr")
232-
session.reindex_array("/arr", shift_left_2, delete_vacated=False)
233-
print("delete_vacated=False:", arr[:])
234-
235-
# delete_vacated=True: vacated positions return fill value
236-
session = repo.writable_session("main")
237-
arr = zarr.open_array(session.store, path="arr")
238-
session.reindex_array("/arr", shift_left_2, delete_vacated=True)
239-
print("delete_vacated=True: ", arr[:])
184+
session.reindex_array("/arr", shift_and_filter)
185+
print("After: ", arr[:])
240186
```
241187

188+
Vacated positions retain stale chunk references.
189+
242190
### Custom Transformations
243191

244192
With `reindex_array`, you can implement any chunk permutation:
@@ -261,7 +209,7 @@ def reverse_chunks(idx):
261209
"""Reverse the order of all chunks."""
262210
return [4 - idx[0]] # 0↔4, 1↔3, 2 stays
263211

264-
session.reindex_array("/arr", reverse_chunks, delete_vacated=False)
212+
session.reindex_array("/arr", reverse_chunks)
265213
print("After: ", arr[:])
266214
```
267215

@@ -287,7 +235,7 @@ def swap_quadrants(idx):
287235
row, col = idx
288236
return [(row + 1) % 2, (col + 1) % 2]
289237

290-
session.reindex_array("/arr2d", swap_quadrants, delete_vacated=False)
238+
session.reindex_array("/arr2d", swap_quadrants)
291239
print("\nAfter swapping quadrants:")
292240
print(arr[:])
293241
```

icechunk-python/python/icechunk/__init__.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@
4848
S3Options,
4949
S3StaticCredentials,
5050
SessionMode,
51-
ShiftMode,
5251
SnapshotInfo,
5352
Storage,
5453
StorageConcurrencySettings,
@@ -164,7 +163,6 @@
164163
"S3StaticCredentials",
165164
"Session",
166165
"SessionMode",
167-
"ShiftMode",
168166
"SnapshotInfo",
169167
"Storage",
170168
"StorageConcurrencySettings",

icechunk-python/python/icechunk/_icechunk_python.pyi

Lines changed: 2 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1888,20 +1888,6 @@ class SessionMode(Enum):
18881888
WRITABLE = 1
18891889
REARRANGE = 2
18901890

1891-
class ShiftMode(Enum):
1892-
"""The mode for shifting array chunks, determining how out-of-bounds chunks are handled.
1893-
1894-
Attributes
1895-
----------
1896-
WRAP: int
1897-
Circular buffer - chunks wrap around to the other side, shape unchanged
1898-
DISCARD: int
1899-
Out-of-bounds chunks are discarded, vacated positions return fill_value
1900-
"""
1901-
1902-
WRAP = 0
1903-
DISCARD = 1
1904-
19051891
class PySession:
19061892
@classmethod
19071893
def from_bytes(cls, data: bytes) -> PySession: ...
@@ -1924,11 +1910,9 @@ class PySession:
19241910
self,
19251911
array_path: str,
19261912
shift_chunk: Callable[[Iterable[int]], Iterable[int] | None],
1927-
delete_vacated: bool,
19281913
) -> None: ...
1929-
def shift_array(
1930-
self, array_path: str, chunk_offset: Iterable[int], mode: ShiftMode
1931-
) -> list[int]: ...
1914+
def shift_array(self, array_path: str, chunk_offset: Iterable[int]) -> list[int]: ...
1915+
def roll_array(self, array_path: str, chunk_offset: Iterable[int]) -> list[int]: ...
19321916
async def move_node_async(self, from_path: str, to_path: str) -> None: ...
19331917
def all_virtual_chunk_locations(self) -> list[str]: ...
19341918
async def all_virtual_chunk_locations_async(self) -> list[str]: ...

0 commit comments

Comments
 (0)