Skip to content
Closed
Show file tree
Hide file tree
Changes from 80 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
9c4d551
Template for the future HDF5ArrayAdapter
genematx Feb 13, 2025
c78ab18
ENH: Implement HDF5ArrayAdapter
genematx Feb 13, 2025
ec2f931
FIX: convert hdf5 dataset parameter
genematx Feb 13, 2025
1767201
FIX: dtype check
genematx Feb 13, 2025
abadcea
FIX: shape check
genematx Feb 13, 2025
ce7b60c
fix typos
genematx Feb 13, 2025
821fe7e
ENH: convert strings to ndslices
genematx Feb 14, 2025
3ddf0b6
Enable slicing in HDF5ArrayAdapter
genematx Feb 14, 2025
73b10e3
ENH: accept str for dataset in HDF5ArrayAdapter
genematx Feb 14, 2025
006f6fa
FIX: string parsing
genematx Feb 14, 2025
9d93fdf
ENH: consider native chunking in hdf5
genematx Feb 14, 2025
643ead6
FIX: rechunk hdf5
genematx Feb 18, 2025
dd98040
WIP: include DHF5ArrayAdapter in HDF5Adapter
genematx Feb 21, 2025
4c534bd
WIP: refactor metadata from file
genematx Feb 21, 2025
9fbc039
ENH: Incorporate HDF5ArrayAdapter into HDF5Adapter
genematx Feb 22, 2025
e41bdb5
Merge branch 'main' into hdf5-array-adapter
genematx Feb 22, 2025
44437dc
FIX: errors with awkward form being None
genematx Feb 24, 2025
35fd53a
ENH: metadata for table columns
genematx Feb 27, 2025
65a4063
ENH: add composite structure family
genematx Feb 27, 2025
d1eff0a
ENH: sketch of a simplest mapped flat structure
genematx Feb 28, 2025
9f751fe
ENH: zipping 1D arrays
genematx Mar 7, 2025
00f12dc
ENH: generate a dataset from composite
genematx Mar 7, 2025
4702e19
MNT: typing of EllipsisType
genematx Mar 7, 2025
566ae48
FIX: loading data from files
genematx Mar 10, 2025
2763627
FIX: typing
genematx Mar 10, 2025
b1e3590
FIX: typing
genematx Mar 10, 2025
84dc14d
MNT: clean-up
genematx Mar 10, 2025
1134d70
Update tiled/adapters/hdf5.py
genematx Mar 11, 2025
c879652
Update tiled/adapters/hdf5.py
genematx Mar 11, 2025
0dc620d
ENH: implement an NDSlice class
genematx Mar 13, 2025
ff1ecad
TST: test for NDSlices
genematx Mar 13, 2025
502f5fb
ENH: catch key errors and raise 410 for broken links
genematx Mar 13, 2025
2e36113
TST: test for broken soft links
genematx Mar 13, 2025
6616547
ENH: raise KeyError for broken links on the client
genematx Mar 14, 2025
defd6d5
Merge branch 'main' into hdf5-array-adapter
genematx Mar 14, 2025
e3a9022
REV: typo
genematx Mar 14, 2025
40c6c68
FIX: get inlined contents
genematx Mar 17, 2025
96b8cd0
MNT: lint
genematx Mar 17, 2025
f1f3f22
MNT: lint
genematx Mar 17, 2025
348bac6
FIX: processing hdf5 error messages
genematx Mar 17, 2025
25a84b6
Merge branch 'main' into virtual-dataset
genematx Mar 17, 2025
1da0410
Merge branch 'hdf5-array-adapter' into virtual-dataset
genematx Mar 17, 2025
56578bd
Merge branch 'awkward-refactor' into virtual-dataset
genematx Mar 18, 2025
70a0996
FIX: check string dtype in ArrayAdapter
genematx Mar 20, 2025
5ac8818
ENH: virtual xarray datasets for 1D arrays
genematx Mar 20, 2025
046b29e
ENH: improve repr for RecordArrays
genematx Mar 20, 2025
0bb2048
Merge branch 'awkward-refactor' into virtual-dataset
genematx Mar 20, 2025
36bc109
FIX: allow for missing metadata attr
genematx Mar 20, 2025
658b2c4
ENH: virtual datasets for ND arrays
genematx Mar 20, 2025
97d17dd
FIX:
genematx Mar 20, 2025
c7ea78d
ENH: Enable POST method to read datasets
genematx Mar 20, 2025
a037e1f
ENH: server-side array transformations and dataset alignment
genematx Mar 22, 2025
887d100
ENH: refactoring and bug fixes of NDSlice
genematx Mar 22, 2025
fdd7489
Merge branch 'hdf5-array-adapter' into virtual-dataset
genematx Mar 22, 2025
aa43987
ENH: allow dims from metadata
genematx Mar 22, 2025
5abb201
FIX: alignment query param in links for datasets
genematx Mar 22, 2025
a763a93
TST: tests for virtual datasets
genematx Mar 22, 2025
d8e7f6e
ENH: convert string dtypes closer to the ArrayAdapter
genematx Mar 22, 2025
ac69cc5
Resolve conflicts
genematx Mar 22, 2025
66f7b4f
Merge branch 'hdf5-array-adapter' into virtual-dataset
genematx Mar 22, 2025
26cab9e
TST: fix url to path conversion
genematx Mar 22, 2025
495d391
Merge branch 'hdf5-array-adapter' into virtual-dataset
genematx Mar 22, 2025
24ec005
MNT: changelog and lint
genematx Mar 24, 2025
f1a3a99
TST: fix Windows tests
genematx Mar 24, 2025
5e0c6d2
Merge branch 'hdf5-array-adapter' into virtual-dataset
genematx Mar 24, 2025
1d101d4
Merge branch 'main' into awkward-refactor
genematx Mar 24, 2025
1fd5db0
TST fix slicing tests
genematx Mar 24, 2025
1d20c5d
Merge branch 'awkward-refactor' into virtual-dataset
genematx Mar 24, 2025
eff4103
TST: fix failing tests
genematx Mar 25, 2025
acebf7b
TST: create distinct temp directories
genematx Mar 25, 2025
4fd5578
REV:
genematx Mar 25, 2025
c9439c4
ENH: exporting NDSlices with Ellipsis to JSON
genematx Mar 31, 2025
6debf2a
Merge branch 'main' into hdf5-array-adapter
genematx Mar 31, 2025
cbdc7bd
Merge branch 'hdf5-array-adapter' into virtual-dataset
genematx Mar 31, 2025
66fc4ea
FIX: NDSlice import
genematx Mar 31, 2025
b7f1f9e
FIX: NDSlice import
genematx Mar 31, 2025
ddaf6ba
MNT: lint
genematx Mar 31, 2025
eed70fb
STY: fix endpoints for composite
genematx Apr 3, 2025
299c0e9
Merge branch 'main' into virtual-dataset
genematx Apr 3, 2025
ad46c08
MNT: lint
genematx Apr 3, 2025
da9434c
ENH: encode long query parameters
genematx Apr 3, 2025
f9c2669
ENH: use POST method by xarray client
genematx Apr 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Write the date in place of the "Unreleased" in the case a new version is release
### Added

- `Composite` structure family to enable direct access to table columns in a single namespace.
- Creating "virtual" datasets from the contents of a Composite container.

### Changed

Expand Down
3 changes: 1 addition & 2 deletions tiled/_tests/test_composite.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,7 @@

# A sparse array
arr = rng.random(size=(10, 20, 30), dtype="float64")
arr[arr < 0.95] = 0 # Fill half of the array with zeros.
sps_arr = sparse.COO(arr)
sps_arr = sparse.COO(numpy.where(arr > 0.95, arr, 0))

md = {"md_key1": "md_val1", "md_key2": 2}

Expand Down
269 changes: 269 additions & 0 deletions tiled/_tests/test_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
import numpy as np
import pandas as pd
import pytest
import sparse

from tiled.client.metadata_update import DELETE_KEY

from ..catalog import in_memory
from ..client import Context, from_context
from ..client.xarray import DatasetClient
from ..server.app import build_app

rng = np.random.default_rng(12345)

# 1D Arrays
time_1x = np.linspace(0, 1, 10)
time_2x = np.linspace(0, 1, 20)
arr1 = rng.random(size=(10,), dtype="float64")
arr2 = rng.random(size=(10, 1), dtype="single")
arr3 = rng.random(size=(20, 1), dtype="double")
arr4 = rng.integers(0, 255, size=(10,), dtype="uint8")

# nD Arrays
img1 = rng.random(size=(10, 13, 17), dtype="float64")
img2 = rng.random(size=(20, 13, 17), dtype="float64")

# Tables
tab1 = pd.DataFrame(
{
"colA": rng.random(10, dtype="float64"),
"colB": rng.integers(0, 255, size=(10,), dtype="uint8"),
"colC": np.random.choice(["a", "b", "c", "d", "e"], 10),
}
)
tab2 = pd.DataFrame(
{
"colD": rng.random(20, dtype="float64"),
"colE": rng.integers(0, 255, size=(20,), dtype="uint8"),
"colF": np.random.choice(["a", "b", "c", "d", "e"], 20),
}
)
tab3 = pd.DataFrame(
{
"colG": rng.random(20, dtype="float64"),
"colH": rng.integers(0, 255, size=(20,), dtype="uint8"),
"colI": np.random.choice(["a", "b", "c", "d", "e"], 20),
}
)

# Sparse Arrays
sps1 = rng.random(size=(10, 13, 17), dtype="float64")
sps1 = sparse.COO(np.where(sps1 > 0.95, sps1, 0))
sps2 = sparse.COO(np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1]))
sps3 = sparse.COO(np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1] * 2))

data = [
("time", time_1x),
("time_1x", time_1x),
("time_2x", time_2x),
("arr1", arr1),
("arr2", arr2),
("arr3", arr3),
("arr4", arr4),
("img1", img1),
("img2", img2),
("colA", tab1["colA"]),
("colB", tab1["colB"]),
("colC", tab1["colC"]),
("colD", tab2["colD"]),
("colE", tab2["colE"]),
("colF", tab2["colF"]),
# ("colG", tab3["colG"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this running into trouble due to types? If so #941 will fix, once that's in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, due to setting up a different type of storage in tests

# ("colH", tab3["colH"]),
# ("colI", tab3["colI"]),
# ("sps1", sps1.todense()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet

# ("sps2", sps2.todense()),
# ("sps3", sps3.todense()),
]

md = {"md_key1": "md_val1", "md_key2": 2}


@pytest.fixture(scope="module")
def tree(tmp_path_factory):
return in_memory(writable_storage=tmp_path_factory.mktemp("test_dataset"))


@pytest.fixture(scope="module")
def context(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
x = client.create_composite(key="x", metadata=md)
x.write_array(time_1x, key="time", metadata={})
x.write_array(
time_1x,
key="time_1x",
metadata={},
specs=["xarray_coord"],
dims=["time_1x"],
)
x.write_array(
time_2x,
key="time_2x",
metadata={},
specs=["xarray_coord"],
dims=["time_2x"],
)
x.write_array(arr1, key="arr1", metadata={})
x.write_array(arr2, key="arr2", metadata={})
x.write_array(arr3, key="arr3", metadata={})
x.write_array(arr4, key="arr4", metadata={})
x.write_array(img1, key="img1", metadata={})
x.write_array(img2, key="img2", metadata={})

x.write_dataframe(
tab1,
key="tab1",
metadata={},
)
x.write_dataframe(
tab2,
key="tab2",
metadata={},
)
# table = pyarrow.Table.from_pandas(tab3)
# x.create_appendable_table(schema=table.schema, key="tab3")
# x.parts["tab3"].append_partition(table, 0)

# x.write_sparse(
# coords=sps1.coords,
# data=sps1.data,
# shape=sps1.shape,
# key="sps1",
# metadata={},
# )
# x.write_sparse(
# coords=sps2.coords,
# data=sps2.data,
# shape=sps2.shape,
# key="sps2",
# metadata={},
# )
# x.write_sparse(
# coords=sps3.coords,
# data=sps3.data,
# shape=sps3.shape,
# key="sps3",
# metadata={},
# )

yield context


def test_create_full_dataset(context):
x = from_context(context)["x"]
ds = x.to_dataset()
assert isinstance(ds, DatasetClient)
assert len(ds) == len(data)
assert set(ds.keys()) == set([name for name, _ in data])


def test_create_partial_dataset(context):
x = from_context(context)["x"]
keys = ["time_1x", "arr1", "img1", "colA", "colD"]
ds = x.to_dataset(*keys)
assert len(ds) == len(keys)
assert set(ds.keys()) == set(keys)


@pytest.mark.parametrize("name, expected", data)
def test_read_from_dataset(context, name, expected):
x = from_context(context)["x"]
ds = x.to_dataset()
actual = ds[name].read()
assert np.array_equal(actual, expected.squeeze())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the inconsistency arise that makes squeeze() necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for arrays with shapes (10, 1) and (20, 1). When creating an xarray, I assumed that trailing ones in the shape could be dropped (to avoid declaring another coordinate).



def test_read_xarray_same_shape(context):
x = from_context(context)["x"]

# Use all arrays to construct the dataset; read all
keys = ["arr1", "arr2", "arr4", "colA", "colB", "colC"]
xarr = x.to_dataset(*keys).read()
assert len(xarr) == len(keys)
assert len(xarr.coords) == 0
assert set(xarr.data_vars) == set(keys)
assert set(xarr.variables) == set(keys)

# Use 'time' arrays as the default coordinate
keys = ["time", "arr1", "arr2", "arr4", "colA", "colB", "colC"]
xarr = x.to_dataset(*keys).read()
assert len(xarr) == len(keys) - 1
assert set(xarr.coords) == {"time"}
assert set(xarr.data_vars) == set(keys) - {"time"}
assert set(xarr.variables) == set(keys)

# Use all arrays to construct the dataset; read a subset
keys = ["time", "arr1", "arr2", "arr4", "colA", "colB", "colC"]
xarr = x.to_dataset(*keys).read(variables=["arr1", "colA"])
assert len(xarr) == 2
assert len(xarr.coords) == 0
assert set(xarr.variables) == {"arr1", "colA"}

# Set 'time_1x' as the default coordinate for some of the arrays
keys = ["time", "time_1x", "arr1", "arr4", "colA", "colB"]
x.parts["tab1"].update_metadata(
metadata={"column_specs": {"colA": ["xarray_data_var"]}, "rows_dim": "time_1x"}
)
xarr = x.to_dataset(*keys).read()
assert set(xarr.dims) == {"time", "time_1x"}
assert set(xarr.coords) == {"time", "time_1x"}
assert set(xarr.data_vars) == {"arr1", "arr4", "colA", "colB"}
assert set(xarr.variables) == {"time", "time_1x", "arr1", "arr4", "colA", "colB"}

# Revert the metadata changes
x.parts["tab1"].update_metadata(
metadata={"column_specs": DELETE_KEY, "rows_dim": DELETE_KEY}
)


def test_read_xarray_with_ndarrays(context):
x = from_context(context)["x"]

keys = ["time", "arr1", "arr2", "arr4", "img1"]
xarr = x.to_dataset(*keys).read()
assert set(xarr.coords) == {"time"}
assert set(xarr.data_vars) == {"arr1", "arr2", "arr4", "img1"}
assert xarr["arr1"].dims == ("time",)
assert xarr["img1"].dims == ("time", "dim1", "dim2")


def test_read_xarray_different_lengths(context):
x = from_context(context)["x"]

keys = ["time_1x", "time_2x", "arr1", "arr3", "img1", "img2", "colA", "colD"]

# Set dimension labels for the arrays
x["arr1"].update_metadata(metadata={"dims": ["time_1x"]})
x["arr3"].update_metadata(metadata={"dims": ["time_2x"]})
x["img1"].update_metadata(metadata={"dims": ["time_1x", "x", "y"]})
x["img2"].update_metadata(metadata={"dims": ["time_2x", "x", "y"]})
x.parts["tab1"].update_metadata(metadata={"rows_dim": "time_1x"})
x.parts["tab2"].update_metadata(metadata={"rows_dim": "time_2x"})

xarr = x.to_dataset(*keys).read()
assert set(xarr.coords) == {"time_1x", "time_2x"}
assert set(xarr.dims) == {"time_1x", "time_2x", "x", "y"}
assert set(xarr.data_vars) == {"arr1", "arr3", "img1", "img2", "colA", "colD"}

# Revert the metadata changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should live in a finally block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the idea!

x["arr1"].update_metadata(metadata={"dims": DELETE_KEY})
x["arr3"].update_metadata(metadata={"dims": DELETE_KEY})
x["img1"].update_metadata(metadata={"dims": DELETE_KEY})
x["img2"].update_metadata(metadata={"dims": DELETE_KEY})
x.parts["tab1"].update_metadata(metadata={"rows_dim": DELETE_KEY})
x.parts["tab2"].update_metadata(metadata={"rows_dim": DELETE_KEY})


@pytest.mark.parametrize("align", ["zip_shortest", "resample"])
def test_read_xarray_with_alignment(context, align):
x = from_context(context)["x"]

keys = ["time", "arr1", "arr3", "img1", "img2", "colA", "colD"]
xarr = x.to_dataset(*keys, align=align).read()
assert set(xarr.coords) == {"time"}
assert set(xarr.dims) == {"time", "dim1", "dim2"}
assert set(xarr.data_vars) == set(keys) - {"time"}
for key in keys:
assert xarr[key].shape[0] == 10
Loading