-
Notifications
You must be signed in to change notification settings - Fork 22
Adds h5py dump #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adds h5py dump #221
Conversation
Hi @pluflou, do you have any updates on this PR? |
@pluflou this would be a good opportunity to merge in the version we have been using |
@roussel-ryan do we want this in both packages? I can copy the updated class form ml-tto and mark this as ready for review |
yes, lets update the h5 dumping based on ml-tto but I would do a PR to the main branch, not the serialization branch, then we can remove it from ml-tto |
@eloiseyang @roussel-ryan The class is now up to date with the ml-tto version. Previously I had a |
@eloiseyang I added pandas to the requirements since I wrote the |
Hi @pluflou, including pandas is a good idea, especially if the ML team already uses it. I've used it before in other projects. Let me run over another review of this code to give comments/approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, thanks for the PR. I read this code as a generalized approach to dumping and recovering Python objects, like pandas dataframes, ndarrays, and dictionaries. I think that's a good idea, and will be useful for everyone when it's implemented.
The trouble I have with this code is that its approach to different types is inconsistent. A HDF5 group is assumed to be a dictionary by default. Sometimes a group is a Pandas dataframe if it has the appropriate attributes associated with it. Dictionaries and Pandas dataframes are easily dumped and recovered.
Sometimes a group is a list, like when there's nested lists, but these can't be recovered because there's no attribute that handles the reconstruction of lists. When the code dumps a heterogeneous list, it converts the list to a string, which the loader can't recover the data from. The list-group approach in nested lists could be used to solve this problem, but it isn't.
Lists of dictionaries are handled, but tuples and NDArrays of dictionaries are not. Pandas dataframes get to know what type of object they are, but tuples don't. Dictionaries only kind of get to know because that's the default case.
Right now, the code can dump and fully recover scalars, homogenous lists of scalar data, dictionaries, nested dictionaries, and pandas dataframes. I would suggest we reduce the code to only handle these cases that be dumped and recovered in their original form. A generalized way to dump and recover Python objects will definitely be useful, but we should make sure that our code can fully deliver on its promises. We don't want to be scrambling to recover a heterogeneous list of enums, booleans, strings, and doubles from a printed string two years down the line, for instance.
The idea is good and the testing suite is robust. This will be an iterative process, and we don't need to create something that's completely general to start. The first step to getting this merged is reducing it down to the parts of the code that deliver on its promise to save and recover python objects. I'm definitely willing to help iterate on this as well once the initial merge is done. Thank again, let me know if you have questions.
------- | ||
None | ||
""" | ||
dt = h5py.string_dtype(encoding=self.string_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rename this to something like h5str
? dt
is not descriptive of what this does.
elif all(isinstance(ele, np.ndarray) for ele in val): | ||
# save np.arrays as datasets | ||
for i, ele in enumerate(val): | ||
f.create_dataset(f"{key}/{i}", data=ele, track_order=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to examine what this code is trying to accomplish by looking at the load
function later down the line. load
does three things: it converts scalar datasets into key/value pairs in a parent dictionary, it converts groups with the pandas attribute into pandas dataframes, and it converts all other groups into dictionaries.
Here's a pretty basic case:
list_of_dicts = {'list': [{'foo': 'a'}, {'bar': 'b'}]}
saver.dump(list_of_dicts, filepath)
result = saver.load(filepath)
# result ={'list': {'0': {'foo': 'a'}, '1': {'bar': 'b'}}}
The result is that the original list_of_dicts
is not recovered. This isn't so bad, and we could reasonably recover the list as long as we know what's happening. But looking ahead, there are two cases in which the code saves data as strings, which means that data in a nested or mixed data type list can't be recovered easily.
There's a couple approaches we could take to resolving this. We could create groups that contain information about the source class. This shouldn't be so hard to implement, but it would mean writing a lot more code. The second approach is to reduce the number of cases that this code handles to the ones that you and the rest of the ML team presently use and need. For instance, I assume you all are not parsing a dictionary from string form from a source dictionary that looks like this: {'list': [1, {'foo': 1}]}
.
I would recommend the second case. It looks like this code handles dictionaries and pandas dataframes the best, so I would recommend you stick to that. We can iterate and add support for nested lists and mixed type lists as time goes on.
This is all just my interpretation based on how the code is written right now. If there are other cases you need covered, please let me know.
) | ||
elif isinstance(val, self.supported_types): | ||
f.create_dataset(key, data=val, track_order=True) | ||
elif isinstance(val, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndarrays and tuples below can also contain dictionaries, but that case is not handled here. I'd recommend making that lack of implementation explicit, such as with an error.
def recursive_load(f): | ||
d = {"attrs": dict(f.attrs)} if f.attrs else {} | ||
for key, val in f.items(): | ||
if isinstance(val, h5py.Group): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to elaborate on my previous comment a little more, h5py groups do two things here. Either they create pandas dataframes or dictionaries by default. We could also write a group that gets reconstructed as a list, which would be a nice way to reconstruct "{key}/{i}" type groups.
I would really recommend that we shrink this PR down. It's difficult to recover data from its printed string form programmatically, and the right way to approach those cases is to expand the code to dump and recover them appropriately, which would add a lot of bulk to this already long PR. Therefore, those parts can be cut.
I think having a generalized way to dump and load data is a good idea in the long run, but we can iterate on our current approach to groups to be general.
for col in columns: | ||
data[col] = val[col][:] | ||
d[key] = pd.DataFrame(data) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already use a pandas_type
attribute above to tell the loader what type we're dealing with. Maybe we should add an attribute for dictionaries too?
"g": {"a": np.nan, "b": np.inf, "c": -np.inf}, | ||
"h": "np.Nan", | ||
"i": np.array((1.0, 2.0), dtype="O"), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add assert statements here like in test_special_values
below?
) | ||
|
||
# Dump to H5 | ||
result_dict = result.model_dump() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already test to make sure that a saved dictionary has the same keys as the original when reloaded. Presumably screen beam measurement already tests to make sure the result has the right attributes on output. So long as both of those things are tested separately, this test doesn't test anything new.
# lists of lists are saved as dicts | ||
# here the lists are saved as nd.arrays | ||
assert np.array_equal( | ||
data["nested_list"][i], loaded_data["nested_list"][f"{i}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of the pattern I'd like to avoid. If we can add an attribute that tells the loader that we're dealing with a pandas dataframe, then there's no reason we can't do the same for a list.
Thank you @eloiseyang for your thorough review! I am wrapping up some other things right now. I plan to address your comments sometime later this week/early next week. |
Thank you both for the work on this PR! I second most of the comments Eloise made here. It seems like this PR is doing more than it needs to in the first pass. Maybe some of the if/else-ing can be converted to try/except. Main point though, with out clouding up Eloise's comments, is I would like to see this reduced to the most commonly used case or two. I think we should pick a save convention (or 2-3) and stick with it/enforce it for now. I'm open to discussing if others disagree. |
No description provided.