Adds h5py dump #221

pluflou · 2025-01-24T05:24:34Z

No description provided.

lcls_tools/common/data/saver.py

eloise-nebula · 2025-03-27T18:18:09Z

Hi @pluflou, do you have any updates on this PR?

roussel-ryan · 2025-03-27T18:19:44Z

@pluflou this would be a good opportunity to merge in the version we have been using

pluflou · 2025-03-27T18:24:26Z

@roussel-ryan do we want this in both packages? I can copy the updated class form ml-tto and mark this as ready for review

roussel-ryan · 2025-03-27T18:32:23Z

yes, lets update the h5 dumping based on ml-tto but I would do a PR to the main branch, not the serialization branch, then we can remove it from ml-tto

pluflou · 2025-03-27T21:12:15Z

@eloiseyang @roussel-ryan The class is now up to date with the ml-tto version. Previously I had a save_data flag in ScreenBeamProfileMeasurement that saves the results to an h5 file if it's set to true. I removed it until we discuss. Is this something that we want to add in the measurement classes here (screen profile and/or emittance)?

pluflou · 2025-03-31T20:49:02Z

@eloiseyang I added pandas to the requirements since I wrote the H5Saver to handle pandas dataframes as well. Is this addition ok? Alternatively, I could check for dataframes in the input only if pandas is installed, and this would save dataframes as strings in the H5 files if they are passed and pandas wasn't installed. We typically use and save dataframes during the ML/Xopt runs that we use lcls-tools for. It could also be useful for a wider range of use-cases that people might encounter while handling data.

eloise-nebula · 2025-04-01T20:47:56Z

Hi @pluflou, including pandas is a good idea, especially if the ML team already uses it. I've used it before in other projects. Let me run over another review of this code to give comments/approval.

eloise-nebula

Hi all, thanks for the PR. I read this code as a generalized approach to dumping and recovering Python objects, like pandas dataframes, ndarrays, and dictionaries. I think that's a good idea, and will be useful for everyone when it's implemented.

The trouble I have with this code is that its approach to different types is inconsistent. A HDF5 group is assumed to be a dictionary by default. Sometimes a group is a Pandas dataframe if it has the appropriate attributes associated with it. Dictionaries and Pandas dataframes are easily dumped and recovered.

Sometimes a group is a list, like when there's nested lists, but these can't be recovered because there's no attribute that handles the reconstruction of lists. When the code dumps a heterogeneous list, it converts the list to a string, which the loader can't recover the data from. The list-group approach in nested lists could be used to solve this problem, but it isn't.

Lists of dictionaries are handled, but tuples and NDArrays of dictionaries are not. Pandas dataframes get to know what type of object they are, but tuples don't. Dictionaries only kind of get to know because that's the default case.

Right now, the code can dump and fully recover scalars, homogenous lists of scalar data, dictionaries, nested dictionaries, and pandas dataframes. I would suggest we reduce the code to only handle these cases that be dumped and recovered in their original form. A generalized way to dump and recover Python objects will definitely be useful, but we should make sure that our code can fully deliver on its promises. We don't want to be scrambling to recover a heterogeneous list of enums, booleans, strings, and doubles from a printed string two years down the line, for instance.

The idea is good and the testing suite is robust. This will be an iterative process, and we don't need to create something that's completely general to start. The first step to getting this merged is reducing it down to the parts of the code that deliver on its promise to save and recover python objects. I'm definitely willing to help iterate on this as well once the initial merge is done. Thank again, let me know if you have questions.

lcls_tools/common/data/saver.py

eloise-nebula · 2025-04-01T22:06:41Z

lcls_tools/common/data/saver.py

+                    elif all(isinstance(ele, np.ndarray) for ele in val):
+                        # save np.arrays as datasets
+                        for i, ele in enumerate(val):
+                            f.create_dataset(f"{key}/{i}", data=ele, track_order=True)


I'd like to examine what this code is trying to accomplish by looking at the load function later down the line. load does three things: it converts scalar datasets into key/value pairs in a parent dictionary, it converts groups with the pandas attribute into pandas dataframes, and it converts all other groups into dictionaries.

Here's a pretty basic case:

list_of_dicts = {'list': [{'foo': 'a'}, {'bar': 'b'}]} saver.dump(list_of_dicts, filepath) result = saver.load(filepath) # result ={'list': {'0': {'foo': 'a'}, '1': {'bar': 'b'}}}

The result is that the original list_of_dicts is not recovered. This isn't so bad, and we could reasonably recover the list as long as we know what's happening. But looking ahead, there are two cases in which the code saves data as strings, which means that data in a nested or mixed data type list can't be recovered easily.

There's a couple approaches we could take to resolving this. We could create groups that contain information about the source class. This shouldn't be so hard to implement, but it would mean writing a lot more code. The second approach is to reduce the number of cases that this code handles to the ones that you and the rest of the ML team presently use and need. For instance, I assume you all are not parsing a dictionary from string form from a source dictionary that looks like this: {'list': [1, {'foo': 1}]}.

I would recommend the second case. It looks like this code handles dictionaries and pandas dataframes the best, so I would recommend you stick to that. We can iterate and add support for nested lists and mixed type lists as time goes on.

This is all just my interpretation based on how the code is written right now. If there are other cases you need covered, please let me know.

eloise-nebula · 2025-04-02T17:53:34Z

lcls_tools/common/data/saver.py

+                                )
+                elif isinstance(val, self.supported_types):
+                    f.create_dataset(key, data=val, track_order=True)
+                elif isinstance(val, np.ndarray):


ndarrays and tuples below can also contain dictionaries, but that case is not handled here. I'd recommend making that lack of implementation explicit, such as with an error.

eloise-nebula · 2025-04-02T18:03:01Z

lcls_tools/common/data/saver.py

+        def recursive_load(f):
+            d = {"attrs": dict(f.attrs)} if f.attrs else {}
+            for key, val in f.items():
+                if isinstance(val, h5py.Group):


Just to elaborate on my previous comment a little more, h5py groups do two things here. Either they create pandas dataframes or dictionaries by default. We could also write a group that gets reconstructed as a list, which would be a nice way to reconstruct "{key}/{i}" type groups.

I would really recommend that we shrink this PR down. It's difficult to recover data from its printed string form programmatically, and the right way to approach those cases is to expand the code to dump and recover them appropriately, which would add a lot of bulk to this already long PR. Therefore, those parts can be cut.

I think having a generalized way to dump and load data is a good idea in the long run, but we can iterate on our current approach to groups to be general.

eloise-nebula · 2025-04-02T18:04:33Z

lcls_tools/common/data/saver.py

+                        for col in columns:
+                            data[col] = val[col][:]
+                        d[key] = pd.DataFrame(data)
+                    else:


We already use a pandas_type attribute above to tell the loader what type we're dealing with. Maybe we should add an attribute for dictionaries too?

eloise-nebula · 2025-04-02T18:06:06Z

tests/unit_tests/lcls_tools/common/data/test_saver.py

+            "g": {"a": np.nan, "b": np.inf, "c": -np.inf},
+            "h": "np.Nan",
+            "i": np.array((1.0, 2.0), dtype="O"),
+        }


Should we add assert statements here like in test_special_values below?

eloise-nebula · 2025-04-02T18:14:12Z

tests/unit_tests/lcls_tools/common/data/test_saver.py

+        )
+
+        # Dump to H5
+        result_dict = result.model_dump()


We already test to make sure that a saved dictionary has the same keys as the original when reloaded. Presumably screen beam measurement already tests to make sure the result has the right attributes on output. So long as both of those things are tested separately, this test doesn't test anything new.

eloise-nebula · 2025-04-02T18:18:45Z

tests/unit_tests/lcls_tools/common/data/test_saver.py

+            # lists of lists are saved as dicts
+            # here the lists are saved as nd.arrays
+            assert np.array_equal(
+                data["nested_list"][i], loaded_data["nested_list"][f"{i}"]


Here's an example of the pattern I'd like to avoid. If we can add an attribute that tells the loader that we're dealing with a pandas dataframe, then there's no reason we can't do the same for a list.

pluflou · 2025-04-07T17:47:07Z

Thank you @eloiseyang for your thorough review! I am wrapping up some other things right now. I plan to address your comments sometime later this week/early next week.

nneveu · 2025-04-07T23:56:49Z

Thank you both for the work on this PR! I second most of the comments Eloise made here. It seems like this PR is doing more than it needs to in the first pass. Maybe some of the if/else-ing can be converted to try/except. Main point though, with out clouding up Eloise's comments, is I would like to see this reduced to the most commonly used case or two. I think we should pick a save convention (or 2-3) and stick with it/enforce it for now. I'm open to discussing if others disagree.

pluflou · 2025-05-02T14:37:16Z

Thank you all for your patience. Just checking in--I got wrapped up in other things and I likely won't get to this until mid/end of May.

…h5_dump

pluflou · 2025-07-01T22:37:59Z

Hi all, apologies for the delay on this PR. I have reworked the module to simplify things and only handle the data types/structure that we are dumping in slac-tools measurements (I mainly looked at the emittance measurement result dump, including the metadata). I still have to test it a bit more and go over these PR comments more closely to make sure things went in the right direction. I will ping you again when it's ready for a final review. Mainly now it supports:

scalars (float, bool, int)
strings
np.ndarray
lists (of scalars, nested, and heterogeous)
tuples
None types
pandas dataframes
dictionaries
pathlib.PosixPath

If does not support and should raise an implementation error for:

heterogenous np.ndarrays
tuples of dictionaries
np.ndarrays of dictionaries
lists of dictionaries

roussel-ryan reviewed Jan 24, 2025

View reviewed changes

lcls_tools/common/data/saver.py Show resolved Hide resolved

pluflou changed the base branch from serialization to main March 27, 2025 18:40

roussel-ryan and others added 7 commits March 27, 2025 12:56

initial attempt at serialization

9ab9322

optimize serialization to be able to create objects from json

0ab7b7a

remove file dump

7f9f812

add h5py saving

3d8c53c

add common class for saving to h5

c77fadd

clean up conditionals

896fcae

add handling for different dtypes

ce2afa5

pluflou force-pushed the h5_dump branch from ea08722 to ce2afa5 Compare March 27, 2025 20:39

pluflou added 4 commits March 27, 2025 13:51

rebase

0439fa2

update h5 saver and add tests

b419d34

remove line from previous conflict resolution

da55256

remove line from previous conflict resolution

1ec3e55

pluflou changed the title ~~[WIP] Adds h5py dump~~ Adds h5py dump Mar 27, 2025

add pandas to reqs for h5saver

b62db53

eloise-nebula requested changes Apr 2, 2025

View reviewed changes

pluflou added 3 commits June 5, 2025 08:54

Merge branch 'h5_dump' of https://github.com/pluflou/lcls-tools into …

acde1c8

…h5_dump

clean up var name and fix tests

b6cb581

adjust nested homogeneous lists

6c00c28

pluflou added 2 commits July 1, 2025 15:32

clean up H5Saver class to smaller list of data types/structures

8150717

Merge branch 'slaclab:main' into h5_dump

6091c14

linting

a5091cc

Adds h5py dump #221

Are you sure you want to change the base?

Adds h5py dump #221

Uh oh!

Conversation

pluflou commented Jan 24, 2025

Uh oh!

Uh oh!

eloise-nebula commented Mar 27, 2025

Uh oh!

roussel-ryan commented Mar 27, 2025

Uh oh!

pluflou commented Mar 27, 2025

Uh oh!

roussel-ryan commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pluflou commented Mar 27, 2025

Uh oh!

pluflou commented Mar 31, 2025

Uh oh!

eloise-nebula commented Apr 1, 2025

Uh oh!

eloise-nebula left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eloise-nebula Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

eloise-nebula Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

eloise-nebula Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

eloise-nebula Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

eloise-nebula Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

eloise-nebula Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

eloise-nebula Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

pluflou commented Apr 7, 2025

Uh oh!

nneveu commented Apr 7, 2025

Uh oh!

pluflou commented May 2, 2025

Uh oh!

pluflou commented Jul 1, 2025

Uh oh!

Uh oh!

roussel-ryan commented Mar 27, 2025 •

edited

Loading