Comparing cloud ATL03 v007 reads across read modules #676

rwegener2 · 2025-04-28T19:25:42Z

rwegener2
Apr 28, 2025
Maintainer

Overview

icepyx reads much slower as compared to xarray and h5py

Some extra time is expected, since icepyx adds value beyond what xarray and h5py provide. But, I think the amount of extra read time right now is much higher than it needs to be and can be reduced.

Possible reasons for the slower reads:

Reading metadata one field at a time instead of parsing needed information from a single read request
Using multiple open calls to read a single group

Methods & Results

Icepyx takes 3.5 times as long (or more) to read 1 group of data as compared to xarray or h5py.

The left plot in the figure below considers the "full time" it takes to read a dataset, which includes instantiating a Read() object and appending variables. The right plot is comparing just the time it takes to .load() a Read() object after it is created. Icepyx takes much longer to read than xarray and h5py (left plot). When considering just the .load() method (right plot) icepyx is more comparable, but only for small file sizes.

For a small file the vast majority of read time is spent instantiating the Read object and creating Variables objects

Profile timeline for one group of the small datafile

The plot above shows the output from pyinstrument while profiling the full read code. This shows a timeline where time is on the x axis. The amount of time spent in different methods is shown by different length bars. We see that the Read.load() method (red box) is only running for a few seconds out of the 17 total seconds. Most of the time is spend instantiating the Read object and appending variables (Note: You'll notice that reads timed using pyinstrument are slower than runs without the profiler running (ex. in the figure above), due to the overhead of running the tool). The extract_product and extract_version methods take up the majority of the time and, notably, they are each repeated at least once. This is likely an avenue for optimization.

When timing outside of pyinstrument read instantiation and variable appending are consistent across file sizes (~6 seconds). This makes sense given that the same amount of data is being accessed for every file.

Icepyx gets bogged down on larger files. The time seems to be spent combining datasets

Profile timeline for one group of the mediumlarge datafile

Notable takeaways from the mediumlarge profile:

most of the load time is used on the .expand_dims() method (smaller red box)
there are 3 calls to Read._read_single_grp() when loading (red circle). I was surprised by this. Given that we are only reading one beam of data I would have expected just one call.

Code snippets for full and partial read

reader = ipx.Read(s3url_007)
reader.vars._avail = reader006.vars.avail()  # workaround to append v006 available variables since this can't be done for v007 yet
reader.vars.append(beam_list=['gt1l'], var_list=['h_ph'])
ds = reader.load(fsspec_optimized=True, h5py_optimized=False)

For the "full read" (left plot) this whole snippet is timed. For the .load() read (right plot) only the last line of the code block above is timed.

Data Sizes and Granules

File Name	Total File Size	Amount of Data Read*	Granule ID
small	184 MB	~70MB	`ATL03_20190613013940_11570313_007_02.h5`
medium	752 MB	~430 MB	`ATL03_20190613055526_11600309_007_02.h5`
mediumlarge	3.82 GB	~2.4 GB	`ATL03_20190613070708_11610306_007_02.h5`
large	8.07 GB	~4.1 GB	`ATL03_20190611220139_11400305_007_02.h5`

*"Amount of data read" calculated using xarray's ds.nbytes

Actionable Takeaway

Refactor the Read and Variables module to ensure that extract_product and extract_version are not running multiple times. Ideally also condense these so that only one metadata call is made to the file.
Investigate why the Read module is running open_dataset when only opening data for a single group. Follow up on .expand_dims(). Again the goal for making icepyx faster with cloud data would be to condense data reads into fewer calls. This decreases the number of requests and would enable us to take advantage of other parellelization libraries (ex. h5coro)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing cloud ATL03 v007 reads across read modules #676

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Comparing cloud ATL03 v007 reads across read modules #676

Uh oh!

Uh oh!

rwegener2 Apr 28, 2025 Maintainer

Overview

Methods & Results

Code snippets for full and partial read

Data Sizes and Granules

Actionable Takeaway

Replies: 0 comments

rwegener2
Apr 28, 2025
Maintainer