Comparing cloud ATL03 v007 reads across read modules #676
rwegener2
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
icepyxreads much slower as compared to xarray and h5pySome extra time is expected, since icepyx adds value beyond what xarray and h5py provide. But, I think the amount of extra read time right now is much higher than it needs to be and can be reduced.
Possible reasons for the slower reads:
Methods & Results
Icepyx takes 3.5 times as long (or more) to read 1 group of data as compared to xarray or h5py.
The left plot in the figure below considers the "full time" it takes to read a dataset, which includes instantiating a
Read()object and appending variables. The right plot is comparing just the time it takes to.load()aRead()object after it is created. Icepyx takes much longer to read than xarray and h5py (left plot). When considering just the.load()method (right plot) icepyx is more comparable, but only for small file sizes.For a small file the vast majority of read time is spent instantiating the Read object and creating Variables objects
Profile timeline for one group of the small datafile

The plot above shows the output from
pyinstrumentwhile profiling the full read code. This shows a timeline where time is on the x axis. The amount of time spent in different methods is shown by different length bars. We see that theRead.load()method (red box) is only running for a few seconds out of the 17 total seconds. Most of the time is spend instantiating the Read object and appending variables (Note: You'll notice that reads timed usingpyinstrumentare slower than runs without the profiler running (ex. in the figure above), due to the overhead of running the tool). Theextract_productandextract_versionmethods take up the majority of the time and, notably, they are each repeated at least once. This is likely an avenue for optimization.When timing outside of
pyinstrumentread instantiation and variable appending are consistent across file sizes (~6 seconds). This makes sense given that the same amount of data is being accessed for every file.Icepyx gets bogged down on larger files. The time seems to be spent combining datasets
Profile timeline for one group of the mediumlarge datafile

Notable takeaways from the mediumlarge profile:
.expand_dims()method (smaller red box)Read._read_single_grp()when loading (red circle). I was surprised by this. Given that we are only reading one beam of data I would have expected just one call.Code snippets for full and partial read
For the "full read" (left plot) this whole snippet is timed. For the
.load()read (right plot) only the last line of the code block above is timed.Data Sizes and Granules
ATL03_20190613013940_11570313_007_02.h5ATL03_20190613055526_11600309_007_02.h5ATL03_20190613070708_11610306_007_02.h5ATL03_20190611220139_11400305_007_02.h5*"Amount of data read" calculated using xarray's
ds.nbytesActionable Takeaway
open_datasetwhen only opening data for a single group. Follow up on.expand_dims(). Again the goal for making icepyx faster with cloud data would be to condense data reads into fewer calls. This decreases the number of requests and would enable us to take advantage of other parellelization libraries (ex. h5coro)Beta Was this translation helpful? Give feedback.
All reactions