Update chunking tutorial - add stuff about disk chunks, dask graph, etc#508
Conversation
charles-turner-1
commented
Apr 7, 2025
- Renamed the tutorial to be more descriptive
- Added some exposition on Dask Graphs, etc.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/intake-vs-mfdataset/4445/19 |
|
Hey @dougiesquire, @jemmajeffree - I've updated this to include some points relevant to the chunking forum discussion - give us a shout if you think anything needs changing/clarification! |
6a2e0d1 to
1f11480
Compare
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:05Z Line #3. # Open up the dashboard by clicking the launch button below - it'll help you to see what dask is doing when it runs expensive operations. I use "task stream" and "progress" most, followed by "scheduler system" and "workers memory" - Is it worth making a suggestion for smth like this given how overwhelming the selection of yellow boxes can be? charles-turner-1 commented on 2025-05-09T03:47:54Z I've actually never changed mine from the defaults, which are task steam, worker memory & progress for me. TBH I'm mostly interested in just being able to see something (anything) is happening & the session hasn't died jemmajeffree commented on 2025-05-09T03:56:19Z you got defaults!? I started with nothing and a million options, so maybe it's changed recently or this is a slightly different approach that opens things.
Anyhow, if there are defaults then it's fine to leave without further suggestions |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:06Z Line #4. datastore.to_dask(xarray_open_kwargs = {'decode_timedelta' : False}) # We need this to avoid a bunch of annoying warnings.
"this" -> "{'decode_timedelta': False}" ; possibly ambiguous for new users charles-turner-1 commented on 2025-05-09T03:48:22Z yeah, will add a little more exposition there charles-turner-1 commented on 2025-05-14T06:04:59Z fixed above
|
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:06Z is there a reason not to merge this idea with the previous cell? ie, time the ds = and finish with ds['u']? charles-turner-1 commented on 2025-05-09T03:49:16Z when you use the %%timeit cell magic you can't see any of the outputs - I think they're actually deleted by the cell magic to make benchmarking less unreliable |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:07Z Move the dot points to after the "rule of thumb" line, to make it clearer why we want each file as a single chunk? Also replace "Lets seem what happens if" with "Let's see what chunk size we get if"
|
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:08Z I'm not quite sure what you're getting at with point 2. Isn't the argument more that the chunks were too small? If they were huge, you would definitely want to split each file into chunks even though we're opening all of them
Also, shouldn't the last point be in a dot point?
I wonder if it's worth having something in here about choosing chunk sizes for whatever you're going to do next. Ie, if you know you're only looking at the southern ocean, then maybe leave the chunks split in yu_ocean, or if only looking at the shallow ocean leave them split in st_ocean? I know this isn't the right place to give a full explanation of choosing chunks for each individual analysis, but I think a brief mention would help charles-turner-1 commented on 2025-05-14T07:38:35Z Yeah, I'm not quite sure what I'm saying with point 2 any more. I'm gonna leave it in for now in case I figure it out & it was important. If I don't I'll excise it before we merge. |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:08Z pedagogically, I don't really like giving a deliberately bad example before a good one, and I'm worried about confusing people. I think this point might be better made using xu_ocean:400, and then explain afterwards why xu_ocean:120 is a bad idea |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:09Z what happens if you use time:3 instead of 2, does it help at all? ie, because won't chunk sizes of 2 break across files? |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:10Z Maybe I missed something, but I thought that warning meant that you're not respecting the chunks on disk because you haven't got an integer multiple? In which case I'd be less comfortable encouraging people to ignore it |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:10Z I'm getting a little lost in this section, what I'm supposed to take from each cell. What would the output be here if they were problematic? |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:11Z they're bridging across files though, in the time dimension. Is this not an issue? And, come to think about it, aren't you going to end up with (2,1) repeatedly through each file if there's no merging of chunks between files? |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:12Z this pattern shows up in the task stream as a lot of flashing very thin yellow and green stripes, possibly worth mentioning as something to look for (there's lots of white, which is a sure sign something can be optimised) |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:13Z I'm slightly concerned about somebody copy-pasting 'datavars':'minimal' without understanding what it does and then discovering that they can't load data from multiple variables into a single dataset later. Can we add a 1 line explanation? |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:13Z is there a reason to have this step separate from the above? In what situation would you use 'coords': minmal without 'compat':'override' Especially since the thing that makes the difference and does what you said was important (overriding checks) is the compat override |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:14Z Can we topload the TLDR into the start of the section? Knowing scientists and where they look for information... |
|
View / edit / reply to this conversation on ReviewNB jemmajeffree commented on 2025-05-09T03:27:15Z Incidentally, I can think of situations in which I might want to run a rolling average through the ocean like that (ie, I want to strip out high frequency signals but can't be bothered convolving with a gaussian etc) wouldn't recommend changing anything, but you might be interested |
|
I've actually never changed mine from the defaults, which are task steam, worker memory & progress for me. TBH I'm mostly interested in just being able to see something (anything) is happening & the session hasn't died View entire conversation on ReviewNB |
|
yeah, will add a little more exposition there View entire conversation on ReviewNB |
|
when you use the %%timeit cell magic you can't see any of the outputs - I think they're actually deleted by the cell magic to make benchmarking less unreliable View entire conversation on ReviewNB |
|
you got defaults!? I started with nothing and a million options, so maybe it's changed recently or this is a slightly different approach that opens things.
Anyhow, if there are defaults then it's fine to leave without further suggestions View entire conversation on ReviewNB |
|
I think the chunking section needs a hefty rewrite TBH, being able to look directly at the chunks now has changed the approach we wanna take - will re-request a review when that's done. |
|
fixed above
View entire conversation on ReviewNB |
|
Yeah, I'm not quite sure what I'm saying with point 2 any more. I'm gonna leave it in for now in case I figure it out & it was important. If I don't I'll excise it before we merge. View entire conversation on ReviewNB |