Skip to content

Update chunking tutorial - add stuff about disk chunks, dask graph, etc#508

Merged
charles-turner-1 merged 11 commits into
mainfrom
chunking-tutorial-update
May 22, 2025
Merged

Update chunking tutorial - add stuff about disk chunks, dask graph, etc#508
charles-turner-1 merged 11 commits into
mainfrom
chunking-tutorial-update

Conversation

@charles-turner-1

Copy link
Copy Markdown
Collaborator
  • Renamed the tutorial to be more descriptive
  • Added some exposition on Dask Graphs, etc.

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@charles-turner-1 charles-turner-1 requested a review from Copilot April 7, 2025 04:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

@charles-turner-1 charles-turner-1 marked this pull request as ready for review April 7, 2025 04:58
@charles-turner-1 charles-turner-1 requested a review from navidcy April 7, 2025 04:58
@access-hive-bot

Copy link
Copy Markdown

This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/intake-vs-mfdataset/4445/19

@charles-turner-1

Copy link
Copy Markdown
Collaborator Author

Hey @dougiesquire, @jemmajeffree - I've updated this to include some points relevant to the chunking forum discussion - give us a shout if you think anything needs changing/clarification!

@charles-turner-1 charles-turner-1 force-pushed the chunking-tutorial-update branch from 6a2e0d1 to 1f11480 Compare May 7, 2025 01:33
@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:05Z
----------------------------------------------------------------

Line #3.    # Open up the dashboard by clicking the launch button below - it'll help you to see what dask is doing when it runs expensive operations. 

I use "task stream" and "progress" most, followed by "scheduler system" and "workers memory" - Is it worth making a suggestion for smth like this given how overwhelming the selection of yellow boxes can be?


charles-turner-1 commented on 2025-05-09T03:47:54Z
----------------------------------------------------------------

I've actually never changed mine from the defaults, which are task steam, worker memory & progress for me. TBH I'm mostly interested in just being able to see something (anything) is happening & the session hasn't died

jemmajeffree commented on 2025-05-09T03:56:19Z
----------------------------------------------------------------

you got defaults!? I started with nothing and a million options, so maybe it's changed recently or this is a slightly different approach that opens things.

Anyhow, if there are defaults then it's fine to leave without further suggestions

@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:06Z
----------------------------------------------------------------

Line #4.    datastore.to_dask(xarray_open_kwargs = {'decode_timedelta' : False}) # We need this to avoid a bunch of annoying warnings.

"this" -> "{'decode_timedelta': False}" ; possibly ambiguous for new users


charles-turner-1 commented on 2025-05-09T03:48:22Z
----------------------------------------------------------------

yeah, will add a little more exposition there

charles-turner-1 commented on 2025-05-14T06:04:59Z
----------------------------------------------------------------

fixed above

@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:06Z
----------------------------------------------------------------

is there a reason not to merge this idea with the previous cell? ie, time the ds = and finish with ds['u']?


charles-turner-1 commented on 2025-05-09T03:49:16Z
----------------------------------------------------------------

when you use the %%timeit cell magic you can't see any of the outputs - I think they're actually deleted by the cell magic to make benchmarking less unreliable

@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:07Z
----------------------------------------------------------------

Move the dot points to after the "rule of thumb" line, to make it clearer why we want each file as a single chunk?

Also replace "Lets seem what happens if" with "Let's see what chunk size we get if"


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:08Z
----------------------------------------------------------------

I'm not quite sure what you're getting at with point 2. Isn't the argument more that the chunks were too small? If they were huge, you would definitely want to split each file into chunks even though we're opening all of them

Also, shouldn't the last point be in a dot point?

I wonder if it's worth having something in here about choosing chunk sizes for whatever you're going to do next. Ie, if you know you're only looking at the southern ocean, then maybe leave the chunks split in yu_ocean, or if only looking at the shallow ocean leave them split in st_ocean? I know this isn't the right place to give a full explanation of choosing chunks for each individual analysis, but I think a brief mention would help


charles-turner-1 commented on 2025-05-14T07:38:35Z
----------------------------------------------------------------

Yeah, I'm not quite sure what I'm saying with point 2 any more. I'm gonna leave it in for now in case I figure it out & it was important. If I don't I'll excise it before we merge.

@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:08Z
----------------------------------------------------------------

pedagogically, I don't really like giving a deliberately bad example before a good one, and I'm worried about confusing people. I think this point might be better made using xu_ocean:400, and then explain afterwards why xu_ocean:120 is a bad idea


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:09Z
----------------------------------------------------------------

what happens if you use time:3 instead of 2, does it help at all? ie, because won't chunk sizes of 2 break across files?


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:10Z
----------------------------------------------------------------

Maybe I missed something, but I thought that warning meant that you're not respecting the chunks on disk because you haven't got an integer multiple? In which case I'd be less comfortable encouraging people to ignore it


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:10Z
----------------------------------------------------------------

I'm getting a little lost in this section, what I'm supposed to take from each cell. What would the output be here if they were problematic?


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:11Z
----------------------------------------------------------------

they're bridging across files though, in the time dimension. Is this not an issue? And, come to think about it, aren't you going to end up with (2,1) repeatedly through each file if there's no merging of chunks between files?


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:12Z
----------------------------------------------------------------

this pattern shows up in the task stream as a lot of flashing very thin yellow and green stripes, possibly worth mentioning as something to look for (there's lots of white, which is a sure sign something can be optimised)


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:13Z
----------------------------------------------------------------

I'm slightly concerned about somebody copy-pasting 'datavars':'minimal' without understanding what it does and then discovering that they can't load data from multiple variables into a single dataset later. Can we add a 1 line explanation?


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:13Z
----------------------------------------------------------------

is there a reason to have this step separate from the above? In what situation would you use 'coords': minmal without 'compat':'override' Especially since the thing that makes the difference and does what you said was important (overriding checks) is the compat override


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:14Z
----------------------------------------------------------------

Can we topload the TLDR into the start of the section? Knowing scientists and where they look for information...


@review-notebook-app

review-notebook-app Bot commented May 9, 2025

Copy link
Copy Markdown

View / edit / reply to this conversation on ReviewNB

jemmajeffree commented on 2025-05-09T03:27:15Z
----------------------------------------------------------------

Incidentally, I can think of situations in which I might want to run a rolling average through the ocean like that (ie, I want to strip out high frequency signals but can't be bothered convolving with a gaussian etc) wouldn't recommend changing anything, but you might be interested


Copy link
Copy Markdown
Collaborator Author

I've actually never changed mine from the defaults, which are task steam, worker memory & progress for me. TBH I'm mostly interested in just being able to see something (anything) is happening & the session hasn't died


View entire conversation on ReviewNB

Copy link
Copy Markdown
Collaborator Author

yeah, will add a little more exposition there


View entire conversation on ReviewNB

Copy link
Copy Markdown
Collaborator Author

when you use the %%timeit cell magic you can't see any of the outputs - I think they're actually deleted by the cell magic to make benchmarking less unreliable


View entire conversation on ReviewNB

Copy link
Copy Markdown
Collaborator

you got defaults!? I started with nothing and a million options, so maybe it's changed recently or this is a slightly different approach that opens things.

Anyhow, if there are defaults then it's fine to leave without further suggestions


View entire conversation on ReviewNB

@charles-turner-1

Copy link
Copy Markdown
Collaborator Author

I think the chunking section needs a hefty rewrite TBH, being able to look directly at the chunks now has changed the approach we wanna take - will re-request a review when that's done.

Copy link
Copy Markdown
Collaborator Author

fixed above


View entire conversation on ReviewNB

@charles-turner-1 charles-turner-1 changed the title Update chunking tutorial - add stuff about dask graph, etc Update chunking tutorial - add stuff about disk chunks, dask graph, etc May 14, 2025

Copy link
Copy Markdown
Collaborator Author

Yeah, I'm not quite sure what I'm saying with point 2 any more. I'm gonna leave it in for now in case I figure it out & it was important. If I don't I'll excise it before we merge.


View entire conversation on ReviewNB

@charles-turner-1 charles-turner-1 merged commit 8194ce2 into main May 22, 2025
3 checks passed
@charles-turner-1 charles-turner-1 deleted the chunking-tutorial-update branch May 22, 2025 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants