Skip to content

Conversation

felix0097
Copy link
Collaborator

That's a draft for a quickstart tutorial notebook. The idea was to walk the user through all steps needed to use the package, eg. from creating the collection to actually using the dataloader + explain the important settings to the user.

Let me know what you think and whether I missed something

@felix0097 felix0097 requested a review from ilan-gold October 1, 2025 16:54
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.63%. Comparing base (e42d367) to head (b6b731f).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #48   +/-   ##
=======================================
  Coverage   83.63%   83.63%           
=======================================
  Files           8        8           
  Lines         605      605           
=======================================
  Hits          506      506           
  Misses         99       99           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks very good

Comment on lines 45 to 47
" \"threading.max_workers\": 5,\n",
" \"codec_pipeline.path\": \"zarrs.ZarrsCodecPipeline\",\n",
" \"concurrency\": 4,\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either don't set the max_workers and concurrency or make it os.cpu_count dependent. I just wouldn't set it tbh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although if you're going to have this big section, then maybe you can explain the settings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just double checked this, for me it doesn't give that much of a speed increase. Would be fine with removing it based on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine removing this yea

Comment on lines 21 to 23
"# Download an example dataset from CELLxGENE\n",
"!wget https://datasets.cellxgene.cziscience.com/866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad"
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do two datasets? It's easy enough and highlights that we can handle the var_space

" \"866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad\",\n",
" ],\n",
" # Path to store the output collection\n",
" output_path=\"tahoe100_FULL\",\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different name?

"metadata": {},
"cell_type": "markdown",
"source": [
"IMPORTANT:\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "IMPORTANT" should at least be bold

Comment on lines +164 to +168
"* The `ZarrSparseDataset` yields batches of sparse tensors.\n",
"* The conversion to dense tensors should be done on the GPU, as shown in the example below.\n",
" * First call `.cuda()` and then `.to_dense()`\n",
" * E.g. `x = x.cuda().to_dense()`\n",
" * This is significantly faster than doing the dense conversion on the CPU.\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention preload_to_gpu here - i.e., if you have a GPU and can spare some extra memory, you should use preload_to_gpu and then you don't need to use .cuda().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the preload_to_gpu option. Would leave the .cuda() call. That might just confuse the user, and if everything is already correct the .cuda() call doesn't do anything.

From the torch documentation:

"If this object is already in CUDA memory and on the correct device, then no copy is performed and the original object is returned."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If this object is already in CUDA memory and on the correct device, then no copy is performed and the original object is returned."

Oh rad, then I would have been ok leaving out preload_to_gpu but now that this is moving in the direction of a guide not on the README.md but on the docs, the extra detail is good then

"source": [
"The conversion code will take care of the following things:\n",
"* Align the gene spaces across all datasets listed in `adata_paths`\n",
" * The gene spaces are aligned based on the gene names provided in the `var_names` field of the individual `AnnData` objects.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's specifically an outer join, though - "aligned" is ambiguous

@felix0097 felix0097 requested a review from ilan-gold October 6, 2025 10:42
Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes, looks good. Probably want to make sure it renders right with the changes

"metadata": {},
"outputs": [],
"source": [
"from arrayloaders import create_anndata_collection\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update imports :)))

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ilan-gold ilan-gold force-pushed the quickstart-tutorial branch from 7e3d9ea to ef2545c Compare October 9, 2025 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants