Skip to content

Re-implementating co_occurrence() #975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

wenjie1991
Copy link

IMPORTANT: Please search among the Pull requests before creating one.

Description

Togeter with @MDLDan we reimplement the squidpy.gr.co_occurrence() function using Numba.
The new algorithm removes the need for a pre-calculated pairwise distance matrix, enabling it to handle large datasets without splitting. Parallel processing is enabled by default, increasing the runtime speed by 40 times.
image

We also implemented it in Rust using PyO3 and achieved similar performance. We chose to push the Numba implementation.

Following issues are related:
#229
#755
#223
#582

How has this been tested?

  • All squidpy.gr.co_occurrence() related have passed in squidpy package test.
  • We also compared the new and old implementations output:
    • Until the number of cells do not require squidpy to split the differences are in the 1e-08 range.
    • When the number of cells requires squidpy to spit differences are in the order of 1e-02 see (
      .

Closes

closes #755

@Intron7 Intron7 requested review from timtreis and Intron7 March 19, 2025 10:45
@codecov-commenter
Copy link

codecov-commenter commented Mar 19, 2025

Codecov Report

Attention: Patch coverage is 53.75000% with 37 lines in your changes missing coverage. Please review.

Project coverage is 66.60%. Comparing base (4a632d6) to head (26200d3).
Report is 189 commits behind head on main.

Files with missing lines Patch % Lines
src/squidpy/gr/_ppatterns.py 53.75% 34 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #975      +/-   ##
==========================================
- Coverage   69.99%   66.60%   -3.39%     
==========================================
  Files          39       40       +1     
  Lines        5532     6079     +547     
  Branches     1037     1031       -6     
==========================================
+ Hits         3872     4049     +177     
- Misses       1367     1669     +302     
- Partials      293      361      +68     
Files with missing lines Coverage Δ
src/squidpy/gr/_ppatterns.py 79.85% <53.75%> (+0.88%) ⬆️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Intron7
Copy link
Member

Intron7 commented Mar 21, 2025

@wenjie1991 This already looks very good and promising. But I believe you can squeeze out even more performance. You can start by adjusting the memory access pattern to be efficent. You can also numba_njit the outer function and parallelize it. Also i would cache the kernel that makes it even more efficent.

@wenjie1991
Copy link
Author

@MDLDan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sq.gr.co_occurrence gives different results for n_splits=1 and n_splits>1
5 participants