Parallel Processing for Mapper #1189

ZiyadOulhaj · 2025-03-14T14:24:50Z

Added parallel processing (using joblib) for the fit method for MapperComplex.

The user can give the maximum number of concurrently running jobs (n_jobs) and the backend implementation (backend) as arguments to fit().

Additionally, data points are assigned to patches with array comparison instead of a for loop.

VincentRouvreau

I also performed some benchmark with this python script and this human.txt dataset, and I am quite surprised about the timings:

n_jobs=1
0:00:04.366957
0:00:41.875797
0:00:05.119477
n_jobs=2
0:00:04.908189
0:00:41.133749
0:00:05.089210
n_jobs=3
0:00:04.968691
0:00:40.905412
0:00:04.944852
n_jobs=4
0:00:05.191212
0:00:41.379255
0:00:04.892046

VincentRouvreau · 2025-03-18T10:51:50Z

src/python/gudhi/cover_complex.py

@@ -410,109 +424,76 @@ def fit(self, X, y=None, filters=None, colors=None):
        # Initialize attributes
        self.simplex_tree_, self.node_info_ = SimplexTree(), {}

-        if np.all(self.gains < .5):


Any reason to remove this safety condition ? @MathieuCarriere Maybe it is no more relevant ?

I think @MathieuCarriere mentioned to me before that he introduced the safety condition to be able to use numpy.digitize function in the np.all(self.gains < .5) case.

Points are assigned to intervals using a 3d array comparison in the code I submitted. This condition wasn't used anymore.

src/python/gudhi/cover_complex.py

mglisse · 2025-03-18T11:11:13Z

src/python/gudhi/cover_complex.py

@@ -349,6 +361,8 @@ def fit(self, X, y=None, filters=None, colors=None):
                filter functions (sometimes called lenses) used to compute the cover. Each column of the numpy array defines a scalar function defined on the input points.
            colors : list of lists or numpy array of shape (num_points) x (num_colors)
                functions used to color the nodes of the cover complex. More specifically, coloring is done by computing the means of these functions on the subpopulations corresponding to each node. If None, first coordinate is used if input is point cloud, and eccentricity is used if input is distance matrix.
+            n_jobs :  The maximum number of concurrently running jobs (default 1). See https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html for details.
+            backend : Specify the parallelization backend implementation (default 'loky'). See https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html for details.


In the rest of Gudhi, I don't think we expose this argument, users can use the parallel_config context manager (https://joblib.readthedocs.io/en/stable/generated/joblib.parallel_config.html) to change the default. Is it recommended practice to expose this?

ZiyadOulhaj · 2025-03-18T13:19:50Z

I also performed some benchmark with this python script and this human.txt dataset, and I am quite surprised about the timings:
n_jobs=1
0:00:04.366957
0:00:41.875797
0:00:05.119477
n_jobs=2
0:00:04.908189
0:00:41.133749
0:00:05.089210
n_jobs=3
0:00:04.968691
0:00:40.905412
0:00:04.944852
n_jobs=4
0:00:05.191212
0:00:41.379255
0:00:04.892046

I think the timings in the benchmark are related to Parallel pool creation overhead. In my experience, running MapperComplex.fit() with n_jobs>1 becomes more efficient compared to n_jobs=1 beginning with the second time it is called (for the same n_jobs). @mglisse mentioned that there is a parallel_config context manager. Maybe this can solve this issue ? Can someone please tell me where to find it in the files and how it is used ?

VincentRouvreau · 2025-07-01T14:53:36Z

@mglisse mentioned that there is a parallel_config context manager. Maybe this can solve this issue ? Can someone please tell me where to find it in the files and how it is used ?

You can do the same as what is explained in sklearn documentation (and maybe we should explain it in GUDHI joblib dependency):

from joblib import parallel_backend
from gudhi.cover_complex import MapperComplex

with parallel_backend('threading', n_jobs=2):
    cover_complex = MapperComplex(...)

parallel processing for Mapper

844f63c

VincentRouvreau reviewed Mar 18, 2025

View reviewed changes

mglisse reviewed Mar 18, 2025

View reviewed changes

formated code

9359b6e

MathieuCarriere self-assigned this Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel Processing for Mapper #1189

Parallel Processing for Mapper #1189

Uh oh!

ZiyadOulhaj commented Mar 14, 2025

Uh oh!

VincentRouvreau left a comment

Uh oh!

VincentRouvreau Mar 18, 2025

Uh oh!

ZiyadOulhaj Mar 18, 2025

Uh oh!

Uh oh!

mglisse Mar 18, 2025 •

edited

Loading

Uh oh!

ZiyadOulhaj commented Mar 18, 2025

Uh oh!

VincentRouvreau commented Jul 1, 2025

Uh oh!

Uh oh!

Parallel Processing for Mapper #1189

Are you sure you want to change the base?

Parallel Processing for Mapper #1189

Uh oh!

Conversation

ZiyadOulhaj commented Mar 14, 2025

Uh oh!

VincentRouvreau left a comment

Choose a reason for hiding this comment

Uh oh!

VincentRouvreau Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

ZiyadOulhaj Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mglisse Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZiyadOulhaj commented Mar 18, 2025

Uh oh!

VincentRouvreau commented Jul 1, 2025

Uh oh!

Uh oh!

mglisse Mar 18, 2025 •

edited

Loading