Skip to content

Parallel Processing for Mapper #1189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ZiyadOulhaj
Copy link

Added parallel processing (using joblib) for the fit method for MapperComplex.

The user can give the maximum number of concurrently running jobs (n_jobs) and the backend implementation (backend) as arguments to fit().

Additionally, data points are assigned to patches with array comparison instead of a for loop.

Copy link
Contributor

@VincentRouvreau VincentRouvreau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also performed some benchmark with this python script and this human.txt dataset, and I am quite surprised about the timings:

n_jobs=1
0:00:04.366957
0:00:41.875797
0:00:05.119477
n_jobs=2
0:00:04.908189
0:00:41.133749
0:00:05.089210
n_jobs=3
0:00:04.968691
0:00:40.905412
0:00:04.944852
n_jobs=4
0:00:05.191212
0:00:41.379255
0:00:04.892046

@@ -410,109 +424,76 @@ def fit(self, X, y=None, filters=None, colors=None):
# Initialize attributes
self.simplex_tree_, self.node_info_ = SimplexTree(), {}

if np.all(self.gains < .5):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to remove this safety condition ? @MathieuCarriere Maybe it is no more relevant ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @MathieuCarriere mentioned to me before that he introduced the safety condition to be able to use numpy.digitize function in the np.all(self.gains < .5) case.

Points are assigned to intervals using a 3d array comparison in the code I submitted. This condition wasn't used anymore.

@@ -349,6 +361,8 @@ def fit(self, X, y=None, filters=None, colors=None):
filter functions (sometimes called lenses) used to compute the cover. Each column of the numpy array defines a scalar function defined on the input points.
colors : list of lists or numpy array of shape (num_points) x (num_colors)
functions used to color the nodes of the cover complex. More specifically, coloring is done by computing the means of these functions on the subpopulations corresponding to each node. If None, first coordinate is used if input is point cloud, and eccentricity is used if input is distance matrix.
n_jobs : The maximum number of concurrently running jobs (default 1). See https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html for details.
backend : Specify the parallelization backend implementation (default 'loky'). See https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rest of Gudhi, I don't think we expose this argument, users can use the parallel_config context manager to change the default. Is it recommended practice to expose this?

@MathieuCarriere MathieuCarriere self-assigned this Mar 18, 2025
@ZiyadOulhaj
Copy link
Author

I also performed some benchmark with this python script and this human.txt dataset, and I am quite surprised about the timings:

n_jobs=1
0:00:04.366957
0:00:41.875797
0:00:05.119477
n_jobs=2
0:00:04.908189
0:00:41.133749
0:00:05.089210
n_jobs=3
0:00:04.968691
0:00:40.905412
0:00:04.944852
n_jobs=4
0:00:05.191212
0:00:41.379255
0:00:04.892046

I think the timings in the benchmark are related to Parallel pool creation overhead. In my experience, running MapperComplex.fit() with n_jobs>1 becomes more efficient compared to n_jobs=1 beginning with the second time it is called (for the same n_jobs). @mglisse mentioned that there is a parallel_config context manager. Maybe this can solve this issue ? Can someone please tell me where to find it in the files and how it is used ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants