-
Notifications
You must be signed in to change notification settings - Fork 69
Parallel Processing for Mapper #1189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also performed some benchmark with this python script and this human.txt
dataset, and I am quite surprised about the timings:
n_jobs=1
0:00:04.366957
0:00:41.875797
0:00:05.119477
n_jobs=2
0:00:04.908189
0:00:41.133749
0:00:05.089210
n_jobs=3
0:00:04.968691
0:00:40.905412
0:00:04.944852
n_jobs=4
0:00:05.191212
0:00:41.379255
0:00:04.892046
@@ -410,109 +424,76 @@ def fit(self, X, y=None, filters=None, colors=None): | |||
# Initialize attributes | |||
self.simplex_tree_, self.node_info_ = SimplexTree(), {} | |||
|
|||
if np.all(self.gains < .5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to remove this safety condition ? @MathieuCarriere Maybe it is no more relevant ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @MathieuCarriere mentioned to me before that he introduced the safety condition to be able to use numpy.digitize function in the np.all(self.gains < .5) case.
Points are assigned to intervals using a 3d array comparison in the code I submitted. This condition wasn't used anymore.
@@ -349,6 +361,8 @@ def fit(self, X, y=None, filters=None, colors=None): | |||
filter functions (sometimes called lenses) used to compute the cover. Each column of the numpy array defines a scalar function defined on the input points. | |||
colors : list of lists or numpy array of shape (num_points) x (num_colors) | |||
functions used to color the nodes of the cover complex. More specifically, coloring is done by computing the means of these functions on the subpopulations corresponding to each node. If None, first coordinate is used if input is point cloud, and eccentricity is used if input is distance matrix. | |||
n_jobs : The maximum number of concurrently running jobs (default 1). See https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html for details. | |||
backend : Specify the parallelization backend implementation (default 'loky'). See https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the rest of Gudhi, I don't think we expose this argument, users can use the parallel_config context manager to change the default. Is it recommended practice to expose this?
I think the timings in the benchmark are related to Parallel pool creation overhead. In my experience, running MapperComplex.fit() with n_jobs>1 becomes more efficient compared to n_jobs=1 beginning with the second time it is called (for the same n_jobs). @mglisse mentioned that there is a parallel_config context manager. Maybe this can solve this issue ? Can someone please tell me where to find it in the files and how it is used ? |
Added parallel processing (using joblib) for the fit method for MapperComplex.
The user can give the maximum number of concurrently running jobs (n_jobs) and the backend implementation (backend) as arguments to fit().
Additionally, data points are assigned to patches with array comparison instead of a for loop.