-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Migrated from BaylorCS/baylorml#2 (@BenjaminHorn)
I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187
I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.
float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
//here comes the loop
return sum_distribution;
}But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.
Some other thing:
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}As i can see max_dist wont be used anywhere.