multithreading center init

Migrated from https://github.com/BaylorCS/baylorml/issues/2 (@BenjaminHorn)

I have a fairly big dataset (100m \* 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised 
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187

I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.

``` c++
float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
    //here comes the loop
    return sum_distribution;
}
```

But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.

Some other thing: 
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?

``` c++
            if (dist2[i].first > max_dist) {
                max_dist = dist2[i].first;
            }
```

As i can see max_dist wont be used anywhere.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multithreading center init #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

multithreading center init #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions