Add standard deviation (std) to KMeans.#5013
Add standard deviation (std) to KMeans.#5013vinnik-dmitry07 wants to merge 3 commits intoshogun-toolbox:developfrom
Conversation
|
Hmm.. |
|
It was becasue of last-time uncommenting of the OpenMP directives... |
src/shogun/clustering/KMeansBase.cpp
Outdated
| std::tie(cluster_assignments, weights_set, std::ignore) = | ||
| compute_cluster_assignments(k); | ||
|
|
||
| auto cluster_indexes = new SGVector<index_t>[k]; |
There was a problem hiding this comment.
we usually allocate those on the stack
There was a problem hiding this comment.
you could use std::vector if you want a vector of differently sized SGVectors
src/shogun/clustering/KMeansBase.cpp
Outdated
| compute_cluster_assignments(k); | ||
|
|
||
| auto cluster_indexes = new SGVector<index_t>[k]; | ||
| auto cluster_counters = new index_t[k]; |
src/shogun/clustering/KMeansBase.cpp
Outdated
| &fixed_centers, "fixed_centers", "Whether to use fixed centers", | ||
| ParameterProperties::HYPER | ParameterProperties::SETTING); | ||
| SG_ADD(&R, "radiuses", "Cluster radiuses", ParameterProperties::MODEL); | ||
| SG_ADD(&stds, "stds", "Cluster standard deviations", ParameterProperties::MODEL); |
There was a problem hiding this comment.
it is not really a model parameter right? Rather something that can be computed after having trained the algorithm
There was a problem hiding this comment.
I made it similarly to R, will change it now.
src/shogun/clustering/KMeansBase.h
Outdated
| SGVector<float64_t> R; | ||
|
|
||
| /** Std of the clusters (size k) */ | ||
| SGMatrix<float64_t> stds; |
There was a problem hiding this comment.
I'd prefer if we didnt store this in the class but rather compute it on demand by a user call
| SGVector<int32_t> M=mbchoose_rand(batch_size,XSize); | ||
| SGVector<int32_t> ncent=SGVector<int32_t>(batch_size); | ||
| for (int32_t j=0; j<batch_size; j++) | ||
| SGVector<int32_t> M = mbchoose_rand(batch_size, XSize); |
There was a problem hiding this comment.
It makes reviewing new code really hard if you have a lot of added whitespace changes ....
There was a problem hiding this comment.
(because it is hard to see what you added)
There was a problem hiding this comment.
I reverted all unnecessary changes.
There was a problem hiding this comment.
thanks. Feel free to send the commit later once we merged this here. but not important
| initialize_training(data); | ||
| auto rhs_cache = distance->get_rhs(); | ||
| minibatch_KMeans(); | ||
| compute_stds(); |
karlnapf
left a comment
There was a problem hiding this comment.
it's a cool feature to have :)
|
shogun-toolbox/shogun-data#194 |
|
Sorry some unit tests is not ok. |
1169b00 to
3ef2426
Compare
| #![extract_centers_radiuses_stds] | ||
| RealMatrix c = kmeans.get_real_matrix("cluster_centers") | ||
| RealVector r = kmeans.get_real_vector("radiuses") | ||
| RealMatrix s = kmeans.get_real_matrix("stds") |
src/shogun/clustering/KMeansBase.cpp
Outdated
| } | ||
| } | ||
|
|
||
| SGMatrix<float64_t> KMeansBase::get_stds() const |
There was a problem hiding this comment.
maybe "compute_std_dev" (it is not a getter)
src/shogun/clustering/KMeansBase.h
Outdated
|
|
||
| /** get cluster standard deviations | ||
| * | ||
| * @return cluster centers or empty matrix if no radiuses are there (not trained yet) |
There was a problem hiding this comment.
if not trained yet, it should throw an error
src/shogun/clustering/KMeansBase.h
Outdated
| /** Matches points and clusters | ||
| * @param change_centers optional coroutine to change centers in | ||
| * Lloyd Kmeans | ||
| * @return A tuple of: \n |
There was a problem hiding this comment.
I think we dont need the newlines
src/shogun/clustering/KMeansBase.cpp
Outdated
| for (int32_t i = 0; i < k; ++i) | ||
| { | ||
| stds.set_column( | ||
| i, rhs->copy_subset(cluster_indexes[i]) |
There was a problem hiding this comment.
we don't want to copy things to compute std-dev. I think you could do this with a view, but might need some work to make the std call work
There was a problem hiding this comment.
also maybe we can do the dynamic cast only once outside the loop?
src/shogun/clustering/KMeansBase.cpp
Outdated
|
|
||
| for (int32_t i = 0; i < k; ++i) | ||
| { | ||
| stds.set_column(i, view(rhs, cluster_indexes[i])->std()); |
There was a problem hiding this comment.
yep this is what I had in mind, nice.
The only thing is, I just checked the std implementation, and it only produces valid results if no subset is present (it is missing an assert for that atm).
Could you add an if-else branch for the case where subsets are present, that iterates through the feature vectors one by one to compute the std? If possible, please iterate in a memory friendly order (sorting the subset indices first)
@gf712 didnt we have something like this automagically somewhere?
There was a problem hiding this comment.
as this implementation here is wrong, this shows that we need some unit tests for the stds :)
There was a problem hiding this comment.
I'd also prefer if the std data wouldnt be copied here, but the std() method would write it straight to the matrix. E.g. via first getting the colunm vector reference, and then passing that to std or something
There was a problem hiding this comment.
We have an iterator for features, but the subset is not sorted as it took longer to qsort than the cheapest computations. I would say test with manual sorting and with iterator.
There was a problem hiding this comment.
I thought about this a bit more. The solution here is actually not ideal. You go through all the data and identify the cluster centers corresponding to each point. Then you copy the data for each cluster and compute the std. While the copying can be avoided, we can turn the whole approach upside down to make it simpler (no need to compute this data structure you added) and more efficient: just compute the cluster for each training point (i.e. apply method or similar), then just iterate through the data once, and add the contribution for the std deviation of each point to the corresponding cluster std
clusters = kmeans.apply(features) # vector with cluster inds
std_devs = SGMatrix(num_dims, num_data) # init with zero
for (vec, c_idx) in zip(feats, clusters):
mean = m_cluster_centres[c_idx] # assumes this was computed and stored by "train"
std_devs[:,c] += (vec-mean)**2 # this is naive (unstable), you want to use Welford's alg or similar, see [here](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)
std_devs /= (num_vec-1)
std_devs = linalg::sqrt(std_devs)
There was a problem hiding this comment.
We are calculating std for each cluster thus we need to know which points belong to this cluster. We can store cluster assignments during Lloyd K-Means, but it is not so easy in the MiniBatch case.
There was a problem hiding this comment.
The only way is to add here something like:
cluster_assginments[M[j]] = near;
-- But it will be executed max_iter * batch_size i.e. default 100 * 300 = 30000 times.
I think the best way is to make std function to accept precomputed mean and do it through an iterating view because, actually, the list of cluster indexes is sorted (i.e. cluster_indexes[i]).
There was a problem hiding this comment.
couldnt you run the apply function to get the cluster assignments of the training data? Or some helper function defined within?
There was a problem hiding this comment.
for mini batch we would need to compute it on the fly I guess?
There was a problem hiding this comment.
Thank you! I did not know about it. Now I can use apply in Lloyd and in Minibatch to get assignments simply adding three lines of code. So extracting compute_cluster_assignments was a bad idea from the outset.
src/shogun/clustering/KMeansBase.cpp
Outdated
| stds.set_column(i, view(rhs, cluster_indexes[i])->std()); | ||
| } | ||
|
|
||
| distance->replace_lhs(lhs_cache); |
src/shogun/clustering/KMeansBase.cpp
Outdated
| { | ||
| const int32_t cluster_assignments_i=cluster_assignments[i]; | ||
| int32_t min_cluster, j; | ||
| float64_t min_dist, dist; |
There was a problem hiding this comment.
could you do this in two lines and init them with the values below straight away? less error prone
63147ff to
fc6fdb5
Compare
78d1071 to
5b5bac3
Compare
5b5bac3 to
962cb88
Compare
| SGMatrix<float64_t> points = distance->get_rhs() | ||
| ->as<DenseFeatures<float64_t>>() | ||
| ->get_feature_matrix(); | ||
| SGVector<float64_t> cluster_assignments = const_cast<KMeansBase*>(this) |
There was a problem hiding this comment.
please no const_cast. @karlnapf this is probably more of an indication that we need to redesign KMeans no?
There was a problem hiding this comment.
probably, but can't we avoid the cast otherwise?
There was a problem hiding this comment.
I think we have to either make this function non const or make apply const
There was a problem hiding this comment.
Making it nonconst breaks watch_method. I tried to make apply const but it causes continuous changes in the derived classes of DistanceMachine. I was not sure that these changes will not bring something bad and decided to use the most obvious solution.
There was a problem hiding this comment.
What do you mean it breaks? I think it should work...
There was a problem hiding this comment.
run is void, no?
There was a problem hiding this comment.
I don’t think so. Check in SGObject
There was a problem hiding this comment.
Hmm, you’re right. Not sure then :D
There was a problem hiding this comment.
void run(std::string_view name) const noexcept(false)
Did I miss something?
There was a problem hiding this comment.
no, I just got confused, sorry! what @karlnapf suggested should work though! :)
|
|
||
| linalg::scale(squares_sums, squares_sums, 1. / (points.num_cols - 1)); | ||
| for (float64_t& x : squares_sums) | ||
| x = std::sqrt(x); |
There was a problem hiding this comment.
is there no linalg for this? Maybe rewrite as std::transform, to be more idiomatic
There was a problem hiding this comment.
++ for linalg... could send a separate pr for adding an elementwise sqrt. We had a few elementwise operations added recently, you could check those
There was a problem hiding this comment.
Not sure if there is power function and could just have exponent 0.5?
There was a problem hiding this comment.
|
|
||
| count += 1; | ||
| auto delta1 = linalg::add(point, mean, 1., -1.); | ||
| linalg::add(mean, linalg::scale(delta1, 1. / count), mean); |
There was a problem hiding this comment.
Can we compute the mean differently @karlnapf? The issue is just that you are doing a division in a loop.. Is there no geometric series, or something else, that could be cheaper to compute?
There was a problem hiding this comment.
this is from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
@vinnik-dmitry07 instead of implementing this online algorithm in here, could you move this somewhere else, so that it is also usable from other parts of the code? I envision an updater being instantiated and then repeatedly being called, a bit like an iterator
@gf712 any ideas where would be best and what the API would look like?
There was a problem hiding this comment.
Sure, we would just have a class with a update(datapoint) public class function. Then can have some derived classes for mean and variance (unless you want to combine them in a single class). Not sure what other algos would be useful to have online update for?
There was a problem hiding this comment.
yes definitely useful! The class should support both mean and variance but each should be optional (one wants one without the other one)
src/shogun/clustering/KMeansBase.cpp
Outdated
|
|
||
| for (int32_t point_number : range(cluster_assignments.vlen)) | ||
| { | ||
| auto cluster_number = (int32_t) cluster_assignments[point_number]; |
There was a problem hiding this comment.
could you use an explicit static_cast please? Avoid using C style cast to avoid any doubt of what you're doing
| compute_cluster_variances(); | ||
| auto cluster_centres = | ||
| std::make_shared<DenseFeatures<float64_t>>(cluster_centers); | ||
| distance->replace_lhs(cluster_centres); |
src/shogun/clustering/KMeansBase.cpp
Outdated
| for (int32_t point_number : range(cluster_assignments.vlen)) | ||
| { | ||
| auto cluster_number = (int32_t) cluster_assignments[point_number]; | ||
| auto point = points.get_column(point_number); |
…to kmeans-add-stds
|
Hi |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
This issue is now being closed due to a lack of activity. Feel free to reopen it. |

Valgrind and the tests is OK.