Add standard deviation (std) to KMeans. by vinnik-dmitry07 · Pull Request #5013 · shogun-toolbox/shogun

vinnik-dmitry07 · 2020-04-21T17:01:32Z

Valgrind and the tests is OK.

vinnik-dmitry07 · 2020-04-21T17:54:02Z

Hmm..

vinnik-dmitry07 · 2020-04-21T19:04:25Z

It was becasue of last-time uncommenting of the OpenMP directives...

vinnik-dmitry07 · 2020-04-21T20:33:49Z

Needs submodule merge.

karlnapf · 2020-04-25T10:57:03Z

src/shogun/clustering/KMeansBase.cpp

+    std::tie(cluster_assignments, weights_set, std::ignore) =
+        compute_cluster_assignments(k);
+
+    auto cluster_indexes = new SGVector<index_t>[k];


we usually allocate those on the stack

you could use std::vector if you want a vector of differently sized SGVectors

karlnapf · 2020-04-25T10:57:14Z

src/shogun/clustering/KMeansBase.cpp

+        compute_cluster_assignments(k);
+
+    auto cluster_indexes = new SGVector<index_t>[k];
+    auto cluster_counters = new index_t[k];


SGVector<index_t>(k)

karlnapf · 2020-04-25T10:58:35Z

src/shogun/clustering/KMeansBase.cpp

 	    &fixed_centers, "fixed_centers", "Whether to use fixed centers",
 	    ParameterProperties::HYPER | ParameterProperties::SETTING);
 	SG_ADD(&R, "radiuses", "Cluster radiuses", ParameterProperties::MODEL);
+	SG_ADD(&stds, "stds", "Cluster standard deviations", ParameterProperties::MODEL);


it is not really a model parameter right? Rather something that can be computed after having trained the algorithm

I made it similarly to R, will change it now.

karlnapf · 2020-04-25T10:58:58Z

src/shogun/clustering/KMeansBase.h

 		SGVector<float64_t> R;

+        /** Std of the clusters (size k) */
+        SGMatrix<float64_t> stds;


I'd prefer if we didnt store this in the class but rather compute it on demand by a user call

karlnapf · 2020-04-25T10:59:22Z

src/shogun/clustering/KMeansMiniBatch.cpp

-		SGVector<int32_t> M=mbchoose_rand(batch_size,XSize);
-		SGVector<int32_t> ncent=SGVector<int32_t>(batch_size);
-		for (int32_t j=0; j<batch_size; j++)
+		SGVector<int32_t> M = mbchoose_rand(batch_size, XSize);


It makes reviewing new code really hard if you have a lot of added whitespace changes ....

(because it is hard to see what you added)

I reverted all unnecessary changes.

thanks. Feel free to send the commit later once we merged this here. but not important

karlnapf · 2020-04-25T10:59:52Z

src/shogun/clustering/KMeansMiniBatch.cpp

 	initialize_training(data);
+	auto rhs_cache = distance->get_rhs();
 	minibatch_KMeans();
+	compute_stds();


remove and make user call

karlnapf

it's a cool feature to have :)

vinnik-dmitry07 · 2020-04-25T18:24:00Z

shogun-toolbox/shogun-data#194
Done. I can now reformat all this files if there is a need.

vinnik-dmitry07 · 2020-04-25T19:01:14Z

Sorry some unit tests is not ok.

karlnapf · 2020-04-25T19:30:36Z

examples/meta/src/clustering/kmeans.sg.in

+#![extract_centers_radiuses_stds]
 RealMatrix c = kmeans.get_real_matrix("cluster_centers")
 RealVector r = kmeans.get_real_vector("radiuses")
+RealMatrix s = kmeans.get_real_matrix("stds")


maybe "std_dev"?

karlnapf · 2020-04-25T19:32:04Z

src/shogun/clustering/KMeansBase.cpp

 	}
 }

+SGMatrix<float64_t> KMeansBase::get_stds() const


maybe "compute_std_dev" (it is not a getter)

karlnapf · 2020-04-25T19:32:31Z

src/shogun/clustering/KMeansBase.h


+		/** get cluster standard deviations
+		 *
+		 * @return cluster centers or empty matrix if no radiuses are there (not trained yet)


if not trained yet, it should throw an error

karlnapf · 2020-04-25T19:33:15Z

src/shogun/clustering/KMeansBase.h

+		/** Matches points and clusters
+		* @param change_centers optional coroutine to change centers in
+		* Lloyd Kmeans
+		* @return A tuple of: \n


I think we dont need the newlines

karlnapf · 2020-04-25T19:34:46Z

src/shogun/clustering/KMeansBase.cpp

+	for (int32_t i = 0; i < k; ++i)
+	{
+		stds.set_column(
+		    i, rhs->copy_subset(cluster_indexes[i])


we don't want to copy things to compute std-dev. I think you could do this with a view, but might need some work to make the std call work

Check View.h

also maybe we can do the dynamic cast only once outside the loop?

karlnapf · 2020-04-25T21:21:28Z

src/shogun/clustering/KMeansBase.cpp

+
+	for (int32_t i = 0; i < k; ++i)
+	{
+		stds.set_column(i, view(rhs, cluster_indexes[i])->std());


yep this is what I had in mind, nice.
The only thing is, I just checked the std implementation, and it only produces valid results if no subset is present (it is missing an assert for that atm).
Could you add an if-else branch for the case where subsets are present, that iterates through the feature vectors one by one to compute the std? If possible, please iterate in a memory friendly order (sorting the subset indices first)

@gf712 didnt we have something like this automagically somewhere?

as this implementation here is wrong, this shows that we need some unit tests for the stds :)

I'd also prefer if the std data wouldnt be copied here, but the std() method would write it straight to the matrix. E.g. via first getting the colunm vector reference, and then passing that to std or something

We have an iterator for features, but the subset is not sorted as it took longer to qsort than the cheapest computations. I would say test with manual sorting and with iterator.

I thought about this a bit more. The solution here is actually not ideal. You go through all the data and identify the cluster centers corresponding to each point. Then you copy the data for each cluster and compute the std. While the copying can be avoided, we can turn the whole approach upside down to make it simpler (no need to compute this data structure you added) and more efficient: just compute the cluster for each training point (i.e. apply method or similar), then just iterate through the data once, and add the contribution for the std deviation of each point to the corresponding cluster std

clusters = kmeans.apply(features) # vector with cluster inds std_devs = SGMatrix(num_dims, num_data) # init with zero for (vec, c_idx) in zip(feats, clusters): mean = m_cluster_centres[c_idx] # assumes this was computed and stored by "train" std_devs[:,c] += (vec-mean)**2 # this is naive (unstable), you want to use Welford's alg or similar, see [here](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) std_devs /= (num_vec-1) std_devs = linalg::sqrt(std_devs)

We are calculating std for each cluster thus we need to know which points belong to this cluster. We can store cluster assignments during Lloyd K-Means, but it is not so easy in the MiniBatch case.

shogun/src/shogun/clustering/KMeansMiniBatch.cpp

Line 80 in 63147ff

SGVector<float64_t> c_alive=rhs_mus->get_feature_vector(near);

The only way is to add here something like:

cluster_assginments[M[j]] = near;

-- But it will be executed max_iter * batch_size i.e. default 100 * 300 = 30000 times.

I think the best way is to make std function to accept precomputed mean and do it through an iterating view because, actually, the list of cluster indexes is sorted (i.e. cluster_indexes[i]).

couldnt you run the apply function to get the cluster assignments of the training data? Or some helper function defined within?

for mini batch we would need to compute it on the fly I guess?

Thank you! I did not know about it. Now I can use apply in Lloyd and in Minibatch to get assignments simply adding three lines of code. So extracting compute_cluster_assignments was a bad idea from the outset.

karlnapf · 2020-04-25T21:22:20Z

src/shogun/clustering/KMeansBase.cpp

+		stds.set_column(i, view(rhs, cluster_indexes[i])->std());
+	}
+
+    distance->replace_lhs(lhs_cache);


spaces vs tabs here

karlnapf · 2020-04-25T21:24:28Z

src/shogun/clustering/KMeansBase.cpp

+    {
+        const int32_t cluster_assignments_i=cluster_assignments[i];
+        int32_t min_cluster, j;
+        float64_t min_dist, dist;


could you do this in two lines and init them with the values below straight away? less error prone

gf712 · 2020-04-30T15:55:39Z

src/shogun/clustering/KMeansBase.cpp

+	SGMatrix<float64_t> points = distance->get_rhs()
+	                                 ->as<DenseFeatures<float64_t>>()
+	                                 ->get_feature_matrix();
+	SGVector<float64_t> cluster_assignments = const_cast<KMeansBase*>(this)


please no const_cast. @karlnapf this is probably more of an indication that we need to redesign KMeans no?

probably, but can't we avoid the cast otherwise?

I think we have to either make this function non const or make apply const

Making it nonconst breaks watch_method. I tried to make apply const but it causes continuous changes in the derived classes of DistanceMachine. I was not sure that these changes will not bring something bad and decided to use the most obvious solution.

What do you mean it breaks? I think it should work...

run is void, no?

I don’t think so. Check in SGObject

Hmm, you’re right. Not sure then :D

void run(std::string_view name) const noexcept(false)
Did I miss something?

no, I just got confused, sorry! what @karlnapf suggested should work though! :)

gf712 · 2020-04-30T15:56:32Z

src/shogun/clustering/KMeansBase.cpp

+
+	linalg::scale(squares_sums, squares_sums, 1. / (points.num_cols - 1));
+	for (float64_t& x : squares_sums)
+		x = std::sqrt(x);


is there no linalg for this? Maybe rewrite as std::transform, to be more idiomatic

++ for linalg... could send a separate pr for adding an elementwise sqrt. We had a few elementwise operations added recently, you could check those

Not sure if there is power function and could just have exponent 0.5?

#5016 (comment)

thanks for that other PR

gf712 · 2020-04-30T15:58:41Z

src/shogun/clustering/KMeansBase.cpp

+
+		count += 1;
+		auto delta1 = linalg::add(point, mean, 1., -1.);
+		linalg::add(mean, linalg::scale(delta1, 1. / count), mean);


Can we compute the mean differently @karlnapf? The issue is just that you are doing a division in a loop.. Is there no geometric series, or something else, that could be cheaper to compute?

this is from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

@vinnik-dmitry07 instead of implementing this online algorithm in here, could you move this somewhere else, so that it is also usable from other parts of the code? I envision an updater being instantiated and then repeatedly being called, a bit like an iterator
@gf712 any ideas where would be best and what the API would look like?

Sure, we would just have a class with a update(datapoint) public class function. Then can have some derived classes for mean and variance (unless you want to combine them in a single class). Not sure what other algos would be useful to have online update for?

yes definitely useful! The class should support both mean and variance but each should be optional (one wants one without the other one)

gf712 · 2020-04-30T15:59:27Z

src/shogun/clustering/KMeansBase.cpp

+
+	for (int32_t point_number : range(cluster_assignments.vlen))
+	{
+		auto cluster_number = (int32_t) cluster_assignments[point_number];


could you use an explicit static_cast please? Avoid using C style cast to avoid any doubt of what you're doing

gf712 · 2020-04-30T15:59:44Z

src/shogun/clustering/KMeansMiniBatch.cpp

 	compute_cluster_variances();
+    auto cluster_centres =
+        std::make_shared<DenseFeatures<float64_t>>(cluster_centers);
+    distance->replace_lhs(cluster_centres);


indentation

gf712 · 2020-04-30T16:02:15Z

src/shogun/clustering/KMeansBase.cpp

+	for (int32_t point_number : range(cluster_assignments.vlen))
+	{
+		auto cluster_number = (int32_t) cluster_assignments[point_number];
+		auto point = points.get_column(point_number);


const auto&

src/shogun/clustering/KMeansBase.cpp

…to kmeans-add-stds

DREAMCATCHERVIBE · 2020-05-04T05:52:13Z

Hi

stale · 2020-10-31T07:10:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-11-07T11:39:32Z

This issue is now being closed due to a lack of activity. Feel free to reopen it.

karlnapf reviewed Apr 25, 2020

View reviewed changes

vinnik-dmitry07 requested a review from karlnapf April 25, 2020 18:24

vinnik-dmitry07 force-pushed the kmeans-add-stds branch from 1169b00 to 3ef2426 Compare April 25, 2020 19:02

karlnapf reviewed Apr 25, 2020

View reviewed changes

vinnik-dmitry07 requested a review from karlnapf April 25, 2020 21:04

karlnapf reviewed Apr 25, 2020

View reviewed changes

vinnik-dmitry07 force-pushed the kmeans-add-stds branch from 63147ff to fc6fdb5 Compare April 30, 2020 13:41

vinnik-dmitry07 requested review from gf712 and karlnapf April 30, 2020 13:46

vinnik-dmitry07 force-pushed the kmeans-add-stds branch from 78d1071 to 5b5bac3 Compare April 30, 2020 14:32

Add std.

962cb88

vinnik-dmitry07 force-pushed the kmeans-add-stds branch from 5b5bac3 to 962cb88 Compare April 30, 2020 14:59

gf712 reviewed Apr 30, 2020

View reviewed changes

src/shogun/clustering/KMeansBase.cpp Show resolved Hide resolved

vinnik-dmitry07 added 2 commits April 30, 2020 20:27

Merge branch 'develop' of https://github.com/shogun-toolbox/shogun in…

b5cb4e1

…to kmeans-add-stds

fixup

cdb4bd9

stale bot added the stale label Oct 31, 2020

stale bot closed this Nov 7, 2020

Uh oh!

Conversation

vinnik-dmitry07 commented Apr 21, 2020

Uh oh!

vinnik-dmitry07 commented Apr 21, 2020

Uh oh!

vinnik-dmitry07 commented Apr 21, 2020

Uh oh!

vinnik-dmitry07 commented Apr 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karlnapf Apr 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karlnapf left a comment

Choose a reason for hiding this comment

Uh oh!

vinnik-dmitry07 commented Apr 25, 2020

Uh oh!

vinnik-dmitry07 commented Apr 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinnik-dmitry07 Apr 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

karlnapf Apr 25, 2020 •

edited

Loading

vinnik-dmitry07 Apr 26, 2020 •

edited

Loading

vinnik-dmitry07 May 1, 2020 •

edited

Loading