Skip to content

Add standard deviation (std) to KMeans.#5013

Closed
vinnik-dmitry07 wants to merge 3 commits intoshogun-toolbox:developfrom
vinnik-dmitry07:kmeans-add-stds
Closed

Add standard deviation (std) to KMeans.#5013
vinnik-dmitry07 wants to merge 3 commits intoshogun-toolbox:developfrom
vinnik-dmitry07:kmeans-add-stds

Conversation

@vinnik-dmitry07
Copy link
Contributor

Valgrind and the tests is OK.

@vinnik-dmitry07
Copy link
Contributor Author

Hmm..

@vinnik-dmitry07
Copy link
Contributor Author

It was becasue of last-time uncommenting of the OpenMP directives...

@vinnik-dmitry07
Copy link
Contributor Author

Needs submodule merge.
image

std::tie(cluster_assignments, weights_set, std::ignore) =
compute_cluster_assignments(k);

auto cluster_indexes = new SGVector<index_t>[k];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually allocate those on the stack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use std::vector if you want a vector of differently sized SGVectors

compute_cluster_assignments(k);

auto cluster_indexes = new SGVector<index_t>[k];
auto cluster_counters = new index_t[k];
Copy link
Member

@karlnapf karlnapf Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGVector<index_t>(k)

&fixed_centers, "fixed_centers", "Whether to use fixed centers",
ParameterProperties::HYPER | ParameterProperties::SETTING);
SG_ADD(&R, "radiuses", "Cluster radiuses", ParameterProperties::MODEL);
SG_ADD(&stds, "stds", "Cluster standard deviations", ParameterProperties::MODEL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not really a model parameter right? Rather something that can be computed after having trained the algorithm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it similarly to R, will change it now.

SGVector<float64_t> R;

/** Std of the clusters (size k) */
SGMatrix<float64_t> stds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we didnt store this in the class but rather compute it on demand by a user call

SGVector<int32_t> M=mbchoose_rand(batch_size,XSize);
SGVector<int32_t> ncent=SGVector<int32_t>(batch_size);
for (int32_t j=0; j<batch_size; j++)
SGVector<int32_t> M = mbchoose_rand(batch_size, XSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes reviewing new code really hard if you have a lot of added whitespace changes ....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(because it is hard to see what you added)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted all unnecessary changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Feel free to send the commit later once we merged this here. but not important

initialize_training(data);
auto rhs_cache = distance->get_rhs();
minibatch_KMeans();
compute_stds();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove and make user call

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a cool feature to have :)

@vinnik-dmitry07
Copy link
Contributor Author

shogun-toolbox/shogun-data#194
Done. I can now reformat all this files if there is a need.

@vinnik-dmitry07
Copy link
Contributor Author

Sorry some unit tests is not ok.

#![extract_centers_radiuses_stds]
RealMatrix c = kmeans.get_real_matrix("cluster_centers")
RealVector r = kmeans.get_real_vector("radiuses")
RealMatrix s = kmeans.get_real_matrix("stds")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "std_dev"?

}
}

SGMatrix<float64_t> KMeansBase::get_stds() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "compute_std_dev" (it is not a getter)


/** get cluster standard deviations
*
* @return cluster centers or empty matrix if no radiuses are there (not trained yet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not trained yet, it should throw an error

/** Matches points and clusters
* @param change_centers optional coroutine to change centers in
* Lloyd Kmeans
* @return A tuple of: \n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we dont need the newlines

for (int32_t i = 0; i < k; ++i)
{
stds.set_column(
i, rhs->copy_subset(cluster_indexes[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to copy things to compute std-dev. I think you could do this with a view, but might need some work to make the std call work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check View.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe we can do the dynamic cast only once outside the loop?


for (int32_t i = 0; i < k; ++i)
{
stds.set_column(i, view(rhs, cluster_indexes[i])->std());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this is what I had in mind, nice.
The only thing is, I just checked the std implementation, and it only produces valid results if no subset is present (it is missing an assert for that atm).
Could you add an if-else branch for the case where subsets are present, that iterates through the feature vectors one by one to compute the std? If possible, please iterate in a memory friendly order (sorting the subset indices first)

@gf712 didnt we have something like this automagically somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this implementation here is wrong, this shows that we need some unit tests for the stds :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also prefer if the std data wouldnt be copied here, but the std() method would write it straight to the matrix. E.g. via first getting the colunm vector reference, and then passing that to std or something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an iterator for features, but the subset is not sorted as it took longer to qsort than the cheapest computations. I would say test with manual sorting and with iterator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a bit more. The solution here is actually not ideal. You go through all the data and identify the cluster centers corresponding to each point. Then you copy the data for each cluster and compute the std. While the copying can be avoided, we can turn the whole approach upside down to make it simpler (no need to compute this data structure you added) and more efficient: just compute the cluster for each training point (i.e. apply method or similar), then just iterate through the data once, and add the contribution for the std deviation of each point to the corresponding cluster std

clusters = kmeans.apply(features) # vector with cluster inds
std_devs = SGMatrix(num_dims, num_data) # init with zero
for (vec, c_idx) in zip(feats, clusters):
  mean = m_cluster_centres[c_idx] # assumes this was computed and stored by "train"
  std_devs[:,c] += (vec-mean)**2 # this is naive (unstable), you want to use Welford's alg or similar, see [here](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)
std_devs /= (num_vec-1)
std_devs = linalg::sqrt(std_devs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calculating std for each cluster thus we need to know which points belong to this cluster. We can store cluster assignments during Lloyd K-Means, but it is not so easy in the MiniBatch case.

Copy link
Contributor Author

@vinnik-dmitry07 vinnik-dmitry07 Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGVector<float64_t> c_alive=rhs_mus->get_feature_vector(near);

The only way is to add here something like:

cluster_assginments[M[j]] = near;

-- But it will be executed max_iter * batch_size i.e. default 100 * 300 = 30000 times.

I think the best way is to make std function to accept precomputed mean and do it through an iterating view because, actually, the list of cluster indexes is sorted (i.e. cluster_indexes[i]).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldnt you run the apply function to get the cluster assignments of the training data? Or some helper function defined within?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for mini batch we would need to compute it on the fly I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I did not know about it. Now I can use apply in Lloyd and in Minibatch to get assignments simply adding three lines of code. So extracting compute_cluster_assignments was a bad idea from the outset.

stds.set_column(i, view(rhs, cluster_indexes[i])->std());
}

distance->replace_lhs(lhs_cache);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces vs tabs here

{
const int32_t cluster_assignments_i=cluster_assignments[i];
int32_t min_cluster, j;
float64_t min_dist, dist;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you do this in two lines and init them with the values below straight away? less error prone

SGMatrix<float64_t> points = distance->get_rhs()
->as<DenseFeatures<float64_t>>()
->get_feature_matrix();
SGVector<float64_t> cluster_assignments = const_cast<KMeansBase*>(this)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please no const_cast. @karlnapf this is probably more of an indication that we need to redesign KMeans no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but can't we avoid the cast otherwise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to either make this function non const or make apply const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it nonconst breaks watch_method. I tried to make apply const but it causes continuous changes in the derived classes of DistanceMachine. I was not sure that these changes will not bring something bad and decided to use the most obvious solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean it breaks? I think it should work...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run is void, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think so. Check in SGObject

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you’re right. Not sure then :D

Copy link
Contributor Author

@vinnik-dmitry07 vinnik-dmitry07 May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void run(std::string_view name) const noexcept(false)
Did I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I just got confused, sorry! what @karlnapf suggested should work though! :)


linalg::scale(squares_sums, squares_sums, 1. / (points.num_cols - 1));
for (float64_t& x : squares_sums)
x = std::sqrt(x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no linalg for this? Maybe rewrite as std::transform, to be more idiomatic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ for linalg... could send a separate pr for adding an elementwise sqrt. We had a few elementwise operations added recently, you could check those

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is power function and could just have exponent 0.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for that other PR


count += 1;
auto delta1 = linalg::add(point, mean, 1., -1.);
linalg::add(mean, linalg::scale(delta1, 1. / count), mean);
Copy link
Member

@gf712 gf712 Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we compute the mean differently @karlnapf? The issue is just that you are doing a division in a loop.. Is there no geometric series, or something else, that could be cheaper to compute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

@vinnik-dmitry07 instead of implementing this online algorithm in here, could you move this somewhere else, so that it is also usable from other parts of the code? I envision an updater being instantiated and then repeatedly being called, a bit like an iterator
@gf712 any ideas where would be best and what the API would look like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we would just have a class with a update(datapoint) public class function. Then can have some derived classes for mean and variance (unless you want to combine them in a single class). Not sure what other algos would be useful to have online update for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes definitely useful! The class should support both mean and variance but each should be optional (one wants one without the other one)


for (int32_t point_number : range(cluster_assignments.vlen))
{
auto cluster_number = (int32_t) cluster_assignments[point_number];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you use an explicit static_cast please? Avoid using C style cast to avoid any doubt of what you're doing

compute_cluster_variances();
auto cluster_centres =
std::make_shared<DenseFeatures<float64_t>>(cluster_centers);
distance->replace_lhs(cluster_centres);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

for (int32_t point_number : range(cluster_assignments.vlen))
{
auto cluster_number = (int32_t) cluster_assignments[point_number];
auto point = points.get_column(point_number);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto&

@DREAMCATCHERVIBE
Copy link

Hi

@stale
Copy link

stale bot commented Oct 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 31, 2020
@stale
Copy link

stale bot commented Nov 7, 2020

This issue is now being closed due to a lack of activity. Feel free to reopen it.

@stale stale bot closed this Nov 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants