Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion data
3 changes: 2 additions & 1 deletion examples/meta/src/clustering/kmeans.sg.in
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@ Machine kmeans = create_machine("KMeans", k=2, distance=d, seed=1)
kmeans.train()
#![train_dataset]

#![extract_centers_and_radius]
#![extract_centers_radiuses_stds]
RealMatrix c = kmeans.get_real_matrix("cluster_centers")
RealVector r = kmeans.get_real_vector("radiuses")
RealMatrix s = kmeans.get_real_matrix("std_dev")
#![extract_centers_and_radius]

#![create_instance_mb]
Expand Down
8 changes: 2 additions & 6 deletions src/shogun/clustering/KMeans.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,7 @@ KMeans::~KMeans()

void KMeans::Lloyd_KMeans(SGMatrix<float64_t> centers, int32_t num_centers)
{
auto lhs =
std::dynamic_pointer_cast<DenseFeatures<float64_t>>(distance->get_lhs());
auto lhs = distance->get_lhs()->as<DenseFeatures<float64_t>>();

int32_t lhs_size=lhs->get_num_vectors();
int32_t dim=lhs->get_num_features();
Expand Down Expand Up @@ -173,10 +172,7 @@ void KMeans::Lloyd_KMeans(SGMatrix<float64_t> centers, int32_t num_centers)
if (iter%(max_iter/10) == 0)
io::info("Iteration[{}/{}]: Assignment of {} patterns changed.", iter, max_iter, changed);
}
distance->reset_precompute();
distance->replace_rhs(rhs_cache);


distance->replace_rhs(rhs_cache);
}

bool KMeans::train_machine(std::shared_ptr<Features> data)
Expand Down
41 changes: 40 additions & 1 deletion src/shogun/clustering/KMeansBase.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,45 @@ void KMeansBase::compute_cluster_variances()
}
}

SGMatrix<float64_t> KMeansBase::compute_std_dev() const
{
require(cluster_centers.size() > 0, "KMeans is not trained!");

SGMatrix<float64_t> points = distance->get_rhs()
->as<DenseFeatures<float64_t>>()
->get_feature_matrix();
SGVector<float64_t> cluster_assignments = const_cast<KMeansBase*>(this)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please no const_cast. @karlnapf this is probably more of an indication that we need to redesign KMeans no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but can't we avoid the cast otherwise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to either make this function non const or make apply const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it nonconst breaks watch_method. I tried to make apply const but it causes continuous changes in the derived classes of DistanceMachine. I was not sure that these changes will not bring something bad and decided to use the most obvious solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean it breaks? I think it should work...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run is void, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think so. Check in SGObject

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you’re right. Not sure then :D

Copy link
Contributor Author

@vinnik-dmitry07 vinnik-dmitry07 May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void run(std::string_view name) const noexcept(false)
Did I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I just got confused, sorry! what @karlnapf suggested should work though! :)

->apply()
->as<MulticlassLabels>()
->get_labels();

SGVector<int32_t> counts(k);
SGMatrix<float64_t> means = cluster_centers.clone();
SGMatrix<float64_t> squares_sums(dimensions, k);

for (int32_t point_number : range(cluster_assignments.vlen))
{
auto cluster_number =
static_cast<int32_t>(cluster_assignments[point_number]);
const auto& point = points.get_column(point_number);
auto& count = counts[cluster_number];
auto mean = means.get_column(cluster_number);
auto squares_sum = squares_sums.get_column(cluster_number);

count += 1;
auto delta1 = linalg::add(point, mean, 1., -1.);
linalg::add(mean, linalg::scale(delta1, 1. / count), mean);
Copy link
Member

@gf712 gf712 Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we compute the mean differently @karlnapf? The issue is just that you are doing a division in a loop.. Is there no geometric series, or something else, that could be cheaper to compute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

@vinnik-dmitry07 instead of implementing this online algorithm in here, could you move this somewhere else, so that it is also usable from other parts of the code? I envision an updater being instantiated and then repeatedly being called, a bit like an iterator
@gf712 any ideas where would be best and what the API would look like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we would just have a class with a update(datapoint) public class function. Then can have some derived classes for mean and variance (unless you want to combine them in a single class). Not sure what other algos would be useful to have online update for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes definitely useful! The class should support both mean and variance but each should be optional (one wants one without the other one)

auto delta2 = linalg::add(point, mean, 1., -1.);
linalg::add(
squares_sum, linalg::element_prod(delta1, delta2), squares_sum);
}

linalg::scale(squares_sums, squares_sums, 1. / (points.num_cols - 1));
for (float64_t& x : squares_sums)
x = std::sqrt(x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no linalg for this? Maybe rewrite as std::transform, to be more idiomatic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ for linalg... could send a separate pr for adding an elementwise sqrt. We had a few elementwise operations added recently, you could check those

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is power function and could just have exponent 0.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for that other PR

return squares_sums;
}

void KMeansBase::initialize_training(const std::shared_ptr<Features>& data)
{
require(distance, "Distance is not provided");
Expand All @@ -153,7 +192,6 @@ void KMeansBase::initialize_training(const std::shared_ptr<Features>& data)
require(lhs, "Lhs features of distance not provided");
int32_t lhs_size=lhs->get_num_vectors();
dimensions=lhs->get_num_features();
const int32_t centers_size=dimensions*k;

require(lhs_size>0, "Lhs features should not be empty");
require(dimensions>0, "Lhs features should have more than zero dimensions");
Expand Down Expand Up @@ -318,6 +356,7 @@ void KMeansBase::init()
&use_kmeanspp, "kmeanspp", "Whether to use kmeans++",
ParameterProperties::HYPER | ParameterProperties::SETTING);
watch_method("cluster_centers", &KMeansBase::get_cluster_centers);
watch_method("std_dev", &KMeansBase::compute_std_dev);
SG_ADD(
&initial_centers, "initial_centers", "Initial centers",
ParameterProperties::HYPER);
Expand Down
6 changes: 6 additions & 0 deletions src/shogun/clustering/KMeansBase.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,12 @@ class KMeansBase : public RandomMixin<DistanceMachine>
*/
SGMatrix<float64_t> get_cluster_centers() const;

/** get cluster standard deviations
*
* @return cluster deviations or throws an error if no ones are there (not trained yet)
*/
SGMatrix<float64_t> compute_std_dev() const;

/** @return object name */
virtual const char* get_name() const { return "KMeansBase"; }

Expand Down
9 changes: 7 additions & 2 deletions src/shogun/clustering/KMeansMiniBatch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,15 @@ void KMeansMiniBatch::minibatch_KMeans()
auto lhs=
distance->get_lhs()->as<DenseFeatures<float64_t>>();
auto rhs_mus = std::make_shared<DenseFeatures<float64_t>>(cluster_centers);
auto rhs_cache=distance->replace_rhs(rhs_mus);
auto rhs_cache = distance->get_rhs();
distance->replace_rhs(rhs_mus);
int32_t XSize=lhs->get_num_vectors();
int32_t dims=lhs->get_num_features();

SGVector<float64_t> v=SGVector<float64_t>(k);
v.zero();

distance->precompute_lhs();

for (auto i : SG_PROGRESS(range(max_iter)))
{
SGVector<int32_t> M=mbchoose_rand(batch_size,XSize);
Expand Down Expand Up @@ -124,6 +126,9 @@ bool KMeansMiniBatch::train_machine(std::shared_ptr<Features> data)
initialize_training(data);
minibatch_KMeans();
compute_cluster_variances();
auto cluster_centres =
std::make_shared<DenseFeatures<float64_t>>(cluster_centers);
distance->replace_lhs(cluster_centres);
return true;
}

Expand Down