Skip to content

Commit bc4e1cd

Browse files
authored
[Cleanup] Combine Batched and Regular KMeans Impl (#2015)
## Combine batched and regular k-means implementations - Unified the batched (host-data) and regular (device-data) k-means `fit` into a single `kmeans_fit` template that works with both host and device mdspans via `batch_load_iterator` - Unified the device and host initialization paths in `init_centroids` - Removed the `inertia_check` parameter — inertia-based convergence checking now always runs. Zero clustering cost (perfect fit) logs a warning instead of asserting. This is needed because spectral clustering can cause all points to converge on the cluster centroids itself. - Added `init_size` parameter to control how many samples are drawn for KMeansPlusPlus initialization. Defaults to `n_samples` for device data, `(3 * n_clusters)` for host data - Replaced per-iteration centroid `raft::copy` with `std::swap` of buffer pointers - For streaming fit, precompute data norms once and cache them: host norms cached to a host buffer on the first iteration and copied back for subsequent iterations. `process_batch` no longer computes norms internally - Replaced raw `cudaPointerGetAttributes` call with `raft::memory_type_from_pointer` - Replaced `cub::DeviceReduce::Sum` calls with `raft::linalg::mapThenSumReduce` - Guarded weight normalization against overflow: apply `(w / wt_sum) * n_samples` via a composed op instead of precomputing a scale, so very small `wt_sum` values don't produce inf - Renamed `checkWeight` to `weightSum` and made it mdspan-based with an `Accessor` template: device reduce for device weights, host loop for host weights. Callers apply the scaling themselves - Eliminated `batch_sums` / `batch_counts` scratch buffers by accumulating directly into `centroid_sums` / `weight_per_cluster` via `reset_sums=false` in `reduce_rows_by_key` / `reduce_cols_by_key`, removing two per-batch `raft::linalg::add` kernels - Removed dead `update_centroids` helpers (both the `detail` and public template) — no remaining callers after the `fit_main` consolidation - Perf: remove multiple `raft::sync_stream` calls and add a CUDA Event to record if the convergence criteria is met. Convergence check is now done on device. Average per-iteration time with mandatory inertia check now matches previous benchmarks even when previously inertia check was disabled. ## C Tests This PR adds C tests for KMeans. These were missing. Here we test both -- the old version and the new (i.e. breaking change). ## Benchmarks: With mandatory early stopping. Batch size is such that we fill up 90% of available GPU memory (95830MiB) HW: GPU: `NVIDIA H100 NVL (CUDA 13.0)` CPU: ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: AuthenticAMD Model name: AMD EPYC 9554 64-Core Processor ``` ``` ================================================================================ SUMMARY ================================================================================ n_clusters batch_size fit_time(s) inertia n_iter ---------------------------------------------------------------- 10,000 29,120,352 1584.72 2.8677e+08 30 20,000 29,120,352 2907.34 2.7368e+08 31 30,000 29,101,305 4254.43 2.6617e+08 31 40,000 29,092,704 5836.12 2.6086e+08 32 50,000 29,083,488 7107.04 2.5680e+08 31 ``` ## Breaking Change This PR is a breaking change of the C++ API because the `inertia_check` param is removed. The breaking changes to the C ABI will be applied in 26.08 Authors: - Tarang Jain (https://github.com/tarang-jain) Approvers: - Victor Lafargue (https://github.com/viclafargue) - Dante Gama Dessavre (https://github.com/dantegd) - Micka (https://github.com/lowener) URL: #2015
1 parent e3af4cd commit bc4e1cd

17 files changed

Lines changed: 1517 additions & 1092 deletions

File tree

c/include/cuvs/cluster/kmeans.h

Lines changed: 186 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ typedef enum {
3939

4040
/**
4141
* @brief Hyper-parameters for the kmeans algorithm
42+
* NB: The inertia_check field is kept for ABI compatibility. Removed in cuvsKMeansParams_v2.
43+
* TODO: CalVer for the replacement: 26.08
4244
*/
4345
struct cuvsKMeansParams {
4446
cuvsDistanceType metric;
@@ -91,7 +93,7 @@ struct cuvsKMeansParams {
9193
*/
9294
int batch_centroids;
9395

94-
/** Check inertia during iterations for early convergence. */
96+
/** Deprecated, ignored. Kept for ABI compatibility. */
9597
bool inertia_check;
9698

9799
/**
@@ -108,14 +110,104 @@ struct cuvsKMeansParams {
108110
* Number of samples to process per GPU batch for the batched (host-data) API.
109111
* When set to 0, defaults to n_samples (process all at once).
110112
*/
111-
int64_t streaming_batch_size;
113+
int64_t streaming_batch_size;
114+
115+
/**
116+
* Number of samples to draw for KMeansPlusPlus initialization.
117+
* When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data,
118+
* or n_samples for device data.
119+
*/
120+
int64_t init_size;
121+
};
122+
123+
/**
124+
* @brief Hyper-parameters for the kmeans algorithm
125+
* TODO: Remove this after cuvsKMeansParams is replaced in ABI 2.0
126+
*/
127+
struct cuvsKMeansParams_v2 {
128+
cuvsDistanceType metric;
129+
130+
/**
131+
* The number of clusters to form as well as the number of centroids to generate (default:8).
132+
*/
133+
int n_clusters;
134+
135+
/**
136+
* Method for initialization, defaults to k-means++:
137+
* - cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm
138+
* to select the initial cluster centers.
139+
* - cuvsKMeansInitMethod::Random (random): Choose 'n_clusters' observations (rows) at
140+
* random from the input data for the initial centroids.
141+
* - cuvsKMeansInitMethod::Array (ndarray): Use 'centroids' as initial cluster centers.
142+
*/
143+
cuvsKMeansInitMethod init;
144+
145+
/**
146+
* Maximum number of iterations of the k-means algorithm for a single run.
147+
*/
148+
int max_iter;
149+
150+
/**
151+
* Relative tolerance with regards to inertia to declare convergence.
152+
*/
153+
double tol;
154+
155+
/**
156+
* Number of instance k-means algorithm will be run with different seeds.
157+
*/
158+
int n_init;
159+
160+
/**
161+
* Oversampling factor for use in the k-means|| algorithm
162+
*/
163+
double oversampling_factor;
164+
165+
/**
166+
* batch_samples and batch_centroids are used to tile 1NN computation which is
167+
* useful to optimize/control the memory footprint
168+
* Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0
169+
* then don't tile the centroids
170+
*/
171+
int batch_samples;
172+
173+
/**
174+
* if 0 then batch_centroids = n_clusters
175+
*/
176+
int batch_centroids;
177+
178+
/**
179+
* Whether to use hierarchical (balanced) kmeans or not
180+
*/
181+
bool hierarchical;
182+
183+
/**
184+
* For hierarchical k-means , defines the number of training iterations
185+
*/
186+
int hierarchical_n_iters;
187+
188+
/**
189+
* Number of samples to process per GPU batch for the batched (host-data) API.
190+
* When set to 0, defaults to n_samples (process all at once).
191+
*/
192+
int64_t streaming_batch_size;
193+
194+
/**
195+
* Number of samples to draw for KMeansPlusPlus initialization.
196+
* When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data,
197+
* or n_samples for device data.
198+
*/
199+
int64_t init_size;
112200
};
113201

114202
typedef struct cuvsKMeansParams* cuvsKMeansParams_t;
203+
typedef struct cuvsKMeansParams_v2* cuvsKMeansParams_v2_t;
115204

116205
/**
117206
* @brief Allocate KMeans params, and populate with default values
118207
*
208+
* @note In cuVS 26.08 (next ABI major version) this signature will be
209+
* replaced by cuvsKMeansParamsCreate_v2.
210+
*
119211
* @param[in] params cuvsKMeansParams_t to allocate
120212
* @return cuvsError_t
121213
*/
@@ -124,11 +216,33 @@ cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t* params);
124216
/**
125217
* @brief De-allocate KMeans params
126218
*
219+
* @note In cuVS 26.08 (next ABI major version) this signature will be
220+
* replaced by cuvsKMeansParamsDestroy_v2.
221+
*
127222
* @param[in] params
128223
* @return cuvsError_t
129224
*/
130225
cuvsError_t cuvsKMeansParamsDestroy(cuvsKMeansParams_t params);
131226

227+
/**
228+
* @brief Allocate KMeans params
229+
*
230+
* Mirrors cuvsKMeansParamsCreate but operates on cuvsKMeansParams_v2.
231+
* Will become the unsuffixed cuvsKMeansParamsCreate in cuVS 26.08.
232+
*
233+
* @param[in] params cuvsKMeansParams_v2_t to allocate
234+
* @return cuvsError_t
235+
*/
236+
cuvsError_t cuvsKMeansParamsCreate_v2(cuvsKMeansParams_v2_t* params);
237+
238+
/**
239+
* @brief De-allocate KMeans params allocated by cuvsKMeansParamsCreate_v2.
240+
*
241+
* @param[in] params
242+
* @return cuvsError_t
243+
*/
244+
cuvsError_t cuvsKMeansParamsDestroy_v2(cuvsKMeansParams_v2_t params);
245+
132246
/**
133247
* @brief Type of k-means algorithm.
134248
*/
@@ -154,6 +268,9 @@ typedef enum { CUVS_KMEANS_TYPE_KMEANS = 0, CUVS_KMEANS_TYPE_KMEANS_BALANCED = 1
154268
* When X is on the host the data is streamed to the GPU in
155269
* batches controlled by params->streaming_batch_size.
156270
*
271+
* @note In cuVS 26.08 (next ABI major version) this signature will be
272+
* replaced by cuvsKMeansFit_v2.
273+
*
157274
* @param[in] res opaque C handle
158275
* @param[in] params Parameters for KMeans model.
159276
* @param[in] X Training instances to cluster. The data must
@@ -181,9 +298,45 @@ cuvsError_t cuvsKMeansFit(cuvsResources_t res,
181298
double* inertia,
182299
int* n_iter);
183300

301+
/**
302+
* @brief Find clusters with k-means algorithm (v2 params layout).
303+
*
304+
* Mirrors cuvsKMeansFit but takes cuvsKMeansParams_v2_t. Will become the
305+
* unsuffixed cuvsKMeansFit in cuVS 26.08.
306+
*
307+
* @param[in] res opaque C handle
308+
* @param[in] params Parameters for KMeans model (v2 layout).
309+
* @param[in] X Training instances to cluster. The data must
310+
* be in row-major format. May be on host or
311+
* device memory.
312+
* [dim = n_samples x n_features]
313+
* @param[in] sample_weight Optional weights for each observation in X.
314+
* Must be on the same memory space as X.
315+
* [len = n_samples]
316+
* @param[inout] centroids [in] When init is InitMethod::Array, use
317+
* centroids as the initial cluster centers.
318+
* [out] The generated centroids from the
319+
* kmeans algorithm are stored at the address
320+
* pointed by 'centroids'. Must be on device.
321+
* [dim = n_clusters x n_features]
322+
* @param[out] inertia Sum of squared distances of samples to their
323+
* closest cluster center.
324+
* @param[out] n_iter Number of iterations run.
325+
*/
326+
cuvsError_t cuvsKMeansFit_v2(cuvsResources_t res,
327+
cuvsKMeansParams_v2_t params,
328+
DLManagedTensor* X,
329+
DLManagedTensor* sample_weight,
330+
DLManagedTensor* centroids,
331+
double* inertia,
332+
int* n_iter);
333+
184334
/**
185335
* @brief Predict the closest cluster each sample in X belongs to.
186336
*
337+
* @note In cuVS 26.08 (next ABI major version) this signature will be
338+
* replaced by cuvsKMeansPredict_v2.
339+
*
187340
* @param[in] res opaque C handle
188341
* @param[in] params Parameters for KMeans model.
189342
* @param[in] X New data to predict.
@@ -209,6 +362,37 @@ cuvsError_t cuvsKMeansPredict(cuvsResources_t res,
209362
bool normalize_weight,
210363
double* inertia);
211364

365+
/**
366+
* @brief Predict the closest cluster each sample in X belongs to (v2 params layout).
367+
*
368+
* Mirrors cuvsKMeansPredict but takes cuvsKMeansParams_v2_t. Will become the
369+
* unsuffixed cuvsKMeansPredict in cuVS 26.08.
370+
*
371+
* @param[in] res opaque C handle
372+
* @param[in] params Parameters for KMeans model (v2 layout).
373+
* @param[in] X New data to predict.
374+
* [dim = n_samples x n_features]
375+
* @param[in] sample_weight Optional weights for each observation in X.
376+
* [len = n_samples]
377+
* @param[in] centroids Cluster centroids. The data must be in
378+
* row-major format.
379+
* [dim = n_clusters x n_features]
380+
* @param[in] normalize_weight True if the weights should be normalized
381+
* @param[out] labels Index of the cluster each sample in X
382+
* belongs to.
383+
* [len = n_samples]
384+
* @param[out] inertia Sum of squared distances of samples to
385+
* their closest cluster center.
386+
*/
387+
cuvsError_t cuvsKMeansPredict_v2(cuvsResources_t res,
388+
cuvsKMeansParams_v2_t params,
389+
DLManagedTensor* X,
390+
DLManagedTensor* sample_weight,
391+
DLManagedTensor* centroids,
392+
DLManagedTensor* labels,
393+
bool normalize_weight,
394+
double* inertia);
395+
212396
/**
213397
* @brief Compute cluster cost
214398
*

c/src/cluster/kmeans.cpp

Lines changed: 86 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@
1616

1717
namespace {
1818

19-
cuvs::cluster::kmeans::params convert_params(const cuvsKMeansParams& params)
19+
// The conversions are templated on the C struct type and reused by both API surfaces.
20+
template <typename ParamsT>
21+
cuvs::cluster::kmeans::params convert_params(const ParamsT& params)
2022
{
2123
auto kmeans_params = cuvs::cluster::kmeans::params();
2224
kmeans_params.metric = static_cast<cuvs::distance::DistanceType>(params.metric);
@@ -28,22 +30,23 @@ cuvs::cluster::kmeans::params convert_params(const cuvsKMeansParams& params)
2830
kmeans_params.oversampling_factor = params.oversampling_factor;
2931
kmeans_params.batch_samples = params.batch_samples;
3032
kmeans_params.batch_centroids = params.batch_centroids;
31-
kmeans_params.inertia_check = params.inertia_check;
33+
kmeans_params.init_size = params.init_size;
3234
kmeans_params.streaming_batch_size = params.streaming_batch_size;
3335
return kmeans_params;
3436
}
3537

36-
cuvs::cluster::kmeans::balanced_params convert_balanced_params(const cuvsKMeansParams& params)
38+
template <typename ParamsT>
39+
cuvs::cluster::kmeans::balanced_params convert_balanced_params(const ParamsT& params)
3740
{
3841
auto kmeans_params = cuvs::cluster::kmeans::balanced_params();
3942
kmeans_params.metric = static_cast<cuvs::distance::DistanceType>(params.metric);
4043
kmeans_params.n_iters = params.hierarchical_n_iters;
4144
return kmeans_params;
4245
}
4346

44-
template <typename T, typename IdxT = int64_t>
47+
template <typename T, typename ParamsT, typename IdxT = int64_t>
4548
void _fit(cuvsResources_t res,
46-
const cuvsKMeansParams& params,
49+
const ParamsT& params,
4750
DLManagedTensor* X_tensor,
4851
DLManagedTensor* sample_weight_tensor,
4952
DLManagedTensor* centroids_tensor,
@@ -140,9 +143,9 @@ void _fit(cuvsResources_t res,
140143
}
141144
}
142145

143-
template <typename T, typename IdxT = int32_t, typename LabelsT = int32_t>
146+
template <typename T, typename ParamsT, typename IdxT = int32_t, typename LabelsT = int32_t>
144147
void _predict(cuvsResources_t res,
145-
const cuvsKMeansParams& params,
148+
const ParamsT& params,
146149
DLManagedTensor* X_tensor,
147150
DLManagedTensor* sample_weight_tensor,
148151
DLManagedTensor* centroids_tensor,
@@ -237,10 +240,11 @@ extern "C" cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t* params)
237240
.oversampling_factor = cpp_params.oversampling_factor,
238241
.batch_samples = cpp_params.batch_samples,
239242
.batch_centroids = cpp_params.batch_centroids,
240-
.inertia_check = cpp_params.inertia_check,
243+
.inertia_check = false,
241244
.hierarchical = false,
242245
.hierarchical_n_iters = static_cast<int>(cpp_balanced_params.n_iters),
243-
.streaming_batch_size = cpp_params.streaming_batch_size};
246+
.streaming_batch_size = cpp_params.streaming_batch_size,
247+
.init_size = cpp_params.init_size};
244248
});
245249
}
246250

@@ -294,6 +298,79 @@ extern "C" cuvsError_t cuvsKMeansPredict(cuvsResources_t res,
294298
});
295299
}
296300

301+
extern "C" cuvsError_t cuvsKMeansParamsCreate_v2(cuvsKMeansParams_v2_t* params)
302+
{
303+
return cuvs::core::translate_exceptions([=] {
304+
cuvs::cluster::kmeans::params cpp_params;
305+
cuvs::cluster::kmeans::balanced_params cpp_balanced_params;
306+
*params = new cuvsKMeansParams_v2{
307+
.metric = static_cast<cuvsDistanceType>(cpp_params.metric),
308+
.n_clusters = cpp_params.n_clusters,
309+
.init = static_cast<cuvsKMeansInitMethod>(cpp_params.init),
310+
.max_iter = cpp_params.max_iter,
311+
.tol = cpp_params.tol,
312+
.n_init = cpp_params.n_init,
313+
.oversampling_factor = cpp_params.oversampling_factor,
314+
.batch_samples = cpp_params.batch_samples,
315+
.batch_centroids = cpp_params.batch_centroids,
316+
.hierarchical = false,
317+
.hierarchical_n_iters = static_cast<int>(cpp_balanced_params.n_iters),
318+
.streaming_batch_size = cpp_params.streaming_batch_size,
319+
.init_size = cpp_params.init_size};
320+
});
321+
}
322+
323+
extern "C" cuvsError_t cuvsKMeansParamsDestroy_v2(cuvsKMeansParams_v2_t params)
324+
{
325+
return cuvs::core::translate_exceptions([=] { delete params; });
326+
}
327+
328+
extern "C" cuvsError_t cuvsKMeansFit_v2(cuvsResources_t res,
329+
cuvsKMeansParams_v2_t params,
330+
DLManagedTensor* X,
331+
DLManagedTensor* sample_weight,
332+
DLManagedTensor* centroids,
333+
double* inertia,
334+
int* n_iter)
335+
{
336+
return cuvs::core::translate_exceptions([=] {
337+
auto dataset = X->dl_tensor;
338+
if (dataset.dtype.code == kDLFloat && dataset.dtype.bits == 32) {
339+
_fit<float>(res, *params, X, sample_weight, centroids, inertia, n_iter);
340+
} else if (dataset.dtype.code == kDLFloat && dataset.dtype.bits == 64) {
341+
_fit<double>(res, *params, X, sample_weight, centroids, inertia, n_iter);
342+
} else {
343+
RAFT_FAIL("Unsupported dataset DLtensor dtype: %d and bits: %d",
344+
dataset.dtype.code,
345+
dataset.dtype.bits);
346+
}
347+
});
348+
}
349+
350+
extern "C" cuvsError_t cuvsKMeansPredict_v2(cuvsResources_t res,
351+
cuvsKMeansParams_v2_t params,
352+
DLManagedTensor* X,
353+
DLManagedTensor* sample_weight,
354+
DLManagedTensor* centroids,
355+
DLManagedTensor* labels,
356+
bool normalize_weight,
357+
double* inertia)
358+
{
359+
return cuvs::core::translate_exceptions([=] {
360+
auto dataset = X->dl_tensor;
361+
if (dataset.dtype.code == kDLFloat && dataset.dtype.bits == 32) {
362+
_predict<float>(res, *params, X, sample_weight, centroids, labels, normalize_weight, inertia);
363+
} else if (dataset.dtype.code == kDLFloat && dataset.dtype.bits == 64) {
364+
_predict<double>(
365+
res, *params, X, sample_weight, centroids, labels, normalize_weight, inertia);
366+
} else {
367+
RAFT_FAIL("Unsupported dataset DLtensor dtype: %d and bits: %d",
368+
dataset.dtype.code,
369+
dataset.dtype.bits);
370+
}
371+
});
372+
}
373+
297374
extern "C" cuvsError_t cuvsKMeansClusterCost(cuvsResources_t res,
298375
DLManagedTensor* X,
299376
DLManagedTensor* centroids,

0 commit comments

Comments
 (0)