Skip to content

Commit 9cbc8da

Browse files
alibeklfcmeta-codesync[bot]
authored andcommitted
Fix integer overflow and unbounded loop in Clustering.cpp (#5130)
Summary: Pull Request resolved: #5130 This diff fixes four bugs in `Clustering.cpp`, four of which trigger for datasets with more than 2,147,483,647 vectors (`INT_MAX`), and one that can trigger regardless of dataset size. ## Bug 1: Integer truncation in fast subsampling — out-of-bounds memory access **Location**: `subsample_training_set()`, line 96 **Before**: ```cpp std::vector<int> perm; // ... perm[i] = rng.rand_int(nx); ``` **Bug**: `rand_int(int max)` takes an `int` parameter. When `nx` is `idx_t` (`int64_t`) and exceeds `INT_MAX`, the implicit narrowing conversion truncates `nx` to `int`. On two's complement (all target platforms), a value like `3,000,000,000` becomes `-1,294,967,296`. The function then generates a "random" index in a garbage range. These values are stored in `perm` and used as array indices: ```cpp memcpy(x_new + i * line_size, x + perm[i] * line_size, line_size); ``` A negative `perm[i]` produces an out-of-bounds read from before the start of `x`. This is undefined behavior that can crash or silently corrupt data. **Fix**: ```cpp std::vector<idx_t> perm; // ... perm[i] = rng.rand_int64() % nx; ``` Two changes: (1) `perm` is now `std::vector<idx_t>` so it can hold indices > `INT_MAX`. (2) `rand_int64()` returns `int64_t`, and `% nx` produces a value in `[0, nx)` without any narrowing. The result is stored losslessly in `idx_t`. ## Bug 2: Missing guard in standard subsampling path **Location**: `subsample_training_set()`, lines 99-108 **Before**: ```cpp } else { perm.resize(nx); rand_perm(perm.data(), nx, actual_seed); } ``` **Bug**: `rand_perm(int* perm, size_t n, int64_t seed)` takes `int*` and internally does `perm[i] = i`. When `nx > INT_MAX`, the value `i` (a `size_t`) is narrowed to `int` on assignment, wrapping to negative values. These negative values are then used as dataset indices — same out-of-bounds access as Bug 1. **Fix**: ```cpp } else { FAISS_THROW_IF_NOT_FMT( nx <= static_cast<idx_t>(std::numeric_limits<int>::max()), "Dataset too large (%" PRId64 ") for standard subsampling; " "set use_faster_subsampling=true", nx); std::vector<int> int_perm(nx); rand_perm(int_perm.data(), nx, actual_seed); perm.assign(int_perm.begin(), int_perm.end()); } ``` Three parts: (1) A guard that fails early with a clear error message directing the user to the fast path (which handles large datasets correctly via Bug 1 fix). (2) A temporary `std::vector<int>` to satisfy `rand_perm`'s `int*` API — safe because the guard guarantees `nx <= INT_MAX`. (3) Copy into the `idx_t` perm vector so both paths produce the same type for downstream code. We chose not to change `rand_perm`'s signature from `int*` to `idx_t*` because it is a public API in `faiss/utils/random.h` and changing it would break all callers. ## Bug 3: Infinite loop in split_clusters **Location**: `split_clusters()`, lines 239-265 **Before**: ```cpp for (cj = 0; true; cj = (cj + 1) % k) { float p = (hassign[cj] - 1.0) / (float)(n - k); float r = rng.rand_float(); if (r < p) { break; } } ``` **Bug**: This loop probabilistically selects a cluster to split (to replace an empty cluster). The probability of picking cluster `cj` is `p = (hassign[cj] - 1) / (n - k)`. When `hassign[cj] = 1` (cluster has exactly one vector), `p = 0 / (n - k) = 0`. No random float `r` satisfies `r < 0`, so that cluster is never picked. **Proof of infinite loop**: If all non-empty clusters have exactly 1 vector assigned (which happens with bad initialization, adversarial data, or too many clusters), then every `p = 0` and the loop condition `true` is never broken. The loop spins forever, hanging the process. Even in non-degenerate cases, the loop can be extremely slow. Example: `n = 10,000,000`, `k = 1000`, largest cluster has 50,000 vectors. Per-cluster probability: `p = 49999 / 9999000 ≈ 0.005`. Expected iterations to find a match: ~200. But with smaller clusters or larger `n`, this grows without bound. **Fix**: ```cpp size_t max_tries = 10 * k; size_t n_tries = 0; bool found = false; for (cj = 0; n_tries < max_tries; cj = (cj + 1) % k) { float p = (hassign[cj] - 1.0) / (float)(n - k); float r = rng.rand_float(); if (r < p) { found = true; break; } n_tries++; } if (!found) { cj = 0; for (size_t j = 1; j < k; j++) { if (hassign[j] > hassign[cj]) { cj = j; } } } ``` After `10 * k` attempts (10 full passes through all clusters), the loop falls back to deterministically picking the largest cluster. This is semantically correct because the probabilistic selection is already weighted by cluster size — larger clusters have higher `p`. The deterministic fallback produces the most likely outcome of the probabilistic selection. Termination is guaranteed in O(k) time. ## Bug 4: Integer overflow in objective accumulation loop **Location**: `Clustering::train_encoded()`, line 535 **Before**: ```cpp for (int j = 0; j < nx; j++) { obj += dis[j]; } ``` **Bug**: `nx` is `idx_t` (`int64_t`). When `nx > INT_MAX`, `int j` overflows at 2,147,483,647. Signed integer overflow is undefined behavior per the C++ standard. In practice on two's complement, `j` wraps to `-2,147,483,648`, which satisfies `j < nx`, so the loop continues with a negative index. `dis[j]` with negative `j` is an out-of-bounds read — crash or garbage accumulation. **Proof**: For `nx = 3,000,000,000`: - `j` increments from 0 to 2,147,483,647 (correct) - Next increment: UB, typically wraps to -2,147,483,648 - `-2,147,483,648 < 3,000,000,000` is true (signed/unsigned comparison promotes to unsigned, but even with signed comparison it's true) - `dis[-2147483648]` — out-of-bounds access **Fix**: ```cpp for (idx_t j = 0; j < nx; j++) { obj += dis[j]; } ``` `idx_t` matches `nx`'s type. The loop variable can represent all valid indices up to `nx`. Reviewed By: mnorris11 Differential Revision: D101624009 fbshipit-source-id: b961f2677f7e7b93642fe795cfe6ca77812573d3
1 parent 3c4056d commit 9cbc8da

1 file changed

Lines changed: 31 additions & 8 deletions

File tree

faiss/Clustering.cpp

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
#include <cmath>
1717
#include <cstdio>
1818
#include <cstring>
19+
#include <limits>
1920

2021
#include <omp.h>
2122

@@ -84,20 +85,27 @@ idx_t subsample_training_set(
8485

8586
const uint64_t actual_seed = get_actual_rng_seed(clus.seed);
8687

87-
std::vector<int> perm;
88+
std::vector<idx_t> perm;
8889
if (clus.use_faster_subsampling) {
8990
// use subsampling with splitmix64 rng
9091
SplitMix64RandomGenerator rng(actual_seed);
9192

9293
const idx_t new_nx = clus.k * clus.max_points_per_centroid;
9394
perm.resize(new_nx);
9495
for (idx_t i = 0; i < new_nx; i++) {
95-
perm[i] = rng.rand_int(nx);
96+
perm[i] = rng.rand_int64() % nx;
9697
}
9798
} else {
9899
// use subsampling with a default std rng
99-
perm.resize(nx);
100-
rand_perm(perm.data(), nx, actual_seed);
100+
FAISS_THROW_IF_NOT_FMT(
101+
nx <= static_cast<idx_t>(std::numeric_limits<int>::max()),
102+
"Dataset too large (%" PRId64
103+
") for standard subsampling; "
104+
"set use_faster_subsampling=true",
105+
nx);
106+
std::vector<int> int_perm(nx);
107+
rand_perm(int_perm.data(), nx, actual_seed);
108+
perm.assign(int_perm.begin(), int_perm.end());
101109
}
102110

103111
nx = clus.k * clus.max_points_per_centroid;
@@ -232,12 +240,27 @@ int split_clusters(
232240
for (size_t ci = 0; ci < k; ci++) {
233241
if (hassign[ci] == 0) { /* need to redefine a centroid */
234242
size_t cj;
235-
for (cj = 0; true; cj = (cj + 1) % k) {
236-
/* probability to pick this cluster for split */
243+
// Try probabilistic selection, with a deterministic fallback
244+
// to the largest cluster if too many iterations pass.
245+
size_t max_tries = 10 * k;
246+
size_t n_tries = 0;
247+
bool found = false;
248+
for (cj = 0; n_tries < max_tries; cj = (cj + 1) % k) {
237249
float p = (hassign[cj] - 1.0) / (float)(n - k);
238250
float r = rng.rand_float();
239251
if (r < p) {
240-
break; /* found our cluster to be split */
252+
found = true;
253+
break;
254+
}
255+
n_tries++;
256+
}
257+
if (!found) {
258+
// Deterministic fallback: split the largest cluster.
259+
cj = 0;
260+
for (size_t j = 1; j < k; j++) {
261+
if (hassign[j] > hassign[cj]) {
262+
cj = j;
263+
}
241264
}
242265
}
243266
memcpy(centroids + ci * d,
@@ -510,7 +533,7 @@ void Clustering::train_encoded(
510533

511534
// accumulate objective
512535
obj = 0;
513-
for (int j = 0; j < nx; j++) {
536+
for (idx_t j = 0; j < nx; j++) {
514537
obj += dis[j];
515538
}
516539

0 commit comments

Comments
 (0)