Fix integer overflow and unbounded loop in Clustering.cpp (#5130)

alibeklfc · meta-codesync[bot] · commit 9cbc8da28efb · 2026-04-20T16:44:11.000-07:00
Summary: Pull Request resolved: #5130 This diff fixes four bugs in `Clustering.cpp`, four of which trigger for datasets with more than 2,147,483,647 vectors (`INT_MAX`), and one that can trigger regardless of dataset size. ## Bug 1: Integer truncation in fast subsampling — out-of-bounds memory access **Location**: `subsample_training_set()`, line 96 **Before**: ```cpp std::vector<int> perm; // ... perm[i] = rng.rand_int(nx); ``` **Bug**: `rand_int(int max)` takes an `int` parameter. When `nx` is `idx_t` (`int64_t`) and exceeds `INT_MAX`, the implicit narrowing conversion truncates `nx` to `int`. On two's complement (all target platforms), a value like `3,000,000,000` becomes `-1,294,967,296`. The function then generates a "random" index in a garbage range. These values are stored in `perm` and used as array indices: ```cpp memcpy(x_new + i * line_size, x + perm[i] * line_size, line_size); ``` A negative `perm[i]` produces an out-of-bounds read from before the start of `x`. This is undefined behavior that can crash or silently corrupt data. **Fix**: ```cpp std::vector<idx_t> perm; // ... perm[i] = rng.rand_int64() % nx; ``` Two changes: (1) `perm` is now `std::vector<idx_t>` so it can hold indices > `INT_MAX`. (2) `rand_int64()` returns `int64_t`, and `% nx` produces a value in `[0, nx)` without any narrowing. The result is stored losslessly in `idx_t`. ## Bug 2: Missing guard in standard subsampling path **Location**: `subsample_training_set()`, lines 99-108 **Before**: ```cpp } else { perm.resize(nx); rand_perm(perm.data(), nx, actual_seed); } ``` **Bug**: `rand_perm(int* perm, size_t n, int64_t seed)` takes `int*` and internally does `perm[i] = i`. When `nx > INT_MAX`, the value `i` (a `size_t`) is narrowed to `int` on assignment, wrapping to negative values. These negative values are then used as dataset indices — same out-of-bounds access as Bug 1. **Fix**: ```cpp } else { FAISS_THROW_IF_NOT_FMT( nx <= static_cast<idx_t>(std::numeric_limits<int>::max()), "Dataset too large (%" PRId64 ") for standard subsampling; " "set use_faster_subsampling=true", nx); std::vector<int> int_perm(nx); rand_perm(int_perm.data(), nx, actual_seed); perm.assign(int_perm.begin(), int_perm.end()); } ``` Three parts: (1) A guard that fails early with a clear error message directing the user to the fast path (which handles large datasets correctly via Bug 1 fix). (2) A temporary `std::vector<int>` to satisfy `rand_perm`'s `int*` API — safe because the guard guarantees `nx <= INT_MAX`. (3) Copy into the `idx_t` perm vector so both paths produce the same type for downstream code. We chose not to change `rand_perm`'s signature from `int*` to `idx_t*` because it is a public API in `faiss/utils/random.h` and changing it would break all callers. ## Bug 3: Infinite loop in split_clusters **Location**: `split_clusters()`, lines 239-265 **Before**: ```cpp for (cj = 0; true; cj = (cj + 1) % k) { float p = (hassign[cj] - 1.0) / (float)(n - k); float r = rng.rand_float(); if (r < p) { break; } } ``` **Bug**: This loop probabilistically selects a cluster to split (to replace an empty cluster). The probability of picking cluster `cj` is `p = (hassign[cj] - 1) / (n - k)`. When `hassign[cj] = 1` (cluster has exactly one vector), `p = 0 / (n - k) = 0`. No random float `r` satisfies `r < 0`, so that cluster is never picked. **Proof of infinite loop**: If all non-empty clusters have exactly 1 vector assigned (which happens with bad initialization, adversarial data, or too many clusters), then every `p = 0` and the loop condition `true` is never broken. The loop spins forever, hanging the process. Even in non-degenerate cases, the loop can be extremely slow. Example: `n = 10,000,000`, `k = 1000`, largest cluster has 50,000 vectors. Per-cluster probability: `p = 49999 / 9999000 ≈ 0.005`. Expected iterations to find a match: ~200. But with smaller clusters or larger `n`, this grows without bound. **Fix**: ```cpp size_t max_tries = 10 * k; size_t n_tries = 0; bool found = false; for (cj = 0; n_tries < max_tries; cj = (cj + 1) % k) { float p = (hassign[cj] - 1.0) / (float)(n - k); float r = rng.rand_float(); if (r < p) { found = true; break; } n_tries++; } if (!found) { cj = 0; for (size_t j = 1; j < k; j++) { if (hassign[j] > hassign[cj]) { cj = j; } } } ``` After `10 * k` attempts (10 full passes through all clusters), the loop falls back to deterministically picking the largest cluster. This is semantically correct because the probabilistic selection is already weighted by cluster size — larger clusters have higher `p`. The deterministic fallback produces the most likely outcome of the probabilistic selection. Termination is guaranteed in O(k) time. ## Bug 4: Integer overflow in objective accumulation loop **Location**: `Clustering::train_encoded()`, line 535 **Before**: ```cpp for (int j = 0; j < nx; j++) { obj += dis[j]; } ``` **Bug**: `nx` is `idx_t` (`int64_t`). When `nx > INT_MAX`, `int j` overflows at 2,147,483,647. Signed integer overflow is undefined behavior per the C++ standard. In practice on two's complement, `j` wraps to `-2,147,483,648`, which satisfies `j < nx`, so the loop continues with a negative index. `dis[j]` with negative `j` is an out-of-bounds read — crash or garbage accumulation. **Proof**: For `nx = 3,000,000,000`: - `j` increments from 0 to 2,147,483,647 (correct) - Next increment: UB, typically wraps to -2,147,483,648 - `-2,147,483,648 < 3,000,000,000` is true (signed/unsigned comparison promotes to unsigned, but even with signed comparison it's true) - `dis[-2147483648]` — out-of-bounds access **Fix**: ```cpp for (idx_t j = 0; j < nx; j++) { obj += dis[j]; } ``` `idx_t` matches `nx`'s type. The loop variable can represent all valid indices up to `nx`. Reviewed By: mnorris11 Differential Revision: D101624009 fbshipit-source-id: b961f2677f7e7b93642fe795cfe6ca77812573d3
diff --git a/faiss/Clustering.cpp b/faiss/Clustering.cpp
@@ -16,6 +16,7 @@
 #include <cmath>
 #include <cstdio>
 #include <cstring>
+#include <limits>
 
 #include <omp.h>
 
@@ -84,20 +85,27 @@ idx_t subsample_training_set(
 
     const uint64_t actual_seed = get_actual_rng_seed(clus.seed);
 
-    std::vector<int> perm;
+    std::vector<idx_t> perm;
     if (clus.use_faster_subsampling) {
         // use subsampling with splitmix64 rng
         SplitMix64RandomGenerator rng(actual_seed);
 
         const idx_t new_nx = clus.k * clus.max_points_per_centroid;
         perm.resize(new_nx);
         for (idx_t i = 0; i < new_nx; i++) {
-            perm[i] = rng.rand_int(nx);
+            perm[i] = rng.rand_int64() % nx;
         }
     } else {
         // use subsampling with a default std rng
-        perm.resize(nx);
-        rand_perm(perm.data(), nx, actual_seed);
+        FAISS_THROW_IF_NOT_FMT(
+                nx <= static_cast<idx_t>(std::numeric_limits<int>::max()),
+                "Dataset too large (%" PRId64
+                ") for standard subsampling; "
+                "set use_faster_subsampling=true",
+                nx);
+        std::vector<int> int_perm(nx);
+        rand_perm(int_perm.data(), nx, actual_seed);
+        perm.assign(int_perm.begin(), int_perm.end());
     }
 
     nx = clus.k * clus.max_points_per_centroid;
@@ -232,12 +240,27 @@ int split_clusters(
     for (size_t ci = 0; ci < k; ci++) {
         if (hassign[ci] == 0) { /* need to redefine a centroid */
             size_t cj;
-            for (cj = 0; true; cj = (cj + 1) % k) {
-                /* probability to pick this cluster for split */
+            // Try probabilistic selection, with a deterministic fallback
+            // to the largest cluster if too many iterations pass.
+            size_t max_tries = 10 * k;
+            size_t n_tries = 0;
+            bool found = false;
+            for (cj = 0; n_tries < max_tries; cj = (cj + 1) % k) {
                 float p = (hassign[cj] - 1.0) / (float)(n - k);
                 float r = rng.rand_float();
                 if (r < p) {
-                    break; /* found our cluster to be split */
+                    found = true;
+                    break;
+                }
+                n_tries++;
+            }
+            if (!found) {
+                // Deterministic fallback: split the largest cluster.
+                cj = 0;
+                for (size_t j = 1; j < k; j++) {
+                    if (hassign[j] > hassign[cj]) {
+                        cj = j;
+                    }
                 }
             }
             memcpy(centroids + ci * d,
@@ -510,7 +533,7 @@ void Clustering::train_encoded(
 
             // accumulate objective
             obj = 0;
-            for (int j = 0; j < nx; j++) {
+            for (idx_t j = 0; j < nx; j++) {
                 obj += dis[j];
             }