Commit 9cbc8da
Fix integer overflow and unbounded loop in Clustering.cpp (#5130)
Summary:
Pull Request resolved: #5130
This diff fixes four bugs in `Clustering.cpp`, four of which trigger for datasets with more than 2,147,483,647 vectors (`INT_MAX`), and one that can trigger regardless of dataset size.
## Bug 1: Integer truncation in fast subsampling — out-of-bounds memory access
**Location**: `subsample_training_set()`, line 96
**Before**:
```cpp
std::vector<int> perm;
// ...
perm[i] = rng.rand_int(nx);
```
**Bug**: `rand_int(int max)` takes an `int` parameter. When `nx` is `idx_t` (`int64_t`) and exceeds `INT_MAX`, the implicit narrowing conversion truncates `nx` to `int`. On two's complement (all target platforms), a value like `3,000,000,000` becomes `-1,294,967,296`. The function then generates a "random" index in a garbage range. These values are stored in `perm` and used as array indices:
```cpp
memcpy(x_new + i * line_size, x + perm[i] * line_size, line_size);
```
A negative `perm[i]` produces an out-of-bounds read from before the start of `x`. This is undefined behavior that can crash or silently corrupt data.
**Fix**:
```cpp
std::vector<idx_t> perm;
// ...
perm[i] = rng.rand_int64() % nx;
```
Two changes: (1) `perm` is now `std::vector<idx_t>` so it can hold indices > `INT_MAX`. (2) `rand_int64()` returns `int64_t`, and `% nx` produces a value in `[0, nx)` without any narrowing. The result is stored losslessly in `idx_t`.
## Bug 2: Missing guard in standard subsampling path
**Location**: `subsample_training_set()`, lines 99-108
**Before**:
```cpp
} else {
perm.resize(nx);
rand_perm(perm.data(), nx, actual_seed);
}
```
**Bug**: `rand_perm(int* perm, size_t n, int64_t seed)` takes `int*` and internally does `perm[i] = i`. When `nx > INT_MAX`, the value `i` (a `size_t`) is narrowed to `int` on assignment, wrapping to negative values. These negative values are then used as dataset indices — same out-of-bounds access as Bug 1.
**Fix**:
```cpp
} else {
FAISS_THROW_IF_NOT_FMT(
nx <= static_cast<idx_t>(std::numeric_limits<int>::max()),
"Dataset too large (%" PRId64
") for standard subsampling; "
"set use_faster_subsampling=true",
nx);
std::vector<int> int_perm(nx);
rand_perm(int_perm.data(), nx, actual_seed);
perm.assign(int_perm.begin(), int_perm.end());
}
```
Three parts: (1) A guard that fails early with a clear error message directing the user to the fast path (which handles large datasets correctly via Bug 1 fix). (2) A temporary `std::vector<int>` to satisfy `rand_perm`'s `int*` API — safe because the guard guarantees `nx <= INT_MAX`. (3) Copy into the `idx_t` perm vector so both paths produce the same type for downstream code.
We chose not to change `rand_perm`'s signature from `int*` to `idx_t*` because it is a public API in `faiss/utils/random.h` and changing it would break all callers.
## Bug 3: Infinite loop in split_clusters
**Location**: `split_clusters()`, lines 239-265
**Before**:
```cpp
for (cj = 0; true; cj = (cj + 1) % k) {
float p = (hassign[cj] - 1.0) / (float)(n - k);
float r = rng.rand_float();
if (r < p) {
break;
}
}
```
**Bug**: This loop probabilistically selects a cluster to split (to replace an empty cluster). The probability of picking cluster `cj` is `p = (hassign[cj] - 1) / (n - k)`. When `hassign[cj] = 1` (cluster has exactly one vector), `p = 0 / (n - k) = 0`. No random float `r` satisfies `r < 0`, so that cluster is never picked.
**Proof of infinite loop**: If all non-empty clusters have exactly 1 vector assigned (which happens with bad initialization, adversarial data, or too many clusters), then every `p = 0` and the loop condition `true` is never broken. The loop spins forever, hanging the process.
Even in non-degenerate cases, the loop can be extremely slow. Example: `n = 10,000,000`, `k = 1000`, largest cluster has 50,000 vectors. Per-cluster probability: `p = 49999 / 9999000 ≈ 0.005`. Expected iterations to find a match: ~200. But with smaller clusters or larger `n`, this grows without bound.
**Fix**:
```cpp
size_t max_tries = 10 * k;
size_t n_tries = 0;
bool found = false;
for (cj = 0; n_tries < max_tries; cj = (cj + 1) % k) {
float p = (hassign[cj] - 1.0) / (float)(n - k);
float r = rng.rand_float();
if (r < p) {
found = true;
break;
}
n_tries++;
}
if (!found) {
cj = 0;
for (size_t j = 1; j < k; j++) {
if (hassign[j] > hassign[cj]) {
cj = j;
}
}
}
```
After `10 * k` attempts (10 full passes through all clusters), the loop falls back to deterministically picking the largest cluster. This is semantically correct because the probabilistic selection is already weighted by cluster size — larger clusters have higher `p`. The deterministic fallback produces the most likely outcome of the probabilistic selection. Termination is guaranteed in O(k) time.
## Bug 4: Integer overflow in objective accumulation loop
**Location**: `Clustering::train_encoded()`, line 535
**Before**:
```cpp
for (int j = 0; j < nx; j++) {
obj += dis[j];
}
```
**Bug**: `nx` is `idx_t` (`int64_t`). When `nx > INT_MAX`, `int j` overflows at 2,147,483,647. Signed integer overflow is undefined behavior per the C++ standard. In practice on two's complement, `j` wraps to `-2,147,483,648`, which satisfies `j < nx`, so the loop continues with a negative index. `dis[j]` with negative `j` is an out-of-bounds read — crash or garbage accumulation.
**Proof**: For `nx = 3,000,000,000`:
- `j` increments from 0 to 2,147,483,647 (correct)
- Next increment: UB, typically wraps to -2,147,483,648
- `-2,147,483,648 < 3,000,000,000` is true (signed/unsigned comparison promotes to unsigned, but even with signed comparison it's true)
- `dis[-2147483648]` — out-of-bounds access
**Fix**:
```cpp
for (idx_t j = 0; j < nx; j++) {
obj += dis[j];
}
```
`idx_t` matches `nx`'s type. The loop variable can represent all valid indices up to `nx`.
Reviewed By: mnorris11
Differential Revision: D101624009
fbshipit-source-id: b961f2677f7e7b93642fe795cfe6ca77812573d31 parent 3c4056d commit 9cbc8da
1 file changed
Lines changed: 31 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
| |||
84 | 85 | | |
85 | 86 | | |
86 | 87 | | |
87 | | - | |
| 88 | + | |
88 | 89 | | |
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
92 | 93 | | |
93 | 94 | | |
94 | 95 | | |
95 | | - | |
| 96 | + | |
96 | 97 | | |
97 | 98 | | |
98 | 99 | | |
99 | | - | |
100 | | - | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
101 | 109 | | |
102 | 110 | | |
103 | 111 | | |
| |||
232 | 240 | | |
233 | 241 | | |
234 | 242 | | |
235 | | - | |
236 | | - | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
237 | 249 | | |
238 | 250 | | |
239 | 251 | | |
240 | | - | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
241 | 264 | | |
242 | 265 | | |
243 | 266 | | |
| |||
510 | 533 | | |
511 | 534 | | |
512 | 535 | | |
513 | | - | |
| 536 | + | |
514 | 537 | | |
515 | 538 | | |
516 | 539 | | |
| |||
0 commit comments