Scripts for Billion-Scale Synthetic Data Generation#2247
Conversation
|
/ok to test |
@jinsolp, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 8fcb3b2 |
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughIntroduces a new ChangesBillion-Scale Synthetic Dataset Generator and Jitter Query Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py`:
- Around line 116-133: The function `choose_random_queries_with_jitter` always
returns float32 jittered query data (via the return statement calling
`add_jitter` on the float32-casted `sampled` array), but the code that uses this
function's output (around line 389) determines the filename suffix using the
original `dataset.dtype` instead of the actual output dtype. This causes float32
data to be written to filenames with the wrong type suffix (e.g., `.u8bin` for
uint8 inputs), leading to incorrect decoding later. Update the code that
generates the output filename suffix to use float32 as the dtype instead of
`dataset.dtype`, since the jittered queries are always float32 regardless of the
input dataset type.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py`:
- Around line 343-351: Add centralized validation logic in the main function
after the parser.parse_args(argv) call and before dispatching to the command
handlers (_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or
inline checks to verify that numeric arguments including total_rows, n_queries,
k, nprobes, sample_size, n_clusters, and pca_components have valid bounds.
Additionally, ensure that nprobes is validated against config.nclusters to
enforce the documented contract. If any argument fails validation, print a clear
error message that includes the argument name, expected valid range, and actual
value provided, then return 1 to exit early. This prevents invalid values from
reaching downstream processing.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py`:
- Around line 67-71: In the __post_init__ method, before normalizing
cluster_densities, add comprehensive validation to check not only that the sum
is positive but also that all individual density values are non-negative
(greater than or equal to zero), contain no NaN or infinite values, and match
the expected shape or dimensions as documented in the class contract. These
checks should occur after computing the total sum but before performing the
normalization division to prevent silent data corruption in downstream
allocation operations.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py`:
- Around line 64-75: The function `_fit_cluster_pca` creates GPU memory
allocations with `residuals_gpu` and `out` but doesn't explicitly free them,
causing GPU memory to accumulate when called repeatedly in a loop (once per
cluster). After converting the results to NumPy arrays using `cp.asnumpy()`,
explicitly delete the GPU objects `residuals_gpu` and `out` using the `del`
statement, and optionally call
`cp.cuda.stream.get_current_stream().synchronize()` to ensure GPU memory is
properly released before the function returns.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py`:
- Around line 219-230: The daemon thread in _flush_async can fail silently if
buf_view.tofile(f) throws an exception (disk full, I/O error), and
_wait_for_write() only calls join() without checking for errors, causing the
code to continue with a potentially corrupt file. Add a nonlocal variable to
capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a
try-except block to catch and store any exception, then modify _wait_for_write()
to check for captured exceptions after joining the thread and re-raise them if
they exist.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py`:
- Around line 54-60: The compute_groundtruth_exact function and related
functions (around lines 119-126) lack explicit bounds validation for the
parameters k, total_rows, and nprobes. When k exceeds total_rows or other
invalid values are passed, the functions prefill outputs with sentinel values
(-1) which then propagate to recall computations downstream. Add validation
checks at the beginning of compute_groundtruth_exact and the other affected
function to verify that k is less than or equal to total_rows, total_rows is
positive, and nprobes is within valid bounds. When validation fails, raise clear
exceptions that include both the expected constraints and the actual invalid
values provided, following the pattern of providing actionable error messages
with expected-vs-actual comparisons.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py`:
- Around line 73-93: Add input validation at the start of the load_dataset
function to ensure sample_size is positive when provided (sample_size must be
greater than 0 to prevent silent data corruption from Python's negative
indexing). Additionally, after loading data in each branch (the .npy block, .pkl
block, and memmap_bin_file block), validate that the resulting numpy array is 2D
and numeric before applying sample_size slicing or returning, raising clear and
actionable errors if the data is 1D or of invalid type. This ensures shape and
type errors are caught at the load boundary rather than deferred to downstream
functions like fit_cluster_stats.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 06365399-0de0-4fab-a427-5f11f7d266be
⛔ Files ignored due to path filters (4)
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_bigann.pngis excluded by!**/*.pngpython/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_falcon.pngis excluded by!**/*.pngpython/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_wiki.pngis excluded by!**/*.pngpython/cuvs_bench/cuvs_bench/synthesize_dataset/figures/pipeline.pngis excluded by!**/*.png
📒 Files selected for processing (11)
python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.pypython/cuvs_bench/cuvs_bench/generate_groundtruth/utils.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/README.mdpython/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_io.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py
| def choose_random_queries_with_jitter(dataset, n_queries, seed=12345): | ||
| """Pick ``n_queries`` random rows from ``dataset``, add Gaussian jitter at | ||
| scale ``0.1 * std(sample)``, and re-normalize to unit norm iff the | ||
| original dataset rows already are. | ||
| """ | ||
| import numpy as _np | ||
|
|
||
| print("Choosing random vectors from dataset and jittering with noise") | ||
| rng = _np.random.default_rng(seed) | ||
| n_rows = dataset.shape[0] | ||
| # Sort indices so the memmap read is sequential rather than random-access. | ||
| query_idx = _np.sort(rng.choice(n_rows, size=n_queries, replace=False)) | ||
| sampled = dataset[query_idx, :].astype(_np.float32, copy=True) | ||
|
|
||
| normalize = is_l2_normalized(sampled) | ||
|
|
||
| return add_jitter(sampled, rng, normalize) | ||
|
|
There was a problem hiding this comment.
HIGH: random-jitter writes float32 query data under dataset-typed filename suffix.
Line 128/Line 132 produce float32 jittered queries, but Line 389 still picks suffix from dataset.dtype. For non-float datasets, this can write float32 payloads into .u8bin/.i8bin filenames, which later decode with the wrong dtype.
Proposed fix
- queries_filename = os.path.join(
- args.output, "queries" + suffix_from_dtype(dtype)
- )
+ query_out_dtype = (
+ queries.dtype if args.queries == "random-jitter" else dtype
+ )
+ queries_filename = os.path.join(
+ args.output, "queries" + suffix_from_dtype(query_out_dtype)
+ )As per coding guidelines, “Prevent silent data corruption from type coercion and validate that array type coercions are handled safely.”
Also applies to: 388-390
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py` around lines
116 - 133, The function `choose_random_queries_with_jitter` always returns
float32 jittered query data (via the return statement calling `add_jitter` on
the float32-casted `sampled` array), but the code that uses this function's
output (around line 389) determines the filename suffix using the original
`dataset.dtype` instead of the actual output dtype. This causes float32 data to
be written to filenames with the wrong type suffix (e.g., `.u8bin` for uint8
inputs), leading to incorrect decoding later. Update the code that generates the
output filename suffix to use float32 as the dtype instead of `dataset.dtype`,
since the jittered queries are always float32 regardless of the input dataset
type.
Source: Coding guidelines
| args = parser.parse_args(argv) | ||
| if args.command == "fit": | ||
| return _cmd_fit(args) | ||
| if args.command == "generate": | ||
| return _cmd_generate(args) | ||
| if args.command == "verify": | ||
| return _cmd_verify(args) | ||
| parser.print_help() | ||
| return 1 |
There was a problem hiding this comment.
HIGH: Add centralized CLI bounds validation before dispatch.
Several numeric arguments are unbounded (total_rows, n_queries, k, nprobes, sample_size, n_clusters, pca_components). Invalid values can crash downstream (or produce invalid GT silently), and nprobes is not enforced against config.nclusters despite the documented contract.
Proposed guardrail patch
def main(argv: list[str] | None = None) -> int:
@@
args = parser.parse_args(argv)
+
+ def _fail(msg: str) -> int:
+ parser.error(msg)
+ return 2
+
+ if args.command == "fit":
+ if args.sample_size is not None and args.sample_size <= 0:
+ return _fail(
+ f"--sample_size must be > 0 when provided (got {args.sample_size})."
+ )
+ if args.n_clusters <= 0:
+ return _fail(f"--n_clusters must be > 0 (got {args.n_clusters}).")
+ if args.pca_components <= 0:
+ return _fail(
+ f"--pca_components must be > 0 (got {args.pca_components})."
+ )
+
+ if args.command in ("generate", "verify"):
+ if args.total_rows <= 0:
+ return _fail(f"--total_rows must be > 0 (got {args.total_rows}).")
+ if args.n_queries <= 0:
+ return _fail(f"--n_queries must be > 0 (got {args.n_queries}).")
+ if args.n_queries > args.total_rows:
+ return _fail(
+ f"--n_queries ({args.n_queries}) must be <= --total_rows ({args.total_rows}) "
+ "for sampling without replacement."
+ )
+ if args.k <= 0:
+ return _fail(f"--k must be > 0 (got {args.k}).")
+ if args.nprobes is not None and args.nprobes <= 0:
+ return _fail(f"--nprobes must be > 0 (got {args.nprobes}).")
+
if args.command == "fit":
return _cmd_fit(args)
if args.command == "generate":
+ if args.nprobes is not None:
+ cfg = load_fingerprint(args.stats, seed=args.seed)
+ if args.nprobes > cfg.nclusters:
+ return _fail(
+ f"--nprobes ({args.nprobes}) must be <= number of clusters "
+ f"({cfg.nclusters}) in fingerprint."
+ )
return _cmd_generate(args)
if args.command == "verify":
+ if args.nprobes is not None:
+ cfg = load_fingerprint(args.stats, seed=args.seed)
+ if args.nprobes > cfg.nclusters:
+ return _fail(
+ f"--nprobes ({args.nprobes}) must be <= number of clusters "
+ f"({cfg.nclusters}) in fingerprint."
+ )
return _cmd_verify(args)As per coding guidelines, “Ensure missing validation does not cause crashes on invalid input through proper size/type checks” and “Provide clear and actionable error messages that include expected vs actual values where helpful.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py` around lines 343
- 351, Add centralized validation logic in the main function after the
parser.parse_args(argv) call and before dispatching to the command handlers
(_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or inline
checks to verify that numeric arguments including total_rows, n_queries, k,
nprobes, sample_size, n_clusters, and pca_components have valid bounds.
Additionally, ensure that nprobes is validated against config.nclusters to
enforce the documented contract. If any argument fails validation, print a clear
error message that includes the argument name, expected valid range, and actual
value provided, then return 1 to exit early. This prevents invalid values from
reaching downstream processing.
Source: Coding guidelines
| def __post_init__(self): | ||
| total = float(self.cluster_densities.sum()) | ||
| if total <= 0.0: | ||
| raise ValueError("cluster_densities must sum to a positive value") | ||
| self.cluster_densities = self.cluster_densities / total |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate and examine the fingerprint file
find . -name "_fingerprint.py" -type f | head -5Repository: rapidsai/cuvs
Length of output: 123
🏁 Script executed:
# Check the full structure of the file and surrounding context
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.pyRepository: rapidsai/cuvs
Length of output: 3252
🏁 Script executed:
# Find how cluster_densities is used downstream
rg "cluster_densities" --type py -A 3 -B 3Repository: rapidsai/cuvs
Length of output: 3727
🏁 Script executed:
# Check for existing tests of Fingerprint class
find . -name "test_*.py" -o -name "*_test.py" | xargs rg "Fingerprint" -lRepository: rapidsai/cuvs
Length of output: 41
🏁 Script executed:
# Look at the full context of where cluster_densities is used in _generate.py
sed -n '1,100p' python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py | head -80Repository: rapidsai/cuvs
Length of output: 2602
🏁 Script executed:
# Check if there are any guards around the multiplication by cluster_densities
rg "cluster_densities" -B 5 -A 5 python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.pyRepository: rapidsai/cuvs
Length of output: 451
🏁 Script executed:
# Check what data["densities"] is and how it's constructed
rg "densities" -B 3 -A 3 python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py | head -40Repository: rapidsai/cuvs
Length of output: 1006
🏁 Script executed:
# Check if Fingerprint is directly instantiated in tests
find . -path "*/test*" -name "*.py" -exec rg "Fingerprint" {} +Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Check the full function where cluster_densities is used
sed -n '1,150p' python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.pyRepository: rapidsai/cuvs
Length of output: 5070
🏁 Script executed:
# Verify what happens with negative or non-finite values in the multiplication
python3 << 'PY'
import numpy as np
# Simulate what happens if cluster_densities contains problematic values
remaining = 100
total_points = 1000
nclusters = 3
# Case 1: negative density (not caught by current validation)
densities_neg = np.array([-0.1, 0.5, 0.6]) # sums to 1.0, passes current check
total = float(densities_neg.sum())
print(f"Negative case: sum={total}, passes check: {total > 0}")
extra = (densities_neg * remaining).astype(np.int64)
print(f"Extra points: {extra} - negative allocation!\n")
# Case 2: NaN/Inf (current check may fail)
densities_nan = np.array([np.nan, 0.5, 0.5])
total = float(densities_nan.sum())
print(f"NaN case: sum={total}, passes check: {total > 0}")
print(f"NaN comparison result: {total <= 0.0}\n")
# Case 3: shape mismatch (not caught at all)
densities_wrong_shape = np.array([[0.3, 0.3, 0.4]]) # shape (1, 3) instead of (3,)
print(f"Shape: {densities_wrong_shape.shape}, normalized: {densities_wrong_shape / densities_wrong_shape.sum()}")
PYRepository: rapidsai/cuvs
Length of output: 264
HIGH: Validate density invariants before normalization.
Only checking sum > 0 allows negative, NaN, Inf, and shape-mismatched densities to pass validation. Negative densities produce silent data corruption in downstream allocation ((config.cluster_densities * remaining).astype(np.int64)); NaN/Inf values can slip through comparisons; and shape mismatches violate the documented contract.
Suggested fix
def __post_init__(self):
+ if self.cluster_densities.shape != (self.nclusters,):
+ raise ValueError(
+ f"cluster_densities shape mismatch: expected ({self.nclusters},), "
+ f"got {self.cluster_densities.shape}"
+ )
+ if not np.issubdtype(self.cluster_densities.dtype, np.floating):
+ self.cluster_densities = self.cluster_densities.astype(np.float64)
+ if not np.isfinite(self.cluster_densities).all():
+ raise ValueError("cluster_densities must contain only finite values")
+ if (self.cluster_densities < 0).any():
+ raise ValueError("cluster_densities must be non-negative")
total = float(self.cluster_densities.sum())
if total <= 0.0:
raise ValueError("cluster_densities must sum to a positive value")
self.cluster_densities = self.cluster_densities / total🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py` around lines
67 - 71, In the __post_init__ method, before normalizing cluster_densities, add
comprehensive validation to check not only that the sum is positive but also
that all individual density values are non-negative (greater than or equal to
zero), contain no NaN or infinite values, and match the expected shape or
dimensions as documented in the class contract. These checks should occur after
computing the total sum but before performing the normalization division to
prevent silent data corruption in downstream allocation operations.
Source: Coding guidelines
| def _fit_cluster_pca( | ||
| residuals: np.ndarray, n_components: int | ||
| ) -> tuple[np.ndarray, np.ndarray]: | ||
| """Fit a single PCA via cuvs and return ``(components, explained_var)``.""" | ||
| residuals_gpu = cp.asarray(residuals, dtype=cp.float32) | ||
| params = cuvs_pca.Params(n_components=n_components, copy=True) | ||
| out = cuvs_pca.fit(params, residuals_gpu) | ||
| components = cp.asnumpy(cp.asarray(out.components)).astype(np.float32) | ||
| explained_var = cp.asnumpy(cp.asarray(out.explained_var)).astype( | ||
| np.float32 | ||
| ) | ||
| return components, explained_var |
There was a problem hiding this comment.
HIGH: GPU memory not freed in loop-called helper
_fit_cluster_pca is called once per cluster (potentially thousands) but doesn't free residuals_gpu or the out object before returning. GPU memory can accumulate between Python GC cycles.
Suggested fix
def _fit_cluster_pca(
residuals: np.ndarray, n_components: int
) -> tuple[np.ndarray, np.ndarray]:
"""Fit a single PCA via cuvs and return ``(components, explained_var)``."""
residuals_gpu = cp.asarray(residuals, dtype=cp.float32)
params = cuvs_pca.Params(n_components=n_components, copy=True)
out = cuvs_pca.fit(params, residuals_gpu)
components = cp.asnumpy(cp.asarray(out.components)).astype(np.float32)
explained_var = cp.asnumpy(cp.asarray(out.explained_var)).astype(
np.float32
)
+ del residuals_gpu, out
+ cp.get_default_memory_pool().free_all_blocks()
return components, explained_var🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py` around lines 64 -
75, The function `_fit_cluster_pca` creates GPU memory allocations with
`residuals_gpu` and `out` but doesn't explicitly free them, causing GPU memory
to accumulate when called repeatedly in a loop (once per cluster). After
converting the results to NumPy arrays using `cp.asnumpy()`, explicitly delete
the GPU objects `residuals_gpu` and `out` using the `del` statement, and
optionally call `cp.cuda.stream.get_current_stream().synchronize()` to ensure
GPU memory is properly released before the function returns.
Source: Coding guidelines
| def _wait_for_write() -> None: | ||
| nonlocal write_thread | ||
| if write_thread is not None: | ||
| write_thread.join() | ||
| write_thread = None | ||
|
|
||
| def _flush_async(f, buf_view: np.ndarray) -> None: | ||
| nonlocal write_thread | ||
| write_thread = threading.Thread( | ||
| target=lambda: buf_view.tofile(f), daemon=True | ||
| ) | ||
| write_thread.start() |
There was a problem hiding this comment.
HIGH: Exceptions in async write thread are silently lost
If buf_view.tofile(f) fails (disk full, I/O error), the exception is swallowed by the daemon thread. _wait_for_write() only calls join() without checking for errors, so the code continues and may produce a corrupt/truncated file without any indication.
Have you considered capturing and re-raising exceptions from the write thread?
Suggested fix using exception capture
+ write_exception: BaseException | None = None
+
def _wait_for_write() -> None:
- nonlocal write_thread
+ nonlocal write_thread, write_exception
if write_thread is not None:
write_thread.join()
write_thread = None
+ if write_exception is not None:
+ exc = write_exception
+ write_exception = None
+ raise exc
+
+ def _do_write(buf_view: np.ndarray, f) -> None:
+ nonlocal write_exception
+ try:
+ buf_view.tofile(f)
+ except BaseException as e:
+ write_exception = e
def _flush_async(f, buf_view: np.ndarray) -> None:
nonlocal write_thread
write_thread = threading.Thread(
- target=lambda: buf_view.tofile(f), daemon=True
+ target=lambda: _do_write(buf_view, f), daemon=True
)
write_thread.start()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py` around lines
219 - 230, The daemon thread in _flush_async can fail silently if
buf_view.tofile(f) throws an exception (disk full, I/O error), and
_wait_for_write() only calls join() without checking for errors, causing the
code to continue with a potentially corrupt file. Add a nonlocal variable to
capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a
try-except block to catch and store any exception, then modify _wait_for_write()
to check for captured exceptions after joining the thread and re-raise them if
they exist.
Source: Coding guidelines
| def compute_groundtruth_exact( | ||
| queries: np.ndarray, | ||
| total_rows: int, | ||
| config: Fingerprint, | ||
| k: int, | ||
| metric: str = "sqeuclidean", | ||
| ) -> Tuple[np.ndarray, np.ndarray, dict]: |
There was a problem hiding this comment.
HIGH: Add explicit bounds validation for k, total_rows, and nprobes.
At Line 70 and Line 139, outputs are prefilled with -1; if k is out of bounds (for example, k > total_rows), invalid sentinel IDs can leak into final GT and then into recall computation (used in python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py, Line 76). Please fail fast with clear expected-vs-actual validation.
Proposed fix
def compute_groundtruth_exact(
queries: np.ndarray,
total_rows: int,
config: Fingerprint,
k: int,
metric: str = "sqeuclidean",
) -> Tuple[np.ndarray, np.ndarray, dict]:
+ if total_rows <= 0:
+ raise ValueError(
+ f"total_rows must be > 0, got total_rows={total_rows}"
+ )
+ if k <= 0 or k > total_rows:
+ raise ValueError(
+ f"k must be in [1, total_rows], got k={k}, total_rows={total_rows}"
+ )
+
"""Exact brute-force GT, computed by streaming every cluster.
@@
def compute_groundtruth_nprobe(
@@
) -> Tuple[np.ndarray, np.ndarray, dict]:
+ if total_rows <= 0:
+ raise ValueError(
+ f"total_rows must be > 0, got total_rows={total_rows}"
+ )
+ if k <= 0 or k > total_rows:
+ raise ValueError(
+ f"k must be in [1, total_rows], got k={k}, total_rows={total_rows}"
+ )
+ if nprobes <= 0 or nprobes > config.nclusters:
+ raise ValueError(
+ "nprobes must be in [1, config.nclusters], "
+ f"got nprobes={nprobes}, config.nclusters={config.nclusters}"
+ )
+
"""Cheap GT via cluster probing: only the ``nprobes`` nearest clustersAs per coding guidelines, “Ensure missing validation does not cause crashes on invalid input through proper size/type checks” and “Provide clear and actionable error messages that include expected vs actual values where helpful.”
Also applies to: 119-126
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py` around
lines 54 - 60, The compute_groundtruth_exact function and related functions
(around lines 119-126) lack explicit bounds validation for the parameters k,
total_rows, and nprobes. When k exceeds total_rows or other invalid values are
passed, the functions prefill outputs with sentinel values (-1) which then
propagate to recall computations downstream. Add validation checks at the
beginning of compute_groundtruth_exact and the other affected function to verify
that k is less than or equal to total_rows, total_rows is positive, and nprobes
is within valid bounds. When validation fails, raise clear exceptions that
include both the expected constraints and the actual invalid values provided,
following the pattern of providing actionable error messages with
expected-vs-actual comparisons.
Source: Coding guidelines
| if ext == ".npy": | ||
| data = np.load(path) | ||
| if sample_size is not None and sample_size < len(data): | ||
| data = data[:sample_size] | ||
| return np.ascontiguousarray(data.astype(np.float32)) | ||
|
|
||
| if ext == ".pkl": | ||
| with open(path, "rb") as f: | ||
| data = pickle.load(f) | ||
| if not isinstance(data, np.ndarray): | ||
| data = np.array(data, dtype=np.float32) | ||
| if sample_size is not None and sample_size < len(data): | ||
| data = data[:sample_size] | ||
| return np.ascontiguousarray(data.astype(np.float32)) | ||
|
|
||
| # Default: treat as fbin (covers ".fbin" and unknown extensions). | ||
| # memmap_bin_file auto-detects the legacy uint32 / extended uint64 header. | ||
| mm = memmap_bin_file(path, dtype, mode="r") | ||
| if sample_size is not None: | ||
| mm = mm[:sample_size] | ||
| return np.ascontiguousarray(mm) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the _io.py file
fd -t f "_io.py" python/Repository: rapidsai/cuvs
Length of output: 112
🏁 Script executed:
# Get the full context of the file around lines 73-93
wc -l python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.pyRepository: rapidsai/cuvs
Length of output: 116
🏁 Script executed:
# Read the full function to understand context
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py | head -120Repository: rapidsai/cuvs
Length of output: 5227
🏁 Script executed:
# Check memmap_bin_file to understand what it returns
rg -A 20 "def memmap_bin_file" python/Repository: rapidsai/cuvs
Length of output: 2220
🏁 Script executed:
# Find callers of load_dataset
rg -B 2 -A 2 "load_dataset" python/ --type pyRepository: rapidsai/cuvs
Length of output: 2539
🏁 Script executed:
# Check if there are tests for load_dataset
fd -t f "test_*_io.py" python/Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Get more context on the _cmd_fit function to see how data is used
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py | sed -n '1,150p'Repository: rapidsai/cuvs
Length of output: 5596
🏁 Script executed:
# Check if there are any validation calls on the data after load_dataset
rg -A 10 "_cmd_fit" python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.pyRepository: rapidsai/cuvs
Length of output: 688
🏁 Script executed:
# Look for the actual fit function to see what contract it expects
rg -B 3 -A 15 "def fit_cluster_stats" python/Repository: rapidsai/cuvs
Length of output: 1585
🏁 Script executed:
# Check what happens in fit_cluster_stats with invalid data shapes
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py | sed -n '80,150p'Repository: rapidsai/cuvs
Length of output: 3245
🏁 Script executed:
# Check if sample_size validation happens elsewhere (e.g., argparse)
rg -B 5 -A 5 "sample_size" python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py | head -40Repository: rapidsai/cuvs
Length of output: 1412
🏁 Script executed:
# Test what happens with negative sample_size
python3 << 'PYEOF'
import numpy as np
# Simulate negative sample_size behavior
data = np.arange(100).reshape(10, 10)
sample_size = -3
# What the current code does
if sample_size is not None and sample_size < len(data):
result = data[:sample_size]
print(f"sample_size={sample_size}, len(data)={len(data)}")
print(f"Condition (sample_size < len(data)): {sample_size < len(data)}")
print(f"Result shape: {result.shape}")
print(f"Result (last 3 rows removed): {result[-3:]}")
PYEOFRepository: rapidsai/cuvs
Length of output: 280
🏁 Script executed:
# Verify what happens with 1D array in fit_cluster_stats
python3 << 'PYEOF'
import numpy as np
# Simulate 1D array error
data_1d = np.arange(100).astype(np.float32)
try:
n_dim = data_1d.shape[1]
except IndexError as e:
print(f"1D array error: {type(e).__name__}: {e}")
# Simulate object dtype
data_obj = np.array(["a", "b", "c"], dtype=object)
try:
ascontiguous = np.ascontiguousarray(data_obj.astype(np.float32))
print("Object dtype converted successfully (unexpected)")
except (ValueError, TypeError) as e:
print(f"Object dtype error: {type(e).__name__}: {e}")
PYEOFRepository: rapidsai/cuvs
Length of output: 181
🏁 Script executed:
# Check the actual behavior of the suggested patch - when validation happens relative to slicing
python3 << 'PYEOF'
import numpy as np
# Simulate the current _npy branch behavior vs proposed fix
path = "/tmp/test.npy"
sample_size = -5
# Create test data
test_data = np.arange(100).reshape(10, 10).astype(np.float32)
np.save(path, test_data)
# Current code (without validation)
data = np.load(path)
print(f"Loaded shape: {data.shape}")
if sample_size is not None and sample_size < len(data):
data = data[:sample_size] # NEGATIVE SLICE HAPPENS HERE
print(f"After slice with sample_size={sample_size}: shape={data.shape}")
print(f"Data was silently corrupted! Expected first {abs(sample_size)} rows, got last {10-abs(sample_size)} rows")
# Now check when the suggested validation would run
print(f"\nValidation check would run AFTER slicing, so it cannot prevent the corruption")
if sample_size is not None and sample_size <= 0:
print(f"Would catch: sample_size={sample_size} is invalid")
else:
print(f"Validation check: passed (data already corrupted)")
PYEOFRepository: rapidsai/cuvs
Length of output: 313
HIGH: Validate dataset shape and sample_size before processing, with sample_size check before slicing.
load_dataset accepts invalid inputs that cause silent data corruption (negative sample_size silently loads the last N rows instead of the first N due to Python's negative indexing) and defers shape errors to fit_cluster_stats (which crashes on 1D arrays with IndexError). Validate sample_size sign before any slicing, and enforce 2D numeric arrays at load boundary with actionable errors.
Suggested fix
+def _validate_sample_size(sample_size: int | None) -> None:
+ if sample_size is not None and sample_size <= 0:
+ raise ValueError(
+ f"sample_size must be > 0, got {sample_size}"
+ )
+
+def _validate_loaded_dataset(data: np.ndarray) -> None:
+ if not isinstance(data, np.ndarray):
+ raise TypeError(f"expected numpy.ndarray, got {type(data).__name__}")
+ if data.ndim != 2:
+ raise ValueError(f"expected 2D array (n_rows, n_cols), got ndim={data.ndim}")
+ if data.shape[1] <= 0:
+ raise ValueError(f"expected n_cols > 0, got shape={data.shape}")
+ if not np.issubdtype(data.dtype, np.number):
+ raise TypeError(f"expected numeric dtype, got {data.dtype}")
def load_dataset(
path: str,
sample_size: int | None = None,
dtype: np.dtype = np.float32,
) -> np.ndarray:
+ _validate_sample_size(sample_size)
ext = os.path.splitext(path)[1].lower()
if ext == ".npy":
data = np.load(path)
if sample_size is not None and sample_size < len(data):
data = data[:sample_size]
+ _validate_loaded_dataset(data)
return np.ascontiguousarray(data.astype(np.float32))
if ext == ".pkl":
with open(path, "rb") as f:
data = pickle.load(f)
if not isinstance(data, np.ndarray):
data = np.array(data, dtype=np.float32)
if sample_size is not None and sample_size < len(data):
data = data[:sample_size]
+ _validate_loaded_dataset(data)
return np.ascontiguousarray(data.astype(np.float32))
mm = memmap_bin_file(path, dtype, mode="r")
if sample_size is not None:
mm = mm[:sample_size]
+ _validate_loaded_dataset(mm)
return np.ascontiguousarray(mm)🧰 Tools
🪛 OpenGrep (1.22.0)
[ERROR] 81-81: pickle.load/loads deserializes arbitrary Python objects and can execute arbitrary code. Use a safe format like JSON instead.
(coderabbit.deserialization.python-pickle)
🪛 Ruff (0.15.17)
[error] 81-81: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py` around lines 73 - 93,
Add input validation at the start of the load_dataset function to ensure
sample_size is positive when provided (sample_size must be greater than 0 to
prevent silent data corruption from Python's negative indexing). Additionally,
after loading data in each branch (the .npy block, .pkl block, and
memmap_bin_file block), validate that the resulting numpy array is 2D and
numeric before applying sample_size slicing or returning, raising clear and
actionable errors if the data is 1D or of invalid type. This ensures shape and
type errors are caught at the load boundary rather than deferred to downstream
functions like fit_cluster_stats.
Source: Coding guidelines
Closes #2208
Adding billion-scale synthetic data generation scripts to
cuvs_bench.New Files in
synthesize_dataset/:__main__.py: fit/generate/verify CLI._fit.py: KMeans + per-cluster PCA fitting logic._fingerprint.py: Fingerprint class._generate.py: per-cluster data generation logic._ground_truth.py: exact (streaming) and nprobe GT computation._verify.py: compares nprobe GT vs exact GT to validate nprobes._io.py— dataset loading (support for npz and pkl files) + fingerprint NPZ save/load logic.README.md+figures/: full workflow guide and synth-vs-real DiskANN validation on Falcon/BigANN/Wiki.For Reviewers: It would be easier to read through the
README.mdand review code for each step in that order (starting with__main__.py.