Skip to content

Conversation

@robertmaynard
Copy link
Contributor

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 20, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@robertmaynard robertmaynard added the non-breaking Introduces a non-breaking change label Oct 20, 2025
@robertmaynard
Copy link
Contributor Author

/okay to test

3 similar comments
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard robertmaynard force-pushed the fea/cuvs_c_standalone_ci branch from d9e8b30 to 227ee79 Compare October 20, 2025 16:43
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

1 similar comment
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard robertmaynard force-pushed the fea/cuvs_c_standalone_ci branch from b7803b2 to 58554d6 Compare October 20, 2025 17:58
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

bdice and others added 6 commits October 28, 2025 10:02
Contributes to rapidsai/build-planning#224

## Notes for Reviewers

This is safe to admin-merge because the change is a no-op... configs on those 2 branches are identical.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nate Rock (https://github.com/rockhowse)

URL: rapidsai#1444
…tion (rapidsai#1354)

This PR adds support for AVQ loss/Noise shaping to the BFloat16 dataset quantization. 

AVQ loss is a modified version of L2 loss which separately penalizes the components of the residual vector which are parallel and perpendicular to the original vector. Quantizing vectors with AVQ loss rather than L2 loss gives a better approximation of the inner product, and thus performs better in Maximal Innter Product Search (https://arxiv.org/abs/1908.10396).

Math:
x : original vector
x_q : quantized vector
r = x - x_q : residual vector
r_para = < r , x > x / || x ||^2 : parallel component of the residual
r_perp = r - r_para : perpendicular compoent of the residual
eta >= 1 : AVQ parameter
AVQ loss = eta * || r_para ||^2 +   || r_perp ||^2

For a float vector x, the goal is to find a bfloat16 vector x_q which minimizes the AVQ loss for a given eta. Unlike L2, AVQ loss is not separable (e.g. ||r_para||^2 contains cross terms from the inner product), so we cannot optimize individual dimensions in parallel and expect convergence. Instead, we use coordinate descent to optimize dimensions of x_q one at a time,  until convergence. 

This coordinate descent happens in the new kernel "quantize_bfloat16_noise_shaped_kernel". For efficient memory accesses and compute, one warp is assigned to optimize each dataset vector. The computation of avq loss is algebraically separated into two pieces: those which can be computed in parallel (i.e. those only depending on local information for the assigned dimension) and those which require global information (namely those depending on < r , x >). Finally threads in a warp serialize to compute the final cost for their dimension, update the quantized value and value of < r , x > (if applicable), and broadcast the updated value of < r, x > for other threads. This continues in blocks of 32 dimensions, until convergence (or a maximum of 10 iterations). 

I've found this strategy does a good job taking advantage of the inherently row_major structure of the dataset/index for efficient coalesced accesses, while still making good use of compute resources (hitting >90% compute throughput on an A6000). 

Besides the coordinate descent kernel, this PR adds some helper functions for the above, refactors the existing bfloat16 to take advantage of them, and adds configuration for the AVQ eta (code uses normal bfloat16 quantization when avq threshold is NaN).

Authors:
  - https://github.com/rmaschal
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Ben Karsin (https://github.com/bkarsin)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1354
Use javascript to not display duplicates in doxygen doc

Authors:
  - Micka (https://github.com/lowener)
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1427
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard robertmaynard force-pushed the fea/cuvs_c_standalone_ci branch from 7c04bcb to 59c3a68 Compare October 28, 2025 15:52
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard robertmaynard force-pushed the fea/cuvs_c_standalone_ci branch from 59c3a68 to 3c46a2d Compare October 28, 2025 16:03
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard robertmaynard force-pushed the fea/cuvs_c_standalone_ci branch from baed47c to 8104ad4 Compare October 28, 2025 18:36
@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

@robertmaynard
Copy link
Contributor Author

/okay to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

5 participants