-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add more misc. changes from candle fork #3196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ivarflakstad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent stuff
| /// Creates a new private buffer (not necessarily zeroed). | ||
| /// | ||
| /// This is intentionally not in the Metal buffer pool to allow the efficient implementation of persistent buffers. | ||
| pub fn new_private_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is nice to have, but I think we should name it something other than private buffer since that already means something for metal buffers (only available on gpu, ref).
We don't want to use actual metal private buffers as that isn't supported on iOS.
How about new_unpooled_buffer or new_persistent_buffer? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this was a mistake on my part. The correct behavior that I intended for this function is to have:
privateif not on iOSshared/RESOURCE_OPTIONSif on iOS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Could I ask why you want it to be private?
According to Apple's documentation there is no performance benefit, so private is usually used when you want to ensure that the cpu does not have access to the buffer for some specific reason. I'd wager a guess this kind of behaviour is frequently used in gaming.
| crate::bail!( | ||
| "The given quantized dtype {:?} is not supported for indexed_moe_forward!", | ||
| self.dtype() | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud here. It would be nice to have automatic fallback to an approach that isn't as optimized, but still valid. Perhaps returning Result<Option<(CudaStorage, crate::Shape)>> is a decent starting point?
If None then fallback?
Not thinking we add this in this PR ofc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might work, the issue is that effectively indexed_moe_forward is a grouped gemm so we'd need existing infrastructure to run a grouped gemm.
Regardless, providing a grouped gemm functionality will be very useful!
|
Addressed the review comments, the |
Co-authored-by Guoqing Bao <[email protected]>
* Update CI * I have no clue what was going on with this maturin file, but I don't like it * update cuda container options * Add compute cap to cuda wf * Fix rust toolchain call * update cuda ci runner and bindgen_cuda
4c3f2be to
bdb66f2
Compare
indexed_moe_forward(fast path for ggml quants)ContextDeviceapis