Skip to content

Conversation

@jayshah1819
Copy link

Add optimized INT8 matrix-vector multiplication kernel

  • Added a new optimized kernel for INT8 matrix-vector multiplication with better performance:

    • Vectorized memory loads for input vectors.
    • Each warp handles 4 rows at a time.
    • Unrolled dot product computation.
    • Warp-level reduction using shuffle instructions.
    • Supports batched matrices and vectors.
  • Kept the original kernel as a fallback.

  • Added helper functions for memory loads, warp reductions, and runtime FP32 → INT8 quantization.

  • Defined INT8 and INT4 quantization structures.

@jafioti
Copy link
Collaborator

jafioti commented Sep 28, 2025

Does the new matvec kernel perform favorably to the existing one that was already there? What are the cases where we would need a fallback to the old one? Also can you remove the binary files in the commit, and the tree.txt?

@jayshah1819
Copy link
Author

Yes. I rewrote the matvec kernel in a separate .cu file with optimizations like vectorized loads and improved warp reductions. The new optimized version should be 1.5-2x faster(I am not sure but we can keep both versions so we can benchmark them) than the original string-embedded kernel. The fallback is just for safety during testing once benchmarked and stable, we can remove the original version( I forgot to add it.) . I'll clean up the binary files and tree.txt from the commit.

@jafioti
Copy link
Collaborator

jafioti commented Sep 30, 2025

awesome, yeah once we can verify the new one is faster, just remove the original one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants