Welcome to xlnscpp Discussions! #1

pradeeban · 2025-02-13T16:57:16Z

pradeeban
Feb 13, 2025
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

pybashc · 2025-02-27T18:18:23Z

pybashc
Feb 27, 2025

I'm Arya Gupta, a pre-final year student at JIIT Noida. I'm passionate about deep learning, numerical computing, and system optimization. Excited to be part of this community and looking forward to learning from everyone!

5 replies

pybashc Feb 28, 2025

@pradeeban , I wanted to know if we have to submit Code Challenges somewhere or just include them with the proposal?

pradeeban Mar 1, 2025
Maintainer Author

@pybashc A GitHub contribution/gist followed by its link incorporated in the proposal works best.

pybashc Mar 1, 2025

@pradeeban Thank you for your response. I will make an effort to contribute if possible and will also begin working on the proposal. If feasible, I would greatly appreciate it if you could review the proposal once it is ready

pradeeban Mar 5, 2025
Maintainer Author

of course, you can email complete proposal drafts. But avoid sending those drafts too early (i.e., incomplete).

pybashc Mar 7, 2025

Thankyou

haarit19058 · 2025-03-26T19:38:45Z

haarit19058
Mar 26, 2025

Hello @pradeeban @markgarnold ,

I have completed the challenges for this project. Now, I was trying to understand the structure of ggml. It has various objects like ggml_context, which handles the context of ggml operations, and ggml_backend_buffer, which manages the buffer for ggml backend operations, and many more. These objects (not objects exactly) are created considering standard datatypes like float, char, etc.

My understanding of the problem is as follows:
To provide lns support, we need to configure the objects of ggml, which use lns datatype, and introduce overloaded operations for them. We also need to create a pipeline to convert float to lns as it is crucial for llama.cpp to interact flawlessly with edited ggml. As you mentioned, we are not trying to improve the lns performance compared to float. We are only simulating it to show the potential of LNS datatype. Maybe, we could do some optimizations while implementing.

Am i thinking in correct direction ??

1 reply

markgarnold Mar 26, 2025
Maintainer

Hi @haarit19058 . Your understanding is roughly correct. You say "we need to configure the objects of ggml, which use lns datatype, and introduce overloaded operation". I am not sure what you mean by that.

I hope you mean: the methods of ggml that receive and return float (or other existing type) data and that operate internally with float data get modified to be methods that receive and return the same type (typically float) data, but now operate internally with the overloaded operations of xlns16_float.

You might have meant to modify ggml to create objects and operate directly on xlns16_float. This would be a very difficult project, and not what is intended.

haarit19058 · 2025-03-27T03:01:20Z

haarit19058
Mar 27, 2025

I hope you mean: the methods of ggml that receive and return float (or other existing type) data and that operate internally with float data get modified to be methods that receive and return the same type (typically float) data, but now operate internally with the overloaded operations of xlns16_float.

I wanted to say the same but could not phrase properly. Thanks @markgarnold

Here is what i plan to to :

Understand the ggml methods by coding
Implement basic arithmetic in lns using ggml tensors
Make a mechanism to convert the fp weights of model into lns and use them for computation
modify the methods of ggml to use xlns

@markgarnold could you please provide some suggestions on what should be done ...

1 reply

markgarnold Mar 27, 2025
Maintainer

@haarit19058 you have the basic idea. The details of what should be done will be part of your proposal.

You may email me a draft of your proposal if you wish.

haarit19058 · 2025-03-27T11:49:55Z

haarit19058
Mar 27, 2025

@markgarnold I have been digging through the code base of ggml and here are some of the findings:

Suppose we want to add two tensors, this is what we do

Create a context
Initialize two tensors
Build the graph
Obtain the result

How ggml do it at backend :
It does not directly computes the addition of the tensors instead it stores them till the forward method on graph is called.

Store the tensors
According to the operations it sets the operation type in ggml_tensor 'enum ggml_op ;'
When the forward method on graph is called it uses a function in ggml-cpu.c 'inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] + y[i]; }'

We will change this function to use lns internally.

But this will involve lot many conversions from float to lns for a llm model. We need to optimize this.

Again, is this the correct direction of thought process ??

6 replies

markgarnold Mar 27, 2025
Maintainer

@haarit19058 You have identified the correct approach.

There will be a forked copy of ggml in which you will change many instances of things like:

inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] + y[i]; }

to"

inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y)
{
#ifdef xlns16
for (int i = 0; i < n; ++i) z[i] = x[i] + y[i];
#else
for (int i = 0; i < n; ++i) z[i] = xlns16_2float(float2xlns16_(x[i]) + /automatically cast/ y[i]);
#endif
}

Note the names of the conversion functions between the class xlns16_float and float have a "" in them: xlns16_2float and float2xlns16 sorry if that is confusing.

The overhead of back and forth conversions could only be avoided by keeping a shadow copy of the LNS values, which would be a huge memory overhead for an LLM. Better to accept the slowdown of conversion. Note float2xlns16_ keeps cached copies of previously converted float values, which might reduce the time involved.

markgarnold Mar 27, 2025
Maintainer

I meant have an underline in them (which github took as markdown).

The confusion is there are convert functions operating on the xlns (integer) representations, but you will want the class conversion functions that have the underline. Easy to remember because the class (xlns16_float) has an underline in it.

markgarnold Mar 27, 2025
Maintainer

It looks I made a few typos and I reversed the lines after the ifdef, but this is an idea of what needs to be done.

You could nest two ifdefs, which would allow you to choose either 16- or 32-bit LNS, by simply including xlns16.cpp or xlns32.cpp on the compilation command line:

#ifdef xlns16
...
#else
#ifdef xlns32
...
#else
...
#endif

markgarnold Mar 27, 2025
Maintainer

#endif

haarit19058 Mar 27, 2025

Hmm, understood sir, I will find the functions that needs to be changed.

haarit19058 · 2025-03-28T05:46:08Z

haarit19058
Mar 28, 2025

Vectors functions / tensors

ggml_vec_xl_u8x2
ggml_vec_xl_u8x4
ggml_vec_xl_s8x4
ggml_vec_xl_s16x2
ggml_vec_tbl
ggml_vec_dot
ggml_vec_set_i8
ggml_vec_set_i16
ggml_vec_set_i32
ggml_vec_cpy_i32
ggml_vec_set_f16
ggml_vec_set_bf16
ggml_vec_add_f32
ggml_vec_add_f16
ggml_vec_add1_f32
ggml_vec_acc_f32
ggml_vec_acc1_f32
ggml_vec_sub_f32
ggml_vec_sub_f16
ggml_vec_set_f32
ggml_vec_cpy_f32
ggml_vec_neg_f32
ggml_vec_neg_f16
ggml_vec_mul_f32
ggml_vec_mul_f16
ggml_vec_div_f32
ggml_vec_div_f16
ggml_vec_dot_f32
ggml_vec_dot_bf16
ggml_vec_dot_f16
ggml_vec_dot_f16_unroll
ggml_vec_mad_f32
ggml_vec_mad_f16
ggml_vec_mad_f32_unroll
ggml_vec_scale_f32
ggml_vec_scale_f16
ggml_vec_norm_f32
ggml_vec_sqr_f32
ggml_vec_sqr_f16
ggml_vec_sqrt_f32
ggml_vec_sqrt_f16
ggml_vec_log_f32
ggml_vec_log_f16
ggml_vec_sin_f32
ggml_vec_sin_f16
ggml_vec_cos_f32
ggml_vec_cos_f16
ggml_vec_abs_f32
ggml_vec_abs_f16
ggml_vec_sgn_f32
ggml_vec_sgn_f16
ggml_vec_step_f32
ggml_vec_step_f16
ggml_vec_tanh_f32
ggml_vec_tanh_f16
ggml_vec_elu_f32
ggml_vec_elu_f16
ggml_vec_relu_f32
ggml_vec_relu_f16
ggml_vec_leaky_relu_f32
ggml_vec_leaky_relu_f16
ggml_vec_sigmoid_f32
ggml_vec_sigmoid_f16
ggml_vec_hardswish_f32
ggml_vec_hardswish_f16
ggml_vec_hardsigmoid_f32
ggml_vec_hardsigmoid_f16
ggml_vec_exp_f32
ggml_vec_exp_f16
ggml_vec_gelu_f16
ggml_vec_gelu_f32
ggml_vec_gelu_f32
ggml_vec_gelu_quick_f32
ggml_vec_gelu_quick_f32
ggml_vec_gelu_quick_f16
ggml_vec_silu_f32
ggml_vec_silu_f16
ggml_vec_soft_max_f32
ggml_vec_log_soft_max_f32
ggml_vec_silu_backward_f32
ggml_vec_silu_backward_f16
ggml_vec_sum_f32
ggml_vec_sum_f32_ggf
ggml_vec_sum_f16_ggf
ggml_vec_sum_bf16_ggf
ggml_vec_max_f32
ggml_vec_norm_inv_f32
ggml_vec_argmax_f32
ggml_vec_dot_f16

Quantization functions

ascendc_quantize_f16_to_q4_0
ascendc_quantize_f32_to_q4_0
quantize_row_q8_K_vnni
ggml_quantize_mat_q8_0_4x4
ggml_quantize_mat_q8_0_4x8
ggml_quantize_mat_q8_K_4x8
ggml_quantize_mat_t<4,GGML_TYPE_Q8_0>
ggml_quantize_mat_t<8,GGML_TYPE_Q8_0>
ggml_quantize_mat_t<8,GGML_TYPE_Q8_K>
quantize_row_q4_0
quantize_row_q4_1
quantize_row_q5_0
quantize_row_q5_1
quantize_row_q8_0
quantize_row_q8_1
dequantize_block
dequantize_block_sycl
dequantize_row_q2_K_sycl
dequantize_row_q3_K_sycl
dequantize_row_q4_0_sycl
dequantize_row_q4_0_sycl_reorder
dequantize_row_q4_1_sycl
dequantize_row_q4_K_sycl
dequantize_row_q5_K_sycl
dequantize_row_q6_K_sycl
dequantize_row_iq1_s_sycl
dequantize_row_iq1_m_sycl
dequantize_row_iq2_xxs_sycl
dequantize_row_iq2_xs_sycl
dequantize_row_iq2_s_sycl
dequantize_row_iq3_xxs_sycl
dequantize_row_iq3_s_sycl
dequantize_row_iq4_xs_sycl
dequantize_row_iq4_nl_sycl
dequantize_q4_0
dequantize_q4_0_reorder
dequantize_q4_1
dequantize_q5_0
dequantize_q5_1
dequantize_q8_0
dequantize_block_q4_0
dequantize_block_q4_0_reorder
dequantize_block_q4_1
dequantize_block_q2_K
dequantize_block_q3_K
get_scale_min_k4
dequantize_block_q4_K
dequantize_block_q5_K
dequantize_block_q6_K
dequantize_block_iq2_xxs
dequantize_block_iq2_xs
dequantize_block_iq2_s
dequantize_block_iq3_xxs
dequantize_block_iq3_s
dequantize_block_iq1_s
dequantize_block_iq1_m
dequantize_block_iq4_nl
dequantize_block_iq4_xs
dequantize_mul_mat_vec
dequantize_mul_mat_vec_reorder
dequantize_mul_mat_vec_q2_k
dequantize_mul_mat_vec_q3_k
dequantize_mul_mat_vec_q4_k
dequantize_mul_mat_vec_q5_k
dequantize_mul_mat_vec_q6_k
dequantize_mul_mat_vec_q4_0_sycl_reorder
dequantize_mul_mat_vec_q4_0_sycl
dequantize_mul_mat_vec_q4_1_sycl
dequantize_mul_mat_vec_q5_0_sycl
dequantize_mul_mat_vec_q5_1_sycl
dequantize_mul_mat_vec_q8_0_sycl
dequantize_mul_mat_vec_q2_K_sycl
dequantize_mul_mat_vec_q3_K_sycl
dequantize_mul_mat_vec_q4_K_sycl
dequantize_mul_mat_vec_q5_K_sycl
dequantize_mul_mat_vec_q6_K_sycl
ggml_sycl_op_dequantize_mul_mat_vec
quantize_q8_1
quantize_row_q8_1_sycl
ggml_vk_get_dequantize_mul_mat_vec
ggml_vk_get_dequantize_mul_mat_vec_id
ggml_vk_quantize_data
ggml_vk_dequantize_data
quantize_row_q4_0_ref
quantize_row_q4_1_ref
quantize_row_q5_0_ref
quantize_row_q5_1_ref
quantize_row_q8_0_ref
quantize_row_q8_1_ref
dequantize_row_q4_0
dequantize_row_q4_1
dequantize_row_q5_0
dequantize_row_q5_1
dequantize_row_q8_0
ggml_is_quantized
ggml_quantize_init
ggml_quantize_free
ggml_quantize_requires_imatrix
ggml_quantize_chunk
quantize_1_blocks_per_row
quantize_1_quants_per_block
quantize_1_row_size
quantize_1
quantize_2_blocks_per_row
quantize_2_quants_per_block
quantize_2_row_size
quantize_2_row
quantize_2
quantize_3_blocks_per_row
quantize_3_quants_per_block
quantize_3_row_size
quantize_3_row
quantize_3
quantize_4_blocks_per_row
quantize_4_row_size
quantize_4_row
quantize_4
quantize_5_blocks_per_row
quantize_5_row_size
quantize_5_row
quantize_5
quantize_6_blocks_per_row
quantize_6_row_size
quantize_6_row
quantize_6
generate_data
array_rmse
total_quantization_error
reference_quantization_error
dot_product
dot_product_error

These functions need to be changed. There are a lot...

3 replies

haarit19058 Mar 28, 2025

void dequantize_row_q4_0(const block_q4_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k) {
    static const int qk = QK4_0;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        const float d = GGML_FP16_TO_FP32(x[i].d);

        for (int j = 0; j < qk/2; ++j) {
            const int x0 = (x[i].qs[j] & 0x0F) - 8;
            const int x1 = (x[i].qs[j] >>   4) - 8;

            y[i*qk + j + 0   ] = x0*d;
            y[i*qk + j + qk/2] = x1*d;
        }
    }
}

for these quantization and dequantization functions what am i supposed to do ?? Like use xlns16 for that ??

haarit19058 Mar 28, 2025

I now understood the scale of the project. I have to change ggml code a lot to get desired results.

markgarnold Mar 28, 2025
Maintainer

The project is not quite as big as the list implies. At least some of the quantization and dequantization functions do not have to change. For example, the dequantize_row_q4_0 example you gave converts a sequence of 4-bit elements (scaled by d, a 16-bit float) to normal 32-bit float for some future operation with float. There is not actual computation in dequantize_row_q4_0, so there is nothing to be done with xlns16.

I think a lot of those functions listed for quantization and dequantization will be like this, but you will have to decide on a case-by-case basis. For example, what about dequantize_mul_mat_vec_q4_k (and similar)? That seems to imply some computation takes place (which we would like done using xlns16.cpp)

haarit19058 · 2025-03-29T18:44:14Z

haarit19058
Mar 29, 2025

@markgarnold I have completed preparing my proposal. Could you kindly review it? I will be sending it to markgarnold@yahoo.com, as I believe this is the correct email address. Please let me know if otherwise.

2 replies

pradeeban Mar 29, 2025
Maintainer Author

@haarit19058 and also make sure to submit to GSoC Portal. You can submit first and keep updating there until the deadline. Sometimes you may not hear from the mentors. We don't send feedback such as "Received your proposal. Looks good."

haarit19058 Mar 30, 2025

Thank you for the clarification, @pradeeban. I will ensure that my proposal is submitted to the GSoC portal and keep updating it until the deadline.

haarit19058 · 2025-03-30T06:38:07Z

haarit19058
Mar 30, 2025

The GGML library includes support for various backends, such as CUDA, OpenCL, and Vulkan. For this project, do we need to implement the LNS backend (xlns32/xlns16) for all these modules, or should we focus on CPU only?

3 replies

echester Mar 30, 2025
Maintainer

My view is that CPU is the useful first step, and that addressing multiple backend options will be time consuming, challenging, and repetitive.
If time allows, it might be interesting to consider an additional hardware layer, or a recommendation for which to address, and why. My instinct is that CUDA makes sense as it's widely available, but this is just opinion: I am not familiar with the options here.

markgarnold Mar 31, 2025
Maintainer

FYI @echester I am working on a version that can be compiled on CUDA device. The "xlns16_alt" and "xlns32_alt" algorithms for addition are a step on that road. The GSoC people like @haarit19058 should focus on CPU using the existing xlns16.cpp. Hopefully there will be a version that can run on CUDA soon

echester Apr 1, 2025
Maintainer

Ok great - I'm relieved I'm not way off beam here then...

haarit19058 · 2025-03-30T13:22:22Z

haarit19058
Mar 30, 2025

Is this project considered medium-scaled or large-scaled? What should I write in gsoc proposal ??

1 reply

pradeeban Mar 30, 2025
Maintainer Author

This question is answered here. https://github.com/uaanchorage/GSoC/blob/main/CONTRIBUTOR-GUIDE.md

The hours -> length mapping is common to all the GSoC organizations.

Ash-the-practical-programmer · 2025-04-08T08:48:58Z

Ash-the-practical-programmer
Apr 8, 2025

I'm Ashwin from CS first year, i came to know about the project pretty late and didn't know we've to link the done code challenges in the proposal application. I didn't put any. What should I do now ?

0 replies

echester · 2025-04-08T09:39:41Z

echester
Apr 8, 2025
Maintainer

You can upload the proposal again if you wish. Ed

…

On Tue, 8 Apr 2025, 09:49 Ash, ***@***.***> wrote: I'm Ashwin from CS first year, i came to know about the project pretty late and didn't know we've to link the done code challenges in the proposal application. I didn't put any. What should I do now ? — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYYSO3O5XO3LTVVNB7KGYL2YOEQ7AVCNFSM6AAAAABXCVORECVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZWGE2TEOI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

AvanishSalunke · 2025-12-16T17:32:45Z

AvanishSalunke
Dec 16, 2025

Hello @markgarnold @echester I am excited to work on this project!! apart from the attached research papers is there anything we should go through in this regards?
Thankyou!!

0 replies

yadavchiragg · 2025-12-29T02:52:33Z

yadavchiragg
Dec 29, 2025

Hi @markgarnold, @pradeeban

I'm interested in contributing a modern CMake build system to xlnscpp.
I've analyzed the codebase and created a detailed implementation plan.

Goals:

Replace manual compilation with CMake
Support all existing compile-time options
Add automated testing with CTest
Provide clear documentation
Enable CI/CD integration

Would this be a valuable contribution? I'm happy to discuss the
approach before starting implementation.

Looking forward to your thoughts!

1 reply

markgarnold Feb 26, 2026
Maintainer

This is not what we need

ArjunDeshwal · 2026-02-25T19:54:59Z

ArjunDeshwal
Feb 25, 2026

Hi @markgarnold, @pradeeban, @EdChester — I'm Arjun. I submitted the native LNS softmax (#22) and table-backed exp/log (#23) that were merged recently, and have a few more open PRs (layer normalization, table-backed exp2/log2, weight quantization). I'm interested in the GSoC project on LNS support for LLMs and have been looking at the ggml backend architecture to understand how the integration would work. Are there specific areas you'd like contributors to focus on?

1 reply

markgarnold Feb 26, 2026
Maintainer

@ArjunDeshwal , ggml is quite complex. LNS will be a backend tool. This is a proof of concept that xlns16 can do the arithmetic and still have llama.cpp run an LLM. Think of how you plan to modify ggml to use xlnscpp.

markgarnold · 2026-02-26T01:51:48Z

markgarnold
Feb 26, 2026
Maintainer

@ArjunDeshwal : a further clarification. This project was not selected by Google last year. There were proposals on how to do it, but not quite good enough.

For example, see haarit19058 list of functions in this thread. I am not saying that is the correct answer, but it illustrates the extent of the changes needed to ggml to make this work. Since it was posted here (open source) you are free to use this as a starting point, but there are other issues to consider.

For example, will you make a clone of ggml and modify that? Or something else?

3 replies

ArjunDeshwal Mar 3, 2026

Hi @markgarnold,

I've been studying how ggml dispatches operations internally traced the path from ggml_add through the compute graph down to ggml_vec_add_f32 in ggml-cpu.c. I also looked at how the quantized dot products work in functions like ggml_vec_dot_q4_0_q8_0.
I put together a repo experimenting with LNS integration: https://github.com/ArjunDeshwal/ggml-lns-backend
It's currently set up as a separate backend implementing ggml_backend_i, but I think modifying the vec functions in a fork with #ifdef guards like you mentioned would be more straightforward especially since float2xlns16_ already handles caching of converted values.
A couple of questions:

For functions like ggml_compute_forward_mul_mat that call into BLAS or use the fused quantized dot products, should those be modified too or just the simpler vec functions (ggml_vec_add_f32, ggml_vec_mul_f32, etc.) first?
Should I start with xlns16 or xlns32, or set it up with nested #ifdefs so both can be toggled?

markgarnold Mar 3, 2026
Maintainer

@ArjunDeshwal (also @Ayush3941 and @naman9271 who are considering related ideas):

Here are my thoughts on the two technical questions asked:

for dealing with functions that use BLAS--obviously those need simpler functions since BLAS is FP, and we want the simplest approach to replace all FP with LNS.
only worry about xlns16; xlns32 is unlikely to be economical to fabricate in hardware and it is expected its numerical performance will be similar to the FP32 used by default in ggml. We are interested whether internal 16-bit LNS arithmetic in ggml is enough that LLMs behave reasonably.

Also mentioned were quantized dot products. Low-width quantization is used in LLMs for parameter compression; ggml usually converts these to FP32. If that is the automatic default behavior, it is fine to then convert FP32 to xlns16, perform the computation, convert back to FP32 (and let ggml/llama take it from there)

ArjunDeshwal Mar 6, 2026

@markgarnold @akrentz6 I've mailed my proposal drafts to boht of you on your respective mails. Let me know what you think about it

Ayush3941 · 2026-02-27T10:04:41Z

Ayush3941
Feb 27, 2026

Hii @markgarnold reading whole thread i see that the reasons of your concern about this project not getting selected last year is very understandable the key aspects most applicants are missing is treating this as any regular GSOC project where you have to contribute according to set guidelines to get selected but this one is different instead of contributing or adding new feature we have to make a new technical perspective and implement it ,not as hard as making things out of scratch but realigning the way they work at the core
Btw myself Ayush Gaur so far i have been reading the research material mentioned in description and have been really interested in this project largely because of my recent interested in related fields such as LLVM and other low -level constructs

my current approach (sharing in case it’s useful):
0> don't rush to deeper level codebases in ggml repo otherwise will get like earlier contributors in this thread
1> make miniature projects using ggml other than the ones mentioned in tasks
2> as you do you make one step down in file system and maybe draw code flow just to remember it
3> from here on you can start modifying it if something breaks unexpectedly take a note of that (important part)

my perspective based on experience with large codebases like LLVM let me know what you all think, since there is a month left before deadline lets give our best

0 replies

markgarnold · 2026-03-03T20:55:09Z

markgarnold
Mar 3, 2026
Maintainer

I appreciate the PRs and comments submitted (including those from @ArjunDeshwal @Ayush3941 and @naman9271 ), but it is important to understand the goal of this project: to see whether 16-bit LNS can work in llama (by modifying ggml to use xlns16). This will be a proof of concept (probably a lot slower than FP).

The mentors for this project are @akrentz6 (who created xlnstorch last summer during GSoC2025) and me. (Ed is no longer able to devote time to this). @pradeeban is the Alaska org person who makes sure we comply with all GSoC rules.

At most one contributor will be selected by Google to receive a stipend during the coding period. As noted, this project is a bit different than those for other orgs. This is an open-source project; sharing your code and insights with your fellow contributors is a good thing. On the other hand, only one contributor will be selected to implement the ggml/xlns16 project. This will be based on the technical quality of the proposal you submit. If there are no "good" proposals, GSoC Alaska will choose not to fund this project.

Since @akrentz6 was selected last year, he may be able give some hints on how to conceive and write a good proposal.

Do not use AI to write your proposal.

I am willing to give feedback on your proposal (sent privately to my email) when it is nearing completion. Check with @akrentz6 if he is also willing to do this--I don't know if he has the time. I will not write your proposal for you.

0 replies

Ninjacoder-vedant · 2026-03-06T22:50:31Z

Ninjacoder-vedant
Mar 6, 2026

Hi @markgarnold and @akrentz6, I am Vedant from IIT Gandhinagar. I really liked xlns support for llama.cpp project idea for GSoC 2026, so I have been working on it for about a week now. I went through the whole thread and identified the core objective: to build an XLNS backend that performs matrix multiplication using xlns16 Is this correct approach? I will include about adding other functions in proposal.

By referring to other backend implementations such as ggml-cpu.cpp and ggml-blas.cpp, and after about three days of debugging and understanding the backend codebase, I was finally able to make my new XLNS backend visible to llama.cpp. Currently, it supports only float32 matmul operations and is single-threaded to keep the implementation simple.

I tested it using a very small model, tinyllamas/stories15M-f32.gguf, and it seems to be working reasonably well so far. Could you suggest what the next steps should be from here? One thing I am considering is adding multithreading support to improve inference speed and then testing it on larger models.

More implementation and testing details can be found here: https://github.com/Ninjacoder-vedant/llama.cpp/blob/xlns-backend/docs/backend/XLNS.md. You can clone the repo’s xlns-backend branch and run it as described in the documentation. If possible, let me know whether this approach is correct by reviewing commits

6 replies

Ninjacoder-vedant Mar 7, 2026

Okay, thanks for feedback. I'll try to do that

Ninjacoder-vedant Mar 14, 2026

@markgarnold and @akrentz6, I have two questions:

If we add XLNS16 as a GGML_TYPE, would it be better to add it only as a compute dtype? That is, weights remain stored in the existing quantized formats, and when we load them we dequantize them into XLNS16 during computation. Or should we add direct support in the current GGUF formats? In that case, we would need to modify hf_to_gguf.py to convert models into an xlns16.gguf format and update quantize.cpp to convert existing f32/f16.gguf models into xlns16.gguf.
If we go with the first option, then when the XLNS backend receives the quantized weights, would it be better to convert the weights into XLNS16 once and store them as a cache for all remaining passes (which will use more CPU RAM), or should we do dynamic conversion each time (which will be slower on CPU)? I think the cache method would work for now.

Let me know what you think

markgarnold Mar 16, 2026
Maintainer

Regarding the questions from@Ninjacoder-vedant :

a compute dtype sounds better. Modifying GGUF is beyond the scope of the project I imagined.
dynamic conversion each time seems better. Running LLMs on a local machine consumes quite a bit of RAM.
Having said that, I am expecting you will compile with the xlns16_table option, which uses a precomputed table of all 65536 possible xlns16 values (which is trivial compared to the size of the parameters of an LLM). So the overhead of conversion each time for the weights is about the same as if you were looking up from a cache.

@akrentz6 may have additional thoughts.

markgarnold Mar 16, 2026
Maintainer

An additional thought:

I am not expecting an xlns32 implementation in ggml. But if that were being done, then the answer for the question of caching would be different. But that is not what this project should consider.

So, no to caching.

Ninjacoder-vedant Mar 16, 2026

@markgarnold Okay, got your point! I think then it's similar to what I have implemented here: matmul_fun. According to what @akrentz6 said, it's a naive implementation in which I am converting the output to f32 every time, and then the next layer converts it back to xlns16.

Instead, what I think we can do is convert the input f32 token embeddings into xlns16 once, and then each operation after this will have at least one tensor (the activations) in xlns16. The other tensor (the weights in mat_mul) can stay in a quantized format. For that, we would need to implement dynamic dequantization functions that convert the most popular quantized formats into xlns16. Then at the output layer we can convert activations (logits) back to f32.

For the operations, these are the ones I think we would need to implement in xlns16: RoPE (Rotary Positional Encodings), RMS_NORM, ADD, SILU, MUL, MUL_MAT, and possibly a few others depending on the model we want to run. These are based on the requirements of the llama3.2-1B model. Is this the correct direction?

Another question I have:

Llama3.2-1B supports flash attention, for which there is an op in ggml called FLASH_ATTN_EXT. It seems to be a fairly complex algorithm to implement in xlns16. I think normal attention should work fine for the GSoC PoC, which would just be Query–Key matmul followed by softmax. So, should we implement FLASH_ATTN_EXT in xlns16?

akrentz6 · 2026-03-07T11:54:50Z

akrentz6
Mar 7, 2026

Hi all, I've attached my proposal from last year's xlnstorch project for reference.
proposal.pdf

0 replies

markgarnold · 2026-03-26T14:50:15Z

markgarnold
Mar 26, 2026
Maintainer

Hi Vedant, I do not see your proposal on the GSoC portal. The deadline is 31 March. You are allowed to submit revisions up to that time. The portal may become slow near the deadline, so it is wise to submit early. On Wednesday, March 18, 2026 at 12:48:33 AM EDT, Vedant Acharya ***@***.***> wrote: @markgarnold Okay, got your point! I think then it's similar to what I have implemented here: matmul_fun. According to what @akrentz6 said, it's a naive implementation in which I am converting the output to f32 every time, and then the next layer converts it back to xlns16. Instead, what I think we can do is convert the input f32 token embeddings into xlns16 once, and then each operation after this will have at least one tensor (the activations) in xlns16. The other tensor (the weights in mat_mul) can stay in a quantized format. For that, we would need to implement dynamic dequantization functions that convert the most popular quantized formats into xlns16. Then at the output layer we can convert activations (logits) back to f32. For the operations, these are the ones I think we would need to implement in xlns16: RoPE (Rotary Positional Encodings), RMS_NORM, ADD, SILU, MUL, MUL_MAT, and possibly a few others depending on the model we want to run. These are based on the requirements of the llama3.2-1B model. Is this the correct direction? Another question I have: - Llama3.2-1B supports flash attention, for which there is an op in ggml called FLASH_ATTN_EXT. It seems to be a fairly complex algorithm to implement in xlns16. I think normal attention should work fine for the GSoC PoC, which would just be Query–Key matmul followed by softmax. So, should we implement FLASH_ATTN_EXT in xlns16? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

Ninjacoder-vedant Mar 26, 2026

Yes, I am working on it. I will submit my first version within 2-3 hours. Once I submit, please kindly review Project goals and project timeline section, and provide your valuable feedback.

krishnamurthi-ramesh · 2026-03-26T20:17:30Z

krishnamurthi-ramesh
Mar 26, 2026

Hi everyone, I’m Krishna, a final year student currently focusing on SLM quantization. I’ve been following this project closely because replacing floats at the compute level is exactly the kind of on-device optimization I've been researching for smaller models.

I’ve already Completed the code challenges (running the xlns16/32 tests and the ggml FP32 matmul examples). I also put together a three-way comparison between FP32 and the LNS variants just to see where the precision trade-offs are. I’m looking forward to the technical discussions here, especially on picking the right vec functions to modify without messing with the core stability

code Challenge Repo: https://github.com/krishnamurthi-ramesh/Gsoc-xlnscpp-CodeChallenge
for your reference these where my results for code Challenge :

0 replies

krishnamurthi-ramesh · 2026-03-26T22:37:42Z

krishnamurthi-ramesh
Mar 26, 2026

Hi @markgarnold @pradeeban,

I wanted to follow up on my introduction with some concrete progress. I've spent the last 24 hours focusing on the #ifdef fork strategy you described for ggml.

I've implemented a proof-of-concept fork where I've patched the core vectorized functions in vec.h and vec.cpp. Specifically:

ggml_vec_dot_f32: I've implemented this with a persistent xlns16_float accumulator. It performs dynamic conversion (float2xlns16_) for the inputs, executes exact LNS multiplication, and uses table-referenced addition, only converting back to float at the very end of the dot product.
vec.h primitives: Added #ifdef GGML_USE_XLNS16 guards for add, mul, and scale.
I've verified the accuracy of these patched functions against an FP32 reference using a custom driver.
You can see the code here: https://github.com/krishnamurthi-ramesh/ggml-xlns-fork
Results:

I believe this core-patch approach is a good way to simulate LNS for llama.cpp without the memory overhead of shadow tensors. Looking forward to your feedback.
So i can move forward and Submit a Proposal for this LNS
Thankyou

0 replies

Welcome to xlnscpp Discussions! #1

Uh oh!

pradeeban Feb 13, 2025 Maintainer

👋 Welcome!

Replies: 21 comments · 34 replies

Uh oh!

Uh oh!

Uh oh!

pradeeban Mar 1, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pradeeban Mar 5, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

markgarnold Mar 26, 2025 Maintainer

Uh oh!

Uh oh!

markgarnold Mar 27, 2025 Maintainer

Uh oh!

Uh oh!

markgarnold Mar 27, 2025 Maintainer

Uh oh!

markgarnold Mar 27, 2025 Maintainer

Uh oh!

markgarnold Mar 27, 2025 Maintainer

Uh oh!

markgarnold Mar 27, 2025 Maintainer

Uh oh!

Uh oh!

Vectors functions / tensors

Quantization functions

Uh oh!

Uh oh!

Uh oh!

markgarnold Mar 28, 2025 Maintainer

Uh oh!

Uh oh!

pradeeban Mar 29, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

echester Mar 30, 2025 Maintainer

Uh oh!

markgarnold Mar 31, 2025 Maintainer

Uh oh!

echester Apr 1, 2025 Maintainer

Uh oh!

Uh oh!

pradeeban Mar 30, 2025 Maintainer Author

Uh oh!

Uh oh!

echester Apr 8, 2025 Maintainer

Uh oh!

pradeeban
Feb 13, 2025
Maintainer

Replies: 21 comments 34 replies

pradeeban Mar 1, 2025
Maintainer Author

pradeeban Mar 5, 2025
Maintainer Author

markgarnold Mar 26, 2025
Maintainer

markgarnold Mar 27, 2025
Maintainer

markgarnold Mar 27, 2025
Maintainer

markgarnold Mar 27, 2025
Maintainer

markgarnold Mar 27, 2025
Maintainer

markgarnold Mar 27, 2025
Maintainer

markgarnold Mar 28, 2025
Maintainer

pradeeban Mar 29, 2025
Maintainer Author

echester Mar 30, 2025
Maintainer

markgarnold Mar 31, 2025
Maintainer

echester Apr 1, 2025
Maintainer

pradeeban Mar 30, 2025
Maintainer Author

echester
Apr 8, 2025
Maintainer