Skip to content

Conversation

two-horned
Copy link

@two-horned two-horned commented Oct 10, 2025

The EGCD is used for cryptographic purpose, for example to find the inverse of a number mod a prime.

I have implemented an iterative version of this algorithm.

The existing GCD has been microoptimized to gain some extra speed (3-5% faster).

I purposefully implemented the EGCD with the Euclidean method instead the Binary GCD, because whenever you'd shift off n zeros inside the Binary GCD, the coefficients would've needed to be multiplied by (2^-n) mod g (g being the gcd of the two inputs), or do some other funky stuff that's performance costly.

Fix some comments in GCD.

Make ml_kem use lcm and egcd from std/math.
@two-horned two-horned changed the title Add proper Extended Greatest Common Divisor function. Micro-optimize GCD. Add std.math.egcd (Extended GCD). Micro-optimize GCD. Oct 10, 2025
@jedisct1
Copy link
Contributor

For cryptographic purposes, the Berstein-Yang algorithm is a better choice than EGCD.

The Zig standard library uses it for field inversion in the NIST curves.

@two-horned
Copy link
Author

the egcd is not fully fleshed out yet. for example there is capabilities for overflow errors.

thanks for your comment, @jedisct1.
regarding Bernstein and Yang's safegcd algorithm, you are totally right, it's the more cryptographically sound choice, but if you don't face side channel attacks, it is unnecessary to run a constant-time version of the Binary GCD (which it is in essence). the use of the Extended Euclidean Algorithm was already inside the ml_kem.zig file. notably, it is a fairly slow implementation of it as well!

btw, regarding finding inverses with a cryptographic algorithm there is also (imo) some improvement to be done in the standard library. a lot of code looks like it's been blindly ported from some other language, but Zig has capabilities that let you express it more elegantly, for example inline-for-loops.

I am not sure how welcome it is to let a draft develop into a bigger PR, but I wouldn't mind to add cryptographic counterparts of the gcd algos to the standard library too, which can be uniformly used by all crypto functions.

@jedisct1
Copy link
Contributor

a lot of code looks like it's been blindly ported from some other language

A bit insulting, but generally, that’s not the case. In fact, both libsodium and Go benefited from some of the tricks that were originally implemented in the Zig standard library.

Regarding inversions: when constant-time behavior is required, they currently use either addition chains (for “nice” primes) or the Bernstein–Yang inversion implemented in native Zig code generated by fiat-crypto, which ensures correctness. So please don’t change that.

That said, a generic EEA implementation is still useful for other scenarios, thanks for adding it!

@two-horned
Copy link
Author

two-horned commented Oct 12, 2025

A bit insulting

I don't think I was and I am sorry if I offended you in any way, but taking a quick look at the code reveals this:

// Autogenerated: 'src/ExtractionOCaml/word_by_word_montgomery' --lang Zig --internal-static --public-function-case camelCase --private-function-case camelCase --public-type-case UpperCamelCase --private-type-case UpperCamelCase --no-prefix-fiat --package-name p384 '' 64 '2^384 - 2^128 - 2^96 + 2^32 - 1' mul square add sub opp from_montgomery to_montgomery nonzero selectznz to_bytes from_bytes one msat divstep divstep_precomp

That being said, I cannot claim that I am able to write something better than what already exists, so maybe this is the only way to get the compiler to do, what it is supposed to do?

@Rexicon226
Copy link
Contributor

I don't think I was and I am sorry if I offended you in any way, but taking a quick look at the code reveals this:

As mentioned above, this is code generated by fiat-crypto, and it is done so in order to ensure correctness. I'm not really sure from what other language this could have been copied, when it's generated.

@two-horned
Copy link
Author

two-horned commented Oct 13, 2025

@Rexicon226 , very kind of you to join this conversation and verifying the very thing I am saying: There has been code that's been "blindly copied over from some other language".

I'm not really sure from what other language this could have been copied, when it's generated.

Auto-generation has its perks: you can quickly produce many variants of basically the same thing and you remove the human element of errors. What you ultimately sacrifice however is readability, including clarity of intend, helping users of the library understand what is implemented and why.

It took me a while, but I found the original formulation or template for this auto-generated code in the standard library: https://github.com/mit-plv/fiat-crypto/blob/master/src/Arithmetic/BYInv.v

Now here are some of my key issues I have as a simple user of the standard library. These are my opinions and might be shared by others or not, nonetheless I want to voice myself. If a healthy communication is not welcomed, I can respect that decision of yours.

  1. The code in its current reads like machine code. I think, it differs little to none from the experience of reading assembler code. Reason is, the code was produced by a machine and it was never the intention for a human to actually read this produced code as primary source for the algorithm. Also, since the fiat-crypto repository cannot formally proof that Zig's AST is correct, so also speaking in a matter of const-time-ness, is it not smarter to just inline the assembler code directly in the functions? For the sake of correctness you would have verify the produced machine code anyways, so why not simply use the auto-generated assembler?
  2. The source of the code is obscure. As implied by me looking for the original implementation quiet some time (I had to understand the fiat-crypto codebase), and also by you @Rexicon226 , it is not clear where the code came from. If a standard library user opened the documentation, he might know it is being auto-generated from this fiat-crypto repository, but it will take him additional time to understand which file he has to look for. Also, because the template is written in Coq, he now also needs to understand Coq. Clearly not a friendly setting.
  3. There is a lack of trustworthiness. How can you make sure, the code is trustworthy? Especially for cryptographic algorithms, there is a high potential that some of kind of supply chain attack, where the fiat-crypto repository is being hijacked or maybe an error on their side is unnoticed. We now not only have to trust the developers of the Zig team to write correct code, we also have to trust, the template masters over on fiat-crypto have everything under control. What raised my eyebrows was the use of AI (Copilot) (yes I've read their commit messages). LLMs are notorious for being confidentially incorrect, so I definitely had to double check whether or not the ai assistant had something to do with the actual Coq proof.

So having my opinions and complaints written out more elaborately, the only logical next step for me is to provide some suggestions regarding each point:

  1. Write code by hand. Not all code has to be written by hand, as sometimes it is simply infeasible, but Zig is such a beautiful language that allows you to actually do a lot of compile time code generation inside of Zig itself. With more code being handwritten or templates being inside the Zig standard library itself, it is very easy to read and comprehend as a human, what the implementation is about.
  2. One of the bitcoin repositories has a FANTASTIC documentation about the safegcd algorithm. The paper might be interesting to read, but a quick but comprehensive summary is literal gold for every user that browses the standard library documentation for some valuable explanation.
  3. Just keep it in Zig. If the code generation cannot be done inside this very repository, atleast let it be owned by ziglang. That way any user can sleep sound in his sleep, the people he trusts his compiler with are also working on his crypto algos. This is btw also what the bitcoin repository did (as seen by one of the links above).

Conclusively I can just reiterate myself, do not take my critique as an insult. I am trying to be constructive or altleast voice an (by me perceived) issue in a neutral and professional manner. I never knew I would have to write this long explanation for a very simple (admittedly spicy) remark, but in my opinion it was justified as you might think so now too. I am happy to hear your perspectives on this matter and hope we can stay respectful to each other.

@Rexicon226
Copy link
Contributor

Reason is, the code was produced by a machine and it was never the intention for a human to actually read this produced code as primary source for the algorithm.

This is true. The use case we need for this code is to be correct, not readable.

Also, since the fiat-crypto repository cannot formally proof that Zig's AST is correct, so also speaking in a matter of const-time-ness, is it not smarter to just inline the assembler code directly in the functions?

The problem here is the lack of constant-time-ness verification on the Zig AST, rather than any potential input sources for the Zig code being generated. This will be addressed in the future with #1776. It is not smarter to inline the assembler code because this

  1. Makes it target dependent, which would require us to have tens of thousands of lines of assembly, which is much more difficult to audit.
  2. It is no different from just generating Zig. If you cannot trust the Zig compiler to be correct, then all of the cryptography code in the stdlib is invalid for your use case. You already need to trust that LLVM correctly generates the assembly for the rest of the cryptography code. Using something like bedrock to generate assembly does not increase the overall security in my eyes.

If a standard library user opened the documentation, he might know it is being auto-generated from this fiat-crypto repository, but it will take him additional time to understand which file he has to look for. Also, because the template is written in Coq, he now also needs to understand Coq. Clearly not a friendly setting.

We require it to be correct. Not fast (fiat-crypto generates some pretty bad code all things considered), nor readable (that already exists in the form of std.crypto.ff if needed, or look at a 3rd-party implementation).
At the end of the day, the Zig generated by fiat-crypto is, to me personally, more trustworthy than a hand-written implementation. It is not particularly difficult to audit the fiat-crypto generated code if you know what to look for.

How can you make sure, the code is trustworthy?

The same way you trust other cryptographic code in the standard library written and/or accepted by @jedisct1. If the level of trust you require is greater (for me personally, at work, we have some algorithms which we've written ourselves, with further testing and verification of correctness), then the Zig standard library isn't for you. And that's totally ok.

I would also like to add that the underlying implementation of a finite field isn't really something that can be a source of vulnerability. Perhaps it could be written in a way that causes a program to panic, but if it is implemented wrong in a malicious way would just cause most algorithms written on top to not work. But that is a subjective argument, so take it as you will.

These generated chunks of code are also not updated pretty much ever, so a supply-chain-based attack is unlikely to happen.

Write code by hand.

This defeats the entire purpose of using a formally verified generator for the implementation. And as mentioned before, we already have something like it in std.crypto.ff. See the above points for a response to the rest of the point.

The paper might be interesting to read, but a quick but comprehensive summary is literal gold for every user that browses the standard library documentation for some valuable explanation.

I personally wouldn't be against adding this link in the doc comment above,

pub fn invert(a: Fe) Fe {

but keep in mind that I am not a core member, so my opinions are my own :).

That way any user can sleep sound in his sleep, the people he trusts his compiler with are also working on his crypto algos.

I guess this could make sense if we were actually re-generating these files at some point in time; however, its purpose is defeated by just the fact that the code is looked at and reviewed by Frank. If you are this concerned about correctness, I think that the finite field implementations are the last thing you should be worrying about. Understand that the Zig stdlib cryptography code is not audited. If your use cases require the utmost correctness, I do not believe it is the correct library for you to be using.

@jedisct1
Copy link
Contributor

OpenSSL has experienced multiple severe carry propagation bugs in its finite field implementations (ex: CVE-2017-3732, CVE-2017-3736 and CVE-2021-4160). Bugs can happen everywhere including in hardware, but nowadays, tools that guarantee correctness of the original source code exist, and Zig has the privilege to be a supported target one of these tools. Not taking advantage of that would be a going back in the past. Using them gives us well-trusted implementations, one common representation and one less thing to worry about. fiat-crypto can also generate platform-specific assembly code that retains the same public API, but helps a lot to ensure that the code runs in constant time. This is something we can take eventually also advantage of.

But maybe we should get back to the PR.

ECGD in std.math is good and useful, and your implementation look good.

@jedisct1
Copy link
Contributor

Ran a quick GCD benchmark:

https://gist.github.com/jedisct1/0ddfe5484c6ea273c32efd73b40924c0

Test Case Old (ms) New (ms) Result
Small Numbers 1.979 avg 2.000 avg OLD 1.06% faster
Powers of 2 0.708 avg 0.690 avg NEW 2.47% faster
High Trailing Zeros 0.666 avg 0.665 avg Equal (-0.10%)
Coprime Numbers 1.940 avg 1.931 avg Equal (-0.47%)
Large Numbers 3.206 avg 3.217 avg Equal (0.35%)
Fibonacci (Worst) 6.800 6.928 OLD 1.88% faster
Random Numbers 61.645 60.785 NEW 1.39% faster

The improvement seems to be usage of @shrExact and @shlExact instead of >> and <<, so maybe we could keep the current, simpler code;

diff --git a/lib/std/math/gcd.zig b/lib/std/math/gcd.zig
index 16ca7846f19a..minimal_exact_shifts 100644
--- a/lib/std/math/gcd.zig
+++ b/lib/std/math/gcd.zig
@@ -26,16 +26,16 @@ pub fn gcd(a: anytype, b: anytype) @TypeOf(a, b) {
     const xz = @ctz(x);
     const yz = @ctz(y);
     const shift = @min(xz, yz);
-    x >>= @intCast(xz);
-    y >>= @intCast(yz);
+    x = @shrExact(x, @intCast(xz));
+    y = @shrExact(y, @intCast(yz));

     var diff = y -% x;
     while (diff != 0) : (diff = y -% x) {
         const zeros = @ctz(diff);
         if (x > y) diff = -%diff;
         y = @min(x, y);
-        x = diff >> @intCast(zeros);
+        x = @shrExact(diff, @intCast(zeros));
     }
-    return y << @intCast(shift);
+    return @shlExact(y, @intCast(shift));
 }

var x: S = @intCast(@abs(a));
var y: S = @intCast(@abs(b));

// Mantain a = s * x + t * y.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mantain -> Maintain

var s: S = std.math.sign(a);
var t: S = 0;

// Mantain b = u * x + v * y.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mantain -> Maintain

};

if (@typeInfo(N) != .int or @typeInfo(N).int.signedness != .unsigned) {
@compileError("`a` and `b` must be usigned integers");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usigned -> unsigned

@Fri3dNstuff
Copy link
Contributor

The improvement seems to be usage of @shrExact and @shlExact instead of >> and <<

@jedisct1, I don't think it is: Compiler Explorer shows that changing the shifts into builtin calls for the version currently on master does not affect the generated machine code.

I am in favour of changing the shifts into builtin calls regardless of whether that affects the output - they simply better encode the knowledge we have.

@jedisct1
Copy link
Contributor

I am in favour of changing the shifts into builtin calls regardless of whether that affects the output - they simply better encode the knowledge we have.

Yep. Doesn't hurt.

@two-horned
Copy link
Author

ECGD in std.math is good and useful, and your implementation look good.

I am not happy with it now, because it might overflow, hence it only being a draft.
There is couple of ways to go around this issue and I might look back into the Binary GCD, because calculating the coefficient does not require multiplication with a little trick I just discovered from somewhere else.

Still, thank you all for taking the time.

@jedisct1

The improvement seems to be usage of @shrExact and @shlExact instead of >> and <<, so maybe we could keep the current, simpler code;

I have spent a lot of time hand optimizing the code for 64 bit words on x86 (on my zen2 architecture).
There is no difference when (of the generated assembler) when you use @shrExact instead of >> or not.

It is crucial to calculate the every step in the exact order I have written down to achieve the optimal assembler output which I have written by hand:

  .text
  .globl gcd_hand_optimized
  .type gcd_hand_optimized @function
gcd_hand_optimized:
  test    %rdi, %rdi
  je      .EARLY_RET_Y      # if zero, early return
  test    %rsi, %rsi
  je      .EARLY_RET_X      # if zero, early return
  tzcntq  %rsi, %rcx        # remove tz from second input (ctz)
  tzcntq  %rdi, %rdx        # remove tz from first  input (ctz)
  shrxq   %rdx, %rdi, %rdi  # remove tz from first input (shift)
  shrxq   %rcx, %rsi, %rsi  # remove tz from second input (shift)
  cmpq    %rcx, %rdx        # save minimum of shifts (compare)
  cmovbq  %rdx, %rcx        # save minimum of shifts (move)
  movq    %rsi, %r8         # create copy of y
  subq    %rdi, %r8         # calculate y - x
  .LOOP:
  movq    %rdi, %r9         # create copy of x
  tzcntq  %r8,  %rdx        # saving zero count in dx
  subq    %rsi, %rdi        # subtract y from x
  cmovbq  %r8,  %rdi        # move y - x to x if carry was set
  cmovbq  %r9,  %rsi        # replace y with x if carry was set
  shrxq   %rdx, %rdi, %rdi  # remove tz from first input (shift)
  movq    %rsi, %r8         # create copy of y
  subq    %rdi, %r8         # calculate y - x
  jne     .LOOP
  shlxq   %rcx, %rsi, %rax  # return y with appropriate shift
  ret
  .EARLY_RET_X:
  movq    %rdi, %rax
  ret
  .EARLY_RET_Y:
  movq    %rsi, %rax
  ret

  .size gcd_hand_optimized, .-gcd_hand_optimized

@jedisct1
Copy link
Contributor

Binary appears to be much faster. I use this naive implementation:

const std = @import("std");

/// (n / 2) mod m
fn halveMod(comptime T: type, n: T, m: T) T {
    if (n & 1 == 0) {
        return n >> 1;
    } else {
        const WideT = std.meta.Int(.unsigned, @bitSizeOf(T) + 1);
        const n_wide: WideT = n;
        const m_wide: WideT = m;
        const result = (n_wide + m_wide) >> 1;
        return @intCast(result);
    }
}

/// (a - b) mod m
fn subMod(comptime T: type, a: T, b: T, m: T) T {
    if (a >= b) {
        return a - b;
    } else {
        return m - (b - a);
    }
}

/// Returns the modular inverse of y modulo m, or 0 if it does not exist.
/// Requires m to be an odd integer >= 3, and 0 <= y < m
pub fn modInverse(comptime T: type, y: T, m: T) 0 {
    std.debug.assert(m >= 3);
    std.debug.assert(m & 1 == 1); // m must be odd
    std.debug.assert(y < m);

    var a: T = y;
    var u: T = 1;
    var b: T = m;
    var v: T = 0;

    while (a != 0) {
        if (a & 1 == 0) {
            a = a >> 1;
            u = halveMod(T, u, m);
        } else {
            if (a < b) {
                // Swap (a, u, b, v) ← (b, v, a, u)
                const temp_a = a;
                const temp_u = u;
                a = b;
                u = v;
                b = temp_a;
                v = temp_u;
            }
            // Now a >= b and both are odd
            // a ← (a − b)/2
            a = (a - b) >> 1;
            // u ← (u − v)/2 mod m
            u = halveMod(T, subMod(T, u, v, m), m);
        }
    }
    if (b != 1) return 0; // Inverse does not exist
    return v;
}

Needs to use signed integers if we really want both Bézout's coefficients.

Not sure that these micro-optimizations will make any practical difference, though.

@two-horned
Copy link
Author

two-horned commented Oct 13, 2025

        const result = (n_wide + m_wide) >> 1;

This line (in halveMod) is not quiet correct, but you got the right idea. We want to return some number (n / 2) mod m. If n is already even, return n / 2, else return (m + n) / 2 or (n - m) / 2 (they are equivalent). Note, this trick only works with the assumption that m = 1 mod 2.

Unfortunately, this week I am kind of full of work, but I try to to play around with some variants till I get an algorithm that does guarantees no overflow and is fast enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants