-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add std.math.egcd (Extended GCD). Micro-optimize GCD. #25533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
f05e366
to
bbf984c
Compare
2849d6c
to
fe4a594
Compare
b8e054b
to
fffe0c7
Compare
For cryptographic purposes, the Berstein-Yang algorithm is a better choice than EGCD. The Zig standard library uses it for field inversion in the NIST curves. |
the egcd is not fully fleshed out yet. for example there is capabilities for overflow errors. thanks for your comment, @jedisct1. btw, regarding finding inverses with a cryptographic algorithm there is also (imo) some improvement to be done in the standard library. a lot of code looks like it's been blindly ported from some other language, but Zig has capabilities that let you express it more elegantly, for example inline-for-loops. I am not sure how welcome it is to let a draft develop into a bigger PR, but I wouldn't mind to add cryptographic counterparts of the gcd algos to the standard library too, which can be uniformly used by all crypto functions. |
A bit insulting, but generally, that’s not the case. In fact, both libsodium and Go benefited from some of the tricks that were originally implemented in the Zig standard library. Regarding inversions: when constant-time behavior is required, they currently use either addition chains (for “nice” primes) or the Bernstein–Yang inversion implemented in native Zig code generated by fiat-crypto, which ensures correctness. So please don’t change that. That said, a generic EEA implementation is still useful for other scenarios, thanks for adding it! |
I don't think I was and I am sorry if I offended you in any way, but taking a quick look at the code reveals this: // Autogenerated: 'src/ExtractionOCaml/word_by_word_montgomery' --lang Zig --internal-static --public-function-case camelCase --private-function-case camelCase --public-type-case UpperCamelCase --private-type-case UpperCamelCase --no-prefix-fiat --package-name p384 '' 64 '2^384 - 2^128 - 2^96 + 2^32 - 1' mul square add sub opp from_montgomery to_montgomery nonzero selectznz to_bytes from_bytes one msat divstep divstep_precomp That being said, I cannot claim that I am able to write something better than what already exists, so maybe this is the only way to get the compiler to do, what it is supposed to do? |
As mentioned above, this is code generated by fiat-crypto, and it is done so in order to ensure correctness. I'm not really sure from what other language this could have been copied, when it's generated. |
@Rexicon226 , very kind of you to join this conversation and verifying the very thing I am saying: There has been code that's been "blindly copied over from some other language".
Auto-generation has its perks: you can quickly produce many variants of basically the same thing and you remove the human element of errors. What you ultimately sacrifice however is readability, including clarity of intend, helping users of the library understand what is implemented and why. It took me a while, but I found the original formulation or template for this auto-generated code in the standard library: https://github.com/mit-plv/fiat-crypto/blob/master/src/Arithmetic/BYInv.v Now here are some of my key issues I have as a simple user of the standard library. These are my opinions and might be shared by others or not, nonetheless I want to voice myself. If a healthy communication is not welcomed, I can respect that decision of yours.
So having my opinions and complaints written out more elaborately, the only logical next step for me is to provide some suggestions regarding each point:
Conclusively I can just reiterate myself, do not take my critique as an insult. I am trying to be constructive or altleast voice an (by me perceived) issue in a neutral and professional manner. I never knew I would have to write this long explanation for a very simple (admittedly spicy) remark, but in my opinion it was justified as you might think so now too. I am happy to hear your perspectives on this matter and hope we can stay respectful to each other. |
This is true. The use case we need for this code is to be correct, not readable.
The problem here is the lack of constant-time-ness verification on the Zig AST, rather than any potential input sources for the Zig code being generated. This will be addressed in the future with #1776. It is not smarter to inline the assembler code because this
We require it to be correct. Not fast (fiat-crypto generates some pretty bad code all things considered), nor readable (that already exists in the form of
The same way you trust other cryptographic code in the standard library written and/or accepted by @jedisct1. If the level of trust you require is greater (for me personally, at work, we have some algorithms which we've written ourselves, with further testing and verification of correctness), then the Zig standard library isn't for you. And that's totally ok. I would also like to add that the underlying implementation of a finite field isn't really something that can be a source of vulnerability. Perhaps it could be written in a way that causes a program to panic, but if it is implemented wrong in a malicious way would just cause most algorithms written on top to not work. But that is a subjective argument, so take it as you will. These generated chunks of code are also not updated pretty much ever, so a supply-chain-based attack is unlikely to happen.
This defeats the entire purpose of using a formally verified generator for the implementation. And as mentioned before, we already have something like it in
I personally wouldn't be against adding this link in the doc comment above, zig/lib/std/crypto/pcurves/common.zig Line 199 in 958faa7
but keep in mind that I am not a core member, so my opinions are my own :).
I guess this could make sense if we were actually re-generating these files at some point in time; however, its purpose is defeated by just the fact that the code is looked at and reviewed by Frank. If you are this concerned about correctness, I think that the finite field implementations are the last thing you should be worrying about. Understand that the Zig stdlib cryptography code is not audited. If your use cases require the utmost correctness, I do not believe it is the correct library for you to be using. |
OpenSSL has experienced multiple severe carry propagation bugs in its finite field implementations (ex: CVE-2017-3732, CVE-2017-3736 and CVE-2021-4160). Bugs can happen everywhere including in hardware, but nowadays, tools that guarantee correctness of the original source code exist, and Zig has the privilege to be a supported target one of these tools. Not taking advantage of that would be a going back in the past. Using them gives us well-trusted implementations, one common representation and one less thing to worry about. fiat-crypto can also generate platform-specific assembly code that retains the same public API, but helps a lot to ensure that the code runs in constant time. This is something we can take eventually also advantage of. But maybe we should get back to the PR. ECGD in |
Ran a quick GCD benchmark: https://gist.github.com/jedisct1/0ddfe5484c6ea273c32efd73b40924c0
The improvement seems to be usage of diff --git a/lib/std/math/gcd.zig b/lib/std/math/gcd.zig
index 16ca7846f19a..minimal_exact_shifts 100644
--- a/lib/std/math/gcd.zig
+++ b/lib/std/math/gcd.zig
@@ -26,16 +26,16 @@ pub fn gcd(a: anytype, b: anytype) @TypeOf(a, b) {
const xz = @ctz(x);
const yz = @ctz(y);
const shift = @min(xz, yz);
- x >>= @intCast(xz);
- y >>= @intCast(yz);
+ x = @shrExact(x, @intCast(xz));
+ y = @shrExact(y, @intCast(yz));
var diff = y -% x;
while (diff != 0) : (diff = y -% x) {
const zeros = @ctz(diff);
if (x > y) diff = -%diff;
y = @min(x, y);
- x = diff >> @intCast(zeros);
+ x = @shrExact(diff, @intCast(zeros));
}
- return y << @intCast(shift);
+ return @shlExact(y, @intCast(shift));
} |
var x: S = @intCast(@abs(a)); | ||
var y: S = @intCast(@abs(b)); | ||
|
||
// Mantain a = s * x + t * y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mantain
-> Maintain
var s: S = std.math.sign(a); | ||
var t: S = 0; | ||
|
||
// Mantain b = u * x + v * y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mantain
-> Maintain
}; | ||
|
||
if (@typeInfo(N) != .int or @typeInfo(N).int.signedness != .unsigned) { | ||
@compileError("`a` and `b` must be usigned integers"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usigned
-> unsigned
@jedisct1, I don't think it is: Compiler Explorer shows that changing the shifts into builtin calls for the version currently on master does not affect the generated machine code. I am in favour of changing the shifts into builtin calls regardless of whether that affects the output - they simply better encode the knowledge we have. |
Yep. Doesn't hurt. |
I am not happy with it now, because it might overflow, hence it only being a draft. Still, thank you all for taking the time.
I have spent a lot of time hand optimizing the code for 64 bit words on x86 (on my zen2 architecture). It is crucial to calculate the every step in the exact order I have written down to achieve the optimal assembler output which I have written by hand: .text
.globl gcd_hand_optimized
.type gcd_hand_optimized @function
gcd_hand_optimized:
test %rdi, %rdi
je .EARLY_RET_Y # if zero, early return
test %rsi, %rsi
je .EARLY_RET_X # if zero, early return
tzcntq %rsi, %rcx # remove tz from second input (ctz)
tzcntq %rdi, %rdx # remove tz from first input (ctz)
shrxq %rdx, %rdi, %rdi # remove tz from first input (shift)
shrxq %rcx, %rsi, %rsi # remove tz from second input (shift)
cmpq %rcx, %rdx # save minimum of shifts (compare)
cmovbq %rdx, %rcx # save minimum of shifts (move)
movq %rsi, %r8 # create copy of y
subq %rdi, %r8 # calculate y - x
.LOOP:
movq %rdi, %r9 # create copy of x
tzcntq %r8, %rdx # saving zero count in dx
subq %rsi, %rdi # subtract y from x
cmovbq %r8, %rdi # move y - x to x if carry was set
cmovbq %r9, %rsi # replace y with x if carry was set
shrxq %rdx, %rdi, %rdi # remove tz from first input (shift)
movq %rsi, %r8 # create copy of y
subq %rdi, %r8 # calculate y - x
jne .LOOP
shlxq %rcx, %rsi, %rax # return y with appropriate shift
ret
.EARLY_RET_X:
movq %rdi, %rax
ret
.EARLY_RET_Y:
movq %rsi, %rax
ret
.size gcd_hand_optimized, .-gcd_hand_optimized |
Binary appears to be much faster. I use this naive implementation: const std = @import("std");
/// (n / 2) mod m
fn halveMod(comptime T: type, n: T, m: T) T {
if (n & 1 == 0) {
return n >> 1;
} else {
const WideT = std.meta.Int(.unsigned, @bitSizeOf(T) + 1);
const n_wide: WideT = n;
const m_wide: WideT = m;
const result = (n_wide + m_wide) >> 1;
return @intCast(result);
}
}
/// (a - b) mod m
fn subMod(comptime T: type, a: T, b: T, m: T) T {
if (a >= b) {
return a - b;
} else {
return m - (b - a);
}
}
/// Returns the modular inverse of y modulo m, or 0 if it does not exist.
/// Requires m to be an odd integer >= 3, and 0 <= y < m
pub fn modInverse(comptime T: type, y: T, m: T) 0 {
std.debug.assert(m >= 3);
std.debug.assert(m & 1 == 1); // m must be odd
std.debug.assert(y < m);
var a: T = y;
var u: T = 1;
var b: T = m;
var v: T = 0;
while (a != 0) {
if (a & 1 == 0) {
a = a >> 1;
u = halveMod(T, u, m);
} else {
if (a < b) {
// Swap (a, u, b, v) ← (b, v, a, u)
const temp_a = a;
const temp_u = u;
a = b;
u = v;
b = temp_a;
v = temp_u;
}
// Now a >= b and both are odd
// a ← (a − b)/2
a = (a - b) >> 1;
// u ← (u − v)/2 mod m
u = halveMod(T, subMod(T, u, v, m), m);
}
}
if (b != 1) return 0; // Inverse does not exist
return v;
} Needs to use signed integers if we really want both Bézout's coefficients. Not sure that these micro-optimizations will make any practical difference, though. |
This line (in Unfortunately, this week I am kind of full of work, but I try to to play around with some variants till I get an algorithm that does guarantees no overflow and is fast enough. |
The EGCD is used for cryptographic purpose, for example to find the inverse of a number mod a prime.
I have implemented an iterative version of this algorithm.
The existing GCD has been microoptimized to gain some extra speed (3-5% faster).
I purposefully implemented the EGCD with the Euclidean method instead the Binary GCD, because whenever you'd shift off n zeros inside the Binary GCD, the coefficients would've needed to be multiplied by (2^-n) mod g (g being the gcd of the two inputs), or do some other funky stuff that's performance costly.