Skip to content

Commit 81cdc3c

Browse files
authored
Merge pull request #3 from aerobus-open-source/joss2
more in paper.md
2 parents 0659c95 + c8e4a0d commit 81cdc3c

File tree

2 files changed

+100
-90
lines changed

2 files changed

+100
-90
lines changed

paper.md

Lines changed: 100 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,12 @@ $$a_0 + a_1X + \ldots + a_nX^n$$
6666

6767
where $a_n \neq 0$ if $n \neq 0$.
6868

69-
$(a_i)$, the coefficients, are elements of $A$. Theory states that if $A$ is a field, then $A[X]$ is euclidean. That means notions like division of gcd have a meaning, yielding an arithmetic of polynomials.
69+
$(a_i)$, the coefficients, are elements of $\mathbb{A}$. Theory states that if $A$ is a field, then $\mathbb{A}[X]$ is euclidean. That means notions like division of gcd have a meaning, yielding an arithmetic of polynomials.
7070

7171

7272
## Field of fractions
73-
If $A$ is euclidean, we can build it's field of fractions: the smallest field containg $\mathbb{A}$.
74-
We construct is a congruences classes of $A\times A$ with respect to the relation $(p,q) \sim (pp, qq)\ \mathrm{iff}\ p*qq = q*pp$. Basic algebra shows that this is a field (every element has an inverse). Canonical example is $\mathbb{Q}$, the set of rational numbers.
73+
If $\mathbb{A}$ is euclidean, we can build it's field of fractions: the smallest field containg $\mathbb{A}$.
74+
We construct is a congruences classes of $\mathbb{A}\times \mathbb{A}$ with respect to the relation $(p,q) \sim (pp, qq)\ \mathrm{iff}\ p*qq = q*pp$. Basic algebra shows that this is a field (every element has an inverse). Canonical example is $\mathbb{Q}$, the set of rational numbers.
7575

7676
Given polynomials over a field form an euclidean ring, we can do the same construction and get rational fractions $P(x) / Q(X)$ where $P$ and $Q$ are polynomials.
7777

@@ -111,7 +111,7 @@ concept IsEuclideanDomain = IsRing<R> && requires {
111111
R::is_euclidean_domain == true;
112112
};
113113
```
114-
which express the algebraic objects described above. Then, as long as a type satisfies the IsEuclideanDomain concept, we can calculate greated common divisor of two values of this type using euclidean algorithm. As stated above, this algorithm operates on types instead of values and does not depend on the Ring, making possible for user to implement another kind of discrete integral domain without worring about that kind of algorithm :
114+
which express the algebraic objects described above. Then, as long as a type satisfies the IsEuclideanDomain concept, we can calculate greatest common divisor of two values of this type using euclidean algorithm. As stated above, this algorithm operates on types instead of values and does not depend on the Ring, making possible for user to implement another kind of discrete euclidean domain without worrying about that kind of algorithm :
115115

116116
```C++
117117
template<typename Ring>
@@ -204,18 +204,18 @@ V AND xx are computed at compile time, yielding the following assembly (clang 17
204204

205205
```assembly
206206
.LCPI0_0:
207-
.quad 0x3fbaec7b35a00d3a # double 0.10517091807564763
208-
main: # @main
209-
push rax
210-
lea rdi, [rip + .L.str]
211-
movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
212-
mov al, 1
213-
call printf@PLT
214-
xor eax, eax
215-
pop rcx
216-
ret
207+
.quad 0x3fbaec7b35a00d3a # double 0.10517091807564763
208+
main: # @main
209+
push rax
210+
lea rdi, [rip + .L.str]
211+
movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
212+
mov al, 1
213+
call printf@PLT
214+
xor eax, eax
215+
pop rcx
216+
ret
217217
.L.str:
218-
.asciz "%lf\n"
218+
.asciz "%lf\n"
219219
```
220220

221221
## Evaluations on variables
@@ -231,56 +231,56 @@ double expm1(const double x) {
231231
again, coefficients are all computed compile time, yielding following assembly (given processor supports fused multiply add) :
232232
```assembly
233233
.LCPI0_0:
234-
.quad 0x3de6124613a86d09 # double 1.6059043836821613E-10
234+
.quad 0x3de6124613a86d09 # double 1.6059043836821613E-10
235235
.LCPI0_1:
236-
.quad 0x3e21eed8eff8d898 # double 2.08767569878681E-9
236+
.quad 0x3e21eed8eff8d898 # double 2.08767569878681E-9
237237
.LCPI0_2:
238-
.quad 0x3e5ae64567f544e4 # double 2.505210838544172E-8
238+
.quad 0x3e5ae64567f544e4 # double 2.505210838544172E-8
239239
.LCPI0_3:
240-
.quad 0x3e927e4fb7789f5c # double 2.7557319223985888E-7
240+
.quad 0x3e927e4fb7789f5c # double 2.7557319223985888E-7
241241
.LCPI0_4:
242-
.quad 0x3ec71de3a556c734 # double 2.7557319223985893E-6
242+
.quad 0x3ec71de3a556c734 # double 2.7557319223985893E-6
243243
.LCPI0_5:
244-
.quad 0x3efa01a01a01a01a # double 2.4801587301587302E-5
244+
.quad 0x3efa01a01a01a01a # double 2.4801587301587302E-5
245245
.LCPI0_6:
246-
.quad 0x3f2a01a01a01a01a # double 1.9841269841269841E-4
246+
.quad 0x3f2a01a01a01a01a # double 1.9841269841269841E-4
247247
.LCPI0_7:
248-
.quad 0x3f56c16c16c16c17 # double 0.0013888888888888889
248+
.quad 0x3f56c16c16c16c17 # double 0.0013888888888888889
249249
.LCPI0_8:
250-
.quad 0x3f81111111111111 # double 0.0083333333333333332
250+
.quad 0x3f81111111111111 # double 0.0083333333333333332
251251
.LCPI0_9:
252-
.quad 0x3fa5555555555555 # double 0.041666666666666664
252+
.quad 0x3fa5555555555555 # double 0.041666666666666664
253253
.LCPI0_10:
254-
.quad 0x3fc5555555555555 # double 0.16666666666666666
254+
.quad 0x3fc5555555555555 # double 0.16666666666666666
255255
.LCPI0_11:
256-
.quad 0x3fe0000000000000 # double 0.5
256+
.quad 0x3fe0000000000000 # double 0.5
257257
.LCPI0_12:
258-
.quad 0x3ff0000000000000 # double 1
258+
.quad 0x3ff0000000000000 # double 1
259259
expm1(double): # @expm1(double)
260-
vxorpd xmm1, xmm1, xmm1
261-
vmovsd xmm2, qword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero
262-
vfmadd231sd xmm2, xmm0, xmm1 # xmm2 = (xmm0 * xmm1) + xmm2
263-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_1] # xmm2 = (xmm0 * xmm2) +
264-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_2] # xmm2 = (xmm0 * xmm2) + mem
265-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_3] # xmm2 = (xmm0 * xmm2) + mem
266-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_4] # xmm2 = (xmm0 * xmm2) + mem
267-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_5] # xmm2 = (xmm0 * xmm2) + mem
268-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_6] # xmm2 = (xmm0 * xmm2) + mem
269-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_7] # xmm2 = (xmm0 * xmm2) + mem
270-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_8] # xmm2 = (xmm0 * xmm2) + mem
271-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_9] # xmm2 = (xmm0 * xmm2) + mem
272-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_10] # xmm2 = (xmm0 * xmm2) + mem
273-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_11] # xmm2 = (xmm0 * xmm2) + mem
274-
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_12] # xmm2 = (xmm0 * xmm2) + mem
275-
vfmadd213sd xmm0, xmm2, xmm1 # xmm0 = (xmm2 * xmm0) + xmm1
276-
ret
260+
vxorpd xmm1, xmm1, xmm1
261+
vmovsd xmm2, qword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero
262+
vfmadd231sd xmm2, xmm0, xmm1
263+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_1]
264+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_2]
265+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_3]
266+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_4]
267+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_5]
268+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_6]
269+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_7]
270+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_8]
271+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_9]
272+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_10]
273+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_11]
274+
vfmadd213sd xmm2, xmm0, qword ptr [rip + .LCPI0_12]
275+
vfmadd213sd xmm0, xmm2, xmm1
276+
ret
277277
```
278278

279279
## Apply on vectors and get proper vectorization
280280
If applied to a vector of data, with proper compiler hints, gcc can easily generate vectorized version of the code :
281281

282282
```c++
283-
double compute_expm1(const size_t N, const double* const __restrict in, double* const __restrict out) {
283+
double compute_expm1(const size_t N, double* in, double* out) {
284284
using V = aerobus::expm1<aerobus::i64, 13>;
285285
for (size_t i = 0; i < N; ++i) {
286286
out[i] = V::eval(in[i]);
@@ -292,51 +292,51 @@ yielding :
292292
293293
```assembly
294294
compute_expm1(unsigned long, double const*, double*):
295-
lea rax, [rdi-1]
296-
cmp rax, 2
297-
jbe .L5
298-
mov rcx, rdi
299-
xor eax, eax
300-
vxorpd xmm1, xmm1, xmm1
301-
vbroadcastsd ymm14, QWORD PTR .LC1[rip]
302-
vbroadcastsd ymm13, QWORD PTR .LC3[rip]
303-
shr rcx, 2
304-
vbroadcastsd ymm12, QWORD PTR .LC5[rip]
305-
vbroadcastsd ymm11, QWORD PTR .LC7[rip]
306-
sal rcx, 5
307-
vbroadcastsd ymm10, QWORD PTR .LC9[rip]
308-
vbroadcastsd ymm9, QWORD PTR .LC11[rip]
309-
vbroadcastsd ymm8, QWORD PTR .LC13[rip]
310-
vbroadcastsd ymm7, QWORD PTR .LC15[rip]
311-
vbroadcastsd ymm6, QWORD PTR .LC17[rip]
312-
vbroadcastsd ymm5, QWORD PTR .LC19[rip]
313-
vbroadcastsd ymm4, QWORD PTR .LC21[rip]
314-
vbroadcastsd ymm3, QWORD PTR .LC23[rip]
315-
vbroadcastsd ymm2, QWORD PTR .LC25[rip]
295+
lea rax, [rdi-1]
296+
cmp rax, 2
297+
jbe .L5
298+
mov rcx, rdi
299+
xor eax, eax
300+
vxorpd xmm1, xmm1, xmm1
301+
vbroadcastsd ymm14, QWORD PTR .LC1[rip]
302+
vbroadcastsd ymm13, QWORD PTR .LC3[rip]
303+
shr rcx, 2
304+
vbroadcastsd ymm12, QWORD PTR .LC5[rip]
305+
vbroadcastsd ymm11, QWORD PTR .LC7[rip]
306+
sal rcx, 5
307+
vbroadcastsd ymm10, QWORD PTR .LC9[rip]
308+
vbroadcastsd ymm9, QWORD PTR .LC11[rip]
309+
vbroadcastsd ymm8, QWORD PTR .LC13[rip]
310+
vbroadcastsd ymm7, QWORD PTR .LC15[rip]
311+
vbroadcastsd ymm6, QWORD PTR .LC17[rip]
312+
vbroadcastsd ymm5, QWORD PTR .LC19[rip]
313+
vbroadcastsd ymm4, QWORD PTR .LC21[rip]
314+
vbroadcastsd ymm3, QWORD PTR .LC23[rip]
315+
vbroadcastsd ymm2, QWORD PTR .LC25[rip]
316316
.L3:
317-
vmovupd ymm15, YMMWORD PTR [rsi+rax]
318-
vmovapd ymm0, ymm15
319-
vfmadd132pd ymm0, ymm14, ymm1
320-
vfmadd132pd ymm0, ymm13, ymm15
321-
vfmadd132pd ymm0, ymm12, ymm15
322-
vfmadd132pd ymm0, ymm11, ymm15
323-
vfmadd132pd ymm0, ymm10, ymm15
324-
vfmadd132pd ymm0, ymm9, ymm15
325-
vfmadd132pd ymm0, ymm8, ymm15
326-
vfmadd132pd ymm0, ymm7, ymm15
327-
vfmadd132pd ymm0, ymm6, ymm15
328-
vfmadd132pd ymm0, ymm5, ymm15
329-
vfmadd132pd ymm0, ymm4, ymm15
330-
vfmadd132pd ymm0, ymm3, ymm15
331-
vfmadd132pd ymm0, ymm2, ymm15
332-
vfmadd132pd ymm0, ymm1, ymm15
333-
vmovupd YMMWORD PTR [rdx+rax], ymm0
334-
add rax, 32
335-
cmp rcx, rax
336-
jne .L3
337-
mov rax, rdi
338-
and rax, -4
339-
vzeroupper
317+
vmovupd ymm15, YMMWORD PTR [rsi+rax]
318+
vmovapd ymm0, ymm15
319+
vfmadd132pd ymm0, ymm14, ymm1
320+
vfmadd132pd ymm0, ymm13, ymm15
321+
vfmadd132pd ymm0, ymm12, ymm15
322+
vfmadd132pd ymm0, ymm11, ymm15
323+
vfmadd132pd ymm0, ymm10, ymm15
324+
vfmadd132pd ymm0, ymm9, ymm15
325+
vfmadd132pd ymm0, ymm8, ymm15
326+
vfmadd132pd ymm0, ymm7, ymm15
327+
vfmadd132pd ymm0, ymm6, ymm15
328+
vfmadd132pd ymm0, ymm5, ymm15
329+
vfmadd132pd ymm0, ymm4, ymm15
330+
vfmadd132pd ymm0, ymm3, ymm15
331+
vfmadd132pd ymm0, ymm2, ymm15
332+
vfmadd132pd ymm0, ymm1, ymm15
333+
vmovupd YMMWORD PTR [rdx+rax], ymm0
334+
add rax, 32
335+
cmp rcx, rax
336+
jne .L3
337+
mov rax, rdi
338+
and rax, -4
339+
vzeroupper
340340
```
341341

342342
# Misc
@@ -375,6 +375,16 @@ using PI_fraction = ContinuedFraction<3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14,
375375

376376
then, you can have the corresponding rational number by using `PI_fraction::type` and a computation with `PI_fraction::val`.
377377

378+
379+
380+
# Benchmarks
381+
We compare to `vml` and to the standard library in the file "benchmarks.cpp".
382+
Benchmark is quite simple and test compute intensive operation : computing sinus (compound twelve times) of all elements of a large double buffer of values (larger than cache). We run code on a Asus expertbook, equipped with an Intel i7-1195G7 @ 2.90GHz.
383+
384+
![Performance comparison between std::math, VML and Aerobus in compound sinus computation (double precision)](performance.png)
385+
386+
387+
378388
# Acknowledgements
379389

380390
Many thanks to my math teachers, A. Soyeur and M. Gonnord. I also acknowledge indirect contributions from F. Duguet, who basically learnt me all I know in C++.

performance.png

14.7 KB
Loading

0 commit comments

Comments
 (0)