-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathHOWTOUSE
149 lines (127 loc) · 3.87 KB
/
HOWTOUSE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
Just copy gf.c and gf.h to your program directory.
Include gf.h
#include "gf.h"
and initialize (call before all GF16 arithmetic functions)
GF16init();
For multiplication c = a * b, use
uint16_t a, b, c;
c = GF16mul(a, b);
For division c = a / b, use
uint16_t a, b, c;
c = GF16div(a, b);
For repeated (regional) computation like:
uint16_t a, x[];
for (i = 0; i < N; i++) {
b = GFmul(a, x[i]);
}
where 'a' is static and 'x' is an array, you can speed up by the following
region calculation technique.
Change the above code to:
uint16_t a, x[], *gf_a, *_x;
gf_a = GF16memL + GF16memIdx[a];
_x = x;
for (i = 0; i < N; i++) {
b = gf_a[GF16memIdx[*_x]];
_x++;
}
For division:
uint16_t a, x[];
for (i = 0; i < N; i++) {
b = GFdiv(a, x[i]);
}
use
uint16_t a, x[], *gf_a, *_x;
gf_a = GF16memH + GF16memIdx[a];
_x = x;
for (i = 0; i < N; i++) {
b = *(gf_a - GF16memIdx[*_x]);
_x++;
}
For another division such as:
uint16_t a, x[];
for (i = 0; i < N; i++) {
b = GFdiv(x[i], a);
}
do
uint16_t a, x[], *gf_a, *_x;
gf_a = GF16memH - GF16memIdx[a];
_x = x;
for (i = 0; i < N; i++) {
b = gf_a[GF16memIdx[*_x]];
_x++;
}
For more speedup in region calculation, see our technical paper.
Especially, GF16crt4bitRegTbl() + the table lookup technique with SIMD shows
superior performance.
GF16crtRegTbl() technique:
Create a lookup table for regional calculation such as:
a * x[i]
a / x[i]
x[i] / a
This method uses 128kB for the table and will run in L2 cache.
For a * x[i],
uint16_t *gf_a = GF16crtRegTbl(a, 0);
for (i = 0; i < N; i++) {
y[i] = gf_a[x[i]]; // This is equal to GFmul(a, x[i]);
// Or use y[i] = GF16LkupRT(gf_a, x[i]);
}
free(gf_a);
For a / x[i],
uint16_t *gf_a = GF16crtRegTbl(a, 1);
for (i = 0; i < N; i++) {
y[i] = gf_a[x[i]]; // This is equal to GFdiv(a, x[i]);
// Or use y[i] = GF16LkupRT(gf_a, x[i]);
}
free(gf_a);
For x[i] / a,
uint16_t *gf_a = GF16crtRegTbl(a, 2);
for (i = 0; i < N; i++) {
y[i] = gf_a[x[i]]; // This is equal to GFdiv(x[i], a);
// Or use y[i] = GF16LkupRT(gf_a, x[i]);
}
free(gf_a);
GF16crtSpltRegTbl technique:
Create two lookup tables with 256 entries for regional calculation such as:
a * x[i]
a / x[i]
x[i] / a
This method requires only 1kB for the tables and will run in L1 cache.
For a * x[i],
uint16_t *gf_a_l = GF16crtSpltRegTbl(a, 0);
uint16_t *gf_a_h = gf_a_l + 256;
uint16_t xi;
for (i = 0; i < N; i++) {
xi = x[i];
y[i] = gf_a_h[xi >> 8] ^ gf_a_l[xi & 0xff]; // = GFmul(a, x[i]);
// Or use y[i] = GF16LkupSRT(gf_a_l, gf_a_h, xi);
}
free(gf_a_l);
For a / x[i],
uint16_t *gf_a_l = GF16crtSpltRegTbl(a, 1);
uint16_t *gf_a_h = gf_a_l + 256;
uint16_t xi;
for (i = 0; i < N; i++) {
xi = x[i];
y[i] = gf_a_h[xi >> 8] ^ gf_a_l[xi & 0xff]; // = GFdiv(a, x[i]);
// Or use y[i] = GF16LkupSRT(gf_a_l, gf_a_h, xi);
}
free(gf_a_l);
For x[i] / a,
uint16_t *gf_a_l = GF16crtSpltRegTbl(a, 1);
uint16_t *gf_a_h = gf_a_l + 256;
uint16_t xi;
for (i = 0; i < N; i++) {
xi = x[i];
y[i] = gf_a_h[xi >> 8] ^ gf_a_l[xi & 0xff]; // = GFdiv(x[i], a);
// Or use y[i] = GF16LkupSRT(gf_a_l, gf_a_h, xi);
}
free(gf_a_l);
GF16crt4bitRegTbl + SIMD technique:
Create eight lookup tables with 16 entries for regional calculation such as:
a * x[i]
a / x[i]
x[i] / a
This method runs with SIMD.
For the details, please see
gf-bench/multiplication/gf-nishida-region-16/gf-bench.c
See gf-bench/*/gf-nishida-region-16/gf-bench.c for sample code.