You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Derivatives (& generalizations like gradients) used widely across science
@@ -76,14 +76,14 @@ end
76
76
77
77
# ╔═╡ 651ab7be-2f87-4112-8304-ac8ecfbda5ff
78
78
md"""
79
-
One could alternatively numerically approximate the derivatives by subtracting applying the formula from the top at a small value of h (also referred to as epsilon).
79
+
Alternatively, one could numerically approximate the derivatives by applying the formula from the top with a small value of h (also referred to as epsilon).
80
80
81
81
for our relu function we might do something like the following
82
82
"""
83
83
84
84
# ╔═╡ 19f2b988-2780-11ef-09c9-675188aabe39
85
85
functionfinite_diff(f, x, h)
86
-
return (f(x+h) -f(x)) / h
86
+
return (f(x+h) -f(x)) / h
87
87
end
88
88
89
89
# ╔═╡ 13abf071-166d-4d5d-8f8d-bb8da95def01
@@ -110,7 +110,7 @@ end
110
110
md"""
111
111
Finite differences seems to work decently well here, why use automatic differentiation?
112
112
113
-
Let's try a function with more inputs, and take the entire gradient (e.g. derivative wrt all inputs.)
113
+
Let's try a function with more inputs and take the entire gradient (i.e. derivative with respect to all inputs.)
114
114
"""
115
115
116
116
# ╔═╡ 63d9f6bc-4b65-4717-8d37-85e6f96dc4a3
@@ -173,19 +173,19 @@ grad_big_fd(inputs, 0.01)
173
173
174
174
# ╔═╡ 87d08aad-18cc-4bbd-b8ca-644e4a01e8d2
175
175
md"""
176
-
That's a bit better, and also makes sense since the derivative is the limit as h approaches 0. So we'll always have some theoretical error if h is non-zero. Let's make h as close to 0 as possible, and see if we can get almost the same anwer!
176
+
That's a bit better. It makes sense, since the derivative is the limit as h approaches 0. So we'd expect some theoretical error if h is non-zero. Let's make h as close to 0 as possible, and see if we can get almost the same answer!
177
177
"""
178
178
179
179
# ╔═╡ 7336d600-4d66-4d65-a2cc-f4192ec10507
180
180
grad_big_fd(inputs, 1e-15)
181
181
182
182
# ╔═╡ 5e14790e-3ee9-4865-95df-9f0454a62c5b
183
183
md"""
184
-
Uh oh the errors got worse! At one point, you'll get correctness issues due to floating point error. Computers don't represent real numbers to perfect accuracy, but instead represent things in a form of scientific notation 1.23456 * 10^3. The computer only has so many bits for the "matissa" aka the left hand side.
184
+
Uh oh the errors got worse! At one point, you'll get correctness issues due to floating point error. Computers don't represent real numbers to perfect accuracy, but instead represent things in a form of scientific notation 1.23456 * 10^3. The computer only has so many bits for the "mantissa" aka the left hand side.
185
185
186
186
What we're doing with our approximation is essentially computing (A + df) - A. As A gets really big compared to df, we lose most of the precision of our numbers!
187
187
188
-
For example, lets suppose we only have 9 digits of precision.
188
+
For example, let's suppose we only have 9 digits of precision.
189
189
190
190
```
191
191
A = 100,000,000.0
@@ -298,7 +298,7 @@ md"""
298
298
One way of thinking about forward mode automatic differentiation is through the lens of dual numbers.
299
299
300
300
301
-
Consider a taylor series for `f(x)`. `f'(x)` is the epsilon coefficient.
301
+
Consider a Taylor series for `f(x)`. `f'(x)` is the epsilon coefficient.
@@ -325,7 +325,7 @@ Let's try add. Let's take two dual numbers and add them up.
325
325
(a + b \epsilon) + (c + d \epsilon) = (a + c) + (b + d)\epsilon
326
326
```
327
327
328
-
What about mulitiply?
328
+
What about multiply?
329
329
330
330
331
331
```math
@@ -337,7 +337,7 @@ Here we find three terms going all the way up to epsilon^2. However, since we kn
337
337
ac + (ad + bc) \epsilon
338
338
```.
339
339
340
-
Enzyme implements these rules or all the primitive LLVM instructions. By implementing the derivative rule for all these instructions, one can apply the chain rule between them to find the total derivative of a computation. Let's see how it works in practice.
340
+
Enzyme implements these rules for all the primitive LLVM instructions. By implementing the derivative rule for all these instructions, one can apply the chain rule between them to find the total derivative of a computation. Let's see how it works in practice.
341
341
"""
342
342
343
343
# ╔═╡ 738fa258-8fe8-472a-b36d-e78b095b5a77
@@ -474,7 +474,7 @@ end
474
474
md"""
475
475
Reverse mode is often the more common way that automatic differentiation is used in practice.
476
476
477
-
Rather than derivatives flowing from input to output as we saw in forward mode, in reverse mode derivatives from from differentiable result to differentiable input (as the name would imply).
477
+
Rather than derivatives flowing from input to output as we saw in forward mode, in reverse mode derivatives from differentiable result to differentiable input (as the name would imply).
478
478
479
479
A reverse-mode AD algorithm would often:
480
480
1) Initialize the derivative result to 1 (e.g. marking the derivative of the result with respect to itself as 1).
@@ -484,7 +484,7 @@ A reverse-mode AD algorithm would often:
484
484
485
485
The reason for the added complexity of reverse-mode is two-fold:
486
486
* First, one must execute the program backwards. For programs with loops, branching, or nontrivial control flow, setting up this infrastructure may be non-trivial.
487
-
* Second, unlike forward mode where the derivative is immediately available upon executing the shadow instruction, in reverse mode we add up all the partial derivatives. This is because if a variable is used multiple times, we can get multiple terms to the derivative from the partials of each of its users.
487
+
* Second, derivatives with respect to input variables are accumulated (`+=`). This is because the derivative with respect to to a variable that gets used multiple times is the sum of derivative contributions from each of its users.
488
488
489
489
Let's look at our simple_mlp example, but do it by hand this time.
But what is this new `Active` annotation. It is only available and reverse mode and refers to a differentiable variable which is immutable. As a result the autodiff function will return a tuple of the derivatives of all active variables. Or, more specifically, the first element of its return will be such a tuple.
535
+
But what is this new `Active` annotation. It is only available in reverse mode and refers to a differentiable variable which is immutable. As a result the autodiff function will return a tuple of the derivatives of all active variables. Or, more specifically, the first element of its return will be such a tuple.
536
536
537
537
You can also ask Enzyme to return both the derivative and the original result (also known as the primal), like below.
538
538
"""
@@ -580,7 +580,7 @@ end
580
580
md"""
581
581
Here we now need to also pass Enzyme a shadow return value, but what values should it be?
582
582
583
-
For differentiable inputs we passed in 0's since the derivativess will be +='d into it.
583
+
For differentiable inputs we passed in 0's since the derivatives will be +='d into it.
584
584
585
585
For differentiable outputs, we will pass in the partial derivative we want to propagate backwards from it. If we simply want the derivative of out[1], let's set it to 1.0.
@@ -632,16 +632,16 @@ Like forward mode we can generalize the equation of what reverse mode computes.
632
632
633
633
Suppose our function `f(x, y, z)` returns 3 variables `f1`, `f2`, and `f3`.
634
634
635
-
Supposing that we had intiialized shadow returns `df1`, `df2`, and `df3`, reverse mode will compute:
635
+
Supposing that we had intiialized shadow returns `d_df1`, `d_df2`, and `d_df3`, reverse mode will compute:
636
636
637
637
```math
638
-
dx += \nabla_x f1(x, y, z) * df1 + \nabla_x f2(x, y, z) * df2 + \nabla_x f3(x, y, z) * df3
638
+
d_dx += \nabla_x f1(x, y, z) * d_df1 + \nabla_x f2(x, y, z) * d_df2 + \nabla_x f3(x, y, z) * d_df3
639
639
```
640
640
```math
641
-
dy += \nabla_y f1(x, y, z) * df1 + \nabla_y f2(x, y, z) * df2 + \nabla_y f3(x, y, z) * df3
641
+
d_dy += \nabla_y f1(x, y, z) * d_df1 + \nabla_y f2(x, y, z) * d_df2 + \nabla_y f3(x, y, z) * d_df3
642
642
```
643
643
```math
644
-
dz += \nabla_z f1(x, y, z) * df1 + \nabla_z f2(x, y, z) * df2 + \nabla_z f3(x, y, z) * df3
644
+
d_dz += \nabla_z f1(x, y, z) * d_df1 + \nabla_z f2(x, y, z) * d_df2 + \nabla_z f3(x, y, z) * d_df3
645
645
```
646
646
647
647
Similarly, Reverse mode is often referred to as computing a vector jacobian product by the literature/other tools.
@@ -790,7 +790,7 @@ Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzym
790
790
$(LocalResource("./gpu.png"))
791
791
792
792
### Multicore Benchmarks
793
-
This also applies to multicore/multi node paralleism as well. The blue lines in this plot denotes the performance of an application as it is scaled with multiple threads (up to 64). The green lines denote the performance of the gradient. On the left plot, standard optimizations (but no novel parallel optimizations) are applied. We see that the performance steadies out after around ~10 threads. On the right hand side, we now apply parallel optimizations prior to AD. We see that the performance continues to scale similarly to that of the original computation, subject to a hiccup at 32-threads (the machine has two distinct CPU sockets, each which 32 threads, so after 32 threads one needs to coordinate across sockets).
793
+
This also applies to multicore/multi node parallelism as well. The blue lines in this plot denotes the performance of an application as it is scaled with multiple threads (up to 64). The green lines denote the performance of the gradient. On the left plot, standard optimizations (but no novel parallel optimizations) are applied. We see that the performance steadies out after around ~10 threads. On the right hand side, we now apply parallel optimizations prior to AD. We see that the performance continues to scale similarly to that of the original computation, subject to a hiccup at 32-threads (the machine has two distinct CPU sockets, each which 32 threads, so after 32 threads one needs to coordinate across sockets).
0 commit comments