Skip to content

Commit ff8e160

Browse files
authored
Merge pull request #10 from samuelpmish/main
fix some typos in enzyme_tut.jl
2 parents 5e3245d + 7e2b9d1 commit ff8e160

File tree

1 file changed

+22
-22
lines changed

1 file changed

+22
-22
lines changed

Enzyme/enzyme_tut.jl

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ md"""
3636
Derivatives compute the rate of change of a function’s output with respect to input(s)
3737
3838
```math
39-
f'(x) = \lim_{h\rightarrow 0} \frac{f(a+h)-f(a)}{h}
39+
f'(x) \coloneqq \lim_{h\rightarrow 0} \frac{f(x+h)-f(x)}{h}
4040
```
4141
4242
Derivatives (& generalizations like gradients) used widely across science
@@ -76,14 +76,14 @@ end
7676

7777
# ╔═╡ 651ab7be-2f87-4112-8304-ac8ecfbda5ff
7878
md"""
79-
One could alternatively numerically approximate the derivatives by subtracting applying the formula from the top at a small value of h (also referred to as epsilon).
79+
Alternatively, one could numerically approximate the derivatives by applying the formula from the top with a small value of h (also referred to as epsilon).
8080
8181
for our relu function we might do something like the following
8282
"""
8383

8484
# ╔═╡ 19f2b988-2780-11ef-09c9-675188aabe39
8585
function finite_diff(f, x, h)
86-
return (f(x+h) - f(x) ) / h
86+
return (f(x+h) - f(x)) / h
8787
end
8888

8989
# ╔═╡ 13abf071-166d-4d5d-8f8d-bb8da95def01
@@ -110,7 +110,7 @@ end
110110
md"""
111111
Finite differences seems to work decently well here, why use automatic differentiation?
112112
113-
Let's try a function with more inputs, and take the entire gradient (e.g. derivative wrt all inputs.)
113+
Let's try a function with more inputs and take the entire gradient (i.e. derivative with respect to all inputs.)
114114
"""
115115

116116
# ╔═╡ 63d9f6bc-4b65-4717-8d37-85e6f96dc4a3
@@ -173,19 +173,19 @@ grad_big_fd(inputs, 0.01)
173173

174174
# ╔═╡ 87d08aad-18cc-4bbd-b8ca-644e4a01e8d2
175175
md"""
176-
That's a bit better, and also makes sense since the derivative is the limit as h approaches 0. So we'll always have some theoretical error if h is non-zero. Let's make h as close to 0 as possible, and see if we can get almost the same anwer!
176+
That's a bit better. It makes sense, since the derivative is the limit as h approaches 0. So we'd expect some theoretical error if h is non-zero. Let's make h as close to 0 as possible, and see if we can get almost the same answer!
177177
"""
178178

179179
# ╔═╡ 7336d600-4d66-4d65-a2cc-f4192ec10507
180180
grad_big_fd(inputs, 1e-15)
181181

182182
# ╔═╡ 5e14790e-3ee9-4865-95df-9f0454a62c5b
183183
md"""
184-
Uh oh the errors got worse! At one point, you'll get correctness issues due to floating point error. Computers don't represent real numbers to perfect accuracy, but instead represent things in a form of scientific notation 1.23456 * 10^3. The computer only has so many bits for the "matissa" aka the left hand side.
184+
Uh oh the errors got worse! At one point, you'll get correctness issues due to floating point error. Computers don't represent real numbers to perfect accuracy, but instead represent things in a form of scientific notation 1.23456 * 10^3. The computer only has so many bits for the "mantissa" aka the left hand side.
185185
186186
What we're doing with our approximation is essentially computing (A + df) - A. As A gets really big compared to df, we lose most of the precision of our numbers!
187187
188-
For example, lets suppose we only have 9 digits of precision.
188+
For example, let's suppose we only have 9 digits of precision.
189189
190190
```
191191
A = 100,000,000.0
@@ -298,7 +298,7 @@ md"""
298298
One way of thinking about forward mode automatic differentiation is through the lens of dual numbers.
299299
300300
301-
Consider a taylor series for `f(x)`. `f'(x)` is the epsilon coefficient.
301+
Consider a Taylor series for `f(x)`. `f'(x)` is the epsilon coefficient.
302302
303303
```math
304304
f(x+\epsilon) = f(x) + f'(x) \epsilon + f''(x) \epsilon^2 + ...
@@ -325,7 +325,7 @@ Let's try add. Let's take two dual numbers and add them up.
325325
(a + b \epsilon) + (c + d \epsilon) = (a + c) + (b + d)\epsilon
326326
```
327327
328-
What about mulitiply?
328+
What about multiply?
329329
330330
331331
```math
@@ -337,7 +337,7 @@ Here we find three terms going all the way up to epsilon^2. However, since we kn
337337
ac + (ad + bc) \epsilon
338338
```.
339339
340-
Enzyme implements these rules or all the primitive LLVM instructions. By implementing the derivative rule for all these instructions, one can apply the chain rule between them to find the total derivative of a computation. Let's see how it works in practice.
340+
Enzyme implements these rules for all the primitive LLVM instructions. By implementing the derivative rule for all these instructions, one can apply the chain rule between them to find the total derivative of a computation. Let's see how it works in practice.
341341
"""
342342

343343
# ╔═╡ 738fa258-8fe8-472a-b36d-e78b095b5a77
@@ -474,7 +474,7 @@ end
474474
md"""
475475
Reverse mode is often the more common way that automatic differentiation is used in practice.
476476
477-
Rather than derivatives flowing from input to output as we saw in forward mode, in reverse mode derivatives from from differentiable result to differentiable input (as the name would imply).
477+
Rather than derivatives flowing from input to output as we saw in forward mode, in reverse mode derivatives from differentiable result to differentiable input (as the name would imply).
478478
479479
A reverse-mode AD algorithm would often:
480480
1) Initialize the derivative result to 1 (e.g. marking the derivative of the result with respect to itself as 1).
@@ -484,7 +484,7 @@ A reverse-mode AD algorithm would often:
484484
485485
The reason for the added complexity of reverse-mode is two-fold:
486486
* First, one must execute the program backwards. For programs with loops, branching, or nontrivial control flow, setting up this infrastructure may be non-trivial.
487-
* Second, unlike forward mode where the derivative is immediately available upon executing the shadow instruction, in reverse mode we add up all the partial derivatives. This is because if a variable is used multiple times, we can get multiple terms to the derivative from the partials of each of its users.
487+
* Second, derivatives with respect to input variables are accumulated (`+=`). This is because the derivative with respect to to a variable that gets used multiple times is the sum of derivative contributions from each of its users.
488488
489489
Let's look at our simple_mlp example, but do it by hand this time.
490490
"""
@@ -532,7 +532,7 @@ Enzyme.autodiff(Reverse, simple_mlp, Active, Active(3.0), Active(2.0), Active(4.
532532
md"""
533533
We got the same answers, great!
534534
535-
But what is this new `Active` annotation. It is only available and reverse mode and refers to a differentiable variable which is immutable. As a result the autodiff function will return a tuple of the derivatives of all active variables. Or, more specifically, the first element of its return will be such a tuple.
535+
But what is this new `Active` annotation. It is only available in reverse mode and refers to a differentiable variable which is immutable. As a result the autodiff function will return a tuple of the derivatives of all active variables. Or, more specifically, the first element of its return will be such a tuple.
536536
537537
You can also ask Enzyme to return both the derivative and the original result (also known as the primal), like below.
538538
"""
@@ -580,7 +580,7 @@ end
580580
md"""
581581
Here we now need to also pass Enzyme a shadow return value, but what values should it be?
582582
583-
For differentiable inputs we passed in 0's since the derivativess will be +='d into it.
583+
For differentiable inputs we passed in 0's since the derivatives will be +='d into it.
584584
585585
For differentiable outputs, we will pass in the partial derivative we want to propagate backwards from it. If we simply want the derivative of out[1], let's set it to 1.0.
586586
"""
@@ -616,9 +616,9 @@ end
616616
# ╔═╡ def31b91-7492-457e-9505-9927f69931c3
617617
begin
618618
x_ip = [1.7]
619-
dx_ip = [1.0]
620-
autodiff(Reverse, sin_inplace, Duplicated(x_ip, dx_ip))
621-
dx_ip
619+
d_dx_ip = [1.0]
620+
autodiff(Reverse, sin_inplace, Duplicated(x_ip, d_dx_ip))
621+
d_dx_ip
622622
end
623623

624624
# ╔═╡ b4819762-7c61-466d-8046-9c968a9d4150
@@ -632,16 +632,16 @@ Like forward mode we can generalize the equation of what reverse mode computes.
632632
633633
Suppose our function `f(x, y, z)` returns 3 variables `f1`, `f2`, and `f3`.
634634
635-
Supposing that we had intiialized shadow returns `df1`, `df2`, and `df3`, reverse mode will compute:
635+
Supposing that we had intiialized shadow returns `d_df1`, `d_df2`, and `d_df3`, reverse mode will compute:
636636
637637
```math
638-
dx += \nabla_x f1(x, y, z) * df1 + \nabla_x f2(x, y, z) * df2 + \nabla_x f3(x, y, z) * df3
638+
d_dx += \nabla_x f1(x, y, z) * d_df1 + \nabla_x f2(x, y, z) * d_df2 + \nabla_x f3(x, y, z) * d_df3
639639
```
640640
```math
641-
dy += \nabla_y f1(x, y, z) * df1 + \nabla_y f2(x, y, z) * df2 + \nabla_y f3(x, y, z) * df3
641+
d_dy += \nabla_y f1(x, y, z) * d_df1 + \nabla_y f2(x, y, z) * d_df2 + \nabla_y f3(x, y, z) * d_df3
642642
```
643643
```math
644-
dz += \nabla_z f1(x, y, z) * df1 + \nabla_z f2(x, y, z) * df2 + \nabla_z f3(x, y, z) * df3
644+
d_dz += \nabla_z f1(x, y, z) * d_df1 + \nabla_z f2(x, y, z) * d_df2 + \nabla_z f3(x, y, z) * d_df3
645645
```
646646
647647
Similarly, Reverse mode is often referred to as computing a vector jacobian product by the literature/other tools.
@@ -790,7 +790,7 @@ Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzym
790790
$(LocalResource("./gpu.png"))
791791
792792
### Multicore Benchmarks
793-
This also applies to multicore/multi node paralleism as well. The blue lines in this plot denotes the performance of an application as it is scaled with multiple threads (up to 64). The green lines denote the performance of the gradient. On the left plot, standard optimizations (but no novel parallel optimizations) are applied. We see that the performance steadies out after around ~10 threads. On the right hand side, we now apply parallel optimizations prior to AD. We see that the performance continues to scale similarly to that of the original computation, subject to a hiccup at 32-threads (the machine has two distinct CPU sockets, each which 32 threads, so after 32 threads one needs to coordinate across sockets).
793+
This also applies to multicore/multi node parallelism as well. The blue lines in this plot denotes the performance of an application as it is scaled with multiple threads (up to 64). The green lines denote the performance of the gradient. On the left plot, standard optimizations (but no novel parallel optimizations) are applied. We see that the performance steadies out after around ~10 threads. On the right hand side, we now apply parallel optimizations prior to AD. We see that the performance continues to scale similarly to that of the original computation, subject to a hiccup at 32-threads (the machine has two distinct CPU sockets, each which 32 threads, so after 32 threads one needs to coordinate across sockets).
794794
795795
This chart is from
796796
```

0 commit comments

Comments
 (0)