From 23b9cf1c29435314ee610db974feb4409848a3fb Mon Sep 17 00:00:00 2001
From: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Date: Tue, 11 Feb 2025 19:16:09 -0300
Subject: [PATCH 1/4] docs: add guide on automatic differentiation

---
 .../advanced/automatic_differentiation.livemd | 172 ++++++++++++++++++
 1 file changed, 172 insertions(+)
 create mode 100644 nx/guides/advanced/automatic_differentiation.livemd

diff --git a/nx/guides/advanced/automatic_differentiation.livemd b/nx/guides/advanced/automatic_differentiation.livemd
new file mode 100644
index 0000000000..f3b70d4509
--- /dev/null
+++ b/nx/guides/advanced/automatic_differentiation.livemd
@@ -0,0 +1,172 @@
+# Automatic Differentation
+
+```elixir
+Mix.install([
+  {:nx, "~> 0.7"}
+])
+```
+
+## What is Function Differentiation
+
+Nx, through the `Nx.Defn.grad/2` and `Nx.Defn.value_and_grad/3` functions allows the user to differentiate functions that were defined through `defn`.
+
+This is really important in Machine Learning settings because, in general, the training process happens through optimization methods that require calculating the gradient of tensor functions.
+
+For those more familiar with the mathematical terminology, the gradient of a tensor function is similar to the derivative of regular (scalar) functions.
+
+Let's take for example the following $f(x)$ and $f'(x)$ scalar function and derivative pair:
+
+$$
+f(x) = x^3 + x\\
+f'(x) = 3x^2 + 1
+$$
+
+We can define a similar function-derivative pair for tensor functions:
+
+$$
+f(\bold{x}) = \bold{x}^3 + \bold{x}\\
+\nabla f(\bold{x}) = 3 \bold{x} ^ 2 + 1
+$$
+
+These may look similar, but the difference is that $f(\bold{x})$ takes in $\bold{x}$ which is a tensor argument. This means that we can have the following argument and results for the function and its gradient:
+
+$$
+\bold{x} = \begin{bmatrix}
+1 & 1 \\
+2 & 3 \\
+5 & 8 \\
+\end{bmatrix}\\
+
+f(\bold{x}) = \bold{x}^3 + \bold{x} = \begin{bmatrix}
+2 & 2 \\
+10 & 30 \\
+130 & 520
+\end{bmatrix}\\
+
+\nabla f(\bold{x}) = 3 \bold{x} ^ 2 + 1 = \begin{bmatrix}
+4 & 4 \\
+13 & 28 \\
+76 & 193
+\end{bmatrix}\\
+$$
+
+## Automatic Differentiation
+
+Now that we have a general feeling of what a function and its gradient are, we can talk about how Nx can use `defn` to calculate gradients for us.
+
+In the following code blocks we're going to define the same tensor function as above and then we'll differentiate it only using Nx, without having to write the explicit derivative at all.
+
+```elixir
+defmodule Math do
+  import Nx.Defn
+
+  defn f(x) do
+    x ** 3 + x
+  end
+
+  defn grad_f(x) do
+    Nx.Defn.grad(x, &f/1)
+  end
+end
+```
+
+```elixir
+x =
+  Nx.tensor([
+    [1, 1],
+    [2, 3],
+    [5, 8]
+  ])
+
+{
+  Math.f(x),
+  Math.grad_f(x)
+}
+```
+
+As we can see, we get the results we expected, aside from the type of the grad, which will always be a floating-point number, even if you pass an integer tensor as input.
+
+Next, we'll using `Nx.Defn.debug_expr` to see what's happening under the hood.
+
+```elixir
+Nx.Defn.debug_expr(&Math.f/1).(x)  
+```
+
+```elixir
+Nx.Defn.debug_expr(&Math.grad_f/1).(x)
+```
+
+If we look closely at the returned `Nx.Defn.Expr` representations for `f` and `grad_f`, we can see that they pretty much translate to the mathematical definitions we had originally.
+
+This possible because Nx holds onto the symbolic representation of a `defn` function while inside `defn`-land, and thus `Nx.Defn.grad` (and similar) can operate on that symbolic representation to return a new symbolic representation (as seen in the second block).
+
+<!-- livebook:{"break_markdown":true} -->
+
+`Nx.Defn.value_and_grad` can be used to calculate both things at once for us:
+
+```elixir
+Nx.Defn.value_and_grad(x, &Math.f/1)
+```
+
+And if we use `debug_expr` again, we can see that the symbolic representation is actually both the function and the grad, returned in a tuple:
+
+```elixir
+Nx.Defn.debug_expr(Nx.Defn.value_and_grad(&Math.f/1)).(x)
+```
+
+Finally, we can talk about functions that receive many arguments, such as the following `add_multiply` function:
+
+```elixir
+add_multiply = fn x, y, z ->
+  addition = Nx.add(x, y)
+  Nx.multiply(z, addition)
+end
+```
+
+At first you may think that if we want to differentiate it, we need to wrap it into a single-argument function so that we can differentiate with respect to a specific argument, which would treat other arguments as constants, as we can see below:
+
+```elixir
+x = Nx.tensor([1, 2])
+y = Nx.tensor([3, 4])
+z = Nx.tensor([5, 6])
+
+{
+  Nx.Defn.grad(x, fn t -> add_multiply.(t, y, z) end),
+  Nx.Defn.grad(y, fn t -> add_multiply.(x, t, z) end),
+  Nx.Defn.grad(z, fn t -> add_multiply.(x, y, t) end)
+}
+```
+
+However, Nx is smart enough to deal with multi-valued functions through `Nx.Container` representations such as a tuple or a map:
+
+```elixir
+Nx.Defn.grad({x, y, z}, fn {x, y, z} -> add_multiply.(x, y, z) end)
+```
+
+Likewise, we can also deal with functions that return multiple values.
+
+`Nx.Defn.grad` requires us to return a scalar from function (that is, a tensor of shape `{}`).
+However, there are instances where we might want to use `value_and_grad` to get out a tuple from our function, while still calculating its gradient.
+
+For this, we have the `value_and_grad/3` arity, which accepts a transformation argument.
+
+```elixir
+x =
+  Nx.tensor([
+    [1, 1],
+    [2, 3],
+    [5, 8]
+  ])
+
+# Notice that the returned values are the 2 addition terms from `Math.f/1`
+multi_valued_return_fn = 
+  fn x -> 
+    {Nx.pow(x, 3), x}
+  end
+
+transform_fn = fn {x_cubed, x} -> Nx.add(x_cubed, x) end
+
+{{x_cubed, x}, grad} = Nx.Defn.value_and_grad(x, multi_valued_return_fn, transform_fn)
+```
+
+If we go back to the start of this livebook, we can see that `grad` holds exactly the result `Math.grad_f`, but now we have access to `x ** 3`, which wasn't accessible before, as originally we could only obtain `x ** 3 + x`.

From d5eb71cc68ac3bee38633f2b684cc9ce0565ee17 Mon Sep 17 00:00:00 2001
From: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Date: Tue, 11 Feb 2025 19:27:39 -0300
Subject: [PATCH 2/4] docs: add livebook to guides section

---
 .../advanced/automatic_differentiation.livemd | 30 ++++++++++++-------
 nx/mix.exs                                    |  1 +
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/nx/guides/advanced/automatic_differentiation.livemd b/nx/guides/advanced/automatic_differentiation.livemd
index f3b70d4509..ec9be42f8f 100644
--- a/nx/guides/advanced/automatic_differentiation.livemd
+++ b/nx/guides/advanced/automatic_differentiation.livemd
@@ -6,7 +6,7 @@ Mix.install([
 ])
 ```
 
-## What is Function Differentiation
+## What is Function Differentiation?
 
 Nx, through the `Nx.Defn.grad/2` and `Nx.Defn.value_and_grad/3` functions allows the user to differentiate functions that were defined through `defn`.
 
@@ -31,23 +31,30 @@ $$
 These may look similar, but the difference is that $f(\bold{x})$ takes in $\bold{x}$ which is a tensor argument. This means that we can have the following argument and results for the function and its gradient:
 
 $$
-\bold{x} = \begin{bmatrix}
+\bold{x} =
+\begin{bmatrix}
 1 & 1 \\
 2 & 3 \\
 5 & 8 \\
-\end{bmatrix}\\
+\end{bmatrix}\\\
+$$
 
-f(\bold{x}) = \bold{x}^3 + \bold{x} = \begin{bmatrix}
+$$
+f(\bold{x}) = \bold{x}^3 + \bold{x} =
+\begin{bmatrix}
 2 & 2 \\
 10 & 30 \\
 130 & 520
-\end{bmatrix}\\
+\end{bmatrix}
+$$
 
-\nabla f(\bold{x}) = 3 \bold{x} ^ 2 + 1 = \begin{bmatrix}
+$$
+\nabla f(\bold{x}) = 3 \bold{x} ^ 2 + 1 =
+\begin{bmatrix}
 4 & 4 \\
 13 & 28 \\
 76 & 193
-\end{bmatrix}\\
+\end{bmatrix}
 $$
 
 ## Automatic Differentiation
@@ -89,7 +96,7 @@ As we can see, we get the results we expected, aside from the type of the grad,
 Next, we'll using `Nx.Defn.debug_expr` to see what's happening under the hood.
 
 ```elixir
-Nx.Defn.debug_expr(&Math.f/1).(x)  
+Nx.Defn.debug_expr(&Math.f/1).(x)
 ```
 
 ```elixir
@@ -159,8 +166,8 @@ x =
   ])
 
 # Notice that the returned values are the 2 addition terms from `Math.f/1`
-multi_valued_return_fn = 
-  fn x -> 
+multi_valued_return_fn =
+  fn x ->
     {Nx.pow(x, 3), x}
   end
 
@@ -170,3 +177,6 @@ transform_fn = fn {x_cubed, x} -> Nx.add(x_cubed, x) end
 ```
 
 If we go back to the start of this livebook, we can see that `grad` holds exactly the result `Math.grad_f`, but now we have access to `x ** 3`, which wasn't accessible before, as originally we could only obtain `x ** 3 + x`.
+
+$$
+$$
diff --git a/nx/mix.exs b/nx/mix.exs
index a43972cf17..12627b8693 100644
--- a/nx/mix.exs
+++ b/nx/mix.exs
@@ -60,6 +60,7 @@ defmodule Nx.MixProject do
         "guides/intro-to-nx.livemd",
         "guides/advanced/vectorization.livemd",
         "guides/advanced/aggregation.livemd",
+        "guides/advanced/automatic_differentiation.livemd",
         "guides/exercises/exercises-1-20.livemd"
       ],
       skip_undefined_reference_warnings_on: ["CHANGELOG.md"],

From 19f2653e820e936c7bffb8fd2c700cb126b1f6b2 Mon Sep 17 00:00:00 2001
From: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Date: Wed, 5 Mar 2025 14:35:35 -0300
Subject: [PATCH 3/4] chore: add description on function differentiation

---
 .../advanced/automatic_differentiation.livemd   | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/nx/guides/advanced/automatic_differentiation.livemd b/nx/guides/advanced/automatic_differentiation.livemd
index ec9be42f8f..69c02baf67 100644
--- a/nx/guides/advanced/automatic_differentiation.livemd
+++ b/nx/guides/advanced/automatic_differentiation.livemd
@@ -9,10 +9,18 @@ Mix.install([
 ## What is Function Differentiation?
 
 Nx, through the `Nx.Defn.grad/2` and `Nx.Defn.value_and_grad/3` functions allows the user to differentiate functions that were defined through `defn`.
-
 This is really important in Machine Learning settings because, in general, the training process happens through optimization methods that require calculating the gradient of tensor functions.
 
-For those more familiar with the mathematical terminology, the gradient of a tensor function is similar to the derivative of regular (scalar) functions.
+Before we get too far ahead of ourselves, let's talk about what is the derivative or the gradient of a function.
+In simple terms, the derivative tells us how a function changes at a given point and lets us measure things such as where a function as maximum,
+minimum or turning points (for example, where a parabola has its vertex).
+
+The ability to measure local minima and maxima is what makes them important to optimization problems, because if we can find them, we can solve problems that want
+to minimize a given function. For higher dimensional problems, we deal with functions of many variables, and thus we use the gradient, which measures the "derivative" in the axis of each function.
+The gradient, then, is a vector that points in the direction where the function changes the most, which leads to the so-called gradient descent method of optimization.
+
+In the gradient descent method, we take tiny steps following the gradient of the function in order to find the nearest local minimum (which hopefully is either the global minimum or close enough to it).
+This is what makes function differentiation so important for machine learning.
 
 Let's take for example the following $f(x)$ and $f'(x)$ scalar function and derivative pair:
 
@@ -176,7 +184,4 @@ transform_fn = fn {x_cubed, x} -> Nx.add(x_cubed, x) end
 {{x_cubed, x}, grad} = Nx.Defn.value_and_grad(x, multi_valued_return_fn, transform_fn)
 ```
 
-If we go back to the start of this livebook, we can see that `grad` holds exactly the result `Math.grad_f`, but now we have access to `x ** 3`, which wasn't accessible before, as originally we could only obtain `x ** 3 + x`.
-
-$$
-$$
+If we go back to the start of this livebook, we can see that `grad` holds exactly the result `Math.grad_f`, but now we have access to `x ** 3`, which wasn't accessible before, as originally we could only obtain `x ** 3 + x`.
\ No newline at end of file

From 8cd8a7e746fd2b2ee969c5f9976bf0341e872975 Mon Sep 17 00:00:00 2001
From: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Date: Wed, 5 Mar 2025 14:46:52 -0300
Subject: [PATCH 4/4] casing

---
 nx/guides/advanced/automatic_differentiation.livemd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nx/guides/advanced/automatic_differentiation.livemd b/nx/guides/advanced/automatic_differentiation.livemd
index 69c02baf67..eb27215153 100644
--- a/nx/guides/advanced/automatic_differentiation.livemd
+++ b/nx/guides/advanced/automatic_differentiation.livemd
@@ -20,7 +20,7 @@ to minimize a given function. For higher dimensional problems, we deal with func
 The gradient, then, is a vector that points in the direction where the function changes the most, which leads to the so-called gradient descent method of optimization.
 
 In the gradient descent method, we take tiny steps following the gradient of the function in order to find the nearest local minimum (which hopefully is either the global minimum or close enough to it).
-This is what makes function differentiation so important for machine learning.
+This is what makes function differentiation so important for Machine Learning.
 
 Let's take for example the following $f(x)$ and $f'(x)$ scalar function and derivative pair: