diff --git a/README.md b/README.md
index b562943a0..9ace37c47 100644
--- a/README.md
+++ b/README.md
@@ -18,12 +18,12 @@
 
 <!-- start elevator-pitch -->
 
-Fine-tuning is an effective way to improve performance on [neural search](https://jina.ai/news/what-is-neural-search-and-learn-to-build-a-neural-search-engine/) tasks.
+Fine-tuning is an effective way to improve AI model performance on specific tasks and use cases tasks.
 However, setting up and performing fine-tuning can be very time-consuming and resource-intensive.
 
 Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure in the cloud.
 With Finetuner, you can easily enhance the performance of pre-trained models,
-making them production-ready [without extensive labeling](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/) or expensive hardware.
+making them production-ready [with less data](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/) and without investing in expensive hardware.
 
 🎏 **Better embeddings**: Create high-quality embeddings for semantic search, visual similarity search, cross-modal text<->image search, recommendation systems,
 clustering, duplication detection, anomaly detection, or other uses.
@@ -39,6 +39,7 @@ freezing, dimensionality reduction, hard-negative mining, cross-modal models, an
 ☁ **All-in-cloud**: Train using our GPU infrastructure, manage runs, experiments, and artifacts on Jina AI Cloud
 without worrying about resource availability, complex integration, or infrastructure costs.
 
+
 <!-- end elevator-pitch -->
 
 ## [Documentation](https://finetuner.jina.ai/)
diff --git a/docs/dummy/dummy.md b/docs/dummy/dummy.md
new file mode 100644
index 000000000..ab8dd8f66
--- /dev/null
+++ b/docs/dummy/dummy.md
@@ -0,0 +1,16 @@
+![The Mona Lisa with a section cut out.](../imgs/MonaLisa1.png)
+![The Mona Lisa with added blurring.](../imgs/MonaLisa2.png) 
+![2D space](../imgs/xy-axes.png)
+![3D space](../imgs/xyz-axes.png)
+![Vector notation](../imgs/notation1.png)
+![Vector notation](../imgs/notation2.png)
+![Vector notation](../imgs/notation3.png)
+![Vector notation](../imgs/notation4.png)
+![Pythagoras Theorem](../imgs/PythagorasTheorem.png)
+![Algebra](../imgs/algebra1.png)
+![Algebra](../imgs/algebra2.png)
+![Algebra](../imgs/algebra3.png)
+![Algebra](../imgs/algebra4.png)
+![Cosine](../imgs/Cosine.png)
+![Cosine Formula](../imgs/Cosine1.png)
+![Cosine Formula](../imgs/Cosine2.png)
\ No newline at end of file
diff --git a/docs/imgs/Cosine.png b/docs/imgs/Cosine.png
new file mode 100644
index 000000000..03d556aeb
Binary files /dev/null and b/docs/imgs/Cosine.png differ
diff --git a/docs/imgs/Cosine1.png b/docs/imgs/Cosine1.png
new file mode 100644
index 000000000..0ea10edaf
Binary files /dev/null and b/docs/imgs/Cosine1.png differ
diff --git a/docs/imgs/Cosine2.png b/docs/imgs/Cosine2.png
new file mode 100644
index 000000000..6561a1b5a
Binary files /dev/null and b/docs/imgs/Cosine2.png differ
diff --git a/docs/imgs/Dont_Panic.png b/docs/imgs/Dont_Panic.png
new file mode 100644
index 000000000..4e1f7268d
Binary files /dev/null and b/docs/imgs/Dont_Panic.png differ
diff --git a/docs/imgs/Faces.png b/docs/imgs/Faces.png
new file mode 100644
index 000000000..f009892c3
Binary files /dev/null and b/docs/imgs/Faces.png differ
diff --git a/docs/imgs/HairType.png b/docs/imgs/HairType.png
new file mode 100644
index 000000000..2e3958656
Binary files /dev/null and b/docs/imgs/HairType.png differ
diff --git a/docs/imgs/MenWomen.png b/docs/imgs/MenWomen.png
new file mode 100644
index 000000000..09fc03def
Binary files /dev/null and b/docs/imgs/MenWomen.png differ
diff --git a/docs/imgs/MonaLisa1.png b/docs/imgs/MonaLisa1.png
new file mode 100644
index 000000000..602d12039
Binary files /dev/null and b/docs/imgs/MonaLisa1.png differ
diff --git a/docs/imgs/MonaLisa2.png b/docs/imgs/MonaLisa2.png
new file mode 100644
index 000000000..17248a2ca
Binary files /dev/null and b/docs/imgs/MonaLisa2.png differ
diff --git a/docs/imgs/PythagorasTheorem.png b/docs/imgs/PythagorasTheorem.png
new file mode 100644
index 000000000..06c6060ab
Binary files /dev/null and b/docs/imgs/PythagorasTheorem.png differ
diff --git a/docs/imgs/Wynona.png b/docs/imgs/Wynona.png
new file mode 100644
index 000000000..55df5928c
Binary files /dev/null and b/docs/imgs/Wynona.png differ
diff --git a/docs/imgs/algebra1.png b/docs/imgs/algebra1.png
new file mode 100644
index 000000000..a70d99ff9
Binary files /dev/null and b/docs/imgs/algebra1.png differ
diff --git a/docs/imgs/algebra2.png b/docs/imgs/algebra2.png
new file mode 100644
index 000000000..68704efc5
Binary files /dev/null and b/docs/imgs/algebra2.png differ
diff --git a/docs/imgs/algebra3.png b/docs/imgs/algebra3.png
new file mode 100644
index 000000000..9ff7c1f35
Binary files /dev/null and b/docs/imgs/algebra3.png differ
diff --git a/docs/imgs/algebra4.png b/docs/imgs/algebra4.png
new file mode 100644
index 000000000..0639e78f4
Binary files /dev/null and b/docs/imgs/algebra4.png differ
diff --git a/docs/imgs/dog_w_text.png b/docs/imgs/dog_w_text.png
new file mode 100644
index 000000000..639564e25
Binary files /dev/null and b/docs/imgs/dog_w_text.png differ
diff --git a/docs/imgs/finetuning-adjusting-dials.png b/docs/imgs/finetuning-adjusting-dials.png
new file mode 100644
index 000000000..931def593
Binary files /dev/null and b/docs/imgs/finetuning-adjusting-dials.png differ
diff --git a/docs/imgs/fruit_NN2.png b/docs/imgs/fruit_NN2.png
new file mode 100644
index 000000000..57499b95e
Binary files /dev/null and b/docs/imgs/fruit_NN2.png differ
diff --git a/docs/imgs/fruit_and_dots.png b/docs/imgs/fruit_and_dots.png
new file mode 100644
index 000000000..a88d42308
Binary files /dev/null and b/docs/imgs/fruit_and_dots.png differ
diff --git a/docs/imgs/matrix1.png b/docs/imgs/matrix1.png
new file mode 100644
index 000000000..520a9ec2a
Binary files /dev/null and b/docs/imgs/matrix1.png differ
diff --git a/docs/imgs/matrix2.png b/docs/imgs/matrix2.png
new file mode 100644
index 000000000..d41b66a74
Binary files /dev/null and b/docs/imgs/matrix2.png differ
diff --git a/docs/imgs/matrix3.png b/docs/imgs/matrix3.png
new file mode 100644
index 000000000..749218808
Binary files /dev/null and b/docs/imgs/matrix3.png differ
diff --git a/docs/imgs/matrix_eq1.png b/docs/imgs/matrix_eq1.png
new file mode 100644
index 000000000..b15a7b49f
Binary files /dev/null and b/docs/imgs/matrix_eq1.png differ
diff --git a/docs/imgs/nn_first_layer.png b/docs/imgs/nn_first_layer.png
new file mode 100644
index 000000000..186d81fe3
Binary files /dev/null and b/docs/imgs/nn_first_layer.png differ
diff --git a/docs/imgs/nn_small.png b/docs/imgs/nn_small.png
new file mode 100644
index 000000000..d146671ec
Binary files /dev/null and b/docs/imgs/nn_small.png differ
diff --git a/docs/imgs/notation1.png b/docs/imgs/notation1.png
new file mode 100644
index 000000000..b4ef849a7
Binary files /dev/null and b/docs/imgs/notation1.png differ
diff --git a/docs/imgs/notation2.png b/docs/imgs/notation2.png
new file mode 100644
index 000000000..ede67ecdc
Binary files /dev/null and b/docs/imgs/notation2.png differ
diff --git a/docs/imgs/notation3.png b/docs/imgs/notation3.png
new file mode 100644
index 000000000..1080a2e2f
Binary files /dev/null and b/docs/imgs/notation3.png differ
diff --git a/docs/imgs/notation4.png b/docs/imgs/notation4.png
new file mode 100644
index 000000000..d93709daf
Binary files /dev/null and b/docs/imgs/notation4.png differ
diff --git a/docs/imgs/ordered_veg.png b/docs/imgs/ordered_veg.png
new file mode 100644
index 000000000..820046820
Binary files /dev/null and b/docs/imgs/ordered_veg.png differ
diff --git a/docs/imgs/plain_fruit_NN.png b/docs/imgs/plain_fruit_NN.png
new file mode 100644
index 000000000..f3d32958f
Binary files /dev/null and b/docs/imgs/plain_fruit_NN.png differ
diff --git a/docs/imgs/relu.png b/docs/imgs/relu.png
new file mode 100644
index 000000000..aaf58895d
Binary files /dev/null and b/docs/imgs/relu.png differ
diff --git a/docs/imgs/relu1.png b/docs/imgs/relu1.png
new file mode 100644
index 000000000..f2b9f6fcf
Binary files /dev/null and b/docs/imgs/relu1.png differ
diff --git a/docs/imgs/scatterplot_fruit.png b/docs/imgs/scatterplot_fruit.png
new file mode 100644
index 000000000..ce6b89fb8
Binary files /dev/null and b/docs/imgs/scatterplot_fruit.png differ
diff --git a/docs/imgs/scatterplot_fruit_ordered.png b/docs/imgs/scatterplot_fruit_ordered.png
new file mode 100644
index 000000000..01f9d76a0
Binary files /dev/null and b/docs/imgs/scatterplot_fruit_ordered.png differ
diff --git a/docs/imgs/scatterplot_fruit_partitioned.png b/docs/imgs/scatterplot_fruit_partitioned.png
new file mode 100644
index 000000000..9ab3732a9
Binary files /dev/null and b/docs/imgs/scatterplot_fruit_partitioned.png differ
diff --git a/docs/imgs/scatterplot_veg.png b/docs/imgs/scatterplot_veg.png
new file mode 100644
index 000000000..c1fefba1d
Binary files /dev/null and b/docs/imgs/scatterplot_veg.png differ
diff --git a/docs/imgs/sigmoid.png b/docs/imgs/sigmoid.png
new file mode 100644
index 000000000..ab4b88abd
Binary files /dev/null and b/docs/imgs/sigmoid.png differ
diff --git a/docs/imgs/thresholds.png b/docs/imgs/thresholds.png
new file mode 100644
index 000000000..4840f7853
Binary files /dev/null and b/docs/imgs/thresholds.png differ
diff --git a/docs/imgs/thresholds1.png b/docs/imgs/thresholds1.png
new file mode 100644
index 000000000..47b95f14d
Binary files /dev/null and b/docs/imgs/thresholds1.png differ
diff --git a/docs/imgs/tomato_centroid.png b/docs/imgs/tomato_centroid.png
new file mode 100644
index 000000000..ea2892a4e
Binary files /dev/null and b/docs/imgs/tomato_centroid.png differ
diff --git a/docs/imgs/tomato_centroid_arrow.png b/docs/imgs/tomato_centroid_arrow.png
new file mode 100644
index 000000000..eb942568a
Binary files /dev/null and b/docs/imgs/tomato_centroid_arrow.png differ
diff --git a/docs/imgs/unordered_veg_partition.png b/docs/imgs/unordered_veg_partition.png
new file mode 100644
index 000000000..4895b27e3
Binary files /dev/null and b/docs/imgs/unordered_veg_partition.png differ
diff --git a/docs/imgs/unordered_veg_partition_softmax.png b/docs/imgs/unordered_veg_partition_softmax.png
new file mode 100644
index 000000000..314d6dd6b
Binary files /dev/null and b/docs/imgs/unordered_veg_partition_softmax.png differ
diff --git a/docs/imgs/vector.png b/docs/imgs/vector.png
new file mode 100644
index 000000000..964000fa7
Binary files /dev/null and b/docs/imgs/vector.png differ
diff --git a/docs/imgs/veg_NN.jpg b/docs/imgs/veg_NN.jpg
new file mode 100644
index 000000000..63d27a27f
Binary files /dev/null and b/docs/imgs/veg_NN.jpg differ
diff --git a/docs/imgs/veg_apart.png b/docs/imgs/veg_apart.png
new file mode 100644
index 000000000..df71ad258
Binary files /dev/null and b/docs/imgs/veg_apart.png differ
diff --git a/docs/imgs/veg_together.png b/docs/imgs/veg_together.png
new file mode 100644
index 000000000..a621cddd1
Binary files /dev/null and b/docs/imgs/veg_together.png differ
diff --git a/docs/imgs/veg_triplet.png b/docs/imgs/veg_triplet.png
new file mode 100644
index 000000000..7edfdd8c0
Binary files /dev/null and b/docs/imgs/veg_triplet.png differ
diff --git a/docs/imgs/xy-axes.png b/docs/imgs/xy-axes.png
new file mode 100644
index 000000000..55072a7dd
Binary files /dev/null and b/docs/imgs/xy-axes.png differ
diff --git a/docs/imgs/xyz-axes.png b/docs/imgs/xyz-axes.png
new file mode 100644
index 000000000..140e86945
Binary files /dev/null and b/docs/imgs/xyz-axes.png differ
diff --git a/docs/intro/AI-pipeline.md b/docs/intro/AI-pipeline.md
new file mode 100644
index 000000000..329863a31
--- /dev/null
+++ b/docs/intro/AI-pipeline.md
@@ -0,0 +1,3 @@
+In practice, we typically do transform the input data in some way before inputting
+it into a neural network (we call this *preprocessing*) and have to perform some
+transformation on the output. Nonetheless, 
diff --git a/docs/intro/brief-refresher-on-vectors.md b/docs/intro/brief-refresher-on-vectors.md
new file mode 100644
index 000000000..99a3389d1
--- /dev/null
+++ b/docs/intro/brief-refresher-on-vectors.md
@@ -0,0 +1,115 @@
+# {octicon}`light-bulb` Brief Refresher on Vectors
+
+Vectors are ordered lists of numbers that correspond to points in a high-dimensional 
+metric space. For example, a point on a graph is defined by a vector. In the 
+image below, you can see three points on a graph, each defined by a vector: 
+`[-3,1]`, `[4,2]`, `[2.5,-3]`
+
+![Untitled](../imgs/xy-axes.png)
+
+Points in a two-dimensional space are defined by vectors of two numbers. In a 
+three-dimensional space, you need vectors with three numbers. In the image below, 
+point *Q* is defined by the vector `[-5, -5, 7]` and point *P* by `[3,0,5]`:
+
+![Untitled](../imgs/xyz-axes.png)
+
+A vector isn’t limited to three numbers. It can have any amount — hundreds, 
+thousands, even millions or billions — defining a point in what’s called a 
+*vector space*. 
+
+These vector spaces are just extensions of the three-dimensional space we are 
+all used to. It can be hard to imagine a space with millions of dimensions, and 
+even harder to draw a picture of it, but they still have the same properties 
+two- and three-dimensional spaces have: Every point is uniquely defined by a 
+vector and each vector defines a unique point. If two vectors are the same, 
+they correspond to the same point in space, and the distance between them is 
+therefore zero. Any two points that correspond to different vectors are different 
+points, and we can calculate the distance between them from their vectors.
+
+## Vector Notation
+
+Traditionally, we write vectors with a little arrow above them, like this 
+five-dimensional vector:
+
+![Untitled](../imgs/notation1.png)
+
+Or this one:
+
+![Untitled](../imgs/notation2.png)
+
+And we designate the individual numbers in a vector with a subscript:
+
+![Untitled](../imgs/notation3.png)
+
+![Untitled](../imgs/notation4.png)
+
+## Distance
+
+There are a number of different kinds of distances that we can calculate 
+between vectors. The two most commonly used in neural networks are 
+*Euclidean distance*, which closely corresponds to our common-sense ideas 
+about distance, and *cosine distance*, which measures the angle between 
+vectors as if they were lines from the origin, instead of points.
+
+### Euclidean distance
+
+Euclidean distance corresponds closely with our everyday sense of distance. 
+If a town is two kilometers to the north and four kilometers to the east, 
+then the straight-line distance to that town is five kilometers, because of 
+the Pythagorean theorem:
+
+![Pythagoras2.png](../imgs/PythagorasTheorem.png)
+
+If sides *a* and *b* form a right angle (= 90 degrees), then the length of 
+side *c* must fit the equation:
+
+[//]: # (![Untitled]&#40;../imgs/algebra1.png&#41;)
+<p align="left"><img src="../../_images/algebra1.png"/></p>
+
+This means:
+
+[//]: # (![Untitled]&#40;../imgs/algebra2.png&#41;)
+<p align="left"><img src="../../_images/algebra2.png"/></p>
+
+We can generalize this to get the Euclidean distance between any two 
+*n*-dimensional vectors with the formula:
+
+[//]: # (![Untitled]&#40;../imgs/algebra3.png&#41;)
+<p align="left"><img src="../../_images/algebra3.png"/></p>
+
+For two five-dimensional vectors, `[2 -3 7 4 1]` and `[-7 0 -2 5 4]`, 
+it looks like this:
+
+[//]: # (![Untitled]&#40;../imgs/algebra4.png&#41;)
+<p align="left"><img src="../../_images/algebra4.png"/></p>
+
+Even if two vectors have thousands or millions of dimensions, it’s 
+pretty trivial for a computer to calculate the Euclidean distance 
+between them.
+
+### Cosine distance
+
+Vectors can also be seen as lines in a high-dimensional space from the 
+origin (the point where all the numbers in the vector are zero) to the 
+point designated by the coordinates in the vector.
+
+![Cosine.png](../imgs/Cosine.png)
+
+Seen in this way, we can calculate the *angle* between the two vectors. 
+For two vectors of dimension *n*:
+
+[//]: # (![Untitled]&#40;../imgs/Cosine1.png&#41;)
+<p align="left"><img src="../../_images/Cosine1.png"/></p>
+
+For the two-dimensional vectors `[2,3]` and `[5,1]`:
+
+[//]: # (![Untitled]&#40;../imgs/Cosine2.png&#41;)
+<p align="left"><img src="../../_images/Cosine2.png"/></p>
+
+Usually, we just stick to the cosine, without calculating the radians or 
+degrees of the angle. If *cos(θ) = 1*, then the angle *θ* is zero degrees, 
+if *cos(θ) = 0* then the angle *θ* is 90 degrees.
+
+We sometimes use other measures than Euclidean distance and cosine, but 
+these are the two most important ones, and you can see how both are quickly 
+calculated from vectors.
\ No newline at end of file
diff --git a/docs/intro/how-fine-tuning-works.md b/docs/intro/how-fine-tuning-works.md
index f836ecdde..a96bd67e5 100644
--- a/docs/intro/how-fine-tuning-works.md
+++ b/docs/intro/how-fine-tuning-works.md
@@ -1,5 +1,362 @@
 (technical-details)=
 # {octicon}`mortar-board` How Fine-tuning Works
 
-Here should be the guides from [Part 1](https://www.notion.so/Not-So-Brief-Explanation-of-Neural-Networks-dd8cd521023642f792683ff43bd2ccf1), 
-[Part 2](https://www.notion.so/Transfer-learning-and-fine-tuning-d405c1fee3d4444cb343a05e73f02db6), ...
\ No newline at end of file
+To understand what fine-tuning does and how it works, you will need to first 
+understand how neural network models work, in general, if not in fine detail.
+
+If you are new to neural networks or not particularly familiar with what it means
+to train and use a neural AI model, we have a [presentation of the core concepts](../intro/what-are-neural-networks)
+relevant to fine-tuning. It is intended for readers with some understanding 
+of algebra and computer engineering but who may have little or no experience with 
+machine learning or artificial intelligence.
+
+We also have a [non-technical summary of what fine-tuning does](../intro/what-is-finetuner), 
+but without explaining how it works.
+
+## Transfer Learning
+
+[TO DO]
+
+## Example: Recognizing pictures of fruit
+
+Let’s consider a more concrete example. Imagine we trained a neural network to
+take as input pictures of fruit, and then tell us which fruit it is. To make
+the problem simpler, let’s say it can recognize different kinds of fruit,
+specified in advance.
+
+We don’t need to be very specific about the exact number of nodes and hidden
+layers, the network will look basically like this:
+
+![Fruit neural network](../imgs/fruit_NN2.png)
+
+The last hidden layer will be a vector with some high number of dimensions.
+Since it’s very hard to draw that, we’re going to pretend the last hidden layer
+forms a two-dimensional space. Before training, when all the weights are set to
+random values, we would expect that layer to give us randomly distributed
+results when we enter pictures of fruit:
+
+![Fruit embeddings pre-training](../imgs/scatterplot_fruit.png)
+
+After training, we would expect the last hidden layer’s distribution to look
+more like this:
+
+![Fruit embeddings post-training](../imgs/scatterplot_fruit_ordered.png)
+
+In short, the last hidden layer should be organized so that the output layer
+can easily recognize fruits just by drawing lines around the parts of the
+vector space each fruit falls into.
+
+![Fruit embeddings partitioned](../imgs/scatterplot_fruit_partitioned.png)
+
+Then, when we present the model with new images, we can expect them to appear
+in the right part of the vector space so that they are easily recognized:
+
+![New fruit inputs](../imgs/fruit_and_dots.png)
+
+## Embeddings
+
+In classification problems, we typically have an explicit output layer, like 
+in the fruit-recognition example above. We provide one output node for each 
+class we want to recognize. Whenever we want to classify inputs into some 
+well-defined set of classes, we can construct this kind of network: We transform 
+input vectors into output vectors where each dimension is linked to a specific 
+output class.
+
+All data classification problems are vector transformation problems, however, 
+not all vector transformation problems are classification problems.
+
+For example, consider the problem of face recognition. Let’s say you have a 
+database of some large number of photos of people’s faces, and, when given a 
+new picture of a face, you want to retrieve the most similar face from the 
+database.
+
+Let’s say there are a million faces in the database. You could try to treat 
+this as a classification problem: Create a neural network with a million 
+output nodes — one for each face — and then train it to class new pictures 
+into one of those million categories.
+
+This is possible, in principle, but in practice it doesn’t work very well.
+
+Neural networks learn by adjusting their weights in response to examples. To 
+learn to do classification, they need to see many examples of each class that 
+they have to learn. How many depends on factors like the size of the network, 
+the quality of the data, and the difficulty of the problem. It is very difficult 
+to know in advance how much data is enough, which is why training data sets are 
+typically as large as possible.
+
+If treat each of a million unique faces as a discrete category, we might need 
+hundreds, thousands, or even more pictures of each face to learn to classify 
+new pictures correctly. This is very difficult to do, and very inefficient.
+
+There is a better way to look at the problem: Instead of treating it like a 
+classification problem, we can treat it like an information retrieval problem. 
+Instead of asking what category each picture belongs to, we ask: Given an input 
+picture, what are the most similar pictures to it in our database, and are any 
+of them similar enough to be a good match?
+
+We saw with the fruit classification problem how the last hidden layer becomes 
+organized during training so that the same kinds of fruits cluster together in 
+the vector space defined by that layer. When we train a neural network so that 
+it maps input data to a vector space where the placement in that space encodes 
+useful information about the input data, we call vectors in that space *embeddings*.
+
+In the fruit recognition example above, the last hidden layer of the network 
+produces embeddings, because we can see how the locations of the output 
+vectors encodes information about what kind of fruit is in the input image.
+
+We can apply that principle to face recognition.
+
+What we would like from a face recognition neural model is for it to transform 
+pictures of faces as inputs into embeddings, such that the more two pictures of 
+faces are similar, the more their embeddings will be close together. The idea 
+is that if two pictures are of the same person, their embeddings will be very 
+close together, much closer than two pictures of different people.
+
+Embeddings are high-dimensional vectors in all practical cases, but once again, 
+since it’s very hard to draw high-dimensional spaces, we’ll pretend its just 
+two dimensions for the example below:
+
+![Face embeddings](../imgs/Faces.png)
+
+We can see in this example that some superficial features of faces cluster 
+together. For example, the men and women are separated:
+
+![Face embeddings by gender](../imgs/MenWomen.png)
+
+And people are clustered by features of their hair:
+
+![Face embeddingsby hair type](../imgs/HairType.png)
+
+In real use cases, embeddings have a lot of dimensions and form a very 
+high-dimensional vector space. With so many dimensions, different features can 
+cluster together in ways that we can’t really draw in two dimensions.
+
+There are a number of techniques for constructing and training networks like 
+this, but what’s important to understand is what these kinds of models do and 
+how they work.
+
+In this example, we want the model to produce an embedding for each picture in 
+the database, which we then store. Then, we take a new picture, we get its 
+embedding vector from the model, and then we compare that vector to the stored 
+vectors in the database. The result will be a ranking of all the images in the 
+database by how similar they are to the new picture.
+
+If we’ve trained the network correctly, and there is another picture of the same 
+person in the database, the closest embedding from the database will match a 
+picture of the same person.
+
+![Wynona Ryder as embedding](../imgs/Wynona.png)
+
+This reduces the problem of face recognition to identifying the stored image 
+whose embedding is closest to the embedding of the query image, and then deciding 
+if they are close enough to be the same person.
+
+This same logic is used for many other problems, like multimodal information 
+retrieval. For example, if we have a database of pictures, and we just want 
+to retrieve pictures of dogs. We solve this problem by constructing and 
+co-training two models, one that takes images as input, and one that takes 
+text as input, but both output embeddings in the same vector space.
+
+![Dog embeddings](../imgs/dog_w_text.png)
+
+## Transfer learning and fine-tuning
+
+In the previous section, we discussed a neural network model that recognizes a 
+few kinds of fruit. Let’s say we now want to recognize vegetables instead of fruit.
+
+We could start all over with a new network and a new training dataset and train 
+it all from scratch. However, one of the key discoveries that have made large-model 
+neural AI work is finding out that we can take a network that has already learned 
+to do a related task and retrain it for a new one, often with much less time and 
+less new data.
+
+You can see how it might make sense to retrain a fruit-recognition neural network 
+to recognize vegetables: It already has learned, indirectly, to pay attention to 
+shapes and colors. The same basic set of features that it uses to recognize fruit 
+would be used to recognize vegetables. If we gave the trained network pictures of 
+vegetables, we might expect the results on the last hidden layer to look something 
+like this:
+
+![scatterplot veg.png](../imgs/scatterplot_veg.png)
+
+You can see that it’s not *completely* random, that the fruit features are not so 
+bad at separating one kind of vegetable from another, but they’re not so good either. 
+A tomato looks a bit like an apple but has a color something like a cherry, and 
+carrots are the same color as oranges, more or less. Some of the things it might 
+have learned to use to recognize fruits can help to recognize vegetables, but the 
+result is far from ideal.
+
+If we stopped here, we would find that we can’t just split this embedding space up 
+into sections that match each type of vegetable without a lot of mistakes.
+
+Fine-tuning is what we do to take advantage of how a network has already *partly* 
+learned a task. The technical term for this is *transfer learning*. If the model 
+can start learning the new task with what it already knows, it can learn faster and 
+better and from much less data.
+
+To do this, first, we delete the output layer that identifies fruit and add a new 
+output layer to identify vegetables.
+
+Now, we have some choices about how we want to fine-tune:
+
+## Full reinforcement learning
+
+Sometimes fine-tuning is done with the same kind of training used with a new, 
+untrained model: Examples of input data and known correct output vectors, typically 
+called *ground truth*. This is always an option, and depending on the problem and 
+quantity and quality of new training data, can be a good answer.
+
+To make this work in our example, we remove the output layer from the fruit 
+recognition model and replace it with a new one for vegetables:
+
+![plain fruit NN.png](../imgs/plain_fruit_NN.png)
+
+Then, we just train the network normally.
+
+Another way is to leave the hidden layers, trained for fruit, completely alone, 
+then add some new hidden layers so that the model looks like this:
+
+![veg NN.jpg](../imgs/veg_NN.jpg)
+
+When you leave the weights of the hidden layers unchanged, this is called 
+*freezing* those layers. You can then try to train the new hidden layers to 
+correctly classify the vegetables.
+
+This is much faster and requires less data than trying to retrain the 
+already-trained hidden layers. However, in many cases, it just can’t work.
+
+The model can only accurately classify the vegetables if there is enough 
+information in the last hidden layer of the fruit recognition network to 
+correctly classify all the vegetable images. Since that network was not 
+trained to recognize vegetables, it may not have learned all the features 
+necessary to classify vegetables, even if it did learn some of them. Freezing 
+layers sometimes reduces the ability to take proper advantage of transfer 
+learning since some of what has been learned in the hidden layers is obscured 
+in the last hidden layer.
+
+## Contrastive Learning
+
+There is an alternative to full reinforcement learning, one that takes 
+advantage of how we can give neural models a geometric interpretation.
+
+Instead of having pairs of inputs and correct output vectors, we can learn 
+from pairs of inputs by themselves if we know that they belong more together 
+or more apart than where the pre-trained model would place them. This is 
+called contrastive learning.
+
+Using the example from above of fine-tuning a model that recognizes fruits 
+to recognize vegetables, we can see that there’s an example of a cucumber and 
+a tomato whose vector representations, in the last hidden layer, are almost 
+the same:
+
+![veg apart.png](../imgs/veg_apart.png)
+
+What we do in contrastive learning is to adjust the weights so that these two 
+examples will be a bit further apart.
+
+We can also do the same for examples that are far apart and should be closer 
+together. For example, these two examples of tomatoes are very far apart, and 
+we should adjust the weights to make them closer together.
+
+![veg together.png](../imgs/veg_together.png)
+
+When we do this little by little, over and over again, with many pairs of 
+examples, we train the model to do the recognition task we want it to do. This 
+typically involves much less new training than training from scratch and 
+usually less than using explicit ground truth input-output pairs.
+
+## Triplet loss methods
+
+Another approach to learning is to use *triplet loss*. In this approach, we 
+don’t have to have explicitly labeled data, and we don’t have to measure how 
+close or far apart the output vectors of different items are.
+
+This learning technique uses a similar set of principles to contrastive 
+learning: We slowly move together output vectors that should be closer 
+together and move apart output vectors that should be further apart. But in 
+this case, we can do this without identifying which ones are too close 
+together or too far apart.
+
+We choose an input, called an *anchor*, and then we choose one that is similar 
+to it, called the *positive input*, and one that is dissimilar, called the 
+*negative input*. In the example below, we’ve chosen a bell pepper image as 
+our anchor, another bell pepper as positive input, and a zucchini image as 
+negative input:
+
+![veg triplet.png](../imgs/veg_triplet.png)
+
+Then, we adjust the weights to move the output vector for the positive input a 
+bit closer to the anchor and the output vector of the negative input further 
+from the anchor. When we do this over and over, with many triplets, the network 
+should learn the desired recognition task.
+
+## Categorical Learning
+
+Contrastive learning methods compare individual data items, adjusting weights 
+slowly to slowly move the output vectors from a semi-random spread into a more 
+compact and organized form that keeps the vectors for things that belong 
+together close to each other and far from things that don’t belong together. 
+We want the hidden layers to maximize the separation of things that belong to 
+different classes.
+
+In short, we want to go from an embedding space that looks like this:
+
+![scatterplot veg.png](../imgs/scatterplot_veg.png)
+
+To something more like this:
+
+![ordered veg.png](../imgs/ordered_veg.png)
+
+Instead of contrasting individual data items that we know are similar or different 
+in some measure, in a case like this where we know what category each data item 
+belongs to, we can compare the categories as a whole to fine-tune the model.
+
+For example, we can find a partition of the embedding space that maximally 
+separates one class from the others. We call this the *decision plane* because it 
+is usually a multidimensional plane in the embedding space. Transposed into two 
+dimensions to make it easier to visualize, it is something like the teal line below, 
+which separates most of the cucumbers from the other vegetables:
+
+![unordered veg partition.png](../imgs/unordered_veg_partition.png)
+
+There are six non-cucumbers on the cucumber side of the decision plane and six 
+cucumbers on the non-cucumber side. We can then adjust the weights to budge the 
+examples that are on the wrong side — and the ones on the right side but close to 
+it — in the direction that separates the cucumbers from the other vegetables, and 
+we do so in proportion to how far they are from the decision plane:
+
+![unordered veg partition softmax.png](../imgs/unordered_veg_partition_softmax.png)
+
+When repeated over and over for each category, the embedding vector space will 
+become organized so that the vegetables cluster together. This approach is 
+sometimes called *softmax loss**, and it’s a common technique that replicates 
+much of what would happen if we just added a new classification layer to the 
+network and trained it traditionally, but without going to all the added trouble.
+
+Another common technique is *center loss*, where instead of calculating the 
+optimal partition for each class, we calculate the centroid of each class, i.e., 
+the point that is the closest to every member of the class. For example, the 
+centroid of the tomatoes in our example:
+
+![tomato centroid.png](../imgs/tomato_centroid.png)
+
+Then, we adjust the weights so that each tomato image has an embedding closer 
+to the centroid, and everything else has an embedding further away:
+
+![tomato centroid arrow.png](../imgs/tomato_centroid_arrow.png)
+
+We repeat this for each category, recalculating the centroid each time, and we 
+get the same ultimate result: An embedding space in which embeddings for the 
+same vegetables cluster together.
+
+There are a number of other techniques that you can use to fine-tune AI models 
+and a large collection of empirical results suggesting which techniques are 
+best suited to which problems, but all of them are variants of the same 
+underlying logic: Algorithms that will slowly adjust the embedding layer of 
+your model to make the things that belong together closer to each other, and 
+farther from the things that do not belong together.
+
+Whatever knowledge the model already has is reflected in the distribution of 
+embedding. If it already has sone the knowledge it needs to perform some 
+specific task, this is reflected in faster fine-tuning with fewer examples 
+leading to better performance after fine-tuning.
\ No newline at end of file
diff --git a/docs/intro/what-are-neural-networks.md b/docs/intro/what-are-neural-networks.md
new file mode 100644
index 000000000..bca0e0b9e
--- /dev/null
+++ b/docs/intro/what-are-neural-networks.md
@@ -0,0 +1,374 @@
+# {octicon}`light-bulb` Neural Networks
+
+Neural Networks are a family of computational techniques for data analysis
+and machine learning that underlie most of the recent breakthroughs in
+artificial intelligence. Although neural networks have a history dating
+back to the 1940s, they have come to almost completely dominate AI
+research in the last decade due to a collection of improvements in
+scaling and training techniques that are sometimes labelled as
+*Deep Learning*.
+
+Some people insist on calling them *artificial neural networks*, to distinguish
+them from the nervous systems of animals and people. Although these techniques 
+were originally inspired by biology, neural network technology has progressed
+entirely independently of advances in neurology, and we know today that
+biological cognition does not work the way artificial neural networks do. The
+term “neural network” itself is almost exclusively used by computer
+professionals and rarely if ever by neurologists, so the addition of
+“artificial” is superfluous.
+
+We also sometimes call them “neural models”, “machine learning models”,
+“AI models” or when the context is clear enough, just “models”. These terms
+are not exactly synonyms, but in present-day usage, they are largely 
+interchangeable.
+
+Neural networks have some important properties that make them so important to
+machine learning and artificial intelligence:
+
+1. Neural networks can, in principle, perform almost any kind of data
+   transformation, if they have the right configuration and are large enough.
+   Any problem that requires consistent mapping from digital inputs to digital
+   outputs can _theoretically_ be performed by a neural network. Whether a neural
+   network to do a particular task can _in practice_ be constructed depends on 
+   many other factors, but in theory, their scope of application is almost
+   unlimited.
+2. Neural networks are highly parallelizable. They scale linearly both in time 
+   and in computer resources. If you want your model to run in half as much time,
+   you can usually do that just by devoting twice as many computer cores to it.
+3. Neural networks can learn to generalize from examples and sometimes find
+   good solutions to problems where even humans don’t have a good solution.
+   They can genuinely learn from the data presented to them, sometimes better
+   than humans can, by some measures, but they do not learn in the same way 
+   that humans do.
+
+To explain how all this works, we need to use some linear algebra.
+
+![Don't panic](../imgs/Dont_Panic.png)
+
+The only mathematical operations going on here are addition, subtraction, 
+multiplication, and in one spot, division and an exponent. There is no calculus
+here.
+
+If this is still too much math, we recommend the video presentation from the 
+YouTube channel [3Blue1Brown](https://www.youtube.com/@3blue1brown), which 
+covers broadly the same topic in about an hour with less mathematical 
+notation and more animated graphics: 
+* [Chapter 1](https://youtu.be/aircAruvnKk)
+* [Chapter 2](https://youtu.be/IHZwWFHWa-w)
+* [Chapter 3](https://youtu.be/Ilg3gGewQ5U)
+* [Chapter 4](https://youtu.be/tIeHLnjs5U8) (This part goes into the underlying
+calculus in depth, and you can skip it if you don't want to do more math than 
+you need to.)
+
+3Blue1Brown's presentation goes into considerably more depth than this page 
+does, and covers topics that are not especially necessary to understand 
+fine-tuning, but is an excellent beginner's presentation of how neural networks 
+work.
+
+## Vectors
+
+Neural networks work using a common abstraction: Many real-world problems can
+be recast as vector transformation problems, and a neural network is, in its
+most generic form, a scheme for translating vectors from one vector space to 
+another.
+
+[Vectors](https://en.wikipedia.org/wiki/Vector_(mathematics_and_physics)) are 
+just ordered lists of numbers. We often write vector variables with a small arrow
+over them:
+
+![Vector notation](../imgs/vector.png)
+
+This just means that the vector `v` consists of the list of numbers `[v₁,v₂,...,vₖ]`.
+
+Vectors correspond to points in a high-dimensional metric space called a 
+_vector space_. This means that:
+
+- A vector corresponds to a single, unique point, and each point corresponds to a
+  unique vector.
+- If two vectors are the same, the points they correspond to in their vector space 
+  are the same.
+- There are functions called *metrics* or *distance functions* that measure the
+  distance between any two vectors in the same vector space. If two vectors are 
+  the same, the distance between them is zero. If two vectors are not the same, 
+  the distance is some amount greater than zero. Distances are never negative.
+
+Please read our [brief refresher on vectors](../intro/brief-refresher-on-vectors) 
+if you are not already familiar with the concept.
+
+What is essentially important about vectors for understanding neural networks is this:
+
+> Computer data, like a vector, is just a sequence of numbers. So any digital
+> information can be treated like a vector just by calling it one!
+
+If we look at it that way, any problem that involves taking some finite amount
+of computer data as input and turning it into some other computer data as output,
+is equivalent to mapping vectors in some vector space into other vectors in another 
+(or possibly the same!) vector space. In theory, all data transformation
+problems can be solved by some neural network. The only fundamental requirement
+for a neural network to work is that we be able to express the problem we want
+to solve as a mapping from some data to some other data.
+
+This makes neural networks a very general-purpose technique for doing things. 
+It is not necessarily the best way to address all problems, but it is an 
+effective way to address a great many problems.
+
+## How Neural Networks Work
+
+The image below is a very schematized picture of a small neural network:
+
+![Small neural network](../imgs/nn_small.png)
+
+This particular network maps vectors with three values `[x, y, z]` to vectors
+with two values `[a, b]` -- i.e., it transforms inputs in a three-dimensional 
+vector space into vectors in a two-dimensional one. It has two “hidden” layers, 
+which are also vectors, in this case, each has four values. Neural networks can 
+have any size and in principle map vectors of any size to other vectors of any 
+size, using any number or configuration of hidden layers, but we are going to 
+use the one pictured above for this example.
+
+To see how this works at the lowest level, let's take a look at just the input
+layer and the first hidden layer:
+
+![First layer](../imgs/nn_first_layer.png)
+
+Each connection between the inputs `[x, y, z]` and the first hidden layer 
+`[m₁,m₂,m₃,m₄]` has a weight: `x` is connected to `m₁` with a weight of `w₁₁`,
+to `m₂` with a weight of `w₁₂`, etc., up to `z` connecting to `m₄` with 
+a weight of `w₃₄`.  Recall that the inputs `x, y, z` are just numbers from
+the input vector, and the weights `w₁₁, w₁₂, ..., w₃₄` are also just numbers.
+
+So we calculate the value of `[m₁,m₂,m₃,m₄]` by multiplying `[x,y,z]` by the 
+weights and then add them up:
+
+![Calulate the first hidden layer](../imgs/matrix_eq1.png)
+
+In linear algebra notation, it looks like this:
+
+![Single-layer matrix equations](../imgs/matrix1.png)
+
+<!--
+\begin{bmatrix}
+m_1 & m_2 & m_3 & m_4
+\end{bmatrix}
+=
+\begin{bmatrix}
+x & y & z
+\end{bmatrix}
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13} & w_{14}\\
+w_{21} & w_{22} & w_{23} & w_{24}\\
+w_{31} & w_{32} & w_{33} & w_{34}
+\end{bmatrix}
+-->
+
+Notation like this is easier to read, but it denotes the same thing: 
+We just multiply and add a bunch of numbers.
+
+Most neural networks then apply what is called an *activation function* to
+the values of each hidden layer before using it to calculate the next layer.
+The activation function is usually a threshold or a function that acts like
+a threshold. A threshold means that after calculating the value of `[m₁,m₂,m₃,m₄]`, 
+we check if each one is greater or less than some threshold value, which 
+we'll write as `θ` (the Greek letter theta). If it's less than `θ`, then 
+we set the value to zero and if it's more, we set it to one. 
+
+On paper, it looks like this:
+
+![Thresholds on first hidden layer](../imgs/thresholds.png)
+
+We often use the Greek letter ϕ to designate activation functions. So, in 
+more compact linear algebra notation, we write this as:
+
+![Matrix notation thresholds](../imgs/thresholds1.png)
+<!--
+\begin{bmatrix}
+m_1 \\
+m_2 \\
+m_3 \\
+m_4
+\end{bmatrix}
+=
+\phi\left(
+\begin{bmatrix}
+w_{11} & w_{21} & w_{31} \\
+w_{12} & w_{22} & w_{32} \\
+w_{13} & w_{23} & w_{33} \\
+w_{14} & w_{24} & w_{34}
+\end{bmatrix}
+\begin{bmatrix}
+x \\ 
+y \\ 
+z
+\end{bmatrix}
+\right)\\
+\newline\newline
+\indent\indent\text{where }\phi(v) =
+\begin{cases}
+1 & \text{if } v > \theta \\
+0 & \text{otherwise}
+\end{cases}
+-->
+
+The value `θ` is also called a _bias_, especially when it has a value other than zero,
+and you will often see it discussed under that name.
+
+Traditionally, most neural networks were trained using a 
+[sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function), which is like 
+a threshold, but instead of suddenly going from zero straight to one, it transitions more 
+smoothly around the threshold value. This has mathematical advantages and sometimes has
+practical ones.
+
+For completeness, this is the equation for a sigmoid activation function. It is close to
+(but not quite) zero for very negative values and close to (but not quite) one for very high
+numbers, and exactly `0.5` if the value is zero.
+
+![Sigmoid](../imgs/sigmoid.png)
+
+Nowadays, we usually use variants of the 
+linear rectifer function (commonly called 
+[ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))) instead of a 
+simple threshold or the sigmoid function. A rectifier is a function that returns the
+part of a value greater than some threshold, or else zero:
+
+![ReLU on the first hidden layer](../imgs/relu.png)
+
+Or:
+
+![ReLU as function](../imgs/relu1.png)
+<!--
+\phi(v) =
+\begin{cases}
+v - \theta & \text{if } v > \theta \\
+0 & \text{otherwise}
+\end{cases}
+-->
+
+Many of the largest language models in recent years use a variant of the 
+rectifier function called GELU or _Gaussian Error Linear Units_. Which 
+activation function to use in which contexts is an empirical aspect of machine
+learning, and this page will not go into depth about activation functions.
+
+What is essential to know about activation functions is that they always ensure 
+that if the values going into on node of a layer goes up, the value of the node
+either stays the same or goes up; and similarly, if the values going into a node
+go down, then the value stays the same or goes down.
+
+We follow the same procedure to calculate the values of the second hidden layer by 
+applying weights to the values of the first hidden layer, and then the activation 
+function.
+
+![Second layer matrix equations](../imgs/matrix2.png )
+<!--
+\begin{bmatrix}
+n_1 \\
+n_2 \\
+n_3 \\
+n_4
+\end{bmatrix}
+=
+\phi\left(
+\begin{bmatrix}
+w'_{11} & w'_{21} & w'_{31} & w'_{41} \\
+w'_{12} & w'_{22} & w'_{32} & w'_{42} \\
+w'_{13} & w'_{23} & w'_{33} & w'_{43} \\
+w'_{14} & w'_{24} & w'_{34} & w'_{44}
+\end{bmatrix}
+\begin{bmatrix}
+m_1 \\
+m_2 \\
+m_3 \\
+m_4
+\end{bmatrix}
+\right)\\
+-->
+
+And then to calculate the output layer, we multiply the last hidden layer by
+another matrix of weights:
+
+![Last hidden layer matrix equations](../imgs/matrix3.png)
+<!--
+\begin{bmatrix}
+a \\
+b
+\end{bmatrix}
+=
+\phi\left(
+\begin{bmatrix}
+w''_{11} & w''_{21} & w''_{31} & w''_{41} \\
+w''_{12} & w''_{22} & w''_{32} & w''_{42}
+\end{bmatrix}
+\begin{bmatrix}
+n_1 \\
+n_2 \\
+n_3 \\
+n_4
+\end{bmatrix}
+\right)\\
+-->
+
+There are also some other things that people sometimes do between hidden layers,
+the most important of which are [regularization techniques](https://en.wikipedia.org/wiki/Regularization_(mathematics)).
+Although essential to neural network engineering and design, we don’t need to
+discuss these subjects here to make sense of fine-tuning.
+
+The essential part is to understand how input vectors are transformed into
+output vectors by multiplying them by weight matrices, summing, running 
+activations and doing that over and over, until you’ve produced a final output 
+vector.
+
+This example network is very small, far too small to do anything very useful. It 
+has only eight nodes in the hidden layers, and a total of 36 weights (3 x 4 + 4 
+x 4 + 4 x 2 = 36). Even for this small network, writing out all the times we 
+multiply, sum up, and compare to thresholds is quite tedious. This is the kind of
+work computers are best suited for!
+
+By comparison, a general-purpose vision processing model like ViT-L/16 takes as 
+input color images with a resolution of 224x224 pixels. This means it takes an 
+input vector of approximately 150,000 values. It uses 24 hidden layers, has more 
+than 600 million weights, and outputs a vector with 16 values. State-of-the-art 
+language models are even larger - GPT-3 has roughly 175 billion weights - but 
+the principles behind how they work remain broadly the same.
+
+## How Neural Networks Learn
+
+A neural network learns by being trained, and that training consists of being
+exposed to examples of what it is supposed to do. The examples - called 
+_training data_ are pairs of input vectors and the output vectors we expect to
+get when we pass it those input vectors.
+
+The training process itself follows a simple scheme:
+
+1. Enter into the neural network each input vector from the training data.
+2. Calculate the resulting output vector. This output vector is going to have 
+   some error - it will not be the correct output vector.
+3. Calculate the distance between the output vector you got from the network 
+   and the correct output vector from the training data. To do this, use some 
+   distance metric, most often _Euclidean distance_ or _cosine distance_ 
+   (described in the 
+   [Brief Refresher on Vectors](https://www.notion.so/Brief-Refresher-on-Vectors-128bfb422b574f4b8212694154ee2061)).
+4. Adjust the weights in the neural network just a little bit, proportionate to
+   the distance calculated in the previous step, so that the next time, the output vector
+   will be a little bit closer to the correct value.
+
+This is done over and over and over, many times, with different example input-output
+pairs, adjusting the weights to make the neural network model produce output
+vectors closer and closer to the correct ones. We adjust the amount that we change 
+the weights in proportion to how far from correct the output is, so that when we are 
+close to correct, we make much smaller changes. This means that eventually, we will be
+producing output vectors that are very close to the correct ones, and making only very 
+tiny changes to the weights. When we are satisifed that the network is no longer getting
+better - that the output vectors aren't getting any closer to the correct ones - or we
+decide that the network has gotten good enough, we stop training.
+
+There are a number of procedures for
+calculating how to adjust the weights, but the traditional way is called
+*back-propagation*, and it’s an application of a common method in statistics called
+*regression*.
+
+There are some additional techniques used in training to enhance robustness and
+speed up processing, and there are other algorithms for adjusting the weights
+that are sometimes used in AI model training. Nonetheless, the purpose is
+always the same: To induce the weights to converge on values that optimally
+produce the expected outputs. Over time, the network learns to translate input
+vectors into output vectors in a way that should generalize to new data and to
+do a good job of the task expected of it.
diff --git a/docs/intro/what-is-finetuner.md b/docs/intro/what-is-finetuner.md
index 73acd456d..fc53824a0 100644
--- a/docs/intro/what-is-finetuner.md
+++ b/docs/intro/what-is-finetuner.md
@@ -1,5 +1,104 @@
-(objectives)=
 # {octicon}`light-bulb` What is Finetuner?
 
-Here should be the Finetuner for MBAs Guide:
-https://www.notion.so/Finetuner-for-MBAs-7465074b64d84486af4204d3f6673f74
\ No newline at end of file
+![Fine tuning](../imgs/finetuning-adjusting-dials.png)
+
+The latest generation of large AI models does many very impressive things.
+Anyone with access to the internet can try out talking with ChatGPT or making pictures with DALL-E or MidJourney. 
+However, so far, very few businesses have been built around them.
+This technology is very new, and it may take some time to see real productivity growth from them.
+In the meantime, many existing businesses are looking at places in their own operations where they might usefully deploy AI.
+
+Large AI models have been trained on large databases of unspecialized data, often texts and images taken from the internet with minimal filtering.
+This gives them impressive-seeming abilities over a potentially unbounded array of scenarios.
+However, when applied to specific problems, their performance can be disappointing.
+Whatever your application is, most of the functionality of a large AI model is useless to you, while the functionality that is useful to you is not directly designed to address your use case.
+
+Building a whole new AI model on the scale of recent breakthroughs, focused on just your business interests, is not very practical.
+The major AI companies don't usually disclose how much time, energy, or money it takes to train their state-of-the-art models.
+Estimates start in the millions of euros, with [some claiming GPT-4 cost over 100 million US dollars to train](https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/).
+Even for the best-funded businesses, it is a challenge to find and retain sufficient engineering talent with experience in AI and another challenge to obtain the vast quantities of data required to train these models.
+Successfully training very large models from scratch may take months on even the largest and fastest clusters at a cost of potentially millions of euros.
+And this is before even addressing error handling and integration costs.
+
+There is another way.
+
+## Pre-training
+
+One of the breakthroughs that led to today's impressive new AI models is the discovery that it often requires much less time and data to adapt an already trained model to a new task than to train a new one from scratch.
+Techniques for “pre-training” networks — training them to do tasks that are indirectly related to your use case but for which there is ample data and easy-to-assess goals — are a big part of how we can create these large models.
+For example, large language models are typically pre-trained using large quantities of text, with the training goal of filling in the blanks from the surrounding context.
+For example:
+
+> *On Saturday, New Jersey Gov. Phil Murphy _________ that Sept. 23 will _______ be declared "Bruce Springsteen Day" in ______ of the singer's birthday.*
+>
+
+By learning to do this task — one with little value by itself — AI models learn a great deal about the relationships between words.
+Then, after the model has learned as much as it can from this task, developers train it to do more sophisticated things, like answering questions or holding conversations.
+
+Image-processing AI models are pre-trained using similar techniques.
+Typically, this means training to fill in the blank squares in pictures or by distorting or adding noise to images and training the model to “fix” them.
+This teaches the model a lot about what kinds of things appear in pictures and what they should look like.
+
+<table style="border: none;">
+    <tr>
+        <td width="45%">
+            <figure>
+                <img src="../../_images/MonaLisa1.png" alt="The Mona Lisa with a section cut out."/>
+                <figcaption style="text-align:center">The Mona Lisa with a section cut out.</figcaption>
+            </figure>
+        </td>
+        <td width="45%">
+            <figure>
+                <img src="../../_images/MonaLisa2.png" alt="The Mona Lisa with added blurring."/>
+                <figcaption style="text-align:center">The Mona Lisa with added blurring.</figcaption>
+            </figure>
+        </td>
+    </tr>
+</table>
+
+AI models that have already learned some relevant things are much easier to train than ones that start without knowing anything.
+This is called *transfer learning*, and it's a very intuitive idea.
+It's much easier to teach someone who knows how to drive a car to safely drive a large truck than to teach someone who has never driven at all.
+AI models work in a similar way.
+
+## Fine-tuning
+
+You can take advantage of transfer learning to adapt AI to your business and your specific use cases without making a multi-year, multi-million euro investment.
+The impressive performance of large AI models on general tasks means that they need much less data and training time to learn to do well at some specific task.
+
+For example, let's consider an online fashion retailer that offers shoppers a search function.
+There are AI models that match pictures to descriptions, but they are trained on a very wide variety of images.
+It does the retailer no good to have AI that can efficiently identify pictures of dogs and cats when they want an AI that can tell the difference between jeans and dress slacks or between an A-line skirt and a pleated skirt.
+
+An AI model that can recognize thousands of different objects has already learned how to make fine distinctions based on the features of images.
+Training it to know the difference between chinos and cargo pants takes relatively few examples compared to training an AI model that doesn't even know what pants are!
+
+This kind of training is called *fine-tuning* because most of what it does is not really learning new things but learning to focus the things that the model already knows on the specific tasks you have in mind.
+
+Although performance is almost always going to be better the more training data you have, we have found that [mere hundreds of items of training data are enough to get most of the gains](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/) from fine-tuning in many highly realistic test cases.
+Acquiring a few hundred examples is a much cheaper proposition than acquiring the millions to billions of data items used to train the original model.
+
+This makes fine-tuning an extremely attractive value proposition for anyone looking to integrate AI into their business.
+
+## Finetuner from Jina AI
+
+Jina AI's Finetuner is a flexible and efficient cloud-native solution for fine-tuning state-of-the-art AI models.
+It provides an intuitive Python interface that can securely upload your data to our cloud and return a fine-tuned model without requiring you to invest in any special hardware or tech stack.
+Our solution includes testing and evaluation methods that quantify performance improvements and provide intuitive reports, so you can see directly how much value fine-tuning adds.
+
+It is impossible for us — or anyone else — to guarantee a specific outcome from a machine learning process.
+However, we are unaware of any case in which fine-tuning with properly representative data did not improve the task-specific performance of an AI model.
+
+We strive to keep the Finetuner as intuitive and easy to use as possible, but we understand that machine learning can be an inherently messy process involving choices that no software can ever automate.
+For this reason, we offer [Finetuner+](https://finetunerplus.jina.ai/), a service where our engineers collaborate with you in planning and executing AI integration.
+We can:
+
+- Advise you on selecting an AI model appropriate to your use case.
+- Help you to acquire and prepare training data for the Finetuner.
+- Manage and evaluate Finetuner runs for you.
+- Assist you in integrating the resulting AI models into your business processes.
+
+Furthermore, if your data security needs are very strict, we can assist you in setting up and running Finetuner on-site at your company.
+
+If you are looking at integrating AI into your business, fine-tuning is essential to getting the most value out of your AI investments.
+Jina AI can help with software, services, and consultants for organizations of all kinds, sizes, and technical experience.
\ No newline at end of file