pytorch · j-t-1 · Mar 2, 2025 · Mar 13, 2025 · May 17, 2025
diff --git a/tutorials/DLRM_Tutorial.ipynb b/tutorials/DLRM_Tutorial.ipynb
@@ -13,20 +13,20 @@
    "source": [
     "This tutorial shows how to apply a model interpretability library, Captum, to a deep learning recommender model (DLRM).\n",
     "\n",
-    "More about the DLRM achitecture and usage can be found here: https://github.com/facebookresearch/dlrm\n",
+    "More about the DLRM achitecture and usage can be found here: https://github.com/facebookresearch/dlrm.\n",
     "\n",
-    "For our experiments we used Criteo's traffic over a period of 7 days. The dataset is also available on kaggle for download: https://www.kaggle.com/c/criteo-display-ad-challenge We pre-trained DLRM model using 39M Ads from Criteo dataset. From feature importance calculation perspective we used a small fraction of preprocessed data.\n",
+    "For our experiments we used Criteo's traffic over a period of seven days. The dataset is also available on kaggle for download: https://www.kaggle.com/c/criteo-display-ad-challenge. We pre-trained a DLRM model using 39M Ads from the Criteo dataset. From a feature importance calculation perspective, we used a small fraction of preprocessed data.\n",
     "\n",
     "In this tutorial we aim to answer the following questions:\n",
     "\n",
     "1. Which input features are essential in predicting clicked and non-clicked Ads ?\n",
     "2. What is the importance of the interaction layer ?\n",
-    "3. Which neurons are important for predicting Clicked Ads in the last fully-connected layer ?\n",
+    "3. Which neurons are important for predicting clicked Ads in the last fully-connected layer ?\n",
     "4. How can neuron importance help us to perform model pruning.\n",
     "\n",
-    "1st, 2nd and 3rd points are also visualized in the diagram below.\n",
+    "The first, sceond and third questions are also visualized in the diagram below.\n",
     "\n",
-    "Note: Please, run this tutorial in a GPU environment. It is most probably going to fail in a CPU environment."
+    "Note: Please run this tutorial in a GPU environment. It is most probably going to fail in a CPU environment."
    ]
   },
   {
@@ -160,7 +160,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's download pre-trained dlrm model from AWS S3 bucket."
+    "Let's download a pre-trained DLRM model from an AWS S3 bucket."
    ]
   },
   {
@@ -196,11 +196,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Since the actual test dataset is pretty large and requires pre-processing, we preprocessed a small portion of it and stored as batches in two 'pt' files so that it is easy for us to work with them. The first 'pt' file, `X_S_T_test_above_0999`, contains 497 samples that are predicted as `Clicked` Ads with a high prediction score, larger than 0.999. The second 'pt' file, X_S_T_test, contains, 1100 samples, Ads, that aren't conditioned on the prediction scores.\n",
+    "Since the actual test dataset is pretty large and requires preprocessing, we preprocessed a small portion of it and stored as batches in two 'pt' files so that it is easier for us to work with. The first 'pt' file, `X_S_T_test_above_0999`, contains 497 samples that are predicted as `Clicked` Ads with a high prediction score, larger than 0.999. The second 'pt' file, `X_S_T_test`, contains 1100 samples, Ads that aren't conditioned on the prediction scores.\n",
     "\n",
-    "The reason why we separated the samples in two groups is that in our analysis we often want to understand most salient features for the Ads that are predicted as `Clicked` with a high prediction score, close to 1.0, vs to the Ads that have mixed prediction scores (some are high and some low).\n",
+    "The reason why we separated the samples in two groups is that in our analysis we often want to understand most salient features for the Ads that are predicted as `Clicked` with a high prediction score, close to 1.0, versus to the Ads that have mixed prediction scores (some are high and some low).\n",
     "\n",
-    "Below, we load both files, so that we can perform model interpretabily for those pre-processed subsets of data.\n",
+    "Below, we load both files, so that we can perform model interpretabily for those preprocessed subsets of data.\n",
     "\n"
    ]
   },
@@ -218,7 +218,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Re-defining forwad pass for the DLRM model so that it accepts sparse embeddings instead of feature indices and offsets. This is done this way because `apply_emb` cannot be easily replaced by model hooks. https://github.com/facebookresearch/dlrm/blob/52b77f80a24303294a02c86b574529cdc420aac5/dlrm_s_pytorch.py#L276 "
+    "Redefining forward pass for the DLRM model so that it accepts sparse embeddings instead of feature indices and offsets. This is done this way because `apply_emb` cannot be easily replaced by model hooks. https://github.com/facebookresearch/dlrm/blob/52b77f80a24303294a02c86b574529cdc420aac5/dlrm_s_pytorch.py#L276."
    ]
   },
   {
@@ -251,7 +251,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's extract individual features for each sample from both batches of data. Each sample is represented through dense and sparse features. In this example `X_test` represents dense features. `lS_o_test` and `lS_i_test` represent sparse features. `lS_o_test` represents the offset of each sparse feature group and `lS_i_test` the index. More details about it can be found here: https://github.com/facebookresearch/dlrm/blob/52b77f80a24303294a02c86b574529cdc420aac5/dlrm_s_pytorch.py#L276\n"
+    "Let's extract individual features for each sample from both batches of data. Each sample is represented through dense and sparse features. In this example `X_test` represents dense features. `lS_o_test` and `lS_i_test` represent sparse features. `lS_o_test` represents the offset of each sparse feature group and `lS_i_test` the index. More details about it can be found here: https://github.com/facebookresearch/dlrm/blob/52b77f80a24303294a02c86b574529cdc420aac5/dlrm_s_pytorch.py#L276.\n"
    ]
   },
   {
@@ -352,7 +352,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Below we visualize feature importance scores for 5 different Ads, color-coded in 5 different colors that were predicted as `Clicked` with 0.999 prediction score. X-axis corresponds to the input features and y-axis to the attribution scores. The first 13 features correspond to dense and the last 26 to sparse features. As we can see, the sparse features primarily contribute to `Clicked` predictions whereas dense, contribute to both `Clicked` and `Non-Clicked` predictions."
+    "Below we visualize feature importance scores for five different Ads, color-coded in five different colors that were predicted as `Clicked` with 0.999 prediction score. X-axis corresponds to the input features and y-axis to the attribution scores. The first 13 features correspond to dense and the last 26 to sparse features. As we can see, the sparse features primarily contribute to `Clicked` predictions whereas dense, contribute to both `Clicked` and `Non-Clicked` predictions."
    ]
   },
   {
@@ -573,9 +573,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now let’s look deeper into the interaction layer. More specifically, let’s examine how important, pairwise feature interactions in the output of the interaction layer, are. In the interaction layer we consider interactions between 27 16-dimensional feature representations, 26 corresponding to sparse and 1 to dense features. The last 16-dimensional dense representation is emerged after transforming 13 dense features into one 16-dimensional embedding vector. In the interaction layer we consider pairwise interactions of 27 features using dot products. This results to 27 x 26 x 0.5 = 351 pairwise interactions excluding self interactions. In the very end, 16-dimensional dense feature representation is being prepended to resulting interactions leading to 16 + 351 = 367 neurons in the output of second concatenation layer.\n",
+    "Now let’s look deeper into the interaction layer. More specifically, let’s examine the importance of pairwise feature interactions in the output of the interaction layer. In the interaction layer we consider interactions between 27 16-dimensional feature representations, 26 corresponding to sparse and 1 to dense features. The last 16-dimensional dense representation is emerged after transforming 13 dense features into one 16-dimensional embedding vector. In the interaction layer we consider pairwise interactions of 27 features using dot products. This results to 27 x 26 x 0.5 = 351 pairwise interactions excluding self interactions. In the very end, 16-dimensional dense feature representation is being prepended to resulting interactions leading to 16 + 351 = 367 neurons in the output of second concatenation layer.\n",
     "\n",
-    "We use Layer Conductance algorithm to estimate the importance of all 367 neurons. "
+    "We use the Layer Conductance algorithm to estimate the importance of all 367 neurons. "
    ]
   },
   {
@@ -602,7 +602,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The figure below demonstrates the importance scores of each neuron in the output of interaction layer. First 16 neurons have mixed contributions both to `Clicked` and `Non-Clicked` predictions. The following 351 interaction neurons either primarily contribute to `Clicked` or have no effect on the prediction. In fact we can see that many of those interactions have no effect on the prediction. This observations, however, are supported by 497 samples that are predicted as `Clicked` with a prediction score larger than 0.999. One might think that the samples might not be representative enough, however, even if we increase the sample size we still observe similar patterns. As an extension of this work one might think of performing statistical significance testing for random subsamples that are predicted as `Clicked` with high prediction score to make more convincing arguments.\n"
+    "The figure below demonstrates the importance scores of each neuron in the output of interaction layer. The first 16 neurons have mixed contributions both to `Clicked` and `Non-Clicked` predictions. The following 351 interaction neurons either primarily contribute to `Clicked` or have no effect on the prediction. In fact we can see that many of those interactions have no effect on the prediction. This observations, however, are supported by 497 samples that are predicted as `Clicked` with a prediction score larger than 0.999. One might think that the samples might not be representative enough, however, even if we increase the sample size we still observe similar patterns. As an extension of this work one might think of performing statistical significance testing for random subsamples that are predicted as `Clicked` with high prediction score to make more convincing arguments.\n"
    ]
   },
   {
@@ -820,7 +820,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can extend our analysis further and ablate the neurons that have zero contribution to the prediction or are negatively correlated with the ads prediction across all samples. According to the specific examples demonstrated above we can see that based on our sample size of 82(prediction score > 0.6), 24 neurons out of all 256 always demonstrate zero contribution to the prediction. If we ablate this neurons we can see that the False Negatives are reducing and overall Recall and F1 score of the model are increasing. Since this is a tutorial and measuring the accuracy and F1 scores on test data can be time consuming we do not demonstrate it here but the users are welcome to ablate those neurons based on the neuron importance scores and examine the difference in the Accurancy and F1 scores.\n",
+    "We can extend our analysis further and ablate the neurons that have zero contribution to the prediction or are negatively correlated with the ads prediction across all samples. According to the specific examples demonstrated above we can see that based on our sample size of 82 (prediction score > 0.6), 24 neurons out of 256 always demonstrate zero contribution to the prediction. If we ablate these neurons we can see that the False Negatives are reducing and overall Recall and F1 score of the model are increasing. Since this is a tutorial and measuring the accuracy and F1 scores on test data can be time consuming we do not demonstrate it here but the users are welcome to ablate those neurons based on the neuron importance scores and examine the difference in the Accuracy and F1 scores.\n",
     "\n",
     "Similar thinking can also be applied to the neurons that are always negatively correlated with the `Clicked` prediction."
    ]