Copyedit LTR notebook, add intro overview (#190)

elastic · Feb 15, 2024 · 9d9d275 · 9d9d275
1 parent d4f3ca4
commit 9d9d275
Showing 1 changed file with 38 additions and 30 deletions.
diff --git a/notebooks/search/08-learning-to-rank.ipynb b/notebooks/search/08-learning-to-rank.ipynb
@@ -10,13 +10,20 @@
     "\n",
     "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/08-learning-to-rank.ipynb)\n",
     "\n",
-    "In this notebook we will see an example on how to train a Learning To Rank model using [XGBoost](https://xgboost.ai/) and how to deploy it to be used as a rescorer in Elasticsearch.\n",
-    "\n",
-    "\n",
-    "**Notes about the Learning To Rank feature:**\n",
-    "- The Learning To Rank feature is available for Elastic Stack versions 8.12.0 and newer and requires a Platinum subscription or higher.\n",
-    "- The Learning To rank is experimental and may be changed or removed completely in future releases. Elastic will make a best effort to fix any issues, but experimental features are not supported to the same level as generally available (GA) features.\n",
-    " \n"
+    "In this notebook, we'll:\n",
+    "\n",
+    "- Connect to an Elasticsearch deployment using the official Python client.\n",
+    "- Import and index a movie dataset into Elasticsearch.\n",
+    "- Extract features from our dataset using Elasticsearch's Query DSL, including custom `script_score` queries.\n",
+    "- Build a training dataset by combining extracted features with a human curated judgment list.\n",
+    "- Train a Learning To Rank model using [XGBoost](https://xgboost.ai/).\n",
+    "- Deploy the trained model to Elasticsearch using [Eland](https://eland.readthedocs.io/en/latest/).\n",
+    "- Use the model as a rescorer for second stage re-ranking.\n",
+    "- Evaluate the impact of the LTR model on search relevance, by comparing search results before and after applying the model.\n",
+    "\n",
+    "> **NOTE:**\n",
+    "> - Learning To Rank is available for Elastic Stack versions 8.12.0 and newer and requires a Platinum subscription or higher.\n",
+    "> - Learning To Rank is experimental and may be changed or removed completely in future releases. Elastic will make a best effort to fix any issues, but experimental features are not supported to the same level as generally available (GA) features.\n"
    ]
   },
   {
@@ -27,7 +34,7 @@
    "source": [
     "## Install required packages\n",
     "\n",
-    "First we will be installing packages required for our example."
+    "First we must install the packages we need for this notebook."
    ]
   },
   {
@@ -113,15 +120,15 @@
     "id": "KLAN6aq_mOpJ"
    },
    "source": [
-    "## Configuring the dataset\n",
+    "## Configure the dataset\n",
     "\n",
-    "In this example notebook we will use a dataset derived from [MSRD](https://github.com/metarank/msrd/tree/master) (Movie Search Ranking Dataset).\n",
+    "We'll use a dataset derived from the [MSRD (Movie Search Ranking Dataset)](https://github.com/metarank/msrd/tree/master).\n",
     "\n",
     "The dataset is available [here](https://github.com/elastic/elasticsearch-labs/tree/main/notebooks/search/sample_data/learning-to-rank/) and contains the following files:\n",
     "\n",
-    "- **movies_corpus.jsonl.gz**: The movies dataset which will be indexed.\n",
-    "- **movies_judgements.tsv.gz**: A file containing relevance judgments for a set of queries.\n",
-    "- **movies_index_settings.json**: Settings to be applied to the documents and index."
+    "- `movies_corpus.jsonl.gz`: Movie dataset to be indexed.\n",
+    "- `movies_judgements.tsv.gz`: Judgment list of relevance judgments for a set of queries.\n",
+    "- `movies_index_settings.json`: Settings to be applied to the documents and index."
    ]
   },
   {
@@ -147,7 +154,7 @@
     "id": "fhO5awX9mOpJ"
    },
    "source": [
-    " ## Importing the document corpus\n",
+    " ## Import the document corpus\n",
     "\n",
     "This step will import the documents of the corpus into the `movies` index .\n",
     "\n",
@@ -230,22 +237,22 @@
    "source": [
     "## Loading the judgment list\n",
     "\n",
-    "Judgemnent list provides human judgement that will be used to train our Learning To Rank model.\n",
+    "The judgment list contains human evaluations that we'll use to train our Learning To Rank model.\n",
     "\n",
     "Each row represents a query-document pair with an associated relevance grade and contains the following columns:\n",
     "\n",
     "| Column    | Description                                                            |\n",
     "|-----------|------------------------------------------------------------------------|\n",
-    "| `query_id`| Pair for the same query are grouped together and received a unique id. |\n",
-    "| `query`   | Actual text for the query.                                             |\n",
-    "| `doc_id`  | Id of the document.                                                    |\n",
+    "| `query_id`| Pairs for the same query are grouped together and received a unique id. |\n",
+    "| `query`   | Actual text of the query.                                             |\n",
+    "| `doc_id`  | ID of the document.                                                    |\n",
     "| `grade`   | The relevance grade of the document for the query.                     |\n",
     "\n",
     "\n",
     "**Note:**\n",
     "\n",
-    "In our notebook the relevance grade is a binary value (relevant or not relavant).\n",
-    "Instread of a binary judgement, you can also use a number that represent the degree of relevance (e.g. from `0` to `4`)."
+    "In this example the relevance grade is a binary value (relevant or not relavant).\n",
+    "You could also use a number that represents the degree of relevance (e.g. from `0` to `4`)."
    ]
   },
   {
@@ -403,11 +410,11 @@
    "source": [
     "## Configure feature extraction\n",
     "\n",
-    "Features and the inputs to our model. They represent information about the query alone, a result document alone or a result document in the context of a query, as in the case of BM25 scores.\n",
+    "Features are the inputs to our model. They represent information about the query alone, a result document alone, or a result document in the context of a query, such as BM25 scores.\n",
     "\n",
     "Features are defined using standard templated queries and the Query DSL.\n",
     "\n",
-    "To simplify defining and iterating on feature extraction during training, we've exposed some primitives directly in `eland`."
+    "To streamline the process of defining and refining feature extraction during training, we have incorporated a number of primitives directly in `eland`."
    ]
   },
   {
@@ -466,11 +473,11 @@
    "source": [
     "## Building the training dataset\n",
     "\n",
-    "Now that we have our basic datasets loaded, and feature extraction configured, we'll use our judgement list to come up with the final dataset for training. The dataset will consist of rows containing `<query, document>` pairs, as well as all of the features we need to train the model. To generate this dataset, we'll run each query from the judgement list and add the extracted features as columns for each of the labelled result documents in the judgement list.\n",
+    "Now that we have our basic datasets loaded, and feature extraction configured, we'll use our judgment list to come up with the final dataset for training. The dataset will consist of rows containing `<query, document>` pairs, as well as all of the features we need to train the model. To generate this dataset, we'll run each query from the judgment list and add the extracted features as columns for each of the labelled result documents.\n",
     "\n",
     "For example, if we have a query `q1` with two labelled documents `d3` and `d9`, the training dataset will end up with two rows — one for each of the pairs `<q1, d3>` and `<q1, d9>`.\n",
     "\n",
-    "Note that because this executes queries on your Elasticsearch cluster, the time to run this operation will vary depending on where the cluster is versus where this notebook runs. For example, if you run the notebook on the same server or host as the Elasticsearch cluster, this operation tends to run very quickly on the sample dataset (< 2 mins)."
+    "Note that because this executes queries on your Elasticsearch cluster, the time to run this operation will vary depending on where the cluster is hosted and where this notebook runs. For example, if you run the notebook on the same server or host as the Elasticsearch cluster, this operation tends to run very quickly on the sample dataset (< 2 mins)."
    ]
   },
   {
@@ -757,7 +764,7 @@
     "\n",
     "The LTR rescorer supports XGBRanker trained models.\n",
     "\n",
-    "You will find more information on XGBRanker model in the xgboost [documentation](https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html)."
+    "Learn more in the [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html)."
    ]
   },
   {
@@ -981,11 +988,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Importing the model to Elasticsearch\n",
+    "## Import the model into Elasticsearch\n",
     "\n",
-    "Once the model is trained you will be able to use Eland to send it to Elasticsearch.\n",
+    "Once the model is trained we can use Eland to load it into Elasticsearch.\n",
     "\n",
-    "Please note that the `MLModel.import_ltr_model` method contains the `LTRModelConfig` object in order to associate the feature extraction with the model.\n"
+    "Please note that the `MLModel.import_ltr_model` method contains the `LTRModelConfig` object which defines how features should be extracted for the model being imported."
    ]
   },
   {
@@ -1148,9 +1155,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We saw above that the title and popularity fields are important ranking feature in our model. Here we can see that now all results contain the query terms in the title. Moreover, more popular movies rank higher, for example `Star Wars: Episode I - The Phantom Menace` is now in third position."
+      "We can see that the `title_bm25` and `popularity` features are weighted more importantly in our trained model.  Now all results include the query terms in the title, outlining the importance of the `title_bm25` feature in the model. Similarly, more popular movies now rank higher, for example `Star Wars: Episode I - The Phantom Menace` is now in third position."
    ]
-  }
+    }
+
  ],
  "metadata": {
   "colab": {