aws-samples · heichow · May 23, 2024 · May 23, 2024 · May 23, 2024 · May 23, 2024
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+*.ipynb filter=strip-notebook-output  
diff --git a/bring-your-own-model/sklearn-end2end.ipynb b/bring-your-own-model/sklearn-end2end.ipynb
diff --git a/feature-engineering-and-training/.gitattributes b/feature-engineering-and-training/.gitattributes
@@ -0,0 +1 @@
+*.ipynb filter=strip-notebook-output  
diff --git a/...and-training/.ipynb_checkpoints/numpy_xgboost_direct_marketing_sagemaker-checkpoint.ipynb b/...and-training/.ipynb_checkpoints/numpy_xgboost_direct_marketing_sagemaker-checkpoint.ipynb
diff --git a/feature-engineering-and-training/numpy_xgboost_direct_marketing_sagemaker.ipynb b/feature-engineering-and-training/numpy_xgboost_direct_marketing_sagemaker.ipynb
@@ -41,7 +41,7 @@
     "\n",
     "## Preparation\n",
     "\n",
-    "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
+    "_This notebook was created and tested on an ml.m5.xlarge notebook instance._\n",
     "\n",
     "Let's start by specifying:\n",
     "\n",
@@ -482,7 +482,7 @@
     "xgb = sagemaker.estimator.Estimator(container,\n",
     "                                    role, \n",
     "                                    instance_count=1, \n",
-    "                                    instance_type='ml.m4.xlarge',\n",
+    "                                    instance_type='ml.m5.xlarge',\n",
     "                                    output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
     "                                    sagemaker_session=sess)\n",
     "xgb.set_hyperparameters(max_depth=5,\n",
@@ -494,7 +494,16 @@
     "                        objective='binary:logistic',\n",
     "                        num_round=100)\n",
     "\n",
-    "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) "
+    "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, wait=False) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "xgb.logs()"
    ]
   },
   {
@@ -515,7 +524,7 @@
    "source": [
     "# cell 18\n",
     "xgb_predictor = xgb.deploy(initial_instance_count=1,\n",
-    "                           instance_type='ml.m4.xlarge')"
+    "                           instance_type='ml.m5.xlarge')"
    ]
   },
   {
@@ -525,7 +534,6 @@
     "---\n",
     "\n",
     "## Evaluation\n",
-    "There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual to predicted values.  In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`), which produces a simple confusion matrix.\n",
     "\n",
     "First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.\n",
     "\n",
@@ -546,12 +554,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, we'll use a simple function to:\n",
-    "1. Loop over our test dataset\n",
-    "1. Split it into mini-batches of rows \n",
-    "1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)\n",
-    "1. Retrieve mini-batch predictions by invoking the XGBoost endpoint\n",
-    "1. Collect predictions and convert from the CSV output our model provides into a NumPy array"
+    "Let's take a look at the test dataset:"
    ]
   },
   {
@@ -560,23 +563,32 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# cell 20\n",
-    "def predict(data, predictor, rows=500 ):\n",
-    "    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n",
-    "    predictions = ''\n",
-    "    for array in split_array:\n",
-    "        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])\n",
-    "\n",
-    "    return np.fromstring(predictions[1:], sep=',')\n",
+    "test_data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We extract the first test data, drop label y_no and y_yes, and convert to list:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "i = 0\n",
     "\n",
-    "predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)"
+    "sample_data = test_data.iloc[i].drop(['y_no', 'y_yes']).tolist()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we'll check our confusion matrix to see how well we predicted versus actuals."
+    "Then we invoke SageMaker real-time endpoint with the sample data for prediction (inference):"
    ]
   },
   {
@@ -585,17 +597,47 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# cell 21\n",
-    "pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])"
+    "result = xgb_predictor.predict(sample_data).decode('utf-8')\n",
+    "\n",
+    "predict = 'Yes' if float(result) > 0.5 else 'No'\n",
+    "actual = 'Yes' if test_data.iloc[i]['y_yes'] == 1 else 'No'\n",
+    "\n",
+    "print(f\"Does the sample client subscribe to term deposit?\")\n",
+    "print(f\"Prediction: {predict}\")\n",
+    "print(f\"Actual: {actual}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "So, of the ~4000 potential customers, we predicted 136 would subscribe and 94 of them actually did.  We also had 389 subscribers who subscribed that we did not predict would.  This is less than desirable, but the model can (and should) be tuned to improve this.  Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](https://core.ac.uk/download/pdf/55631291.pdf).\n",
+    "You can also invoke SageMaker endpoint via boto3:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import boto3\n",
+    "\n",
+    "endpoint_name = xgb_predictor.endpoint_name\n",
+    "\n",
+    "client = boto3.client('sagemaker-runtime', region_name=boto3.Session().region_name)\n",
+    "\n",
+    "payload = ','.join(str(e) for e in sample_data).encode(\"utf-8\")\n",
+    "content_type = 'text/csv'\n",
+    "response = client.invoke_endpoint(EndpointName=endpoint_name, Body=payload, ContentType=content_type)\n",
+    "\n",
+    "result = response['Body'].read().decode('utf-8')\n",
+    "\n",
+    "predict = 'Yes' if float(result) > 0.5 else 'No'\n",
+    "actual = 'Yes' if test_data.iloc[i]['y_yes'] == 1 else 'No'\n",
     "\n",
-    "_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._"
+    "print(f\"Does the sample client subscribe to term deposit?\")\n",
+    "print(f\"Prediction: {predict}\")\n",
+    "print(f\"Actual: {actual}\")"
    ]
   },
   {
@@ -616,6 +658,74 @@
     "xgb_predictor.delete_endpoint(delete_endpoint_config=True)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "--\n",
+    "## Batch Transform \n",
+    "Apart from deploying SageMaker endpoint for real-time inference, SageMaker also supports batch inference for a list of input data. \n",
+    "\n",
+    "First, let drop label y_no and y_yes in the test dataset, then save as CSV file and upload to S3:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_data.drop(['y_no', 'y_yes'], axis=1).to_csv('test.csv', index=False, header=False)\n",
+    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then we launch a batch inference job to do prediction for the set of test data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_path = f\"s3://{bucket}/{prefix}/test/test.csv\"\n",
+    "transformer_output_path = f\"s3://{bucket}/{prefix}/transformer-output\"\n",
+    "\n",
+    "xgb_transformer = xgb.transformer(\n",
+    "    instance_count=1,\n",
+    "    instance_type='ml.m5.large',\n",
+    "    output_path=transformer_output_path\n",
+    ")\n",
+    "\n",
+    "xgb_transformer.transform(\n",
+    "    data=test_path,\n",
+    "    data_type='S3Prefix',\n",
+    "    content_type='text/csv'\n",
+    ")\n",
+    "\n",
+    "print(sagemaker.s3.S3Downloader.read_file(f\"{transformer_output_path}/test.csv.out\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Download the batch inference result."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sagemaker.s3.S3Downloader.download(f\"{transformer_output_path}/test.csv.out\", \"batch_result\")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -830,10 +940,11 @@
    "source": [
     "# cell 22\n",
     "from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner\n",
+    "\n",
     "hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),\n",
-    "                            'min_child_weight': ContinuousParameter(1, 10),\n",
-    "                            'alpha': ContinuousParameter(0, 2),\n",
-    "                            'max_depth': IntegerParameter(1, 10)}\n"
+    "                         'min_child_weight': ContinuousParameter(1, 10),\n",
+    "                         'alpha': ContinuousParameter(0, 2),\n",
+    "                         'max_depth': IntegerParameter(1, 10)}"
    ]
   },
   {
@@ -856,8 +967,8 @@
     "tuner = HyperparameterTuner(xgb,\n",
     "                            objective_metric_name,\n",
     "                            hyperparameter_ranges,\n",
-    "                            max_jobs=20,\n",
-    "                            max_parallel_jobs=3)\n"
+    "                            max_jobs=9,\n",
+    "                            max_parallel_jobs=3)"
    ]
   },
   {
@@ -901,7 +1012,7 @@
     "# cell 28\n",
     "#  Deploy the best trained or user specified model to an Amazon SageMaker endpoint\n",
     "tuner_predictor = tuner.deploy(initial_instance_count=1,\n",
-    "                           instance_type='ml.m4.xlarge')"
+    "                           instance_type='ml.m5.xlarge')"
    ]
   },
   {
@@ -923,18 +1034,18 @@
    "source": [
     "# cell 30\n",
     "# Predict\n",
-    "predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(),tuner_predictor)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# cell 31\n",
-    "# Collect predictions and convert from the CSV output our model provides into a NumPy array\n",
-    "pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])"
+    "i = 0\n",
+    "\n",
+    "sample_data = test_data.iloc[i].drop(['y_no', 'y_yes']).tolist()\n",
+    "\n",
+    "result = tuner_predictor.predict(sample_data).decode('utf-8')\n",
+    "\n",
+    "predict = 'Yes' if float(result) > 0.5 else 'No'\n",
+    "actual = 'Yes' if test_data.iloc[i]['y_yes'] == 1 else 'No'\n",
+    "\n",
+    "print(f\"Does the sample client subscribe to term deposit?\")\n",
+    "print(f\"Prediction: {predict}\")\n",
+    "print(f\"Actual: {actual}\")"
    ]
   },
   {
@@ -993,7 +1104,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.11.11"
   },
   "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
  },

diff --git a/mlops/fm-evaluation-at-scale-main/notebooks/.gitattributes b/mlops/fm-evaluation-at-scale-main/notebooks/.gitattributes
@@ -0,0 +1 @@
+*.ipynb filter=strip-notebook-output