Skip to content

Add sklearn-end2end to bring-your-own-model #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.ipynb filter=strip-notebook-output
777 changes: 777 additions & 0 deletions bring-your-own-model/sklearn-end2end.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions feature-engineering-and-training/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.ipynb filter=strip-notebook-output

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
"\n",
"## Preparation\n",
"\n",
"_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
"_This notebook was created and tested on an ml.m5.xlarge notebook instance._\n",
"\n",
"Let's start by specifying:\n",
"\n",
Expand Down Expand Up @@ -482,7 +482,7 @@
"xgb = sagemaker.estimator.Estimator(container,\n",
" role, \n",
" instance_count=1, \n",
" instance_type='ml.m4.xlarge',\n",
" instance_type='ml.m5.xlarge',\n",
" output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
" sagemaker_session=sess)\n",
"xgb.set_hyperparameters(max_depth=5,\n",
Expand All @@ -494,7 +494,16 @@
" objective='binary:logistic',\n",
" num_round=100)\n",
"\n",
"xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) "
"xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, wait=False) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xgb.logs()"
]
},
{
Expand All @@ -515,7 +524,7 @@
"source": [
"# cell 18\n",
"xgb_predictor = xgb.deploy(initial_instance_count=1,\n",
" instance_type='ml.m4.xlarge')"
" instance_type='ml.m5.xlarge')"
]
},
{
Expand All @@ -525,7 +534,6 @@
"---\n",
"\n",
"## Evaluation\n",
"There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual to predicted values. In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`), which produces a simple confusion matrix.\n",
"\n",
"First we'll need to determine how we pass data into and receive data from our endpoint. Our data is currently stored as NumPy arrays in memory of our notebook instance. To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.\n",
"\n",
Expand All @@ -546,12 +554,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we'll use a simple function to:\n",
"1. Loop over our test dataset\n",
"1. Split it into mini-batches of rows \n",
"1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)\n",
"1. Retrieve mini-batch predictions by invoking the XGBoost endpoint\n",
"1. Collect predictions and convert from the CSV output our model provides into a NumPy array"
"Let's take a look at the test dataset:"
]
},
{
Expand All @@ -560,23 +563,32 @@
"metadata": {},
"outputs": [],
"source": [
"# cell 20\n",
"def predict(data, predictor, rows=500 ):\n",
" split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n",
" predictions = ''\n",
" for array in split_array:\n",
" predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])\n",
"\n",
" return np.fromstring(predictions[1:], sep=',')\n",
"test_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We extract the first test data, drop label y_no and y_yes, and convert to list:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"i = 0\n",
"\n",
"predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)"
"sample_data = test_data.iloc[i].drop(['y_no', 'y_yes']).tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll check our confusion matrix to see how well we predicted versus actuals."
"Then we invoke SageMaker real-time endpoint with the sample data for prediction (inference):"
]
},
{
Expand All @@ -585,17 +597,47 @@
"metadata": {},
"outputs": [],
"source": [
"# cell 21\n",
"pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])"
"result = xgb_predictor.predict(sample_data).decode('utf-8')\n",
"\n",
"predict = 'Yes' if float(result) > 0.5 else 'No'\n",
"actual = 'Yes' if test_data.iloc[i]['y_yes'] == 1 else 'No'\n",
"\n",
"print(f\"Does the sample client subscribe to term deposit?\")\n",
"print(f\"Prediction: {predict}\")\n",
"print(f\"Actual: {actual}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, of the ~4000 potential customers, we predicted 136 would subscribe and 94 of them actually did. We also had 389 subscribers who subscribed that we did not predict would. This is less than desirable, but the model can (and should) be tuned to improve this. Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](https://core.ac.uk/download/pdf/55631291.pdf).\n",
"You can also invoke SageMaker endpoint via boto3:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"\n",
"endpoint_name = xgb_predictor.endpoint_name\n",
"\n",
"client = boto3.client('sagemaker-runtime', region_name=boto3.Session().region_name)\n",
"\n",
"payload = ','.join(str(e) for e in sample_data).encode(\"utf-8\")\n",
"content_type = 'text/csv'\n",
"response = client.invoke_endpoint(EndpointName=endpoint_name, Body=payload, ContentType=content_type)\n",
"\n",
"result = response['Body'].read().decode('utf-8')\n",
"\n",
"predict = 'Yes' if float(result) > 0.5 else 'No'\n",
"actual = 'Yes' if test_data.iloc[i]['y_yes'] == 1 else 'No'\n",
"\n",
"_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._"
"print(f\"Does the sample client subscribe to term deposit?\")\n",
"print(f\"Prediction: {predict}\")\n",
"print(f\"Actual: {actual}\")"
]
},
{
Expand All @@ -616,6 +658,74 @@
"xgb_predictor.delete_endpoint(delete_endpoint_config=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--\n",
"## Batch Transform \n",
"Apart from deploying SageMaker endpoint for real-time inference, SageMaker also supports batch inference for a list of input data. \n",
"\n",
"First, let drop label y_no and y_yes in the test dataset, then save as CSV file and upload to S3:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_data.drop(['y_no', 'y_yes'], axis=1).to_csv('test.csv', index=False, header=False)\n",
"boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we launch a batch inference job to do prediction for the set of test data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_path = f\"s3://{bucket}/{prefix}/test/test.csv\"\n",
"transformer_output_path = f\"s3://{bucket}/{prefix}/transformer-output\"\n",
"\n",
"xgb_transformer = xgb.transformer(\n",
" instance_count=1,\n",
" instance_type='ml.m5.large',\n",
" output_path=transformer_output_path\n",
")\n",
"\n",
"xgb_transformer.transform(\n",
" data=test_path,\n",
" data_type='S3Prefix',\n",
" content_type='text/csv'\n",
")\n",
"\n",
"print(sagemaker.s3.S3Downloader.read_file(f\"{transformer_output_path}/test.csv.out\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the batch inference result."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sagemaker.s3.S3Downloader.download(f\"{transformer_output_path}/test.csv.out\", \"batch_result\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -830,10 +940,11 @@
"source": [
"# cell 22\n",
"from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner\n",
"\n",
"hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),\n",
" 'min_child_weight': ContinuousParameter(1, 10),\n",
" 'alpha': ContinuousParameter(0, 2),\n",
" 'max_depth': IntegerParameter(1, 10)}\n"
" 'min_child_weight': ContinuousParameter(1, 10),\n",
" 'alpha': ContinuousParameter(0, 2),\n",
" 'max_depth': IntegerParameter(1, 10)}"
]
},
{
Expand All @@ -856,8 +967,8 @@
"tuner = HyperparameterTuner(xgb,\n",
" objective_metric_name,\n",
" hyperparameter_ranges,\n",
" max_jobs=20,\n",
" max_parallel_jobs=3)\n"
" max_jobs=9,\n",
" max_parallel_jobs=3)"
]
},
{
Expand Down Expand Up @@ -901,7 +1012,7 @@
"# cell 28\n",
"# Deploy the best trained or user specified model to an Amazon SageMaker endpoint\n",
"tuner_predictor = tuner.deploy(initial_instance_count=1,\n",
" instance_type='ml.m4.xlarge')"
" instance_type='ml.m5.xlarge')"
]
},
{
Expand All @@ -923,18 +1034,18 @@
"source": [
"# cell 30\n",
"# Predict\n",
"predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(),tuner_predictor)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# cell 31\n",
"# Collect predictions and convert from the CSV output our model provides into a NumPy array\n",
"pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])"
"i = 0\n",
"\n",
"sample_data = test_data.iloc[i].drop(['y_no', 'y_yes']).tolist()\n",
"\n",
"result = tuner_predictor.predict(sample_data).decode('utf-8')\n",
"\n",
"predict = 'Yes' if float(result) > 0.5 else 'No'\n",
"actual = 'Yes' if test_data.iloc[i]['y_yes'] == 1 else 'No'\n",
"\n",
"print(f\"Does the sample client subscribe to term deposit?\")\n",
"print(f\"Prediction: {predict}\")\n",
"print(f\"Actual: {actual}\")"
]
},
{
Expand Down Expand Up @@ -993,7 +1104,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
"version": "3.11.11"
},
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
},
Expand Down
1 change: 1 addition & 0 deletions mlops/fm-evaluation-at-scale-main/notebooks/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.ipynb filter=strip-notebook-output
Loading