Description
❓ Questions & Help
Hi everyone, I was trying to reproduce the Getting Started: Session-based Recommendation with Synthetic Data example on AWS SageMaker following the Training and Serving Merlin on AWS SageMaker official tutorial (that uses a merlin-models
model) but using a transformers4rec
model instead.
The AWS SageMaker tutorial using merlin-models
works as expected for both the training and inference steps (after following the PR NVIDIA-Merlin/Merlin#1040 fixes). However, when I'm trying to do the same with the transformers4rec
getting-started tutorial, I'm getting the following error trying to perform the inference on a SageMaker Endpoint:
| 1733851742784 | I1210 17:29:02.641670 103 python_be.cc:2177] TRITONBACKEND_ModelInstanceExecute: model instance name 0_transformworkflowtriton_0 released 1 requests | AllTraffic/i-0a730c865fae02cab |
| 1733851742784 | Failed to transform operator <merlin.systems.dag.runtimes.triton.ops.workflow.TransformWorkflowTriton object at 0x7f70e322ce50> | AllTraffic/i-0a730c865fae02cab |
| 1733851742784 | Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 237, in _run_node_transform transformed_data = node.op.transform(selection, input_data) File "/usr/local/lib/python3.10/dist-packages/merlin/systems/dag/runtimes/triton/ops/workflow.py", line 92, in transform raise RuntimeError(inference_response.error().message()) | AllTraffic/i-0a730c865fae02cab |
| 1733851742784 | RuntimeError: Error: <class 'KeyError'> - "['weekday_sin-list', 'category-list', 'item_id-count', 'age_days-list', 'item_id-list', 'day-first'] not in index", Traceback: [' File "/opt/ml/model/0_transformworkflowtriton/1/model.py", line 117, in execute\n transformed = self.runner.run_workflow(input_tensors)\n', ' File "/usr/local/lib/python3.10/dist-packages/merlin/systems/workflow/base.py", line 103, in run_workflow\n transformed = LocalExecutor().transform(transformable, self.workflow.graph)\n', ' File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 102, in transform\n transformed_data = self._execute_node(node, transformable, capture_dtypes, strict)\n', ' File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 116, in _execute_node\n upstream_outputs = self._run_upstream_transforms(\n', ' File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 130, in _run_upstream_transforms\n node_output = self._execute_node(\n', ' File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 119, in _execute_node\n upstream_columns = self._append_addl_root_columns(node, transformable, upstream_outputs)\n', ' File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 154, in _append_addl_root_columns\n upstream_outputs.append(transformable[list(root_columns)])\n', ' File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 3811, in __getitem__\n indexer = self.columns._get_indexer_strict(key, "columns")[1]\n', ' File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 6113, in _get_indexer_strict\n self._raise_if_missing(keyarr, indexer, axis_name)\n', ' File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing\n raise KeyError(f"{not_found} not in index")\n']
As you can see, the error seems to be related to the grouped variables in the 0_transformworkflowtriton
model of the Triton ensemble. However, the model training and the ensemble initialization on the Triton server seems to be ok SM_endpoint_logs_full.txt:
| 1733850433118 | +-------------------------------------------+---------+--------+ | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | Model | Version | Status | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | +-------------------------------------------+---------+--------+ | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | /opt/ml/model/::0_transformworkflowtriton | 1 | READY | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | /opt/ml/model/::1_predictpytorchtriton | 1 | READY | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | /opt/ml/model/::executor_model | 1 | READY | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | +-------------------------------------------+---------+--------+ | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | I1210 17:07:12.893780 103 tritonserver.cc:2385] | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | Option | Value | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | server_id | triton | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | server_version | 2.35.0 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | model_repository_path[0] | /opt/ml/model/ | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | model_control_mode | MODE_NONE | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | strict_model_config | 0 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | rate_limit | OFF | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | pinned_memory_pool_byte_size | 268435456 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | cuda_memory_pool_byte_size{0} | 67108864 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | min_supported_compute_capability | 6.0 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | strict_readiness | 1 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | exit_timeout | 30 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | | cache_enabled | 0 | | AllTraffic/i-0a730c865fae02cab |
| 1733850433118 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I think that the cause of this error could be in the Triton server initialization command (tritonserver --allow-sagemaker=true --allow-http=false $SAGEMAKER_ARGS
) or in the SageMaker Endpoint invocation ( runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name, ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}", Body=request_body)
) (details and code attached below), since when I perform the Triton inference using the AWS SageMaker Training job (the same instance used for training) it works as expected. Any help with this issue will be highly appreciated.
Details
Following the Merlin SageMaker tutorial, these are my files:
Dockerfile
FROM nvcr.io/nvidia/merlin/merlin-pytorch:23.12
RUN pip3 install sagemaker-training
COPY --chown=1000:1000 --chmod=764 serve /usr/bin/serve
serve
(Initializes the Triton server. Copied from the PR fix):
#!/bin/bash
# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
SAGEMAKER_SINGLE_MODEL_REPO=/opt/ml/model/
# Use 'ready' for ping check in single-model endpoint mode, and use 'live' for ping check in multi-model endpoint model
# https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/rest_predict_v2.yaml#L10-L26
if [ -n "$SAGEMAKER_TRITON_OVERRIDE_PING_MODE" ]; then
SAGEMAKER_TRITON_PING_MODE=${SAGEMAKER_TRITON_OVERRIDE_PING_MODE}
else
SAGEMAKER_TRITON_PING_MODE="ready"
fi
# Note: in Triton on SageMaker, each model url is registered as a separate repository
# e.g., /opt/ml/models/<hash>/model. Specifying MME model repo path as /opt/ml/models causes Triton
# to treat it as an additional empty repository and changes
# the state of all models to be UNAVAILABLE in the model repository
# https://github.com/triton-inference-server/core/blob/main/src/model_repository_manager.cc#L914,L922
# On Triton, this path will be a dummy path as it's mandatory to specify a model repo when starting triton
SAGEMAKER_MULTI_MODEL_REPO=/tmp/sagemaker
SAGEMAKER_MODEL_REPO=${SAGEMAKER_SINGLE_MODEL_REPO}
is_mme_mode=false
if [ -n "$SAGEMAKER_MULTI_MODEL" ]; then
if [ "$SAGEMAKER_MULTI_MODEL" == "true" ]; then
mkdir -p ${SAGEMAKER_MULTI_MODEL_REPO}
SAGEMAKER_MODEL_REPO=${SAGEMAKER_MULTI_MODEL_REPO}
if [ -n "$SAGEMAKER_TRITON_OVERRIDE_PING_MODE" ]; then
SAGEMAKER_TRITON_PING_MODE=${SAGEMAKER_TRITON_OVERRIDE_PING_MODE}
else
SAGEMAKER_TRITON_PING_MODE="live"
fi
is_mme_mode=true
echo -e "Triton is running in SageMaker MME mode. Using Triton ping mode: \"${SAGEMAKER_TRITON_PING_MODE}\""
fi
fi
SAGEMAKER_ARGS="--model-repository=${SAGEMAKER_MODEL_REPO}"
#Set model namespacing to true, but allow disabling if required
if [ -n "$SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --model-namespacing=${SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING}"
else
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --model-namespacing=true"
fi
if [ -n "$SAGEMAKER_BIND_TO_PORT" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --sagemaker-port=${SAGEMAKER_BIND_TO_PORT}"
fi
if [ -n "$SAGEMAKER_SAFE_PORT_RANGE" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --sagemaker-safe-port-range=${SAGEMAKER_SAFE_PORT_RANGE}"
fi
if [ -n "$SAGEMAKER_TRITON_ALLOW_GRPC" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-grpc=${SAGEMAKER_TRITON_ALLOW_GRPC}"
else
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-grpc=false"
fi
if [ -n "$SAGEMAKER_TRITON_ALLOW_METRICS" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-metrics=${SAGEMAKER_TRITON_ALLOW_METRICS}"
else
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-metrics=false"
fi
if [ -n "$SAGEMAKER_TRITON_METRICS_PORT" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --metrics-port=${SAGEMAKER_TRITON_METRICS_PORT}"
fi
if [ -n "$SAGEMAKER_TRITON_GRPC_PORT" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --grpc-port=${SAGEMAKER_TRITON_GRPC_PORT}"
fi
if [ -n "$SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --buffer-manager-thread-count=${SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT}"
fi
if [ -n "$SAGEMAKER_TRITON_THREAD_COUNT" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --sagemaker-thread-count=${SAGEMAKER_TRITON_THREAD_COUNT}"
fi
# Enable verbose logging by default. If env variable is specified, use value from env variable
if [ -n "$SAGEMAKER_TRITON_LOG_VERBOSE" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-verbose=${SAGEMAKER_TRITON_LOG_VERBOSE}"
else
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-verbose=true"
fi
if [ -n "$SAGEMAKER_TRITON_LOG_INFO" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-info=${SAGEMAKER_TRITON_LOG_INFO}"
fi
if [ -n "$SAGEMAKER_TRITON_LOG_WARNING" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-warning=${SAGEMAKER_TRITON_LOG_WARNING}"
fi
if [ -n "$SAGEMAKER_TRITON_LOG_ERROR" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-error=${SAGEMAKER_TRITON_LOG_ERROR}"
fi
if [ -n "$SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-default-byte-size=${SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE}"
else
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-default-byte-size=16777216" #16MB
fi
if [ -n "$SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-growth-byte-size=${SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE}"
else
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-growth-byte-size=1048576" #1MB
fi
if [ -n "$SAGEMAKER_TRITON_TENSORFLOW_VERSION" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=tensorflow,version=${SAGEMAKER_TRITON_TENSORFLOW_VERSION}"
fi
if [ -n "$SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT" ]; then
num_gpus=$(nvidia-smi -L | wc -l)
for ((i=0; i<${num_gpus}; i++)); do
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --model-load-gpu-limit ${i}:${SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT}"
done
fi
if [ -n "$SAGEMAKER_TRITON_ADDITIONAL_ARGS" ]; then
SAGEMAKER_ARGS="${SAGEMAKER_ARGS} ${SAGEMAKER_TRITON_ADDITIONAL_ARGS}"
fi
tritonserver --allow-sagemaker=true --allow-http=false $SAGEMAKER_ARGS
train.py
(here I just copied thetransformers4rec
tutorial):
import argparse
import json
import logging
import os
import sys
import tempfile
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import glob
import cudf
import numpy as np
import pandas as pd
import nvtabular as nvt
from nvtabular.ops import *
from merlin.schema.tags import Tags
from transformers4rec.utils.data_utils import save_time_based_splits
import torch
from transformers4rec import torch as tr
from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt
from transformers4rec.torch.utils.examples_utils import wipe_memory
from merlin.schema import Schema
from merlin.io import Dataset
from transformers4rec.config.trainer import T4RecTrainingArguments
from transformers4rec.torch import Trainer
from merlin.core.dispatch import make_df
from merlin.systems.dag import Ensemble
from merlin.systems.dag.ops.pytorch import PredictPyTorch
from merlin.systems.dag.ops.workflow import TransformWorkflow
import cloudpickle
from merlin.table import TensorTable, TorchColumn
from merlin.table.conversions import convert_col
import shutil
from nvtabular.workflow import Workflow
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))
def parse_args():
"""
Parse arguments passed from the SageMaker API to the container.
"""
parser = argparse.ArgumentParser()
# Model directory: we will use the default set by SageMaker, /opt/ml/model
parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
return parser.parse_known_args()
def data_preprocessing():
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", "./data/")
NUM_ROWS = os.environ.get("NUM_ROWS", 100000)
long_tailed_item_distribution = np.clip(np.random.lognormal(3., 1., int(NUM_ROWS)).astype(np.int32), 1, 50000)
# generate random item interaction features
df = pd.DataFrame(np.random.randint(70000, 90000, int(NUM_ROWS)), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution
# generate category mapping for each item-id
df['category'] = pd.cut(df['item_id'], bins=334, labels=np.arange(1, 335)).astype(np.int32)
df['age_days'] = np.random.uniform(0, 1, int(NUM_ROWS)).astype(np.float32)
df['weekday_sin']= np.random.uniform(0, 1, int(NUM_ROWS)).astype(np.float32)
# generate day mapping for each session
map_day = dict(zip(df.session_id.unique(), np.random.randint(1, 10, size=(df.session_id.nunique()))))
df['day'] = df.session_id.map(map_day)
SESSIONS_MAX_LENGTH =20
# Categorify categorical features
categ_feats = ['item_id', 'category'] >> nvt.ops.Categorify()
# Define Groupby Workflow
groupby_feats = categ_feats + ['session_id', 'day', 'age_days', 'weekday_sin']
# Group interaction features by session
groupby_features = groupby_feats >> nvt.ops.Groupby(
groupby_cols=["session_id"],
aggs={
"item_id": ["list", "count"],
"category": ["list"],
"day": ["first"],
"age_days": ["list"],
'weekday_sin': ["list"],
},
name_sep="-")
# Select and truncate the sequential features
sequence_features_truncated = (
groupby_features['category-list']
>> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH)
)
sequence_features_truncated_item = (
groupby_features['item_id-list']
>> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH)
>> TagAsItemID()
)
sequence_features_truncated_cont = (
groupby_features['age_days-list', 'weekday_sin-list']
>> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH)
>> nvt.ops.AddMetadata(tags=[Tags.CONTINUOUS])
)
# Filter out sessions with length 1 (not valid for next-item prediction training and evaluation)
MINIMUM_SESSION_LENGTH = 2
selected_features = (
groupby_features['item_id-count', 'day-first', 'session_id'] +
sequence_features_truncated_item +
sequence_features_truncated +
sequence_features_truncated_cont
)
filtered_sessions = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)
seq_feats_list = filtered_sessions['item_id-list', 'category-list', 'age_days-list', 'weekday_sin-list'] >> nvt.ops.ValueCount()
workflow = nvt.Workflow(filtered_sessions['session_id', 'day-first'] + seq_feats_list)
dataset = nvt.Dataset(df)
# Generate statistics for the features and export parquet files
# this step will generate the schema file
workflow.fit_transform(dataset).to_parquet(os.path.join(INPUT_DATA_DIR, "processed_nvt"))
workflow.save(os.path.join(INPUT_DATA_DIR, "workflow_etl"))
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", os.path.join(INPUT_DATA_DIR, "sessions_by_day"))
# Read in the processed parquet file
sessions_gdf = cudf.read_parquet(os.path.join(INPUT_DATA_DIR, "processed_nvt/part_0.parquet"))
save_time_based_splits(data=nvt.Dataset(sessions_gdf),
output_dir= OUTPUT_DIR,
partition_col='day-first',
timestamp_col='session_id',
)
return
def model_training():
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", "./data")
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", f"{INPUT_DATA_DIR}/sessions_by_day")
train = Dataset(os.path.join(INPUT_DATA_DIR, "processed_nvt/part_0.parquet"))
schema = train.schema
# You can select a subset of features for training
schema = schema.select_by_name(['item_id-list',
'category-list',
'weekday_sin-list',
'age_days-list'])
inputs = tr.TabularSequenceFeatures.from_schema(
schema,
max_sequence_length=20,
continuous_projection=64,
masking="mlm",
d_output=100,
)
# Define XLNetConfig class and set default parameters for HF XLNet config
transformer_config = tr.XLNetConfig.build(
d_model=64, n_head=4, n_layer=2, total_seq_length=20
)
# Define the model block including: inputs, masking, projection and transformer block.
body = tr.SequentialBlock(
inputs, tr.MLPBlock([64]), tr.TransformerBlock(transformer_config, masking=inputs.masking)
)
# Define the evaluation top-N metrics and the cut-offs
metrics = [NDCGAt(top_ks=[20, 40], labels_onehot=True),
RecallAt(top_ks=[20, 40], labels_onehot=True)]
# Define a head related to next item prediction task
head = tr.Head(
body,
tr.NextItemPredictionTask(weight_tying=True,
metrics=metrics),
inputs=inputs,
)
# Get the end-to-end Model class
model = tr.Model(head)
per_device_train_batch_size = int(os.environ.get(
"per_device_train_batch_size",
'128'
))
per_device_eval_batch_size = int(os.environ.get(
"per_device_eval_batch_size",
'32'
))
# Set hyperparameters for training
train_args = T4RecTrainingArguments(
data_loader_engine='merlin',
dataloader_drop_last = True,
gradient_accumulation_steps = 1,
per_device_train_batch_size = per_device_train_batch_size,
per_device_eval_batch_size = per_device_eval_batch_size,
output_dir = "./tmp",
learning_rate=0.0005,
lr_scheduler_type='cosine',
learning_rate_num_cosine_cycles_by_epoch=1.5,
num_train_epochs=5,
max_sequence_length=20,
report_to = [],
logging_steps=50,
no_cuda=False,
)
trainer = Trainer(
model=model,
args=train_args,
schema=schema,
compute_metrics=True,
)
start_window_index = int(os.environ.get(
"start_window_index",
'1'
))
final_window_index = int(os.environ.get(
"final_window_index",
'8'
))
start_time_window_index = start_window_index
final_time_window_index = final_window_index
#Iterating over days of one week
for time_index in range(start_time_window_index, final_time_window_index):
# Set data
time_index_train = time_index
time_index_eval = time_index + 1
train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
print(train_paths)
# Train on day related to time_index
print('*'*20)
print("Launch training for day %s are:" %time_index)
print('*'*20 + '\n')
trainer.train_dataset_or_path = train_paths
trainer.reset_lr_scheduler()
trainer.train()
trainer.state.global_step +=1
print('finished')
# Evaluate on the following day
trainer.eval_dataset_or_path = eval_paths
train_metrics = trainer.evaluate(metric_key_prefix='eval')
print('*'*20)
print("Eval results for day %s are:\t" %time_index_eval)
print('\n' + '*'*20 + '\n')
for key in sorted(train_metrics.keys()):
print(" %s = %s" % (key, str(train_metrics[key])))
wipe_memory()
eval_data_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
# set new data from day 7
eval_metrics = trainer.evaluate(eval_dataset=eval_data_paths, metric_key_prefix='eval')
for key in sorted(eval_metrics.keys()):
print(" %s = %s" % (key, str(eval_metrics[key])))
model_path= os.environ.get("OUTPUT_DIR", f"{INPUT_DATA_DIR}/saved_model")
model.save(model_path)
def model_ensemble(output_path):
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", "./data/")
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", f"{INPUT_DATA_DIR}/sessions_by_day")
model_path= os.environ.get("model_path", f"{INPUT_DATA_DIR}/saved_model")
loaded_model = cloudpickle.load(
open(os.path.join(model_path, "t4rec_model_class.pkl"), "rb")
)
model = loaded_model.cuda()
model.eval()
train_paths = os.path.join(OUTPUT_DIR, f"{1}/train.parquet")
dataset = Dataset(train_paths)
df = cudf.read_parquet(train_paths, columns=model.input_schema.column_names)
table = TensorTable.from_df(df.loc[:100])
for column in table.columns:
table[column] = convert_col(table[column], TorchColumn)
model_input_dict = table.to_dict()
traced_model = torch.jit.trace(model, model_input_dict, strict=True)
input_schema = model.input_schema
output_schema = model.output_schema
workflow = Workflow.load(os.path.join(INPUT_DATA_DIR, "workflow_etl"))
torch_op = workflow.input_schema.column_names >> TransformWorkflow(workflow) >> PredictPyTorch(
traced_model, input_schema, output_schema
)
ensemble = Ensemble(torch_op, workflow.input_schema)
ens_config, node_configs = ensemble.export(output_path)
return
def train(output_path):
data_preprocessing()
model_training()
model_ensemble(output_path)
return
if __name__ == "__main__":
args, _ = parse_args()
train(args.model_dir)
- Training output logs (Successful execution):
INFO:sagemaker:Creating training-job with name: model-training-2024-12-10-16-51-51-539
2024-12-10 16:51:54 Starting - Starting the training job...
2024-12-10 16:52:08 Starting - Preparing the instances for training...
2024-12-10 16:52:49 Downloading - Downloading the training image..................
2024-12-10 16:55:56 Training - Training image download completed. Training in progress....==================================
== Triton Inference Server Base ==
==================================
NVIDIA Release 23.06 (build 62878575)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-12-10 16:56:10,623 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-12-10 16:56:10,658 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-12-10 16:56:10,693 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-12-10 16:56:10,707 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {},
"current_host": "algo-1",
"current_instance_group": "homogeneousCluster",
"current_instance_group_hosts": [
"algo-1"
],
"current_instance_type": "ml.g4dn.xlarge",
"distribution_hosts": [],
"distribution_instance_groups": [],
"framework_module": null,
"hosts": [
"algo-1"
],
"hyperparameters": {},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {},
"input_dir": "/opt/ml/input",
"instance_groups": [
"homogeneousCluster"
],
"instance_groups_dict": {
"homogeneousCluster": {
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g4dn.xlarge",
"hosts": [
"algo-1"
]
}
},
"is_hetero": false,
"is_master": true,
"is_modelparallel_enabled": null,
"is_smddpmprun_installed": false,
"is_smddprun_installed": false,
"job_name": "model-training-2024-12-10-16-51-51-539",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-east-1-397669588823/model-training-2024-12-10-16-51-51-539/source/sourcedir.tar.gz",
"module_name": "train",
"network_interface_name": "eth0",
"num_cpus": 4,
"num_gpus": 1,
"num_neurons": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"current_instance_type": "ml.g4dn.xlarge",
"current_group_name": "homogeneousCluster",
"hosts": [
"algo-1"
],
"instance_groups": [
{
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g4dn.xlarge",
"hosts": [
"algo-1"
]
}
],
"network_interface_name": "eth0"
},
"user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g4dn.xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-397669588823/model-training-2024-12-10-16-51-51-539/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g4dn.xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":null,"hosts":["algo-1"],"hyperparameters":{},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":false,"is_smddprun_installed":false,"job_name":"model-training-2024-12-10-16-51-51-539","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-397669588823/model-training-2024-12-10-16-51-51-539/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=[]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
PYTHONPATH=/opt/ml/code:/usr/local/bin:/opt/tritonserver:/usr/local/lib/python3.10/dist-packages:/usr/lib/python310.zip:/usr/lib/python3.10:/usr/lib/python3.10/lib-dynload:/usr/local/lib/python3.10/dist-packages/faiss-1.7.2-py3.10.egg:/ptx:/usr/local/lib/python3.10/dist-packages/merlin_hps-0.0.0-py3.10-linux-x86_64.egg:/usr/lib/python3/dist-packages:/usr/lib/python3.10/dist-packages
Invoking script with the following command:
/usr/bin/python3 train.py
2024-12-10 16:56:10,708 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker Debugger as it is not installed.
2024-12-10 16:56:10,708 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.
/usr/local/lib/python3.10/dist-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'
warn(f"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}")
#015Creating time-based splits: 0%| | 0/9 [00:00<?, ?it/s]#015Creating time-based splits: 11%|█ | 1/9 [00:00<00:07, 1.11it/s]#015Creating time-based splits: 56%|█████▌ | 5/9 [00:01<00:00, 6.19it/s]#015Creating time-based splits: 100%|██████████| 9/9 [00:01<00:00, 11.05it/s]#015Creating time-based splits: 100%|██████████| 9/9 [00:01<00:00, 7.77it/s]
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
['./data/sessions_by_day/1/train.parquet']
********************
Launch training for day 1 are:
********************
#015 0%| | 0/65 [00:00<?, ?it/s]#015 2%|▏ | 1/65 [00:01<01:53, 1.78s/it]#015 3%|▎ | 2/65 [00:01<00:51, 1.23it/s]#015 6%|▌ | 4/65 [00:02<00:21, 2.88it/s]#015 9%|▉ | 6/65 [00:02<00:12, 4.59it/s]#015 12%|█▏ | 8/65 [00:02<00:09, 6.29it/s]#015 15%|█▌ | 10/65 [00:02<00:06, 7.86it/s]#015 18%|█▊ | 12/65 [00:02<00:05, 9.19it/s]#015 22%|██▏ | 14/65 [00:02<00:05, 9.63it/s]#015 25%|██▍ | 16/65 [00:02<00:04, 10.71it/s]#015 28%|██▊ | 18/65 [00:03<00:04, 11.48it/s]#015 31%|███ | 20/65 [00:03<00:03, 12.16it/s]#015 34%|███▍ | 22/65 [00:03<00:03, 12.70it/s]#015 37%|███▋ | 24/65 [00:03<00:03, 13.03it/s]#015 40%|████ | 26/65 [00:03<00:02, 13.20it/s]#015 43%|████▎ | 28/65 [00:03<00:03, 12.32it/s]#015 46%|████▌ | 30/65 [00:04<00:02, 12.75it/s]#015 49%|████▉ | 32/65 [00:04<00:02, 13.03it/s]#015 52%|█████▏ | 34/65 [00:04<00:02, 13.26it/s]#015 55%|█████▌ | 36/65 [00:04<00:02, 13.46it/s]#015 58%|█████▊ | 38/65 [00:04<00:01, 13.55it/s]#015 62%|██████▏ | 40/65 [00:04<00:02, 12.46it/s]#015 65%|██████▍ | 42/65 [00:04<00:01, 12.93it/s]#015 68%|██████▊ | 44/65 [00:05<00:01, 13.19it/s]#015 71%|███████ | 46/65 [00:05<00:01, 13.43it/s]#015 74%|███████▍ | 48/65 [00:05<00:01, 13.48it/s]#015 77%|███████▋ | 50/65 [00:05<00:01, 13.66it/s]#015 #015#015 77%|███████▋ | 50/65 [00:05<00:01, 13.66it/s]#015 80%|████████ | 52/65 [00:05<00:00, 13.54it/s]#015 83%|████████▎ | 54/65 [00:05<00:00, 12.33it/s]#015 86%|████████▌ | 56/65 [00:05<00:00, 12.79it/s]#015 89%|████████▉ | 58/65 [00:06<00:00, 13.13it/s]#015 92%|█████████▏| 60/65 [00:06<00:00, 13.26it/s]#015 95%|█████████▌| 62/65 [00:06<00:00, 13.36it/s]#015 98%|█████████▊| 64/65 [00:06<00:00, 13.53it/s]#015 #015#015100%|██████████| 65/65 [00:06<00:00, 13.53it/s]#015100%|██████████| 65/65 [00:06<00:00, 9.73it/s]
{'loss': 5.8526, 'learning_rate': 0.0002801341700638303, 'epoch': 3.85}
{'train_runtime': 6.6806, 'train_samples_per_second': 1245.397, 'train_steps_per_second': 9.73, 'train_loss': 5.716501206618089, 'epoch': 5.0}
finished
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 81.45it/s]
********************
Eval results for day 2 are:#011
********************
eval_/loss = 5.159573078155518
eval_/next-item/ndcg_at_20 = 0.151389941573143
eval_/next-item/ndcg_at_40 = 0.20212794840335846
eval_/next-item/recall_at_20 = 0.4114583432674408
eval_/next-item/recall_at_40 = 0.65625
eval_runtime = 0.1406
eval_samples_per_second = 1365.82
eval_steps_per_second = 42.682
['./data/sessions_by_day/2/train.parquet']
********************
Launch training for day 2 are:
********************
#015 0%| | 0/65 [00:00<?, ?it/s]#015 2%|▏ | 1/65 [00:00<00:07, 8.46it/s]#015 5%|▍ | 3/65 [00:00<00:05, 11.95it/s]#015 8%|▊ | 5/65 [00:00<00:04, 13.05it/s]#015 11%|█ | 7/65 [00:00<00:04, 13.28it/s]#015 14%|█▍ | 9/65 [00:00<00:04, 13.38it/s]#015 17%|█▋ | 11/65 [00:00<00:03, 13.59it/s]#015 20%|██ | 13/65 [00:00<00:03, 13.72it/s]#015 23%|██▎ | 15/65 [00:01<00:03, 12.57it/s]#015 26%|██▌ | 17/65 [00:01<00:03, 12.87it/s]#015 29%|██▉ | 19/65 [00:01<00:03, 13.18it/s]#015 32%|███▏ | 21/65 [00:01<00:03, 13.38it/s]#015 35%|███▌ | 23/65 [00:01<00:03, 13.57it/s]#015 38%|███▊ | 25/65 [00:01<00:02, 13.74it/s]#015 42%|████▏ | 27/65 [00:02<00:03, 12.62it/s]#015 45%|████▍ | 29/65 [00:02<00:02, 12.92it/s]#015 48%|████▊ | 31/65 [00:02<00:02, 13.14it/s]#015 51%|█████ | 33/65 [00:02<00:02, 13.42it/s]#015 54%|█████▍ | 35/65 [00:02<00:02, 13.43it/s]#015 57%|█████▋ | 37/65 [00:02<00:02, 13.57it/s]#015 60%|██████ | 39/65 [00:02<00:01, 13.57it/s]#015 63%|██████▎ | 41/65 [00:03<00:01, 12.52it/s]#015 66%|██████▌ | 43/65 [00:03<00:01, 12.88it/s]#015 69%|██████▉ | 45/65 [00:03<00:01, 13.17it/s]#015 72%|███████▏ | 47/65 [00:03<00:01, 13.26it/s]#015 75%|███████▌ | 49/65 [00:03<00:01, 13.44it/s]#015 #015#015 77%|███████▋ | 50/65 [00:03<00:01, 13.44it/s]#015 78%|███████▊ | 51/65 [00:03<00:01, 13.54it/s]#015 82%|████████▏ | 53/65 [00:04<00:00, 12.41it/s]#015 85%|████████▍ | 55/65 [00:04<00:00, 12.80it/s]#015 88%|████████▊ | 57/65 [00:04<00:00, 12.91it/s]#015 91%|█████████ | 59/65 [00:04<00:00, 13.21it/s]#015 94%|█████████▍| 61/65 [00:04<00:00, 13.23it/s]#015 97%|█████████▋| 63/65 [00:04<00:00, 13.45it/s]#015100%|██████████| 65/65 [00:04<00:00, 13.69it/s]#015 #015#015100%|██████████| 65/65 [00:04<00:00, 13.69it/s]#015100%|██████████| 65/65 [00:04<00:00, 13.18it/s]
{'loss': 4.9397, 'learning_rate': 0.0002801341700638303, 'epoch': 3.85}
{'train_runtime': 4.932, 'train_samples_per_second': 1686.954, 'train_steps_per_second': 13.179, 'train_loss': 4.896147273137019, 'epoch': 5.0}
finished
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 80.61it/s]
********************
Eval results for day 3 are:#011
********************
eval_/loss = 4.623995304107666
eval_/next-item/ndcg_at_20 = 0.1915176659822464
eval_/next-item/ndcg_at_40 = 0.2323640137910843
eval_/next-item/recall_at_20 = 0.5
eval_/next-item/recall_at_40 = 0.6979166865348816
eval_runtime = 0.1367
eval_samples_per_second = 1404.154
eval_steps_per_second = 43.88
['./data/sessions_by_day/3/train.parquet']
********************
Launch training for day 3 are:
********************
#015 0%| | 0/65 [00:00<?, ?it/s]#015 2%|▏ | 1/65 [00:00<00:07, 8.63it/s]#015 5%|▍ | 3/65 [00:00<00:05, 12.12it/s]#015 8%|▊ | 5/65 [00:00<00:04, 13.07it/s]#015 11%|█ | 7/65 [00:00<00:04, 13.19it/s]#015 14%|█▍ | 9/65 [00:00<00:04, 13.32it/s]#015 17%|█▋ | 11/65 [00:00<00:03, 13.54it/s]#015 20%|██ | 13/65 [00:00<00:03, 13.67it/s]#015 23%|██▎ | 15/65 [00:01<00:03, 12.57it/s]#015 26%|██▌ | 17/65 [00:01<00:03, 12.99it/s]#015 29%|██▉ | 19/65 [00:01<00:03, 13.34it/s]#015 32%|███▏ | 21/65 [00:01<00:03, 13.57it/s]#015 35%|███▌ | 23/65 [00:01<00:03, 13.64it/s]#015 38%|███▊ | 25/65 [00:01<00:02, 13.74it/s]#015 42%|████▏ | 27/65 [00:02<00:03, 12.49it/s]#015 45%|████▍ | 29/65 [00:02<00:02, 12.83it/s]#015 48%|████▊ | 31/65 [00:02<00:02, 13.05it/s]#015 51%|█████ | 33/65 [00:02<00:02, 13.33it/s]#015 54%|█████▍ | 35/65 [00:02<00:02, 13.52it/s]#015 57%|█████▋ | 37/65 [00:02<00:02, 13.50it/s]#015 60%|██████ | 39/65 [00:02<00:01, 13.56it/s]#015 63%|██████▎ | 41/65 [00:03<00:01, 12.42it/s]#015 66%|██████▌ | 43/65 [00:03<00:01, 12.79it/s]#015 69%|██████▉ | 45/65 [00:03<00:01, 12.98it/s]#015 72%|███████▏ | 47/65 [00:03<00:01, 13.11it/s]#015 75%|███████▌ | 49/65 [00:03<00:01, 13.19it/s]#015 #015#015 77%|███████▋ | 50/65 [00:03<00:01, 13.19it/s]#015 78%|███████▊ | 51/65 [00:03<00:01, 13.28it/s]#015 82%|████████▏ | 53/65 [00:04<00:01, 11.52it/s]#015 85%|████████▍ | 55/65 [00:04<00:00, 11.63it/s]#015 88%|████████▊ | 57/65 [00:04<00:00, 12.03it/s]#015 91%|█████████ | 59/65 [00:04<00:00, 12.49it/s]#015 94%|█████████▍| 61/65 [00:04<00:00, 12.77it/s]#015 97%|█████████▋| 63/65 [00:04<00:00, 13.09it/s]#015100%|██████████| 65/65 [00:05<00:00, 13.23it/s]#015 #015#015100%|██████████| 65/65 [00:05<00:00, 13.23it/s]#015100%|██████████| 65/65 [00:05<00:00, 12.95it/s]
{'loss': 4.6249, 'learning_rate': 0.0002801341700638303, 'epoch': 3.85}
{'train_runtime': 5.0177, 'train_samples_per_second': 1658.117, 'train_steps_per_second': 12.954, 'train_loss': 4.608376018817609, 'epoch': 5.0}
finished
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 65.46it/s]
********************
Eval results for day 4 are:#011
********************
eval_/loss = 4.37877082824707
eval_/next-item/ndcg_at_20 = 0.21264421939849854
eval_/next-item/ndcg_at_40 = 0.26149502396583557
eval_/next-item/recall_at_20 = 0.5520833134651184
eval_/next-item/recall_at_40 = 0.7916666865348816
eval_runtime = 0.1551
eval_samples_per_second = 1238.234
eval_steps_per_second = 38.695
['./data/sessions_by_day/4/train.parquet']
********************
Launch training for day 4 are:
********************
#015 0%| | 0/65 [00:00<?, ?it/s]#015 2%|▏ | 1/65 [00:00<00:09, 6.78it/s]#015 3%|▎ | 2/65 [00:00<00:07, 8.19it/s]#015 5%|▍ | 3/65 [00:00<00:07, 8.77it/s]#015 6%|▌ | 4/65 [00:00<00:06, 9.10it/s]#015 8%|▊ | 5/65 [00:00<00:06, 9.26it/s]#015 9%|▉ | 6/65 [00:00<00:06, 9.38it/s]#015 11%|█ | 7/65 [00:00<00:06, 9.43it/s]#015 12%|█▏ | 8/65 [00:00<00:06, 9.45it/s]#015 14%|█▍ | 9/65 [00:00<00:05, 9.54it/s]#015 15%|█▌ | 10/65 [00:01<00:05, 9.56it/s]#015 17%|█▋ | 11/65 [00:01<00:05, 9.58it/s]#015 18%|█▊ | 12/65 [00:01<00:05, 9.62it/s]#015 20%|██ | 13/65 [00:01<00:05, 9.62it/s]#015 22%|██▏ | 14/65 [00:01<00:06, 8.48it/s]#015 23%|██▎ | 15/65 [00:01<00:05, 8.81it/s]#015 26%|██▌ | 17/65 [00:01<00:04, 10.19it/s]#015 29%|██▉ | 19/65 [00:01<00:04, 11.27it/s]#015 32%|███▏ | 21/65 [00:02<00:03, 11.98it/s]#015 35%|███▌ | 23/65 [00:02<00:03, 12.46it/s]#015 38%|███▊ | 25/65 [00:02<00:03, 12.65it/s]#015 42%|████▏ | 27/65 [00:02<00:03, 11.79it/s]#015 45%|████▍ | 29/65 [00:02<00:02, 12.27it/s]#015 48%|████▊ | 31/65 [00:02<00:02, 12.59it/s]#015 51%|█████ | 33/65 [00:03<00:02, 12.78it/s]#015 54%|█████▍ | 35/65 [00:03<00:02, 12.96it/s]#015 57%|█████▋ | 37/65 [00:03<00:02, 13.05it/s]#015 60%|██████ | 39/65 [00:03<00:01, 13.11it/s]#015 63%|██████▎ | 41/65 [00:03<00:01, 12.02it/s]#015 66%|██████▌ | 43/65 [00:03<00:01, 12.44it/s]#015 69%|██████▉ | 45/65 [00:04<00:01, 12.49it/s]#015 72%|███████▏ | 47/65 [00:04<00:01, 12.88it/s]#015 75%|███████▌ | 49/65 [00:04<00:01, 13.10it/s]#015 #015#015 77%|███████▋ | 50/65 [00:04<00:01, 13.10it/s]#015 78%|███████▊ | 51/65 [00:04<00:01, 13.18it/s]#015 82%|████████▏ | 53/65 [00:04<00:00, 12.10it/s]#015 85%|████████▍ | 55/65 [00:04<00:00, 12.51it/s]#015 88%|████████▊ | 57/65 [00:04<00:00, 12.83it/s]#015 91%|█████████ | 59/65 [00:05<00:00, 12.98it/s]#015 94%|█████████▍| 61/65 [00:05<00:00, 13.17it/s]#015 97%|█████████▋| 63/65 [00:05<00:00, 13.29it/s]#015100%|██████████| 65/65 [00:05<00:00, 13.41it/s]#015 #015#015100%|██████████| 65/65 [00:05<00:00, 13.41it/s]#015100%|██████████| 65/65 [00:05<00:00, 11.76it/s]
{'loss': 4.5203, 'learning_rate': 0.0002801341700638303, 'epoch': 3.85}
{'train_runtime': 5.5296, 'train_samples_per_second': 1504.618, 'train_steps_per_second': 11.755, 'train_loss': 4.5138815072866585, 'epoch': 5.0}
finished
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 80.70it/s]
********************
Eval results for day 5 are:#011
********************
eval_/loss = 4.3964667320251465
eval_/next-item/ndcg_at_20 = 0.18582205474376678
eval_/next-item/ndcg_at_40 = 0.24064064025878906
eval_/next-item/recall_at_20 = 0.5104166865348816
eval_/next-item/recall_at_40 = 0.78125
eval_runtime = 0.1351
eval_samples_per_second = 1421.411
eval_steps_per_second = 44.419
['./data/sessions_by_day/5/train.parquet']
********************
Launch training for day 5 are:
********************
#015 0%| | 0/60 [00:00<?, ?it/s]#015 2%|▏ | 1/60 [00:00<00:07, 7.42it/s]#015 5%|▌ | 3/60 [00:00<00:05, 11.33it/s]#015 8%|▊ | 5/60 [00:00<00:04, 12.46it/s]#015 12%|█▏ | 7/60 [00:00<00:04, 12.99it/s]#015 15%|█▌ | 9/60 [00:00<00:03, 13.34it/s]#015 18%|█▊ | 11/60 [00:00<00:03, 13.43it/s]#015 22%|██▏ | 13/60 [00:01<00:03, 12.15it/s]#015 25%|██▌ | 15/60 [00:01<00:03, 12.56it/s]#015 28%|██▊ | 17/60 [00:01<00:03, 12.92it/s]#015 32%|███▏ | 19/60 [00:01<00:03, 13.29it/s]#015 35%|███▌ | 21/60 [00:01<00:02, 13.54it/s]#015 38%|███▊ | 23/60 [00:01<00:02, 13.71it/s]#015 42%|████▏ | 25/60 [00:01<00:02, 12.52it/s]#015 45%|████▌ | 27/60 [00:02<00:02, 12.78it/s]#015 48%|████▊ | 29/60 [00:02<00:02, 13.12it/s]#015 52%|█████▏ | 31/60 [00:02<00:02, 13.23it/s]#015 55%|█████▌ | 33/60 [00:02<00:02, 13.33it/s]#015 58%|█████▊ | 35/60 [00:02<00:01, 13.40it/s]#015 62%|██████▏ | 37/60 [00:02<00:01, 11.61it/s]#015 65%|██████▌ | 39/60 [00:03<00:01, 12.02it/s]#015 68%|██████▊ | 41/60 [00:03<00:01, 12.44it/s]#015 72%|███████▏ | 43/60 [00:03<00:01, 12.68it/s]#015 75%|███████▌ | 45/60 [00:03<00:01, 12.88it/s]#015 78%|███████▊ | 47/60 [00:03<00:00, 13.09it/s]#015 82%|████████▏ | 49/60 [00:03<00:00, 12.05it/s]#015 #015#015 83%|████████▎ | 50/60 [00:03<00:00, 12.05it/s]#015 85%|████████▌ | 51/60 [00:04<00:00, 12.41it/s]#015 88%|████████▊ | 53/60 [00:04<00:00, 12.63it/s]#015 92%|█████████▏| 55/60 [00:04<00:00, 12.93it/s]#015 95%|█████████▌| 57/60 [00:04<00:00, 13.02it/s]#015 98%|█████████▊| 59/60 [00:04<00:00, 13.15it/s]#015 #015#015100%|██████████| 60/60 [00:04<00:00, 13.15it/s]#015100%|██████████| 60/60 [00:04<00:00, 12.79it/s]
{'loss': 4.4897, 'learning_rate': 0.00024999999999999995, 'epoch': 4.17}
{'train_runtime': 4.6906, 'train_samples_per_second': 1637.311, 'train_steps_per_second': 12.791, 'train_loss': 4.493310101826986, 'epoch': 5.0}
finished
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 75.56it/s]
********************
Eval results for day 6 are:#011
********************
eval_/loss = 4.464842319488525
eval_/next-item/ndcg_at_20 = 0.182639017701149
eval_/next-item/ndcg_at_40 = 0.23443275690078735
eval_/next-item/recall_at_20 = 0.5
eval_/next-item/recall_at_40 = 0.75
eval_runtime = 0.1432
eval_samples_per_second = 1341.187
eval_steps_per_second = 41.912
['./data/sessions_by_day/6/train.parquet']
********************
Launch training for day 6 are:
********************
#015 0%| | 0/65 [00:00<?, ?it/s]#015 2%|▏ | 1/65 [00:00<00:07, 8.29it/s]#015 5%|▍ | 3/65 [00:00<00:05, 11.42it/s]#015 8%|▊ | 5/65 [00:00<00:05, 11.30it/s]#015 11%|█ | 7/65 [00:00<00:05, 11.23it/s]#015 14%|█▍ | 9/65 [00:00<00:05, 11.14it/s]#015 17%|█▋ | 11/65 [00:00<00:04, 11.23it/s]#015 20%|██ | 13/65 [00:01<00:04, 11.20it/s]#015 23%|██▎ | 15/65 [00:01<00:04, 10.49it/s]#015 26%|██▌ | 17/65 [00:01<00:04, 10.92it/s]#015 29%|██▉ | 19/65 [00:01<00:04, 10.68it/s]#015 32%|███▏ | 21/65 [00:01<00:04, 10.58it/s]#015 35%|███▌ | 23/65 [00:02<00:04, 10.26it/s]#015 38%|███▊ | 25/65 [00:02<00:03, 10.12it/s]#015 42%|████▏ | 27/65 [00:02<00:04, 9.11it/s]#015 43%|████▎ | 28/65 [00:02<00:04, 9.14it/s]#015 45%|████▍ | 29/65 [00:02<00:03, 9.27it/s]#015 46%|████▌ | 30/65 [00:02<00:03, 9.27it/s]#015 49%|████▉ | 32/65 [00:03<00:03, 10.02it/s]#015 52%|█████▏ | 34/65 [00:03<00:02, 10.98it/s]#015 55%|█████▌ | 36/65 [00:03<00:02, 11.78it/s]#015 58%|█████▊ | 38/65 [00:03<00:02, 12.34it/s]#015 62%|██████▏ | 40/65 [00:03<00:02, 11.57it/s]#015 65%|██████▍ | 42/65 [00:03<00:01, 12.28it/s]#015 68%|██████▊ | 44/65 [00:04<00:01, 12.69it/s]#015 71%|███████ | 46/65 [00:04<00:01, 12.83it/s]#015 74%|███████▍ | 48/65 [00:04<00:01, 13.03it/s]#015 77%|███████▋ | 50/65 [00:04<00:01, 13.25it/s]#015 #015#015 77%|███████▋ | 50/65 [00:04<00:01, 13.25it/s]#015 80%|████████ | 52/65 [00:04<00:00, 13.31it/s]#015 83%|████████▎ | 54/65 [00:04<00:00, 12.21it/s]#015 86%|████████▌ | 56/65 [00:04<00:00, 12.52it/s]#015 89%|████████▉ | 58/65 [00:05<00:00, 12.78it/s]#015 92%|█████████▏| 60/65 [00:05<00:00, 12.95it/s]#015 95%|█████████▌| 62/65 [00:05<00:00, 13.18it/s]#015 98%|█████████▊| 64/65 [00:05<00:00, 13.03it/s]#015 #015#015100%|██████████| 65/65 [00:05<00:00, 13.03it/s]#015100%|██████████| 65/65 [00:05<00:00, 11.49it/s]
{'loss': 4.4849, 'learning_rate': 0.0002801341700638303, 'epoch': 3.85}
{'train_runtime': 5.6554, 'train_samples_per_second': 1471.167, 'train_steps_per_second': 11.493, 'train_loss': 4.474948002741887, 'epoch': 5.0}
finished
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 75.96it/s]
********************
Eval results for day 7 are:#011
********************
eval_/loss = 4.43782901763916
eval_/next-item/ndcg_at_20 = 0.2118179351091385
eval_/next-item/ndcg_at_40 = 0.25489482283592224
eval_/next-item/recall_at_20 = 0.5364583134651184
eval_/next-item/recall_at_40 = 0.7447916865348816
eval_runtime = 0.1424
eval_samples_per_second = 1348.357
eval_steps_per_second = 42.136
['./data/sessions_by_day/7/train.parquet']
********************
Launch training for day 7 are:
********************
#015 0%| | 0/65 [00:00<?, ?it/s]#015 2%|▏ | 1/65 [00:00<00:07, 8.45it/s]#015 5%|▍ | 3/65 [00:00<00:05, 11.55it/s]#015 8%|▊ | 5/65 [00:00<00:04, 12.14it/s]#015 11%|█ | 7/65 [00:00<00:04, 12.45it/s]#015 14%|█▍ | 9/65 [00:00<00:04, 12.64it/s]#015 17%|█▋ | 11/65 [00:00<00:04, 12.93it/s]#015 20%|██ | 13/65 [00:01<00:03, 13.11it/s]#015 23%|██▎ | 15/65 [00:01<00:04, 12.07it/s]#015 26%|██▌ | 17/65 [00:01<00:03, 12.48it/s]#015 29%|██▉ | 19/65 [00:01<00:03, 12.80it/s]#015 32%|███▏ | 21/65 [00:01<00:03, 12.85it/s]#015 35%|███▌ | 23/65 [00:01<00:03, 13.10it/s]#015 38%|███▊ | 25/65 [00:01<00:03, 12.94it/s]#015 42%|████▏ | 27/65 [00:02<00:03, 11.95it/s]#015 45%|████▍ | 29/65 [00:02<00:02, 12.41it/s]#015 48%|████▊ | 31/65 [00:02<00:02, 12.76it/s]#015 51%|█████ | 33/65 [00:02<00:02, 12.83it/s]#015 54%|█████▍ | 35/65 [00:02<00:02, 13.01it/s]#015 57%|█████▋ | 37/65 [00:02<00:02, 13.02it/s]#015 60%|██████ | 39/65 [00:03<00:02, 12.91it/s]#015 63%|██████▎ | 41/65 [00:03<00:02, 11.93it/s]#015 66%|██████▌ | 43/65 [00:03<00:01, 12.38it/s]#015 69%|██████▉ | 45/65 [00:03<00:01, 12.74it/s]#015 72%|███████▏ | 47/65 [00:03<00:01, 13.04it/s]#015 75%|███████▌ | 49/65 [00:03<00:01, 13.28it/s]#015 #015#015 77%|███████▋ | 50/65 [00:03<00:01, 13.28it/s]#015 78%|███████▊ | 51/65 [00:04<00:01, 13.35it/s]#015 82%|████████▏ | 53/65 [00:04<00:00, 12.21it/s]#015 85%|████████▍ | 55/65 [00:04<00:00, 12.45it/s]#015 88%|████████▊ | 57/65 [00:04<00:00, 12.75it/s]#015 91%|█████████ | 59/65 [00:04<00:00, 12.89it/s]#015 94%|█████████▍| 61/65 [00:04<00:00, 13.13it/s]#015 97%|█████████▋| 63/65 [00:04<00:00, 13.11it/s]#015100%|██████████| 65/65 [00:05<00:00, 13.15it/s]#015 #015#015100%|██████████| 65/65 [00:05<00:00, 13.15it/s]#015100%|██████████| 65/65 [00:05<00:00, 12.72it/s]
{'loss': 4.5053, 'learning_rate': 0.0002801341700638303, 'epoch': 3.85}
{'train_runtime': 5.1121, 'train_samples_per_second': 1627.513, 'train_steps_per_second': 12.715, 'train_loss': 4.501318594125601, 'epoch': 5.0}
finished
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 71.06it/s]
********************
Eval results for day 8 are:#011
********************
eval_/loss = 4.415571689605713
eval_/next-item/ndcg_at_20 = 0.17643620073795319
eval_/next-item/ndcg_at_40 = 0.2237410992383957
eval_/next-item/recall_at_20 = 0.5260416865348816
eval_/next-item/recall_at_40 = 0.7552083134651184
eval_runtime = 0.149
eval_samples_per_second = 1288.315
eval_steps_per_second = 40.26
#015 0%| | 0/6 [00:00<?, ?it/s]#015100%|██████████| 6/6 [00:00<00:00, 69.36it/s]
eval_/loss = 4.415571689605713
eval_/next-item/ndcg_at_20 = 0.17643620073795319
eval_/next-item/ndcg_at_40 = 0.2237410992383957
eval_/next-item/recall_at_20 = 0.5260416865348816
eval_/next-item/recall_at_40 = 0.7552083134651184
eval_runtime = 0.1498
eval_samples_per_second = 1281.623
eval_steps_per_second = 40.051
2024-12-10 16:57:15,910 sagemaker-training-toolkit INFO Reporting training SUCCESS
2024-12-10 16:57:22 Uploading - Uploading generated training model
2024-12-10 16:57:35 Completed - Training job completed
Training seconds: 300
Billable seconds: 300
- Triton request code:
import boto3
import numpy as np
import pandas as pd
from merlin.schema.tags import Tags
from merlin.core.dispatch import get_lib
from nvtabular.workflow import Workflow
from merlin.systems.triton import convert_df_to_triton_input
import tritonclient.http as httpclient
### Data Generation
NUM_ROWS =1000
long_tailed_item_distribution = np.clip(np.random.lognormal(3., 1., int(NUM_ROWS)).astype(np.int32), 1, 50000)
# generate random item interaction features
df = pd.DataFrame(np.random.randint(70000, 90000, int(NUM_ROWS)), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution
# generate category mapping for each item-id
df['category'] = pd.cut(df['item_id'], bins=334, labels=np.arange(1, 335)).astype(np.int32)
df['age_days'] = np.random.uniform(0, 1, int(NUM_ROWS)).astype(np.float32)
df['weekday_sin']= np.random.uniform(0, 1, int(NUM_ROWS)).astype(np.float32)
# generate day mapping for each session
map_day = dict(zip(df.session_id.unique(), np.random.randint(1, 10, size=(df.session_id.nunique()))))
df['day'] = df.session_id.map(map_day)
### Request Preparation
batch = df[:6]
workflow = Workflow.load("../container_output/0_transformworkflowtriton/1/workflow")
inputs = convert_df_to_triton_input(workflow.input_schema, batch, httpclient.InferInput)
request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
inputs
)
### Endpoint Invocation
endpoint_name = "endpoint-triton-merlin-ensemble-2024-12-10-16-59-50"
runtime_sm_client = boto3.client("sagemaker-runtime")
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}",
Body=request_body,
)
# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix):]
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
response["Body"].read(), header_length=int(header_length_str)
)
print(result)