-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Describe the bug
I trained a model with DeepAR algorithm. I can deploy an inference endpoint and get predictions. I can do batch transform with 100 lines. When I run batch transform with 50k lines, I get error.
To reproduce
transformer = sagemaker.transformer.Transformer('allcustomers',
1, 'ml.m4.xlarge',
output_path = 's3://MYBUCKET/output/',
strategy='MultiRecord',
max_concurrent_transforms=10,
base_transform_job_name = 'allcustomers')
transformer.transform('s3://MYBUCKET/test/', split_type='Line')
Screenshots or logs
Arguments: serve
No handlers could be found for logger "root"
[03/18/2020 22:39:59 INFO 139806956894016] Estimated memory required per model 21MB.
[03/18/2020 22:39:59 INFO 139806956894016] Estimated available memory 15244MB.
[03/18/2020 22:39:59 INFO 139806956894016] Estimated maximum number of workers for the available memory is 719.
[03/18/2020 22:39:59 INFO 139806956894016] Using 4 workers
[03/18/2020 22:39:59 INFO 139806956894016] loading entry points
[03/18/2020 22:39:59 INFO 139806956894016] Prediction endpoint operating in batch mode
[03/18/2020 22:39:59 INFO 139806956894016] loaded request iterator application/jsonlines
[03/18/2020 22:39:59 INFO 139806956894016] loaded response encoder application/jsonlines
[03/18/2020 22:39:59 INFO 139806956894016] loaded model class model
[03/18/2020 22:39:59 WARNING 139806956894016] Requesting context without setting the requested num of gpus. Using 'auto'
[03/18/2020 22:40:00 INFO 139806956894016] nvidia-smi took: 0.0252621173859 secs to identify 0 gpus
[03/18/2020 22:40:00 INFO 139806956894016] Number of GPUs being used: 0
[03/18/2020 22:40:00 INFO 139806956894016] Loading Config from /opt/ml/model/model_algo-1-config.json
#metrics {"Metrics": {"model.deserialize_phase1.time": {"count": 1, "max": 1.4071464538574219, "sum": 1.4071464538574219, "min": 1.4071464538574219}}, "EndTime": 1584571200.021157, "Dimensions": {}, "StartTime": 1584571199.9294}
[03/18/2020 22:40:00 INFO 139806956894016] Deserializing model parameters from /opt/ml/model/model_algo-1
[22:40:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
#metrics {"Metrics": {"model.bind.time": {"count": 1, "max": 30.337095260620117, "sum": 30.337095260620117, "min": 30.337095260620117}}, "EndTime": 1584571200.075073, "Dimensions": {}, "StartTime": 1584571200.021286}
#metrics {"Metrics": {"model.deserialize_phase2.time": {"count": 1, "max": 53.880929946899414, "sum": 53.880929946899414, "min": 53.880929946899414}}, "EndTime": 1584571200.075195, "Dimensions": {}, "StartTime": 1584571200.075161}
[2020-03-18 22:40:00 +0000] [1] [INFO] Starting gunicorn 19.7.1
[2020-03-18 22:40:00 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2020-03-18 22:40:00 +0000] [1] [INFO] Using worker: sync
[2020-03-18 22:40:00 +0000] [33] [INFO] Booting worker with pid: 33
[2020-03-18 22:40:00 +0000] [34] [INFO] Booting worker with pid: 34
[2020-03-18 22:40:00 +0000] [35] [INFO] Booting worker with pid: 35
[2020-03-18 22:40:00 +0000] [36] [INFO] Booting worker with pid: 36
2020-03-18T22:40:18.196:[sagemaker logs]: MaxConcurrentTransforms=10, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.06198883056640625, "sum": 0.06198883056640625, "min": 0.06198883056640625}}, "EndTime": 1584571219.106869, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.17764}
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.04482269287109375, "sum": 0.04482269287109375, "min": 0.04482269287109375}}, "EndTime": 1584571219.132697, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.22286}
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.03886222839355469, "sum": 0.03886222839355469, "min": 0.03886222839355469}}, "EndTime": 1584571219.139472, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.115628}
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.025033950805664062, "sum": 0.025033950805664062, "min": 0.025033950805664062}}, "EndTime": 1584571219.164191, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571219.107058}
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.01811981201171875, "sum": 0.01811981201171875, "min": 0.01811981201171875}}, "EndTime": 1584571219.173972, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571219.132838}
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.02002716064453125, "sum": 0.02002716064453125, "min": 0.02002716064453125}}, "EndTime": 1584571219.186274, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571219.164287}
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.06198883056640625, "sum": 0.06198883056640625, "min": 0.06198883056640625}}, "EndTime": 1584571219.106869, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.17764}
The log ends like:
[2020-03-18 22:47:00 +0000] [44] [INFO] Worker exiting (pid: 44)
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[2020-03-18 22:47:00 +0000] [48] [INFO] Booting worker with pid: 48
[2020-03-18 22:47:00 +0000] [49] [INFO] Booting worker with pid: 49
[2020-03-18 22:47:00 +0000] [50] [INFO] Booting worker with pid: 50
#metrics {"Metrics": {"invocations_error.count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1584571620.751803, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571620.734525}
#metrics {"Metrics": {"invocations.count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1584571620.751912, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571620.751885}
[2020-03-18 22:47:00 +0000] [44] [INFO] Worker exiting (pid: 44)
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[2020-03-18 22:47:00 +0000] [48] [INFO] Booting worker with pid: 48
[2020-03-18 22:47:00 +0000] [49] [INFO] Booting worker with pid: 49
[2020-03-18 22:47:00 +0000] [50] [INFO] Booting worker with pid: 50
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-133-657e6a78db31> in <module>()
----> 1 transformer.wait()
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self, logs)
253 """Placeholder docstring"""
254 self._ensure_last_transform_job()
--> 255 self.latest_transform_job.wait(logs=logs)
256
257 def stop_transform_job(self, wait=True):
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self, logs)
382 def wait(self, logs=True):
383 if logs:
--> 384 self.sagemaker_session.logs_for_transform_job(self.job_name, wait=True)
385 else:
386 self.sagemaker_session.wait_for_transform_job(self.job_name)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_transform_job(self, job_name, wait, poll)
3189
3190 if wait:
-> 3191 self._check_job_status(job_name, description, "TransformJobStatus")
3192 if dot:
3193 print()
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
2613 ),
2614 allowed_statuses=["Completed", "Stopped"],
-> 2615 actual_status=status,
2616 )
2617
UnexpectedStatusException: Error for Transform job allcustomersweekly-batch-2020-03-18-22-36-39-708: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.
Additional context
Test data looks like:
{"start": "2019-01-20 00:00:00", "target": [2.0, 2.0, 1.0, 0.0, 3.0, 0.0, 3.0, 2.0, 0.0, 5.0, 0.0, 2.0, 3.0, 4.0, 0.0, 1.0, 2.0, 2.0, 2.0, 0.0, 3.0, 0.0, 3.0, 1.0, 3.0, 0.0, 0.0, 2.0, 3.0, 1.0, 3.0, 0.0, 3.0, 2.0, 3.0, 1.0, 1.0, 0.0, 2.0, 0.0, 1.0, 1.0, 2.0, 3.0, 1.0, 2.0, 1.0, 2.0, 0.0, 2.0], "cat": [2002, 2]}
{"start": "2019-03-03 00:00:00", "target": [2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0], "cat": [7147, 5]}
{"start": "2019-01-06 00:00:00", "target": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 3.0, 1.0, 1.0, 3.0, 0.0, 0.0, 3.0, 0.0, 2.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 3.0, 0.0, 1.0, 0.0, 3.0, 0.0, 0.0, 2.0, 0.0, 0.0, 2.0, 0.0, 1.0, 1.0, 2.0], "cat": [7209, 2]}
Reactions are currently unavailable