Skip to content

Batch Transform Error for DeepAR model #1363

@mstfldmr

Description

@mstfldmr

Describe the bug
I trained a model with DeepAR algorithm. I can deploy an inference endpoint and get predictions. I can do batch transform with 100 lines. When I run batch transform with 50k lines, I get error.

To reproduce

transformer = sagemaker.transformer.Transformer('allcustomers',
                                                1, 'ml.m4.xlarge',
                                                output_path = 's3://MYBUCKET/output/',
                                                strategy='MultiRecord',
                                                max_concurrent_transforms=10,
                                                base_transform_job_name = 'allcustomers')

transformer.transform('s3://MYBUCKET/test/', split_type='Line')

Screenshots or logs

Arguments: serve
No handlers could be found for logger "root"
[03/18/2020 22:39:59 INFO 139806956894016] Estimated memory required per model 21MB.
[03/18/2020 22:39:59 INFO 139806956894016] Estimated available memory 15244MB.
[03/18/2020 22:39:59 INFO 139806956894016] Estimated maximum number of workers for the available memory is 719.
[03/18/2020 22:39:59 INFO 139806956894016] Using 4 workers
[03/18/2020 22:39:59 INFO 139806956894016] loading entry points
[03/18/2020 22:39:59 INFO 139806956894016] Prediction endpoint operating in batch mode
[03/18/2020 22:39:59 INFO 139806956894016] loaded request iterator application/jsonlines
[03/18/2020 22:39:59 INFO 139806956894016] loaded response encoder application/jsonlines
[03/18/2020 22:39:59 INFO 139806956894016] loaded model class model
[03/18/2020 22:39:59 WARNING 139806956894016] Requesting context without setting the requested num of gpus. Using 'auto'
[03/18/2020 22:40:00 INFO 139806956894016] nvidia-smi took: 0.0252621173859 secs to identify 0 gpus
[03/18/2020 22:40:00 INFO 139806956894016] Number of GPUs being used: 0
[03/18/2020 22:40:00 INFO 139806956894016] Loading Config from /opt/ml/model/model_algo-1-config.json
#metrics {"Metrics": {"model.deserialize_phase1.time": {"count": 1, "max": 1.4071464538574219, "sum": 1.4071464538574219, "min": 1.4071464538574219}}, "EndTime": 1584571200.021157, "Dimensions": {}, "StartTime": 1584571199.9294}

[03/18/2020 22:40:00 INFO 139806956894016] Deserializing model parameters from /opt/ml/model/model_algo-1
[22:40:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
#metrics {"Metrics": {"model.bind.time": {"count": 1, "max": 30.337095260620117, "sum": 30.337095260620117, "min": 30.337095260620117}}, "EndTime": 1584571200.075073, "Dimensions": {}, "StartTime": 1584571200.021286}

#metrics {"Metrics": {"model.deserialize_phase2.time": {"count": 1, "max": 53.880929946899414, "sum": 53.880929946899414, "min": 53.880929946899414}}, "EndTime": 1584571200.075195, "Dimensions": {}, "StartTime": 1584571200.075161}

[2020-03-18 22:40:00 +0000] [1] [INFO] Starting gunicorn 19.7.1
[2020-03-18 22:40:00 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2020-03-18 22:40:00 +0000] [1] [INFO] Using worker: sync
[2020-03-18 22:40:00 +0000] [33] [INFO] Booting worker with pid: 33
[2020-03-18 22:40:00 +0000] [34] [INFO] Booting worker with pid: 34
[2020-03-18 22:40:00 +0000] [35] [INFO] Booting worker with pid: 35
[2020-03-18 22:40:00 +0000] [36] [INFO] Booting worker with pid: 36
2020-03-18T22:40:18.196:[sagemaker logs]: MaxConcurrentTransforms=10, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD
#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.06198883056640625, "sum": 0.06198883056640625, "min": 0.06198883056640625}}, "EndTime": 1584571219.106869, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.17764}

#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.04482269287109375, "sum": 0.04482269287109375, "min": 0.04482269287109375}}, "EndTime": 1584571219.132697, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.22286}

#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.03886222839355469, "sum": 0.03886222839355469, "min": 0.03886222839355469}}, "EndTime": 1584571219.139472, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.115628}

#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.025033950805664062, "sum": 0.025033950805664062, "min": 0.025033950805664062}}, "EndTime": 1584571219.164191, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571219.107058}

#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.01811981201171875, "sum": 0.01811981201171875, "min": 0.01811981201171875}}, "EndTime": 1584571219.173972, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571219.132838}

#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.02002716064453125, "sum": 0.02002716064453125, "min": 0.02002716064453125}}, "EndTime": 1584571219.186274, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571219.164287}

#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 0.06198883056640625, "sum": 0.06198883056640625, "min": 0.06198883056640625}}, "EndTime": 1584571219.106869, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571200.17764}

The log ends like:



[2020-03-18 22:47:00 +0000] [44] [INFO] Worker exiting (pid: 44)
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[2020-03-18 22:47:00 +0000] [48] [INFO] Booting worker with pid: 48
[2020-03-18 22:47:00 +0000] [49] [INFO] Booting worker with pid: 49
[2020-03-18 22:47:00 +0000] [50] [INFO] Booting worker with pid: 50
#metrics {"Metrics": {"invocations_error.count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1584571620.751803, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571620.734525}

#metrics {"Metrics": {"invocations.count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1584571620.751912, "Dimensions": {"Host": "UNKNOWN", "Operation": "scoring", "Algorithm": "DeepARModel"}, "StartTime": 1584571620.751885}

[2020-03-18 22:47:00 +0000] [44] [INFO] Worker exiting (pid: 44)
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[22:47:00] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.202184.0/RHEL5_64/generic-flavor/src/src/engine/naive_engine.cc:55: Engine shutdown
[2020-03-18 22:47:00 +0000] [48] [INFO] Booting worker with pid: 48
[2020-03-18 22:47:00 +0000] [49] [INFO] Booting worker with pid: 49
[2020-03-18 22:47:00 +0000] [50] [INFO] Booting worker with pid: 50
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-133-657e6a78db31> in <module>()
----> 1 transformer.wait()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self, logs)
    253         """Placeholder docstring"""
    254         self._ensure_last_transform_job()
--> 255         self.latest_transform_job.wait(logs=logs)
    256 
    257     def stop_transform_job(self, wait=True):

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self, logs)
    382     def wait(self, logs=True):
    383         if logs:
--> 384             self.sagemaker_session.logs_for_transform_job(self.job_name, wait=True)
    385         else:
    386             self.sagemaker_session.wait_for_transform_job(self.job_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_transform_job(self, job_name, wait, poll)
   3189 
   3190         if wait:
-> 3191             self._check_job_status(job_name, description, "TransformJobStatus")
   3192             if dot:
   3193                 print()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   2613                 ),
   2614                 allowed_statuses=["Completed", "Stopped"],
-> 2615                 actual_status=status,
   2616             )
   2617 

UnexpectedStatusException: Error for Transform job allcustomersweekly-batch-2020-03-18-22-36-39-708: Failed. Reason: InternalServerError: We encountered an internal error.  Please try again.

Additional context
Test data looks like:

{"start": "2019-01-20 00:00:00", "target": [2.0, 2.0, 1.0, 0.0, 3.0, 0.0, 3.0, 2.0, 0.0, 5.0, 0.0, 2.0, 3.0, 4.0, 0.0, 1.0, 2.0, 2.0, 2.0, 0.0, 3.0, 0.0, 3.0, 1.0, 3.0, 0.0, 0.0, 2.0, 3.0, 1.0, 3.0, 0.0, 3.0, 2.0, 3.0, 1.0, 1.0, 0.0, 2.0, 0.0, 1.0, 1.0, 2.0, 3.0, 1.0, 2.0, 1.0, 2.0, 0.0, 2.0], "cat": [2002, 2]}
{"start": "2019-03-03 00:00:00", "target": [2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0], "cat": [7147, 5]}
{"start": "2019-01-06 00:00:00", "target": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 3.0, 1.0, 1.0, 3.0, 0.0, 0.0, 3.0, 0.0, 2.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 3.0, 0.0, 1.0, 0.0, 3.0, 0.0, 0.0, 2.0, 0.0, 0.0, 2.0, 0.0, 1.0, 1.0, 2.0], "cat": [7209, 2]}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions