Skip to content

Check for app failure before updating task infos #464

@hungj

Description

@hungj

If AM fails, TonyClient will hang for a while retrying to connect to AM. We should fail faster here.

14-09-2020 15:15:07 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:07 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 44 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:08 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:08 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:09 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:09 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:10 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:10 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:11 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:11 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 FATAL TonyClient:985 - Failed to run TonyClient
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - java.net.ConnectException: Call From ltx1-hcl6554.grid.linkedin.com/10.150.121.188 to ltx1-hcl3578.grid.linkedin.com:31852 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:754)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1547)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1489)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1388)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.sun.proxy.$Proxy20.getTaskInfos(Unknown Source)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskInfos(TensorFlowClusterPBClientImpl.java:75)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at java.lang.reflect.Method.invoke(Method.java:498)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.sun.proxy.$Proxy21.getTaskInfos(Unknown Source)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskInfos(ApplicationRpcClient.java:81)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.updateTaskInfos(TonyClient.java:895)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:851)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.run(TonyClient.java:185)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.start(TonyClient.java:983)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.main(TonyClient.java:1097)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - Caused by: java.net.ConnectException: Connection refused
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:701)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:808)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1604)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1435)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	... 21 more
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 ERROR TonyClient:992 - Application failed to complete successfully

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions