-
Notifications
You must be signed in to change notification settings - Fork 164
Open
Labels
good first issueGood for newcomersGood for newcomers
Description
If AM fails, TonyClient will hang for a while retrying to connect to AM. We should fail faster here.
14-09-2020 15:15:07 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:07 INFO Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 44 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:08 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:08 INFO Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:09 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:09 INFO Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:10 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:10 INFO Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:11 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:11 INFO Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 INFO Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 FATAL TonyClient:985 - Failed to run TonyClient
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - java.net.ConnectException: Call From ltx1-hcl6554.grid.linkedin.com/10.150.121.188 to ltx1-hcl3578.grid.linkedin.com:31852 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:754)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1547)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client.call(Client.java:1489)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client.call(Client.java:1388)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.sun.proxy.$Proxy20.getTaskInfos(Unknown Source)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskInfos(TensorFlowClusterPBClientImpl.java:75)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at java.lang.reflect.Method.invoke(Method.java:498)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.sun.proxy.$Proxy21.getTaskInfos(Unknown Source)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskInfos(ApplicationRpcClient.java:81)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.linkedin.tony.TonyClient.updateTaskInfos(TonyClient.java:895)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:851)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.linkedin.tony.TonyClient.run(TonyClient.java:185)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.linkedin.tony.TonyClient.start(TonyClient.java:983)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at com.linkedin.tony.TonyClient.main(TonyClient.java:1097)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - Caused by: java.net.ConnectException: Connection refused
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:701)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:808)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client.getConnection(Client.java:1604)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - at org.apache.hadoop.ipc.Client.call(Client.java:1435)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - ... 21 more
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 ERROR TonyClient:992 - Application failed to complete successfully
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomers