Replies: 2 comments
-
|
@IsaacYangSLA can you help comment on this, thanks in advance. I will also try this when I got time. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
sorry, did not respond on this and just noticed your questions. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I am trying to orchestrate everything in a Kubernetes environment on 2 instances in different network. I have generated the provision file with HA setting and with helm builder to deploy it using the Helm chart.
To give you a brief overview of my deployment, I have used Netmaker to create a private network and joined both instances to that network so the instances can communicate via netmaker interface IP. I have created Kubernetes cluster using kubeadm command and updated the node-ip to private netmaker IP in kubelet arguments for both instances. Additionally, I have used Calico CNI for pod netorking and got all pods successfully running and ready. I have added ingress-nginx controller to expose pod ports for FL server by updating the config map and daemon set part in the yaml file as mentioned in the Helm deployment of Nvflare - https://nvflare.readthedocs.io/en/latest/user_guide/helm_chart.html. After this I just used helm to install the Nvflare server to kubernetes which created 3 pods - Server1, Server2, and Overseer which were all successfully running and ready.
While the deployment of the NVFlare server was successful and I was able to login to the admin console, I encountered an issue when trying to start the client sites (site-1 and site-2). The error that I am receiving is as follows as per the site logs:
Cell - INFO - site-1: created backbone external connector to grpc://server2:8102
2023-04-25 12:17:22,020 - ConnectorManager - INFO - 1227537: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-04-25 12:17:22,020 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:26496] is starting
2023-04-25 12:17:22,521 - Cell - INFO - site-1: created backbone internal listener for tcp://localhost:26496
2023-04-25 12:17:22,521 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE grpc://server2:8102] is starting
2023-04-25 12:17:22,522 - FederatedClient - INFO - Wait for engine to be created.
2023-04-25 12:17:30,328 - nvflare.fuel.f3.sfm.conn_manager - INFO - Retrying [CH00001 ACTIVE grpc://server2:8102] in 8 seconds
2023-04-25 12:17:38,535 - nvflare.fuel.f3.sfm.conn_manager - INFO - Retrying [CH00001 ACTIVE grpc://server2:8102] in 16 seconds
2023-04-25 12:17:53,051 - MPM - ERROR - main_func execute exception: Login failed.
2023-04-25 12:17:53,052 - MPM - ERROR - Traceback (most recent call last):
File "/home/kubeflare/.local/lib/python3.10/site-packages/nvflare/fuel/f3/mpm.py", line 144, in run
rc = main_func()
File "/home/kubeflare/.local/lib/python3.10/site-packages/nvflare/private/fed/app/client/client_train.py", line 120, in main
raise RuntimeError("Login failed.")
RuntimeError: Login failed.
2023-04-25 12:17:55,254 - MPM - INFO - MPM: Good Bye!
I have reviewed the discussion on Github that suggests that this error could be related to the TLS settings. I would greatly appreciate your guidance on how to resolve this issue. - #1130 (reply in thread).
Beta Was this translation helpful? Give feedback.
All reactions