-
Notifications
You must be signed in to change notification settings - Fork 30
Closed
Description
Issue Type
Running
Have you searched for existing documents and issues?
No
OS Platform and Distribution
CentOS 7
All_in_one Version
v0.11.0b1
Kuscia Version
0.13.0b0
What happend and What you expected to happen.
## all-in-one 使用的镜像版本列表
{
"images": {
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/capsule-manager-sim-ubuntu20.04": "v0.1.0b0",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/dataproxy": "0.3.0b0",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia": "0.13.0b0",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/scql": "0.9.2b1",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8": "1.11.0b1",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad": "0.12.0b0",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/serving-anolis8": "0.8.0b0",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/sf-tee-dm-sim": "0.1.0b0",
"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/teeapps-sim-ubuntu20.04": "0.1.2b0"
}
}
## 场景复现
在secretpad上定义好训练流,并成功执行所有节点后,在`提交模型`阶段出错。
Log output.
## secretpad web端返回的错误信息
The remaining no-failed party task counts 1 are less than the task success threshold 2. pending party[], running party[tckhsamm], successful party[], failed party[vrwpdnqj]
## Lite `tckhsamm` 的 secretpad 错误日志
2025-07-30 13:51:27 [] [grpc-default-executor-17] INFO o.s.s.m.integration.job.JobManager - starter jobEvent ... type: MODIFIED
object {
job_id: "ysuo"
status {
state: "Failed"
create_time: "2025-07-30T05:51:18Z"
start_time: "2025-07-30T05:51:18Z"
end_time: "2025-07-30T05:51:27Z"
tasks {
task_id: "wbsx-model-export"
state: "Failed"
err_msg: "The remaining no-failed party task counts 1 are less than the task success threshold 2. pending party[], running party[tckhsamm], successful party[], failed party[vrwpdnqj]"
create_time: "2025-07-30T05:51:18Z"
start_time: "2025-07-30T05:51:18Z"
end_time: "2025-07-30T05:51:27Z"
parties {
domain_id: "tckhsamm"
state: "Failed"
endpoints {
port_name: "inference"
scope: "Cluster"
endpoint: "wbsx-model-export-0-inference.tckhsamm.svc"
}
endpoints {
port_name: "spu"
scope: "Cluster"
endpoint: "wbsx-model-export-0-spu.tckhsamm.svc"
}
endpoints {
port_name: "fed"
scope: "Cluster"
endpoint: "wbsx-model-export-0-fed.tckhsamm.svc"
}
endpoints {
port_name: "global"
scope: "Domain"
endpoint: "wbsx-model-export-0-global.tckhsamm.svc:29599"
}
}
parties {
domain_id: "vrwpdnqj"
state: "Failed"
err_msg: "container[secretflow] terminated state reason \"Error\", message: \".py\\\", line 458, in result\\n return self.__get_result()\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n raise self._exception\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n result = self.fn(*self.args, **self.kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 91, in _task\\n ret = self._task_impl(*resolved)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/actor.py\\\", line 147, in _execute_impl\\n ret = func(*args, **kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py\\\", line 65, in wrapper\\n return method(*args, **kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow_serving_lib/graph_builder.py\\\", line 251, in build_proto\\n libserving.graph_validator_impl(graph_def_str)\\nRuntimeError: what: \\n\\t[Enforce fail at secretflow_serving/util/arrow_helper.cc:384] src_f->type()->id() == dst_f->type()->id(). edge schema check failed, src: onehot_encode_1, dst: feature_calculate_2. field: bmi type not match, expect: float, get: double\\n\\n2025-07-30 05:51:26.184 ERROR global_context.py:183 [vrwpdnqj] -- FedLocalError on seq id 125\\nTraceback (most recent call last):\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/global_context.py\\\", line 177, in _send_check\\n send_future.result()\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n return self.__get_result()\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n raise self._exception\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n result = self.fn(*self.args, **self.kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/global_context.py\\\", line 170, in _send\\n raise local_err\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/global_context.py\\\", line 158, in _send\\n obj = fed_obj.get_object()\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n self._object = self._object.get()\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n future_result = self.future.result(timeout=2)\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n return self.__get_result()\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n raise self._exception\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n result = self.fn(*self.args, **self.kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 88, in _task\\n resolved = self.resolve_dependencies(*args, **kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 68, in resolve_dependencies\\n resolved.append(arg.get_object())\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n self._object = self._object.get()\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n future_result = self.future.result(timeout=2)\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n return self.__get_result()\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n raise self._exception\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n result = self.fn(*self.args, **self.kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 88, in _task\\n resolved = self.resolve_dependencies(*args, **kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 68, in resolve_dependencies\\n resolved.append(arg.get_object())\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n self._object = self._object.get()\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n future_result = self.future.result(timeout=2)\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n return self.__get_result()\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n raise self._exception\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n result = self.fn(*self.args, **self.kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 88, in _task\\n resolved = self.resolve_dependencies(*args, **kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 68, in resolve_dependencies\\n resolved.append(arg.get_object())\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n self._object = self._object.get()\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n future_result = self.future.result(timeout=2)\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n return self.__get_result()\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n raise self._exception\\n File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n result = self.fn(*self.args, **self.kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 91, in _task\\n ret = self._task_impl(*resolved)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/actor.py\\\", line 147, in _execute_impl\\n ret = func(*args, **kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py\\\", line 65, in wrapper\\n return method(*args, **kwargs)\\n File \\\"/usr/local/lib/python3.10/site-packages/secretflow_serving_lib/graph_builder.py\\\", line 251, in build_proto\\n libserving.graph_validator_impl(graph_def_str)\\nsecretflow.distributed.fed.exception.FedLocalError: what: \\n\\t[Enforce fail at secretflow_serving/util/arrow_helper.cc:384] src_f->type()->id() == dst_f->type()->id(). edge schema check failed, src: onehot_encode_1, dst: feature_calculate_2. field: bmi type not match, expect: float, get: double\\n\\n2025-07-30 05:51:26.185 WARNING global_context.py:133 [vrwpdnqj] -- Signal SIGINT to exit.\\n2025-07-30 05:51:26.185 WARNING api.py:116 [vrwpdnqj] -- Stop signal received (e.g. via SIGINT/Ctrl+C), try to shutdown fed. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.\\n2025-07-30 05:51:26.185 INFO api.py:214 [vrwpdnqj] -- Shutdowning unintendedly, wait_for_sending True\\n2025-07-30 05:51:26.185 INFO global_context.py:214 [vrwpdnqj] -- Try stop context, wait_for_sending True, on_error True\\n2025-07-30 05:51:26.185 INFO global_context.py:218 [vrwpdnqj] -- task_executor stopped\\n2025-07-30 05:51:26.185 INFO global_context.py:220 [vrwpdnqj] -- recv_executor stopped\\n2025-07-30 05:51:26.186 INFO global_context.py:224 [vrwpdnqj] -- send_executor stopped\\n2025-07-30 05:51:26.186 INFO global_context.py:237 [vrwpdnqj] -- Context stopped\\n2025-07-30 05:51:26.283 [warning] [channel.h:~Channel:163] Channel destructor is called before WaitLinkTaskFinish, try stop send thread\\n2025-07-30 05:51:26.284 ERROR brpc_link.py:150 [vrwpdnqj] -- Receiving exception: <class \'secretflow.distributed.fed.exception.FedRemoteError\'>, FedRemoteError occurred at tckhsamm caused by what: \\n\\t[Enforce fail at secretflow_serving/util/arrow_helper.cc:384] src_f->type()->id() == dst_f->type()->id(). edge schema check failed, src: onehot_encode_1, dst: feature_calculate_2. field: age type not match, expect: float, get: double\\n from tckhsamm, seq id 124. Re-raise it.\\n2025-07-30 05:51:26.284 INFO api.py:222 [vrwpdnqj] -- Shutdowned\\n2025-07-30 05:51:26.285 CRITICAL api.py:225 [vrwpdnqj] -- Exit now due to the previous error.\\n\""
endpoints {
port_name: "fed"
scope: "Cluster"
endpoint: "wbsx-model-export-0-fed.vrwpdnqj.svc"
}
endpoints {
port_name: "global"
scope: "Domain"
endpoint: "wbsx-model-export-0-global.vrwpdnqj.svc:32277"
}
endpoints {
port_name: "inference"
scope: "Cluster"
endpoint: "wbsx-model-export-0-inference.vrwpdnqj.svc"
}
endpoints {
port_name: "spu"
scope: "Cluster"
endpoint: "wbsx-model-export-0-spu.vrwpdnqj.svc"
}
}
alias: "wbsx-model-export"
}
stage_status_list {
domain_id: "tckhsamm"
state: "JobCreateStageSucceeded"
}
stage_status_list {
domain_id: "vrwpdnqj"
state: "JobCreateStageSucceeded"
}
approve_status_list {
domain_id: "tckhsamm"
state: "JobAccepted"
}
approve_status_list {
domain_id: "vrwpdnqj"
state: "JobAccepted"
}
}
}
,nodeId=vrwpdnqj
2025-07-30 13:51:27 [] [grpc-default-executor-17] INFO o.s.s.m.integration.job.JobManager - watched jobEvent: jobId=ysuo, jobState=Failed, task=[taskId=wbsx-model-export,alias=wbsx-model-export,state=Failed], endTime=2025-07-30T05:51:27Z
2025-07-30 13:51:27 [] [grpc-default-executor-17] INFO o.s.s.m.integration.job.JobManager - watched jobEvent: jobId=ysuo, but project job not exist, skip
Metadata
Metadata
Assignees
Labels
No labels