Skip to content

模型提交异常 — 类型不匹配(expect float, get: double) #309

@treeforest

Description

@treeforest

Issue Type

Running

Have you searched for existing documents and issues?

No

OS Platform and Distribution

CentOS 7

All_in_one Version

v0.11.0b1

Kuscia Version

0.13.0b0

What happend and What you expected to happen.

## all-in-one 使用的镜像版本列表

{
    "images": {
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/capsule-manager-sim-ubuntu20.04": "v0.1.0b0",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/dataproxy": "0.3.0b0",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia": "0.13.0b0",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/scql": "0.9.2b1",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8": "1.11.0b1",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad": "0.12.0b0",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/serving-anolis8": "0.8.0b0",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/sf-tee-dm-sim": "0.1.0b0",
        "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/teeapps-sim-ubuntu20.04": "0.1.2b0"
    }
}


## 场景复现
在secretpad上定义好训练流,并成功执行所有节点后,在`提交模型`阶段出错。

Log output.

## secretpad web端返回的错误信息

The remaining no-failed party task counts 1 are less than the task success threshold 2. pending party[], running party[tckhsamm], successful party[], failed party[vrwpdnqj]


## Lite `tckhsamm` 的 secretpad 错误日志

2025-07-30 13:51:27 [] [grpc-default-executor-17] INFO  o.s.s.m.integration.job.JobManager - starter jobEvent ... type: MODIFIED
object {
  job_id: "ysuo"
  status {
    state: "Failed"
    create_time: "2025-07-30T05:51:18Z"
    start_time: "2025-07-30T05:51:18Z"
    end_time: "2025-07-30T05:51:27Z"
    tasks {
      task_id: "wbsx-model-export"
      state: "Failed"
      err_msg: "The remaining no-failed party task counts 1 are less than the task success threshold 2. pending party[], running party[tckhsamm], successful party[], failed party[vrwpdnqj]"
      create_time: "2025-07-30T05:51:18Z"
      start_time: "2025-07-30T05:51:18Z"
      end_time: "2025-07-30T05:51:27Z"
      parties {
        domain_id: "tckhsamm"
        state: "Failed"
        endpoints {
          port_name: "inference"
          scope: "Cluster"
          endpoint: "wbsx-model-export-0-inference.tckhsamm.svc"
        }
        endpoints {
          port_name: "spu"
          scope: "Cluster"
          endpoint: "wbsx-model-export-0-spu.tckhsamm.svc"
        }
        endpoints {
          port_name: "fed"
          scope: "Cluster"
          endpoint: "wbsx-model-export-0-fed.tckhsamm.svc"
        }
        endpoints {
          port_name: "global"
          scope: "Domain"
          endpoint: "wbsx-model-export-0-global.tckhsamm.svc:29599"
        }
      }
      parties {
        domain_id: "vrwpdnqj"
        state: "Failed"
        err_msg: "container[secretflow] terminated state reason \"Error\", message: \".py\\\", line 458, in result\\n    return self.__get_result()\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n    raise self._exception\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n    result = self.fn(*self.args, **self.kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 91, in _task\\n    ret = self._task_impl(*resolved)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/actor.py\\\", line 147, in _execute_impl\\n    ret = func(*args, **kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py\\\", line 65, in wrapper\\n    return method(*args, **kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow_serving_lib/graph_builder.py\\\", line 251, in build_proto\\n    libserving.graph_validator_impl(graph_def_str)\\nRuntimeError: what: \\n\\t[Enforce fail at secretflow_serving/util/arrow_helper.cc:384] src_f->type()->id() == dst_f->type()->id(). edge schema check failed, src: onehot_encode_1, dst: feature_calculate_2. field: bmi type not match, expect: float, get: double\\n\\n2025-07-30 05:51:26.184 ERROR global_context.py:183 [vrwpdnqj] -- FedLocalError on seq id 125\\nTraceback (most recent call last):\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/global_context.py\\\", line 177, in _send_check\\n    send_future.result()\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n    return self.__get_result()\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n    raise self._exception\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n    result = self.fn(*self.args, **self.kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/global_context.py\\\", line 170, in _send\\n    raise local_err\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/global_context.py\\\", line 158, in _send\\n    obj = fed_obj.get_object()\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n    self._object = self._object.get()\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n    future_result = self.future.result(timeout=2)\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n    return self.__get_result()\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n    raise self._exception\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n    result = self.fn(*self.args, **self.kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 88, in _task\\n    resolved = self.resolve_dependencies(*args, **kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 68, in resolve_dependencies\\n    resolved.append(arg.get_object())\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n    self._object = self._object.get()\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n    future_result = self.future.result(timeout=2)\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n    return self.__get_result()\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n    raise self._exception\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n    result = self.fn(*self.args, **self.kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 88, in _task\\n    resolved = self.resolve_dependencies(*args, **kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 68, in resolve_dependencies\\n    resolved.append(arg.get_object())\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n    self._object = self._object.get()\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n    future_result = self.future.result(timeout=2)\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n    return self.__get_result()\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n    raise self._exception\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n    result = self.fn(*self.args, **self.kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 88, in _task\\n    resolved = self.resolve_dependencies(*args, **kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 68, in resolve_dependencies\\n    resolved.append(arg.get_object())\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 117, in get_object\\n    self._object = self._object.get()\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/object.py\\\", line 31, in get\\n    future_result = self.future.result(timeout=2)\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 458, in result\\n    return self.__get_result()\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/_base.py\\\", line 403, in __get_result\\n    raise self._exception\\n  File \\\"/usr/local/lib/python3.10/concurrent/futures/thread.py\\\", line 58, in run\\n    result = self.fn(*self.args, **self.kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/call_holder.py\\\", line 91, in _task\\n    ret = self._task_impl(*resolved)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/distributed/fed/actor.py\\\", line 147, in _execute_impl\\n    ret = func(*args, **kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py\\\", line 65, in wrapper\\n    return method(*args, **kwargs)\\n  File \\\"/usr/local/lib/python3.10/site-packages/secretflow_serving_lib/graph_builder.py\\\", line 251, in build_proto\\n    libserving.graph_validator_impl(graph_def_str)\\nsecretflow.distributed.fed.exception.FedLocalError: what: \\n\\t[Enforce fail at secretflow_serving/util/arrow_helper.cc:384] src_f->type()->id() == dst_f->type()->id(). edge schema check failed, src: onehot_encode_1, dst: feature_calculate_2. field: bmi type not match, expect: float, get: double\\n\\n2025-07-30 05:51:26.185 WARNING global_context.py:133 [vrwpdnqj] -- Signal SIGINT to exit.\\n2025-07-30 05:51:26.185 WARNING api.py:116 [vrwpdnqj] -- Stop signal received (e.g. via SIGINT/Ctrl+C), try to shutdown fed. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.\\n2025-07-30 05:51:26.185 INFO api.py:214 [vrwpdnqj] -- Shutdowning unintendedly, wait_for_sending True\\n2025-07-30 05:51:26.185 INFO global_context.py:214 [vrwpdnqj] -- Try stop context, wait_for_sending True, on_error True\\n2025-07-30 05:51:26.185 INFO global_context.py:218 [vrwpdnqj] -- task_executor stopped\\n2025-07-30 05:51:26.185 INFO global_context.py:220 [vrwpdnqj] -- recv_executor stopped\\n2025-07-30 05:51:26.186 INFO global_context.py:224 [vrwpdnqj] -- send_executor stopped\\n2025-07-30 05:51:26.186 INFO global_context.py:237 [vrwpdnqj] -- Context stopped\\n2025-07-30 05:51:26.283 [warning] [channel.h:~Channel:163] Channel destructor is called before WaitLinkTaskFinish, try stop send thread\\n2025-07-30 05:51:26.284 ERROR brpc_link.py:150 [vrwpdnqj] -- Receiving exception: <class \'secretflow.distributed.fed.exception.FedRemoteError\'>, FedRemoteError occurred at tckhsamm caused by what: \\n\\t[Enforce fail at secretflow_serving/util/arrow_helper.cc:384] src_f->type()->id() == dst_f->type()->id(). edge schema check failed, src: onehot_encode_1, dst: feature_calculate_2. field: age type not match, expect: float, get: double\\n from tckhsamm, seq id 124. Re-raise it.\\n2025-07-30 05:51:26.284 INFO api.py:222 [vrwpdnqj] -- Shutdowned\\n2025-07-30 05:51:26.285 CRITICAL api.py:225 [vrwpdnqj] -- Exit now due to the previous error.\\n\""
        endpoints {
          port_name: "fed"
          scope: "Cluster"
          endpoint: "wbsx-model-export-0-fed.vrwpdnqj.svc"
        }
        endpoints {
          port_name: "global"
          scope: "Domain"
          endpoint: "wbsx-model-export-0-global.vrwpdnqj.svc:32277"
        }
        endpoints {
          port_name: "inference"
          scope: "Cluster"
          endpoint: "wbsx-model-export-0-inference.vrwpdnqj.svc"
        }
        endpoints {
          port_name: "spu"
          scope: "Cluster"
          endpoint: "wbsx-model-export-0-spu.vrwpdnqj.svc"
        }
      }
      alias: "wbsx-model-export"
    }
    stage_status_list {
      domain_id: "tckhsamm"
      state: "JobCreateStageSucceeded"
    }
    stage_status_list {
      domain_id: "vrwpdnqj"
      state: "JobCreateStageSucceeded"
    }
    approve_status_list {
      domain_id: "tckhsamm"
      state: "JobAccepted"
    }
    approve_status_list {
      domain_id: "vrwpdnqj"
      state: "JobAccepted"
    }
  }
}
,nodeId=vrwpdnqj
2025-07-30 13:51:27 [] [grpc-default-executor-17] INFO  o.s.s.m.integration.job.JobManager - watched jobEvent: jobId=ysuo, jobState=Failed, task=[taskId=wbsx-model-export,alias=wbsx-model-export,state=Failed], endTime=2025-07-30T05:51:27Z
2025-07-30 13:51:27 [] [grpc-default-executor-17] INFO  o.s.s.m.integration.job.JobManager - watched jobEvent: jobId=ysuo, but project job not exist, skip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions