Skip to content

[Feature Request] Add full static-shape Telum EP integration for ONNX Runtime on s390x #27315

@k8ika0s

Description

@k8ika0s

Describe the feature request

Summary

Add full static-shape Telum EP integration for ONNX Runtime on s390x (z16+), including robust capability gating, explicit CPU fallback behavior, and reproducible offload/performance validation.

Problem

Current Telum EP support is functional but not yet complete across real transformer-like graphs and validation workflows.
Key gaps are:

  • partial NNPA function availability across hardware generations
  • clear fallback observability when nodes are not offloaded
  • repeatable proof (offload + acceleration) suitable for upstream review without native Telum CI hardware

Proposed Feature

  1. Keep static-shape-first scope for Telum EP.
  2. Use runtime NNPA capability query as source of truth for kernel registration/partitioning.
  3. Default to CPU fallback (non-strict), with strict mode available for fail-fast.
  4. Add reproducible benchmark and validation workflow showing:
    • node placement on Telum
    • CPU vs Telum performance deltas on deterministic static models

Success Criteria

  • Telum-assigned nodes are clearly reported.
  • Unsupported NNPA paths fall back predictably (or fail in strict mode).
  • Reproducible benchmark evidence demonstrates acceleration on Telum-capable hosts.
  • Non-Telum builds remain unaffected.

Non-Goals

  • Dynamic-shape full offload in this phase.
  • Forcing new external CI infrastructure in upstream repo.

Request For Maintainer Feedback

  • Acceptable CI/testing contract when Telum hardware is unavailable in upstream CI.

    Because upstream CI has no Telum hardware and does not allow third-party runners, we should test Telum capability logic through a stubbed capability interface. The EP keeps the real zDNN-backed capability path in production, but tests can inject deterministic “snapshot” capability profiles (NNPA present/absent, function bits, max-dim limits) to validate GetCapability decisions, partitioning behavior, and CPU fallback paths. This gives reliable regression coverage in standard CI without requiring real Telum execution, while preserving true hardware validation on dedicated s390x/Telum hosts. Runtime behavior on hardware remains unchanged by default.

  • Preferred PR slicing for review.

    Suggested:
    1: runtime/provider behavior
    2: kernel/coverage expansion
    3: performance/tooling
    4: Docs

Initial testing results:

Telum_EP_Perf_Validation.md

Note: I see another change pending around s390x. I am not involved with that work, but I would be happy to review and collaborate on any merge conflicts.

Describe scenario use case

Use Case Scenario

A production inference service runs transformer-style models on IBM z16/z17 systems.
Operators need:

  • deterministic static-shape inference,
  • maximal Telum offload where available,
  • safe CPU fallback where unavailable,
  • visibility into fallback decisions,
  • measurable latency/throughput improvements vs CPU-only execution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestrequest for unsupported feature or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions