-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Describe the feature request
Summary
Add full static-shape Telum EP integration for ONNX Runtime on s390x (z16+), including robust capability gating, explicit CPU fallback behavior, and reproducible offload/performance validation.
Problem
Current Telum EP support is functional but not yet complete across real transformer-like graphs and validation workflows.
Key gaps are:
- partial NNPA function availability across hardware generations
- clear fallback observability when nodes are not offloaded
- repeatable proof (offload + acceleration) suitable for upstream review without native Telum CI hardware
Proposed Feature
- Keep static-shape-first scope for Telum EP.
- Use runtime NNPA capability query as source of truth for kernel registration/partitioning.
- Default to CPU fallback (non-strict), with strict mode available for fail-fast.
- Add reproducible benchmark and validation workflow showing:
- node placement on Telum
- CPU vs Telum performance deltas on deterministic static models
Success Criteria
- Telum-assigned nodes are clearly reported.
- Unsupported NNPA paths fall back predictably (or fail in strict mode).
- Reproducible benchmark evidence demonstrates acceleration on Telum-capable hosts.
- Non-Telum builds remain unaffected.
Non-Goals
- Dynamic-shape full offload in this phase.
- Forcing new external CI infrastructure in upstream repo.
Request For Maintainer Feedback
-
Acceptable CI/testing contract when Telum hardware is unavailable in upstream CI.
Because upstream CI has no Telum hardware and does not allow third-party runners, we should test Telum capability logic through a stubbed capability interface. The EP keeps the real zDNN-backed capability path in production, but tests can inject deterministic “snapshot” capability profiles (NNPA present/absent, function bits, max-dim limits) to validate GetCapability decisions, partitioning behavior, and CPU fallback paths. This gives reliable regression coverage in standard CI without requiring real Telum execution, while preserving true hardware validation on dedicated s390x/Telum hosts. Runtime behavior on hardware remains unchanged by default.
-
Preferred PR slicing for review.
Suggested:
1: runtime/provider behavior
2: kernel/coverage expansion
3: performance/tooling
4: Docs
Initial testing results:
Note: I see another change pending around s390x. I am not involved with that work, but I would be happy to review and collaborate on any merge conflicts.
Describe scenario use case
Use Case Scenario
A production inference service runs transformer-style models on IBM z16/z17 systems.
Operators need:
- deterministic static-shape inference,
- maximal Telum offload where available,
- safe CPU fallback where unavailable,
- visibility into fallback decisions,
- measurable latency/throughput improvements vs CPU-only execution.