feat: add native CPU kernel for SequenceMap (opset 17)#28813
feat: add native CPU kernel for SequenceMap (opset 17)#28813Rishi-Dave wants to merge 3 commits into
Conversation
Replace the broad ORT_USE_CPUINFO macro (with negated platform exclusions) with inline defined(CPUINFO_SUPPORTED) && defined(__linux__) guards at each point of use. Since __APPLE__ and __linux__ are mutually exclusive, the previous negation-based condition collapses to simply defined(__linux__). Drop the intermediate ORT_USE_CPUINFO macro in favour of direct guards.
Without a dedicated kernel, SequenceMap falls back to the ONNX context-dependent function body, which expands the op into a Loop over SequenceInsert calls. Each SequenceInsert copies the accumulator sequence, producing O(n^2) memory traffic for an n-element input. This adds a native CPU kernel that: - Derives from IControlFlowKernel and sets up the body subgraph via the standard FeedsFetchesManager flow used by Loop, If, and Scan. - Iterates the input sequence sequentially in O(n), forwarding the i-th element of each sequence-typed input and passing tensor-typed additional_inputs through unchanged. - Validates that sequence-typed additional_inputs share the input sequence length. - Assembles one TensorSeq per body output and appends fetched tensors per iteration without intermediate copies. Adds unit tests for the identity body, an add-scalar body with a tensor additional input, and a body that emits two outputs to cover the multi-output path. Fixes microsoft#23024
There was a problem hiding this comment.
Pull request overview
This PR adds a native CPU Execution Provider kernel for ONNX SequenceMap (opset 17) to avoid the quadratic Loop + SequenceInsert function-body fallback, executing the body subgraph directly per sequence element via the control-flow kernel infrastructure.
Changes:
- Introduces
SequenceMapas a CPU control-flow kernel usingFeedsFetchesManager+utils::ExecuteSubgraph. - Registers the new kernel in the CPU EP opset-17 registry.
- Adds unit tests that construct minimal
bodysubgraphs (identity, add-with-extra-input, two outputs) and validate results. - Also changes cpuinfo usage gating in
PosixEnvto Linux-only.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cpu/sequence/sequence_ops.h | Declares the new SequenceMap control-flow kernel and its FeedsFetchesManager member. |
| onnxruntime/core/providers/cpu/sequence/sequence_ops.cc | Implements SetupSubgraphExecutionInfo/Compute and registers the opset-17 CPU kernel. |
| onnxruntime/core/providers/cpu/cpu_execution_provider.cc | Registers SequenceMap in the CPU EP kernel registry. |
| onnxruntime/test/providers/cpu/sequence/sequence_ops_test.cc | Adds new SequenceMap unit tests with constructed body subgraphs. |
| onnxruntime/core/platform/posix/env.cc | Narrows cpuinfo integration to Linux-only in PosixEnv. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const auto& subgraph_map = subgraph_session_state.GetOrtValueNameIdxMap(); | ||
| std::unique_ptr<FeedsFetchesManager> ffm; | ||
| ORT_RETURN_IF_ERROR(FeedsFetchesManager::Create(feed_names, fetch_names, subgraph_map, ffm)); | ||
| ORT_RETURN_IF_ERROR(utils::InitializeFeedFetchCopyInfo(subgraph_session_state, *ffm)); |
| std::vector<std::string> outer_feed_names; | ||
| outer_feed_names.reserve(node_inputs.size()); | ||
| for (const auto* input_def : node_inputs) { | ||
| outer_feed_names.push_back(input_def->Name()); | ||
| } |
| std::vector<OrtValue> feeds; | ||
| feeds.reserve(static_cast<size_t>(num_outer_inputs)); | ||
|
|
||
| // Build feeds: sequence inputs -> element i; tensor inputs -> pass-through OrtValue. | ||
| for (int k = 0; k < num_outer_inputs; ++k) { | ||
| const auto* seq_k = (k == 0) ? input_seq : ctx->Input<TensorSeq>(k); | ||
| if (seq_k != nullptr) { | ||
| feeds.push_back(seq_k->GetAt(i)); | ||
| } else { | ||
| // Tensor input: shallow-copy the OrtValue (shared_ptr, safe) from the kernel context. | ||
| const auto* input_val = ctx_internal->GetInputMLValue(k); | ||
| ORT_ENFORCE(input_val != nullptr, "SequenceMap: input ", k, " is neither a sequence nor a tensor."); | ||
| feeds.push_back(*input_val); | ||
| } | ||
| } |
| for (int j = 0; j < num_outputs; ++j) { | ||
| output_seqs[j] = ctx->Output<TensorSeq>(j); | ||
| ORT_ENFORCE(output_seqs[j] != nullptr, "SequenceMap: failed to get output TensorSeq slot ", j); | ||
| output_seqs[j]->Reserve(seq_len); | ||
| } |
| if (i == 0) { | ||
| output_seqs[j]->SetType(fetches[j].Get<Tensor>().DataType()); | ||
| } |
| TypeProto float_tensor; | ||
| float_tensor.mutable_tensor_type()->set_elem_type(TensorProto_DataType_FLOAT); | ||
| float_tensor.mutable_tensor_type()->mutable_shape()->add_dim(); | ||
|
|
||
| auto& x_arg = graph.GetOrCreateNodeArg("x", &float_tensor); | ||
| auto& scalar_arg = graph.GetOrCreateNodeArg("scalar_in", &float_tensor); | ||
| auto& out_arg = graph.GetOrCreateNodeArg("add_out", &float_tensor); |
| // additional_inputs is a tensor (passed through to every iteration) | ||
| test.AddInput<float>("additional_inputs", {3}, {100.0f, 100.0f, 100.0f}); |
| // We can not use CPUINFO if it is not supported and we do not want to use | ||
| // it on certain platforms because of the binary size increase. | ||
| // We could use it to find out the number of physical cores for certain supported platforms | ||
| #if defined(CPUINFO_SUPPORTED) && !defined(__APPLE__) && !defined(__ANDROID__) && !defined(__wasm__) && !defined(_AIX) | ||
| #if defined(CPUINFO_SUPPORTED) && defined(__linux__) | ||
| #include <cpuinfo.h> | ||
| #define ORT_USE_CPUINFO | ||
| #endif |
| for (size_t i = 0; i < seq_len; ++i) { | ||
| std::vector<OrtValue> feeds; | ||
| feeds.reserve(static_cast<size_t>(num_outer_inputs)); |
- Include implicit inputs in subgraph feed setup and Compute - Initialize output TensorSeq element type before iteration loop so empty input sequences produce correctly-typed outputs - Remove redundant per-iteration SetType - Hoist feeds/fetches allocations outside the iteration loop - Fix scalar broadcasting test: use rank-0 TypeProto and scalar value
|
Thanks for the review. Addressed in 62a1bea:
On the |
Summary
SequenceMap(opset 17), eliminating the ONNX function-body fallback that expands into aLoopoverSequenceInsertand produces O(n^2) memory traffic.Loop/If/Scan: derives fromIControlFlowKernel, prepares feeds/fetches viaFeedsFetchesManager, and invokes the body subgraph throughutils::ExecuteSubgraph.Motivation
Fixes #23024. The ONNX spec defines
SequenceMapvia a context-dependent function body that decomposes the op into aLoopwhose accumulator is grown bySequenceInserton every iteration. EachSequenceInsertcopies the accumulated sequence, so processing an n-element input requires O(n^2) memory traffic. Workloads that map per-element transforms over long sequences hit this quadratic behaviour and are forced to avoid the operator entirely.A native kernel iterates the input in O(n), forwards the i-th element of each sequence-typed input plus passthrough tensor inputs to the body, and appends each body output to the appropriate output
TensorSeqwithout per-iteration copies.Changes
onnxruntime/core/providers/cpu/sequence/sequence_ops.h: declaresSequenceMapas anIControlFlowKernelwith aFeedsFetchesManagermember.onnxruntime/core/providers/cpu/sequence/sequence_ops.cc: implementsSetupSubgraphExecutionInfoandCompute, registers the kernel for opset 17, validates length parity for sequence-typedadditional_inputs, and assembles per-outputTensorSeqresults.onnxruntime/core/providers/cpu/cpu_execution_provider.cc: adds the forward declaration andBuildKernelCreateInfoentry for the new kernel alongside the other opset-17 sequence ops.onnxruntime/test/providers/cpu/sequence/sequence_ops_test.cc: addsSequenceMap_Identity,SequenceMap_AddScalar, andSequenceMap_TwoOutputscovering single-input identity, sequence + tensor broadcast, and dual-output body graphs.Test Plan
onnxruntime_test_all --gtest_filter='SequenceOpsTest.SequenceMap*'— exercises the three new tests.onnxruntime_test_all --gtest_filter='SequenceOpsTest.*'— confirms no regression in sibling sequence ops.sequence_map_identity_*,sequence_map_add_*, andsequence_map_extract_shapescontinue to be excluded for TensorRT EP only; the CPU EP now executes them via the native kernel rather than the function-body fallback.Issue Resolution
Fixes #23024.