Description
The need here is fairly similar to what's been discussed for #4613. However, i fear the synchronization point between DC (handshake, in that case) and pipelines being at the ESI level (meaning that the pipeline has already been lowered) skips the step where we'd actually want to do some meaningful analysis of DC + pipeline interactions (think merging pipelines).
This also relates to the discussion at https://discourse.llvm.org/t/should-ssa-values-in-handshake-always-have-implicit-handshake-semantics/70321. DC takes care of the "implicit SSA handshake semantics" issue, but the handshake.unit
proposal is orthogonal, and still a valid design point.
As an example, we want to connect the following pipeline with the surrounding DC logic:
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
...
%out, %done = pipeline.scheduled(%arg0) clock %clk reset %rst go %go : (i32) -> (i32) {
^bb0(%arg0_0: i32, %s0_valid : i1):
%1 = comb.sub %arg0_0, %arg0_0 : i32
pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
^bb1(%6: i32, %7: i32, %s1_valid : i1): // pred: ^bb1
%8 = comb.add %6, %7 : i32
pipeline.return %8 : i32
}
...
}
Option 1: DC-interface'd pipeline
This is closer to the original notion of a "latency-insensitive pipeline" that the pipeline dialect was designed with. Back then, the design intention was "how can we have a pipeline abstraction where the body of the pipeline can serve as both a latency insensitive and latency sensitive implementation internally, in between each stage. The latter part, we found, wasn't really possible.
However, it's still perfectly valid to have the interface of the statically scheduled pipeline be latency insensitive - this essentially implies that all of the logic that was explicit - as in option 1 - is now implicit, and implemented by a lowering.
The semantics here would imply unit-rate actor semantics between all of the inputs and outputs of the pipeline + accounting for the current state of the pipeline (i.e. II > 1 and stall signal assertions).
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
%out = pipeline.scheduled.li(%arg0) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
^bb0(%arg0_0: i32, %s0_valid : i1):
%1 = comb.sub %arg0_0, %arg0_0 : i32
pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
^bb1(%6: i32, %7: i32, %s1_valid : i1): // pred: ^bb1
%8 = comb.add %6, %7 : i32
pipeline.return %8 : i32
}
hw.output %out : !dc.value<i32>
}
This obviously puts a larger strain on the lowering. However, what i like about this is wrt. IR analysis. It is trivial to determine latency-insensitive pipelines (since it's now a separate op). Furthermore, we also know that said pipeline internally is always statically scheduled, with everything that comes with that (latency, ...).
e.g. for merging two pipelines that feeds into each other:
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
%out = pipeline.scheduled.li(%arg0) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
^bb0(%arg0_0: i32, %s0_valid : i1):
%1 = comb.sub %arg0_0, %arg0_0 : i32
pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
^bb1(%6: i32, %7: i32, %s1_valid : i1): // pred: ^bb1
%8 = comb.add %6, %7 : i32
pipeline.return %8 : i32
}
%out2 = pipeline.scheduled.li(%out) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
^bb0(%arg0_0: i32, %s0_valid : i1):
%1 = comb.sub %arg0_0, %arg0_0 : i32
pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
^bb1(%6: i32, %7: i32, %s1_valid : i1): // pred: ^bb1
%8 = comb.add %6, %7 : i32
pipeline.return %8 : i32
}
hw.output %out2 : !dc.value<i32>
}
// Merges to
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
%out = pipeline.scheduled.li(%arg0) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
^bb0(%arg0_0: i32, %s0_valid : i1):
%1 = comb.sub %arg0_0, %arg0_0 : i32
pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
^bb1(%6: i32, %7: i32, %s1_valid : i1): // pred: ^bb1
%8 = comb.add %6, %7 : i32
%9 = comb.sub %8, %8 : i32
pipeline.stage ^bb2(%9, %8 : i32, i32)
^bb2(%10: i32, %11 : i32 %s2_valid : i1):
%12 = comb.add %10, %11 : i32
pipeline.return %12 : i32
}
hw.output %out : !dc.value<i32>
}
Option 2: Generic DC "fixed-latency, unit rate" operation
In practice, i'd assume most DC<->Pipeline optimizations pertains to the merging of known-latency groups of operations, which isn't restricted to just pipeline
-dialect operations. This is where the unit-rate actor proposal of https://discourse.llvm.org/t/should-ssa-values-in-handshake-always-have-implicit-handshake-semantics/70321 comes in, wherein it is the in- and outputs of the operations which have unit-rate actor semantics, allowing us to place essentially anything within the body of the unit rate actor, so long as it returns an output valid signal.
Additionally, we would be able to tag such a dc.unit
operation with information such as latency.
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
%out = dc.unit(%arg0) : (!dc.value<i32>) -> (!dc.value<i32>) {latency = 1} {
// Body of a unit rate actor just has the "unwrapped" arguments, the
// (joined) valid signal, and has a mandatory "done"/output validity
// return signal.
^bb0(%a0 : i32, %valid : i1):
%out, %done = pipeline.scheduled.li(%arg0) clock %clk reset %rst go %valid : (i32) -> (i32) {
^bb0(%arg0_0: i32, %s0_valid : i1):
%1 = comb.sub %arg0_0, %arg0_0 : i32
pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
^bb1(%6: i32, %7: i32, %s1_valid : i1): // pred: ^bb1
%8 = comb.add %6, %7 : i32
pipeline.return %8 : i32
}
return %out, %done : i32, i1
}
hw.output %out : !dc.value<i32>
}
While this is a very generic approach, i fear that it may make analysis and transformation a bit harder (think the case of merging pipelines - is it better to have two pipeline.scheduled.li
operations abutting or two pipeline.dc.unit
abutting? Given that things in dc.unit
may be literally anything, we're able to place the things together, but wouldn't be able to do a cool pipeline merge into a single pipeline).