[DC][Pipeline] Connecting together DC and pipeline

The need here is fairly similar to what's been discussed for https://github.com/llvm/circt/pull/4613. However, i fear the synchronization point between DC (handshake, in that case) and pipelines being at the ESI level (meaning that the pipeline has already been lowered) skips the step where we'd actually want to do some meaningful analysis of DC + pipeline interactions (think merging pipelines).

This also relates to the discussion at https://discourse.llvm.org/t/should-ssa-values-in-handshake-always-have-implicit-handshake-semantics/70321. DC takes care of the "implicit SSA handshake semantics" issue, but the `handshake.unit` proposal is orthogonal, and still a valid design point.

As an example, we want to connect the following pipeline with the surrounding DC logic:
```mlir
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
   ...
  %out, %done = pipeline.scheduled(%arg0) clock %clk reset %rst go %go : (i32) -> (i32) {
  ^bb0(%arg0_0: i32, %s0_valid : i1):
    %1 = comb.sub %arg0_0, %arg0_0 : i32
    pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
  ^bb1(%6: i32, %7: i32, %s1_valid : i1):  // pred: ^bb1
    %8 = comb.add %6, %7 : i32
    pipeline.return %8 : i32
  }
  ...
}
```

# Option 1: DC-interface'd pipeline
This is closer to the original notion of a "latency-insensitive pipeline" that the pipeline dialect was designed with. Back then, the design intention was "how can we have a pipeline abstraction where the body of the pipeline can serve as both a latency insensitive and latency sensitive implementation **internally, in between each stage**. The latter part, we found, wasn't really possible.
However, it's still perfectly valid to have the _interface_ of the statically scheduled pipeline be latency insensitive - this essentially implies that all of the logic that was explicit - as in option 1 - is now implicit, and implemented by a lowering.

The semantics here would imply **unit-rate actor semantics** between all of the inputs and outputs of the pipeline + accounting for the current state of the pipeline (i.e. II > 1 and stall signal assertions). 

```mlir
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
  %out = pipeline.scheduled.li(%arg0) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
  ^bb0(%arg0_0: i32, %s0_valid : i1):
    %1 = comb.sub %arg0_0, %arg0_0 : i32
    pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
  ^bb1(%6: i32, %7: i32, %s1_valid : i1):  // pred: ^bb1
    %8 = comb.add %6, %7 : i32
    pipeline.return %8 : i32
  }
  hw.output %out : !dc.value<i32>
}
```

This obviously puts a larger strain on the lowering. However, what i like about this is wrt. IR analysis. It is trivial to determine latency-insensitive pipelines (since it's now a separate op). Furthermore, we also know that said pipeline internally is **always** statically scheduled, with everything that comes with that (latency, ...). 

e.g. for merging two pipelines that feeds into each other:
```mlir
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
  %out = pipeline.scheduled.li(%arg0) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
  ^bb0(%arg0_0: i32, %s0_valid : i1):
    %1 = comb.sub %arg0_0, %arg0_0 : i32
    pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
  ^bb1(%6: i32, %7: i32, %s1_valid : i1):  // pred: ^bb1
    %8 = comb.add %6, %7 : i32
    pipeline.return %8 : i32
  }

  %out2 = pipeline.scheduled.li(%out) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
  ^bb0(%arg0_0: i32, %s0_valid : i1):
    %1 = comb.sub %arg0_0, %arg0_0 : i32
    pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
  ^bb1(%6: i32, %7: i32, %s1_valid : i1):  // pred: ^bb1
    %8 = comb.add %6, %7 : i32
    pipeline.return %8 : i32
  }

  hw.output %out2 : !dc.value<i32>
}

// Merges to

hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
  %out = pipeline.scheduled.li(%arg0) clock %clk reset %rst : (!dc.value<i32>) -> (!dc.value<i32>) {
  ^bb0(%arg0_0: i32, %s0_valid : i1):
    %1 = comb.sub %arg0_0, %arg0_0 : i32
    pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
  ^bb1(%6: i32, %7: i32, %s1_valid : i1):  // pred: ^bb1
    %8 = comb.add %6, %7 : i32
    %9 = comb.sub %8, %8 : i32
    pipeline.stage ^bb2(%9, %8 : i32, i32)
  ^bb2(%10: i32, %11 : i32 %s2_valid : i1):
    %12 = comb.add %10, %11 : i32
    pipeline.return %12 : i32
  }
  hw.output %out : !dc.value<i32>
}
```

# Option 2: Generic DC "fixed-latency, unit rate" operation

In practice, i'd assume most DC<->Pipeline optimizations pertains to the merging of known-latency groups of operations, which isn't restricted to just `pipeline`-dialect operations. This is where the unit-rate actor proposal of https://discourse.llvm.org/t/should-ssa-values-in-handshake-always-have-implicit-handshake-semantics/70321 comes in, wherein it is the in- and outputs of the operations which have unit-rate actor semantics, allowing us to place essentially anything within the body of the unit rate actor, so long as it returns an output valid signal.
Additionally, we would be able to tag such a `dc.unit` operation with information such as latency.

```mlir
hw.module @myPipeline(%arg0 : !dc.value<i32>) -> (out: !dc.value<i32>) {
  %out = dc.unit(%arg0) : (!dc.value<i32>) -> (!dc.value<i32>) {latency = 1} {
    // Body of a unit rate actor just has the "unwrapped" arguments, the
    // (joined) valid signal, and has a mandatory "done"/output validity
    // return signal. 
    ^bb0(%a0 : i32, %valid : i1):
      %out, %done = pipeline.scheduled.li(%arg0) clock %clk reset %rst go %valid : (i32) -> (i32) {
        ^bb0(%arg0_0: i32, %s0_valid : i1):
            %1 = comb.sub %arg0_0, %arg0_0 : i32
            pipeline.stage ^bb1 regs(%1, %arg0_0 : i32, i32)
        ^bb1(%6: i32, %7: i32, %s1_valid : i1):  // pred: ^bb1
            %8 = comb.add %6, %7 : i32
            pipeline.return %8 : i32
        }
      return %out, %done : i32, i1
  }
  hw.output %out : !dc.value<i32>
}
```

 While this is a very generic approach, i fear that it may make analysis and transformation a bit harder (think the case of merging pipelines - is it better to have two `pipeline.scheduled.li` operations abutting or two `pipeline.dc.unit` abutting? Given that things in `dc.unit` may be literally anything, we're able to place the things together, but wouldn't be able to do a cool pipeline merge into a single pipeline). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DC][Pipeline] Connecting together DC and pipeline #5477

Option 1: DC-interface'd pipeline

Option 2: Generic DC "fixed-latency, unit rate" operation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development