-
Notifications
You must be signed in to change notification settings - Fork 68
[Bug] Hierarchical regions with inter-kernel streams deadlock in simulator and have HLS codegen issues #561
Description
Bug Report: Hierarchical regions with inter-kernel stream communication deadlock in simulator and have codegen issues in HLS
Description
When using hierarchical dataflow regions (a top region calling an inner region via inner(...) inside a wrapper kernel), both the simulator backend and the Vitis HLS backend fail when the inner region contains kernels that communicate via streams with circular dependencies (e.g., producer sends data, then waits for consumer's processed result).
Reproducer
import allo
from allo.ir.types import float32, Stream
import allo.dataflow as df
import numpy as np
@df.region()
def inner(A: float32[4], B: float32[4]):
fwd: Stream[float32, 4][1]
bwd: Stream[float32, 4][1]
@df.kernel(mapping=[1], args=[A])
def producer(local_A: float32[4]):
fwd[0].put(local_A[0])
fwd[0].put(local_A[1])
r0: float32 = bwd[0].get() # Wait for consumer result
r1: float32 = bwd[0].get()
local_A[2] = r0
local_A[3] = r1
@df.kernel(mapping=[1], args=[B])
def consumer(local_B: float32[4]):
v0: float32 = fwd[0].get()
v1: float32 = fwd[0].get()
local_B[0] = v0 + 1.0
local_B[1] = v1 + 1.0
bwd[0].put(v0 + 1.0)
bwd[0].put(v1 + 1.0)
@df.region()
def top(A: float32[4], B: float32[4]):
@df.kernel(mapping=[1], args=[A, B])
def wrapper(local_A: float32[4], local_B: float32[4]):
inner(local_A, local_B)
# WORKS: flat build, simulator runs kernels in parallel
sim = df.build(inner, target="simulator")
A = np.array([10.0, 20.0, 0.0, 0.0], dtype=np.float32)
B = np.zeros(4, dtype=np.float32)
sim(A, B) # Passes
# DEADLOCKS: hierarchical build, simulator serializes inner kernels
sim2 = df.build(top, target="simulator")
sim2(A, B) # Hangs forever
# HLS CODEGEN BUG: wrapper_0 calls inner__0 before it's defined
hls_mod = df.build(top, target="vitis_hls", mode="csim", project="test.prj")
hls_mod(A, B) # Compilation error: 'inner__0' was not declaredObserved Behavior
1. Simulator Backend: Deadlock
Building inner directly (flat) works - the simulator launches producer and consumer in parallel threads. But building top (hierarchical) deadlocks because the simulator executes the inner region's kernels sequentially via func.call. The producer blocks on bwd[0].get() waiting for consumer, but consumer hasn't started yet.
2. HLS Backend: Two codegen issues in generated kernel.cpp
a) Missing forward declaration: wrapper_0 is emitted before inner__0, causing C++ compilation failure:
void wrapper_0(float v0[4], float v1[4]) {
inner__0(v0, v1); // ERROR: 'inner__0' was not declared in this scope
}
// ... inner__0 defined later ...b) Missing #pragma HLS dataflow in inner region: The inner region body calls its kernels sequentially without the dataflow pragma:
void inner__0(float v32[4], float v33[4]) {
hls::stream<float> v34, v35;
producer_0__0_fixed(v32, v34, v34, v35, v35); // Sequential!
consumer_0__0_fixed(v33, v34, v34, v35, v35); // Sequential!
// Missing: #pragma HLS dataflow
}Only the top-level top() function has #pragma HLS dataflow, but the inner region also needs it for its kernels to execute concurrently.
Expected Behavior
- Simulator: Inner region kernels should execute in parallel threads, same as when the region is built directly.
- HLS codegen: Inner region functions should be emitted before their callers (or forward-declared), and should include
#pragma HLS dataflowto parallelize their internal kernels.
Environment
- Allo version: main branch
- Python: 3.12
- Vitis HLS: 2023.2
- OS: Linux (RHEL)