Xilinx · ollycassidy13 · Apr 22, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 24, 2026
diff --git a/docs/finn/components/index.rst b/docs/finn/components/index.rst
@@ -10,3 +10,4 @@ This section provides detailed documentation for specific FINN hardware componen
    :maxdepth: 2
 
    rtl-swg
+   pwpolyf
diff --git a/docs/finn/components/pwpolyf.rst b/docs/finn/components/pwpolyf.rst
@@ -0,0 +1,272 @@
+PWPolyF Piecewise Polynomial Activation
+=======================================
+
+Overview
+--------
+
+PWPolyF is a hardware activation layer that approximates nonlinear functions
+(GELU, SiLU, Sigmoid, Tanh) using piecewise polynomials evaluated with
+Horner's method on a chain of DSPFP32 FMA units. With the default degree of 2,
+this uses two cascaded DSPs and one RAMB18 coefficient ROM per PE, giving
+single-cycle-per-element throughput. Per-function configuration, including
+clamping behaviour and polynomial coefficients, is delivered through a
+SystemVerilog package (``pwpolyf_pkg``) using a ``func_cfg_t`` struct.
+
+The input domain is partitioned into ``1 + 2*5*(2^K)`` segments: one near-zero
+region, positive octave sub-segments, and negative mirrors. With the default
+``K=3`` this gives 81 segments. Segment selection reuses the FP32 exponent and
+mantissa bit fields directly, matching the RTL implementation.
+
+Polynomial coefficients are generated at HDL build time by
+``PWPolyF_rtl._generate_coeffs_pkg()``, which fits polynomials of the
+configured degree to the reference PyTorch functions and writes
+``pwpolyf_pkg.sv``. Both ``K`` and ``degree`` are configurable. They default to
+``K=3`` and ``degree=2`` when inferred from standard ONNX ops.
+
+Architecture
+------------
+
+PWPolyF is RTL-only, with no HLS variant, and targets Versal devices only. The
+RTL instantiates the Versal DSPFP32 primitive, so UltraScale+ and older parts
+must not be specialized to this backend.
+
+Two export paths are supported:
+
+.. code-block:: text
+
+   Path A: PiecewisePolyActivation        Path B: nn.GELU / nn.SiLU / etc.
+       |  torch.onnx.export                   |  torch.onnx.export
+       |  (dynamo=False)                      |  (dynamo=True or False)
+       v                                      v
+   PWPolyF custom ONNX node           Standard ONNX ops (Gelu, Sigmoid,
+       |                               Tanh, Sigmoid+Mul for SiLU,
+       |                               Div+Erf+Add+Mul+Mul for GELU)
+       |                                      |
+       +------------- both paths -------------+
+                         |
+                   InferPWPolyFLayer
+                         v
+               PWPolyF HW op (finn.custom_op.fpgadataflow)
+                         |  SpecializeLayers
+                         v
+               PWPolyF_rtl (finn.custom_op.fpgadataflow.rtl)
+                         |  generate_hdl
+                         v
+               finn-rtllib/pwpolyf/hdl/ SystemVerilog IP
+
+Standard ONNX Op Inference
+--------------------------
+
+``InferPWPolyFLayer`` recognises standard ONNX activation ops in addition to
+the explicit ``PWPolyF`` custom op. This allows models that use ``nn.GELU``,
+``nn.SiLU``, ``nn.Sigmoid``, or ``nn.Tanh`` to be exported with ``dynamo=True``
+or ``dynamo=False`` and automatically converted to PWPolyF HW layers.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 45 20
+
+   * - ONNX op type
+     - Pattern
+     - Maps to
+   * - ``Gelu`` (opset 20+)
+     - Single node
+     - ``func="gelu"``
+   * - ``Div`` + ``Erf`` + ``Add`` + ``Mul`` + ``Mul``
+     - ``x * 0.5 * (1 + erf(x / sqrt(2)))``
+     - ``func="gelu"``
+   * - ``Sigmoid``
+     - Single node (standalone)
+     - ``func="sigmoid"``
+   * - ``Tanh``
+     - Single node
+     - ``func="tanh"``
+   * - ``Sigmoid`` + ``Mul``
+     - ``Mul(x, Sigmoid(x))``
+     - ``func="silu"``
+
+``Gelu`` as a single ONNX node requires opset 20 or later. With lower opsets,
+including ``dynamo=True`` export defaults to opset 18, GELU decomposes into a
+5-node Erf-based pattern. Both forms are matched. SiLU has no standard ONNX op
+and decomposes to ``Sigmoid(x) * x``. Only FLOAT32 inputs are converted.
+
+Folding
+-------
+
+PWPolyF uses PE parallelism. ``NumChannels % PE == 0`` must hold. Each PE
+instantiates its own polynomial evaluation pipeline with ``degree`` DSPs.
+``SetFolding`` handles PE selection automatically.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 10 10 15 15 15 25
+
+   * - PE
+     - Degree
+     - DSPs
+     - BRAM18s
+     - Approx LUTs
+     - Cycles per spatial position
+   * - 1
+     - 2
+     - 2
+     - 1
+     - 200
+     - NumChannels
+   * - C
+     - 2
+     - 2C
+     - C
+     - 200C
+     - 1
+   * - 1
+     - 3
+     - 3
+     - 2
+     - 300
+     - NumChannels
+
+Resource Estimates
+------------------
+
+* DSP: ``degree * PE`` (one FP32 FMA stage per polynomial degree per PE)
+* LUT: approximately ``100 * degree * PE`` for segment address decode and
+  control
+* BRAM18: ``(degree - 1) * PE`` for default ``K=3``. Vivado infers delayed
+  coefficient lookups as 32-bit ROMs.
+* URAM: 0
+
+ONNX Export
+-----------
+
+Two export paths are supported:
+
+* ``PiecewisePolyActivation`` exports as a single ``PWPolyF`` custom op via
+  ``torch.autograd.Function.symbolic()``. It requires ``dynamo=False`` and
+  preserves the ``K`` attribute on the ONNX node.
+* Standard PyTorch modules (``nn.GELU``, ``nn.SiLU``, ``nn.Sigmoid``,
+  ``nn.Tanh``) export with ``dynamo=True`` or ``dynamo=False`` and produce
+  standard ONNX ops that ``InferPWPolyFLayer`` converts to PWPolyF with
+  default ``K=3``.
+
+Attributes on the explicit PWPolyF ONNX node are:
+
+* ``func``: one of ``gelu``, ``silu``, ``sigmoid``, ``tanh``
+* ``K``: mantissa subdivision bits, default 3
+
+Node Attributes
+---------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 25 15 45
+
+   * - Attribute
+     - Type
+     - Description
+   * - ``func``
+     - string
+     - Activation function name
+   * - ``K``
+     - int
+     - Mantissa subdivision bits, default 3
+   * - ``degree``
+     - int
+     - Polynomial degree / FMA stages, default 2
+   * - ``NumChannels``
+     - int
+     - Number of channels in the last input dimension
+   * - ``PE``
+     - int
+     - Processing elements
+   * - ``inputDataType``
+     - string
+     - Input data type, always FLOAT32
+   * - ``outputDataType``
+     - string
+     - Output data type, always FLOAT32
+   * - ``numInputVectors``
+     - ints
+     - Batch/spatial dimensions
+
+Supported Functions
+-------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 20 30
+
+   * - Function
+     - Negative clamp
+     - Positive behaviour
+   * - GELU
+     - 0.0
+     - passthrough (``y=x``)
+   * - SiLU
+     - 0.0
+     - passthrough (``y=x``)
+   * - Sigmoid
+     - 0.0
+     - clamp to 1.0
+   * - Tanh
+     - -1.0
+     - clamp to 1.0
+
+Files
+-----
+
+Python files:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 35 50
+
+   * - File
+     - Purpose
+   * - ``util/torch_hw_modules.py``
+     - PyTorch activation module, ONNX export, software simulation
+   * - ``custom_op/fpgadataflow/pwpolyf.py``
+     - Base HW op for shape, folding, resource estimates, cppsim
+   * - ``custom_op/fpgadataflow/rtl/pwpolyf_rtl.py``
+     - RTL backend for HDL generation, package generation, rtlsim, IPI
+   * - ``util/pwpolyf.py``
+     - Compatibility imports for existing PWPolyF utility users
+   * - ``transformation/fpgadataflow/convert_to_hw_layers.py``
+     - ``InferPWPolyFLayer`` transformation
+   * - ``builder/build_dataflow_steps.py``
+     - Build pipeline integration
+   * - ``transformation/fpgadataflow/set_folding.py``
+     - Folding support
+
+RTL files:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 35 50
+
+   * - File
+     - Purpose
+   * - ``finn-rtllib/pwpolyf/hdl/pwpolyf_pkg.sv``
+     - ``func_cfg_t`` struct per activation, regenerated per K
+   * - ``finn-rtllib/pwpolyf/hdl/pwpolyf.sv``
+     - Polynomial evaluation pipeline using a Horner chain on DSPFP32
+   * - ``finn-rtllib/pwpolyf/hdl/queue.sv``
+     - Elastic FIFO for backpressure
+   * - ``finn-rtllib/pwpolyf/hdl/pwpolyf_template_wrapper.v``
+     - AXI-Stream wrapper template
+
+Tests
+-----
+
+``tests/fpgadataflow/test_fpgadataflow_pwpolyf.py`` covers:
+
+* cppsim for all supported functions, channel counts, spatial shapes, and
+  foldings
+* ONNX export for the explicit ``PiecewisePolyActivation`` path
+* ``InferPWPolyFLayer`` conversion and execution
+* standard op inference for Gelu, Sigmoid, Tanh, SiLU, and Erf-based GELU
+* execution correctness against ``PiecewisePolyActivation``
+* Versal-only specialization checks
+* resource estimates, folded shapes, and expected cycles
+* coefficient package generation for ``K`` and ``degree``
+* Vivado HDL generation, RTL simulation, and stitched IP simulation
diff --git a/docs/finn/reference/folding-constraints.rst b/docs/finn/reference/folding-constraints.rst
@@ -68,6 +68,9 @@ Constraint Table
    * - Pool
      - PE
      - inp_channels % PE == 0
+   * - PWPolyF
+     - PE
+     - NumChannels % PE == 0
    * - Thresholding
      - PE
      - MH % PE == 0

diff --git a/docs/finn/source_code/finn.custom_op.fpgadataflow.rst b/docs/finn/source_code/finn.custom_op.fpgadataflow.rst
@@ -136,6 +136,15 @@ finn.custom\_op.fpgadataflow.pool
    :undoc-members:
    :show-inheritance:
 
+finn.custom\_op.fpgadataflow.pwpolyf
+--------------------------------------
+
+.. automodule:: finn.custom_op.fpgadataflow.pwpolyf
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 finn.custom\_op.fpgadataflow.streamingdataflowpartition
 --------------------------------------------------------
 

diff --git a/docs/finn/source_code/finn.custom_op.fpgadataflow.rtl.rst b/docs/finn/source_code/finn.custom_op.fpgadataflow.rtl.rst
@@ -45,6 +45,14 @@ finn.custom\_op.fpgadataflow.streamingfifo\_rtl
    :undoc-members:
    :show-inheritance:
 
+finn.custom\_op.fpgadataflow.pwpolyf\_rtl
+--------------------------------------------
+
+.. automodule:: finn.custom_op.fpgadataflow.rtl.pwpolyf_rtl
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 finn.custom\_op.fpgadataflow.thresholding\_rtl
 -------------------------------------------------------
 

diff --git a/docs/finn/source_code/finn.util.rst b/docs/finn/source_code/finn.util.rst
@@ -188,6 +188,24 @@ finn.util.pytorch
  :show-inheritance:
 
 
+finn.util.torch_hw_modules
+---------------------------
+
+.. automodule:: finn.util.torch_hw_modules
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
+finn.util.pwpolyf
+-------------------
+
+.. automodule:: finn.util.pwpolyf
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 finn.util.test
 ---------------------
 

diff --git a/finn-rtllib/pwpolyf/hdl/pwpolyf.abc b/finn-rtllib/pwpolyf/hdl/pwpolyf.abc
@@ -0,0 +1,5 @@
+import  queue
+read_sv pwpolyf_pkg.sv
+read_sv pwpolyf.sv
+setup_tb  pwpolyf_tb
+setup_top pwpolyf
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,3 +10,4 @@ This section provides detailed documentation for specific FINN hardware componen
		:maxdepth: 2

		rtl-swg
		pwpolyf