Skip to content
1 change: 1 addition & 0 deletions docs/finn/components/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ This section provides detailed documentation for specific FINN hardware componen
:maxdepth: 2

rtl-swg
pwpolyf
272 changes: 272 additions & 0 deletions docs/finn/components/pwpolyf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
PWPolyF Piecewise Polynomial Activation
=======================================

Overview
--------

PWPolyF is a hardware activation layer that approximates nonlinear functions
(GELU, SiLU, Sigmoid, Tanh) using piecewise polynomials evaluated with
Horner's method on a chain of DSPFP32 FMA units. With the default degree of 2,
this uses two cascaded DSPs and one RAMB18 coefficient ROM per PE, giving
single-cycle-per-element throughput. Per-function configuration, including
clamping behaviour and polynomial coefficients, is delivered through a
SystemVerilog package (``pwpolyf_pkg``) using a ``func_cfg_t`` struct.

The input domain is partitioned into ``1 + 2*5*(2^K)`` segments: one near-zero
region, positive octave sub-segments, and negative mirrors. With the default
``K=3`` this gives 81 segments. Segment selection reuses the FP32 exponent and
mantissa bit fields directly, matching the RTL implementation.

Polynomial coefficients are generated at HDL build time by
``PWPolyF_rtl._generate_coeffs_pkg()``, which fits polynomials of the
configured degree to the reference PyTorch functions and writes
``pwpolyf_pkg.sv``. Both ``K`` and ``degree`` are configurable. They default to
``K=3`` and ``degree=2`` when inferred from standard ONNX ops.

Architecture
------------

PWPolyF is RTL-only, with no HLS variant, and targets Versal devices only. The
RTL instantiates the Versal DSPFP32 primitive, so UltraScale+ and older parts
must not be specialized to this backend.

Two export paths are supported:

.. code-block:: text

Path A: PiecewisePolyActivation Path B: nn.GELU / nn.SiLU / etc.
| torch.onnx.export | torch.onnx.export
| (dynamo=False) | (dynamo=True or False)
v v
PWPolyF custom ONNX node Standard ONNX ops (Gelu, Sigmoid,
| Tanh, Sigmoid+Mul for SiLU,
| Div+Erf+Add+Mul+Mul for GELU)
| |
+------------- both paths -------------+
|
InferPWPolyFLayer
v
PWPolyF HW op (finn.custom_op.fpgadataflow)
| SpecializeLayers
v
PWPolyF_rtl (finn.custom_op.fpgadataflow.rtl)
| generate_hdl
v
finn-rtllib/pwpolyf/hdl/ SystemVerilog IP

Standard ONNX Op Inference
--------------------------

``InferPWPolyFLayer`` recognises standard ONNX activation ops in addition to
the explicit ``PWPolyF`` custom op. This allows models that use ``nn.GELU``,
``nn.SiLU``, ``nn.Sigmoid``, or ``nn.Tanh`` to be exported with ``dynamo=True``
or ``dynamo=False`` and automatically converted to PWPolyF HW layers.

.. list-table::
:header-rows: 1
:widths: 20 45 20

* - ONNX op type
- Pattern
- Maps to
* - ``Gelu`` (opset 20+)
- Single node
- ``func="gelu"``
* - ``Div`` + ``Erf`` + ``Add`` + ``Mul`` + ``Mul``
- ``x * 0.5 * (1 + erf(x / sqrt(2)))``
- ``func="gelu"``
* - ``Sigmoid``
- Single node (standalone)
- ``func="sigmoid"``
* - ``Tanh``
- Single node
- ``func="tanh"``
* - ``Sigmoid`` + ``Mul``
- ``Mul(x, Sigmoid(x))``
- ``func="silu"``

``Gelu`` as a single ONNX node requires opset 20 or later. With lower opsets,
including ``dynamo=True`` export defaults to opset 18, GELU decomposes into a
5-node Erf-based pattern. Both forms are matched. SiLU has no standard ONNX op
and decomposes to ``Sigmoid(x) * x``. Only FLOAT32 inputs are converted.

Folding
-------

PWPolyF uses PE parallelism. ``NumChannels % PE == 0`` must hold. Each PE
instantiates its own polynomial evaluation pipeline with ``degree`` DSPs.
``SetFolding`` handles PE selection automatically.

.. list-table::
:header-rows: 1
:widths: 10 10 15 15 15 25

* - PE
- Degree
- DSPs
- BRAM18s
- Approx LUTs
- Cycles per spatial position
* - 1
- 2
- 2
- 1
- 200
- NumChannels
* - C
- 2
- 2C
- C
- 200C
- 1
* - 1
- 3
- 3
- 2
- 300
- NumChannels

Resource Estimates
------------------

* DSP: ``degree * PE`` (one FP32 FMA stage per polynomial degree per PE)
* LUT: approximately ``100 * degree * PE`` for segment address decode and
control
* BRAM18: ``(degree - 1) * PE`` for default ``K=3``. Vivado infers delayed
coefficient lookups as 32-bit ROMs.
* URAM: 0

ONNX Export
-----------

Two export paths are supported:

* ``PiecewisePolyActivation`` exports as a single ``PWPolyF`` custom op via
``torch.autograd.Function.symbolic()``. It requires ``dynamo=False`` and
preserves the ``K`` attribute on the ONNX node.
* Standard PyTorch modules (``nn.GELU``, ``nn.SiLU``, ``nn.Sigmoid``,
``nn.Tanh``) export with ``dynamo=True`` or ``dynamo=False`` and produce
standard ONNX ops that ``InferPWPolyFLayer`` converts to PWPolyF with
default ``K=3``.

Attributes on the explicit PWPolyF ONNX node are:

* ``func``: one of ``gelu``, ``silu``, ``sigmoid``, ``tanh``
* ``K``: mantissa subdivision bits, default 3

Node Attributes
---------------

.. list-table::
:header-rows: 1
:widths: 25 15 45

* - Attribute
- Type
- Description
* - ``func``
- string
- Activation function name
* - ``K``
- int
- Mantissa subdivision bits, default 3
* - ``degree``
- int
- Polynomial degree / FMA stages, default 2
* - ``NumChannels``
- int
- Number of channels in the last input dimension
* - ``PE``
- int
- Processing elements
* - ``inputDataType``
- string
- Input data type, always FLOAT32
* - ``outputDataType``
- string
- Output data type, always FLOAT32
* - ``numInputVectors``
- ints
- Batch/spatial dimensions

Supported Functions
-------------------

.. list-table::
:header-rows: 1
:widths: 20 20 30

* - Function
- Negative clamp
- Positive behaviour
* - GELU
- 0.0
- passthrough (``y=x``)
* - SiLU
- 0.0
- passthrough (``y=x``)
* - Sigmoid
- 0.0
- clamp to 1.0
* - Tanh
- -1.0
- clamp to 1.0

Files
-----

Python files:

.. list-table::
:header-rows: 1
:widths: 35 50

* - File
- Purpose
* - ``util/torch_hw_modules.py``
- PyTorch activation module, ONNX export, software simulation
* - ``custom_op/fpgadataflow/pwpolyf.py``
- Base HW op for shape, folding, resource estimates, cppsim
* - ``custom_op/fpgadataflow/rtl/pwpolyf_rtl.py``
- RTL backend for HDL generation, package generation, rtlsim, IPI
* - ``util/pwpolyf.py``
- Compatibility imports for existing PWPolyF utility users
* - ``transformation/fpgadataflow/convert_to_hw_layers.py``
- ``InferPWPolyFLayer`` transformation
* - ``builder/build_dataflow_steps.py``
- Build pipeline integration
* - ``transformation/fpgadataflow/set_folding.py``
- Folding support

RTL files:

.. list-table::
:header-rows: 1
:widths: 35 50

* - File
- Purpose
* - ``finn-rtllib/pwpolyf/hdl/pwpolyf_pkg.sv``
- ``func_cfg_t`` struct per activation, regenerated per K
* - ``finn-rtllib/pwpolyf/hdl/pwpolyf.sv``
- Polynomial evaluation pipeline using a Horner chain on DSPFP32
* - ``finn-rtllib/pwpolyf/hdl/queue.sv``
- Elastic FIFO for backpressure
* - ``finn-rtllib/pwpolyf/hdl/pwpolyf_template_wrapper.v``
- AXI-Stream wrapper template

Tests
-----

``tests/fpgadataflow/test_fpgadataflow_pwpolyf.py`` covers:

* cppsim for all supported functions, channel counts, spatial shapes, and
foldings
* ONNX export for the explicit ``PiecewisePolyActivation`` path
* ``InferPWPolyFLayer`` conversion and execution
* standard op inference for Gelu, Sigmoid, Tanh, SiLU, and Erf-based GELU
* execution correctness against ``PiecewisePolyActivation``
* Versal-only specialization checks
* resource estimates, folded shapes, and expected cycles
* coefficient package generation for ``K`` and ``degree``
* Vivado HDL generation, RTL simulation, and stitched IP simulation
3 changes: 3 additions & 0 deletions docs/finn/reference/folding-constraints.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ Constraint Table
* - Pool
- PE
- inp_channels % PE == 0
* - PWPolyF
- PE
- NumChannels % PE == 0
* - Thresholding
- PE
- MH % PE == 0
Expand Down
9 changes: 9 additions & 0 deletions docs/finn/source_code/finn.custom_op.fpgadataflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,15 @@ finn.custom\_op.fpgadataflow.pool
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.pwpolyf
--------------------------------------

.. automodule:: finn.custom_op.fpgadataflow.pwpolyf
:members:
:undoc-members:
:show-inheritance:


finn.custom\_op.fpgadataflow.streamingdataflowpartition
--------------------------------------------------------

Expand Down
8 changes: 8 additions & 0 deletions docs/finn/source_code/finn.custom_op.fpgadataflow.rtl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,14 @@ finn.custom\_op.fpgadataflow.streamingfifo\_rtl
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.pwpolyf\_rtl
--------------------------------------------

.. automodule:: finn.custom_op.fpgadataflow.rtl.pwpolyf_rtl
:members:
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.thresholding\_rtl
-------------------------------------------------------

Expand Down
18 changes: 18 additions & 0 deletions docs/finn/source_code/finn.util.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,24 @@ finn.util.pytorch
:show-inheritance:


finn.util.torch_hw_modules
---------------------------

.. automodule:: finn.util.torch_hw_modules
:members:
:undoc-members:
:show-inheritance:


finn.util.pwpolyf
-------------------

.. automodule:: finn.util.pwpolyf
:members:
:undoc-members:
:show-inheritance:


finn.util.test
---------------------

Expand Down
5 changes: 5 additions & 0 deletions finn-rtllib/pwpolyf/hdl/pwpolyf.abc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import queue
read_sv pwpolyf_pkg.sv
read_sv pwpolyf.sv
setup_tb pwpolyf_tb
setup_top pwpolyf
Loading
Loading