Skip to content

Conversation

@lstasytis
Copy link

FIFO sizing is an extremely time-consuming process in terms of CPU cycles due to currently requiring RTL simulation of a model to determine the FIFO depths by tracking the behavior of the model.

Currently, FINN uses two main approaches for FIFO sizing:

The global sizing algorithm which incrementally tightens FIFO sizes while rerunning RTLSIM until a steady-state is reached of the entire model. (AutoFIFOSizingMethod.LARGEFIFO_RTLSIM)
The characteristic-function based approach which RTL simulates individual nodes and constructs characteristic functions for each one and then uses phase-shifting to determine an optimal FIFO size by finding how many cycles the input characteristic function must be shifted forward before it reaches a steady state with the output stream. (AutoFIFOSizingMethod.CHARACTERIZE)
Between these two options, the later - characteristic sizing can be dramatically sped up by computing the characteristic functions of individual nodes analytically, rather than by using RTLSIM.

This can be accomplished by manually reviewing the RTL (or HLS) code of each node and constructing a model python function which reproduces the loop behavior of only the AXI stream reads and writes, filling up the characteristic function array.

The PR includes these python-based model functions as part of the custom_ops classes of the following nodes:

  • channelwise_op
  • convolutioninputgenerator
  • fmpadding
  • labelselect
  • matrixvectoractivation
  • pool
  • streamingdatawidthconverter (generalized variant, very conservative estimate)
  • streamingmaxpool
  • thresholding
  • vectorvectoractivation

Additionally, it includes modifications to /builder/build_dataflow_config.py and builder/build_dataflow_steps.py so that vivado is not called in the FIFO sizing step unless there is no characteristic function for an a node (and in that case it called to only characterize that respective node). This is achieved by introducing a new 'ipgen_ignore' node argument, which is set to true for all analytically characterized nodes once FIFO sizing is started and will force the FINN compiler to skip calling Vivado. This argument set back to false, allowing to call Vivado once the analytic FIFO sizing is finished.

Improvements to be made:

The remaining nodes in FINN should be characterized as necessary
There might exist parameter configurations for the convolutioninputgenerator and streamingmaxpool nodes where the characteristic function is inaccurate since an exhaustive search is complicated to do automatically.
Currently, the test for fifo sizing tests for exact correctness of the analytical function relative to the RTLSIM output. However, small latencies introduced at the start or end of the characteristic function by HLS do not lead to a change in final FIFO sizes. The test should be changed to compare the characteristic functions in a more relaxed manner. This would then allow to also perform an exhaustive test of all possible configurations for the nodes.
The characteristic FIFO sizing algorithm lacks support for branching networks. This is irrespective of the individual characteristic functions and should be improved in the main FIFO sizing function (transformation/derive_characteristic.py)

@lstasytis lstasytis marked this pull request as ready for review October 15, 2024 16:11
Copy link
Collaborator

@fpjentzsch fpjentzsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not run the tests for thresholding & vvau yet.

In general we probably want to reduce the number of tests that go into the default suite. Currently the most test cases are located in

  • ConvInpGen: 20.736
  • Thresholding: 4.608
  • VVAU: 2.304

Maybe we can still include an extended test suite via pytest.fixture or pytest_generate_tests functions @auphelia?

],
)
@pytest.mark.parametrize("topology", ["tfc", "cnv"])
def test_fifosizing_linear(method, topology):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After incorporating some of my other comments, this test fails for me in the cases [cnv, characterize_analytic] and [cnv, characterize_rtl] during "step_set_fifo_depths" with

Exception: Error in simulation! Takes too long to produce output. Consider setting the LIVENESS_THRESHOLD env.var. to a larger value.

and

%Warning: ./MVAU_hls_6_Matrix_Vector_Activate_Stream_Batch_p_ZL7threshs_0_ROM_AUTO_1R.dat:0: $readmem file not found

I don't have an explicit LIVENESS_THRESHOLD set, so it should default to 10k.

Does this PR depend on the new auto-folding PR or other PRs?
Could the "internal_decoupled" mode be the culprit (see my other comment)?

Copy link
Author

@lstasytis lstasytis Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the mvau fix now in, the simulation error does not show up anymore for me, however I still get the

$readmem file not found

warning as well, even though the tests all pass and I manually checked that each type of fifo sizing strategy does generate the same fifo depths across the entire model. This warning is not present in the current /dev build so something is breaking, even though it is not affecting the fifo-sizing. I'm a bit stumped on what is going on here. If the missing mem files are replicas of the weight stream, I guess for the purposes of FIFO sizing, they're not really relevant, since it's parallel streams.

@fpjentzsch
Copy link
Collaborator

Regarding moving the step_set_fifo_depths before codegen/ipgen (which you discussed before I believe @lstasytis @auphelia):
We had to fix this issue in FINN+ to allow node-by-node rtlsim after FIFO insertion: eki-project#85

Longterm, to aid design space exploration, we are thinking about splitting this step into a FIFO-Sizing and a FIFO-Insertion part. Or a "fast" (analytical FIFO-Sizing) and "slow" (old FIFO-Sizing and Insertion) part. Or simply integrating fast analytical FIFO-Sizing into the folding step (so it can be used for early resource estimation) and doing FIFO-Insertion as part of the backend.

@lstasytis
Copy link
Author

Regarding moving the step_set_fifo_depths before codegen/ipgen (which you discussed before I believe @lstasytis @auphelia): We had to fix this issue in FINN+ to allow node-by-node rtlsim after FIFO insertion: eki-project#85

Longterm, to aid design space exploration, we are thinking about splitting this step into a FIFO-Sizing and a FIFO-Insertion part. Or a "fast" (analytical FIFO-Sizing) and "slow" (old FIFO-Sizing and Insertion) part. Or simply integrating fast analytical FIFO-Sizing into the folding step (so it can be used for early resource estimation) and doing FIFO-Insertion as part of the backend.

The issue with splitting the 'fast' and 'slow' flows is that they are not mutually exclusive - the generation of a token access vectors (traces) is done on a per-node basis and uses either rtlsim or the tree model of a node depending on a.) what is available, b.) what is chosen. This means that there are going to be many situations where either:

a.) a new node is introduced
b.) a node is updated
c.) the model tree is determined to be inaccurate for sizing

It can be beneficial to perform the tree-based 'fast' flow for all nodes except the ones resulting from the aforementioned cases.

What I would push for, assuming we integrate these traces further into FINN as time goes on, is to decouple the trace generation from their use for fifo sizing. Right now I already have them as two separate transformations, both performed in step_set_fifo_depths, however it could make sense to split them entirely and then step_generate_traces can use either the fast backend of the slow backend depending on what will be needed for the fifo sizing step (large_rtlsim etc would require all nodes synthesized, analytical sizing 'might' depending on a flag set for trace generation to be done with either a tree or the simulations).

The traces have a lot of potential in FINN outside of FIFO sizing, we can use them to get very accurate performance models of nodes, characterize their behavior, get precise period lengths etc.

@lstasytis lstasytis force-pushed the feature/analytical-fifo-sizing branch from fa931a0 to 6a90df1 Compare August 29, 2025 15:44
@lstasytis
Copy link
Author

lstasytis commented Aug 29, 2025

Pushed the changes to bring the PR up to date with the current version using tree-based models of nodes and various heuristics to compute the final fifo sizes ( 479b26e ).

Missing parts to fix over the next few days from my side:
a.) streamingmaxpool was merged into pool, but I did not merge the tree models of pool and streamingmaxpool yet so I am leaving pool.py without a tree model until I resolve this.
b.) there is a bug (most likely with convinputgenerator) for resnet50 where the produced fifo size under tree-based characterization is about 2x higher than it should be. This is due to the refactoring efforts as there were two distinct versions of the CIG that fixed two different CIG variants and merging them together broke one. Should get it fixed very quickly.
c.) pre-commit cleanup not yet performed on the PR as it's a bit massive and wanted to get it out now.
d.) removed individual node characterization tests in favor of just updating the test_fifosizing.py tests, will update with larger tests soon.
e.) the entire derive_characteristic.py code should be refactored further, currently there are 3 different transformations for stretching TAVs which can all be fused into one.
f.) TAVs should be moved to separate json files which are stored in the build folder of a model. This is very low priority however.


In general, the PR 'should' now be usable and work with every layer that is present in finn-examples & still work using rtlsim for the unsupported ones. What is 100% not supported right now are networks with nodes that split into more than 2 branches. This is a relatively trivial limitation to fix, however. To enable, use: cfg.auto_fifo_strategy=AutoFIFOSizingMethod.ANALYTIC

For best results in terms of FIFO size at the cost of runtime, use:
cfg.tav_generation_strategy=TAVGenerationMethod.RTLSIM
For MUCH faster sizing at the potential loss of fifo size accuracy, use:
cfg.tav_generation_strategy=TAVGenerationMethod.TREE_MODEL

One can also toy around with the relaxation strategies by changing up the tav_utilization_strategy flag, the fun is in lstasytis@479b26e#diff-07f10d5f59069cef1ce2958f585d358fb59f4ac3d64f978911128c280df94e2cR1342 lines. Some strategies can produce fifo sizes nearly identical to systematic search, however have shown to decrease throughput for certain models and are thus unreliable. I would recommend to run the default conservative relaxation strategy to see the throughput, then attempt one of these more aggressive strategies and see if the throughput remains acceptable when the fifo sizes are decreased further by it.

these are the expected fifo size and runtime results (except resnet50 until I fix b.)). SimOnce = RTLSIM, Modeled = TREE_MODEL:

image

@lstasytis lstasytis force-pushed the feature/analytical-fifo-sizing branch from 4ce486c to 479b26e Compare August 29, 2025 16:10
@lstasytis lstasytis force-pushed the feature/analytical-fifo-sizing branch from 91b47b3 to a836e28 Compare September 12, 2025 17:31
@fpjentzsch fpjentzsch mentioned this pull request Oct 6, 2025
7 tasks
@lstasytis lstasytis force-pushed the feature/analytical-fifo-sizing branch 8 times, most recently from 3ad78c4 to 8f875cd Compare October 29, 2025 11:31
…ifo derivation transformations. Swapping fifo sizing step to before stitch currently breaks stitching
@lstasytis lstasytis force-pushed the feature/analytical-fifo-sizing branch from 8f875cd to 04def38 Compare October 29, 2025 11:35
@lstasytis
Copy link
Author

lstasytis commented Nov 11, 2025

Pushed the changes to bring the PR up to date with the current version using tree-based models of nodes and various heuristics to compute the final fifo sizes ( 479b26e ).

Missing parts to fix over the next few days from my side: a.) streamingmaxpool was merged into pool, but I did not merge the tree models of pool and streamingmaxpool yet so I am leaving pool.py without a tree model until I resolve this. b.) there is a bug (most likely with convinputgenerator) for resnet50 where the produced fifo size under tree-based characterization is about 2x higher than it should be. This is due to the refactoring efforts as there were two distinct versions of the CIG that fixed two different CIG variants and merging them together broke one. Should get it fixed very quickly. c.) pre-commit cleanup not yet performed on the PR as it's a bit massive and wanted to get it out now. d.) removed individual node characterization tests in favor of just updating the test_fifosizing.py tests, will update with larger tests soon. e.) the entire derive_characteristic.py code should be refactored further, currently there are 3 different transformations for stretching TAVs which can all be fused into one. f.) TAVs should be moved to separate json files which are stored in the build folder of a model. This is very low priority however.

The PR has now undergone multiple passes with the majority of shortcomings fixed. Remaining are:
a.) The SWG model, while now not a 'hacky' uniform distribution approach, but an elegant 'true to reality' characterization, is not extremely accurate, as it is very difficult to model how well an SWG overlaps reads versus writes for parallel_window=0 cases. This needs to be looked at by someone like Thomas who understands how the internal buffer behaves under various configurations (depthwise 1 versus 0, SWG over an IFM=Kernel in size, etc). The results are 40-50% worse than the uniform distribution for some cases now. The rest of the tree models are extremely accurate, however.

f.) left for Jakoba to look at depending on how she might want to refactor the PR.

All models now have dedicated characterization tests under their respective main test case files. There is also an optional DEBUG parameter that generates a plot with matplotlib comparing rtlsim vs tree model TAVs for debugging convenience :) They get stored in the same folder as from where a pytest is run (generates its own dedicated folder). I also leave the CACHING parameter (default False) which caches models during pytests so that rtlsim only needs to be run once for each unique test case, massively speeding up any debugging efforts.

One additional change the PR could use (quick one) is the tree model 'error' metric being changed from the currently static 'maximum amount of TAV volume and length' delta being changed to a %. So for example 'the tree model's volume for the same timestep as the rtlsim should not be off by more than 10%'. Might add this if I find the time for it later as it would make it much more clear how accurate the models are.

Lastly, I have a finn-examples fork which can be cleanly pulled and linked to this branch to test the fifo sizing on finn-example models rapidly. You just have to run python build.py in each respective model's build folder or run the shell script in the parent folder. It skips all hardware generation and only runs fifo sizing & prints out the final size in KB + time taken in s. Results are up to 40-50% worse than in the fork version I used for the paper, driven entirely by the SWG.

A cheap fix to get very accurate yet still fast fifo sizing is to comment out the tree model function in the SWG, it will then JIT rtlsim only these nodes and use the tree models for all the rest. This then gives sizing performance practically identical to rtlsim, while being only marginally slower than using exclusively tree models (well under 1h for mobilenet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants