Multiple streamlined accelerators (streaming partitions) per design, reading from single AXI HP port #1555
-
|
Hello all! I want to ask if anybody has worked on a similar data handling case as mine and reached successful results. Idea: Question: Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
|
Hi, thanks for your interest in the project! So you want to instantiate multiple instances of the same FINN accelerator and feed in/out data via DMA? Then it should be possible to just use multiple of the DMA IP cores generated by FINN to one or multiple SmartConnects and slightly adjust the FINN-generated driver to control multiple DMA cores (instead of just looking for "idma_0", "odma_0"). Please note that there is an outdated PR (#789) that introduces additional parallelism (along "samples"/"pixels"), which could be significantly more efficient that instantiating an accelerator with lower parallelism multiple times due to non-linear resource scaling effects. We used it for this paper: https://ieeexplore.ieee.org/document/9933377 I believe @lstasytis looked at a revival of that PR recently. Maybe he can chime in and point you somewhere. |
Beta Was this translation helpful? Give feedback.
-
|
Hi all,
Many thanks for the advice and support.
Before going into M>1, I am encountering another issue I am finding hard to
solve the past few days. It is related to the correctness of the
preprocessing/postprocessing input/output of the CNN. I have read all the
discussions and examples, but still my model does not output the expected
data (high output in the order of thousands) in real fpga test. It performs
well when tested before the build and deploy.
My main concerns have been fusing the add/mul into thresholds and for that
reason I have tried to copy the CNV example using a preprocessing UINT8 ->
float [-1/1] and Quant Identity which are the same as the example:
def forward(self, x):
x = x * (2.0 / 255.0)
x = x - 1.0
return x
self.input_quant = QuantIdentity(
bit_width=4,
min_val=-1.0,
max_val=1.0 - 2.0 ** (-7),
narrow_range=False,
restrict_scaling_type=RestrictValueType.POWER_OF_TWO,
return_quant_tensor=True,
)
my CNN is having 3ch input, 3 cnv layers, 1 depthwise separable, Global
average pool (QuantAvgPool2d) k= image size, and 1d conv classifier 1x24 at
the end resulting in a binary output, 1/0. , using build settings from
Mobilenet v1, w4a4 quant
I belive I managed to have my input solved, I am attaching onnx shapes of
the qonnx to finn stage, and final bitstream:
[image: image.png]
[image: image.png]
My output is currently the result of the last conv layer, 8 bit weights x
8 bit activations -> INT24 output. I have been trying to posprocess in
software as I do not have any thresholds at the end, but even whem applying
the ommited Mul blocks in software, numbers are very big and do not
correspond to software in any case.
Do you have any general advice on how to debug, search for a solution?
Any specific order of the Mul/ Add blocks with respect to multithresholds I
have to keep, as I believe this might be causing the weird outputs.
Any advice on output blocks in the case of binary decision? I saw that
Top-K might not be ideal.
I am attaching my qonnx export script for reference, which contains the
model.
I am completing my MSc defence in TU Delft soon and will be very happy to
add the model to the examples if desirable. Again huge thanks for the
support.
Best regards,
Daniel
… Hi @Daniel2291 <https://github.com/Daniel2291> , so we've been doing some
work on using the MMV for our own model (exact same situation - saturated
SIMD/PE, but wanted to push for higher throughput).
There is a forked dev branch in which we implemented modifications to the
python/compiler side of finn to support this parameter for MVAU and the
Thresholding layers. We just haven't gotten around to consolidating it into
a nice clean PR to push upstream yet.
Note that if you want to use the M parameter for the Thresholding layer,
you would also need to incorporate a modification made to the hlslib
repository (link: https://github.com/MrMudkip9352/finn-hlslib_mmv )
The finn-side of changes branch in question:
https://github.com/MrMudkip9352/finn_mmv/tree/dev
—
Reply to this email directly, view it on GitHub
<#1555 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDL36WC6K7TGUCKIDMNRMYT4X5MMNAVCNFSM6AAAAACXLVPDE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNZSHEZDOMA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hi,
thanks for your interest in the project!
So you want to instantiate multiple instances of the same FINN accelerator and feed in/out data via DMA? Then it should be possible to just use multiple of the DMA IP cores generated by FINN to one or multiple SmartConnects and slightly adjust the FINN-generated driver to control multiple DMA cores (instead of just looking for "idma_0", "odma_0").
Please note that there is an outdated PR (#789) that introduces additional parallelism (along "samples"/"pixels"), which could be significantly more efficient that instantiating an accelerator with lower parallelism multiple times due to non-linear resource scaling effects. We used it for this paper: h…