GSoC 2026: Project#5 Investigation on Quantized Models on M4 Max #34735
Replies: 11 comments 4 replies
-
|
@Passavee-Losripat thank you for reviewing this topic! |
Beta Was this translation helpful? Give feedback.
-
|
I added a new section to README, please take a look: https://github.com/alvoron/gsoc-2026-openvino/tree/main?tab=readme-ov-file#6-start-addressing-a-technical-gap-optional |
Beta Was this translation helpful? Give feedback.
-
|
@alvoron For follow up, I implemented the fix and opened a Draft PR here: #34931 Summary of what I implemented:
However, I still cannot achieve the rise of acl_i8 in execution graph so I further traced down the pipeline and found that
|
Beta Was this translation helpful? Give feedback.
-
|
Hello @Passavee-Losripat, thank you for your contribution and for the effort you’ve already put into this task! Based on the observations you’ve shared, it looks like the pipeline breaks before LPT is actually applied. For us, it’s important that weights are still in int8 form (with a dequantization subgraph) at the moment LPT starts. Regarding your questions:
For further debugging, you may also find the debug capabilities helpful: Please feel free to ask further questions or share additional findings |
Beta Was this translation helpful? Give feedback.
-
|
@v-Golubev Thank you for your guidance! I just recheck with
I also want to correct my earlier statement that "inputs at ir_1_preLpt_out are already FP16." I was measuring Conv activation port precisions, not weight constants. The activations properly become FP16 via For clarification, I want to share in-depth result of PR #34931: IR Stage Analysis
i8 weight constants now survive through LPT with PR #34931. Runtime Execution Graph Analysis
There is improvement in i8 Layer Presence in Runtime Graph
49 convolutions have i8 weights correctly loaded but are still executing as gemm_acl_f32. This requires further investigation 49 Convolution InvestigationI added debug logging to
|
Beta Was this translation helpful? Give feedback.
-
|
@Passavee-Losripat please don't forget to submit your proposal officially till the 31th of March. |
Beta Was this translation helpful? Give feedback.
-
Implementing a I suppose we don’t need to propagate quantization through Swish. Instead, Swish should be fused as a post‑op into the convolution, which is handled by |
Beta Was this translation helpful? Give feedback.
-
Thank you for the detailed direction! I was looking into case At first I believe the cause is that However, since the problem is likely to come from Swish node blocking the pipeline. I am currently look into |
Beta Was this translation helpful? Give feedback.
-
I added debug logging into However, in order to fuse conv's child as post-ops,
My plan for solving this is by marking the |
Beta Was this translation helpful? Give feedback.
-
|
@Passavee-Losripat thanks for the updates
This is exactly the right approach in this case.
Good point. But |
Beta Was this translation helpful? Give feedback.
-
|
Following up on the snippets_mark_skipped.cpp approach, I implemented However the result gets worse: The 82 I also added So I am wondering if should |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @alvoron and @v-Golubev!
My name is Passavee Losripat, a third-year Computer Science student at KAIST, and I'm interested in contributing to Project#5 Optimize Quantized Model Inference Performance on ARM Devices with OpenVINO for GSoC 2026. I've been following your initial repository as preparation for my GSoC proposal and wanted to share my investigation findings.
Contributions to OpenVINO
vectorize()and fixingpower()dtype promotion in the Keras OpenVINO backendSetup Details:
I have built OpenVINO 2026.1 from source with
ENABLE_DEBUG_CAPS=ONon Apple M4 Max. Full analysis script and result can be found in: scripts/analyze_graph.pyMy Investigation
After inspection of exec graph of both models, we can confirm that INT8 ACL kernels are not used:
Noticing that there is no
acl_i8in both of the models, showing that both models runs identically to FP32.First I confirmed the original quantized model genuinely contains INT8 weights (
I8: 156port precisions), so the problem is inside OpenVINO not the model itself.Result from tracing Conv port precision through all transformation stages:
After further comparing op counts between
ir_0andir_1, the single Subtract node disappears in the same step that Conv precision changes FP32 to FP16. This is consistent with the described technical gap of zero-point Subtract on the activation path is not being folded into ACL QuantizationInfo, causing the pre-LPT pass to fall back to FP16 instead.After checking
originalLayersNamesacross all 101 Conv nodes in the exec graph, I got result that FakeQuantize never appears in any of them. This meanshasQuantizationPostOp=falseinACLConvolutionExecutor::supports(), which causes it to return false and fall back to FP16. This is likely due to the described pattern gap whereConv→Multiply→Add→Swish→FakeQuantizeis not recognized because Swish is between Add and FakeQuantize.Questions
After I go through the code in
acl_conv.cpp, I found out thatactivationLayerInfoand the FakeQuantize scales are set in mutually exclusive branches in the constructor so currently you can fuse either an activation or a FakeQuantize post-op. So for theConv -> Swish -> FakeQuantizepattern, if Swish is handled asactivationLayerInfo, the FQ scales (fqInputScaleandfqOutputScale) will be never set, andgetDstQuantizationInfoinvalidateTensorsInfogets empty scales. So I wonder if we should handle both simultaneously by passing Swish toactivationLayerInfowhile also setting FQ scales or does this require ACL'sNEConvolutionLayerto natively support fused activation + requantization together in a single INT8 kernel?Beta Was this translation helpful? Give feedback.
All reactions