GSoC 2026: Project#5 Investigation on Quantized Models on M4 Max #34735

Passavee-Losripat · 2026-03-16T18:56:06Z

Passavee-Losripat
Mar 16, 2026

My name is Passavee Losripat, a third-year Computer Science student at KAIST, and I'm interested in contributing to Project#5 Optimize Quantized Model Inference Performance on ARM Devices with OpenVINO for GSoC 2026. I've been following your initial repository as preparation for my GSoC proposal and wanted to share my investigation findings.

Contributions to OpenVINO

Posted issue [Bug]: [ARM] INT8 inference produces incorrect results on Apple M4 Max #34673 : INT8 inference producing incorrect results on Apple M4 Max which is my motivation to improve OpenVINO for inference in ARM CPU
Merged keras#22404: implementing vectorize() and fixing power() dtype promotion in the Keras OpenVINO backend

Setup Details:

I have built OpenVINO 2026.1 from source with ENABLE_DEBUG_CAPS=ON on Apple M4 Max. Full analysis script and result can be found in: scripts/analyze_graph.py

My Investigation

Identical Runtime:

Model	Latency
FP32	26.0 ms
INT8 (Quantized)	26.0 ms

After inspection of exec graph of both models, we can confirm that INT8 ACL kernels are not used:

fp32: {'gemm_acl_f16': 88, 'winograd_acl_f16': 6, 'acl_f16': 8}
int8: {'gemm_acl_f16': 87, 'winograd_acl_f16': 6, 'acl_f16': 8}

Noticing that there is no acl_i8 in both of the models, showing that both models runs identically to FP32.

Confirmation of problem in Activation zero-point representation
First I confirmed the original quantized model genuinely contains INT8 weights (I8: 156 port precisions), so the problem is inside OpenVINO not the model itself.

Result from tracing Conv port precision through all transformation stages:

ir_0_preLpt_in:    {'FP32': 282}
ir_1_preLpt_out:   {'FP16': 282} -> INT8 lost before LPT
ir_2_lpt_out:      {'FP16': 282}
ir_3_postLpt_out:  {'FP16': 282}
ir_4_snippets_out: {'FP16': 282}
ir_5_cpuSpecific:  {'FP16': 282}

After further comparing op counts between ir_0 and ir_1, the single Subtract node disappears in the same step that Conv precision changes FP32 to FP16. This is consistent with the described technical gap of zero-point Subtract on the activation path is not being folded into ACL QuantizationInfo, causing the pre-LPT pass to fall back to FP16 instead.

Confirmation that FakeQuantize never fused into Conv
After checking originalLayersNames across all 101 Conv nodes in the exec graph, I got result that FakeQuantize never appears in any of them. This means hasQuantizationPostOp=false in ACLConvolutionExecutor::supports(), which causes it to return false and fall back to FP16. This is likely due to the described pattern gap where Conv→Multiply→Add→Swish→FakeQuantize is not recognized because Swish is between Add and FakeQuantize.

Questions

After I go through the code in acl_conv.cpp, I found out that activationLayerInfo and the FakeQuantize scales are set in mutually exclusive branches in the constructor so currently you can fuse either an activation or a FakeQuantize post-op. So for the Conv -> Swish -> FakeQuantize pattern, if Swish is handled as activationLayerInfo, the FQ scales (fqInputScale and fqOutputScale) will be never set, and getDstQuantizationInfo in validateTensorsInfo gets empty scales. So I wonder if we should handle both simultaneously by passing Swish to activationLayerInfo while also setting FQ scales or does this require ACL's NEConvolutionLayer to natively support fused activation + requantization together in a single INT8 kernel?

alvoron · 2026-03-18T10:47:30Z

alvoron
Mar 18, 2026
Collaborator

@Passavee-Losripat thank you for reviewing this topic!
Let's try to loose the constraint and accept both FQ and Activation post-ops.
It'd be great if you try it to implement and share your results in the proposal.

1 reply

Passavee-Losripat Mar 18, 2026
Author

Let's try to loose the constraint and accept both FQ and Activation post-ops.
It'd be great if you try it to implement and share your results in the proposal.

@alvoron Thank you for the direction! I'll implement this and include the results in the proposal. In the implementation, should the fix focus specifically on Conv -> Swish -> FQ or would you prefer a more general design that can handle any activation + FQ combination for future use cases?

alvoron · 2026-03-18T18:32:07Z

alvoron
Mar 18, 2026
Collaborator

I added a new section to README, please take a look: https://github.com/alvoron/gsoc-2026-openvino/tree/main?tab=readme-ov-file#6-start-addressing-a-technical-gap-optional

1 reply

Passavee-Losripat Mar 18, 2026
Author

@alvoron Thank you for the detailed guide! I'll start working through these steps and include the results in my proposal.

Passavee-Losripat · 2026-03-25T17:20:23Z

Passavee-Losripat
Mar 25, 2026
Author

@alvoron For follow up, I implemented the fix and opened a Draft PR here: #34931

Summary of what I implemented:

ACLConvolutionExecutor now accepts Activation + FakeQuantize simultaneously via an iteration loop (replacing the mutually exclusive branches)
ConvMulAddFQBlock pattern extended to optionally match Swish or Relu between Add and FakeQuantize
- designed for future activation extension. For example, MobileNetV3 has Conv -> Multiply -> Add -> HSwish -> FQ so it only requires extending wrap_type<...> without structural changes
canFuse() updated to allow FakeQuantize after a fused Eltwise activation
Supporting passes (ConvertConvolutionBias, FallbackUnsupportedLPConvToFP16) updated to handle the optional activation anchor

However, I still cannot achieve the rise of acl_i8 in execution graph so I further traced down the pipeline and found that ConvolutionTransformation is currently skipping all 94 Conv nodes because inputs at ir_1_preLpt_out are already FP16 and isQuantizedStatic() requires u8/i8 and returns false for all convolutions. LPT is a complete no-op for this model (FQ=104 unchanged across all 4 LPT stages). This is consistent with the Subtract zero-point issue described in your original repo but I haven't fully confirmed the causal chain yet. So I would like to ask further about:

Is the intended approach to refine the FuseSubtractToFakeQuantizeTransformation callback condition, or is there a different entry point?
Where do the 50 gemm_acl_f32 nodes in the runtime graph come from if ConvolutionTransformation is skipping everything?

0 replies

v-Golubev · 2026-03-26T11:10:04Z

v-Golubev
Mar 26, 2026
Maintainer

Hello @Passavee-Losripat, thank you for your contribution and for the effort you’ve already put into this task!

Based on the observations you’ve shared, it looks like the pipeline breaks before LPT is actually applied. For us, it’s important that weights are still in int8 form (with a dequantization subgraph) at the moment LPT starts.
My assumption is that ConstantFolding pass prematurely folds the convolution weights subgraph into FP16. In the low‑precision pipeline of the plugin, there is a transformation called MarkDequantization, whose purpose is to disable ConstantFolding for dequantization subgraphs. It seems possible that this mechanism is not triggering correctly in your case, so I would suggest investigating this part of the pipeline first.

Regarding your questions:

At this stage, the issue is likely earlier than FuseSubtractToFakeQuantizeTransformation, so refining its callback condition may not be the right entry point yet.
The appearance of gemm_acl_f32 nodes is consistent with convolutions being converted after weights are folded to FP16, which explains why ConvolutionTransformation skips them.

For further debugging, you may also find the debug capabilities helpful:
src/common/transformations/docs/debug_capabilities

Please feel free to ask further questions or share additional findings

0 replies

Passavee-Losripat · 2026-03-26T20:59:19Z

Passavee-Losripat
Mar 26, 2026
Author

@v-Golubev Thank you for your guidance! I just recheck with OV_MATCHER_LOGGING=true OV_MATCHERS_TO_LOG=MarkDequantization and found out that all of the matched pattern have CALLBACK FAILED endings. This is expected since swap_nodes() returns false because this model's dequantization subgraph has no Reshape/Unsqueeze nodes to reorder. MarkDequantization is working correctly.

However, I still cannot achieve the rise of acl_i8 in execution graph so I further traced down the pipeline and found that ConvolutionTransformation is currently skipping all 94 Conv nodes because inputs at ir_1_preLpt_out are already FP16 and isQuantizedStatic() requires u8/i8 and returns false for all convolutions.

I also want to correct my earlier statement that "inputs at ir_1_preLpt_out are already FP16." I was measuring Conv activation port precisions, not weight constants. The activations properly become FP16 via ConvertPrecision on ARM which is expected. Tracing i8 weight constants directly shows they survive PreLpt in both master and branch.

For clarification, I want to share in-depth result of PR #34931:

IR Stage Analysis

IR Stage	Master	PR #34931
ir_0_preLpt_in	78	78
ir_1_preLpt_out	78	78
ir_2_lpt_out	0	50
ir_3_postLpt_out	0	50

i8 weight constants now survive through LPT with PR #34931.

Runtime Execution Graph Analysis

Kernel	Master	PR #34931
gemm_acl_f16	88	42
winograd_acl_f16	6	6
acl_f16	8	15
gemm_acl_f32	0	50
acl_I8	0	3
jit_I8	0	62
jit_f32	0	49

There is improvement in jit_I8 and reduction of gemm_acl_f16. However, 50 new gemm_acl_f32 nodes appear from the ACL falls back to FP32 because it cannot yet fuse the complete post-op chain.

i8 Layer Presence in Runtime Graph

Layer	Master Branch	PR #34931
`Const (unknown_i8)`	0	50
`Convolution (gemm_acl_f32)`	0	49
`Reorder (jit_uni_i8)`	0	1

49 convolutions have i8 weights correctly loaded but are still executing as gemm_acl_f32. This requires further investigation

49 Convolution Investigation

I added debug logging to ACLConvolutionExecutor::supports() to trace why those 49 convolutions are rejected:

49  REJECTED: src=f32  wei=i8  dst=f32  hasFQ=0  postOps=[]
44  REJECTED: src=f16  wei=f16  dst=f16  hasFQ=0  postOps=[]
 1  REJECTED: src=u8   wei=i8  dst=f32  hasFQ=0  postOps=[]

postOps=[] for all 49 means FakeQuantize is never fused as a post-op. After tracing further, I found out that ConvolutionTransformation requires a dequantization Multiply as the direct activation input, so LPT must first propagate quantization through Swish and Add. However, I cannot find SwishTransformation in LPT. My next step is implementing SwishTransformation in LPT to enable quantization propagation through Swish.

0 replies

alvoron · 2026-03-27T09:21:25Z

alvoron
Mar 27, 2026
Collaborator

@Passavee-Losripat please don't forget to submit your proposal officially till the 31th of March.

1 reply

Passavee-Losripat Mar 28, 2026
Author

Thank you so much for the reminder! I have already submitted my first version of the proposal through the Google portal. I am currently updating the proposal to include latest findings. I will make sure to submit the final version before the deadline.

v-Golubev · 2026-03-27T10:01:22Z

v-Golubev
Mar 27, 2026
Maintainer

postOps=[] for all 49 means FakeQuantize is never fused as a post-op. After tracing further, I found out that ConvolutionTransformation requires a dequantization Multiply as the direct activation input, so LPT must first propagate quantization through Swish and Add. However, I cannot find SwishTransformation in LPT. My next step is implementing SwishTransformation in LPT to enable quantization propagation through Swish.

Implementing a SwishTransformation in LPT is not feasible in general, because Swish is not commutative with scaling:
Swish(x · s) ≠ Swish(x) · s' for any constant s'. LPT relies on exactly this kind of property to propagate dequantization scales, which Swish does not satisfy.
Also, besides Swish, there is still the issue with activation zero‑points. If zero‑point fusion on activations is not handled, then even adding Swish support would not be enough to make convolutions run in low precision. I would suggest focusing on a simpler case where only Swish seems to block the pipeline, for example:
REJECTED: src=u8 wei=i8 dst=f32 hasFQ=0 postOps=[]

I suppose we don’t need to propagate quantization through Swish. Instead, Swish should be fused as a post‑op into the convolution, which is handled by GraphOptimizer::FuseConvolutionAndSimpleOperation. Adapting this fusion logic for the observed pattern looks like a more promising direction than extending LPT.

0 replies

Passavee-Losripat · 2026-03-28T11:09:35Z

Passavee-Losripat
Mar 28, 2026
Author

I would suggest focusing on a simpler case where only Swish seems to block the pipeline, for example:
REJECTED: src=u8 wei=i8 dst=f32 hasFQ=0 postOps=[]
I suppose we don’t need to propagate quantization through Swish. Instead, Swish should be fused as a post‑op into the convolution, which is handled by GraphOptimizer::FuseConvolutionAndSimpleOperation. Adapting this fusion logic for the observed pattern looks like a more promising direction than extending LPT.

Thank you for the detailed direction! I was looking into case src=u8 wei=i8 dst=f32 hasFQ=0 postOps=[] in ir_0_preLpt_in, the original chain is Conv -> Add -> Swish -> FakeQuantize -> next Conv. However, in ir_3_postLpt_out, the graph for this convolution is Conv(u8, i8) -> Add(FP32) -> Multiply(FP16) which indicates that LPT could not keep FQ as post-op through Swish node so it folded FQ into dequantization Multiply and separately fused Swish, resulting in hasFQ=0 and dst=f32.

At first I believe the cause is thatACLConvolutionExecutor::supports only considered for i8->i8 or u8->u8 path, currently not u8->f32. Since f32 is required dst here. I tried adding path u8->f32 in ACLConvolutionExecutor::supports(), guarded validateTensorsInfo() for f32 dst, and added the type mapping in aclLowpConvTypeMapping. However this caused a segfault at runtime, so I checked CpuGemmConv2d source and later found out that only QASYMM8_SIGNED (i8) -> F32 is supported by the kernel.

However, since the problem is likely to come from Swish node blocking the pipeline. I am currently look into GraphOptimizer::FuseConvolutionAndSimpleOperation to fuse Swish and FQ as a post-ops into the convolution, which should result in dst=u8 with FQ providing requantization. I will share my updates shortly once I finished my implementation.

0 replies

Passavee-Losripat · 2026-03-30T02:04:03Z

Passavee-Losripat
Mar 30, 2026
Author

I am currently look into GraphOptimizer::FuseConvolutionAndSimpleOperation to fuse Swish and FQ as a post-ops into the convolution, which should result in dst=u8 with FQ providing requantization. I will share my updates shortly once I finished my implementation.

I added debug logging into conv.cpp and to trace how GraphOptimizer::FuseConvolutionAndSimpleOperation see conv's child. I found out that every immediate conv's child is already a subgraph node:

[CANFUSE] fusedWith.size()=0 trying=Subgraph name=/model.0/act/Mul/fq_output_0
[CANFUSE] fusedWith.size()=0 trying=Subgraph name=/model.1/conv/Conv/fq_output_0

However, in order to fuse conv's child as post-ops, FuseConvolutionAndSimpleOperation requires individual nodes, not a subgraph. So I tried to trace back the execution pipeline in plugin.cpp and got:

transformations.UpToLpt()
transformations.PostLpt()
transformations.Snippets()
...
GraphOptimizer::ApplyCommonGraphOptimizations

Snippets() calls MainSnippets() and run SnippetsTokenization that collapses the Mul -> Add -> Swish -> FQ as a single subgraph node before the GraphOptimizer::ApplyCommonGraphOptimizations. Hence, there's no individual nodes left at the time GraphOptimizer::ApplyCommonGraphOptimizations run.

My plan for solving this is by marking the Mul -> Add -> Swish -> FQ chain as skipped in aarch64/pass/snippets_mark_skipped.cpp so later GraphOptimizer::FuseConvolutionAndSimpleOperation can see individual nodes and fuse them correctly. Would this approach be the right approach? I concern that this could affect unquantized models with the same pattern by preventing Snippets from fusing them.

0 replies

v-Golubev · 2026-03-30T09:14:58Z

v-Golubev
Mar 30, 2026
Maintainer

@Passavee-Losripat thanks for the updates

My plan for solving this is by marking the Mul -> Add -> Swish -> FQ chain as skipped in aarch64/pass/snippets_mark_skipped.cpp so later GraphOptimizer::FuseConvolutionAndSimpleOperation can see individual nodes and fuse them correctly. Would this approach be the right approach?

This is exactly the right approach in this case.

I concern that this could affect unquantized models with the same pattern by preventing Snippets from fusing them.

Good point. But snippets_mark_skipped allows adding custom checks, so we can limit this behavior to quantized cases only (for example, by checking convolution activation and weights precisions) and avoid affecting unquantized models.

1 reply

Passavee-Losripat Mar 30, 2026
Author

Thank you for the confirmation! I will include this and finalize my proposal soon. Since the deadline for proposal is tomorrow, I may not finish before then, but I will continue and share results afterwards.

Passavee-Losripat · 2026-04-07T12:23:26Z

Passavee-Losripat
Apr 7, 2026
Author

Following up on the snippets_mark_skipped.cpp approach, I implemented match_acl_int8_conv_swish_fq_chain in utils.cpp to match the Conv -> Mul -> Add -> Swish -> FQ pattern and added isACLInt8ConvSwishFQChainMarked in aarch64/pass/snippets_mark_skipped.cpp to mark those nodes as SkippedByPlugin. I confirm the marking is working by FuseConvolutionAndSimpleOperation now sees individual MulAdd nodes instead of Subgraph.

However the result gets worse:

Before Snippets fix: latency=19.2ms, gemm_acl_f32=50, ref_any_f16=0,  jit_I8=62
After Snippets fix:  latency=25.1ms, gemm_acl_f32=50, ref_any_f16=82, jit_I8=79

The 82 ref_any_f16 nodes are the dequantization MulAdd nodes that were previously running efficiently inside Snippets Subgraphs but now fall back to reference kernels because they cannot fuse into conv as well. gemm_acl_f32 is still at 50 because canFuseSimpleOperation returns false for EltwiseMulAdd on ARM64 since isUnarySupportedAsPostOp doesn't include it (it's not unary) and canBePerformedAsScaleShift is gated to x86_64 only.

I also added EltwiseSwish to isUnarySupportedAsPostOp for ACL since Swish is supported as an ACL activation post-op, but this alone didn't change the result since MulAdd is the immediate child of conv and blocks the chain before Swish is reached.

So I am wondering if should canBePerformedAsScaleShift be enabled for ARM64, or is there a different approach for fusing the dequantization Multiply on ARM? And should the Snippets marking be conditional on EltwiseMulAdd being fuseable to avoid this in the meantime?

0 replies

GSoC 2026: Project#5 Investigation on Quantized Models on M4 Max #34735

Uh oh!

Uh oh!

Passavee-Losripat Mar 16, 2026

Contributions to OpenVINO

Setup Details:

My Investigation

Questions

Replies: 11 comments · 4 replies

Uh oh!

alvoron Mar 18, 2026 Collaborator

Uh oh!

Passavee-Losripat Mar 18, 2026 Author

Uh oh!

alvoron Mar 18, 2026 Collaborator

Uh oh!

Passavee-Losripat Mar 18, 2026 Author

Uh oh!

Passavee-Losripat Mar 25, 2026 Author

Uh oh!

v-Golubev Mar 26, 2026 Maintainer

Uh oh!

Uh oh!

Passavee-Losripat Mar 26, 2026 Author

IR Stage Analysis

Runtime Execution Graph Analysis

i8 Layer Presence in Runtime Graph

49 Convolution Investigation

Uh oh!

alvoron Mar 27, 2026 Collaborator

Uh oh!

Passavee-Losripat Mar 28, 2026 Author

Uh oh!

v-Golubev Mar 27, 2026 Maintainer

Uh oh!

Passavee-Losripat Mar 28, 2026 Author

Uh oh!

Passavee-Losripat Mar 30, 2026 Author

Uh oh!

v-Golubev Mar 30, 2026 Maintainer

Uh oh!

Passavee-Losripat Mar 30, 2026 Author

Uh oh!

Passavee-Losripat Apr 7, 2026 Author

Passavee-Losripat
Mar 16, 2026

Replies: 11 comments 4 replies

alvoron
Mar 18, 2026
Collaborator

Passavee-Losripat Mar 18, 2026
Author

alvoron
Mar 18, 2026
Collaborator

Passavee-Losripat Mar 18, 2026
Author

Passavee-Losripat
Mar 25, 2026
Author

v-Golubev
Mar 26, 2026
Maintainer

Passavee-Losripat
Mar 26, 2026
Author

alvoron
Mar 27, 2026
Collaborator

Passavee-Losripat Mar 28, 2026
Author

v-Golubev
Mar 27, 2026
Maintainer

Passavee-Losripat
Mar 28, 2026
Author

Passavee-Losripat
Mar 30, 2026
Author

v-Golubev
Mar 30, 2026
Maintainer

Passavee-Losripat Mar 30, 2026
Author

Passavee-Losripat
Apr 7, 2026
Author