Description
vSomeip Version
v3.4.10
Boost Version
1.82
Environment
Android and QNX
Describe the bug
My automotive system has *.fidl
with ~3500 attributes, one per CAN signal. My *.fdepl
maps each attribute into a unique EventGroup.
Especially when resuming from suspend-to-ram it's possible that UDP SOMEIP-SD will be operational but TCP socket will be broken. This leads to tce restart()
but during this time any Subscribe will receive SubscribeNack in response:
4191 105.781314 10.6.0.10 10.6.0.3 SOME/IP-SD 1408 SOME/IP Service Discovery Protocol [Subscribe]
4192 105.790868 10.6.0.3 10.6.0.10 SOME/IP-SD 1396 SOME/IP Service Discovery Protocol [SubscribeNack]
4193 105.792094 10.6.0.10 10.6.0.3 SOME/IP-SD 1410 SOME/IP Service Discovery Protocol [Subscribe]
4194 105.801525 10.6.0.10 10.6.0.3 SOME/IP-SD 1410 SOME/IP Service Discovery Protocol [Subscribe]
4195 105.802118 10.6.0.3 10.6.0.10 SOME/IP-SD 1398 SOME/IP Service Discovery Protocol [SubscribeNack]
4196 105.819610 10.6.0.3 10.6.0.10 SOME/IP-SD 1398 SOME/IP Service Discovery Protocol [SubscribeNack]
as the number of EventGroup scales to a large number, this become catastrophic to performance.
In service_discovery_impl::handle_eventgroup_subscription_nack()
each EventGroup calls restart()
:
vsomeip/implementation/service_discovery/src/service_discovery_impl.cpp
Lines 2517 to 2521 in cf49723
and in tcp_client_endpoint_impl::restart()
while ::CONNECTING
the code will "early terminate" from maximum 5 restarts:
vsomeip/implementation/endpoints/src/tcp_client_endpoint_impl.cpp
Lines 77 to 85 in cf49723
thereafter the code will fall through, calling shutdown_and_close_socket_unlocked()
and perform the full restart even while a connection is in progress.
As the system continues processing 1000s of SubscribeNack this will be a tight loop of 100% cpu load and multiple seconds to plow-through the workload. This can easily exceed a 2s ServiceDiscovery interval and cascade to further problems.
Reproduction Steps
My reproduction was:
- start with fully-established communication between tse and tce
- tce enters suspend-to-ram with TCP socket established
- allow tse to continue running, exceed TCP keepalive timeout, and close the TCP socket
- tce resumes from suspend-to-ram thinking TCP socket is still established, then discovers it to be closed
but any use-case where tse closes the TCP socket but UDP is functional should be sufficient.
Expected behaviour
Performance should be better.
Logs and Screenshots
No response