[BUG]: vsomeip slow to restart with lots of EventGroup

### vSomeip Version

v3.4.10

### Boost Version

1.82

### Environment

Android and QNX

### Describe the bug

My automotive system has `*.fidl` with ~3500 attributes, one per CAN signal. My `*.fdepl` maps each attribute into a unique EventGroup.

Especially when resuming from suspend-to-ram it's possible that UDP SOMEIP-SD will be operational but TCP socket will be broken. This leads to tce `restart()` but during this time any Subscribe will receive SubscribeNack in response:
```
4191	105.781314	10.6.0.10	10.6.0.3	SOME/IP-SD	1408	SOME/IP Service Discovery Protocol [Subscribe]
4192	105.790868	10.6.0.3	10.6.0.10	SOME/IP-SD	1396	SOME/IP Service Discovery Protocol [SubscribeNack]
4193	105.792094	10.6.0.10	10.6.0.3	SOME/IP-SD	1410	SOME/IP Service Discovery Protocol [Subscribe]
4194	105.801525	10.6.0.10	10.6.0.3	SOME/IP-SD	1410	SOME/IP Service Discovery Protocol [Subscribe]
4195	105.802118	10.6.0.3	10.6.0.10	SOME/IP-SD	1398	SOME/IP Service Discovery Protocol [SubscribeNack]
4196	105.819610	10.6.0.3	10.6.0.10	SOME/IP-SD	1398	SOME/IP Service Discovery Protocol [SubscribeNack]
```
as the number of EventGroup scales to a large number, this become catastrophic to performance.

In `service_discovery_impl::handle_eventgroup_subscription_nack()` each EventGroup calls `restart()`:
https://github.com/COVESA/vsomeip/blob/cf497232adf84f55947f7a24e1b64e04b49f1f38/implementation/service_discovery/src/service_discovery_impl.cpp#L2517-L2521

and in `tcp_client_endpoint_impl::restart()` while `::CONNECTING` the code will "early terminate" from maximum 5 restarts:
https://github.com/COVESA/vsomeip/blob/cf497232adf84f55947f7a24e1b64e04b49f1f38/implementation/endpoints/src/tcp_client_endpoint_impl.cpp#L77-L85

thereafter the code will fall through, calling `shutdown_and_close_socket_unlocked()` and perform the full restart even while a connection is in progress.

As the system continues processing 1000s of SubscribeNack this will be a tight loop of 100% cpu load and multiple seconds to plow-through the workload. This can easily exceed a 2s ServiceDiscovery interval and cascade to further problems.

### Reproduction Steps

My reproduction was:
- start with fully-established communication between tse and tce
- tce enters suspend-to-ram with TCP socket established
- allow tse to continue running, exceed TCP keepalive timeout, and close the TCP socket
- tce resumes from suspend-to-ram thinking TCP socket is still established, then discovers it to be closed

but any use-case where tse closes the TCP socket but UDP is functional should be sufficient.

### Expected behaviour

Performance should be better.

### Logs and Screenshots

_No response_

	if (!its_subscription->is_selective()) {
	auto its_reliable = its_subscription->get_endpoint(true);
	if (its_reliable)
	its_reliable->restart();
	}

	if (!_force && self->state_ == cei_state_e::CONNECTING) {
	std::chrono::steady_clock::time_point its_current
	= std::chrono::steady_clock::now();
	std::int64_t its_connect_duration = std::chrono::duration_cast<std::chrono::milliseconds>(
	its_current - self->connect_timepoint_).count();
	if (self->aborted_restart_count_ < self->tcp_restart_aborts_max_
	&& its_connect_duration < self->tcp_connect_time_max_) {
	self->aborted_restart_count_++;
	return;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: vsomeip slow to restart with lots of EventGroup #690

vSomeip Version

Boost Version

Environment

Describe the bug

Reproduction Steps

Expected behaviour

Logs and Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: vsomeip slow to restart with lots of EventGroup #690

Description

vSomeip Version

Boost Version

Environment

Describe the bug

Reproduction Steps

Expected behaviour

Logs and Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions