Description
How to Trigger
Honestly, this is hard to trigger. Our system has a preemptive mobilisation centre that identifies lower priority tasks that will try to abort while the load per node is very high.
If a query cancellation is triggered at the same time as a node failure, there is a high probability of triggering this bug
Trino Version
We fork from open source repositories via Tag 435. However, we backport some functionality from upstream.
Failures
Some expired tasks are still updating information, e.g. ContinuousTaskStatusFetcher
, TaskInfoFetcher
still try to update summrize or status
Temporary fixes
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/server/remotetask/ContinuousTaskStatusFetcher.java#L253
Change synchronous onFail.accept()
to asynchronous, release lock on HttpRemoteTask