Skip to content

[Bug] DeadLock/Thread leak in Stage-scheduler #25538

Open
@Max-Cheng

Description

@Max-Cheng

How to Trigger

Honestly, this is hard to trigger. Our system has a preemptive mobilisation centre that identifies lower priority tasks that will try to abort while the load per node is very high.
If a query cancellation is triggered at the same time as a node failure, there is a high probability of triggering this bug

Trino Version

We fork from open source repositories via Tag 435. However, we backport some functionality from upstream.

Failures

Some expired tasks are still updating information, e.g. ContinuousTaskStatusFetcher, TaskInfoFetcher still try to update summrize or status

Temporary fixes

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/server/remotetask/ContinuousTaskStatusFetcher.java#L253
Change synchronous onFail.accept() to asynchronous, release lock on HttpRemoteTask

Stack

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions