Skip to content

SQS timeout bug #1

@vm-wylbur

Description

@vm-wylbur

When a stratum takes longer than the SQS timeout period, SQS moves the stratum message from "in flight" back to "queued." That means that the stratum is assigned to another thread while the first thread may still be running.

A worker thread cannot tell the difference between a stratum that has been abandoned (perhaps because it was dequeued but then its worker failed) and one that is currently being computed but taking a long time.

The problem is that these very long-lived strata end up being computed more than once. In fact, any stratum that takes RUNTIME > SQS_TIMEOUT will be reinserted in the queue and reassigned ceiling(RUNTIME/SQS_TIMEOUT) times. These are the hardest-to-compute strata, so this can very greatly extend the total runtime.

This needs attention. Maybe a central watcher script can harvest the currently-running strata ids, and somehow broadcast it to workers so they avoid dequeueing a currently-running stratum?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions