Skip to content

work queue priority invariants #4276

@benclifford

Description

@benclifford

I'm seeing behaviour where WQ tasks aren't executed in priority order, for a Parsl user.

I hacked around in the codebase and I can see the priorities delivered all the way into to the operations that work on q->ready_list.

But then it looks like the list is getting out of order: I added this check at the start of send_one_task on a hypothesis I assumed, that q->ready_list should be in priority order. (although I'm unclear if that is expected to be true).

I end up with a q->ready_list that is quite out of order. For example:

BENC: checking list order
BENC: entry priority -18.730013
BENC: entry priority -19.377128
BENC: entry priority -20.990366
BENC: entry priority -21.471784
BENC: entry priority -22.673790
BENC: entry priority -22.888660
BENC: entry priority -25.742648
BENC: entry priority -26.607197
BENC: entry priority -28.126937
BENC: entry priority -29.548732
BENC: entry priority -29.638328
BENC: entry priority -30.239305
BENC: entry priority -33.161287
BENC: entry priority -34.062007
BENC: entry priority -37.991897
BENC: entry priority -40.609988
BENC: entry priority -40.727513
BENC: entry priority -41.533317
BENC: entry priority -44.636339
BENC: entry priority -45.619628
BENC: entry priority -45.753476
BENC: entry priority -45.829790
BENC: entry priority -47.320308
BENC: entry priority -47.816183
BENC: entry priority -48.364868
BENC: entry priority -48.679864
BENC: entry priority -53.935601
BENC: entry priority -56.517787
BENC: entry priority -61.032222
BENC: entry priority -4.027813
BENC: list is out of order - it should be descending
BENC: entry priority -4.034614
BENC: entry priority -4.096619
BENC: entry priority -4.129197
BENC: entry priority -4.259326
BENC: entry priority -4.277225
BENC: entry priority -4.307081
BENC: entry priority -4.629009
BENC: entry priority -4.847665
BENC: entry priority -4.895281
BENC: entry priority -5.912347
BENC: entry priority -6.119405
BENC: entry priority -6.989896
BENC: entry priority -7.565397
BENC: entry priority -7.774574
BENC: entry priority -7.992148
BENC: entry priority -8.028452
BENC: entry priority -8.735769
BENC: entry priority -8.971470
BENC: entry priority -9.086354
BENC: entry priority -9.511160
BENC: entry priority -10.241796
BENC: entry priority -10.438019
BENC: entry priority -10.801479
BENC: entry priority -11.053111
BENC: entry priority -11.512188
BENC: entry priority -11.749560
BENC: entry priority -12.064031
BENC: entry priority -12.781551
BENC: entry priority -13.288103
BENC: entry priority -13.462147
BENC: entry priority -14.395489
BENC: entry priority -14.497951
BENC: entry priority -15.642678
BENC: entry priority -17.450457
BENC: checked list order

Over time that list seems to get permuted rotated around a bit - staying in-order-except-for-one-patch.

Here's the code for the above output I added at the start of send_one_task in work_queue.c

@@ -4589,18 +4598,42 @@ static int send_one_task( struct work_queue *q )
        int tasks_considered = 0;
        timestamp_t now = timestamp_get();
 
+        // the list should be in order, larger values later
+        // because thats how priorities work.
+
+        printf("BENC: checking list order\n"); 
+        void *i = NULL;
+        void *last = NULL;
+        struct list_cursor *cur = list_cursor_create(q->ready_list);
+        for (list_seek(cur, 0); list_get(cur, &i); list_next(cur)) {
+                printf("BENC: entry priority %f\n", work_queue_task_priority(i)); 
+                if(last == NULL) {
+                    last = i;
+                } else {
+                    if(!(work_queue_task_priority(i) < work_queue_task_priority(last))) {
+                        printf("BENC: list is out of order - it should be descending\n");
+                    }
+                    last = i;
+                }
+                // if (p(i) <= p(item)) {
+                i = NULL;
+        }
+        list_cursor_destroy(cur);
+        printf("BENC: checked list order\n"); 

I think that is the superficial explanation of why I'm not seeing tasks run in priority order.

I'm not clear from the code what the invariants are meant to be - what's happening with that list_rotate call right after that as far as I understand it deliberately violates the above invariant.

Let me know if you want a more focused reproducer or any other investigation.

Attached is the parsl code I'm running that generates a prioritised workload, which is what I have been using to investigate. You should be able to run it as python3 priotasks.py. It doesn't actually give any machine-tested validation that tasks are run in priority order, but it does enough to drive the above patch output.

priotasks.py

This is added onto 2c8f80d

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions