Skip to content

Job resource (items/requests/logs) iterator may return more entries than specified via --tail #374

Open
@starrify

Description

@starrify

Version Affected

2871384 (current HEAD of master)

Observations*

In order to monitor a long-running job I did something like this:

$ watch -n 60 "TZ=UTC date --iso-8601=sec | tr '\n' ' ' | tee -a test_output && shub items <scrapy_cloud_job_id> -n 1 | jq '.some_field' | tee -a test_output"

The command intents to fetch one item per minute.
So far it's been running for ~1h and the log suggests there were once ~180 items fetched during one shub items -n 1 command.

Analysis

In shub.utils.job_resource_iter the tail-related logic is like this:

  1. Fetch resource.stats() and retrieve the total number of entries within.
  2. Calculate the corresponding index to start with: last_item = total_nr_items - tail - 1
  3. Fetch the resource starting from the pre-calculated index: resource_iter(startafter=last_item_key)

However, there may be new entries added between the 1st and 3rd step, and any newly added entries would be also returned.

Proposal

There may be a count parameter added to the resource_iter call (e.g. resource_iter(startafter=last_item_key, count=tail)).

It's assumed to be okay to return no more than N items when a user has --tail N while there's no --follow.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions