Open
Description
Version Affected
2871384 (current HEAD of master)
Observations*
In order to monitor a long-running job I did something like this:
$ watch -n 60 "TZ=UTC date --iso-8601=sec | tr '\n' ' ' | tee -a test_output && shub items <scrapy_cloud_job_id> -n 1 | jq '.some_field' | tee -a test_output"
The command intents to fetch one item per minute.
So far it's been running for ~1h and the log suggests there were once ~180 items fetched during one shub items -n 1
command.
Analysis
In shub.utils.job_resource_iter
the tail-related logic is like this:
- Fetch
resource.stats()
and retrieve the total number of entries within. - Calculate the corresponding index to start with:
last_item = total_nr_items - tail - 1
- Fetch the resource starting from the pre-calculated index:
resource_iter(startafter=last_item_key)
However, there may be new entries added between the 1st and 3rd step, and any newly added entries would be also returned.
Proposal
There may be a count
parameter added to the resource_iter
call (e.g. resource_iter(startafter=last_item_key, count=tail)
).
It's assumed to be okay to return no more than N items when a user has --tail N
while there's no --follow
.