Description
Hello.
Recently there was this issue #121 for which a batch read workaround was implemented. I am now experiencing from what I believe to be same or similar issue but now while using JSON instead of msgpack. Basically when I do for item in job.items.iter(..., count=X, ...):
if there are long intervals during iteration the count can end up being ignored. I was able to reproduce it with the following snippet:
sh_client = ScrapinghubClient(APIKEY, use_msgpack=False)
take = 10_000
job_id = '168012/276/1'
for i, item in enumerate(sh_client.get_job(job_id).items.iter(count=take, meta='_key')):
print(f'\r{i} ({item["_key"]})', end='')
if i == 3000:
print('\nsleeping')
time.sleep(60*3)
if i > take:
print('\nWTF')
break
With the sleep part removed the WTF section does not fire and the iterator stops on 168012/276/1/9999th item.
This seem to be more of a ScrapyCloud API platform problem but I am reporting it here to track nonetheless.
For now I am assuming resource/collections iteration is not robust if any delays are possible client side during retrieval (I haven't tested any other potential issues) and I will try either preloading all at once (.list()
) or using .list_iter()
when makes sense as a habit.