Skip to content

Long intervals during resource iteration can lead to issuesΒ #141

Open
@hermit-crab

Description

@hermit-crab

Hello.

Recently there was this issue #121 for which a batch read workaround was implemented. I am now experiencing from what I believe to be same or similar issue but now while using JSON instead of msgpack. Basically when I do for item in job.items.iter(..., count=X, ...): if there are long intervals during iteration the count can end up being ignored. I was able to reproduce it with the following snippet:

sh_client = ScrapinghubClient(APIKEY, use_msgpack=False)
take = 10_000
job_id = '168012/276/1'
for i, item in enumerate(sh_client.get_job(job_id).items.iter(count=take, meta='_key')):
    print(f'\r{i} ({item["_key"]})', end='')

    if i == 3000:
        print('\nsleeping')
        time.sleep(60*3)
    
    if i > take:
        print('\nWTF')
        break

With the sleep part removed the WTF section does not fire and the iterator stops on 168012/276/1/9999th item.

This seem to be more of a ScrapyCloud API platform problem but I am reporting it here to track nonetheless.

For now I am assuming resource/collections iteration is not robust if any delays are possible client side during retrieval (I haven't tested any other potential issues) and I will try either preloading all at once (.list()) or using .list_iter() when makes sense as a habit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions