Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/max pages per crawl #46

Merged
merged 3 commits into from
Jan 8, 2024
Merged

Fix/max pages per crawl #46

merged 3 commits into from
Jan 8, 2024

Conversation

foxt451
Copy link
Contributor

@foxt451 foxt451 commented Jan 8, 2024

The first commit fixes the early exit in a way that it doesn't wait for all requests to finish
The second commit adds code to avoid navigation to pages at all and track this limit based not on the number of items in the dataset, but based on the number of opened pages so far (correct me if you intended to do it the other way)
Closes #45

Copy link
Contributor

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would just remove the pageOutputted completely, we really only need one limit, let's not complicate it
  2. You have race condition where if you increment state.pagesOpened++;, the actor might finish before this page gets to the processing. I'm thinking how to approach it. Probably by draining the queue, or adding a more complex logic or tracking requests in progress

@foxt451
Copy link
Contributor Author

foxt451 commented Jan 8, 2024

@metalwarrior665 so, I removed pageOutputted fully and also did some patching on log to be able to abort extra requests. what do you think?

@metalwarrior665
Copy link
Contributor

@foxt451 Sorry for going back and forth. I realized that this last solution is also a bit dangerous in case there would be too many links, and the draining could take too long.

So I'm a bit struggling with several ideas on how to do this optimally (logically correct and not messing the code too much).

My best take so far is
1. Take the requestHandler and wrap it in another function e.g. requestHandlerInner and with try/catch so we always know the code will continue after it is called (there is a single exit point).
2. Then before and after the requestHandlerInner we add the logic for checking the pagesOpened
3. Before requestHandlerInner we increment pagesOpened and add the request to a special state gptRequestInProgress, this state must not be persisted so we don't end up in a deadlock. We also before check if we should finish with pagesOpened >= maxPagesCrawled && Object.keys(gptRequestInProgress).length === 0
4. After the function finishes, we remove the request from the gptRequestInProgress and check again if we should finish
5. Let's use crawler.teardown for finish so the code after crawler can run.

Please let me know if you have a better idea :)

Ok, so actually there is already maxRequestsPerCrawl on the crawler which will finish after the in-progress requests are done. So I guess the draining is good enough in that case as it should speed this up and should take few seconds :)

@metalwarrior665 metalwarrior665 merged commit a85d3c5 into master Jan 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Actor doesn't correctly finish when reaching max scraped pages
2 participants