Fix/max pages per crawl #46

foxt451 · 2024-01-08T14:41:16Z

The first commit fixes the early exit in a way that it doesn't wait for all requests to finish
The second commit adds code to avoid navigation to pages at all and track this limit based not on the number of items in the dataset, but based on the number of opened pages so far (correct me if you intended to do it the other way)
Closes #45

metalwarrior665

I would just remove the pageOutputted completely, we really only need one limit, let's not complicate it
You have race condition where if you increment state.pagesOpened++;, the actor might finish before this page gets to the processing. I'm thinking how to approach it. Probably by draining the queue, or adding a more complex logic or tracking requests in progress

foxt451 · 2024-01-08T15:35:46Z

@metalwarrior665 so, I removed pageOutputted fully and also did some patching on log to be able to abort extra requests. what do you think?

metalwarrior665 · 2024-01-08T20:54:29Z

~~@foxt451 Sorry for going back and forth. I realized that this last solution is also a bit dangerous in case there would be too many links, and the draining could take too long.~~

~~So I'm a bit struggling with several ideas on how to do this optimally (logically correct and not messing the code too much).~~

~~My best take so far is~~
1. Take the requestHandler and wrap it in another function e.g. requestHandlerInner and with try/catch so we always know the code will continue after it is called (there is a single exit point).
2. Then before and after the requestHandlerInner we add the logic for checking the pagesOpened
3. Before requestHandlerInner we increment pagesOpened and add the request to a special state gptRequestInProgress, this state must not be persisted so we don't end up in a deadlock. We also before check if we should finish with pagesOpened >= maxPagesCrawled && Object.keys(gptRequestInProgress).length === 0
4. After the function finishes, we remove the request from the gptRequestInProgress and check again if we should finish
5. Let's use crawler.teardown for finish so the code after crawler can run.

~~Please let me know if you have a better idea :)~~

Ok, so actually there is already maxRequestsPerCrawl on the crawler which will finish after the in-progress requests are done. So I guess the draining is good enough in that case as it should speed this up and should take few seconds :)

foxt451 added 2 commits January 8, 2024 16:30

fix: exit actor immediately on limit reach

a80345c

feat: early exit before navigation

dce9f61

foxt451 requested a review from metalwarrior665 January 8, 2024 14:41

metalwarrior665 requested changes Jan 8, 2024

View reviewed changes

feat: better early exit

c51941b

foxt451 requested a review from metalwarrior665 January 8, 2024 15:35

metalwarrior665 merged commit a85d3c5 into master Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/max pages per crawl #46

Fix/max pages per crawl #46

Uh oh!

foxt451 commented Jan 8, 2024

Uh oh!

metalwarrior665 left a comment

Uh oh!

foxt451 commented Jan 8, 2024

Uh oh!

metalwarrior665 commented Jan 8, 2024

Uh oh!

Uh oh!

Fix/max pages per crawl #46

Fix/max pages per crawl #46

Uh oh!

Conversation

foxt451 commented Jan 8, 2024

Uh oh!

metalwarrior665 left a comment

Choose a reason for hiding this comment

Uh oh!

foxt451 commented Jan 8, 2024

Uh oh!

metalwarrior665 commented Jan 8, 2024

Uh oh!

Uh oh!