Skip to content

maxPagesToFetch bug#155

Open
cmacdonald wants to merge 1 commit intoyasserg:masterfrom
cmacdonald:PR-maxPagesToFetch
Open

maxPagesToFetch bug#155
cmacdonald wants to merge 1 commit intoyasserg:masterfrom
cmacdonald:PR-maxPagesToFetch

Conversation

@cmacdonald
Copy link
Copy Markdown

Ok, this is a bit more subtle.

The config option is maxPagesToFetch. However, as currently implemented its semantics would be better described as maxPagesToSchedule.

Consider the following use case:

  • I want to crawl 20 pages for a given site, but queue up everything else
  • I can decide later to continue crawl the site, based on the queue made in stage 1.
    This fails because once the queue has 20 pages in it, no more pages are added to it.

This patch fixes the semantic of the config option so that pages are added to the queue, but queue consumption stops once maxPagesToFetch has been exceeded. This is also useful if you are inserting pages into the queue with higher priorities.

@Chaiavi
Copy link
Copy Markdown
Contributor

Chaiavi commented Aug 9, 2016

In the changes I don't see any use to the maxPagesToSchedule

@cmacdonald
Copy link
Copy Markdown
Author

To be clear, I didn't name any code called maxPagesToSchedule. However, maxPagesToSchedule could be added in schedule() and scheduleAll(). But I agree with your premise that such semantics (i.e. the status quo) is not actually useful.

pgalbraith added a commit to pgalbraith/crawler4j that referenced this pull request Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants