maxPagesToFetch bug by cmacdonald · Pull Request #155 · yasserg/crawler4j

cmacdonald · 2016-08-09T14:24:37Z

Ok, this is a bit more subtle.

The config option is maxPagesToFetch. However, as currently implemented its semantics would be better described as maxPagesToSchedule.

Consider the following use case:

I want to crawl 20 pages for a given site, but queue up everything else
I can decide later to continue crawl the site, based on the queue made in stage 1.
This fails because once the queue has 20 pages in it, no more pages are added to it.

This patch fixes the semantic of the config option so that pages are added to the queue, but queue consumption stops once maxPagesToFetch has been exceeded. This is also useful if you are inserting pages into the queue with higher priorities.

Chaiavi · 2016-08-09T15:28:32Z

In the changes I don't see any use to the maxPagesToSchedule

cmacdonald · 2016-08-09T15:33:33Z

To be clear, I didn't name any code called maxPagesToSchedule. However, maxPagesToSchedule could be added in schedule() and scheduleAll(). But I agree with your premise that such semantics (i.e. the status quo) is not actually useful.

maxPagesToFetch bug

ccf2cc2

cmacdonald mentioned this pull request Aug 10, 2016

maxPagesToFetch is misleading #137

Open

pgalbraith added a commit to pgalbraith/crawler4j that referenced this pull request Nov 30, 2018

Merge yasserg#155 into refactored codebase

3c30e7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maxPagesToFetch bug#155

maxPagesToFetch bug#155
cmacdonald wants to merge 1 commit intoyasserg:masterfrom
cmacdonald:PR-maxPagesToFetch

cmacdonald commented Aug 9, 2016

Uh oh!

Chaiavi commented Aug 9, 2016

Uh oh!

cmacdonald commented Aug 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cmacdonald commented Aug 9, 2016

Uh oh!

Chaiavi commented Aug 9, 2016

Uh oh!

cmacdonald commented Aug 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants