Hello,
We have a use-case where some URLs are prioritised (boosted), but the crawler terminates after XX URLs are fetched. To implement this, we planned to use the CrawlConfig.maxPagesToFetch, whose javadoc states "Maximum number of pages to fetch". However, this documentation and variable name is misleading, as it actually limits the number of URLs to schedule (i.e. be added to the frontier). If you agree, I would propose renaming this option and adding another that limits the number fetched. If all URLs have equal priority, then the two options will be equivalent in semantics.
Hello,
We have a use-case where some URLs are prioritised (boosted), but the crawler terminates after XX URLs are fetched. To implement this, we planned to use the CrawlConfig.maxPagesToFetch, whose javadoc states "Maximum number of pages to fetch". However, this documentation and variable name is misleading, as it actually limits the number of URLs to schedule (i.e. be added to the frontier). If you agree, I would propose renaming this option and adding another that limits the number fetched. If all URLs have equal priority, then the two options will be equivalent in semantics.