I can specify with the number of crawlers the amount of concurrent threads to be run via the CrawlController. Every crawler thread uses the same CrawlConfig.
This is good for some settings but bad for some other settings. For example the resumableCrawling, maxDepthOfCrawling or maxPagesToFetch is globally useful for all threads, but proxy settings (proxyHost, proxyUsername, …) or userAgentString should be definable per each thread instance and not the same value for all concurrent threads, as i understand your settings interface so far.
In order to realize the desired setup, i have to define for the preferred amount of threads a separate CrawlController that can then use a separate CrawlConfig. However i am not sure, if i can use the same storage folder, resumable crawling and whats most important if the constrollers share their work-list, which i assume are not do.
This finally brings be to the point that the interface for multiple crawlers is designed wrong, not to say broken.
Some pull requests that try to workaround this design error: #80 #57
Probably the best solution to fix this, it to distinguish between globally settings CrawlConfig and per thread settings that inherit from the global settings (for default values) but can be overwritten. Furthermore, the per thread settings have less options, limiting to proxy settings, user agent, delay ... In class hierarchy CrawlConfig would inherit from the basic per thread settings.
I can specify with the number of crawlers the amount of concurrent threads to be run via the
CrawlController. Every crawler thread uses the sameCrawlConfig.This is good for some settings but bad for some other settings. For example the
resumableCrawling,maxDepthOfCrawlingormaxPagesToFetchis globally useful for all threads, but proxy settings (proxyHost,proxyUsername, …) oruserAgentStringshould be definable per each thread instance and not the same value for all concurrent threads, as i understand your settings interface so far.In order to realize the desired setup, i have to define for the preferred amount of threads a separate
CrawlControllerthat can then use a separateCrawlConfig. However i am not sure, if i can use the same storage folder, resumable crawling and whats most important if the constrollers share their work-list, which i assume are not do.This finally brings be to the point that the interface for multiple crawlers is designed wrong, not to say broken.
Some pull requests that try to workaround this design error: #80 #57
Probably the best solution to fix this, it to distinguish between globally settings
CrawlConfigand per thread settings that inherit from the global settings (for default values) but can be overwritten. Furthermore, the per thread settings have less options, limiting to proxy settings, user agent, delay ... In class hierarchyCrawlConfigwould inherit from the basic per thread settings.