Faulty interface design of single web crawler running in multiple threads

I can specify with the number of crawlers the amount of concurrent threads to be run via the `CrawlController`. Every crawler thread uses the same `CrawlConfig`.

This is good for some settings but bad for some other settings. For example the `resumableCrawling`, `maxDepthOfCrawling` or `maxPagesToFetch` is globally useful for all threads, but proxy settings (`proxyHost`, `proxyUsername`, …) or `userAgentString` should be definable per each thread instance and not the same value for all concurrent threads, as i understand your settings interface so far.

In order to realize the desired setup, i have to define for the preferred amount of threads a separate `CrawlController` that can _then_ use a separate `CrawlConfig`. However i am not sure, if i can use the same storage folder, resumable crawling and whats most important if the constrollers share their work-list, which i assume are not do.

This finally brings be to the point that the interface for multiple crawlers is designed wrong, not to say broken.

Some pull requests that try to workaround this design error: https://github.com/yasserg/crawler4j/pull/80 https://github.com/yasserg/crawler4j/pull/57

Probably the best solution to fix this, it to distinguish between globally settings `CrawlConfig` and _per thread_ settings that inherit from the global settings (for default values) but can be overwritten. Furthermore, the  _per thread_ settings have less options, limiting to proxy settings, user agent, delay ... In class hierarchy `CrawlConfig` would inherit from the basic _per thread_ settings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faulty interface design of single web crawler running in multiple threads #108

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Faulty interface design of single web crawler running in multiple threads #108

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions