Skip to content

Faulty interface design of single web crawler running in multiple threads #108

@geskill

Description

@geskill

I can specify with the number of crawlers the amount of concurrent threads to be run via the CrawlController. Every crawler thread uses the same CrawlConfig.

This is good for some settings but bad for some other settings. For example the resumableCrawling, maxDepthOfCrawling or maxPagesToFetch is globally useful for all threads, but proxy settings (proxyHost, proxyUsername, …) or userAgentString should be definable per each thread instance and not the same value for all concurrent threads, as i understand your settings interface so far.

In order to realize the desired setup, i have to define for the preferred amount of threads a separate CrawlController that can then use a separate CrawlConfig. However i am not sure, if i can use the same storage folder, resumable crawling and whats most important if the constrollers share their work-list, which i assume are not do.

This finally brings be to the point that the interface for multiple crawlers is designed wrong, not to say broken.

Some pull requests that try to workaround this design error: #80 #57

Probably the best solution to fix this, it to distinguish between globally settings CrawlConfig and per thread settings that inherit from the global settings (for default values) but can be overwritten. Furthermore, the per thread settings have less options, limiting to proxy settings, user agent, delay ... In class hierarchy CrawlConfig would inherit from the basic per thread settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions