Skip to content

subdomain crawl with "allowedDomains" parameter crawls top domain, too #381

Open
@michaelpapesch

Description

@michaelpapesch

For the domain "test.domain.com" result.response.url includes urls from "domain.com", too.
I tried it with the subdomain name and regexp.
I don't understand, why, shouldn't "allowedDomains" parameter prevent scanning from URLs of other domains?

(async () => {
    const crawler = await HCCrawler.launch({
        headless: true,
        args: [
            '--ignore-certificate-errors',
            '--no-sandbox',
        ],
        allowedDomains: [domain],
        maxDepth: 8,
        customCrawl: async (page, crawl) => {
            const result = await crawl();
            result.content = await page.content();
            return result;
        },
        onSuccess: result => {
            const values = [
                result.response.url
            ];
        },
    await crawler.queue(url);
    await crawler.onIdle();
    await crawler.close().then(() => connection.end());
    console.log('Scan completed.');
})();

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions