Skip to content

Are links with empty href ignored? (button links handled by page js) #373

Open
@YuriGor

Description

@YuriGor

What is the current behavior?
Looks like crawler doesn't call preRequest for links with empty href?

If the current behavior is a bug, please provide the steps to reproduce

const HCCrawler = require('headless-chrome-crawler');
const seedUrl = 'https://en.comparis.ch/gesundheit/arzt/search?searchcat=doctor';
const capUrl = 'https://en.comparis.ch/gesundheit/arzt';

const testUrl = (url) => !url || url.startsWith(capUrl);
HCCrawler.launch({
  obeyRobotsTxt: false,
  args: ['--disable-web-security'],
  maxDepth: 2,
  preRequest: (options) => console.log(`${testUrl(options.url) ? '+' : '-'} [${options.url}]`) || testUrl(options.url),
  evaluatePage: (() => ({ text: window.document.body.innerText })
  ),
  onSuccess: ((result) => {
    // console.log(` === ${result.options.url} === `);
  }),
})
  .then((crawler) => {
    crawler.queue(seedUrl);
    crawler.onIdle()
      .then(() => crawler.close());
  });

What is the expected behavior?

I expect to see in the console log empty URLs are tested. For example pagination buttons.

What is the motivation / use case for changing the behavior?
to be able to navigate in dynamic sites, where we have links with empty href attr handled by page javascript.

Please tell us about your environment:

  • Version: 1.8.0
  • Platform / OS version: Ubuntu / 20.04
  • Node.js version: v14.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions