Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • H headless-chrome-crawler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 29
    • Issues 29
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • yujiosaka
  • headless-chrome-crawler
  • Issues
  • #373
Closed
Open
Issue created Oct 16, 2020 by Yuri Gor@YuriGor

Are links with empty href ignored? (button links handled by page js)

What is the current behavior? Looks like crawler doesn't call preRequest for links with empty href?

If the current behavior is a bug, please provide the steps to reproduce

const HCCrawler = require('headless-chrome-crawler');
const seedUrl = 'https://en.comparis.ch/gesundheit/arzt/search?searchcat=doctor';
const capUrl = 'https://en.comparis.ch/gesundheit/arzt';

const testUrl = (url) => !url || url.startsWith(capUrl);
HCCrawler.launch({
  obeyRobotsTxt: false,
  args: ['--disable-web-security'],
  maxDepth: 2,
  preRequest: (options) => console.log(`${testUrl(options.url) ? '+' : '-'} [${options.url}]`) || testUrl(options.url),
  evaluatePage: (() => ({ text: window.document.body.innerText })
  ),
  onSuccess: ((result) => {
    // console.log(` === ${result.options.url} === `);
  }),
})
  .then((crawler) => {
    crawler.queue(seedUrl);
    crawler.onIdle()
      .then(() => crawler.close());
  });

What is the expected behavior?

I expect to see in the console log empty URLs are tested. For example pagination buttons.

What is the motivation / use case for changing the behavior? to be able to navigate in dynamic sites, where we have links with empty href attr handled by page javascript.

Please tell us about your environment:

  • Version: 1.8.0
  • Platform / OS version: Ubuntu / 20.04
  • Node.js version: v14.0.0
Assignee
Assign to
Time tracking