Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • H headless-chrome-crawler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 29
    • Issues 29
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • yujiosaka
  • headless-chrome-crawler
  • Issues
  • #357
Closed
Open
Issue created Nov 05, 2019 by Aleksandr Borkun@AleksandrBorkun

preRequest function cuts the entire branch instead of a single page

What is the current behavior?

preRequest function cutting a lot of links in case of URL regexp filtering

If the current behavior is a bug, please provide the steps to reproduce

const isVisitedMap = { '\\.com\\/users\\/[\\d]+[\\w]+': false //CONSUMER_PROFILE_REGEXP }

function isVisited(url){
  for(const [key, value] of Object.entries(isVisitedMap)){
      if(new RegExp(key, 'g').test(url)){
          if(value){
              return value;
          }else{
              isVisitedMap[key] = true;
              return false;
          }
      }
  }
  return false;
}

(async () => {
  const crawler = await HCCrawler.launch({
    // Function to be evaluated in browsers
    preRequest: (async (opt) => { return !isVisited(opt.url)}),
    evaluatePage: (() => ({
      pagePath: window.location.href,
    })),
    // Function to be called with evaluated results from browsers
    onSuccess: (result => {
      console.log('pagePath:', result.result.pagePath);
    }),
    exporter
  });

  // Queue a request
   await crawler.queue({
    url: 'https://www.example.com/',
    headless: false,
    maxDepth: 4,
    userAgent: 'DuckDuckBot',
    allowedDomains: [/example\.com$/],
    skipDuplicates: true,
  });
  await crawler.onIdle(); // Resolved when no queue is left
  await crawler.close(); // Close the crawler
})();

What is the expected behavior? an only single page that match regexp skipped

What is the motivation / use case for changing the behavior?

if you'll try to crawl web sites with public access to the user's profile pages (or any other entities pages), you'll probably want to skip all of the user profile links except one, because all of them are similar. I thought that I can skip a single page by returning false in preRequest function.

Please tell us about your environment:

  • Version: "headless-chrome-crawler": "^1.8.0",
  • Platform / OS version: Windows 10
  • Node.js version: v10.16.3 npm: '6.9.0',
Assignee
Assign to
Time tracking