Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • H headless-chrome-crawler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 29
    • Issues 29
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • yujiosaka
  • headless-chrome-crawler
  • Issues
  • #198
Closed
Open
Issue created Apr 04, 2018 by Anthony Pessy@panthony

Suggestion: robots.txt shouldn't be reparsed every time

What is the current behavior?

The robots.txt is re-parsed for every request but those files can be big.

Today Google only reads the first 500 Kb and ignore the rest.

What is the expected behavior?

Maybe the crawler could keep the parsed robots.txt up to N instances. It should allow a strong cache hit without allowing it to growth forever.

What is the motivation / use case for changing the behavior?

Although I didn't manage to find the robots.txt again, I did already see ones that were doing easily > 1Mb.

The overall performance could take a serious hit if it were to be reparsed for every single request.

Assignee
Assign to
Time tracking