Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • H headless-chrome-crawler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 29
    • Issues 29
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • yujiosaka
  • headless-chrome-crawler
  • Issues
  • #118
Closed
Open
Issue created Feb 22, 2018 by Administrator@rootContributor

[Feature Request] Add support for WARC file format

Created by: ibnesayeed

WARC is well-known format for storing crawled captures. It can store arbitrary number of HTTP requests and responses along with other network interactions such as DNS lookups along with their header, payload, and other metadata. It is usually used by web archives, but there are some other use cases as well. WARC is the default format in which Heritrix crawler (originally developed by the Internet Archive) stores captures. Wget supports WARC format as well. There are some other tools such as WARCreate (a Chrome extension) to save web pages in WARC format along with all their page requisites while browsing and Squidwarc (a Headless Chrome-based crawler) specifically for archival purposes.

That said, adding support for WARC format will immediately make this project more useful for the web archiving community.

Assignee
Assign to
Time tracking