Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • H headless-chrome-crawler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 29
    • Issues 29
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • yujiosaka
  • headless-chrome-crawler
  • Issues
  • #186
Closed
Open
Issue created Mar 29, 2018 by Tom Nielsen@tomnielsen

Suggestion: BaseCache api is confusing and not efficient.

Background LOVE this project! I tried to write my own BaseCache instance to use LevelDB and have some general feedback.

What is the current behavior? The difference between get(key), set(key, value), enqueue(key, value, priority), dequeue(key) is very confusing. key seems to be more a keyspace?

Since the underlying chrome browser had its own cache, the term is overload. E.g. does the clearCache setting clear Chrome's cache or the persistent priority queue?

What is the expected behavior? I expected the API to be more like standard Priority Queue APIs (similar to what is used for the PriorityQueue class used internally in the code), but that's not the BaseCache API.

Here's what the current API looks like (for reference).

class BaseCache {
    init();
    close();
    
    clear();
    
    get(key);
    set(key, value);
    
    enqueue(key, value, priority)
    dequeue(key);
 
    size(key);
    remove(key);
}

Maybe I'm missing something, but why does the outside caller need to know anything about what key the queue is using?

How is get() / set() supposed to be used outside the class compared to enqueue() and dequeue()?

I kind of expected the API to persist a queue to look more like this:

class BasePersistentPriorityQueue {
    init(namespace);
    close();

    clear();
    
    enqueue( [{value, priority}] )
    dequeue();
 
    peek();
    queue_length();
}

Notice that enqueue() takes an array like enqueue([{value: value, priority: priory}, ...]) since batch queuing might be supported by underlying mechanism and significantly improve performance.

Higher level code queues all links found in the page in a tight loop. It can/should be done in batch. From existing code: each(urls, url => void this._push(extend({}, options, { url }), depth));

This loop over potentially hundreds of links found on a page. As it is now, each call reads/modifies/writes a hotspot single key. For a shared network queue. This is has really horrible for performance implications.

What is the motivation / use case for changing the behavior? Performance and readability.

Please tell us about your environment:

  • Version: 1.5.0
  • Platform / OS version: Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64
  • Node.js version: v6.13.0
Assignee
Assign to
Time tracking