IN THIS ARTICLE

Updated on

March 25, 2024

Puppeteer Cluster: Basic & Advanced Setup Explained

If you are someone looking to perform mass scraping from various websites and traditional scraping methods don't work for you, this article is for you. To support mass scraping, there is a library called puppeteer-cluster for Puppeteer developers. Puppeteer-cluster creates a pool of Chromium instances that can handle a large number of requests efficiently. In this setup, each Chromium instance in the Puppeteer Cluster can be directed to different websites. This allows you to collect data from multiple streams simultaneously.

In this article, we will discuss how to use Puppeteer Cluster for mass scraping and how to configure it. Here are some quick jump links to get you started:

Finally, we will cover how to fix common errors:

Fixing common puppeteer-cluster errors.

What is the function of Puppeteer Cluster?

Suppose you are attempting large-scale web scraping and have to manually manage multiple instances of Puppeteer. In a typical scenario, such as scraping data from multiple e-commerce sites for price comparison, you would need to write separate scripts for each Puppeteer instance. You will have to initiate and control them individually. This task can be very challenging, as managing the synchronization of these instances becomes complex, especially when dealing with hundreds of pages. You will also have to handle errors and crashes in individual instances, and resource allocation must be manually optimized to prevent overloading the system.

Puppeteer Cluster solves this problem for you. It allows you to create a pool or 'cluster' of browser instances. Instead of manually handling each browser instance, Puppeteer Cluster automates this process. Once you've established a cluster, you can assign tasks to it. The cluster then distributes these tasks across its pool of browser instances. It does so in a way that optimizes resource usage and maximizes efficiency. One of the key features of Puppeteer Cluster is its ability to control concurrency. You can set how many tasks are run in parallel, which helps in managing system resources and prevents overloading your server.

Moreover, the Puppeteer Cluster includes mechanisms for error handling. If a browser instance encounters an error or crashes, the cluster can automatically retry the task with another instance. But that’s not all. The cluster can be scaled according to the requirements of your task. Whether you need to handle tens, hundreds, or even more pages simultaneously, Puppeteer Cluster can scale up to meet this demand.

Now that you have a good idea about the importance of Puppeteer Cluster, let's move on to see how to use it.

Prerequisites

To install the Puppeteer cluster, you need the following tools.

Nodejs - You can download and install it from the official Nodejs website.
npm (Node Package Manager) - npm usually comes bundled with Node.js
Puppeteer: Before installing Puppeteer Cluster, you need to have Puppeteer installed as it is a dependency. You can do it with the command “npm install puppeteer”.

How to install Puppeteer cluster?

You can install puppeteer-cluster by running the following command:

npm install puppeteer-cluster

This command will install Puppeteer-Cluster along with all its required dependencies from the npm registry. To check whether the installation was successful, you can run the following command:

npm list

This should list Puppeteer Cluster among the installed packages.

Basic setup of Puppeteer cluster

Let’s see how to use puppeteer-cluster for scraping. In this example, we will run two browsers in parallel.


const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
    });

    cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        const title = await page.title();
        console.log(`Title of ${url}: ${title}`);
    });

    cluster.queue('https://www.wikipedia.org');
    cluster.queue('https://www.github.com');

    await cluster.idle();
    await cluster.close();
})();

In this script, we initialize a new cluster using the Cluster.launch method. We set the concurrency to Cluster.CONCURRENCY_CONTEXT, meaning each browser runs in its own context, similar to incognito sessions. The maxConcurrency option is set to 2, allowing two browser instances to run simultaneously. You can change this number to fit your task.

The cluster.task method defines what each browser instance should do. Here, it goes to a given URL, retrieves the page title, and logs it to the console. You will have to change this logic according to the task at hand.

We then queue two URLs using cluster.queue. Puppeteer Cluster will automatically distribute these URLs to the available browser instances. Here, you can queue as many URLs as you want.

Finally, cluster.idle waits for all queued tasks to be completed, and cluster.close shuts down the cluster and closes all browser instances.

Advanced setup of Puppeteer cluster

Let's discuss some advanced methods of using puppeteer-cluster for more nuanced tasks or when facing anti-scraping measure.

Configuring concurrency model

The concept of concurrency mode plays a crucial role when working with Puppeteer-Cluster. It defines how tasks such as opening pages and running scripts are managed and isolated from each other.

Puppeteer-Cluster offers you the option to select from three concurrency models. Based on your use case, you must choose the most suitable concurrency model. These three options are:

1) CONCURRENCY_PAGE: In this model, each worker in the cluster manages a single page within a browser instance. It's suitable for tasks that can operate in the same browser context but need separate pages.

Use case: Suppose you are developing a script to scrape product prices from various e-commerce websites. The goal is to open multiple product pages and extract pricing information. Each worker can handle a separate product page within the same browser instance. Since all these pages can share the same browser context like cookies from the e-commerce site, this model is efficient.

2) CONCURRENCY_CONTEXT: Here, each worker handles a distinct browser context within the same browser instance. This setup is ideal when tasks require isolation from each other but don’t need individual browser instances.

Use case: Imagine a task where you need to scrape data from multiple social media profiles. In this case, you need different login sessions. This model is perfect for managing multiple login sessions simultaneously. Each context can log in to a different account as they can isolate session data like cookies and local storage.

3) CONCURRENCY_BROWSER: In this approach, each worker controls a whole browser instance. It provides complete isolation at the browser level. This is the most resource-intensive option but provides the highest level of isolation and flexibility.

Use case: This is a suitable model if you want to scrape a website and ensure the information is accurate across different browser types.

To select a concurrency model in Puppeteer-Cluster, you set the concurrency property when launching the cluster. Here is an example.


(async () => {
   
    const clusterPage = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 5,
    });

If you want to use CONCURRENCY_CONTEXT or CONCURRENCY_BROWSER, use the following codes.


    const clusterContext = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 5,
    });


    const clusterBrowser = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 5, 
    });

Apart from the above three methods, puppeteer-cluster allows you to create your own concurrency implementation as well.

Custom task execution and Error Handling

Users can also create complex custom task functions with error-handling features for each function. This ensures that tasks are completed on time, without network errors or browser instance crashes. Also, users can customize the retry behavior and timeout thresholds.

Here's an example.


cluster.task(async ({ page, data }) => {
    try {
        await page.goto(data.url, { timeout: 60000 });    
    } catch (error) {
        if (error.name === 'TimeoutError') {
            console.error('Navigation timed out, retrying...');
            return cluster.queueTask(data);
        } else {
            console.error('Task failed:', error);
        }
    }
});

Apart from the above, enabling verbose logging or small command line output in Puppeteer Cluster can also be particularly useful for debugging and gaining detailed insights if an error occurs. Setting a debug environment variable is also a common practice for enabling detailed logging, as it allows you to control what level of logging you want to see in your console. This is incredibly useful for both development and troubleshooting.

Fixing common errors

When it comes to Puppeteer-Cluster some errors occur especially when dealing with complex web scraping or automation tasks. Next, we will look into three common errors and how to address them.

TimeoutError: Navigation Timeout Exceeded

This happens when a page takes too long to load. Some of the possible causes for this are slow network conditions, heavy page content, or unresponsive servers.

‍Fixes:

Increase the timeout using either ‘await page.goto(url, { timeout: 60000 })’ or page.waitForSelector().
Check on why there are slow network conditions, heavy page content, or unresponsive servers.
Implement automatic retry or page refresh process.
Depending on what site you are visiting and its dependency on JavaScript, you could try disabling JavaScript on the accessed page by using page.setJavaScriptEnabled(false).

If you want to further learn on page reload in Puppeteer, Follow the tutorial on Page Reload in Puppeteer: Methods & Errors Explained.

Error: Page Closed Unexpectedly

This happens due to unhandled exceptions in the task, browser instance crashes, or manual closure of browser or Chromium instances. Also in the Puppeteer Cluster, handling crawling errors auto restarts is a crucial aspect, especially when dealing with large-scale scraping tasks.

Fixes:

Error handling.
Resource management.
Avoid manual closure of the browser.

Error: Failed to launch the browser process!

This error happens due to conflicting software like antiviruses, missing dependencies, incorrect Chrome installation, or insufficient permissions.

Fixes:

Check for conflicting software and disable it temporarily.
Ensure correct Chrome installation and compatibility with Puppeteer.
Run puppeteer with elevated permissions only if necessary.

Conclusion

In this article, we have understood how Puppeteer Cluster stands as a powerful tool for mass-scraping, offering the capabilities to handle multiple web scraping tasks. We have also discussed the advanced uses of the Puppeteer cluster, such as configuring the concurrency models. You can do more things to improve the efficiency of your scraping tasks with the Puppeteer cluster. For example, Puppeteer Cluster offers a built-in monitoring feature for monitoring statistics, which provides valuable insights about your scraping activities. Finally, by following proper error-handling methods, you can increase the success of your overall projects.

It's important to know that when you are scraping, you need a reliable proxy solution. Webshare provides you with 10 free proxies just for signing up. Using their reliable proxies will help you efficiently carry out your scraping activities without getting blocked.