Using Puppeteer on AWS Lambda for Scraping

Puppeteer is a popular tool for browser automation. It is widely used for scraping and other web automation tasks like application testing. However, you can increase its efficiency and effectiveness by using complementary services. One such complementary service is AWS lambda which is the focus of this article.

Here are jump links to get you started:

What is AWS Lambda?

Amazon Web Services has introduced AWS Lambda as a serverless computing service. It follows an event-driven architecture which means you can use it to run code in response to events. You don’t need to provision or manage servers as it’s managed by AWS itself.

These codes that we run on the AWS Lambda platform are called Lambda functions. Once you deploy a Lambda function, the AWS Lambda platform will execute this function in response to events or triggers. These events can be anything like changes in an AWS S3 bucket, updates to a DynamoDB table, or HTTP requests via Amazon API Gateway.

Also, AWS Lambda is extremely good at scaling. It automatically scales by running code in response to each trigger. Your code can be triggered thousands of times per second. You can write Lambda functions in multiple programming languages, including Node.js, Python, Ruby, Java, Go, .NET, and more.

Finally, Lambda follows a pay-per-use pricing model, where you are charged based on the number of requests for your functions and the time your code executes. This can be good or bad based on your use case and scenario.

What are the benefits of using AWS Lambda with Puppeteer?

Firstly, Lambda functions can be triggered by AWS services like S3, DynamoDB, or CloudWatch. This means your scraping tasks can be automatically started in response to specific events, like changes in a database, scheduled times, or other AWS triggers. This is often considered as the main use case of AWS Lambda for Puppeteer developers.

Secondly, AWS Lambda can handle multiple instances of your scraping scripts concurrently. This is particularly beneficial for large-scale scraping operations.

Thirdly, with AWS Lambda, you can schedule scraping activities to run at specific times using AWS CloudWatch Events. This is useful for scraping websites that require data to be collected at regular intervals.

Fourthly, you can use other AWS services more easily along with Puppeteer scripts. For example, you can directly store scraped data in Amazon S3, process them using Amazon RDS or DynamoDB, and trigger other AWS services based on the scraping results.

Packages & Prerequisites

To use Puppeteer with AWS Lambda, make sure you have the following.

AWS Account to access AWS Lambda and other AWS services. Be mindful of the AWS region where you deploy your Lambda functions.
Create an IAM role with the necessary permissions for Lambda to execute and interact with other AWS services
Node.js and Node Package Manager
Install Puppeteer core package
Optionally, you can look at Puppeteer Extra and its plugins to improve scraping operations.

Example configuration

Let's get started with Puppeteer on AWS Lambda. The following steps will guide you through creating and running a Lambda function for scraping or other automation purposes.

1. Create an AWS Lambda function

Go to the AWS Console and create a new Lambda function.
Choose a runtime compatible with Puppeteer (Node.js)
Set the memory to a higher value, as Puppeteer can be resource-intensive. Starting with 1,024 MB is a good idea.
Set the timeout to a higher value to accommodate the time Puppeteer might need. A 1-3 minute timeout could be a starting point.

2. Package your code and dependencies

AWS Lambda environments don't come with a headless chromium instance pre-installed. Therefore, we have to install Chromium binary compatible with AWS Lambda's environment. You can use a precompiled binary like chrome-aws-lambda by installing it with the command `npm install chrome-aws-lambda`.
As we will be using chrome-aws-lambda, we should install puppeteer-core (not the puppeteer package), which does not download unnecessary Chrome binaries. It can be installed using npm install puppeteer-core.
Create a Node.js script that uses Puppeteer for scraping or testing. Make sure to use puppeteer-core and the Chromium binary from chrome-aws-lambda.

Here is a basic Node.js script for an AWS Lambda function that uses Puppeteer.


const chromium = require('chrome-aws-lambda');
const puppeteer = require('puppeteer-core');


exports.handler = async (event) => {
    let result = null;
    let browser = null;
   
    try {
        browser = await puppeteer.launch({
            args: chromium.args,
            defaultViewport: chromium.defaultViewport,
            executablePath: await chromium.executablePath,
            headless: chromium.headless,
        });


        let page = await browser.newPage();
        await page.goto(event.url || 'https://website.com');


        result = await page.title();
    } catch (error) {
        return context.fail(error);
    } finally {
        if (browser !== null) {
            await browser.close();
        }
    }


    return context.succeed(result);
};

Here’s what the code above does, shortly:

Launch Browser: Initializes a headless browser session on AWS Lambda using Puppeteer and Chromium configured for Lambda's environment.
Navigate and Act: Opens a new page, navigates to a specified URL (default or event-provided), and performs an action (retrieves the page title).
Error Management: Catches and handles any errors during execution, ensuring graceful failure.
Resource Cleanup: Closes the browser to free up resources, then returns the action's result (the page title) or an error message.

Then, Zip your code along with the node_modules directory.

3. Deploy your Lambda function

Upload your ZIP file to the Lambda function you created.
Ensure the handler setting matches your file and export. For example, if your file is named index.js and your export is handler, set the handler to index.handler.

Note that if you have complex dependencies or larger packages, you will have to consider using a container image. Also, based on your use case, you will have to set environment variables in the Lambda function configuration to manage dynamic values like URLs, and API keys.

Using Puppeteer on AWS Lambda for scraping

As we mentioned earlier, using Puppeteer on AWS Lambda for scraping is ideal since it gets us cost-effective scalability and relatively easy management. Let’s dig a little bit deeper into the basics of how data scraping can be done using Puppeteer AWS Lambda.

Example use case: News website content scraping

Consider a scenario where you need to scrape the latest news headlines, summaries, and URLs from a popular online news portal. This data might be used for content aggregation, analysis, or keeping track of particular topics.

Our scraping objectives would be:

Go to the news portal's homepage.
Identify and extract the headlines.
Scrape the summary or the first paragraph of each news article.
Collect the URLs of the full articles.

Here is an example Node.js script using Puppeteer for scraping a news website article. This script is intended to be deployed as an AWS Lambda function.


const chromium = require('chrome-aws-lambda');
const puppeteer = require('puppeteer-core');


exports.handler = async (event) => {
    let browser = null;
    try {
        browser = await puppeteer.launch({
            args: chromium.args,
            defaultViewport: chromium.defaultViewport,
            executablePath: await chromium.executablePath,
            headless: chromium.headless,
        });


        const page = await browser.newPage();
        await page.goto('https://news-website.com');


        // Scrape headlines, summaries, and URLs
        const articles = await page.evaluate(() => {
            const articleArray = [];
            const articleNodes = document.querySelectorAll('.news-article');


            articleNodes.forEach(article => {
                const headline = article.querySelector('h2').innerText;
                const summary = article.querySelector('p.summary').innerText;
                const url = article.querySelector('a').href;


                articleArray.push({ headline, summary, url });
            });


            return articleArray;
        });


        return articles;
    } catch (error) {
        throw new Error(`Scraping failed: ${error}`);
    } finally {
        if (browser) {
            await browser.close();
        }
    }
};

Now you can deploy the above script to script to AWS Lambda with appropriate memory and timeout settings. Next, invoke the Lambda function to perform the scraping task. This can be done on-demand or scheduled at regular intervals.

You can process the returned data further or store them in S3 or DynamoDB. To save the scraped data to an S3 bucket, you need to modify the Lambda function to include the AWS SDK and add logic to upload the data to S3. You can do it by following the below steps.

Step 1 : Add AWS SDK to your project dependencies. You can do this by running npm install aws-sdk in your project directory.

Step 2: Add code to upload the scraped data to an S3 bucket. Here's an example modification to the existing script.


const AWS = require('aws-sdk');
const s3 = new AWS.S3();


exports.handler = async (event) => {
    // ... existing Puppeteer scraping code ...


    try {
        // Convert scraped data to JSON
        const scrapedDataJson = JSON.stringify(articles);


        // Define S3 upload parameters
        const uploadParams = {
            Bucket: 'your-s3-bucket-name',
            Key: 'scraped-data.json',
            Body: scrapedDataJson,
            ContentType: 'application/json'
        };


        // Upload data to S3
        await s3.putObject(uploadParams).promise();
    } catch (error) {
        throw new Error(`Scraping or S3 upload failed: ${error}`);
    } finally {
        if (browser) {
            await browser.close();
        }
    }
};

You have to make sure that the S3 bucket or DynamoDB table is created and properly configured in your AWS account. Moreover, The Lambda function's IAM role should have the necessary permissions to write to the S3 bucket or DynamoDB table. After making these changes, deploy the updated Lambda function and test it to see whether it correctly scrapes data and stores it in S3 or DynamoDB.

Advanced techniques

Apart from the above, some advanced techniques can be used to scrape data as well. Some of them are as follows.

Interacting with elements such as clicking buttons, filling forms, and triggering events to access deeper content or stimulate user actions.
Handling dynamic content with the use of page.waitForSelector() or page.waitForFunction() since we should give some time for the data to load.
Downloading required files using page._client.send('Page.setDownloadBehavior').
You can also capture screenshots of elements or full pages using page.screenshot().

Conclusion

In short, joining Puppeteer with AWS Lambda gives a powerful and cost-effective way to run large-scale web scraping and other web automation tasks without managing servers or infrastructure. Not to forget it is the perfect solution for anyone who wants to automate tasks and unlock the capabilities of Puppeteer on the cloud.