In web development and data analysis, extracting data from websites has become a crucial aspect for many applications. Web scraping can be a powerful tool for gathering data, automating tasks and performing various analyses. One notable framework that simplifies web scraping in a headless browser environment is Puppeteer, and when coupled with Node.js, it becomes one of the best frameworks for scraping. Let’s guide you through the basics of web scraping with Puppeteer and Node.js.
Jump to a section that’s relevant to you or simply continue reading:
- Scrape text by Selector, XPath or Class
- Scrape - Single page
- Scrape - Multiple pages
- Scrape - All pages (imitate a crawler)
These sections should get you started in no more than 10 minutes. If you are curious, learn about advanced scraping tasks like scraping iframes or getting and transforming HTML to PDF:
If you are running into troubles while scraping, see:
Before we dive into the world of Puppeteer and Node.js web scraping, you need to check off the following prerequisites:
Ensure that you have Node.js installed on your system. If you don’t have it installed, you can download the latest version from the official Node.js website. Follow the instructions for your operating system.
Create a new project directory where you’ll be working with Puppeteer. Open a terminal, navigate to your chosen directory, and run the following commands:
How scraping in Puppeteer works?
Now that we have our environment set up, let’s delve into the core concepts of scraping with Puppeteer.
Scraping text by Selector
When scraping text by selector in Puppeteer, we can leverage the page.$eval() that is particularly useful when you want to target a specific element, such as a paragraph, heading, or any other HTML tag.
Below is an example that uses the page.$eval() method to select an element and extract its text content. The extracted text will then be saved to a JSON file.
Here’s the generated output:
Scraping text by XPath
Here’s an example that uses the page.$x() method to select an element by XPath and extract its text content:
Here’s the generated output:
Scraping text by Class
The below example uses the page.$() method with a class selector to select an element and extract its text content.
Here’s the generated output:
Scraping a single page
Now that we’ve covered the basics of selecting elements by selector, XPath and class, let’s put this knowledge into practice by scraping a single page. In the following example, we'll navigate to Webshare, and use page.content() to retrieve the entire HTML content of the page. The extracted HTML is then saved to a JSON file named 'extractedData_singlePage.json'.
You can run the code as shown below:
The output looks like this:
Scraping multiple pages
The following example demonstrates how we can extract data from multiple pages of a website.
The script launches a headless browser, navigates to each URL, extracts the page title using page.title(), and stores the results in a JSON file. The loop iterates through the specified pages, logging the scraped titles and saving the data to 'extractedData_multiplePages.json'.
Run the code to see the output as below:
Scraping all pages of a website
Scraping all pages of a website, similar to tools like Screaming Frog, involves a comprehensive approach to navigate through the entire site and extract relevant information. In Puppeteer, we can achieve this by recursively visiting pages, discovering links and accumulating data.
Below is an example that scrapes all pages from Webshare, extracts their URLs and titles, and saves the information to a JSON file.
Here’s how your output will look like:
Advanced scraping task examples
As you delve into more advanced scraping tasks with Puppeteer, you can encounter scenarios like scraping content within iframes and transforming HTML to PDF. The following sections illustrate how to tackle these tasks efficiently.
To scrape content within an iframe using Puppeteer, you need to switch to the iframe context and then perform operations within it. The below script navigates to a page, identifies the Trustpilot iframe using its title attribute, switches to the iframe context, and extracts the HTML content within the iframe's body. The extracted content is then stored in 'extractedData_iframe.json'.
Here’s how the output will look like:
Get HTML and transform to PDF
Puppeteer provides the capability to capture HTML content and transform it into a PDF file. The example code below illustrates how to fetch HTML from a page, save it to a JSON file, and then convert it into a PDF.
Here’s how the json will look like:
Below is the generated pdf file:
Dealing with anti-scraping measures
Web scraping often encounters anti-scraping measures implemented by websites to prevent automated access. To overcome these challenges, several strategies can be employed, including the use of proxies, Puppeteer Extra library and modification of user agents.
Proxies play a crucial role in mitigating the risk of being blocked during scraping activities by masking the origin IP address. The following code demonstrates how to integrate proxies into a Puppeteer script:
By setting useProxy to true, the script configures Puppeteer to use the specified proxy details, allowing for more discreet and distributed web scraping.
Puppeteer Extra is an extension library for Puppeteer that equips it with stealth capabilities to bypass anti-bot measures. The script below shows the integration of Puppeteer Extra:
Learn more about which Puppeteer Extra plugins can help you scrape more successfully.
User agent modification
Modifying the user agent allows a script to emulate different browsers, potentially bypassing user-agent-based checks. The following code depicts how to set a custom user agent in Puppeteer:
By setting a custom user agent, the script can mimic different browsers, making it harder for websites to identify and block automated scraping activities. This strategy adds an extra layer of disguise to the scraping process. To add an additional layer of customization, you can incorporate cookies in the script using the page.setCookie method. Learn more on handling cookies as some websites use them to track user sessions and behavior.
Error handling tips
Efficient error handling is crucial in web scraping to gracefully manage unexpected issues and ensure the robustness of your script. Below are some key error-handling tips to enhance the reliability of your Puppeteer web scraping code.
Page load failures
Handle page load failures by catching navigation errors, logging the issue, and taking appropriate actions.
Adjust timeouts based on the expected loading times of pages or elements. This helps prevent premature script termination due to default timeout settings.
Element selection failures
When scraping elements on a page, anticipate scenarios where the desired element might not be found. Implement try-catch blocks to handle such cases.
In this article, we covered basic and intermediate level web scraping techniques using Puppeteer and Node.js, handling anti-scraping measures, error handling tips, and advanced scraping tasks such as working with iframes and transforming HTML to PDF. Developers, equipped with this knowledge, can efficiently navigate through web scraping challenges and extract valuable data from diverse sources.