Puppeteer is a Node library for browser automation. As a browser automation tool, you can use it for automated testing, and scraping web data even from dynamically loaded sites. In the context of web scraping, especially when you need to perform Puppeteer get element text operations, it's often necessary to pull out text from elements like paragraphs (<p>), spans (<span>), and divisions (<div>), among others. To accomplish this, Puppeteer provides several methods, two of which are given below.
First, page.evaluate() is a method that runs custom JavaScript within the browser context. This means that whatever code you could run in the browser's console, you can execute it on the page through this function. When extracting data from web elements, page.evaluate() can be used to access any element's textContent or innerText properties directly.
Second, page.$eval() is a more specialized method that combines querying for an element and executing a function against that element. It's a shorthand for selecting an element with a CSS selector and then extracting its text content in one go.
Both these functions are invaluable for web scraping and automation. page.evaluate() offers broad JavaScript execution capabilities, while page.$eval() streamlines the process of targeting and extracting text from specific elements.
In this article, we will discuss several scenarios where you would need to extract text and how to do it with Puppeteer using the methods mentioned above.
To extract the text of a certain element, we need a way such as element IDs to distinguish it from others. Usually, the class attribute is used to group elements that share a similar style. Therefore, in web scraping, using class to identify a group of elements that share a common characteristic is a common practice. In this section, we will show you how to extract text by an element's class.
<span> and <div> are one of the most common elements used in a web page. They are often used as containers to hold other elements. As our first example, we will show you how to extract text from <span>elements.
Using Puppeteer's page.$eval() function, we can easily extract a single element text from <span> elements on a webpage. However, it's important to note that page.$eval() targets only the first <span> element that matches the specified class. Here's a brief example to illustrate this behavior.
Note: To understand this code, you might want to read more about Get Element in Puppeteer.
This code snippet highlights that page.$eval() returns the inner text of the first <span> element matching the class (.token). It's a quick way to access specific text content when you're interested in the first occurrence of an element with a particular class.
If you need to grab text from every <span> element with the "token" class on a webpage, Puppeteer's page.$$eval() is the tool for the job. Unlike page.$eval() which only fetches text from the first match, page.$$eval() lets us collect text from all matching elements. Check out this example to see it in action.
Let’s look at another example where we extract the data from a<p>tag as <p> tags are commonly used to wrap information. For this example, we will consider <p> tags in the webshare.io/blog webpage.
In the script, three methods are used to pull text from <p> elements with the class caps_menu. page.$eval grabs the text from the first matching <p> element. For all matching <p> elements, page.$$eval is used to retrieve their texts. And page.evaluate runs a custom script to do the same, offering more control over the JavaScript execution within the page context. As you can see each method has its use.
Note: If you are scraping, we highly recommend using a Puppeteer proxy to prevent getting blocked.
In Puppeteer, page.evaluate() and page.$eval() do share the purpose of executing code within the context of the page. However, the scope of their operation and their use cases differ. The page.evaluate() method is a general-purpose tool for running scripts in the context of the page itself, allowing interaction with any accessible elements or variables defined in the page's environment. It's not necessarily faster than page.$eval(), as performance depends on the nature of the task rather than the method itself.
page.$eval(), on the other hand, is a more focused function. It requires a CSS selector and executes the provided function on the first element that matches this selector. It's particularly useful when you need to perform an action or retrieve information from a specific element, as it automatically handles the query selection.
We will do more examples to understand and practice extracting text from various elements. In this example, we will extract all heading elements.
Here is the output of the console.
In this Puppeteer script, we're focusing on extracting text from all heading elements (h1 through h6) on the "webshare.io/blog" page. The page.evaluate() method is used to run JavaScript within the page's context, selecting all heading elements and mapping their inner text to an array. This array is then logged to the console, showing all the headings found on the page.
Extracting product titles from e-commerce websites like Etsy is important for market research, competitive analysis, and price monitoring. It helps businesses understand market trends, compare products, and strategize pricing. In this example, we'll focus on extracting product titles from the "Weddings" category on Etsy, where product titles are displayed in h3 headings with the class v2-listing-card__title.
After extracting product titles, the next logical step for a market research analysis is retrieving product prices. For this purpose, we will continue our previous example on the Etsy "Weddings" category page. Here's how you can extract product prices on Etsy using Puppeteer:
Once you run the above script, you will see a list of prices on your console output.
However, when you try to scrape popular sites like Etsy, you might be blocked by anti-scraper measure. To find solutions to them you might want to read about Puppeteer Stealth mode.
Element Not Found: This error occurs when Puppeteer attempts to extract text from an element that does not exist or has not yet loaded on the page.
Solution - For error handling, use Puppeteer's page.waitForSelector() to wait for the element to be present before attempting to extract text.
Timeout Error: A timeout error may happen if Puppeteer takes too long to find the element or if the page takes too long to load.
Solution - Increase the timeout setting using page.waitForSelector(selector, {timeout: 10000}) to give more time for elements to appear.
Incorrect Use of Asynchronous Operations: Puppeteer operations are asynchronous, and incorrect handling of promises or await/async can lead to issues.
Solution: Ensure all Puppeteer commands, especially those that return promises (like page.evaluate(), page.$eval(), etc.), are properly awaited or handled with .then().
This article explored how to extract text from elements in the Puppeteer web automation tool. We covered specific tasks such as extracting text by an element’s class, retrieving all heading elements, and extracting product titles, and product prices. We further discussed Puppeteer's page.evaluate() and page.$eval() functions. Additionally, we addressed common challenges encountered in web scraping, including difficulties in locating elements and handling slow page loads. Through understanding these tools and potential issues, users can effectively navigate and extract data from web pages using Puppeteer.