Web scraping is a useful technique for extracting valuable insights from the internet. Python is a top choice for web scraping tasks, with its versatility and various libraries such as Beautiful Soup, Pyppeteer, Scrapy, etc. When scraping websites, it is often necessary to use a proxy to avoid being detected and blocked. This is because websites can detect and block scrapers by looking at their IP addresses. This article highlights three main types of proxies that you can use with Pyppeteer, how to set up, and more advanced tips for successful scraping with Pyppeteer.
What is a Proxy in Pyppeteer?
A proxy allows you to route your traffic through a different IP address, making it appear as if you are from a different location.
Pyppeteer is a Python library that provides a high-level API for controlling headless Chrome or Chromium browsers using the Chrome DevTools Protocol. You can use it for efficient web scraping.
A Pyppeteer proxy is a feature that integrates the proxy functionality within the Pyppeteer Python library. This allows users to reroute their web traffic through different IP addresses, providing a layer of anonymity and location masking.
How to set up a proxy on Pyppeteer?
Let's start with the simplest type of proxy: the static IP proxy configuration. This provides a single IP address that you can use for all of your scraping requests. This type of proxy is most suitable for scraping websites that are not very aggressive in blocking scrapers. In other words, you can use it for low-volume scraping.
To set up a static IP proxy in Pyppeteer, you'll need the IP address and port number of your proxy server. This information should be included in the 'args' parameter within the launch() method when initializing Pyppeteer.
Apart from that, most proxy services require authentication, typically through a username and password. This authentication information must also be provided within the 'args' parameter."
For this tutorial, we use proxies from Webshare. We offer 10 free proxies when you sign up.
For example, the following code shows how to set up a static IP proxy on Pyppeteer using a proxy server with the IP address 126.96.36.199 and port number 5868. You can access your Webshare proxy configuration on your dashboard's Proxy List page
How to set up a rotating proxy on Pyppeteer?
A rotating proxy provides a pool of IP addresses through which you can cycle, making it harder for websites to detect and block your scraping activities. Rotating proxies are ideal for high-volume scraping or for circumventing IP bans and rate limits on websites that monitor and restrict web scraping activities.
Basically, Pyppeteer supports two methods for setting up a rotating proxy.
- Rotating proxy list method
- Rotating proxy endpoint method
Rotating proxy list method
The rotating proxy list method involves maintaining a list of proxy servers and rotating through them at set intervals. To configure Pyppeteer with a list of proxy servers, you need to compile a list of these servers. Then, you can pass this list to the 'proxy' parameter of the launch() method. Below is an example of how to do this. Again, we will use the proxies we got from the Webshare free package.
This script imports the random module to use the random.choice function, which selects a random proxy from the list. It then formats this proxy into the appropriate string to be passed as an argument to launch(). Every time main() is run, a different proxy will be randomly selected from your list, enabling proxy rotation.
Rotating proxy endpoint method
In the above example, a list of static proxies may not work if one of the proxies goes down or becomes unresponsive. Also, overuse of a single proxy can lead to it being flagged or blocked by the sites you are accessing.
Using a rotating proxy service through an endpoint is more suitable because such services typically manage a large pool of proxies and automatically rotate them.
To update the Puppeteer code to use a rotating proxy endpoint like the one provided by Webshare, you can modify the proxy configuration in the launch options.
When using this endpoint, every time you send a request, the proxy service at p.webshare.io will handle the rotation of the proxy address for you. Each request will appear to come from a different IP address from the perspective of the target website.
How to set up a residential proxy on Pyppeteer?
Residential proxies are IPs assigned to real residential addresses, so that they appear like regular users. This makes it the most challenging type of proxy for websites to detect and block. Residential proxies are most suitable for scraping websites that are very sensitive to scraping activity, such as e-commerce websites and social media platforms.
Setting up residential proxies is the same as setting up static proxies. The only thing that matters is knowing the IP address and port of the residential proxies.
proxy_ip and proxy_port should be replaced with the actual IP address and port number provided by your residential proxy provider. proxy_username and proxy_password are placeholders for your proxy credentials. Replace them with the actual username and password given by your proxy provider.
If you want to learn how to do this with Puppeteer, read our article “Proxy in Puppeteer: 3 Effective Setup Methods Explained”.
Advanced tips for more successful scraping with Pyppeteer
Pyppeteer automatically manages cookies; however, there are certain cases where you may need to handle cookies manually. For instance, clearing cookies before each scraping request might be necessary to prevent getting blocked. To delete a cookie from the current page, you can use page.deleteCookie(). Learn more about managing cookies.
HTTP headers are significant components of the HTTP protocol, which are used to transfer information and instructions between clients and servers. HTTP headers are used to authenticate users and facilitate content negotiation, caching, security, and session management. Also, they play an important role in web scraping by replicating actual user behavior through headers such as User-Agent and Cookie. By adding new HTTP headers, you can further enhance these communications and functionalities.
Use a user agent
Websites often recognize scraping by analyzing the 'User-Agent' header. You can alter the User-Agent using the page.setUserAgent() function to mimic different browsers and devices. While Pyppeteer uses a default user agent, to prevent being blocked, you might choose to use a unique user agent. For instance, employing a mobile user agent can be effective for scraping mobile websites. Learn more about user agents.
Create more natural browsing patterns
To avoid detection, you can simulate browsing patterns as humans do. Use random delays between requests, click on links, and scroll the pages using Pyppeteer's functions.
In this article, we've discussed several methods for setting up proxies in Pyppeteer to enhance web scraping capabilities. Whether you need a simple static proxy or a sophisticated residential proxy, Pyppeteer offers the flexibility to cater to your scraping requirements. So by considering specific use cases and requirements you can select the best methods and techniques for improved web scraping.