IN THIS ARTICLE

Updated on

June 28, 2024

Rotating Proxies in Scrapy: 2 Methods Explained

Web scraping is crucial for data extraction and analysis across various industries. Scrapy, a popular Python web scraping framework, is often used for its efficiency and flexibility. However, websites increasingly deploy measures to detect and block scraping activities, primarily through IP address monitoring. To avoid detection and maintain successful scraping sessions, rotating proxies are essential. In this article, we’ll cover two methods for implementing rotating proxies in Scrapy: using datacenter proxies in Method 1 and residential proxies in Method 2. Additionally, we’ll discuss common issues and debugging tips to help you effectively manage your web scraping projects.

Prerequisites

Before diving into the methods for rotating proxies in Scrapy, ensure you have the following prerequisites in place:

1. Basic Knowledge of Python and Scrapy: Familiarity with Python programming and the Scrapy framework is essential. You should know how to set up a Scrapy project and create basic spiders.

2. Scrapy Installation: Make sure Scrapy is installed on your system. You can install Scrapy using pip:


pip install scrapy

3. Proxy Provider Account: To use rotating proxies, you need an account with a proxy provider. There are various providers for both datacenter and residential proxies. Sign up and obtain the necessary proxy details (IP addresses, ports, usernames, and passwords). Webshare offers 10 free datacenter proxies to get you started.

4. Basic Understanding of HTTP and Networking: Understanding how HTTP requests and responses work, as well as some basic networking concepts, will help in configuring and troubleshooting proxies.

Now, let’s dive into the methods for rotating proxies in Scrapy.

Method 1: Rotating datacenter proxies

Datacenter proxies are IP addresses provided by data centers rather than residential ISPs. They are cost-effective and offer high-speed connections, making them a popular choice for web scraping. Here’s how to set up and use rotating datacenter proxies in Scrapy.

Step 1: Choose a datacenter proxy provider

First, select a reliable datacenter proxy provider. Sign up for an account and obtain the proxy list, which typically includes IP addresses, ports, and authentication credentials.

Step 2: Install required libraries

You might need additional libraries to manage proxies and requests. Ensure the requests library is installed:


pip install requests

Step 3: Update Scrapy settings

Modify the Scrapy settings to integrate proxy rotation. Open the settings.py file of your Scrapy project and add the following configurations:


# settings.py

# Enable the Downloader Middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'myproject.middlewares.ProxyMiddleware': 750,
}

# Define the proxy list (you can also load it from an external file)
PROXY_LIST = [
    'http://username:password@proxy1:port',
    'http://username:password@proxy2:port',
    'http://username:password@proxy3:port',
]

# Rotate proxies
ROTATING_PROXY_LIST = PROXY_LIST

Step 4: Implement proxy middleware

Create a custom middleware to handle the rotation of proxies. In your Scrapy script, create a file named middlewares.py and add the following code:


# middlewares.py

import random

class ProxyMiddleware(object):
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            proxy_list=crawler.settings.get('ROTATING_PROXY_LIST')
        )

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy

Step 5: Use the proxy middleware in your spider

To use the custom proxy middleware in your Scrapy spider, follow these steps:

1. Ensure the Middleware is Configured: Make sure the middleware you created (ProxyMiddleware) is included in the DOWNLOADER_MIDDLEWARES setting of your Scrapy project, as shown in the settings.py configuration.

2. Create a Scrapy Spider: Define your Scrapy spider to perform the actual web scraping. Here's an example of a simple Scrapy spider:


# my_spider.py

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = ['http://example.com']  # Replace with your target URL(s)

    def parse(self, response):
        self.log(f'Visited site: {response.url}')
        # Your parsing logic here

3. Run the Spider: Execute your spider using the Scrapy command-line tool. This will start the scraping process and utilize the rotating proxies as configured:


scrapy crawl my_spider

Method 2: Rotating residential proxies

Residential proxies are IP addresses provided by Internet Service Providers (ISPs) to homeowners. These proxies are more reliable for web scraping as they appear to be regular users to target websites. Here’s how to set up and use rotating residential proxies in Scrapy.

Step 1: Choose a residential proxy provider

Select a fast and affordable residential proxy provider like Webshare :) Get a residential proxy plan and obtain the proxy details, including IP addresses, ports, and authentication credentials.

Step 2: Update Scrapy settings

Modify the Scrapy settings to integrate proxy rotation. Open the settings.py file of your Scrapy project and add the following configurations:


# settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'myproject.middlewares.ProxyMiddleware': 750,
}

PROXY_LIST = [
    'http://username:password@residential_proxy1:port',
    'http://username:password@residential_proxy2:port',
    'http://username:password@residential_proxy3:port',
]

ROTATING_PROXY_LIST = PROXY_LIST

Step 3: Implement proxy middleware

Create a custom middleware to handle the rotation of residential proxies. In your Scrapy project, create or update the middlewares.py file with the following code:


# middlewares.py

import random

class ProxyMiddleware:
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            proxy_list=crawler.settings.get('ROTATING_PROXY_LIST')
        )

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy

Step 4: Use the proxy middleware in your spider

Ensure your spider is configured to use the proxy middleware. Here's an example of a Scrapy spider that uses the rotating residential proxies:


# my_spider.py

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = ['http://example.com']  # Replace with your target URL(s)

    def parse(self, response):
        self.log(f'Visited site: {response.url}')
        # Your parsing logic here
        # Example: Extract and print the title of the page
        page_title = response.xpath('//title/text()').get()
        self.log(f'Page title: {page_title}')
        # Additional parsing and data extraction logic

Step 5: Run and verify

Run your spider and ensure the requests are being sent through different residential proxies:


scrapy crawl my_spider -s LOG_LEVEL=INFO

Check the logs to confirm that different proxies are being used for each request. This ensures that your Scrapy spider effectively utilizes rotating residential proxies, enhancing the reliability of your web scraping tasks.

Debugging common issues

When using rotating proxies in Scrapy, you might encounter some common issues. Here are three typical problems and their solutions:

Issue 1: Proxy connection failures

Problem: The spider fails to connect using a proxy, resulting in connection errors.

Solution: Proxy connection failures are often due to inactive or improperly formatted proxies. It's crucial to validate your proxies before adding them to your rotation list. You can write a simple script to test each proxy:


import requests

proxy_list = [
    'http://username:password@proxy1:port',
    'http://username:password@proxy2:port',
    'http://username:password@proxy3:port'
]

def test_proxy(proxy):
    try:
        response = requests.get('http://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, timeout=5)
        return response.status_code == 200
    except:
        return False

valid_proxies = [proxy for proxy in proxy_list if test_proxy(proxy)]
print(f'Valid proxies: {valid_proxies}')

Additionally, configure Scrapy to retry failed requests by increasing the retry settings in settings.py:


# settings.py

RETRY_ENABLED = True
RETRY_TIMES = 5  # Number of times to retry

Issue 2: High latency and timeouts

Problem: Proxies introduce high latency, causing requests to time out.

Solution: High latency and timeouts can significantly slow down your scraping operations. To mitigate this, adjust the download timeout settings in settings.py:


# settings.py

DOWNLOAD_TIMEOUT = 30  # Increase the timeout period

Furthermore, filter out slow proxies by measuring their response times. Update your proxy testing script to include response time checks:

import time

def test_proxy(proxy):
    try:
        start_time = time.time()
        response = requests.get('http://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, timeout=5)
        response_time = time.time() - start_time
        return response.status_code == 200 and response_time < 5  # Only keep fast proxies
    except:
        return False

Regularly update your proxy list to maintain a set of high-quality, low-latency proxies.

Issue 3: Proxy bans and CAPTCHAs

Problem: Despite using rotating proxies, your requests still get blocked or encounter CAPTCHAs.

Solution: Proxy bans and CAPTCHAs are common defenses against web scraping. To avoid detection, it's essential to implement a user-agent rotation strategy in addition to proxy rotation. Update settings.py to include random user-agents using the fake_useragent library:


# settings.py

from fake_useragent import UserAgent

ua = UserAgent()

DOWNLOADER_MIDDLEWARES.update({
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
})

USER_AGENT_LIST = [ua.random for _ in range(10)]

Rotating user-agents makes it harder for websites to identify and block your scraping requests. Additionally, you can use services like 2Captcha or Anti-Captcha to solve CAPTCHAs automatically:


# Example using 2Captcha

import requests

captcha_api_key = 'your_2captcha_api_key'
captcha_site_key = 'site_key_from_website'
url = 'http://example.com'

def solve_captcha(api_key, site_key, url):
    response = requests.post('http://2captcha.com/in.php', data={
        'key': api_key,
        'method': 'userrecaptcha',
        'googlekey': site_key,
        'pageurl': url
    })
    if response.ok and response.text.startswith('OK'):
        captcha_id = response.text.split('|')[1]
        # Polling for the result
        while True:
            res = requests.get(f'http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}')
            if res.text == 'CAPCHA_NOT_READY':
                time.sleep(5)
                continue
            elif res.text.startswith('OK'):
                return res.text.split('|')[1]
            else:
                raise Exception(f'Error solving captcha: {res.text}')
    else:
        raise Exception(f'Error submitting captcha: {response.text}')

By integrating user-agent rotation and automatic CAPTCHA solving, you can significantly reduce the chances of your requests being blocked and ensure continuous data extraction.

Conclusion

Using rotating proxies in Scrapy is essential for maintaining successful web scraping operations. Datacenter proxies offer speed and cost-effectiveness, while residential proxies provide reliability and reduce the risk of bans. By implementing the methods outlined and addressing common issues, you can enhance your scraping projects, ensuring smooth and uninterrupted data extraction.

‍

Web Scraping in Scrapy: Working Example [5 Minutes]