Buy fast & affordable proxy servers. Get 10 proxies today for free.
Download our Proxy Server Extension
Products
© Webshare Proxy
payment methods

Web scraping is crucial for data extraction and analysis across various industries. Scrapy, a popular Python web scraping framework, is often used for its efficiency and flexibility. However, websites increasingly deploy measures to detect and block scraping activities, primarily through IP address monitoring. To avoid detection and maintain successful scraping sessions, rotating proxies are essential. In this article, we’ll cover two methods for implementing rotating proxies in Scrapy: using datacenter proxies in Method 1 and residential proxies in Method 2. Additionally, we’ll discuss common issues and debugging tips to help you effectively manage your web scraping projects.
Before diving into the methods for rotating proxies in Scrapy, ensure you have the following prerequisites in place:
1. Basic Knowledge of Python and Scrapy: Familiarity with Python programming and the Scrapy framework is essential. You should know how to set up a Scrapy project and create basic spiders.
2. Scrapy Installation: Make sure Scrapy is installed on your system. You can install Scrapy using pip:
<pre class="highlight pre-shadow">
<code class="js">
pip install scrapy
</code>
</pre>3. Proxy Provider Account: To use rotating proxies, you need an account with a proxy provider. There are various providers for both datacenter and residential proxies. Sign up and obtain the necessary proxy details (IP addresses, ports, usernames, and passwords). Webshare offers 10 free datacenter proxies to get you started.
4. Basic Understanding of HTTP and Networking: Understanding how HTTP requests and responses work, as well as some basic networking concepts, will help in configuring and troubleshooting proxies.
Now, let’s dive into the methods for rotating proxies in Scrapy.
Datacenter proxies are IP addresses provided by data centers rather than residential ISPs. They are cost-effective and offer high-speed connections, making them a popular choice for web scraping. Here’s how to set up and use rotating datacenter proxies in Scrapy.
First, select a reliable datacenter proxy provider. Sign up for an account and obtain the proxy list, which typically includes IP addresses, ports, and authentication credentials.
You might need additional libraries to manage proxies and requests. Ensure the requests library is installed:
<pre class="highlight pre-shadow">
<code class="js">
pip install requests
</code>
</pre>Modify the Scrapy settings to integrate proxy rotation. Open the settings.py file of your Scrapy project and add the following configurations:
<pre class="highlight pre-shadow">
<code class="js">
# settings.py
# Enable the Downloader Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'myproject.middlewares.ProxyMiddleware': 750,
}
# Define the proxy list (you can also load it from an external file)
PROXY_LIST = [
'http://username:password@proxy1:port',
'http://username:password@proxy2:port',
'http://username:password@proxy3:port',
]
# Rotate proxies
ROTATING_PROXY_LIST = PROXY_LIST
</code>
</pre>Create a custom middleware to handle the rotation of proxies. In your Scrapy script, create a file named middlewares.py and add the following code:
<pre class="highlight pre-shadow">
<code class="js">
# middlewares.py
import random
class ProxyMiddleware(object):
def __init__(self, proxy_list):
self.proxy_list = proxy_list
@classmethod
def from_crawler(cls, crawler):
return cls(
proxy_list=crawler.settings.get('ROTATING_PROXY_LIST')
)
def process_request(self, request, spider):
proxy = random.choice(self.proxy_list)
request.meta['proxy'] = proxy
</code>
</pre>To use the custom proxy middleware in your Scrapy spider, follow these steps:
1. Ensure the Middleware is Configured: Make sure the middleware you created (ProxyMiddleware) is included in the DOWNLOADER_MIDDLEWARES setting of your Scrapy project, as shown in the settings.py configuration.
2. Create a Scrapy Spider: Define your Scrapy spider to perform the actual web scraping. Here's an example of a simple Scrapy spider:
<pre class="highlight pre-shadow">
<code class="js">
# my_spider.py
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ['http://example.com'] # Replace with your target URL(s)
def parse(self, response):
self.log(f'Visited site: {response.url}')
# Your parsing logic here
</code>
</pre>3. Run the Spider: Execute your spider using the Scrapy command-line tool. This will start the scraping process and utilize the rotating proxies as configured:
<pre class="highlight pre-shadow">
<code class="js">
scrapy crawl my_spider
</code>
</pre>Residential proxies are IP addresses provided by Internet Service Providers (ISPs) to homeowners. These proxies are more reliable for web scraping as they appear to be regular users to target websites. Here’s how to set up and use rotating residential proxies in Scrapy.
Select a fast and affordable residential proxy provider like Webshare :) Get a residential proxy plan and obtain the proxy details, including IP addresses, ports, and authentication credentials.
Modify the Scrapy settings to integrate proxy rotation. Open the settings.py file of your Scrapy project and add the following configurations:
<pre class="highlight pre-shadow">
<code class="js">
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'myproject.middlewares.ProxyMiddleware': 750,
}
PROXY_LIST = [
'http://username:password@residential_proxy1:port',
'http://username:password@residential_proxy2:port',
'http://username:password@residential_proxy3:port',
]
ROTATING_PROXY_LIST = PROXY_LIST
</code>
</pre>Create a custom middleware to handle the rotation of residential proxies. In your Scrapy project, create or update the middlewares.py file with the following code:
<pre class="highlight pre-shadow">
<code class="js">
# middlewares.py
import random
class ProxyMiddleware:
def __init__(self, proxy_list):
self.proxy_list = proxy_list
@classmethod
def from_crawler(cls, crawler):
return cls(
proxy_list=crawler.settings.get('ROTATING_PROXY_LIST')
)
def process_request(self, request, spider):
proxy = random.choice(self.proxy_list)
request.meta['proxy'] = proxy
</code>
</pre>Ensure your spider is configured to use the proxy middleware. Here's an example of a Scrapy spider that uses the rotating residential proxies:
<pre class="highlight pre-shadow">
<code class="js">
# my_spider.py
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ['http://example.com'] # Replace with your target URL(s)
def parse(self, response):
self.log(f'Visited site: {response.url}')
# Your parsing logic here
# Example: Extract and print the title of the page
page_title = response.xpath('//title/text()').get()
self.log(f'Page title: {page_title}')
# Additional parsing and data extraction logic
</code>
</pre>Run your spider and ensure the requests are being sent through different residential proxies:
<pre class="highlight pre-shadow">
<code class="js">
scrapy crawl my_spider -s LOG_LEVEL=INFO
</code>
</pre>Check the logs to confirm that different proxies are being used for each request. This ensures that your Scrapy spider effectively utilizes rotating residential proxies, enhancing the reliability of your web scraping tasks.
When using rotating proxies in Scrapy, you might encounter some common issues. Here are three typical problems and their solutions:
Problem: The spider fails to connect using a proxy, resulting in connection errors.
Solution: Proxy connection failures are often due to inactive or improperly formatted proxies. It's crucial to validate your proxies before adding them to your rotation list. You can write a simple script to test each proxy:
<pre class="highlight pre-shadow">
<code class="js">
import requests
proxy_list = [
'http://username:password@proxy1:port',
'http://username:password@proxy2:port',
'http://username:password@proxy3:port'
]
def test_proxy(proxy):
try:
response = requests.get('http://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, timeout=5)
return response.status_code == 200
except:
return False
valid_proxies = [proxy for proxy in proxy_list if test_proxy(proxy)]
print(f'Valid proxies: {valid_proxies}')
</code>
</pre>Additionally, configure Scrapy to retry failed requests by increasing the retry settings in settings.py:
<pre class="highlight pre-shadow">
<code class="js">
# settings.py
RETRY_ENABLED = True
RETRY_TIMES = 5 # Number of times to retry
</code>
</pre>Problem: Proxies introduce high latency, causing requests to time out.
Solution: High latency and timeouts can significantly slow down your scraping operations. To mitigate this, adjust the download timeout settings in settings.py:
<pre class="highlight pre-shadow">
<code class="js">
# settings.py
DOWNLOAD_TIMEOUT = 30 # Increase the timeout period
Furthermore, filter out slow proxies by measuring their response times. Update your proxy testing script to include response time checks:
import time
def test_proxy(proxy):
try:
start_time = time.time()
response = requests.get('http://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, timeout=5)
response_time = time.time() - start_time
return response.status_code == 200 and response_time < 5 # Only keep fast proxies
except:
return False
</code>
</pre>Regularly update your proxy list to maintain a set of high-quality, low-latency proxies.
Problem: Despite using rotating proxies, your requests still get blocked or encounter CAPTCHAs.
Solution: Proxy bans and CAPTCHAs are common defenses against web scraping. To avoid detection, it's essential to implement a user-agent rotation strategy in addition to proxy rotation. Update settings.py to include random user-agents using the fake_useragent library:
<pre class="highlight pre-shadow">
<code class="js">
# settings.py
from fake_useragent import UserAgent
ua = UserAgent()
DOWNLOADER_MIDDLEWARES.update({
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
})
USER_AGENT_LIST = [ua.random for _ in range(10)]
</code>
</pre>Rotating user-agents makes it harder for websites to identify and block your scraping requests. Additionally, you can use services like 2Captcha or Anti-Captcha to solve CAPTCHAs automatically:
<pre class="highlight pre-shadow">
<code class="js">
# Example using 2Captcha
import requests
captcha_api_key = 'your_2captcha_api_key'
captcha_site_key = 'site_key_from_website'
url = 'http://example.com'
def solve_captcha(api_key, site_key, url):
response = requests.post('http://2captcha.com/in.php', data={
'key': api_key,
'method': 'userrecaptcha',
'googlekey': site_key,
'pageurl': url
})
if response.ok and response.text.startswith('OK'):
captcha_id = response.text.split('|')[1]
# Polling for the result
while True:
res = requests.get(f'http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}')
if res.text == 'CAPCHA_NOT_READY':
time.sleep(5)
continue
elif res.text.startswith('OK'):
return res.text.split('|')[1]
else:
raise Exception(f'Error solving captcha: {res.text}')
else:
raise Exception(f'Error submitting captcha: {response.text}')
</code>
</pre>By integrating user-agent rotation and automatic CAPTCHA solving, you can significantly reduce the chances of your requests being blocked and ensure continuous data extraction.
Using rotating proxies in Scrapy is essential for maintaining successful web scraping operations. Datacenter proxies offer speed and cost-effectiveness, while residential proxies provide reliability and reduce the risk of bans. By implementing the methods outlined and addressing common issues, you can enhance your scraping projects, ensuring smooth and uninterrupted data extraction.