Web Scraping in Scrapy: Working Example [5 Minutes]

In this article, we'll walk you through a quick working example of web scraping with Scrapy, using an ecommerce website. Whether you're new to Scrapy or looking to refine your skills, this guide will help you understand how to use Scrapy for web scraping efficiently. We'll also cover common troubleshooting tips to help you avoid typical errors encountered by beginners.

Prerequisites

Before we dive into the step-by-step guide, ensure you have the following prerequisites in place:

Python: Install the latest version of Python from python.org.
Scrapy: Install Scrapy by running pip install scrapy in your terminal or command prompt.

Step-by-Step guide to web scraping with Scrapy

In this section, we'll use this ecommerce website and cover the steps required to setup and run a web scraper using Scrapy.

Step 1: Set up your scrapy project

First, we need to set up a new Scrapy project. This will create the necessary directory structure and files required for our scraper.

1. Open your terminal and run the following command to create a new Scrapy project named books_scraper:


scrapy startproject books_scraper

2. Navigate into the newly created project directory:


cd books_scraper

Your project directory should now have the following structure:


books_scraper/
    scrapy.cfg
    books_scraper/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Step 2: Creating a spider

Next, we need to create a spider to crawl the website. We’ll define our spider in the spiders directory.

1. Create a new file named books_spider.py in the books_scraper/spiders directory:


import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = [
        'http://books.toscrape.com/',
    ]

    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('p.price_color::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

In this spider, we define:

The name of the spider as "books".
The start_urls list containing the URL of the website to scrape.
The parse method, which extracts the title and price of each book from the page and follows pagination links to scrape additional pages.

Step 3: Running your spider

To run your spider and start scraping data from the website, use the following command in your terminal to navigate into the books_scraper directory:


cd books_scraper

Now, use this command to run the books spider and save the scraped data to a books.json file in the project’s root directory:


scrapy crawl books -o books.json

Here’s how the output looks like:

Common troubleshooting tips

When using Scrapy for web scraping, beginners often encounter a few common errors. Here, we’ll discuss three frequent issues and how to resolve them:

ImportError: No module named scrapy

Error Description: This error occurs when Python cannot find the Scrapy module, usually because Scrapy is not installed in the current environment.
Solution: Ensure Scrapy is installed in your environment. You can install Scrapy using the following command:


pip install scrapy

If you’re using a virtual environment, make sure it’s activated before running the installation command:


# Activate virtual environment
source venv/bin/activate  # On Windows, use venv\Scripts\activate

# Install Scrapy
pip install scrapy

403 Forbidden error

Error Description: A 403 Forbidden error means that the website you are trying to scrape is blocking your requests. This often happens because the website identifies your scraper as a bot.
Solution: To bypass this, you can set a user agent string to mimic a real browser. Add the following setting in your Scrapy spider or project settings to specify a common user agent:


custom_settings = {
    'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

Alternatively, you can set the user agent in the settings.py file of your Scrapy project:


# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

Missing output file

Error Description: Sometimes, after running the spider, you might notice that the output file (e.g., JSON, CSV) is not created. This usually happens due to incorrect command usage or path issues.
Solution: Ensure you have specified the correct output format and path when running the spider. The command should look like this:


scrapy crawl books -o books.json

If you want to save the file to a specific directory, make sure the directory exists and use the correct path:


scrapy crawl books -o output/books.json

Ensure that you have write permissions to the directory while you are trying to save the file.

Conclusion

Web scraping with Scrapy is an easy way to extract data from websites using Python. By following the step-by-step guide provided, you can quickly set up a Scrapy project, create a spider, and start scraping data from a website. We also covered common troubleshooting tips to help you resolve frequent issues faced by beginners. With Scrapy’s features and ease of use, you can streamline your web scraping tasks and focus on analyzing the data you need.

Rotating Proxies in Scrapy: 2 Methods Explained