Tutorial

How To Crawl A Web Page with Scrapy and Python 3

Updated on December 7, 2022
Default avatar

By Justin Duke

English
How To Crawl A Web Page with Scrapy and Python 3

Introduction

Web scraping, often called web crawling or web spidering, is the act of programmatically going over a collection of web pages and extracting data, and is a powerful tool for working with data on the web.

With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, retrieve data from a site without an official API, or just satisfy your own personal curiosity.

In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use Quotes to Scrape, a database of quotations hosted on a site designed for testing out web spiders. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages containing quotes and displays them on your screen.

The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.

Prerequisites

To complete this tutorial, you’ll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.

Step 1 — Creating a Basic Scraper

Scraping is a two step process:

  1. Systematically finding and downloading web pages.
  2. Extract information from the downloaded pages.

Both of those steps can be implemented in a number of ways in many languages.

You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you’ll sometimes have to deal with sites that require specific settings and access patterns.

You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we’re going to use Python and Scrapy to build our scraper.

Scrapy is one of the most popular and powerful Python scraping libraries; it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time.

Scrapy, like most Python packages, is on PyPI (also known as pip). PyPI, the Python Package Index, is a community-owned repository of all published Python software.

If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip installed on your machine, so you can install Scrapy with the following command:

  1. pip install scrapy

If you run into any issues with the installation, or you want to install Scrapy without using pip, check out the official installation docs.

With Scrapy installed, create a new folder for our project. You can do this in the terminal by running:

  1. mkdir quote-scraper

Now, navigate into the new directory you just created:

  1. cd quote-scraper

Then create a new Python file for our scraper called scraper.py. We’ll place all of our code in this file for this tutorial. You can create this file using the editing software of your choice.

Start out the project by making a very basic scraper that uses Scrapy as its foundation. To do that, you’ll need to create a Python class that subclasses scrapy.Spider, a basic spider class provided by Scrapy. This class will have two required attributes:

  • name — just a name for the spider.
  • start_urls — a list of URLs that you start to crawl from. We’ll start with one URL.

Open the scrapy.py file in your text editor and add this code to create the basic spider:

scraper.py
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

Let’s break this down line by line:

First, we import scrapy so that we can use the classes that the package provides.

Next, we take the Spider class provided by Scrapy and make a subclass out of it called BrickSetSpider. Think of a subclass as a more specialized form of its parent class. The Spider class has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn’t know where to look or what data to look for. By subclassing it, we can give it that information.

Finally, we name the class quote-spider and give our scraper a single URL to start from: https://quotes.toscrape.com. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages of famous quotations.

Now, test out the scraper. Typically, Python files are run with a command like python path/to/file.py. However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. Start your scraper with the following command:

  1. scrapy runspider scraper.py

The command will output something like this:

Output
2022-12-02 10:30:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet Password: b4d94e3a8d22ede1 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', ... 'scrapy.extensions.logstats.LogStats'] 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', ... 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', ... 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider opened 2022-12-02 10:30:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2022-12-02 10:49:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com> (referer: None) 2022-12-02 10:30:08 [scrapy.core.engine] INFO: Closing spider (finished) 2022-12-02 10:30:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 226, ... 'start_time': datetime.datetime(2022, 12, 2, 18, 30, 8, 492403)} 2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider closed (finished)

That’s a lot of output, so let’s break it down.

  • The scraper initialized and loaded additional components and extensions it needed to handle reading data from URLs.
  • It used the URL we provided in the start_urls list and grabbed the HTML, just like your web browser would do.
  • It passed that HTML to the parse method, which doesn’t do anything by default. Since we never wrote our own parse method, the spider just finishes without doing any work.

Now let’s pull some data from the page.

Step 2 — Extracting Data from a Page

We’ve created a very basic program that pulls down a page, but it doesn’t do any scraping or spidering yet. Let’s give it some data to extract.

If you look at the page we want to scrape, you’ll see it has the following structure:

  • There’s a header that’s present on every page.
  • There’s a link to log in.
  • Then there are the quotes themselves, displayed in what looks like a table or ordered list. Each quote has a similar format.

When writing a scraper, you will need to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with the tags that aren’t relevant to our goal removed for readability:

quotes.toscrape.com
<body> ... <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.”</span> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> <a href="/author/Thomas-A-Edison">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / > <a class="tag" href="/tag/edison/page/1/">edison</a> <a class="tag" href="/tag/failure/page/1/">failure</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a> </div> </div> ... </body>

Scraping this page is a two step process:

  1. First, grab each quote by looking for the parts of the page that have the data we want.
  2. Then, for each quote, grab the data we want from it by pulling the data out of the HTML tags.

scrapy grabs data based on selectors that you provide. Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy supports either CSS selectors or XPath selectors.

We’ll use CSS selectors for now since CSS is a perfect fit for finding all the sets on the page. If you look at the HTML, you’ll see that each quote is specified with the class quote. Since we’re looking for a class, we’d use .quote for our CSS selector. The . part of the selector searches the class attribute on elements. All we have to do is create a new method in our class named parse and pass that selector into the response object, like this:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        
        for quote in response.css(QUOTE_SELECTOR):
            pass

This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those quotes so we can display it.

Another look at the source of the page we’re parsing tells us that the text of each quote is stored within a span with the text class and the author of the quote in a <small> tag with the author class:

quotes.toscrape.com
... <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.”</span> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> ...

The quote object we’re looping over has its own css method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        
        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
            }

Note: The trailing comma after extract_first() isn’t a typo. In Python, a trailing comma in dict objects is valid syntax, and a good way to leave room for more adding more items, which we will here later.

You’ll notice two things going on in this code:

  • We append ::text to our selectors for the quote and author. That’s a CSS pseudo-selector that fetches the text inside of the tag rather than the tag itself.
  • We call extract_first() on the object returned by quote.css(TEXT_SELECTOR) because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.

Save the file and run the scraper again:

  1. scrapy runspider scraper.py

This time the output will contain the quotes and their authors:

Output
... 2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'} 2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'} 2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'} ...

Let’s keep expanding on this by adding new selectors for links to pages about the author and tags for the quote. By investigating the HTML for each quote, we find:

  • The link to the author’s about page is stored in a link immediately following their name.
  • The tags are stored as a collection of a tags, each classed tag, stored within a div element with the tags class.

So, let’s modify the scraper to get this new information:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        ABOUT_SELECTOR = '.author + a::attr("href")'
        TAGS_SELECTOR = '.tags > .tag::text'

        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
                'about': 'https://quotes.toscrape.com' + 
                        quote.css(ABOUT_SELECTOR).extract_first(),
                'tags': quote.css(TAGS_SELECTOR).extract(),
            }

Save your changes and run the scraper again:

  1. scrapy runspider scraper.py

Now the output will contain the new data:

Output
2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'about': 'https://quotes.toscrape.com/author/Albert-Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']} 2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'about': 'https://quotes.toscrape.com/author/Jane-Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']} 2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'about': 'https://quotes.toscrape.com/author/Marilyn-Monroe', 'tags': ['be-yourself', 'inspirational']}

Now let’s turn this scraper into a spider that follows links.

Step 3 — Crawling Multiple Pages

You’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.

You’ll notice that the top and bottom of each page has a little right carat (>) that links to the next page of results. Here’s the HTML for that:

quotes.toscrape.com
... <nav> <ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a> </li> </ul> </nav> ...

In the source, you will find an li tag with the class of next, and inside that tag, there’s an a tag with a link to the next page. All we have to do is tell the scraper to follow that link if it exists.

Modify your code as follows:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        ABOUT_SELECTOR = '.author + a::attr("href")'
        TAGS_SELECTOR = '.tags > .tag::text'
        NEXT_SELECTOR = '.next a::attr("href")'

        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
                'about': 'https://quotes.toscrape.com' + 
                        quote.css(ABOUT_SELECTOR).extract_first(),
                'tags': quote.css(TAGS_SELECTOR).extract(),
            }

        next_page = response.css(NEXT_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

First, we define a selector for the “next page” link, extract the first match, and check if it exists. The scrapy.Request is a new request object that Scrapy knows means it should fetch and parse next.

This means that once we go to the next page, we’ll look for a link to the next page there, and on that page we’ll look for a link to the next page, and so on, until we don’t find a link for the next page. This is the key piece of web scraping: finding and following links. In this example, it’s very linear; one page has a link to the next page until we’ve hit the last page, But you could follow links to tags, or other search results, or any other URL you’d like.

Now, if you save your code and run the spider again you’ll see that it doesn’t just stop once it iterates through the first page of sets. It keeps on going through all 100 quotes on all 10 pages. In the grand scheme of things it’s not a huge chunk of data, but now you know the process by which you automatically find new pages to scrape.

Here’s our completed code for this tutorial:

scraper.py
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        ABOUT_SELECTOR = '.author + a::attr("href")'
        TAGS_SELECTOR = '.tags > .tag::text'
        NEXT_SELECTOR = '.next a::attr("href")'

        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
                'about': 'https://quotes.toscrape.com' + 
                        quote.css(ABOUT_SELECTOR).extract_first(),
                'tags': quote.css(TAGS_SELECTOR).extract(),
            }

        next_page = response.css(NEXT_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
            )

Conclusion

In this tutorial you built a fully-functional spider that extracts data from web pages in less than thirty lines of code. That’s a great start, but there’s a lot of fun things you can do with this spider. That should be enough to get you thinking and experimenting. If you need more information on Scrapy, check out Scrapy’s official docs. For more information on working with data from the web, see our tutorial on “How To Scrape Web Pages with Beautiful Soup and Python 3”.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us


About the authors
Default avatar
Justin Duke

author



Default avatar

Tech Writer at DigitalOcean


Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
10 Comments


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

had to remove the ‘a’ from this line in order to extract the name :

NAME_SELECTOR = ‘h1 a ::text’

keeping this ‘a’ will extract the item number

Sorry, the website from the guide is not working, it returns <403 http://brickset.com/sets/year-2016>: HTTP status code is not handled or not allowed. How can I solve this ?

Hi Brian. the Tutorials is very good. but, i get empty response when i use this sample to test. then i found than the website address modified to https protocol.and i think is the reason that i got empty response. or another reason? i don’t know…

This is an excellent tutorial on crawling web pages with Scrapy and Python! The explanation of Scrapy’s architecture and the step-by-step instructions made it easy for me to follow along and build my own web crawler. While Scrapy is a powerful framework, I recently came across Crawlbase, a crawler that integrates seamlessly with Python. It offers a user-friendly interface and advanced features that streamline the crawling process. I’ve been impressed with its performance and flexibility. I highly recommend checking out Crawlbase as an alternative option for web crawling. Thank you for sharing this valuable resource!

Instead of doing:

yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

Just use:

yield response.follow(next_page, self.parse)

Thank me later.

I think all the people following this tutorial spiked the target site’s bandwidth usage and pissed the owner off, probably causing him to block scrapy somehow.

I was good up until the crawling part. I’m seeing this in my output.

2021-01-14 18:20:06 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://brickset.com/sets/year-2016/page-2> (referer: https://brickset.com/sets/year-2016)
2021-01-14 18:11:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://brickset.com/sets/year-2016/page-2>: HTTP status code is not handled or not allowed

This leads me to believe that it is not extracting the next URL correctly, but I checked and https://brickset.com/sets/year-2016/page-2 is valid. Noggin scratcher.

How to make a Chinese translation version for this article?

Thanks, Justin. It’s a great tutorial for beginner. But I just stuck in the step of running the scraper.py. It seemed different from your outcome starting from the ‘DEBUG: Redirecting …’. Here is the details below:

2020-06-02 16:34:34 [scrapy.core.engine] INFO: Spider opened
2020-06-02 16:34:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2020-06-02 16:34:34 [scrapy.extensions.telnet] INFO: Telnet console listening on
 127.0.0.1:6023
2020-06-02 16:34:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (
301) to <GET https://brickset.com/sets/year-2016> from <GET http://brickset.com/
sets/year-2016>
2020-06-02 16:34:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://brick
set.com/sets/year-2016> (referer: None)
2020-06-02 16:34:35 [scrapy.core.scraper] ERROR: Spider error processing <GET ht
tps://brickset.com/sets/year-2016> (referer: None)
Traceback (most recent call last):
  File "d:\python\lib\site-packages\twisted\internet\defer.py", line 654, in _ru
nCallbacks
    current.result = callback(current.result, *args, **kw)
  File "d:\python\lib\site-packages\scrapy\spiders\__init__.py", line 90, in par
se
    raise NotImplementedError('{}.parse callback is not defined'.format(self.__c
lass__.__name__))
NotImplementedError: BrickSetSpider.parse callback is not defined
2020-06-02 16:34:35 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-02 16:34:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 452,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 15537,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'elapsed_time_seconds': 1.146,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 6, 2, 8, 34, 35, 300048),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/NotImplementedError': 1,
 'start_time': datetime.datetime(2020, 6, 2, 8, 34, 34, 154048)}
2020-06-02 16:34:35 [scrapy.core.engine] INFO: Spider closed (finished)

I checked the url is still working, but I don’t know why it got redirected.

OK, the Scraper wont run if you forget to add import scrapy to the first line of code.

You did so in the first and last Scraper.py

but for the 3 out of 5 code snippets you don’t include import scrapy line. (you do for the 1st and the last code), which confused me & my machine a bit, until I noticed.

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more
DigitalOcean Cloud Control Panel