Web scraping, often called web crawling or web spidering, is the act of programmatically going over a collection of web pages and extracting data, and is a powerful tool for working with data on the web.
With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, retrieve data from a site without an official API, or just satisfy your own personal curiosity.
In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use Quotes to Scrape, a database of quotations hosted on a site designed for testing out web spiders. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages containing quotes and displays them on your screen.
The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.
To complete this tutorial, you’ll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.
Scraping is a two step process:
Both of those steps can be implemented in a number of ways in many languages.
You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you’ll sometimes have to deal with sites that require specific settings and access patterns.
You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we’re going to use Python and Scrapy to build our scraper.
Scrapy is one of the most popular and powerful Python scraping libraries; it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time.
Scrapy, like most Python packages, is on PyPI (also known as pip
). PyPI, the Python Package Index, is a community-owned repository of all published Python software.
If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip
installed on your machine, so you can install Scrapy with the following command:
- pip install scrapy
If you run into any issues with the installation, or you want to install Scrapy without using pip
, check out the official installation docs.
With Scrapy installed, create a new folder for our project. You can do this in the terminal by running:
- mkdir quote-scraper
Now, navigate into the new directory you just created:
- cd quote-scraper
Then create a new Python file for our scraper called scraper.py
. We’ll place all of our code in this file for this tutorial. You can create this file using the editing software of your choice.
Start out the project by making a very basic scraper that uses Scrapy as its foundation. To do that, you’ll need to create a Python class that subclasses scrapy.Spider
, a basic spider class provided by Scrapy. This class will have two required attributes:
name
— just a name for the spider.start_urls
— a list of URLs that you start to crawl from. We’ll start with one URL.Open the scrapy.py
file in your text editor and add this code to create the basic spider:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
Let’s break this down line by line:
First, we import scrapy
so that we can use the classes that the package provides.
Next, we take the Spider
class provided by Scrapy and make a subclass out of it called BrickSetSpider
. Think of a subclass as a more specialized form of its parent class. The Spider
class has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn’t know where to look or what data to look for. By subclassing it, we can give it that information.
Finally, we name the class quote-spider
and give our scraper a single URL to start from: https://quotes.toscrape.com. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages of famous quotations.
Now, test out the scraper. Typically, Python files are run with a command like python path/to/file.py
. However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. Start your scraper with the following command:
- scrapy runspider scraper.py
The command will output something like this:
Output2022-12-02 10:30:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet Password: b4d94e3a8d22ede1
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
...
'scrapy.extensions.logstats.LogStats']
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
...
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
...
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider opened
2022-12-02 10:30:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-02 10:49:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com> (referer: None)
2022-12-02 10:30:08 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-02 10:30:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 226,
...
'start_time': datetime.datetime(2022, 12, 2, 18, 30, 8, 492403)}
2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider closed (finished)
That’s a lot of output, so let’s break it down.
start_urls
list and grabbed the HTML, just like your web browser would do.parse
method, which doesn’t do anything by default. Since we never wrote our own parse
method, the spider just finishes without doing any work.Now let’s pull some data from the page.
We’ve created a very basic program that pulls down a page, but it doesn’t do any scraping or spidering yet. Let’s give it some data to extract.
If you look at the page we want to scrape, you’ll see it has the following structure:
When writing a scraper, you will need to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with the tags that aren’t relevant to our goal removed for readability:
quotes.toscrape.com<body>
...
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / >
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>
...
</body>
Scraping this page is a two step process:
scrapy
grabs data based on selectors that you provide. Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy
supports either CSS selectors or XPath selectors.
We’ll use CSS selectors for now since CSS is a perfect fit for finding all the sets on the page. If you look at the HTML, you’ll see that each quote is specified with the class quote
. Since we’re looking for a class, we’d use .quote
for our CSS selector. The .
part of the selector searches the class
attribute on elements. All we have to do is create a new method in our class named parse
and pass that selector into the response
object, like this:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
for quote in response.css(QUOTE_SELECTOR):
pass
This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those quotes so we can display it.
Another look at the source of the page we’re parsing tells us that the text of each quote is stored within a span
with the text
class and the author of the quote in a <small>
tag with the author
class:
quotes.toscrape.com ...
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
...
The quote
object we’re looping over has its own css
method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
}
Note: The trailing comma after extract_first()
isn’t a typo. In Python, a trailing comma in dict
objects is valid syntax, and a good way to leave room for more adding more items, which we will here later.
You’ll notice two things going on in this code:
::text
to our selectors for the quote and author. That’s a CSS pseudo-selector that fetches the text inside of the tag rather than the tag itself.extract_first()
on the object returned by quote.css(TEXT_SELECTOR)
because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.Save the file and run the scraper again:
- scrapy runspider scraper.py
This time the output will contain the quotes and their authors:
Output...
2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}
...
Let’s keep expanding on this by adding new selectors for links to pages about the author and tags for the quote. By investigating the HTML for each quote, we find:
a
tags, each classed tag
, stored within a div
element with the tags
class.So, let’s modify the scraper to get this new information:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
ABOUT_SELECTOR = '.author + a::attr("href")'
TAGS_SELECTOR = '.tags > .tag::text'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
'about': 'https://quotes.toscrape.com' +
quote.css(ABOUT_SELECTOR).extract_first(),
'tags': quote.css(TAGS_SELECTOR).extract(),
}
Save your changes and run the scraper again:
- scrapy runspider scraper.py
Now the output will contain the new data:
Output2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'about': 'https://quotes.toscrape.com/author/Albert-Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'about': 'https://quotes.toscrape.com/author/Jane-Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'about': 'https://quotes.toscrape.com/author/Marilyn-Monroe', 'tags': ['be-yourself', 'inspirational']}
Now let’s turn this scraper into a spider that follows links.
You’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.
You’ll notice that the top and bottom of each page has a little right carat (>
) that links to the next page of results. Here’s the HTML for that:
quotes.toscrape.com...
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
...
In the source, you will find an li
tag with the class of next
, and inside that tag, there’s an a
tag with a link to the next page. All we have to do is tell the scraper to follow that link if it exists.
Modify your code as follows:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
ABOUT_SELECTOR = '.author + a::attr("href")'
TAGS_SELECTOR = '.tags > .tag::text'
NEXT_SELECTOR = '.next a::attr("href")'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
'about': 'https://quotes.toscrape.com' +
quote.css(ABOUT_SELECTOR).extract_first(),
'tags': quote.css(TAGS_SELECTOR).extract(),
}
next_page = response.css(NEXT_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
First, we define a selector for the “next page” link, extract the first match, and check if it exists. The scrapy.Request
is a new request object that Scrapy knows means it should fetch and parse next.
This means that once we go to the next page, we’ll look for a link to the next page there, and on that page we’ll look for a link to the next page, and so on, until we don’t find a link for the next page. This is the key piece of web scraping: finding and following links. In this example, it’s very linear; one page has a link to the next page until we’ve hit the last page, But you could follow links to tags, or other search results, or any other URL you’d like.
Now, if you save your code and run the spider again you’ll see that it doesn’t just stop once it iterates through the first page of sets. It keeps on going through all 100 quotes on all 10 pages. In the grand scheme of things it’s not a huge chunk of data, but now you know the process by which you automatically find new pages to scrape.
Here’s our completed code for this tutorial:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
ABOUT_SELECTOR = '.author + a::attr("href")'
TAGS_SELECTOR = '.tags > .tag::text'
NEXT_SELECTOR = '.next a::attr("href")'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
'about': 'https://quotes.toscrape.com' +
quote.css(ABOUT_SELECTOR).extract_first(),
'tags': quote.css(TAGS_SELECTOR).extract(),
}
next_page = response.css(NEXT_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
)
In this tutorial you built a fully-functional spider that extracts data from web pages in less than thirty lines of code. That’s a great start, but there’s a lot of fun things you can do with this spider. That should be enough to get you thinking and experimenting. If you need more information on Scrapy, check out Scrapy’s official docs. For more information on working with data from the web, see our tutorial on “How To Scrape Web Pages with Beautiful Soup and Python 3”.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
had to remove the ‘a’ from this line in order to extract the name :
NAME_SELECTOR = ‘h1 a ::text’
keeping this ‘a’ will extract the item number
Sorry, the website from the guide is not working, it returns <403 http://brickset.com/sets/year-2016>: HTTP status code is not handled or not allowed. How can I solve this ?
Hi Brian. the Tutorials is very good. but, i get empty response when i use this sample to test. then i found than the website address modified to https protocol.and i think is the reason that i got empty response. or another reason? i don’t know…
This is an excellent tutorial on crawling web pages with Scrapy and Python! The explanation of Scrapy’s architecture and the step-by-step instructions made it easy for me to follow along and build my own web crawler. While Scrapy is a powerful framework, I recently came across Crawlbase, a crawler that integrates seamlessly with Python. It offers a user-friendly interface and advanced features that streamline the crawling process. I’ve been impressed with its performance and flexibility. I highly recommend checking out Crawlbase as an alternative option for web crawling. Thank you for sharing this valuable resource!
Instead of doing:
Just use:
Thank me later.
I think all the people following this tutorial spiked the target site’s bandwidth usage and pissed the owner off, probably causing him to block scrapy somehow.
I was good up until the crawling part. I’m seeing this in my output.
This leads me to believe that it is not extracting the next URL correctly, but I checked and https://brickset.com/sets/year-2016/page-2 is valid. Noggin scratcher.
How to make a Chinese translation version for this article?
Thanks, Justin. It’s a great tutorial for beginner. But I just stuck in the step of running the scraper.py. It seemed different from your outcome starting from the ‘DEBUG: Redirecting …’. Here is the details below:
I checked the url is still working, but I don’t know why it got redirected.
OK, the Scraper wont run if you forget to add import scrapy to the first line of code.
You did so in the first and last Scraper.py
but for the 3 out of 5 code snippets you don’t include import scrapy line. (you do for the 1st and the last code), which confused me & my machine a bit, until I noticed.