Web Scraping is the programming-based technique for extracting relevant information from websites and storing it in the local system for further use.
In modern times, web scraping has a lot of applications in the fields of Data Science and Marketing. Web scrapers across the world gather tons of information for either personal or professional use. Moreover, present-day tech giants rely on such web scraping methods to fulfill the needs of their consumer base.
In this article, we will be scraping product information from Amazon websites. Accordingly, we will take considering a “Playstation 4” as the target product.
Note: The website layout and tags might change over time. Therefore the reader is advised to understand the concept of scraping so that self-implementation does not become an issue.
If you want to build a service using web scraping, you might have to go through IP Blocking as well as proxy management. It’s good to know underlying technologies and processes, but for bulk scraping, it’s better to work with scraping API providers like Zenscrape. They even take care of Ajax requests and JavaScript for dynamic pages. One of their popular offerings is residential proxy service.
In order to make a soup, we need proper ingredients. Similarly, our fresh web scraper requires certain components.
Many websites have certain protocols for blocking robots from accessing data. Therefore, in order to extract data from a script, we need to create a User-Agent. The User-Agent is basically a string that tells the server about the type of host sending the request.
This website contains tons of user agents for the reader to choose from. Following is an example of a User-Agent within the header value.
There is an extra field in HEADERS
called “Accept-Language”, which translates the webpage to English-US, if needed.
A webpage is accessed by its URL (Uniform Resource Locator). With the help of the URL, we will send the request to the webpage for accessing its data.
The requested webpage features an Amazon product. Hence, our Python script focuses on extracting product details like “The Name of the Product”, “The Current Price” and so on.
Note: The request to the URL is sent via
"requests"
library. In case the user gets a “No module named requests” error, it can be installed by"pip install requests"
.
The webpage
variable contains a response received by the website. We pass the content of the response and the type of parser to the Beautiful Soup function.
lxml
is a high-speed parser employed by Beautiful Soup to break down the HTML page into complex Python objects. Generally, there are four kinds of Python Objects obtained:
One of the most hectic part of this project is unearthing the ids and tags storing the relevant information. As mentioned before, we use web browsers for accomplishing this task.
We open the webpage in the browser and inspect the relevant element by pressing right-click.
As a result, a panel opens on the right-hand side of the screen as shown in the following figure.
Once we obtain the tag values, extracting information becomes a piece of cake. However, we must learn certain functions defined for Beautiful Soup Object.
Using the find()
function available for searching specific tags with specific attributes we locate the Tag Object containing title of the product.
Then, we take out the NavigableString Object
And finally, we strip extra spaces and convert the object to a string value.
We can take a look at types of each variable using type()
function.
Output:
In the same way, we need to figure out the tag values for other product details like “Price of the Product” and “Consumer Ratings”.
The following Python script displays the following details for a product:
Output:
Now that we know how to extract information from a single amazon webpage, we can apply the same script to multiple webpages by simply changing the URL.
Moreover, let us now attempt to fetch links from an Amazon search results webpage.
Previously, we obtained information about a random PlayStation 4. It would be a resourceful idea to extract such information for multiple PlayStations for comparison of prices and ratings.
We can find a link enclosed in a <a><\a>
tag as a value for the href
attribute.
instead of fetching a single link, we can extract all the similar links using find_all()
function.
The find_all()
function returns an iterable object containing multiple Tag objects. As a result, we pick each Tag object and pluck out the link stored as a value for href
attribute.
We store the links inside a list so that we can iterate over each link and extract product details.
We reuse the functions created before for extracting product information. Even though this process of producing multiple soups makes the code slow, but in turn, provides a proper comparison of prices between multiple models and deals.
Below is the complete working Python script for listing multiple PlayStation deals.
Output:
The above Python script is not restricted to the list of PlayStations. We can switch the URL to some other link to an Amazon search result, like headphones or earphones.
As mentioned before, the layout and tags of an HTML page may change over time making the above code worthless in this regard. However, the reader must bring home the concept of web scraping and techniques learnt in this article.
There can be various advantages of Web Scraping ranging from “comparing product prices” to “analyzing consumer tendencies”. Since the internet is accessible to everyone and Python is a very easy language, anyone can perform Web Scraping to meet their needs.
We hope this article was easy to understand. Feel free to comment below for any queries or feedback. Until then, Happy Scraping!!!.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
While we believe that this content benefits our community, we have not yet thoroughly reviewed it. If you have any suggestions for improvements, please let us know by clicking the “report an issue“ button at the bottom of the tutorial.
is it also possible to know the price per week?
- Aserat tesma Nahum
Hi, fantastic tutorial. How can i get url of image?
- carlo
how to download product image?
- shah
how to save in data frame ?
- Ritesh