i was following the tutorial: ‘How To Crawl A Web Page with Scrapy and Python 3’ and when I got to the part where we actually gather the data from the website , I am getting the same thing as when the for loop just had pass in it. Here is my code:
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider" #<< name of the spider
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = ".set"
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 ::text'
PIECES_SELECTOR = './/d1[dt/text() = "Pieces"]/dd/a/text()'
MINIFIGS_SELECTOR = './/d1[dt/text() = "Minifigs"]/dd[2]/a/text()'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
'image': brickset.css(IMAGE_SELECTOR).extract_first(),
}
But here is my output:
2021-03-05 02:09:45 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-05 02:09:45 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j 16 Feb 2021), cryptography 3.4.6, Platform Windows-10-10.0.18362-SP0
2021-03-05 02:09:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-05 02:09:45 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2021-03-05 02:09:45 [scrapy.extensions.telnet] INFO: Telnet Password: 2dbd3797bff9e0a2
2021-03-05 02:09:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-03-05 02:09:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-05 02:09:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-05 02:09:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-05 02:09:46 [scrapy.core.engine] INFO: Spider opened
2021-03-05 02:09:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-05 02:09:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-05 02:09:46 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://brickset.com/sets/year-2016> (referer: None)
2021-03-05 02:09:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://brickset.com/sets/year-2016>: HTTP status code is not handled or not allowed
2021-03-05 02:09:46 [scrapy.core.engine] INFO: Closing spider (finished)
2021-03-05 02:09:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 226,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2227,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'elapsed_time_seconds': 0.307499,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 3, 5, 7, 9, 46, 875575),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 1,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 3, 5, 7, 9, 46, 568076)}
2021-03-05 02:09:46 [scrapy.core.engine] INFO: Spider closed (finished)
Please help ;-;
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Heya,
Your spider is not able to fetch data from the webpage because it’s getting a 403 HTTP status code. This code means that the server understood the request but refuses to authorize it. This is often due to the site’s rules for web scraping, which could be blocking your spider.
You can overcome this issue by using a custom user agent in your spider. Websites often block requests from unknown user agents or web scrapers, so by impersonating a web browser, you might be able to bypass these restrictions.
Here’s an example of how you can modify your settings.py file to include a user agent:
# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537'
If you still get a 403 error after trying this, you can use middlewares to handle these cases and retry the request with a different user agent. For this, you will need to set up a list of user agents and rotate between them.
Bear in mind that you should respect the website’s robots.txt file and scraping policies to avoid any legal issues.
If the issue persists, the website might be using more sophisticated anti-scraping measures. In such cases, you may need to use more complex techniques, such as rotating IP addresses using proxies, or even switching to a headless browser like Selenium, which can mimic user behavior more convincingly.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.