I'm not getting correct output witch scrapy

Question

i was following the tutorial: ‘How To Crawl A Web Page with Scrapy and Python 3’ and when I got to the part where we actually gather the data from the website , I am getting the same thing as when the for loop just had pass in it. Here is my code:

import scrapy 


class BrickSetSpider(scrapy.Spider): 
    name = "brickset_spider"    #<< name of the spider     
    start_urls = ['http://brickset.com/sets/year-2016']    

    
    def parse(self, response):
        SET_SELECTOR = ".set" 
        for brickset in response.css(SET_SELECTOR): 
            
            NAME_SELECTOR = 'h1 ::text'
            PIECES_SELECTOR = './/d1[dt/text() = "Pieces"]/dd/a/text()'
            MINIFIGS_SELECTOR = './/d1[dt/text() = "Minifigs"]/dd[2]/a/text()'
            IMAGE_SELECTOR = 'img ::attr(src)'
            yield {
                'name': brickset.css(NAME_SELECTOR).extract_first(), 
                'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
                'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
                'image': brickset.css(IMAGE_SELECTOR).extract_first(),
            }

But here is my output:

2021-03-05 02:09:45 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-05 02:09:45 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j  16 Feb 2021), cryptography 3.4.6, Platform Windows-10-10.0.18362-SP0  
2021-03-05 02:09:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-05 02:09:45 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2021-03-05 02:09:45 [scrapy.extensions.telnet] INFO: Telnet Password: 2dbd3797bff9e0a2
2021-03-05 02:09:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-03-05 02:09:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-05 02:09:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-05 02:09:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-05 02:09:46 [scrapy.core.engine] INFO: Spider opened
2021-03-05 02:09:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-05 02:09:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-05 02:09:46 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://brickset.com/sets/year-2016> (referer: None)
2021-03-05 02:09:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://brickset.com/sets/year-2016>: HTTP status code is not handled or not allowed
2021-03-05 02:09:46 [scrapy.core.engine] INFO: Closing spider (finished)
2021-03-05 02:09:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 226,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2227,
 'downloader/response_count': 1,
 'downloader/response_status_count/403': 1,
 'elapsed_time_seconds': 0.307499,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 3, 5, 7, 9, 46, 875575),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/403': 1,
 'log_count/DEBUG': 1,
 'log_count/INFO': 11,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 3, 5, 7, 9, 46, 568076)}
2021-03-05 02:09:46 [scrapy.core.engine] INFO: Spider closed (finished)

Please help ;-;

KFSys · Answer

Heya,

Your spider is not able to fetch data from the webpage because it’s getting a 403 HTTP status code. This code means that the server understood the request but refuses to authorize it. This is often due to the site’s rules for web scraping, which could be blocking your spider.

You can overcome this issue by using a custom user agent in your spider. Websites often block requests from unknown user agents or web scrapers, so by impersonating a web browser, you might be able to bypass these restrictions.

Here’s an example of how you can modify your settings.py file to include a user agent:

# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537'

If you still get a 403 error after trying this, you can use middlewares to handle these cases and retry the request with a different user agent. For this, you will need to set up a list of user agents and rotate between them.

Bear in mind that you should respect the website’s robots.txt file and scraping policies to avoid any legal issues.

If the issue persists, the website might be using more sophisticated anti-scraping measures. In such cases, you may need to use more complex techniques, such as rotating IP addresses using proxies, or even switching to a headless browser like Selenium, which can mimic user behavior more convincingly.

Report this

I'm not getting correct output witch scrapy

Become a contributor for community

DigitalOcean Documentation

Resources for startups and AI-native businesses

Get our newsletter

The developer cloud

Get started for free