Question

Scrapy Concurrent Requests

Posted March 1, 2020 1.4k views
Python

This is my code snippet

import scrapy
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    custom_settings = {
        'CONCURRENT_REQUESTS': 25,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 100,
        'DOWNLOAD_DELAY': 0
    }

    f = open("list.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()

    def parse(self, response):
        for quote in response.xpath("//div[@class='border block']"):
            urlgem = quote.xpath(".//div[@class='col-md-4 pull-right']/a/@href").extract()
            if urlgem:
                yield {
                    'text': urlgem,
                }

Using Terminal I execute above code using command

scrapy runspider quotes_spider.py -o quotes.json

list.txt contains 50 URL of same domain separated by lines

As per my understanding the code should put 25 requests to the domain for 25 URL (from a list of 50 URL) and should complete in 2-3 seconds time span and generate file name quotes.json

The output in file quotes.json is coming out as expected but SCRAPY is not performing the task concurrently instead it is fetching URL one by one and takes approximately 55-60 seconds to complete.

Please help!!!

1 comment
  • Another code that I wrote using BeautifulSoup performing same task as above

    from multiprocessing import Pool
    import requests
    from bs4 import BeautifulSoup
    
    base_url = 'https://SOMEWEBSITE.com/search_by='
    all_urls = list()
    
    def generate_urls():
        for i in range(480882, 480983):
            all_urls.append(base_url + str(i))
    
    def scrape(url):
        res = requests.get(url)
        soup = BeautifulSoup(res.text, 'lxml')
        pageText = soup.find_all('div', {'class': 'border block'})
    
        for bidBox in pageText:
            bidNumber = bidBox.find('p', {'class': 'bid_no pull-left'})
            print(bidNumber.text)
            bidStatus = bidBox.find('span', {'class': 'text-success'})
            print(bidStatus.text)
    
    generate_urls()
    
    p = Pool(50)
    p.map(scrape, all_urls)
    p.terminate()
    p.join()
    

    Above code executed in terminal using command

    python3 quotes.py
    

    Above also produce same result. The output is as expected but the execution time is still approx. 1 URL per second.

    Please help!!!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

×
1 answer

Hi there @kartikmittal2006,

Scrapy is great as it is built on top of the Twisted library which is an asynchronous networking library that allows you to write non-blocking (asynchronous) code for concurrency, which improves the performance a lot.

However, in order to achieve that, according to the documentation you should run Scrapy as a script as described here:

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

Basically Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. I could suggest trying the code example from the documentation link above.

Another thing that you could try doing is using asyncio. The main downside here is that asyncio support in Scrapy is experimental. For more information, you could take a look at the official documentation here:

http://doc.scrapy.org/en/latest/topics/asyncio.html

Hope that this helps!
Regards,
Bobby

  • Hi Bobby

    Thanks for the response. I modified my code to be run as a script as below:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        base_url = 'https://<SOME_WEBSITE>'
        custom_settings = {
            'CONCURRENT_REQUESTS': 25,
            'CONCURRENT_REQUESTS_PER_DOMAIN': 50,
            'CONCURRENT_REQUESTS_PER_IP': 50,
            'RANDOMIZE_DOWNLOAD_DELAY': 0,
            'DOWNLOAD_DELAY': 0
        }
    
        start_urls = ["https:<SOME_WEBSITE>page_no=%d" % i for i in xrange(4100, 4150)]
    
        def parse(self, response):
            for quote in response.xpath("//div[@class='border block']"):
                urlgem = quote.xpath(".//div[@class='col-md-4 pull-right']/a/@href").extract()
                if urlgem:
                    yield {
                        'text': urlgem,
                    }
    
    process = CrawlerProcess(settings={
        'FEED_FORMAT': 'json',
        'FEED_URI': 'items.json'
    })
    
    process.crawl(QuotesSpider)
    process.start()
    

    The above code is then executed using command

    python quotes_spider2.py
    

    Unexpectedly the code took even more time to run then previously. Final log is as below:

    {'downloader/exception_count': 1,
     'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
     'downloader/request_bytes': 26282,
     'downloader/request_count': 53,
     'downloader/request_method_count/GET': 53,
     'downloader/response_bytes': 817202,
     'downloader/response_count': 52,
     'downloader/response_status_count/200': 50,
     'downloader/response_status_count/500': 2,
     'elapsed_time_seconds': 131.194052,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2020, 3, 5, 14, 14, 49, 164314),
     'item_scraped_count': 459,
     'log_count/DEBUG': 512,
     'log_count/INFO': 13,
     'memusage/max': 69451776,
     'memusage/startup': 44838912,
     'response_received_count': 50,
     'retry/count': 3,
     'retry/reason_count/500 Internal Server Error': 2,
     'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
     'scheduler/dequeued': 53,
     'scheduler/dequeued/memory': 53,
     'scheduler/enqueued': 53,
     'scheduler/enqueued/memory': 53,
     'start_time': datetime.datetime(2020, 3, 5, 14, 12, 37, 970262)}
    2020-03-05 19:44:49 [scrapy.core.engine] INFO: Spider closed (finished)
    

    Please suggest if something is wrong done by me

Submit an Answer