This is my code snippet
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
custom_settings = {
'CONCURRENT_REQUESTS': 25,
'CONCURRENT_REQUESTS_PER_DOMAIN': 100,
'DOWNLOAD_DELAY': 0
}
f = open("list.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
for quote in response.xpath("//div[@class='border block']"):
urlgem = quote.xpath(".//div[@class='col-md-4 pull-right']/a/@href").extract()
if urlgem:
yield {
'text': urlgem,
}
Using Terminal I execute above code using command
scrapy runspider quotes_spider.py -o quotes.json
list.txt contains 50 URL of same domain separated by lines
As per my understanding the code should put 25 requests to the domain for 25 URL (from a list of 50 URL) and should complete in 2-3 seconds time span and generate file name quotes.json
The output in file quotes.json is coming out as expected but SCRAPY is not performing the task concurrently instead it is fetching URL one by one and takes approximately 55-60 seconds to complete.
Please help!!!
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
Join our DigitalOcean community of over a million developers for free! Get help and share knowledge in Q&A, subscribe to topics of interest, and get courses and tools that will help you grow as a developer and scale your project or business.
Hi there @kartikmittal2006,
Scrapy is great as it is built on top of the
Twisted
library which is an asynchronous networking library that allows you to write non-blocking (asynchronous) code for concurrency, which improves the performance a lot.However, in order to achieve that, according to the documentation you should run Scrapy as a script as described here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
Basically Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. I could suggest trying the code example from the documentation link above.
Another thing that you could try doing is using
asyncio
. The main downside here is thatasyncio
support in Scrapy is experimental. For more information, you could take a look at the official documentation here:http://doc.scrapy.org/en/latest/topics/asyncio.html
Hope that this helps! Regards, Bobby