This is my code snippet
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
custom_settings = {
'CONCURRENT_REQUESTS': 25,
'CONCURRENT_REQUESTS_PER_DOMAIN': 100,
'DOWNLOAD_DELAY': 0
}
f = open("list.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
for quote in response.xpath("//div[@class='border block']"):
urlgem = quote.xpath(".//div[@class='col-md-4 pull-right']/a/@href").extract()
if urlgem:
yield {
'text': urlgem,
}
Using Terminal I execute above code using command
scrapy runspider quotes_spider.py -o quotes.json
list.txt contains 50 URL of same domain separated by lines
As per my understanding the code should put 25 requests to the domain for 25 URL (from a list of 50 URL) and should complete in 2-3 seconds time span and generate file name quotes.json
The output in file quotes.json is coming out as expected but SCRAPY is not performing the task concurrently instead it is fetching URL one by one and takes approximately 55-60 seconds to complete.
Please help!!!
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Hi there @kartikmittal2006,
Scrapy is great as it is built on top of the Twisted library which is an asynchronous networking library that allows you to write non-blocking (asynchronous) code for concurrency, which improves the performance a lot.
However, in order to achieve that, according to the documentation you should run Scrapy as a script as described here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
Basically Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. I could suggest trying the code example from the documentation link above.
Another thing that you could try doing is using asyncio. The main downside here is that asyncio support in Scrapy is experimental. For more information, you could take a look at the official documentation here:
http://doc.scrapy.org/en/latest/topics/asyncio.html
Hope that this helps! Regards, Bobby
Scrapy, by default, respects robots.txt and implements AutoThrottle to avoid overwhelming the target server with too many requests. This can slow down the process.
You can try the following adjustments to improve the performance of Scrapy:
ROBOTSTXT_OBEYScrapy might be slowing down because it’s obeying the robots.txt file of the website you’re scraping. Try turning this off:
custom_settings = {
'CONCURRENT_REQUESTS': 25,
'CONCURRENT_REQUESTS_PER_DOMAIN': 25, # This should not be higher than CONCURRENT_REQUESTS
'DOWNLOAD_DELAY': 0,
'ROBOTSTXT_OBEY': False,
'AUTOTHROTTLE_ENABLED': False, # Disable AutoThrottle if you want to speed things up
}
Scrapy provides detailed logs that can tell you whether certain limits (such as robots.txt or request delays) are affecting the concurrency. Check the logs carefully for clues about what’s throttling the requests.
Sometimes the network or server being scraped can affect the speed. If the target server is slow or heavily protected by rate-limiting, this might limit Scrapy’s performance despite high concurrency settings.
CONCURRENT_REQUESTS: Limits the total number of requests Scrapy can handle at once.CONCURRENT_REQUESTS_PER_DOMAIN: Limits the number of requests sent to the same domain at once.DOWNLOAD_DELAY: If set, this adds a delay between requests, which slows down the process.Make sure CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN are balanced (for example, setting CONCURRENT_REQUESTS higher than CONCURRENT_REQUESTS_PER_DOMAIN won’t make sense if you’re scraping only one domain).
Try lowering the CONCURRENT_REQUESTS to see if Scrapy’s behavior changes. Sometimes overly aggressive concurrency settings can lead to slower behavior due to server overload protection mechanisms.
Here’s a refined version of your custom_settings:
custom_settings = {
'CONCURRENT_REQUESTS': 25,
'CONCURRENT_REQUESTS_PER_DOMAIN': 25, # Keep the same to balance domain requests
'DOWNLOAD_DELAY': 0,
'ROBOTSTXT_OBEY': False, # Disable robots.txt rules
'AUTOTHROTTLE_ENABLED': False, # Disable throttling
'LOG_LEVEL': 'INFO', # Set to 'DEBUG' for more verbose logging
}
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.