How to block website from crawling my server?

Posted on June 9, 2019

I have Ubuntu 18 on my vps with ufw enabled, i tried to block the ip adress from the netstat log but it is still accessing my site like the following : root@Waseely:~# netstat -c | grep 3000 tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:57123 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:38822 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:36797 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:57123 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:38822 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:36797 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:57123 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:38822 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:57123 ESTABLISHED tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:38822 TIME_WAIT tcp6 0 0 142.93.23.138:3000 crawl-66-249-64-3:46276 ESTABLISHED

I am trying to block the ip adress of the "crawl-66-249-64-3 but no luck

Any idea?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Bobby

June 10, 2019

This looks like a googlebot, have you tried disallowing google via your robots.txt file?

saadsawash

June 10, 2019

I tried that , also tried to block the ip it self using sudo ufw deny from 66.249.64.1/24

alexdo

September 27, 2023

Heya,

In your robots.txt file, you can use the Disallow directive to specify which bots or user-agents should be blocked.

For example, to block all access from the user-agent crawl-66-249-64-3, you can add the following lines:

User-agent: crawl-66-249-64-3
Disallow: /

This instructs the bot “crawl-66-249-64-3” not to crawl any pages on your site.

To make sure the changes take effect, you may need to restart your web server. Use the appropriate command for your web server software.

For Apache, you can use:

sudo systemctl restart apache2

For Nginx:

sudo systemctl restart nginx

To test if the robots.txt file is correctly blocking the bot, you can use various online tools or access logs to see if the bot’s access to your site decreases or stops.

Remember that well-behaved bots usually respect the rules in the robots.txt file, but malicious bots might not. It’s an effective way to communicate your preferences to web crawlers, but it doesn’t guarantee that all bots will obey. If you continue to experience issues, you might consider implementing additional security measures or IP blocking as mentioned earlier.

Hope that this helps!

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Learn more

Resources for startups and AI-native businesses

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Learn more

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Get started for free

Get started

*This promotional offer applies to new accounts only.