Dealing With Bad Bots

| | Comments (0)
Architecture of a Web crawler.

Image via Wikipedia

Most search engines' spiders obey the robots.txt commands.

Basically you can instruct a search engine to not index certain parts of your site or disallow some spiders from accessing your site entirely.

Unfortunately some search engine spiders are either badly written or intentionally evil and totally ignore any commands you might try to pass them via the robots.txt

One such robot is Voila.

Voila identifies itself with the UserAgent string:

VoilaBot BETA 1.2

Depending on the type of site you have you're probably best advised to block it entirely.

If you have access to iptables then you can simply issue a series of commands similar to this one:

iptables -I INPUT -s 81.52.143.15 -j DROP

I'm trying to get a full list of the IP ranges used by Voila, but so far I've found two which you could block. They are:

193.252.148.0/23
81.52.142.0/23

On one server the VoilaBot had caused the sites to become completely unresponsive with the load average climbing constantly!

Reblog this post [with Zemanta]

Categories

, ,

Leave a comment

About this Entry

This page contains a single entry by Michele Neylon published on September 14, 2008 12:19 PM.

RedFly Interview With Aaron Wall of Seo Book was the previous entry in this blog.

Yahoo! Launch Analytics Service is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.2-en