Entries tagged with “Robots Exclusion Standard” from Search Engine CookBook

Dealing With Bad Bots

|
Architecture of a Web crawler.

Image via Wikipedia

Most search engines' spiders obey the robots.txt commands.

Basically you can instruct a search engine to not index certain parts of your site or disallow some spiders from accessing your site entirely.

Unfortunately some search engine spiders are either badly written or intentionally evil and totally ignore any commands you might try to pass them via the robots.txt

One such robot is Voila.

Voila identifies itself with the UserAgent string:

VoilaBot BETA 1.2

Depending on the type of site you have you're probably best advised to block it entirely.

If you have access to iptables then you can simply issue a series of commands similar to this one:

iptables -I INPUT -s 81.52.143.15 -j DROP

I'm trying to get a full list of the IP ranges used by Voila, but so far I've found two which you could block. They are:

193.252.148.0/23
81.52.142.0/23

On one server the VoilaBot had caused the sites to become completely unresponsive with the load average climbing constantly!

Reblog this post [with Zemanta]

Feed Subscription

If you use an RSS reader, you can subscribe to a feed of all future entries tagged “Robots Exclusion Standard”.

Subscribe to feed Subscribe to feed