September 2008 Archives

Dealing With Bad Bots

| | Comments (0)
Architecture of a Web crawler.

Image via Wikipedia

Most search engines' spiders obey the robots.txt commands.

Basically you can instruct a search engine to not index certain parts of your site or disallow some spiders from accessing your site entirely.

Unfortunately some search engine spiders are either badly written or intentionally evil and totally ignore any commands you might try to pass them via the robots.txt

One such robot is Voila.

Voila identifies itself with the UserAgent string:

VoilaBot BETA 1.2

Depending on the type of site you have you're probably best advised to block it entirely.

If you have access to iptables then you can simply issue a series of commands similar to this one:

iptables -I INPUT -s 81.52.143.15 -j DROP

I'm trying to get a full list of the IP ranges used by Voila, but so far I've found two which you could block. They are:

193.252.148.0/23
81.52.142.0/23

On one server the VoilaBot had caused the sites to become completely unresponsive with the load average climbing constantly!

Reblog this post [with Zemanta]

About this Archive

This page is an archive of entries from September 2008 listed from newest to oldest.

July 2008 is the previous archive.

October 2008 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.2-en