digitalFAQ.com Forum

digitalFAQ.com Forum (https://www.digitalfaq.com/forum/)
-   Website and Server Troubleshooting (https://www.digitalfaq.com/forum/web-tech/)
-   -   Block bad bots (via User Agents) to save bandwidth, protect data, stop site copying (https://www.digitalfaq.com/forum/web-tech/3153-block-bad-bots.html)

admin 06-01-2011 07:16 AM

Block bad bots (via User Agents) to save bandwidth, protect data, stop site copying
 
What Is This?
Inorganic traffic (Note: humans = organic traffic), aka "bots", can really pull at a server, and reduce how well your site works at any given moment. Some inorganic traffic is good, such as Bing or Google, and will help you. These smart bots help catalog your site in search engines, and tend to be fairly conservative when it comes to requesting data from your site, so as not to slow it or outright crash it. Bad bots, however, don't really care about your site -- these exist for far more nefarious reasons, such as harvesting email address, copying/stealing your content, or simply being a resource-eating nuisance. Bad bots, often identifiable by their User Agents, can eat your bandwidth, slow your site, and make overall performance lackluster.

How Much Security Does It Add?
Like any other method used to protect a site, this isn't 100% coverage to block problem, but simply another layer of protection for website owners. Blocking something is better than blocking nothing. (Inversely, there are some REALLY BAD lists online, that will block too much! You'll lose site visitors! YIKES! This list is very conservative and minimalist.)

How Well Does It Work? (aka "I Tried This and It Didn't Help!")
In an ideal world, every website on a server would have these rules, to globally block the problem User Agents. If you have a dedicated server, you can control incoming traffic completely. If you're using a VPS, you can control most of your resource hits caused by malicious/malformed traffic. On cheap shared hosting, it may not help much at all, in terms of speeding up the site, if you're the only site out of 100 or even 1,000 blocking such junk traffic.

How to Install Anti-Bot Protection
You'll need the ability to add and/or write to the root htaccess file. This comes default with most Linux/Apache (or Litespeed) hosts, and can be added to Windows servers (if you have VPS or dedicated, or a really nice host that will allow for a shared install!). If you don't know what htaccess is, ask -- we'll explain in another post.

Anyway, you'll want to add this to your rules:
PHP Code:

RewriteEngine On
RewriteCond 
%{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DownloadDemon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExpressWebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENTHTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ImageStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^ImageSucker [OR]
RewriteCond %{HTTP_USER_AGENTIndyLibrary [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetNinja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOCWebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^MassDownloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDowntool [OR]
RewriteCond %{HTTP_USER_AGENT} ^MisterPiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetVampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^OfflineExplorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^OfflineNavigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^PapaFoto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^TeleportPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebImageCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGoIS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebsiteeXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebsiteQuester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^XaldonWebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule 
^.* - [F,L

If not already entered into your htaccess file, you'll need this before the above rules:
PHP Code:

RewriteEngine On 

And at the end, instead of this:
Code:

RewriteRule ^.* - [F,L]
You may want to consider a URL from spampoison.com -- which works to further harm spammers, bots, etc. When you visit that site, you'll get a URL to use. For example:
Code:

RewriteRule /*$ http://english-1234567890.spampoison.com [L,R]
And that's it. :)

Junk traffic will be blocked or redirected to spampoison, and your site will run a little healthier. :thumb:

When implemented on digitalFAQ.com, for example, site load times increased by anywhere from 100-400ms. (That's 0.1 to 0.4 seconds!) While that number may seem small to web hosting novices, that's a huge leap of time/load savings!


All times are GMT -5. The time now is 07:51 PM

Site design, images and content © 2002-2024 The Digital FAQ, www.digitalFAQ.com
Forum Software by vBulletin · Copyright © 2024 Jelsoft Enterprises Ltd.