Blocking Bots in Apache Using htaccess
Recently I had an application become the victim of bot spam. Since the web is something on the order of 60% bot traffic, many of these are inconsequential and can safely be blocked or directed to a cache to alleviate server strain. I chose to block them in this case, based on user agent, since many of these bots have a range of IP addresses they can utilize and IPs can easily be swapped.
Here is a list of the bots I was able to block from several application, with out impacting SEO. Not all of these bots will be right to block for every application. The top listed one, “^$” is the regex for an empty string. I do not allow bots to access the pages unless they identify with a user-agent, I found most often the only things hitting my these applications with out a user agent were security tools gone rogue.
^$
EasouSpider
Add Catalog
PaperLiBot
Spiceworks
ZumBot
RU_Bot
Wget
Java/1.7.0_25
Slurp
FunWebProducts
80legs
Aboundex
AcoiRobot
Acoon Robot
AhrefsBot
aihit
AlkalineBOT
AnzwersCrawl
Arachnoidea
ArchitextSpider
archive
Autonomy Spider
Baiduspider
BecomeBot
benderthewebrobot
BlackWidow
Bork-edition
Bot mailto:craftbot@yahoo.com
botje
catchbot
changedetection
Charlotte
ChinaClaw
commoncrawl
ConveraCrawler
Covario
crawler
curl
Custo
data mining development project
DigExt
DISCo
discobot
discoveryengine
DOC
DoCoMo
DotBot
Download Demon
Download Ninja
eCatch
EirGrabber
EmailSiphon
EmailWolf
eurobot
Exabot
Express WebPictures
ExtractorPro
EyeNetIE
Ezooms
Fetch
Fetch API
filterdb
findfiles
findlinks
FlashGet
flightdeckreports
FollowSite Bot
Gaisbot
genieBot
GetRight
GetWeb!
gigablast
Gigabot
Go-Ahead-Got-It
Go!Zilla
GrabNet
Grafula
GT::WWW
hailoo
heritrix
HMView
houxou
HTTP::Lite
HTTrack
ia_archiver
IBM EVV
id-search
IDBot
Image Stripper
Image Sucker
Indy Library
InterGET
Internet Ninja
internetmemory
ISC Systems iRc Search 2.1
JetCar
JOC Web Spider
k2spider
larbin
larbin
LeechFTP
libghttp
libwww
libwww-perl
linko
LinkWalker
lwp-trivial
Mass Downloader
metadatalabs
MFC_Tear_Sample
Microsoft URL Control
MIDown tool
Missigua
Missigua Locator
Mister PiX
MJ12bot
MOREnet
MSIECrawler
msnbot
naver
Navroad
NearSite
Net Vampire
NetAnts
NetSpider
NetZIP
NextGenSearchBot
NPBot
Nutch
Octopus
Offline Explorer
Offline Navigator
omni-explorer
PageGrabber
panscient
panscient.com
Papa Foto
pavuk
pcBrowser
PECL::HTTP
PHP/
PHPCrawl
picsearch
pipl
pmoz
PredictYourBabySearchToolbar
RealDownload
Referrer Karma
ReGet
reverseget
rogerbot
ScoutJet
SearchBot
seexie
seoprofiler
Servage Robot
SeznamBot
shopwiki
sindice
sistrix
SiteSnagger
SiteSnagger
smart.apnoti.com
SmartDownload
Snoopy
Sosospider
spbot
suggybot
SuperBot
SuperHTTP
SuperPagesUrlVerifyBot
Surfbot
SurveyBot
SurveyBot
swebot
Synapse
Tagoobot
tAkeOut
Teleport
Teleport Pro
TeleportPro
TweetmemeBot
TwengaBot
twiceler
UbiCrawler
uptimerobot
URI::Fetch
urllib
User-Agent
VoidEYE
VoilaBot
WBSearchBot
Web Image Collector
Web Sucker
WebAuto
WebCopier
WebCopier
WebFetch
WebGo IS
WebLeacher
WebReaper
WebSauger
Website eXtractor
Website Quester
WebStripper
WebStripper
WebWhacker
WebZIP
WebZIP
Wells Search II
WEP Search
Widow
winHTTP
WWWOFFLE
Xaldon WebSpider
Xenu
yacybot
yandex
YandexBot
YandexImages
yBot
YesupBot
YodaoBot
yolinkBot
youdao
Zao
Zealbot
Zeus
ZyBORG
Most often you find people blocking Bots using something like this. I found this just continues to grow the htaccess file and adds a lot of unneeded lines to the file. Why do with two hundred lines what can be accomplished in two?
RewriteCond %{HTTP_USER_AGENT} ^Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VB\ Project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW::Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} RPT-HTTPClient [NC]
RewriteRule .* - [R=403,L]
I found thats I could achieve the same effect with a two lines and adding more entries became easier since they were just separated by the pipe character signifying “or” in the regex. It seemed cleaner to me to have 2 lines doing the work of what was 232 lines.
You can see this in my .htaccess boiler plate below.
I will advise you when blocking bots be very specific. Simply using a generic word like “fire” could pop positive for “firefox” You can also adjust the regex to fix that issue but I found it much simpler to be more specific and that has the added benefit of being more informative to the next person to touch the .htaccess.
Additionally, you will see I have a rule for Java/1.7.0_25 in this case it happened to be a bot using this version of java to slam my servers. Do be careful blocking language specific user agents like this, some languages such as ColdFusion run on the JVM and use the language user agent and web requests to localhost to assemble things like PDFs. Jruby, Groovy, or Scala, may do similar things, however I have not tested them.
Thanks for reading.