[Tutorial] Block malicious bots, harvesters and crawlers in Nginx

One of the most recent challenges I came across lately was to block malicious bots, harvesters and crawlers or other similar user-agents that should not access my data in Nginx. I already had such a system in place for Apache as I’m running all public websites with Apache, but since I have Nginx as reverse proxy for some apps I realized it would be a great idea to port the system I had made for Apache to Nginx.

So I started to do some research using Github’s Gist and found a series of snippets that did what I wanted, but did not include all the user-agents that I wanted blocked. Still it was better than nothing as I already had the user-agents listed in my Apache configs and all I needed to do was to build the list for Nginx.

Therefore, to start blocking malicious user-agents you need to create a file (using your favorite editor) called bad_bots.conf (or whatever you like to name it) under the /etc/nginx/ directory with the following contents:

map $http_user_agent $limit_bots {
    default 0;
    ~*(google|Googlebot|bing|yandex|msnbot|yahoo|mail|Wordpress|Joomla|Drupal|feed|rss|XML-RPC|iTunes|Googlebot-Image|Googlebot-Video|Xenu|ping|Simplepie) 1;
    ~*(AltaVista|Slurp|BlackWidow|Bot|ChinaClaw|Custo|DISCo|Download|Demon|eCatch|EirGrabber|EmailSiphon|EmailWolf|SuperHTTP|Surfbot|BatchFTP|Harvest|Collector|Copier) 1;
    ~*(Express|WebPictures|ExtractorPro|EyeNetIE|FlashGet|GetRight|GetWeb!|GrabNet|Grafula|HMView|Go!Zilla|Go-Ahead-Got-It|Whacker|Extractor|lftp|clsHTTP|Mirror|Explorer|moget) 1;
    ~*(rafula|HMView|HTTrack|Stripper|Sucker|Indy|InterGET|Ninja|JetCar|Spider|larbin|LeechFTP|Downloader|tool|Navroad|NearSite|NetAnts|tAkeOut|WWWOFFLE|Navigator|SuperHTTP|MIDown) 1;
    ~*(GrabNet|Snagger|Vampire|NetZIP|Octopus|Offline|PageGrabber|Foto|pavuk|pcBrowser|Openfind|ReGet|SiteSnagger|SmartDownload|SuperBot|WebSpider|Vacuum|WWW-Collector-E|LinkWalker) 1;
    ~*(Teleport|VoidEYE|Collector|WebAuto|WebCopier|WebFetch|WebGo|WebLeacher|Reaper|WebSauger|eXtractor|Quester|WebStripper|WebZIP|Wget|Widow|Zeus|WebBandit|Jorgee|Webclipping) 1;
    ~*(Twengabot|htmlparser|libwww|Python|perl|urllib|scan|Curl|email|PycURL|Pyth|PyQ|WebCollector|WebCopy|webcraw|WinHttp|okhttp|Java|Webster|Enhancer|GrabNet|trivial|LWP|Magnet) 1;
    ~*(Mag-Net|moget|Recorder|ReGet|RepoMonkey|Siphon|AppsViewer|Lynx|Acunetix|FHscan|Baidu|Yandex|EasyDL|WebEMailExtrac|Mail|MJ12|FastProbe|spbot|DotBot|SemRush|Daum|DuckDuckGo) 1;
    ~*(Aboundex|teoma|80legs|360Spider|Alexibot|asterias|attach|BackWeb|Bandit|Bigfoot|Black.Hole|CopyRightCheck|BlowFish|Buddy|Bullseye|BunnySlippers|Cegbfeieh|CherryPicker|DIIbot) 1;
    ~*(Spyder|cosmos|Crescent|Custo|AIBOT|dragonfly|Drip|ebingbong|Crawler|EyeNetIE|Foobot|flunky|FrontPage|hloader|Jyxobot|humanlinks|IlseBot|JustView|Robot|InfoTekies|Intelliseek|Jakarta) 1;
    ~*(Keyword|Iria|MarkWatch|likse|JOC|Mata.Hari|Memo|Microsoft.URL|Control|MIIxpc|Missigua|Locator|PIX|NAMEPROTECT|NextGenSearchBot|NetMechanic|NICErsPRO|Netcraft|niki-bot|NPbot|tracker) 1;
    ~*(Pockey|ProWebWalker|psbot|Pump|QueryN.Metasearch|SlySearch|Snake|Snapbot|Snoopy|sogou|SpaceBison|spanner|worm|Surfbot|suzuran|Szukacz|Telesoft|Intraformant|TheNomad|Titan|turingos) 1;
    ~*(URLy|Warning|VCI|Widow|WISENutbot|Xaldon|ZmEu|Zyborg|Aport|Parser|ahref|zoom|Powermarks|SafeDNS|BLEXBot|aria2|wikido|Qwantify|grapeshot|Nutch|linkdexbot|Twitterbot|Google-HTTP-Java-Client) 1;
    ~*(Veoozbot|ScoutJet|DomainAppender|Go-http-client|SEOkicks|WHR|sqlmap|ltx71|InfoPath|rogerbot|Alltop|heritrix|indiensolidaritet|Experibot|magpie|RSSInclude|wp-android|Synapse) 1;
    ~*(GimmeUSAbot|istellabot|interfax|vebidoobot|Jetty|dataaccessd|Dalvik|eCairn|BazQux|Wotbox|null|scrapy-redis|weborama-fetcher|TrapitAgent|UNKNOWN|SeznamBot|Dataprovider|BUbiNG) 1;
    ~*(cliqzbot|Deepnet|Ziba|linqia|portscout|Dataprovider|ia_archiver|MEGAsync|GroupHigh|Moreover|YisouSpider|CacheSystem|Clickagy|SMUrlExpander|XoviBot|MSIECrawler|Qwantify|JCE|tools.ua.random) 1;
    ~*(YaK|Mechanize|zgrab|Owler|Barkrowler|extlinks|achive-it|BDCbot|Siteimprove|Freshbot|WebDAV|Thumbtack|Exabot|Collector|mutant|Ukraine|NEWT|LinkextractorPro|LinkScan|LNSpiderguy) 1;
    ~*(Apache-HttpClient|Sphere|MegaIndex.ru|WeCrawlForThePeace|proximic|accelobot|searchmetrics|purebot|Ezooms|DinoPing|discoverybot|integromedb|visaduhoc|Searchbot|SISTRIX|brandwatch) 1;
    ~*(PeoplePal|PagesInventory|Nutch|HTTP_Request|Zend_Http_Client|Riddler|Netseer|CLIPish|Add\ Catalog|Butterfly|SocialSearcher|xpymep.exe|POGS|WebInDetail|WEBSITEtheWEB|CatchBot|rarely\ used) 1;
    ~*(ltbot|Wotbot|netEstate|news\ bot|omgilibot|Owlin|Mozilla--no-parent|Feed\ Parser|Feedly|Fetchbot|PHPCrawl|PhantomJS|SV1|R6_FeedFetcher|pilipinas|Proxy|PHP/5\.|chroot|DataCha0s|mobmail\ android) 1;
}

NOTE 1: As you may see, I used multiple lines to define the bots although I could have put everything on a single line, but I did so because I wanted to see all the lines on the screen and a very long line would have exceeded the width of the editor window, thus making it harder for me to identify duplicates or potential errors.

NOTE 2: The first line contains matches for some legit user-agents such as the Google and Bing crawlers, RSS feeds, iTunes podcast fetcher, Simplepie feed fetcher, mail clients or CMS cron scripts. I am actually blocking them all, but I added all of these on the first line so that whoever uses this code can easily locate and remove the legit user-agents that he intends to allow access from.

Next, we must include the newly created file into our nginx.conf so we edit the /etc/nginx/nginx.conf file and add the following line:

          include /etc/nginx/bad_bots.conf;

Right after the existing includes so it looks something like this:

          include /etc/nginx/conf.d/*.conf;
          include /etc/nginx/sites-enabled/*;
          include /etc/nginx/bad_bots.conf;

NOTE: If you used a different filename just replace bad_bots.conf (emphasized above, with your own filename).

This insures that the newly created configuration is loaded when Nginx starts.

Once this is done we move onto the vhost configuration file, where we define the action to be taken when the user-agent matches one of our blocked regexes.

To do so you must edit your vhost configuration file which should be located under the /etc/nginx/sites-enabled/ directory and append the following lines to the end of the main location, right before the closing tag:

        error_page 403 = @deny;
        if ($limit_bots = 1) {
            return 403;
        }

A typical snippet would look like this (lines referenced above are emphasized):

    location / {
        proxy_pass http://xxxxxx.net:9999/;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $http_host;

        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forward-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forward-Proto http;
        proxy_set_header X-Nginx-Proxy true;

        proxy_redirect off;
        error_page 403 = @deny;
        if ($limit_bots = 1) {
            return 403;
        }
    }

As you may see in the snippet above, compared to other snippets you may find online, I used a specific variable for the 403 status code and I did so because I wanted to redirect all this 403 traffic to a page that would record it, not just drop it. Which is why in the same vhost configuration file, after the closing tag of the location snippet referenced above I defined a 301 redirect for this specific page as follows:

location @deny {
        return 301 https://xxxxx.net/forbidden.html;
   }

You can obviously use just a filename or full path in case you prefer to route the failures to a local file. I’m using a full URL because I have multiple panels on multiple virtual servers redirecting to the same page.

Some final consideration to take into account:

  • These restrictions can be used in association with the vestaCP security tweaks I recommended earlier whenever the vestaCP panel must also remain open to the public and access to it won’t be restricted on an IP based system;
  • Reviewing the patterns above you may have seen that a lot of user-agents are actually missing. This is because the code above filtered based upon regexes and is also case sensitive. This means that bot != Bot while SpamBot == Bot. This way the referece to Bot (first letter uppercase and the next two lowercase) above will block any user-agent that contains it, but will exclude any user-user agent that contains bot (all letters with lower case). This insures that user-agents which contain lowercase reference of the word bot will be skipped if whitelisted. Like Googlebot for instance.
  • The very same code can be used to create whitelists. For instance if you wish to whitelist some user-agents called MSN-Bot, RandomBot, but block everything else containing Bot you can simply add a different line (preferably on top of the rest) that looks like this:
     ~*(MSN-Bot|RandomBot) 0;

    IMPORTANT: The bots that are whitelisted must equal 0 (zero, not 1 like the rest) as this way when the match is made as per the code that we defined in the vhost configuration file, they get excluded from the block.

This list isn’t final and will receive more user-agents and some more improvements that will make it better, however if you have any suggestions to make please leave a comment.

P.S. I’m already brainstorming for something like this for Litespeed/OpenLitespeed webservers and will probably make a tutorial about it as well pretty soon.