This page describes, how to block web browser and bot user agent strings from accessing, scraping and hacking a website by configuring the Apache .htaccess control file. Written 2022-08-11. Tested with Apache 2.4.54 for FreeBSD.
What is a user agent string?
When a web browser opens a web page, it will identify itself to the web server, that serves the web page. This identification will contain information about the web browser software and the operating system. The user agent is also known as the User-Agent request header.
In the example below, the user agent identified itself as Chrome with AppleWebKit rendering engine on a Linux operating system, while opening a page. Note, that it is common for web browsers to pretent to be other browsers and use other rendering engines in order to avoid problems with control scripts. Those are often the ancient Mozilla and Safari web browsers and KHTML and Gecko rendering engines.
64.124.8.24 - - [11/Aug/2022:21:57:28 +0200] "GET /2021/06/23/the-gm-bedford-diesel-marine-engine/ HTTP/1.1" 200 9204 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
In the example below, the agent remained more anonymous, but identified itself as a non-human automatic bot from a web archiving project, while archiving the front page.
207.241.229.48 - - [28/Mar/2022:08:16:14 +0200] "GET / HTTP/1.0" 200 9870 "https://hq.wb.archive.org/" "Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot)"
In the example below, the agent also remained more anonymous, but identified itself as a non-human automatic bot from a search engine, while reading the sitemap.
66.249.66.62 - - [11/Aug/2022:19:51:27 +0200] "GET /sitemap_index.xml HTTP/1.1" 200 351 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
How to block user agent strings in .htaccess.
In the following example, the Apache .htaccess control file is configured to block user agents, that contain the words archive, arkiv and heritrix, and present the HTTP error code “404 Not Found”. Another common HTTP error code is “403 Forbidden”. The code block should be placed in the top of the access control file.
$ nano .htaccess <IfModule mod_rewrite.c> RewriteEngine On RewriteCond "%{HTTP_USER_AGENT}" archive [NC,OR] RewriteCond "%{HTTP_USER_AGENT}" arkiv [NC,OR] RewriteCond "%{HTTP_USER_AGENT}" heritrix [NC] RewriteRule ^.*$ - [R=404,L] </IfModule>


Should web archive site scraping crawler bots be blocked?
The words, that were used in the example above, are often found in user agents, that is related to site scraping web crawler bots, that is collecting data from websites to be stored in massive data centers, driven by web archiving projects. This data collection is not only a waste of system ressources at the data center, Internet bandwidth, system ressources on the target web server and environmental ressources, but it is not related to search engine ranking nor any other meaningful purpose. Add to this, that some archive site scraping crawler bots are actually also using brute force hacking to collect non-public data for other purposes than archiving. If law obligates data collection, the same law will provide means for doing so in a proper manner and the archive project will cover the costs. This makes such bots good candidates for blocking.
Does web archive site scraping crawler bots use bruteforce hacking?
Some web archive projects use dictionary based bruteforce hacking attempts to gain access to non-public streaming and database driven services. They do not offer domain exclusion nor deletion. They are protected by special laws and funded by governments. These circumstances indicate, that these web archive projects have other purposes, than just archiving.
Keep in mind, that web archiving is about archiving public websites, that is valuable to society. It is not about “breaking an entry” and stealing non-public data from private companies, family websites nor home networks.
The examples below are just a short number of actual dictionary based bruteforce hacking attempts, that was executed from the site scraping crawler bot from a Danish web archive project. Each HTTP error code 404 is proof, that the crawler did not follow links nor a site map.
130.225.26.139 "GET /6.5.3609 HTTP/1.0" 404 5524 130.225.26.139 "GET /ShockwaveFlash.ShockwaveFlash HTTP/1.0" 404 5524 130.225.26.139 "GET /youtube.com HTTP/1.0" 404 5524 130.225.26.139 "GET /youtu.be HTTP/1.0" 404 5524 130.225.26.139 "GET /audio/ogg HTTP/1.0" 404 5524 130.225.26.139 "GET /audio/mp4 HTTP/1.0" 404 5524 130.225.26.139 "GET /audio/mpeg HTTP/1.0" 404 5524 130.225.26.139 "GET /application/vnd.apple.mpegurl HTTP/1.0" 404 5524 130.225.26.139 "GET /ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/x3d HTTP/1.0" 404 5524 130.225.26.139 "GET /jwplayer.js HTTP/1.0" 404 5524 130.225.26.139 "GET /jwplayer.html5.js HTTP/1.0" 404 5524 130.225.26.139 "GET /jwplayer.flash.swf HTTP/1.0" 404 5524 130.225.26.139 "GET /application/x-shockwave-flash HTTP/1.0" 404 5524 130.225.26.139 "GET /dock.position HTTP/1.0" 404 5524 130.225.26.139 "GET /jwpsrv.js HTTP/1.0" 404 5524 130.225.26.139 "GET /sharing.js HTTP/1.0" 404 5524 130.225.26.139 "GET /related.js HTTP/1.0" 404 5524 130.225.26.139 "GET /gapro.js HTTP/1.0" 404 5524 130.225.26.139 "GET /skins/$1.xml HTTP/1.0" 404 5524
None of these paths are behavior of clean archiving bots, that just followed a links or sitemap on a public website. These are signature probes, related to hacking. These particular signature probes detect video streaming services, embedded media references, video decoders and vulnerabilities in URL handling. There is even a path injection payload expoit attempt.
More about user agents.
Firefox user agent string reference on Resources for Developers by Developers on Mozilla. User-Agent strings on Chrome Developers.