This page describes, how to block web browser and bot user agents from accessing and scraping a website by configuring the Apache .htaccess control file. Written 2022-08-11. Tested with Apache 2.4.54 for FreeBSD.

What is a user agent?

When a web browser opens a web page, it will identify itself to the web server, that serves the web page. This identification will contain information about the web browser software and the operating system. The user agent is also known as the User-Agent request header.

In the example below, the user agent identified itself as Chrome with AppleWebKit rendering engine on a Linux operating system, while opening a page. Note, that it is common for web browsers to pretent to be other browsers and use other rendering engines in order to avoid problems with control scripts. Those are often the ancient Mozilla and Safari web browsers and KHTML and Gecko rendering engines.

64.124.8.24 - - [11/Aug/2022:21:57:28 +0200] "GET /2021/06/23/the-gm-bedford-diesel-marine-engine/ HTTP/1.1" 200 9204 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"

In the example below, the agent remained more anonymous, but identified itself as a non-human automatic bot from a web archiving project, while archiving the front page.

207.241.229.48 - - [28/Mar/2022:08:16:14 +0200] "GET / HTTP/1.0" 200 9870 "https://hq.wb.archive.org/" "Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot)"

In the example below, the agent also remained more anonymous, but identified itself as a non-human automatic bot from a search engine, while reading the sitemap.

66.249.66.62 - - [11/Aug/2022:19:51:27 +0200] "GET /sitemap_index.xml HTTP/1.1" 200 351 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

How to block user agents in .htaccess.

In the following example, the Apache .htaccess control file is configured to block user agents, that contain the words archive, arkiv and heritrix, and present the HTTP error code “404 Not Found”. Another common HTTP error code is “403 Forbidden”. The code block should be placed in the top of the access control file.

$ nano .htaccess
<IfModule mod_rewrite.c>
  RewriteEngine On
  RewriteCond "%{HTTP_USER_AGENT}" archive [NC,OR]
  RewriteCond "%{HTTP_USER_AGENT}" arkiv [NC,OR]
  RewriteCond "%{HTTP_USER_AGENT}" heritrix [NC]
  RewriteRule ^.*$ - [R=404,L]
</IfModule>
Screenshot of Firefox, that has been blocked and denied access to a website. HTTP error code “404 Not Found” and “The requested URL was not found on this server”.
Screenshot of Firefox, that has been blocked and denied access to a website. HTTP error code “403 Forbidden” and “You don’t have permission to access this resource”.

Should web archive site scraping crawler bots be blocked?

The words, that were used in the example above, are often found in user agents, that is related to site scraping web crawler bots, that is collecting data from websites to be stored in massive datacentres, driven by web archiving projects. This data collection is not only a waste of bandwidth and system ressources on the target web server, but it is not related to search engine ranking nor any other meaningful purpose, which makes such bots good candidates for blocking.

Does web archive crawler bots use bruteforce hacking?

Some web archive projects use dictionary based bruteforce hacking attempts to gain access to non-public streaming and database driven services, while not offering domain exclusion nor deletion, and being protected by special laws and funded by governments, which could indicate, that these web archive projects have top secret purposes.

The examples below are just a short number of actual dictionary based bruteforce hacking attempts, that was executed from the site scraping crawler bot from a Danish web archive project. Each HTTP error code 404 indicate, that the hacking attempt was not succesful.

Keep in mind, that web archiving is about archiving public websites, that is valuable to society. It is not about “breaking an entry” and stealing non-public data from private companies, family websites nor home networks.

130.225.26.139 "GET /6.5.3609 HTTP/1.0" 404 5524
130.225.26.139 "GET /ShockwaveFlash.ShockwaveFlash HTTP/1.0" 404 5524
130.225.26.139 "GET /youtube.com HTTP/1.0" 404 5524
130.225.26.139 "GET /youtu.be HTTP/1.0" 404 5524
130.225.26.139 "GET /audio/ogg HTTP/1.0" 404 5524
130.225.26.139 "GET /audio/mp4 HTTP/1.0" 404 5524
130.225.26.139 "GET /audio/mpeg HTTP/1.0" 404 5524
130.225.26.139 "GET /application/vnd.apple.mpegurl HTTP/1.0" 404 5524
130.225.26.139 "GET /ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/x3d HTTP/1.0" 404 5524
130.225.26.139 "GET /jwplayer.js HTTP/1.0" 404 5524
130.225.26.139 "GET /jwplayer.html5.js HTTP/1.0" 404 5524
130.225.26.139 "GET /jwplayer.flash.swf HTTP/1.0" 404 5524
130.225.26.139 "GET /application/x-shockwave-flash HTTP/1.0" 404 5524
130.225.26.139 "GET /dock.position HTTP/1.0" 404 5524
130.225.26.139 "GET /jwpsrv.js HTTP/1.0" 404 5524
130.225.26.139 "GET /sharing.js HTTP/1.0" 404 5524
130.225.26.139 "GET /related.js HTTP/1.0" 404 5524
130.225.26.139 "GET /gapro.js HTTP/1.0" 404 5524
130.225.26.139 "GET /skins/$1.xml HTTP/1.0" 404 5524

More about user agents.

Firefox user agent string reference on Resources for Developers by Developers on Mozilla. User-Agent strings on Chrome Developers.