The strange tale of robots.txt

A few days ago I read an interesting post. In this post the author explains how he was able to build a list of subdirectories to brute force web servers and find hidden resources, relying on publicly available information from robots.txt files. Before we dive into the details, here is a quick reminder of robots.txt –
robots.txt is a simple text file that defines how search engines, like Google, and other bots should interact with the pages and files of a web site. Bots will obviously not scan parts of the application with no links since they are not aware of them. However if there are files or directories that are linked to, from other pages and a web server administrator does not want those to be visited by bots, robots.txt file can be used to define where the bots should not go.
For example, the following lines tell all bots not to visit two directories:
User-agent: *
Disallow: /tmp/
Disallow: /admin/
As clearly stated in http://www.robotstxt.org/ there are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. In particular, do not expect a vulnerability scanner or an email harvester to respect such directives
The /robots.txt file is a publicly available file. Anyone can therefore see what sections of your server you don’t want robots to traverse

However it seems that even today there’s a lot of misconception regarding the role and use of robots.txt. In particular some web administrators consider it as an access control mechanism. Hence they include in this file explicit reference to folders and even files whose presence would not be revealed otherwise. In addition, unless explicitly told otherwise, search engines will index the contents of your robots.txt file, which makes it easier for interested parties to analyze these files and extract information.
A good example can be seen in the following robots.txt file (obtained by randomly searching the web):
From a security point of view, every restriction command has a lot of information behind it and can be abused for malicious purposes. Marking parts as “inaccessible” is a beacon for unwanted attention. In the example above, an intruder becomes aware of the existence of specific files (i.e. /plenum/data/xxx.doc) which seem to contain sensitive information. A malicious data collector can become aware of this information without even accessing the target server by going looking through search engine cache.
The robots.txt protocol can define restrictions not only regarding URLs and files, but also regarding parameters. Vulnerable parameters can be easily used to do almost everything on the server; starting from executing malicious commands through a denial of service. Adding such parameters and/ or values to robots.txt (as can be seen in the example below) is a poor way to patch your application that only increases the chances it will be used against you.

If you are looking to implement access control to files or folders, a set of configuration files can be defined and placed in a directory on the webserver (for example, .htaccess file in Apache servers). These configuration files are not accessible through the web services, as they are hardcoded not to deliver such files. If you are looking to prevent indexing of parts of your application by search engines, only include in the robots.txt file those parts that are otherwise publicly available. Avoid as much as possible parameter directives in robots.txt, unless their usage is otherwise publicly authorized.
Additional precaution can be achieved by restricting access to the actual robots.txt file to verified search engine bots (e.g. Google, Bing). These could be identified by their range of IP addresses. For unverified bots who access the file a restrictive “disallow all” file should be delivered. Implementing such a scheme requires constant maintenance of search engine data and the ability to deliver different contents of robots.txt in different situations. Such capability is delivered by leading WAF products, which also provide for protection against abuse of parameters and can enforce access controls on specific parts of the application. The use of WAF with client classification capabilities (i.e. whether access is done using a bot or a browser) also allows the detection of suspicious manual access to the robots.txt file, allowing application owners to further monitor the source of that manual activity with more scrutiny.

Summary

Robots.txt is not a security mechanism and it is amazing to see after all these years that people still don’t understand the purpose of it. They mistakenly think that this mechanism has the ability to prevent a disallowed access to application data. Restricting access to parts of the application (files / folders / parameters) requires the use of security mechanisms. Available mechanisms include special configuration files (e.g. .htaccess) and code adjustments inside the web server or alternatively use Web Application Firewalls outside the webserver. The advantage of the latter is investing minimal effort while supplying whole and up-to-date solution without platform / OS dependencies.
By Efrat Levy