What is robots.txt - The Basics
Robots.txt - It's why is it used?
When any website is completed and released on the Internet, there are several different kinds of site visitors, most are human beings who visit the website because they followed a link in a Google search engine results page, or because they clicked a link in a newsletter and so on. These site visitors come to the website because thy believe that the website has something that will add value to them in some way.
A small but significant number of site visitors however are not human beings they are Internet driven Robots. Robots as small blocks of code, usually created by Search Engines such as Google, Bing, and so on, which crawl through website content and returns specific information to the domain from where they originated.
They are called robots because this blocks of code, are complete in themselves, carry out all that they've been pre-programmed to do, without any further human intervention. They carry out the work they were programmed to do tirelessly and endlessly.
While robots from Search Engines are always welcome to crawl through the contents of a website, there are robots that have been written by not so nice people that crawl websites looking to harvest Email ID's to SPAM, credit card details to defraud, Administrator passwords to hack in and take over the website, and unpleasant stuff like that.
These robots are coded to be malicious. Regretfully such Robots are almost impossible to control by the websites they visit.
Theoretically, website owners can communicate to such Robots. In general however these Robots just ignore such communications. After all they have been specifically written to be malicious. Hence, no self respecting, malicious Robot, would heed a website owner's communications to leave their website alone?
Now comes the question, How do website owners communicate with the Robots that are well behaved and willing to listen?
The answer is via robots.txt.
The Robots Exclusion Protocol.
The contents of robots.txt is called The Robots Exclusion Protocol.
Here is how this works.
A robot visits a website. let's say http://www.mywebsite.com. Before it commences crawling through website content all well behaved robots, first check of the existence of http://www.example.com/robots.txt.
If such a file exists in the root directory of the website then the well behaved robots are so coded that they read the contents of this file and modify what it is they do on the website based on the contents of robots.txt.
Instructions For Robots.
Let's assume that the very first sentence in robots.txt is:
User-agent: *
Robots understand that the keyword User-agent is used to address all of them
The attribute passed to the keyword is *
This indicates that the website owner wants to grab the attention of All Robots.
: is a simple separator symbol, which is used to separate a keyword from its attribute.
This indicates that all the website owner's instructions below this first line is applicable to all robots.
Let's assume that the next line is:
Disallow: /
Here the keyword is Disallow which tells all robots that are not welcome to visit the website.
The attribute passed to the keyword is /
: is a simple separator symbol, which is used to separate a keyword from its attribute.
Disallow: /
Therefore instructs all robots that they should not visit any of the pages of the website.
NOTE: There are two important things to bear in mind when communicating with robots via robots.txt.
1) Malicious robots will simply ignore the instructions given by the website owner in robots.txt.
2) /robots.txt is publicly available file. What this means is if anyone keys in http://[website URL in full]/robots.txt the web server will return the contents of the file robots.txt for them to read. Hence, anyone ( including hackers ) can see what sections of your website you do not want robots to crawl.
Hence, please do not try and use robots.txt to hide any information on the website. This approach just does not work.
Why do robots ignore the instructions in /robots.txt?
Well there could be simple reasons like the robot code being written by an inexperienced programmer. Having said that, its a ton more likely to be a malicious robot explicitly written to scan the website looking for usable information that can be abused. The robot could be looking for HTML forms form where is could harvest Email IDs, or it could be specifically coded to identify website security loop holes that can be exploited to hack into and take over the website.
Some website owners are concerned about the file /robots.txt itself. Why have it on the website?
What are the security implications of /robots.txt?
Many website owners are concerned that listing pages or directories in the /robots.txt file may invite their unintended access.
Here is what can be done to address the security implication of robots.txt.
You could put all the files you do not want robots to visit in a separate folder on the web server. Then make that directory unlistable on the Internet by specifically configuring the web server. Next block access of robots to this folder in /robots.txt
NOTE: If anyone on the Internet places a link to one ( or more ) of the files in the hidden folder then this is a security hole.
For example, rather than:
User-Agent: *
Disallow: /thisfile.html
Disallow: /thatfile.html
Use a single line that reads:
User-Agent: *
Disallow: /norobots/
NOTE: You would have to create a /norobots directory on the webserver. Then place thisfile.html and thatfile.html into it.
Q1. Where is the file robots.txt normally placed?
A1. In the top level directory ( root folder ) of the web site.
As web site owner you need to put robots.txt in exactly the same place where you put your web site's main index.html site gateway page.
Remember to use all lower case for the filename robots.txt.
Q2. What editor should be used to create the file robots.txt?
A2. Any text editor can be used. Notepad is a great example.
Q3. What are the common rules that can be placed inside robots.txt?
A3. The /robots.txt is a text file, which must have each rule on a new line.
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /dbtables/
In this example, robots are instructed to leave three folders alone, they are: /cgi-bin, /tmp and /dbtables
NOTE:
There should not be any blank lines in between rules.
Wildcard characters are not allowed.
Next, you would have to configure your web server to not generate a directory listing for the folder /norobots.
Now all a hacker would learn is that there is a norobots folder on the website, but would not able to list the files in there.
NOTE: Configuring a web server not to generate a directory listing for any website folder is done by placing an empty index.html file in such a folder.
Now if anyone tries to obtain a directory listing by passing the path to the folder to the web server, the web server will return an empty HTML page. Blank and white.
IMPORTANT: Website owners should understand that /robots.txt is not intended for access control, so don't try to use it as such.
Think of robots.txt as a No Entry sign, not a locked door.
If this is the studied approach taken by website owners then they will just not worry about the security implications of robots.txt.
Q1. Where is the file robots.txt normally placed?
A1. In the top level directory ( root folder ) of the web site.
As web site owner you need to put robots.txt in exactly the same place where you put your web site's main index.html site gateway page.
Remember to use all lower case for the filename robots.txt.
Simple Effective robots.txt Rules
To allow all robots complete access to all folders and files on the web server:
User-agent: *
Disallow:
To dis-allow a single specific robot ( requires knowing the name of the robot )
User-agent: BadBot
Disallow: /
To allow a single specific robot
User-agent: Google
Disallow:
To allow a multiple specific robots
User-agent: Google
Disallow:
User-agent: Bing
Disallow:
To disallow all robots access to specific pages
User-agent: *
Disallow: /tutorials/tutorial1.html
Disallow: /tutorials/tutorial2.html
Ivan Bayross
Open source tutorials | Open source training