- Internet & the Web
The Importance and Definition of the Robots.txt File
What is it?
The robots.txt file is actually called the Robots Exclusion Protocol. This is a text file, not HTML or other webs design protocol file that is included in a website to tell what web robots and crawlers can and cannot access. The robots and crawlers are usually run by large websites that need to keep regular updates of web content. A good example is a search engine. When these robots or crawlers reach a site, the first thing they do is search for the robots.txt file. It is usually found in the root directory of the website. For example:
Now, if the bots or crawlers cannot find the robots.txt file it simply assumes that the site’s owner does not have any special instructions and continues doing what it is expected to do. And it should also be noted that including a robots.txt file in a website should never be considered as a security measure in the sense that the search engines will skip the pages or content included in the robots.txt file.
How does it work?
The content of the robots.txt file is quite straight forward. When the web crawlers find a site they first look for the file in the root directory and then go down the list of commands that are in it. Usually these commands will include the name of the web crawler the ‘do not index’ command is aimed at and the list of pages that it should not access. The two commands that are used are: ‘User-agent: ’ and ‘Disallow: ’ Also, in case it is needed, comments can be added. All that is needed is to add a ‘#’ sign before any sentence.
Syntax of the Robots.txt file
User-Agent:- This command is used to specify the web crawling robots which one of them are excluded from the search. Obviously, it would make no sense to have one search engine blocked while another picks it up – as the page will still be available to the web search engines albeit as the search result of the one that was allowed to crawl and index it. This is because search results are stashed too. Therefore, if a page is to be skipped it should be skipped by all crawlers:
Disallow:- The syntax for stopping the web crawlers from indexing a page is as ‘Disallow’ followed by the URL and name of the page, relative from the directory. So, if the URL of a web page in our earlier example was http://www.somewebsite.com/Ignored_Pages/First.html the correct way of having the crawlers ignore it would be:
The asterisk (‘*’) of course acts as a wild card and implies that all of the crawlers need to skip it.
# (the comment line):This is just a marker put in front of a line to show that it is a comment and not to be read by the crawlers.
Bringing it all together, our robots.txt file would look something like:
# This robots.txt is used to block all crawlers from reading my ‘First.html’ page. Thank you!
The number of pages is unlimited and just needs to be listed on a new line after a new ‘Disallow:’