The Importance and Definition of the Robots.txt File

Crawling - the search engine spider
Crawling - the search engine spider


What is it?

The robots.txt file is actually called the Robots Exclusion Protocol. This is a text file, not HTML or other webs design protocol file that is included in a website to tell what web robots and crawlers can and cannot access. The robots and crawlers are usually run by large websites that need to keep regular updates of web content. A good example is a search engine. When these robots or crawlers reach a site, the first thing they do is search for the robots.txt file. It is usually found in the root directory of the website. For example:

http://www.somewebsite.com/robots.txt

Now, if the bots or crawlers cannot find the robots.txt file it simply assumes that the site’s owner does not have any special instructions and continues doing what it is expected to do. And it should also be noted that including a robots.txt file in a website should never be considered as a security measure in the sense that the search engines will skip the pages or content included in the robots.txt file.

How does it work?

The content of the robots.txt file is quite straight forward. When the web crawlers find a site they first look for the file in the root directory and then go down the list of commands that are in it. Usually these commands will include the name of the web crawler the ‘do not index’ command is aimed at and the list of pages that it should not access. The two commands that are used are: ‘User-agent: ’ and ‘Disallow: ’ Also, in case it is needed, comments can be added. All that is needed is to add a ‘#’ sign before any sentence.

Syntax of the Robots.txt file

User-Agent:- This command is used to specify the web crawling robots which one of them are excluded from the search. Obviously, it would make no sense to have one search engine blocked while another picks it up – as the page will still be available to the web search engines albeit as the search result of the one that was allowed to crawl and index it. This is because search results are stashed too. Therefore, if a page is to be skipped it should be skipped by all crawlers:

User-Agent: *

Disallow:- The syntax for stopping the web crawlers from indexing a page is as ‘Disallow’ followed by the URL and name of the page, relative from the directory. So, if the URL of a web page in our earlier example was http://www.somewebsite.com/Ignored_Pages/First.html the correct way of having the crawlers ignore it would be:

Disallow: /Ignored_Pages/First.html

The asterisk (‘*’) of course acts as a wild card and implies that all of the crawlers need to skip it.

# (the comment line):This is just a marker put in front of a line to show that it is a comment and not to be read by the crawlers.

Bringing it all together, our robots.txt file would look something like:

# This robots.txt is used to block all crawlers from reading my ‘First.html’ page. Thank you!

User-Agent: *

Disallow: /Ignored_Pages/First.html

The number of pages is unlimited and just needs to be listed on a new line after a new ‘Disallow:’


More by this Author


Comments 3 comments

Brett.Tesol profile image

Brett.Tesol 5 years ago from Somewhere in Asia

Interesting hub! Voted awesome and useful.

I don't suppose you know how pinging a website helps to draw the search engine bots to it, do you?


tsadjatko profile image

tsadjatko 5 years ago from maybe (the guy or girl) next door

Sounds like you know your stuff...


Luke Zelleke profile image

Luke Zelleke 5 years ago Author

Brett.Tesol: pinging the bots can help with indexing of your site whenever you have new content. But most sites, Wordpress for example, have auto-ping settings that can be set to do the pinging. Once you've configured that you just forget about it. If you choose to do it manually care should be taken not to overdo it: there is a penalty.

TsadJatko: Thank you, but in the world of 0's and 1's you never know anything, not for long at least.

Thank you both.

L.Z.

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working