ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

What is robots.txt - The Basics

Updated on December 2, 2012

Robots.txt - It's why is it used?

When any website is completed and released on the Internet, there are several different kinds of site visitors, most are human beings who visit the website because they followed a link in a Google search engine results page, or because they clicked a link in a newsletter and so on. These site visitors come to the website because thy believe that the website has something that will add value to them in some way.

A small but significant number of site visitors however are not human beings they are Internet driven Robots. Robots as small blocks of code, usually created by Search Engines such as Google, Bing, and so on, which crawl through website content and returns specific information to the domain from where they originated.

They are called robots because this blocks of code, are complete in themselves, carry out all that they've been pre-programmed to do, without any further human intervention. They carry out the work they were programmed to do tirelessly and endlessly.

While robots from Search Engines are always welcome to crawl through the contents of a website, there are robots that have been written by not so nice people that crawl websites looking to harvest Email ID's to SPAM, credit card details to defraud, Administrator passwords to hack in and take over the website, and unpleasant stuff like that.

These robots are coded to be malicious. Regretfully such Robots are almost impossible to control by the websites they visit.

Theoretically, website owners can communicate to such Robots. In general however these Robots just ignore such communications. After all they have been specifically written to be malicious. Hence, no self respecting, malicious Robot, would heed a website owner's communications to leave their website alone?

Now comes the question, How do website owners communicate with the Robots that are well behaved and willing to listen?

The answer is via robots.txt.

The Robots Exclusion Protocol.

The contents of robots.txt is called The Robots Exclusion Protocol.

Here is how this works.

A robot visits a website. let's say http://www.mywebsite.com. Before it commences crawling through website content all well behaved robots, first check of the existence of http://www.example.com/robots.txt.

If such a file exists in the root directory of the website then the well behaved robots are so coded that they read the contents of this file and modify what it is they do on the website based on the contents of robots.txt.

Instructions For Robots.

Let's assume that the very first sentence in robots.txt is:
User-agent: *

Robots understand that the keyword User-agent is used to address all of them

The attribute passed to the keyword is *
This indicates that the website owner wants to grab the attention of All Robots.

: is a simple separator symbol, which is used to separate a keyword from its attribute.

This indicates that all the website owner's instructions below this first line is applicable to all robots.

Let's assume that the next line is:
Disallow: /

Here the keyword is Disallow which tells all robots that are not welcome to visit the website.

The attribute passed to the keyword is /

: is a simple separator symbol, which is used to separate a keyword from its attribute.

Disallow: /
Therefore instructs all robots that they should not visit any of the pages of the website.

NOTE: There are two important things to bear in mind when communicating with robots via robots.txt.

1) Malicious robots will simply ignore the instructions given by the website owner in robots.txt.

2) /robots.txt is publicly available file. What this means is if anyone keys in http://[website URL in full]/robots.txt the web server will return the contents of the file robots.txt for them to read. Hence, anyone ( including hackers ) can see what sections of your website you do not want robots to crawl.

Hence, please do not try and use robots.txt to hide any information on the website. This approach just does not work.

Why do robots ignore the instructions in /robots.txt?

Well there could be simple reasons like the robot code being written by an inexperienced programmer. Having said that, its a ton more likely to be a malicious robot explicitly written to scan the website looking for usable information that can be abused. The robot could be looking for HTML forms form where is could harvest Email IDs, or it could be specifically coded to identify website security loop holes that can be exploited to hack into and take over the website.

Some website owners are concerned about the file /robots.txt itself. Why have it on the website?

What are the security implications of /robots.txt?

Many website owners are concerned that listing pages or directories in the /robots.txt file may invite their unintended access.

Here is what can be done to address the security implication of robots.txt.

You could put all the files you do not want robots to visit in a separate folder on the web server. Then make that directory unlistable on the Internet by specifically configuring the web server. Next block access of robots to this folder in /robots.txt

NOTE: If anyone on the Internet places a link to one ( or more ) of the files in the hidden folder then this is a security hole.

For example, rather than:

User-Agent: *
Disallow: /thisfile.html
Disallow: /thatfile.html

Use a single line that reads:

User-Agent: *
Disallow: /norobots/

NOTE: You would have to create a /norobots directory on the webserver. Then place thisfile.html and thatfile.html into it.

Q1. Where is the file robots.txt normally placed?

A1. In the top level directory ( root folder ) of the web site.

As web site owner you need to put robots.txt in exactly the same place where you put your web site's main index.html site gateway page.

Remember to use all lower case for the filename robots.txt.

Q2. What editor should be used to create the file robots.txt?

A2. Any text editor can be used. Notepad is a great example.

Q3. What are the common rules that can be placed inside robots.txt?

A3. The /robots.txt is a text file, which must have each rule on a new line.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /dbtables/

In this example, robots are instructed to leave three folders alone, they are: /cgi-bin, /tmp and /dbtables

NOTE:
There should not be any blank lines in between rules.
Wildcard characters are not allowed.

Next, you would have to configure your web server to not generate a directory listing for the folder /norobots.

Now all a hacker would learn is that there is a norobots folder on the website, but would not able to list the files in there.

NOTE: Configuring a web server not to generate a directory listing for any website folder is done by placing an empty index.html file in such a folder.

Now if anyone tries to obtain a directory listing by passing the path to the folder to the web server, the web server will return an empty HTML page. Blank and white.

IMPORTANT: Website owners should understand that /robots.txt is not intended for access control, so don't try to use it as such.

Think of robots.txt as a No Entry sign, not a locked door.

If this is the studied approach taken by website owners then they will just not worry about the security implications of robots.txt.

Q1. Where is the file robots.txt normally placed?

A1. In the top level directory ( root folder ) of the web site.

As web site owner you need to put robots.txt in exactly the same place where you put your web site's main index.html site gateway page.

Remember to use all lower case for the filename robots.txt.

Simple Effective robots.txt Rules

To allow all robots complete access to all folders and files on the web server:
User-agent: *
Disallow:

To dis-allow a single specific robot ( requires knowing the name of the robot )

User-agent: BadBot
Disallow: /

To allow a single specific robot

User-agent: Google
Disallow:

To allow a multiple specific robots

User-agent: Google
Disallow:

User-agent: Bing
Disallow:

To disallow all robots access to specific pages

User-agent: *
Disallow: /tutorials/tutorial1.html
Disallow: /tutorials/tutorial2.html

Comments

    0 of 8192 characters used
    Post Comment

    No comments yet.

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://hubpages.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)