ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

The Importance and Definition of the Robots.txt File

Updated on February 6, 2012
Crawling - the search engine spider
Crawling - the search engine spider


What is it?

The robots.txt file is actually called the Robots Exclusion Protocol. This is a text file, not HTML or other webs design protocol file that is included in a website to tell what web robots and crawlers can and cannot access. The robots and crawlers are usually run by large websites that need to keep regular updates of web content. A good example is a search engine. When these robots or crawlers reach a site, the first thing they do is search for the robots.txt file. It is usually found in the root directory of the website. For example:

http://www.somewebsite.com/robots.txt

Now, if the bots or crawlers cannot find the robots.txt file it simply assumes that the site’s owner does not have any special instructions and continues doing what it is expected to do. And it should also be noted that including a robots.txt file in a website should never be considered as a security measure in the sense that the search engines will skip the pages or content included in the robots.txt file.

How does it work?

The content of the robots.txt file is quite straight forward. When the web crawlers find a site they first look for the file in the root directory and then go down the list of commands that are in it. Usually these commands will include the name of the web crawler the ‘do not index’ command is aimed at and the list of pages that it should not access. The two commands that are used are: ‘User-agent: ’ and ‘Disallow: ’ Also, in case it is needed, comments can be added. All that is needed is to add a ‘#’ sign before any sentence.

Syntax of the Robots.txt file

User-Agent:- This command is used to specify the web crawling robots which one of them are excluded from the search. Obviously, it would make no sense to have one search engine blocked while another picks it up – as the page will still be available to the web search engines albeit as the search result of the one that was allowed to crawl and index it. This is because search results are stashed too. Therefore, if a page is to be skipped it should be skipped by all crawlers:

User-Agent: *

Disallow:- The syntax for stopping the web crawlers from indexing a page is as ‘Disallow’ followed by the URL and name of the page, relative from the directory. So, if the URL of a web page in our earlier example was http://www.somewebsite.com/Ignored_Pages/First.html the correct way of having the crawlers ignore it would be:

Disallow: /Ignored_Pages/First.html

The asterisk (‘*’) of course acts as a wild card and implies that all of the crawlers need to skip it.

# (the comment line):This is just a marker put in front of a line to show that it is a comment and not to be read by the crawlers.

Bringing it all together, our robots.txt file would look something like:

# This robots.txt is used to block all crawlers from reading my ‘First.html’ page. Thank you!

User-Agent: *

Disallow: /Ignored_Pages/First.html

The number of pages is unlimited and just needs to be listed on a new line after a new ‘Disallow:’


Comments

    0 of 8192 characters used
    Post Comment

    • Luke Zelleke profile imageAUTHOR

      Luke Zelleke 

      7 years ago

      Brett.Tesol: pinging the bots can help with indexing of your site whenever you have new content. But most sites, Wordpress for example, have auto-ping settings that can be set to do the pinging. Once you've configured that you just forget about it. If you choose to do it manually care should be taken not to overdo it: there is a penalty.

      TsadJatko: Thank you, but in the world of 0's and 1's you never know anything, not for long at least.

      Thank you both.

      L.Z.

    • tsadjatko profile image

      7 years ago from now on

      Sounds like you know your stuff...

    • Brett.Tesol profile image

      Brett Caulton 

      7 years ago from Asia

      Interesting hub! Voted awesome and useful.

      I don't suppose you know how pinging a website helps to draw the search engine bots to it, do you?

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://hubpages.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)