Robots Exclusion Protocol of Search Engines

What is REP?

I have a habit of jokingly calling the Robots Exclusion Protocol (REP), the Rest in Peace (RIP) standard due to the complete peace that it brings to your site or its portions without any disturbance from the search engine crawlers. It is one of the most powerful tools available in the hands of webmasters to prevent the crawling of robots on a particular site, wholly or partly. Since the crawling does not happen, the site or its portion thereof which has been included in the protocol file, does not get indexed and hence, is not shown in the search engine result pages. The protocol has a list of directives which can be implemented at the domain level, page level and link level. Further, these directives have been commonly recognized by major search engines- Google, Yahoo and Microsoft. These directives are issued to the search engine bots either by way of the robots.txt file or by way of HTML Meta tags.

Robots.txt file

1. Disallow: All files, directories or even the whole site which you want to prevent from getting indexed for one specific bot or all the bots are blocked by using this function.


To block an entire site

User-agent: *

Disallow: /


To disallow a particular folder for all search engine bots:

User-agent: *
Disallow: /folder1/


To disallow only Googlebot from crawling specific directory:

User-agent: Googlebot

Disallow: /directory/


(two other popular google bots are Googlebot-Image and Googlebot-Mobile)


2. Allow: After disallowing pages or directories of a site, you might some of these to be crawled. This can be achieved by using ‘allow’ along with ‘disallow’.


3. $ wildcard: By using the $ wildcard support you can block all those pages whose URL ends at a specific point. $ is placed at the end of that point in a URL.


For blocking all URLs having the .pdf extension.

User-agent: Googlebot
Disallow: /*.pdf$


4. * wildcard: This is a further refinement in which the bots are guided by a sequence of characters, say with session ids,for preventing these to show up in SERPs.


To block all directories beginning with ‘xyz’

User-agent: Googlebot
Disallow: /xyz*/

HTML Meta Tags

1. Noindex: <Meta Name=”ROBOTS” Content=”NOINDEX”>

It prevents indexing of a particular page. It shall be mentioned that the page is crawled but it is not indexed and shown in SERPs.


2. Nofollow: <Meta Name=”ROBOTS” Content=”NOFOLLOW”>

This tag can be implemented at the page level as given above and also at the link level. The above tag prevents all links on that page from being followed onto the linked pages. However, this does not prevent that page being crawled if it has been linked from some other page or link where ‘nofollow’ has not been applied.


3. Nosnippet: <Meta Name=”GOOGLEBOT” Content=”NOSNIPPET”>


4. Noarchive: <Meta Name=”GOOGLEBOT” Content=”NOARCHIVE”>

Both the nosnippet and noarchive tags are aimed at preventing the caching of web page. So, cache links are not shown as well.

Other HTML Exclusion Codes

1. UNAVAILABLE_AFTER:

<Meta Name=”GOOGLEBOT” CONTENT=”unavailable_after: [date][time][timezone]>


Date and Time in RFC 850 format.


This is placed in the header section of a page and has the effect of removing that page from appearing in the SERPs, after 1 day from the said expiry time. However, it is to be remembered that the page is still there in the Google Index and has not been removed from there. Only it is not being shown in the SERPs. To seek complete removal of the page from the index, it is suggested that URL removal function of the Webmasters Tools shall be used.


2. NOIMAGEINDEX: <Meta Name=”ROBOTS” content=”NOIMAGEINDEX”>


This prevents the indexing of images on a particular page.

How to prevent non-HTML files from being crawled and indexed?

Meta tags are possible to be included only in the HTML pages. However, for certain types of file such as PDFs, Video Files and Audio files, which might be needed to be blocked, there is another tag which shall be used in the HTTP header section. This is X-Robots-Tag.


Examples,

X-Robots-Tag: noindex

x-Robots-Tag: noarchive, nosnippet

Conclusion

This hub provides all information about the various features of this protocol which aims at preventing the crawling and indexing of the site completely or in parts, depending on what you want search engines to show and what not to show. Webmasters should make use of these features or codes depending on which would be the most appropriate method to use.

Comments

No comments yet.

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working