Using robots.txt to disallow search engines to index your files
As a writer who contributes articles to various sites around the internet, I wanted to set up an online archive for my work. This would be a repository to which I could give others access as needed (for example, to establish authorship in copyright infringement cases). At the same time, to avoid having different references to the same material show up in search results, I needed to prevent the files in my archive from being indexed by search engines such as Google or Bing.
A little research showed that by using a robots.txt file, I could inform search engines that they should not index certain items on my website. It’s a simple and easy solution that does exactly what I need it to do. But in setting up my robots.txt file, I ran into some issues that were not addressed in the documentation I had read, and which required some time and head-scratching to figure out through trial and error.
That’s why I thought it might be useful to provide a simple guide that might save someone else from having to struggle with the issues I did.
What is robots.txt?
Search engines use applications called “robots” to “crawl” the entire internet, searching out online files and adding them to a database. When a user enters a search term into Google, for example, that query is matched against Google’s database of websites it has crawled. It is from that internal database that a list of search results is produced for the user.
The robot.txt file is used to essentially put up a KEEP OUT sign for files on your website you don’t want search engine robots to see. Since these files will be skipped by the robot, they won’t be indexed in the search engine’s database, and they won’t show up in search results.
Reputable search engines all program their robots to look for the robot.txt file on every website they find. If that file exists, the robot will follow its instructions regarding any files or folders the robot should skip.
(Take note that this is all entirely voluntary on the search engine’s part. Rogue search engines can and do ignore the instructions in robot.txt. In fact, some bad guys may actually be attracted to the parts of your website robot.txt says to avoid on the theory that if you want to hide it, there might be something there they can exploit).
How to set up a robots.txt file
I’m going to describe how I set up my robots.txt file to address my specific need. You can read a more general description of the various ways robots.txt can be used here.
First of all, to use a robots.txt file you must have access to the top level directory of your web site. That’s where the robots.txt file will be placed.
For example, if your web site is
then the robots.txt file must have the name
Note that if you put robots.txt anywhere else on the site, it won’t be recognized. For example, if you put your robots.txt into a folder called mygoodstuff:
web crawling robots will not recognize it, and won’t heed its instructions.
Also note that capitalization matters! The file name must be robots.txt and nothing else. ROBOTS.TXT or Robots.Txt won’t work.
Do you plan to personally set up a robots.txt file?
The contents of a robots.txt file
Here’s what the contents of a typical robots.txt file might look like:
The User-agent term specifies the particular search engines to which this directive is addressed. The * in the above example signifies that it applies to all search engines. If you only want your instructions to apply to Google, for example, you would use:
This would restrict only Google, and not any other search engines, from accessing the folders or files you list.
The Disallow term specifies which folders or files are not to be searched or recognized by the robot. In the example above, I don't want the contents of a folder called folder-to-ignore to be indexed by search engines. So, my Disallow statement instructs web crawlers to ignore the following URL:
Multiple folders or files can be specified:
Creating a robots.txt file
Any text editor, such as NotePad in Windows, may be used to create robot.txt files. Note that if a document editor, such as Microsoft Word, is used, the output must be saved as a .txt file. Otherwise, the file may contain hidden codes that will invalidate its contents.
Once saved as text, the file must be uploaded to the top level directory of your website. On most servers, that will be the public_html folder.
Upload robots.txt in exactly the same way you normally upload files to the site, making sure it is placed in the proper folder.
VIDEO: How to create a robots.txt file
Testing your robots.txt file
It’s very important to test your robots.txt setup to insure that it’s working as you desire. Otherwise you may find that the folders you wanted blocked are still accessible to crawlers, and are showing up in search results. Once that happens, it could take weeks or even months to get them removed from the search engine’s database.
Several free robots.txt testers are available on the web. Here are the ones I used:
Google’s Webmaster Tools robots.txt tester (requires a Google account)
The GOTCHAs that got me!
Google was unable to see my robots.txt file
I set up my robots.txt file to block a folder called /YCN Archive/. I created that folder on my website and verified that it could be accessed as expected.
I then created a robots.txt file with the following contents:
Disallow: /YCN Archive/
After uploading this file to my top-level directory, I tested it using the robots.txt tester in Google’s Webmaster Tools. Although I carefully followed the directions given via the Webmaster Tools link above, I immediately ran into a problem. Here’s the totally unexpected error message I got:
But the robot.txt was there! I could see it in my website’s file listing, exactly where it was supposed to be. Why couldn’t Google see it? Eventually I saw something on the tester page I hadn’t noticed before:
The key was in the line that says, “Latest version seen on 7/26/14 …” (I was doing the test several days after 7/26). When I initiated the test, it seems that Google didn’t go out and look at the state of the website at that moment, but apparently relied on its internal picture of what the website looked like the last time it crawled it.
I needed Google to have a current picture of what was on my website. I caused that to happen by using the Fetch as Google function:
Once the Fetch as Google function was performed, Google was able to find the robots.txt file.
Here’s another point to be careful of. In the robots.txt tester, Google listed my website two different ways:
Of course both those entries refer to exactly the same URL. But I had to do individual Google fetches for each to have the robots.txt file recognized. I also did separate tests on each in order to make sure my blocking instructions would be carried out no matter which URL was used to access the site.
My robots.txt file didn’t work!
Now that Google could see my robots.txt file, I ran the test, confident of success. It still didn’t work. This time, the test reported that although my robots.txt was now recognized, it was not blocking access to the /YCN Archive/ folder. Web crawler access to that folder was still "ALLOWED."
No spaces allowed in the disallowed folder or file’s name
I knew my robots.txt was set up correctly, so it baffled me why it was not blocking access to the specified folder. It took me some time to figure out what was going on. My folder had a space in the name! When I renamed the folder to remove the space, the Google robots.txt tester showed the folder as blocked.
robots.txt does its job
Since I put my robot.txt in place, it’s done its job silently and efficiently. My files are safely archived online, and can be accessed by anyone to whom I give the URL. But none of them are showing up in search engine results.
© 2014 Ronald E. Franklin