OK, on my “post per day” quest, sometimes we need to cover in depth topics and sometimes the more basic. Today is one of the latter, but get the robots.txt file wrong and you will omit your pages from all (decent) search engines.
Well, robots.txt is a file that (should, in most cases) sit in the root directory of your web server and it is a file that all the major search engine robots access before they start doing their collection work on your site. The best way of thinking about this file is that it is like leaving a note for someone with directions. The main difference is that the note (mainly) contains instructions for the things that you do NOT want that person to do.
A basic robots.txt file (which you can create in any text editor or there are also some automated programmes like this. Also, Google has a basic editor in its Webmaster Tools programme) looks like this:
The user agent part is where you can specify which robots you want to give instructions to; the ‘*’ indicated “all spiders”. Alternatively, you can input ‘google’ here or whatever if you only want that robot to have that particular instruction. The disallow part is (as you would expect) the parts of your webserver that you would rather the robots not visit. This could be for privacy reasons (suck eggs time – although, you need to pretty much not upload things in unprotected areas of your web server that you don’t want people to access) or because you want the spider (who is short on time and patience) to not bother with the unimportant parts of your site and get quickly into the important stuff.
The main thing to remember about disallowing is NOT to do this:
This will tell the robot not to index anything on your site (unless this is what you want to happen).
Will the pages/sections I disallow stop them getting indexed by Google, etc? Err, no, actually. If someone on the web links to pages that you would rather not be indexed then the spiders will still follow those links and index the pages. If there are pages you truly don’t want indexed then you need to use “noindex” in your Meta tag for that page, something like this:
meta name=”ROBOTS” content=”NOINDEX,FOLLOW”
This means that you don’t really want the search engine to index the page, but you are quite happy for it to follow links on the page (more of this in another post).
A good element is that you can tell the robot where your XML sitemap (again, more in another post) is, by using this:
Another handy thing is that because you know that all good robots access the file first before they visit your site, if you look at the log files and see how many times the robots.txt file has been accessed you can get an idea, over time, of how frequently those little robots are accessing your stuff.
If you are still worried or unsure about all of this, Google have a nice little checker in their Webmaster Tools programme. You can see if the file is in the right place and also see if you have made some (big and small) errors.
Robot.txt won’t make or break your SEM efforts, but get it right and it will help. Get it wrong and prepare to have to sit around until the search engines re-index your site… not nice.