Canonical URLs and Duplicate Content
As we know there is no such thing as a duplicate content penalty. However, duplicate content is a real issue for both search engines and those who wish to do well in this area.
The big three (Google, Yahoo and Microsoft) rarely get together on general standards, but when they do it is worth taking note. All three have a problem cutting through the mess that is the world-wide web and trying to archive it into something that is usable for us all. A big issue in this respect is the amount of duplication the web throws up. This sits in two main areas…
- The plagiarism that inherently exists on the web e.g. “ooh, that is a good article, I will use that on my site”, etc.
- Websites that duplicate information either intentionally or unintentionally.
The later category is the one that the big three (mainly Google) have been trying for a long while to mitigate. Fortunately, the “pet insurance london”, “pet insurance cardiff”, etc. which were all basically the same page, but with different Meta and H1 tags problem has been dealt with pretty much (some are still out there though!). Also, data sorting on page (e.g. sort by price, etc.) which generates lots of very similar pages all with different URLs (i.e. usually with the “?sort=” parameter is slowly being tackled. Yahoo, also have a parameter based URL removal tool in their site explorer suite.
The recent announcement helps with issues 1 (not at all as you can’t use this tag for external domain pages) and 2 (yes) to try to help search engines and webmasters make sure that the real (and main) version of a web page is treated as the canonical (main and sort of only) one.
The addition is a tag for the head of your page which is…
link rel=”canonical” href=”The URL you want to be the main one for this web page”
As a rule it is not a bad idea to include this in all of your pages that you manually create (which includes ones that you create in WordPress, etc. – don’t worry there are some plugins available already). This means that if these pages get tagged with a differnet URLs in one way or another, at least the search engine know what you meant to be the main one. Also, if you create intentional duplicate content (landing pages, etc.) then you can use this tag to help you to not confuse the search engines.
The plagiarism one is still something that the search engines will have to deal with themselves. Also, if you are creating multiple pages (with different URLs) from a database source (feed, CMS, etc.) then this will only help you if you can incorporate a dynamic element into the ‘rel’ tag that picks the canonical URL for you. This will work as a good alternative to the ‘noindex’/'nofollow’ way which (was) my favoured method.
It is well to remember that this is not a panacea. You should still look to rationalise your URLs and ensure that they are not duplicated and indexed. Remember, you are still passing PageRank (well, the links you don’t ‘nofollow’ anyway) to any page you link to on your site (even the ‘noindex’ pages), so (still) make sure you are not giving away credit to meaningless pages.
However, overall the is really good and especially so since it creates a ’301 redirect’ environment (so the good stuff gets passed back to the original page too) and therefor better than ‘noindex’ in many ways. Remember this will only work on your domain and links between these pages (i.e. you can’t use the tag to external domains from the domain you are working on).
If used well this is an excellent addition to your SEM efforts.




Does this also take care of the URL hijacking problem, more info on this blog post here. This fix is said to work only on the site it is on and does not work on external URLs. So if the URL starts ‘www.someproxy.blah…’ this should be viewed as an external URL even if the end result is the spoofed page? Will the canonical tag work? I would guess not. Comments and opinions are very welcome.