I am incharge of a few websites (about 250), they are a mix of corporate sites and e-commerce sites. In october 2009 one of the programmers at work implemented googles canonical url recommendation to our e-commerce platform so that we could expand on the number of pages without getting penalties for having duplicate content.
Basically the system works like this:
We have an htaccess url rewrite code in place to make the urls nicer – there are four types of url for each product page:
I’m sure there are lots of sites that are the same. The reasoning for these types of url is simple – we can have one product in many categories. We have popup mini pages from category level pages so that the user doesnt need to go into the product page if they want and we have them with brands, and without.
Canonical seemed like a fantastic idea – it would mean that we could enable all the features at the same time and wouldnt have to pick one way to do it for a site, it could do them all and not worry about duplicate content penalties. So each of the four url types had a canonical url added which was simply the product and category url.
After about 3 months one of our sites dropped off the radar. I had a look at the webmastertools pages and noticed that all of a sudden the number of errors for the duplicate fields (duplicate title tags, etc) had jumped up from arround 15 to 1800 !!
Obviously google thought that we were serving duplicate content.
I did notice a couple of issues that werent caused by the canonical url thing (which I fixed but they had no effect). One good thing did come from the issues though: I wrote code that gets called every time a page is accessed that checks the url that the person is trying to get and checks it against the url that they should be going to; if they dont match then it will 301 them to the proper place.
The actual cause of the probem was to do with google and ‘parameters’ – some of you will have noticed in the webmastertools that there is a parameter section – where google ‘learns’ the parameters of your site. Basically google had ‘learned’ that we use
and had started trying random values and strings against these to see what happens. Not very good if the actual pages all work but mostly return 404 pages (the number of your 404 pages goes through the roof with random numbers that dont exist) – the solution to this was the above 301 code.
After implementing the auto 301 code google dropped the errors from 1800 down to 1200 in one crawl. The next day they were down to 900, then 600, then 400, then 100, now they are at about 20 again. And the site has re-appeared in the serp’s.
If google had been listening to the canonical tags in those pages then none of these duplicate content issues would have appeared – google would have simply gone to the canonical url and checked that the pages was the same. (it seems to be doing it now that it cant simply add random parameters):
I have now implemented this on all of our e-commerce systems and our development systems too. Basically all it does is to redirect the person if they try to change the parameters to something that they shouldnt be!
So, if you are having issues with google dropping your sites and you have canonical urls enabled then you need to make sure that google isnt trying to guess parameters (I sent myself emails when googlebot was trying to access pages to see what it was trying!) and you need to 301 to the proper place if googlebot (or anyone else) tries to change the parameters.
I also have a similar implementation on this blog –
/hp-mini-210-external-wifi-aerial-mod is a valid url
/hp-mini-210-external-wifi-aeria is not, but it will 301 you to the above url. try it.