Best Practice

Why You Should Care About the Humble Canonical Tag (And Why 56% of Web Pages Don't)

Regan Kerr

June 24, 2020

Canonical tags have been around for about eight years now, but they’re still one of the most underused (and under-appreciated) tools in content publishing and ecommerce. Often confused, often misused, and sometimes abused, the canonical tag is vitally important to searching content on the internet, and for internal site search to return great results for users.

At Sajari, our website search product means we interact with different sets of site data and site structures. One thing that we’ve noticed is that some site search and ecommerce customers aren’t setting their canonical tags correctly, and that’s a problem.

Besides preventing your site search from working properly, a lack of canonical tags may mean that you’re missing out on SEO domain authority, as well as a bunch of other #content wins. Read our short guide to canonicals below and grab yourself some low hanging fruit for your marketing efforts - set those canonicals today!

TL;DR

If a site has duplicate or similar content available under different URLs, the canonical tag is a great way to direct Google - or your onsite search - to the best result. Even if it's the same content on the page, a different URL means search engines see your content as a different page. Consolidate your similar pages using the canonical tag or search engines and your customers can get confused and upset. If you want to check how many canonicals you have, use Sajari’s site search health report tool.

What is a canonical?

A canonical tag lets a crawler know what version of a page is the "original".

Canonical tags are set in the page header using the following syntax:

You might think of each different page on your website as being "the original" because you made every page unique, but that’s rarely the case. Here’s why.

When a site is made searchable, what actually happens is that a piece of code from the search engine (the "crawler") looks at all your site’s pages, maps where they sit in relation to other pages, and records what content is on them.

This information is added to what’s called the "index". When people want to search your site, the search engine then just has to look through the index; it’s much quicker than looking through your site again with a crawler.

Every so often the crawler will go back over your site to check if anything has changed, see if there’s anything new, and whether the index needs to be updated.

Sounds simple enough, right?

Well, it is when you think about how people view web pages — by content, not by URL. A search engine however, sees each page as each individual URL, regardless of content. This means that https://sajari.com is different from www.sajari.com , even though both URLs return the same content. They’re both different from sajari.com, and they’re all different to https://www.sajari.com.

You can imagine how this problem compounds if you’re doing something like keeping your mobile site on its own subdomain.

This is a simple case for the need for canonicals, but depending on the way your site is designed and structured, it could get messy really quick. Duplicate content often isn’t great for your site’s SEO and is really frustrating for crawlers to work with. You might have a whole lot that already exists on your site that you don’t even know about.

If you have an ecommerce website and use dynamic pages which change the URL depending on different options, such as location, size, or color, and don’t have a canonical set, you’ve probably got duplicate content for all the possible combinations of location, size and color.

If you use session IDs and append them onto your URLs without pointing to a canonical URL, web crawlers will only see duplicate versions of the same content. Automatically generated URL parameters will cause the same problem.

By setting a canonical, crawlers let the search engine know which is the original — the "true" version. Here’s an emoji example to illustrate.

The jaundice-skinned fellow is the canonical "man walking" emoji. The others are also "man walking" emojis, but were variations added a later date. Everyone recognizes them as the same emoji, but not the "original" one.

What other problems do canonicals solve?

Similar pages competing. You might have a bunch of products with different options, such as size or color, all living under "similar but different" URLs, or in different categories accessed by different URL paths. You might also have regional variations for your website, where the content is very similar on some pages, but the URL is different.

You don’t want all your similar pages competing for a search engine’s attention when one will do. By setting a canonical, you can assign all the link signals and other SEO value to one page, and remove the non-canonical versions from the search index.

If you don’t have canonicals specified for Google, the crawler that indexes your site content will just take a guess. Google can be pretty good at this, but it’s always best to set your own according to your site requirements, and have them in your site’s code. This also allows Google to crawl your site more efficiently, so new content gets indexed faster.

Cross-domain authority. As mentioned above, while two pages might have exactly the same content, they live under different URLs. If you allow republishing or syndication of your content, they could even live under different domains. This means a search engine won’t know which page to use in search results, or worse, rank the republished version of your content above your own website. All the domain authority, subsequent backlinks, and other SEO signals won’t be assigned to you.

You can use canonicals across domains to stop this happening, and make sure you get all the SEO juice that you’re entitled to. This also helps protect your website against all its content being scraped to be hosted elsewhere by nefarious individuals.

Analytics. Setting a canonical URL for similar pages in your user journey will also clean up your analytics and reporting. Instead of having multiple "same same but different" landing pages in your reporting, setting just one page as the canonical allows you to better understand user journeys and sales funnel performance. This is demonstrated rather simply below.

Setting canonicals keeps your data clean so you can easily see what's happening.

Why don’t people use canonicals?

Some people don’t understand them properly, or can’t be bothered setting them up. They’re not strictly necessary, and they’re not a "hard" signal like a page redirect — crawlers can ignore canonicals, especially if they realize that the pages pointing to a canonical aren’t actually that similar.

Some people prefer to use a 301 redirects, and reroute users to the "correct" URL, while telling search engines that the content has permanently moved to a new address. This is often done in an attempt to avoid duplicate content and how your site appears to Google, though it’s debatable as to whether duplicate content is even penalized that heavily. It’s also especially hard to implement on larger sites with lots of content and pages to redirect.

So what’s the difference between a canonical and a 301 redirect?

A 301 redirect is used when a page’s URL is being permanently changed to transfer all the previous authority and linking "power" to the new URL. While canonicals are implemented on the page level, 301 redirects are implemented at the server level. 301 redirects will send the user to a different URL, while a canonical tag is purely for the benefit of search engines.

As you can imagine, there’s quite a bit of confusion between the two.

301 redirects are useful for when the location of a page changes permanently. This could be as small as a deleted product or updated blog posts, or as large as redirecting your entire site to send visitors through "https://" instead of "http://".

You can also use 301 redirects in combination with canonical tags on your site. The 301 redirects your users to a consistent home page URL, while the canonical tag makes sure that search engines know which version of any subsequent pages is the most correct.

A 301 redirect would be like saying that all the variations of "man walking" no longer exist, and this is the only "man walking" emoji anyone can use, ever again.

So how many webpages don’t have a canonical tag set?

Based on our research and the data gathered from our site search health reports, we estimate that around 56% of web pages don’t have a canonical set.

This was a quick sampling of only 50 site search health reports, but there’s a clear pattern emerging — sites are either very good at setting their canonicals, or very bad.

We also don’t know how many of the sites have their canonicals set correctly until we start indexing and performing searches on their website. Just because they’re set doesn’t mean they’re necessarily doing the right thing — some websites have every canonical set to the homepage, which defeats the point of even having the tag set.

Why does Sajari care so much about canonicals?

We believe people should be using canonicals because:

They’re important to our crawler that indexes your site.
They’re important to how your users discover content on your site.
They’re part of web standards, and we try to play by the rules.

While most of the discussion above talks about canonicals in relation to web search engines (externally), they can be much more vital to your internal website search — it’s a whole different ball game to optimizing for a search engine like Google. Without canonicals, website search has your pages competing against themselves.

It’s a pretty awful user experience if you’re returning a bunch of very similar pages in your internal search. It could be confusing, as the user isn’t sure if your site search has broken or not — why not just show one version of the same result?

It can also be frustrating for users to see what appears to be duplicates of the same result when they’re looking for something else, especially if it knocks the relevant result — the one they’re actually looking for — further down the results.

By setting a canonical for your similar pages, you make it much easier for your site search provider to serve relevant results to your users.

How can I use canonicals better?

Use self-referential canonicals. It’s perfectly fine for a page to specify that it itself is the canonical URL . It can help keep your site structure organized— as long as any other similar pages aren’t doing the same thing.

If every page on your site lists itself as the canonical (some content management systems do this automatically), you’ll run into the same problem as every page on your site not having a canonical set. There can be only one canonical for each piece of similar content.

Don’t set canonicals on content that differs too much. Google’s crawler is smart enough to differentiate between pages that are the same and pages that are similar. Pages that only differ in a few places, like regional variants of a product page, might be OK to count together using a canonical tag, but if it differs too much, Google will simply ignore your tag.

In an internal context, your users might not be able to find valuable information they need, because the page containing it contains a poorly set canonical. This can be an issue if you’re reusing page templates in other places on your site, and forget to change the canonical attribute.

Don’t use multiple canonicals. If a page specifies more than one canonical, all will be ignored. Your page can’t be two things at once.

Stop using session IDs in URLs, start using cookies. They might not be as accurate for your analytics if users have cookies turned off, but it’s a neater solution. If you absolutely have to use session IDs, use one that uses hashes (#), not questions marks (?) — everything after a hashtag will be ignored.

How can I check whether I have canonical URLs set on my site or not?

Well, how lucky you asked.

A rudimentary way to check where you’re at is by Googling within your domain and seeing what URL is displayed in search results. Just Google using the phrase "site:yoursite.com" and see whether the URL structure is consistent.

As mentioned, the above method is pretty quick and dirty — you’re probably going to need more data to see how much (or how little) work you have to do.

This has become such a discussion point for us with customers that we created a tool to do some automated analysis and scoring to help businesses identify issues like poorly set canonical. We’ve built a search health report for websites which checks that all the required information for proper crawling by a search engine:

exists,
isn’t a duplicate
and is in the right place.

We use the search health report to work with our clients on getting their websites set up for search by our search engine. It’s completely free, only requires your homepage URL, and will generate you a report detailing the percentage of pages on your site using:

Canonicals
Open-graph tags
schema.org entities

You’ll also receive a bunch of other useful site health measures, like how fast your site loads, how any 404 links resolve, whether you have a valid site map, and plenty of other useful things to know.

There’s so much information out there in the world, and it’s getting increasingly harder to keep it all organized. At Sajari we believe that outstanding search is the key to an outstanding business, and canonical tags are a small but important part of a healthy website. Check yours are set correctly, today.

‍