We will give you all the tips to fix duplicate content. You will get rid of duplicate content which can unfortunately affect SEO.

Contents

Duplicate content definition

Duplicate content is content that appears in duplicate within a website itself or from 2 different websites (in this case, it is also called plagiarism). In other words, we find the same content (text, images…) on 2 pages which have 2 very different URLs. This is a recurring issue, especially among e-merchants who fall into this trap very quickly because of the multiple filters available for the user’s browsing experience.

Duplicate_Content

Duplicate content, or "DC", may lead to an algorithmic penalty by Panda. Unlike Penguin, the latter does not penalize the entire website but only the duplicate pages which may be downgraded or completely disappear from the search results. For once, we can only approve the search engine’s methods. Indeed, what’s the point of having a results page filled with the same content?

How does duplicate content occur within a website?

We could legitimately think that by knowing this rule, we are safe from duplicate content.

"I don't copy-paste my listings from another website or from any page on my own website, so it's okay! "

If only it were that simple!

Today, with the proliferation of mega-menus and faceted filters, duplicate content is a true sword of Damocles without even realising it is happening. According to some estimates, 29% of the web is duplicated!

Here are duplicate content’s most common causes.

URL parameters and tracking codes

URLs and tracking parameters are a frequent source of duplicate content. This can be a problem caused not only by the parameters themselves, but also by the order in which those parameters appear in the URL.

For example:
https://exemple.com/produits/femmes/robes/vert.html can be duplicated with
https://exemple.com/produits/femmes/?category=robes&color=vert

Moreover, it’s possible that a user's own session will generate duplicate content. If the ID session is automatically created and is a URL parameter, then it can generate duplicate content if this URL is used elsewhere and crawled by Google.
Since it’s very difficult to anticipate the consequences of having URL parameters, it’s better to avoid them as much as possible. Either way, URLs with parameters are usually poorly indexed or poorly positioned on Google.

Faceted navigation

Facets, more commonly referred to as filters, are SEOs’ bane. If they are rather easy to manage on small websites, they can become hard to handle on very large websites.

Facets, more commonly referred to as filters, are SEOs’ bane. If they are rather easy to manage on small websites, they can become hard to handle on very large websites.

- The Women's Pants category page has 10 products
- Of these 10 products, 9 are red

By clicking on the “red” filter, you end up with 9 products. The pages with and without a “red” filter look very similar!

zalando-faceted-navigation-seo

Another example: on a spare parts website for two-wheeled vehicles:

- 10 scooter carburetors of brand X and dimension Y are also compatible with mopeds and 50cc motorcycles
- In each Scooter/Moped/50cc Motorcycle category, if you select brand X and dimension Y carburetors, you end up with the same products… and therefore duplicate content!

On websites that do not use URL rewriting, this duplicate content can even extend to product listings that fall into multiple categories, with different URLs.

HTTP/HTTPS

When migrating from HTTP to HTTPS, chances of duplicate content can considerably increase if you don’t take time to do some check-ups. Two pages available in HTTP and HTTPS versions are considered to be strictly similar by search engines and so it can be penalizing.

Copy-pasted content

The recurring issue that all e-commerce websites face is enriching the product pages. Some websites have tens of thousands of products to put online, and some of them are only distinguished by a different color or size. Few of the websites have enough human resources to write 1 unique content per product. Google claims that it applies a certain tolerance, but in reality, we realize that not all pages are indexed or positioned.

What are the consequences of duplicate content?

Duplicate content will impact the way content is indexed by search engines.

- They will have to choose which content’s version to index.

- In addition to that, search engines will spend time crawling the same content several times (depending on the number of times the content is duplicated), and therefore they will potentially index other good contents less efficiently.

Once again, search engines want to provide the best user experience, that’s why they will not display the same content’s multiple versions, and they will always choose the content’s version that they believe is the best.

If you don't take action, you risk a drop in search results’ positions and traffic loss.  

Google may also remove some pages from the search results.

Finally, the last risk is about popularity.  

Indeed, if you acquire backlinks proactively or naturally, the fact that there are several entry doors will dilute/disperse the value of these inbound links. However, if all these links link to a single page, its notoriety and popularity will be greater.

In short, duplicate content limits the content's potential in terms of search engine visibility and negatively affects SEO traffic.

But there are solutions!

How to fix or remove duplicate content?

Detect external duplicate content

Whenever plagiarism is suspected, there are tools available to detect websites that have copied your content. Positeo, Plagium and Copyscape offer free access. However, they quickly show their limits. If you want to detect mass duplicate content, it’s necessary to go for the paid access.

Detecting internal duplicate content

Only a crawl tool can shed light on a website’s internal duplicate content. Among the most successful ones: Botify and Oncrawl. There are also less powerful tools for small websites, such as Site Analyzer or ScreamingFrog in its freemium version.
Those tools help you see the duplication’s percentage between pages. The most powerful ones distinguish pure content’s duplication (description of articles for example) from the template (found on all pages of the website).

How to check for duplicate content?

Fortunately, there are techniques to prevent duplicate contents, even the least obvious ones.

Avoid duplicate content thanks to the rel=canonical attribute

The best solution to avoid duplicate content is to use the rel=canonical attribute. It’s used to tell search engines which URL should be considered as the original one. So, if bots come across a duplicate page, they will know that they have to ignore it.

The rel="canonical" attribute fits directly into the page's HTML header (or "header"). It looks like this:

General format :

... [header code]...

This tag must be added on each duplicate version. The original page must also include a canonical URL, which will point to itself.

301 redirects

Sometimes duplicate content is a one-time event. This could be, for example, a new product page with a new reference, but the content is identical to the product’s old version (now out of stock). In this case, using a canonical tag is not the most appropriate because search engines would continue to crawl the old version which is now obsolete. A 301 redirect helps avoid duplicate content while still keeping the old page’s popularity.

Using a Noindex Meta Robots

This is the least “clean” solution. Indeed, a well-built website should not need to put pages in noindex. However, some technical constraints sometimes prevent the implementation of this best practice. The tag content = “noindex, follow" has the advantage that it can be manually added on each page. In particular, this makes it possible to quickly fix duplicate content issues, while waiting for a more lasting solution to be found.

It looks like this:

... [header code]... "> "robots

With this tag, robots can still browse pages but will be prevented from indexing them. By using the tag, you “clear customs” with search engines. It’s like telling them “I know I’ve got duplicate pages, but I promise it’s not on purpose, and I am not trying to manipulate bots in order to have multiple identical pages on the SERPs!”.

A common mistake is to additionally disallow the crawling of these pages in robots.txt. In order for bots to see the Noindex tag, they need to be able to crawl them.

Manage preferred domains and settings in Google Search Console

The Google Search Console helps you set a website's preferred domain while specifying whether Googlebot should crawl multiple URL parameters differently. Depending on the website’s structure and the duplicate content’s origin, setting up a preferred domain and/or managing settings may be a back-up solution. But, this is a method that will only be valid for Google. In fact, you won't fix the issues on Bing or other search engines. To do so, these changes must also be reflected in the tools for other search engines’ webmasters, which can be quite heavy! It’s always better to properly treat a wound rather than to put a “bandage” over it.

Take care of your internal linking

When you manage to develop a clean structure free of duplicate content, it’s essential to maintain consistency within internal linking. Each internal link must point to the canonical URL and not to the duplicate page. You will save on crawl budget!

Conclusion

Duplicate content is a major issue for all websites, in particular e-commerce ones. Even with experience, it’s sometimes difficult to anticipate all the potential instances of duplicate content. That’s why it’s essential to invest in a good crawl tool that will help you constantly monitor your website’s status and condition.

class="img-responsive
   Article written by Louis Chevant

Further reading

The complete guide to Internal Meshing

The step-by-step method to build your semantic cocoons, your mesh and the optimal tree structure of your website.