The meta robots tag needs to be properly implemented
The meta robots tag indicates two things to the engine: that it must not index a content, but also (and in a complementary way) that must not follow the links found on this content. Both options are possible.
The meta robots tag can be found on all websites. It gives information to search engines. This piece of code is located in the page’s head. In other words, in the code’s "header” (not to be confused with the HTTP header). This tag has a big impact on pages. It gives the power to decide which pages should be seen and crawled by search engines, and which ones should be hidden and prohibited from indexing. It’s essential, but needs to be used with caution.
The meta robots tag can also be used on internal links, as part of PageRank sculpting. For a long time, it had been used in the wrong way to manage faceted navigation and pagination. Now it shows its interest in external outbound links. For example, if an e-commerce website creates a page for the Nike brand, it may be required to link to the Nike website. In this case, a meta robots tag will keep the page’s SEO juice (which comes from the website’s internal linking and various external links). However, this method is not unanimous among SEOs. Indeed, some people trust Google enough, and believe that this impact on juice loss is small or nonexistent. They also consider that this is part of natural netlinking, and that the meta robots tag is not essential, unless there are a lot of outbound links. Others prefer to play it safe by consistently using meta robots tags.
How to set up the meta robots tag in the header?
This tag’s implementation is quick and easy. To do so, it requires access to a page or a set of pages’ code, then:
Copy and paste the whole head into a separate document. There are HTML editors suitable for code writing, such as SublimeText for example, highlighting tags which are not closed correctly.
Include the tag as shown below:
Provide guidelines for user-agents
Although this tag is standard, you can also provide directives to specific robots by replacing "robots" with a specific user-agent's name. This is useful during pre-production crawls for example, or if you want to prevent certain crawlers launched by competitors from crawling your website.
If you want to use different meta robots tag directives for different search user-agents, then separate them for each user-agent.
The X-robots-tag, a more refined alternative to the meta robots
While the meta robots tag affects an entire page’s indexing behavior, the X-robots-tag can be directly included in the HTTP header, and help you control the indexing of certain elements only. Therefore, it offers the same possibilities as the meta robots, with more flexibility. For example, you can use it to block non-HTML files such as images, videos or flash (even if it has become quite rare). To add it to HTTP responses, you must use the .htaccess and httpd.conf files. It looks like this:
HTTP/1.1 200 OK
Date: Tue, 25 NOVEMBER 2018 21:48:34 GMT
X-Robots-Tag: googlebot: nofollow
X-Robots-Tag: otherbot: noindex, nofollow
As with the meta robots, you can use other directives for a page like nosnippet, noodp, notranslate, etc. If you don't want to use robots.txt or meta robots tag, or if you need to block non-HTML content, then use X-Robots-Tag.
The meta robots tag’s different directives and their impact
This meta robots tag has 4 main functions for search engine crawlers.
The Follow and Nofollow directives
These directives are of the utmost importance. To better understand what is at stake: imagine that the website is funnel-shaped and has small holes scattered everywhere. The top of the funnel represents the homepage. The holes are all the internal links. Then, imagine that you pour liquid at the top of the website, from the homepage. If all the links are in Follow, the liquid will continue on its way and feed the website’s deeper pages. However, if all the links are in Nofollow, then the liquid will stop, and not feed the other pages. And those pages can be compared to plants! Without water, they will not survive.
After this lovely comparison, let's sum up with more technical terms. Google used to use a metric called PageRank to calculate the pages’ relevance on a site. It took internal linking but also external links (off-site) into account. Today, this is an obsolete metric but it has been taken up by SEO solutions to assess pages’ relevance within internal linking. These solutions like Botify, Oncrawl or ScreamingFrog send a crawler on websites which follows the meta robots’ directives. It ensures that the tag is used correctly.
The Index and Noindex directives
These two directives are the ones that can cause the most damage when poorly implemented. They are simply used to indicate to Google whether the page should appear in its index or not. So, you can imagine that by mistakenly including a "noindex" on a strategic page’s header by mistake (like the homepage, for the less fortunate ones!), the consequences can be dramatic. On the contrary, the "noindex" can be used to avoid duplicate content (although this is a relatively...dirty method!). This technique is found on e-commerce websites which do not manage URL rewriting, and which end up with several paths for the same page. Then, the links to the duplicated paths are put in nofollow, and the landing pages in noindex. Therefore, a page which would be in "index" would create duplicate content, severely sanctioned by our dear friend Google Panda. But let's remember once again that this technique is not clean at all! This is a temporary fix at best. The "noindex" does not prevent the page from being crawled by Google. However, the latter allocates a crawl budget to your website which it’s important to use wisely. Therefore, there is no point in wasting crawls on a page that you do not want to index. By extension (and from experience!), we can also state that "nofollow" does not prevent GoogleBots from crawling landing pages either. There are two main reasons for this:
Inbound links (i.e. pointing to this page) were once a “follow”. Therefore, Google has already crawled the page and will know the path even if you cut the links. Of course, after a while it will stop crawling, but it can take months!
Links are sent from external websites. In that case, nothing can be done other than requesting the links removal, which is usually unlikely to be successful.
Finally, there is the "noimageindex" directive which prohibits search engines from indexing images on the page. If the images are from another website, search engines can still index them. In that case, it is advisable to use the X-robots-tag instead.
The Noarchive directive
This directive directly impacts Google caching. A cache is simply a page backup after the crawler went through it and left. On paper, it seems safe and convenient. Even in the event of maintenance, Internet users can have access to the cached page. But it has some drawbacks, in particular for e-commerce websites whose pages are constantly evolving (price, product availability, etc.). Therefore, not all webmasters find the Google cache useful. To go against this, they can use the Noarchive directive. This tag only removes the "Cache" link from the page. Google will continue to index the page and display a snippet.
Isn't it a bit risky to rebel against Google?
Officially, Google says there are no consequences for this tag’s use. But from a UX point of view, it's best not to have too much fun with it if you don't have a handle on it.
The Nosnippet directive
There are snippets and rich snippets. The snippets simply correspond to the websites’ data which appear in the SERPs (= search results): the URL, the title, the meta description… The rich snippets (or enriched extracts), provide more information such as price, availability, ratings, number of calories for cooking recipes, etc.
The Nosnippet tag tells Google not to show this data under the page in search results and prevents caching.
The Noodp directive
Meta descriptions can be filled in by site owners or by Google. If it’s empty and the site is listed in DMOZ, search engines will display text’s snippets extracted from the website. You can force the search engine to ignore ODP information by including a meta robots tag like this:
What the hell is the Dmoz directory?
DMOZ was the largest human-edited directory with over 4 million websites listed. It was built and maintained by a large global community of volunteer publishers. The new search engines’ performance led to the decline of DMOZ which ended up disappearing on March 14th, 2017, which made the Noopd tag obsolete.
The unavailable_after and no translate directives
The unavailable_after directive tells search engines a date/time after which they should not display a page on search results. It can be compared to a noindex’s timed version. "Notranslate" prevents search engines from showing a page’s translations in their search results.
Meta robots tag: most common mistakes
The confusion noindex vs disallow
The noindex prevents Google from indexing a page, but it can still crawl it. If you want to optimize the crawl budget, the noindex is not very useful. The disallow command, which is integrated directly into the robots.txt, helps you block a page from being crawled, and so saves on the crawl budget! If you want to deindex a page, you will have to make sure to add the noindex before the disallow. How else can Google know that it should deindex this page if it no longer has access to its meta robots tag?
Put the meta robots tag outside of the header
It’s common to see meta robots tags in a page’s body. However, if Google states that its robots can still read it, it’s still better to leave it in the header (its official spot).
Forget or incorrectly position spaces and commas
Most search engines can interpret the tag even if spaces are missing. However, for Google, commas are very important. Without them, its robots cannot read it.
Ensure the meta robots’ compliance tags on the website
On showcase websites with few pages, manual verification is possible. But with tens, hundreds or thousands of pages, only suitable tools will be able to highlight any anomalies. These SEO solutions’ crawlers mimic Google robots’ behavior. Then, they are able to determine the number of nofollow links or noindex pages. By cross-referencing other information such as the internal PageRank, the average position in the SERPs or the page duplication rate, you are able to assess the impact of the different directives. It’s a great help in the decision-making process.
The meta robots tag is of paramount importance in managing a website’s indexing. However, there’s nothing magical, it is not the only way to optimize your crawl budget.