All websites around the world are crawled by the Googlebot, which is responsible for analyzing them in order to establish a relevant ranking in the search results. In this post, we will see the Googlebot’s different actions, its expectations, and the means available to you to optimize its exploration of your website.
Googlebot is a virtual robot, developed by engineers at the giant’s Mountain View offices. This little "Wall-E of the web" quickly goes through websites before indexing some of their pages. This computer program searches for and reads websites’ content, and modifies its index according to the news that it finds. The index, in which search results are stored, is kind of Google’s brain. This is where all its knowledge is housed.
Google uses thousands of small computers to send its crawlers to every corner of the web to find pages, to see what's on them. There are several different robots, each with a well-defined purpose. For example, AdSense and AdsBot are responsible for checking paid ads’ relevance, while Android Mobile Apps checks Android apps. There is also an Images Googlebot, News, etc. Here is a list of the most famous and most important ones with their “User-agent” name:
- Googlebot (desktop) Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- Googlebot (mobile) Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- Googlebot Video Googlebot-Video/1.0
- Googlebot Images Googlebot-Image/1.0
- Googlebot News Googlebot-News
Google provides Googlebots’ full list here.
Googlebot s completely autonomous, no one really “pilots” it once it's up and running. The robot uses sitemaps and links discovered in previous searches. Whenever the crawler finds new links on a website, it will follow them to visit landing pages, and add them to its index if they are of interest. Likewise, if Googlebot encounters broken links or modified links, it will take them into account and refresh its index. Googlebot itself determines how often it will crawl the pages. It allocates a “crawl budget” to each website. Therefore, it’s normal that a hundred or thousand-page website is not fully crawled or indexed. To make it easier for Googlebot, and to ensure that the website is correctly indexed, you must check that no factor is blocking the crawl or slowing it down (wrong command in robots.txt for example).
The robots.txt commands
The robots.txt is in a way Googlebot’s roadmap. This is the first thing it crawls so it can follow its directions. In the robots.txt file, it’s possible to restrict Googlebot’s access to certain parts of a website. This system is often used in crawl budget optimization strategies. The robots.txt for each website can be accessed by adding /robots.txt at the end of the URL.
With it, a website can block the exploration of shopping cart pages, my account and other configuration pages.
CSS stands for Cascading Style Sheets. This file describes how HTML elements should be displayed on the screen. It saves a lot of time because the style sheets apply throughout the website. It can even control multiple websites’ layout at the same time. Googlebot doesn't just read text, it also downloads CSS files to better understand a page’s overall content.
Thanks to CSS, it can also:
Detect possible manipulation attempts by websites to deceive robots, and better position themselves (the most famous: cloaking and white police on a white background).
Download some images (logo, pictograms, etc.)
Read the guidelines for responsive design, which are essential to show that a website is suitable for mobile browsing.
Googlebot downloads the images on the website to enrich its “Google Images” engine. Of course, the crawler doesn't “see” the image yet, but it can understand it thanks to the alt attribute and the page’s overall context. Therefore, you should not neglect images because they can become a major source of traffic, even if today it’s still very complicated to analyze them with Google Analytics.
Google's robot is rather discreet, we don't really see it at first. For beginners, this actually is a totally abstract notion. However, it’s there, and it leaves a trace behind. This “trace” is visible in the website logs. One way to understand how Googlebot is visiting a website is through log analysis. The log file also makes it possible to observe the precise date and time of the bot's visit, the target file or the requested page, the server response header, etc.
There are several tools for this.
Google Search Console
Search Console, formerly known as Webmaster Tools, is one of the most important free tools for checking a website’s usability. Through its indexing and crawl curves, you will be able to see crawled and indexed pages’ ratio compared to the total number of pages of which a website is composed. You will also get a list of crawl errors (404 or 500 errors for example) that you can fix in order to help Googlebot to better crawl the website.
Paid log analysis tools
To find out how often Googlebot visits a website and what it does there, you can also choose paid tools which are much more advanced than Search Console. Among the well-known ones: Oncrawl, Botify, Kibana, Screaming Frog… These tools are more intended for websites made up of many pages that need to be segmented to facilitate analysis. Indeed, unlike Search Console which gives you an overall crawl rate, some of those tools give the possibility of refining the analyzes by determining a crawl rate for each page’s type (category pages, product pages, etc.). This segmentation is essential to bring out the problematic pages, and then consider the corrections that are necessary.
Google does not share the list of IP addresses used by different robots because it often changes. So, to find out if a (real) Googlebot is visiting a website, you can do a reverse IP search. Spammers can easily spoof a user-agent name, but not an IP address. The robots.txt file can help you determine how Googlebot is visiting certain parts of a website. But be careful, this method is not ideal for beginners. Indeed, if you use the wrong commands, you could prevent Googlebot from crawling the entire website, which would directly lead to the website being removed from search results.
Helping Googlebot crawl more pages on a website can be a complex process, which comes down to breaking down the technical barriers that prevent the crawler from crawling the website in an optimized way. This is one of SEO’s 3 pillars: on-site optimization.
Regularly update the website’s content
Content is by far the most important criterion for Google but also for other search engines. Websites that regularly update their content are likely to be crawled more frequently because Google is constantly on the lookout for new things. If you’ve got a showcase website where it’s difficult to regularly add content, you can use a blog, directly attached to the website. This will encourage the bot to come more often while enriching the website’s semantics. On average, it’s recommended to provide fresh content at least three times a week in order to improve the crawl rate.
Improve server response time and page load time
The page load time is a determining factor. Indeed, if Googlebot finds it takes too long to load and crawl a page, it will crawl fewer pages after that. Therefore, you must host the website on a reliable server offering good performance.
Submitting a sitemap is one of the first things you can do to make bots crawl your website more easily and faster. They may not crawl all the pages in the sitemap, but they will have the paths all cooked up, which is especially important for pages that tend to be improperly linked within a website.
Avoid duplicate content
Duplicate content decreases the crawl rate a lot because Google considers that you are using its resources to crawl the same thing. In other words, you tire the robots for nothing! That’s why duplicate content should be avoided for Google as much as possible, but also for the dear Google Panda friend.
Block access to unwanted pages via Robots.txt
To preserve the crawl budget, you don’t need to let search engine robots crawl irrelevant pages, such as information pages, account administration pages, etc. A simple modification to the robots.txt file will block crawling on these pages from Googlebot.
Use Ping services
Pinging is a great way to get bots to visit you by notifying them of new updates. There are many manual ping services like Pingomatic on WordPress. You can manually add other ping services to many search engine bots.
Take care of the internal linking
Internal linking is essential to optimize the crawl budget. It not only helps you to deliver SEO juice to every page, but also better guides bots to deeper pages. Precisely, if you’ve got a blog, when you add an article, you should link towards an older page if possible. The latter will always be fed and will continue to show all its interest to Googlebot. Internal linking doesn't directly help increase Google's crawl rate, but it does help bots effectively crawl the website’s deep pages that are often overlooked.
As smart as they are, robots are not able to visualize an image yet. They need textual guidance. If the website contains images, be sure to complete the alt attributes to provide a clear description that search engines will understand and index. Images can only appear in search results if they are properly optimized.
Googlebot is a little robot that visits a website daily, looking for new things. If you’ve made wise technical choices for the website, it will frequently come and crawl many pages. If you provide it with fresh content on a regular basis, it will come back even more often. In fact, whenever you make a change on the website, you can invite Googlebot to come and see that change from the Google Search Console. In theory, it leads to faster indexing.