Log analysis is a technique regularly used by SEO professionals. It gives an overall view of a website's performance, its internal linking and its impact on robots' behavior. Indeed, the log files are the only data that is 100% accurate. Ultimately, log analysis is an essential help in getting better search engine rankingswhile increasing traffic, conversions and sales.
What is a server log file?
Log files are data recorded by the server which supports a website. This data can come from Internet users, as well as robots.
What are logs?
When a user types a URL into a browser, the browser first translates the URL into 3 components:
- Server name
- File name
The server's name is converted into an IP address via the Domain Name server to establish a connection between the browser and the corresponding web server, where the requested file is located. Then, an HTTP Get request is sent to the web server through the associated protocol for the requested page (or file), which is then interpreted in order to format the page that appears on the screen. Then, each of these requests is recorded as a “hit” by the web server. Those “hits” are visible in the logs, but also in the Google Search Console.
The log file’s structure depends on the server’s type, and the applied configurations, but there are attributes that are almost always included:
- IP Server
- Timestamp (date and time)
- Method (GET / POST)
- URI request (aka: URI stem + URI request)
- HTTP status code
Here's what it looks like:
What information can be extracted from it?
Logs contain information such as the IP of the Internet user accessing the information, a timestamp, a user-agent, the HTTP response code, images, CSS and and any other necessary file to display the page. Therefore, log files are made up of thousands of lines. Most hosting solutions automatically keep log files for a certain period of time. Usually, this information is only made available to the webmaster or domain owner.
Pages viewed or not crawled by Google
Log analysis highlights the pages that are of most interest to the robots and those that are abandoned. This information is essential in any SEO strategy. Indeed, each website includes very strategic pages and secondary pages. The logs allow you to see if all your strategic pages are crawled by Google. If the rate and/or the crawl frequency are lower than the average for the site, this means that a fundamental work is necessary to push Google to consult these pages more. There are no standard solutions that we can advise you on, as the problems vary from one site to another. But we will give you a few tips of optimisation in the second part of this article.
The website’s crawl frequency
The pages’ crawl frequency indicates whether bots are crawling them regularly or not. Don’t mix the crawl frequency up with the crawl rate, which refers only to the number of crawls allocated to a page or group of pages, even if these two metrics are often linked. Indeed, an uninteresting page will hardly be crawled, and gradually not crawled at all. Therefore, the crawl frequency is additional data to measure the relevance of one or more pages. If robots come to see the pages that have been optimized more often, you may conclude that the work is worth it!
The robot's crawl volume
Log analysis also makes it possible to know the precise daily crawl volume on already known URLs and newly crawled URLs. It’s important to monitor the crawl volumes every day using a monitoring tool. It looks like a curve representing the number of daily crawls. A significant drop in crawl volume may be due to a slowdown in the server, 500 errors, or an issue with the server retrieving logs. On the contrary, a large increase is not necessarily positive. For example, if you've just added a lot of products and the number of crawls is exploding, it’s worth looking at which pages have been crawled to make sure there is no duplicate content issue. A crawl spike can also precede a Google update or be a one-off for no particular reason.
Data retrieved from the logs also gather all the response codes for each event. If a user arrives on a website and is faced with a 500 code (server error), a log line will be created. The same goes for the 404 pages (pages not found), 200 (accessible pages), etc. Analyzing HTTP response codes has two advantages: SEO and UX (user experience). If the internet users are regularly stuck in front of website’s doors, they probably won't come back, just like Google's robots, which will end up penalizing you.
You might be wondering why analyze HTTP codes in the logs when they are already available in the crawler results ? The explanation is very simple: data on the crawler are calculated on an instant T. If you perform a crawl when the server is not very busy (and so, more efficient), you might not notice any response code issue, even if, in reality, the website displays a lot of server errors most of the time. HTTP codes’ analysis in the logs provides a more general idea that is smoothed out over time.
Temporary 302 redirects
302 redirects are also displayed in the logs. Unlike the 301s which are permanent and don’t pose any problems as such, you should watch the 302s. In general, it’s better to avoid them altogether. If you’ve got no choice, you will actually need to use them for occasional temporary purposes, in which case the pages will lose visibility over time. Therefore, by analyzing the logs, you will be able to see if the bots continue to crawl 302s, and take the necessary measures to direct them to the right URLs, either by removing the 302s or by replacing it with a 301.
What is the crawl budget, and how to optimize it?
Google robots do not crawl every page of a website (unless it’s a small showcase website). We don't know exactly how search engines define the crawl budget. According to Google, the search engine considers two factors: the pages’ popularity, and the content’s freshness. It means that if a page’s content is often updated, Googlebot will try to crawl it more frequently.
Thus, we can observe extremely low crawl rates (2-3% or even less) on websites with many SEO issues (poor internal linking, duplicate content, weak link profile, low content, slow pages, etc.). Therefore, one of the goals of log analysis is to optimize this crawl budget.
Detect technical problems
5xx and 4xx errors are SEOs’ worst nightmare because they send a very negative signal to Google. Moreover, they indirectly disrupt the website’s internal linking because when a crawler cannot access a page, it cannot access the internal links it contains either. Depending on the page’s size, this can impact the crawl of pages that are closely related. Therefore, it’s essential to correct these technical issues, especially if they are recurring issues.
The so-called orphan pages are existing pages but which are not linked to the website. These pages are easily detected by cross-referencing data from a crawler and logs. They will not appear in the crawler data, but may appear in the logs. So, you’re wondering how Google can crawl pages that are not linked to the website? Well, there are several reasons: either these pages were once linked to the website and lost their links (for example a product out of stock that no longer appears in the category listing), or they receive external inbound links (backlinks). In both cases, even if they no longer receive any links, Google can continue to crawl them because it knows them. It’s also difficult to prevent bots from naturally crawling these pages. You could remove them from the sitemap, remove the external links, and find that Google would still crawl them tirelessly! There is only one option for this: block them in robots.txt. But this solution is not viable in the long term because; 1/ the number of lines is limited; 2/ it represents manual management which can be time consuming, especially for large websites.
Duplicate URLs are one of the main causes of penalties from bots and especially Panda. They can often have serious unintentional consequences on a website’s organic ranking. When you come across similar groups of pages, you can opt for the rel=canonical tag or you can even delete certain pages. In any case, you must check the logs beforehand to find out which page is the reference for robots. If, out of ten duplicate pages, you find that a version is particularly crawled, then you should choose it as the reference page.
Identify page optimizations
The different rates
The active pages’ rate
To develop an optimization plan and set priorities, you need to use different metrics. The first one is, of course, a page or a group of pages’ crawl rate. It will let you know how interested robots are in those pages. The next one is the active pages’ rate within that group of pages. Indeed, the crawl rate is not enough because a crawled page is not necessarily an active page (=which has received at least 1 visit in the last 30 days). Therefore, it’s interesting to know which pages are crawled but not active, and to find the causes.
The pages’ recency
Then, you can rank the crawled and active pages based on their recency. You may find that Google is more interested in “fresh” content or, on the contrary, in old content. Either way, it will give you an idea of the action plan to set up in order to highlight the most strategic pages.
The crawls/visits ratio
In SEO, you have to make choices to optimize the crawl budget. It can sometimes happen that you have to "cut the foot to avoid cutting the leg." Indeed, certain pages that are not very strategic can use a lot of crawl, for a too low number of visits. Thus, by analyzing this "number of crawls/visits" ratio, you will be able to highlight the pages which consume too much budget compared to what they bring you back. You can choose to obfuscate the links that lead to these pages so that robots will no longer see them, or simply remove them from the website if they have no interest to the users. This will divert and lead Google's robots to other pages that are more interesting and strategic to you.
The crawl rate by type of page
In crawl and log analysis tools like Oncrawl or Botify, you can segment the pages according to the analysis needs. This is a crucial operation that will condition all the decisions. Thus, you will look at the robots’ behavior on each type of page, and see the targeted actions’ evolution.
Identify mobile adaptability needs
Thanks to the data relating to the user-agent, the logs will help you to know if the website switched to the Mobile First Index (MFI). If this is the case, you will notice it very quickly: almost all crawls will be carried out by the mobile bot. This is important information that will help you determine the priority actions. For example, if the website has switched to MFI, then one of the priorities should be improving load times.
Logs are therefore a mine of information for every website publisher. By crossing their data with that of a crawler, you will know precisely the state of health of your site and the behaviour of the search engine robots. This isan essential step before embarking on a natural referencing strategy. It may seem complex at first, but it can be approached empirically and can be quite simply fascinating!
Article written by Louis Chevant
The complete guide to Internal Meshing
The step-by-step method to build your semantic cocoons, your mesh and the optimal tree structure of your website.