The robot exclusion protocol, better known as the robots.txt, is a convention to prevent web crawlers from accessing all or part of a website. It is a text file used for SEO, containing commands for the search engines’ indexing robots that specify pages that can or cannot be indexed.
The robots.txt is not used to de-index pages, but to prevent them from being browsed.
If a page had never been indexed before, preventing its crawl will allow it to never be indexed. But if a page is already indexed or if another website links to it, the robots.txt will not allow deindexing. To prevent a page from being indexed on Google, you must use
noindex tags/directives, or protect it with a password.
The main objective of the robots.txt file is therefore to manage the crawl budget of the robot by prohibiting it from browsing pages with low added value, but which must exist for the user journey (shopping cart, etc.)..
PS: the robots.txt file is one of the first files analyzed by the engines.
Search engines have two main tasks: crawl the Web to discover content and index that content so that it can be distributed to users looking for information.
To crawl sites, search engines follow links to get from one site to another, they crawl many billions of links and websites. This is called "spidering". Once the search engine robot accesses a website, it looks for a robots.txt file. If it finds one, the robot will first read this file before continuing to browse the page. If the robots.txt file does not contain directives prohibiting user agent activity or if the site does not have a robots.txt file, it will crawl other information on the site.
Importance of robots.txt
Robots.txt files control robot access to certain areas of your site. While it can be very dangerous if you accidentally forbid Googlebot to crawl your entire site, there are some situations in which a robots.txt file can be very useful.
Common use cases include :
- Avoid crawling duplicate content.
- Preventing crawling of an internal search engine.
- Preventing search engines from indexing certain images on your site.
- Specify the location of the sitemap.
- Specifying a scan delay to prevent your servers from being overloaded when crawlers load multiple pieces of content simultaneously.
If your site does not contain any areas where you want to control user access, you may not need a robots.txt file.
Robots.txt file language
The robots.txt file consists of a set of instruction blocks and optionally sitemap directives.
Each block contains two parts:
- One or more User-agent directives : Which robots this block is for.
- One or more commands: Which constraints must be respected.
The most common command is Disallow, which forbids robots to crawl a portion of the site.
What is a user agent?
When a program initiates a connection to a web server (whether it is a robot or a standard web browser), it gives basic information on its identity via an HTTP header called "user-agent".
For Google, the list of user-agents used by Google crawlers is available here.
# Lines beginning with # are comments #
# Beginning of block 1
# Beginning of block 2
# Additional sitemap directive
Other block commands :
- Allow (Applicable only to Googlebot) : Command to tell Googlebot that it can access a page or sub-folder, even though the parent page or sub-folder is forbidden (this command has precedence over the Disallow commands).
- Crawl-delay: This parameter allows you to specify and set the number of seconds the robot should wait between each successive request.
- Sitemap: allows you to easily indicate to search engines the pages of your sites to crawl. A Sitemap is an XML file that lists the URLs of a site as well as additional metadata on each URL in order to enable a more intelligent exploration of the site by search engines.
Language of robots.txt files: Regular expressions
Regex are special characters that allow you to simplify the writing of robots.txt through the use of patterns.
In the robots.txt file, most search engines (Google, Bing, Yandex ...) only include two of them:
- * : Corresponds to any sequence of characters
- $ : Corresponds to the end of a URL
Note: if the use of regex leads to a matching with several blocks for a given robot, only the most specific block will be taken into account.
For example here GoogleBot will choose block 2:
User-agent: * # Start of block 1
User-agent: Googlebot #Start of block 2
The user-agent can be any value, in other words the block is applicable to all robots.
This command prevents the crawl of URLs containing a series of characters (*) followed by ".gif" at the end of the url (".gif$"), in other words gif images.
Note: in robots.txt, URLs all start with a slash because they are emnate from the root of the site, represented by "/".
Prevents the crawl of all URLs that start with /private (including /privateblabla1.html), identical to /private*
Prevents crawling of all URLs that start with /private/ (including /private/page1.html), same as /private/*
Prevent crawling of exactly /private/ (e.g. /private/page1.html is still accessible).
Allow: /wp-admin/admin-ajax.php = The Allow statement allows exceptions, here it allows the robots to explore admin-ajax.php which is part of the directory previously forbidden, /wp-admin/.
Sitemap: "sitemap link" also allows to indicate to search engines the address of the sitemap.xml file of the site, if there is one.
You don't know if you have a robots.txt file?
- Just type your root domain,
- then add /robots.txt at the end of the URL. For example, the robots file of "Panorabanques" is located on the domain "https://www.panorabanques.com".
If no .txt page appears, you do not have a robots.txt page (live).
If you do not have robots.txt :
- Do you need it? Check that you don't have low value pages that require it. Example: shopping cart, search pages of your internal search engine, etc.
- If you need it, create the file following the above mentioned directives
A robots.txt file consists of one or more rules. Follow the basic rules for robots.txt files, that is to say the rules of formatting, syntax and location stated above to create the robots.txt.
Regarding format and location you can use almost any text editor to create a robots.txt file. The text editor must be able to create standard ASCII or UTF-8 text files. Do not use a word processor, as these programs often save files in a proprietary format and may add unexpected characters (e.g., curly quotes), which can confuse crawlers.
Formatting and usage rules
- The robots.txt file name must be in lower case (no Robots.txt or ROBOTS.TXT).
- Your site can only contain one robots.txt file.
- If it is missing, a 404 error will be displayed and the robots will consider that no content is prohibited.
- Make sure you do not block content or sections of your website that you want to be crawled.
- Links on pages blocked by robots.txt will not be followed.
- Do not use robots.txt to prevent sensitive data from being displayed in the SERP. Since other pages may link directly to the page containing private information, they may still be indexed. If you want to block your page from the search results, use a different method, such as password protection or the noindex meta directive.
- Some search engines have multiple users. For example, Google uses Googlebot for organic search and Googlebot-Image for image search. Most user agents of the same search engine follow the same rules. Therefore, it is not necessary to specify guidelines for different search engine bots, but it does allow you to refine the way your site content is analyzed.
- A search engine will cache the content of robots.txt, but usually updates the cached content at least once a day. If you change the file and want to update it more quickly, you can send your robots.txt URL to Google.
The Submit feature of the robots.txt testing tool makes it easy for you to have Google crawl and index a new robots.txt file for your site faster. Keep Google informed of changes to your robots.txt file by following the steps below:
- Click Submit in the lower right corner of the robots.txt file editor. This will open a "Submit" dialog box.
- Upload your modified robots.txt code from the Robots.txt Testing Tool page by clicking the Upload button in the Upload dialog box.
- Add your new robots.txt file to the root of your domain as a text file called a robots.txt file. The URL of your robots.txt file must be /robots.txt.
- Click Validate Online Version to verify that the online robots.txt file is the version you want Google to crawl.
- Click Submit Online Version to notify Google that your robots.txt file has been modified and ask Google to crawl it.
- Verify that your latest version has been crawled successfully by refreshing the page in your browser to update the tool's editor and view your robots.txt file code online. Once the page is refreshed, you can also click on the drop-down menu above the text editor to display the timestamp that indicates when Google first saw the latest version of your robots.txt file.
The robots.txt allows you to forbid robots to access parts of your website, especially if an area of your page is private or if the content is not essential for search engines. Thus, the robots.txt is an essential tool to control the indexing of your pages.