How does Google crawl and interpret the robots.txt file?
How does Google crawl and interpret the robots.txt file?
If we want to appear in the search results, the first thing we must do is to get Google or any other search engine to interpret the website correctly. If they can’t read the content, or don’t know what they can read…. How will they know which pages to show and which not to show?
Related to web indexation, there are some elements that we must take into consideration: the robots.txt file, through which we will indicate to the robots the pages and files that they can access from our website. A key aspect to design any SEO strategy correctly.
What are robots.txt files?
The robots.txt files are a text file through which we provide information to the robots (bots, crawlers…) about the pages or files that can request information or not on a website. Through this file we can “communicate” directly with the crawlers.
Thus, mainly the robots.txt file is used to avoid overloading the server with requests and to manage the traffic of robots on the website, because in this file we indicate the content that should crawl and not on a website. Here we have to point out that blocking or not blocking pages, has a different use than the “no-index” tag (explained below).
The robots.txt file is located at the root of the website (www.myweb.com/robots.txt).
How to implement robots.txt files?
The robots.txt file is a text file that is implemented in the root of the main domain, for example: www.myweb.com/robots.txt. This is where we will include different elements to tell the crawlers which pages should be crawled, and which should not.
Robots.txt files can be created in any web editor, just be aware that it can create standard UTF-8 text files.
How to implement or modify the robots.txt file in WordPress?
In general, by default the robots.txt file will be implemented in WordPress with the use of SEO plugins (Yoast SEO or Rank Math), like this one:
We also have the option to edit this file in several ways. On the one hand, directly from the FTP of our hosting or through different plugins that you can implement in WordPress as Yoast SEO or Rank Math, from where you can edit the file. It should be borne in mind, however, that editing this file incorrectly, can significantly affect the positioning results of the web, so it will be important to inform us well of what each parameter means and how it can affect our website.
In the case of using Rank Math in WordPress, for example, to edit the file we should go to Rank Math > General Settings > Edit Robots.txt files.
Aspects to take into account for a correct implementation
For a correct implementation, it is important to consider the following aspects, as highlighted by Google
The file must be named robots.txt and there can only be one per website.
The robots.txt files can be implemented individually on each subdomain of a website.
The robots.txt files consist of one or more groups with specific directives (always one per line), which include:
to whom they are applied (user-agent)
the directories or sites that this user-agent can access and those that cannot.
By default, the user-agent will be able to crawl all pages that are not indicated as disallow. And these groups will be processed in the order they are written in the text. So the group that includes the most specific rule and is first, will be the one to follow.
If there are two conflicting rules, for Google and Bing always “wins” the directive with more characters. So if we have a disallow with /page/ and allow with /page/, the first one will have more weight. On the other hand, if they have the same length, the less restrictive one prevails (via Ahrefs).
Basic guide to robots.txt files. What parameters should we know?
Next, we are going to define some of the main elements that are important to know in order to interpret and implement robots.txt files:
User-agent: they are the form of identification of the crawlers, they define the directives that the different crawlers will follow, and must always be included in each group. It is important to note that each search engine has its own. For example, Google’s is called Google Robots or Googlebot, Bing’s is called Bingbot or Baidu’s is called Baiduspider (robots database). It is important to note that the following character (*) is supported, which is used to apply the directive to all crawlers.
Allow and disallow directives: these directives are used to specifically indicate to the user-agent pages to allow and pages or files not to crawl (disallow). There must be at least one in each group when they are included.
Disallow: To block a page with the disallow directive you must specify the full name including / at the end.
Allow: The allow directive, which overrides the disallow directive, is often used to indicate to crawlers that they can crawl a specific section of a directory blocked by the disallow directive.
Allow and disallow, when we give or deny access to robots in a personalized way
There are several things to consider when setting the allow and disallow directives in the file. An incorrect implementation can affect the results of the page in the SERPs. So we must take into account the following parameters.
If we leave the robots.txt file as follows, it will not block any directory:
But if, for some reason (not recommended unless it is for a consistent reason), we put only the slash (/), we would block the crawling of the entire web page (and we would not appear in search engines):
On the other hand, if between /…/ we define a directory, we will block only this one from the crawler. How it would be /wp-admin/. It is important to emphasize, that if we do not include the final /, the robots will not crawl any page that will begin with /wp-admin.
Likewise, if within this exclusion, we want to indicate any subdirectory that could be traced, we will include it as allow:
Other elements to take into account for the robots.txt file
We have seen that by means of the user-agent, as well as directories or URLs with the parameter allow or disallow, we can indicate to the robots different parameters that they can crawl or not of the web.
Next, we are going to detail some that we can find and that can be useful to us depending on the objectives of our project. But it should be noted that each website is different, so you should analyze well if you are interested in any of them and why.
The *, to indicate “any” in the robots crawl.
Using user-agent: * is one of the easiest ways to indicate to all robots (Google, Bing, Baidu…) that they can crawl the entire website. Since it is a wildcard we use * to indicate that “any” robot can crawl the site.
So, if you use it in user-agent, you indicate to the robots that they can crawl the entire website, and you will only have to detail with the disallow element, those directories that you do not want them to access
On the other hand, you can also use * in URLs, either at the beginning or in the middle, and it has the same meaning: all/any. In this case, for example, we block any URL such as www.myweb.com/retail/red/jumper or www.myweb.com/retail/small/jumper.
Indicates the end of a URL with the $ symbol
To indicate to robots the end of a URL, you can use the $ symbol. For example, it can be used with disallow: *.php$, which would block the URL ending like this, but it would accept another one like: .php/whateverend. That would be for if we want it to crawl some particular URLs of these files.
Preventing crawlers from visiting the website
Depending on our strategy (check this before), we may not want specific robots to crawl the site. This aspect would be indicated as follows:
The #, to explain comments
If we want to comment on any aspect, but without addressing the robots, we will do it with the #. Since the robots will not read everything after #.
Disallow or no-index tag?
Before including this directive as a disallow in a web page, it is very important to correctly assess whether including it can be very useful or not. A decision that, as we have already mentioned, will vary depending on our website and objectives.
Indexing content with the “robots” meta tag
With the “robots” meta tag, we can specify at the page level how content should be indexed in searches, in case we do not want a page to be displayed in the results.
But in order to indicate this directive to a robot, it is important that they can read it. So blocking a URL that includes the tag “no index” in robots.txt, will be a mistake. Since the robots will not be able to read that page and, although it may be difficult for them to find it, it may be indexed after a while, since the robots end up reaching it through links from other websites.
What will be our best option when we have URLs with parameters, “no-index” or disallow?
Leaving aside that it will vary depending on the objectives of each website, what we have to ask ourselves is:
Do we want the robots to spend time analyzing the URLs with parameters that are created when someone uses the website’s search engine?
Do we want the robots to spend time analyzing the URLs with parameters when someone filters products?
Depending on the answer, we will design our strategy.
Finally, another option that we can point out is that, although we decide that we are interested in having the searches on the website performed by users blocked, we may want to make an exception for some specific ones, as they may be terms of interest, which can help us to increase visibility. In this case, for example, we would put:
Blocking URLs with canonicals tags
The canonical tag is used to avoid duplicate content on a web page. These tags are sometimes used in parameterized URLs – which have very similar content to the main product or category – to prevent creating duplicate content.
Thus, if we block URLs with parameters to the robots.txt file, we will prevent robots from being able to read this information and identify which is the “home” page, especially when filtering products, as indicated by John Mueller (Google):
Although it is not “mandatory” to include the sitemap, the file that provides information about the structure of the web, it is advisable to include it or them in the robots.txt file, as this way we indicate to Google the content that we are interested in crawling.
How to check that we have correctly implemented the robots.txt file?
Finally, after having and assessing all these aspects, there is only one thing that, as always, we must do: check that we have implemented it correctly on the website.
We can check it in the “Robots.txt Tester” of Google Search Console or directly on each URL.
Do you want more information about how Google interprets the robots.txt file?
If you would like more information for your website about how Google interprets the robots.txt file, you can contact me.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.