Glossary entry: robots.txt

In the field of search engine optimization (SEO) various tools can help you understand how Google sees the domain. Additionally, there are instructions to help Google recognize the right information of your website - so you can achieve more visibility step by step.

The robots.txt file is such an instruction.

What is robots.txt ?

The robots.txt plays an essential role in online marketing, as well as in SEO. Important tools in the context of indexing and crawling are for example robots information in meta tags, the canonical tag, redirect and last but not least the robots.txt.

With the help of the robots.txt file, individual files in a directory, complete directories or entire domains can be excluded from the Crawling excluded . The robots.txt is a file that can be placed in the root directory of your website and is the first thing a bot calls when it visits a page.

Put simply: The robots.txt file is a guide for all crawlers and bots and gives them instructions on how to read the website. This file can be used to exclude certain categories or only individual pages for the bots. You can thus specify that, for example, all pages may be crawled by the Googlebot, but the Yahoo bot may not index the content of the pages.

There can be only one robot.txt per domain.

Why do you need the robots.txt?

Basically, the goal of every website is to gain more visibility in search engines like Google and thus to get more traffic. It is important that all relevant URLs are crawled by the search engine and can be included in the index. As is often the case, quality comes before quantity. Because even for the largest search engine in the world - Google - capturing and storing all websites and their content is a challenge. Therefore, each domain has only a certain crawling budget (number of crawled URLs per day) and this means to use it wisely. Translated with www.DeepL.com/Translator (free version)

Currently, robots.txt is not an official standard for a website, yet the file is considered a must among webmasters and SEO specialists - there is an unspoken law to use a robots.txt file.

Note: Allegedly, Google is already working on creating a uniform standard for all robots.txt files.

robots.txt file - structure & contents

A robots.txt file can have countless commands -. but each complete command consists of two essential elements that always belong together: The user agent and the command itself.

The user agent names the bot or crawler to which the next command is to be asserted.

Below comes the command itself:

Disallow: excludes the affected files

or Allow: includes the affected files.

Submitting the robots.txt

To create a .txt file, you can use a text editor and then save it in the Google Search Console be tested for errors. Errors in the syntax must be avoided at all costs - to check this, an analysis can be performed for the txt.file in the Google Search Console under "Status" -> "Blocked URLs".

Beware: Different crawlers can interpret the syntax differently.

Beware: If a page is blocked in the robots.txt file, it can still be indexed when other websites link to it. Although crawlers take the robots.txt file into account, unauthorized URLs can still be found and indexed by bots and appear in Google's search results if they reach the page through other means. If you want to prevent this, you can password protect files on your server, use noindex meta tags or response headers.

Important companion for installing the robots.txt, the crawlspaces and the sitemap is the Google Search Console.

robots.txt - example

User-agent: UniversalRobot/1.0

User-agent: my-robot

Disallow: /sources/dtd/

User-agent: *

Disallow: /nonsense/

Disallow: /temp/

Disallow: /newsticker.shtml

Top User Agents - Designations:

Crawler = User-agent
Google = Googlebot
Google Image Search = Googlebot Image
Bing = Bingbot
Yahoo = Slurp
MSN = Msnbot

Wildcards

Basically, the Robots Exclusion Protocol does not allow regular expressions (wildcards), but the largest search engine supports certain expressions.

Example:

* Placeholder for any strings following this character.

Pages whose URLs contain ".pdf" are not retrieved by the Googlebot.

Useragent: *

Disallow: *.pdf

$ serves as a placeholder for a filter rule that takes effect at the end of a string.

Content whose URLs end with ".pdf" will not be retrieved by Googlebot.

Useragent: *

Disallow: *.pdf$

Define exceptions

In addition to the "Disallow" command, Googlebot also understands "Allow" in robots.txt - and this allows to define exceptions for blocked directories:

robots.txt - example:

User-agent: Googlebot

Disallow: /news/

Allow: /news/index.html

This combination of Disallow & Allow commands tells the Googlebot that this is an exception - and although there is an ordered directory, it is allowed to read this particular file.

Sitemap

Besides the instructions of the crawling behavior, the robots.txt also allows to refer to the sitemap.

robots.txt - example:

User-agent: *

Disallow: /temp/

Sitemap: http://www.beispiel.de/sitemap.xml

robots.txt - use in SEO area

Through a robots.txt file in the directory, webmasters can instruct all search engines and their robots (user-agents) which pages to read and include in the index or which to exclude.

Search engines react to blocked pages and usually do not crawl this - however, it must be remembered that there are also "evil" crawlers that simply ignore the robots.txt. The robots.txt is therefore not mandatory and can be bypassed, as it is issued as an instruction that search engines do not have to follow. The robots.txt does not give any guarantee that a blocked page will be excluded from indexing.

Glossary entry: robots.txt

What is robots.txt ?

Why do you need the robots.txt?

robots.txt file - structure & contents

Submitting the robots.txt

robots.txt - example

Top User Agents - Designations:

Wildcards

Define exceptions

Sitemap

robots.txt - use in SEO area

Professional support SEO & more?