Technically, robots.txt is a simple text file placed in the root directory of a website (e.g., www.example.com/robots.txt). It adheres to the Robots Exclusion Standard, guidelines for how web crawlers should behave when visiting a website. This file contains instructions in the form of "directives" that tell bots which parts of the website they can and cannot crawl.
| Directive | Description | Example |
|---|---|---|
Disallow |
Specifies paths or patterns that the bot should not crawl. | Disallow: /admin/ (disallow access to the admin directory) |
Allow |
Explicitly permits the bot to crawl specific paths or patterns, even if they fall under a broader Disallow rule. |
Allow: /public/ (allow access to the public directory) |
Crawl-delay |
Sets a delay (in seconds) between successive requests from the bot to avoid overloading the server. | Crawl-delay: 10 (10-second delay between requests) |
Sitemap |
Provides the URL to an XML sitemap for more efficient crawling. | Sitemap: <https://www.example.com/sitemap.xml> |