英文:
Robots.txt file and Googlebot crawability
问题
这个robots.txt
文件会允许Googlebot爬取我的网站吗?
英文:
Will this robots.txt
allow Googlebot to crawl my site or not?
Disallow: /
User-agent: Robozilla
Disallow: /
User-agent: *
Disallow:
Disallow: /cgi-bin/
Sitemap: https://koyal.pk/sitemap/sitemap.xml
答案1
得分: 1
如果您想知道Google如何对待robots.txt文件,您应该通过在Google的robots.txt测试工具中进行测试来获取官方答案。这里我有使用您提供的robots.txt进行测试的结果:
Googlebot将能够爬取该站点,但Google告诉您,您正在使用的robots.txt语法存在问题。我看到了一些问题:
Disallow
指令 必须 在其上方有一个User-agent
指令。- 在每个
User-agent
指令之前应该有一个新行(除了文件开头的那个指令)。 Disallow:
行意味着“允许所有爬取”。只有在没有其他Disallow
规则时才应使用它。
我认为符合语法规则的robots.txt文件,可以实现您的意图如下:
User-agent: Robozilla
Disallow: /
User-agent: *
Disallow: /cgi-bin/
Sitemap: https://koyal.pk/sitemap/sitemap.xml
这将阻止Robozilla
机器人爬取,同时允许所有其他机器人(包括Googlebot)爬取除/cgi-bin/
目录之外的所有内容。
英文:
If you want to know how Google will react to a robots.txt file, you should get an official answer by testing in Google's robots.txt testing tool. Here I have the results of such a test using the robots.txt that you provided:
Googlebot will be able to crawl the site, however Google tells you that the robots.txt syntax you are using as a problem. I see several problems:
- A
Disallow
directive must have aUser-agent
directive somewhere above it. - There should be a new line before each
User-agent
directive (except the one at the beginning of the file.) - The
Disallow:
line means "allow all crawling". That should only be used if there are no otherDisallow
rules.
A syntactically correct robots.txt that I think would do what you intend is:
User-agent: Robozilla
Disallow: /
User-agent: *
Disallow: /cgi-bin/
Sitemap: https://koyal.pk/sitemap/sitemap.xml
That would provent the Robozilla
bot from crawling while allowing all other bots (including Googlebot) to crawl everything except the /cgi-bin/
directory.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论