英文:
Robots.txt - blocking bots from adding to cart in WooCommerce
问题
我不确定Google的robots.txt
测试工具有多好,我想知道下面这个关于我的WooCommerce网站的robots.txt
示例是否能够有效地阻止机器人添加到购物车并爬取购物车页面,同时允许像Google这样的良好机器人来爬取网站,还要阻止一些已经引起资源使用问题的机器人。以下是我的示例,附有我的注释(注释不包括在实际的robots.txt
文件中):
**阻止一些引起资源问题的网络爬虫(是否需要为每一个都单独添加"Disallow: /"?)
User-agent: Baiduspider
User-agent: Yandexbot
User-agent: MJ12Bot
User-agent: DotBot
User-agent: MauiBot
Disallow: /
**允许所有其他网络爬虫
User-agent: *
Allow: /
**阻止所有允许的网络爬虫添加到购物车和爬取购物车页面
Disallow: /*add-to-cart=*
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml
我将这个内容输入了Google的robots.txt
检查器,并得到了一个有关爬取延迟的警告,告诉我它将被忽略。
英文:
I'm not sure how good Google's robots.txt
tester is and wondering if the following example of my robots.txt
for me WooCommerce site will actually do the trick for blocking bots from adding to cart and crawling cart pages, while allowing good bots like Google to crawl the site and also block some bots that have been causing resource usage. Here's my example below with my **comments (comments are not included in the actual robots.txt
file):
**block some crawlers that were causing resource issues (do I need a separate "Disallow: /" for each one?)
User-agent: Baiduspider
User-agent: Yandexbot
User-agent: MJ12Bot
User-agent: DotBot
User-agent: MauiBot
Disallow: /
**allow all other bots
User-agent: *
Allow: /
**drop all allowed bots from adding to cart and crawling cart pages
Disallow: /*add-to-cart=*
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml
I put this through Google's robots.txt
checker and it came out with 1 warning on the crawl delay, telling me it would be ignored.
答案1
得分: 1
Baidu 和 Yandex 分别是来自中国和俄罗斯的实际搜索引擎。我不建议阻止它们,因为它们可以向您的网站发送合法的流量。我会删除以下内容:
User-agent: Baiduspider
User-agent: Yandexbot
您的允许规则是完全不必要的。默认情况下,允许爬取,除非有匹配的 Disallow
规则。Allow:
应该只用于向 Disallow:
添加更具体的例外情况。大多数机器人不能处理它,因此如果不需要它,应该将其删除,以免混淆它们。
crawl-delay: 10
对于除了只有少数页面的站点以外的任何站点来说都太慢了。大多数电子商务网站需要机器人能够每秒爬取多个页面,以便将其网站索引到搜索引擎中。虽然 Google 会忽略这个指令,但设置这样的长时间爬取延迟将阻止其他搜索引擎如必应有效地爬取和索引您的网站。
大多数机器人不会将 *
视为 Disallow:
规则中的通配符。主要的搜索引擎机器人将理解该规则,但很少有其他机器人会理解。
除此之外,您的 robots.txt
看起来应该能够满足您的需求。Google 的测试工具是验证 Googlebot 是否会按您的要求执行的好地方。
英文:
Baidu and Yandex are actual search engines from China and Russia respectively. I wouldn't recommend blocking them because they can send legitimate traffic to your site. I would remove
User-agent: Baiduspider
User-agent: Yandexbot
Your allow rule is totally unnecessary. By default crawling is allowed unless there is a matching Disallow
rule. Allow:should only be used to add a more specific exception to a
Disallow:` Most robots can't deal with it, so if you don't need it, you should take it out so that you don't confuse them.
crawl-delay: 10
is way too slow for anything other than sites with just a handful of pages. Most ecommerce sites need bots to be able to crawl multiple pages per second to get their site indexed in search engines. While Google ignores this directive, setting a long crawl delay like this will prevent other search engines like Bing from effectively crawling and indexing your site.
Most robots don't treat *
as a wildcard in Disallow:
rules. The major search engine bots will understand that rule, but few other bots will.
Other than that, your robots.txt
looks like it should do what you want. Google's testing tool is a good place to verify that Googlebot will do what you want.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论