2023年3月7日 06:04:32go评论169阅读模式

英文:

Robots.txt - blocking bots from adding to cart in WooCommerce

问题

我不确定Google的robots.txt测试工具有多好，我想知道下面这个关于我的WooCommerce网站的robots.txt示例是否能够有效地阻止机器人添加到购物车并爬取购物车页面，同时允许像Google这样的良好机器人来爬取网站，还要阻止一些已经引起资源使用问题的机器人。以下是我的示例，附有我的注释（注释不包括在实际的robots.txt文件中）：

**阻止一些引起资源问题的网络爬虫（是否需要为每一个都单独添加"Disallow: /"？）

User-agent: Baiduspider
User-agent: Yandexbot
User-agent: MJ12Bot
User-agent: DotBot
User-agent: MauiBot
Disallow: /

**允许所有其他网络爬虫

User-agent: *
Allow: /

**阻止所有允许的网络爬虫添加到购物车和爬取购物车页面

Disallow: /*add-to-cart=*
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml

我将这个内容输入了Google的robots.txt检查器，并得到了一个有关爬取延迟的警告，告诉我它将被忽略。

英文:

I'm not sure how good Google's robots.txt tester is and wondering if the following example of my robots.txt for me WooCommerce site will actually do the trick for blocking bots from adding to cart and crawling cart pages, while allowing good bots like Google to crawl the site and also block some bots that have been causing resource usage. Here's my example below with my **comments (comments are not included in the actual robots.txt file):

**block some crawlers that were causing resource issues (do I need a separate "Disallow: /" for each one?)

User-agent: Baiduspider
User-agent: Yandexbot
User-agent: MJ12Bot
User-agent: DotBot
User-agent: MauiBot
Disallow: /

**allow all other bots

User-agent: *
Allow: /

**drop all allowed bots from adding to cart and crawling cart pages

Disallow: /*add-to-cart=*
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml

I put this through Google's robots.txt checker and it came out with 1 warning on the crawl delay, telling me it would be ignored.

答案1

得分: 1

Baidu 和 Yandex 分别是来自中国和俄罗斯的实际搜索引擎。我不建议阻止它们，因为它们可以向您的网站发送合法的流量。我会删除以下内容：

User-agent: Baiduspider
User-agent: Yandexbot

您的允许规则是完全不必要的。默认情况下，允许爬取，除非有匹配的 Disallow 规则。Allow: 应该只用于向 Disallow: 添加更具体的例外情况。大多数机器人不能处理它，因此如果不需要它，应该将其删除，以免混淆它们。

crawl-delay: 10 对于除了只有少数页面的站点以外的任何站点来说都太慢了。大多数电子商务网站需要机器人能够每秒爬取多个页面，以便将其网站索引到搜索引擎中。虽然 Google 会忽略这个指令，但设置这样的长时间爬取延迟将阻止其他搜索引擎如必应有效地爬取和索引您的网站。

大多数机器人不会将 * 视为 Disallow: 规则中的通配符。主要的搜索引擎机器人将理解该规则，但很少有其他机器人会理解。

除此之外，您的 robots.txt 看起来应该能够满足您的需求。Google 的测试工具是验证 Googlebot 是否会按您的要求执行的好地方。

英文:

Baidu and Yandex are actual search engines from China and Russia respectively. I wouldn't recommend blocking them because they can send legitimate traffic to your site. I would remove

User-agent: Baiduspider
User-agent: Yandexbot

Your allow rule is totally unnecessary. By default crawling is allowed unless there is a matching Disallow rule. Allow:should only be used to add a more specific exception to aDisallow:` Most robots can't deal with it, so if you don't need it, you should take it out so that you don't confuse them.

crawl-delay: 10 is way too slow for anything other than sites with just a handful of pages. Most ecommerce sites need bots to be able to crawl multiple pages per second to get their site indexed in search engines. While Google ignores this directive, setting a long crawl delay like this will prevent other search engines like Bing from effectively crawling and indexing your site.

Most robots don't treat * as a wildcard in Disallow: rules. The major search engine bots will understand that rule, but few other bots will.

Other than that, your robots.txt looks like it should do what you want. Google's testing tool is a good place to verify that Googlebot will do what you want.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Robots.txt – 阻止搜索引擎爬虫在WooCommerce中添加到购物车。

问题

答案1

如何在ASP.NET Web表单中自动获取财政年度并显示在文本框中。

如何使嵌入标签显示视频而不是下载它们？

忽略Go Web爬虫中的外部链接。

Postman输出 “status”: 404, “error”: “Not Found”, “path”: “/api/employees”

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论