2023年3月8日 18:53:53go评论101阅读模式

英文:

How can I block Google from crawling things it thinks are URLs from __NEXT_DATA__?

问题

我的 __NEXT_DATA__ 有时包含看起来像URL的内容。然后Google爬取它并在Google搜索控制台中报告404错误。

是否有一种简单的方法可以阻止Google查看 __NEXT_DATA__？
我尝试使用： 但显然它似乎并不起作用。

编辑：甚至尝试对其进行base64编码，但然后它已经呈现并发送给客户端。如果我更改它，就会出现一个水合错误。

英文:

My __NEXT_DATA__ sometimes includes content that looks like a URL. Google then crawls it and reports a 404 error in Google Search Console.

Is there a simple way to block Google from looking into __NEXT_DATA__?
I tried using:  but apparently it doesn't really work.

Edit: even tried to base64 encode it but then it is rendered already and sent to the client. If I change it, I get a hydration error.

答案1

得分: 1

I've written about this Google behavior on the Webmasters stack site: Google follows JavaScript string as relative path - produces 404 error. I find this feature of Googlebot to be annoying, but it doesn't seem to hurt your site when it happens.

Google doesn't penalize sites for having 404 errors. In fact, Google prefers when sites appropriately return 404 errors for URLs that are not supposed to contain content.

Google's crawl budget is very large. Googlebot is usually willing to crawl ten or one hundred times as many pages as it indexes. Your crawl budget will increase as your site gains reputation (in the form of links from other sites.) I wouldn't worry about this eating into your crawl budget unless Googlebot is trying to crawl thousands of these URLs.

Google needs to be able to access your JavaScript to render and index your site. If Google can't see your hydrated pages, it won't be able to index your content. You shouldn't try to block your JavaScript from Googlebot. The only way to do it would be to put this JavaScript in a separate .js file and block it with robots.txt. See Preventing robots from crawling specific part of a page

One way to prevent this is to change your JavaScript so the content looks less like URLs. Google seems to tend to crawl string literals that look like a URL path with a / in them or end in .html. If it is your own JS code that is triggering these heuristics, you can rewrite it to break up your string literals. For example, use var foo = "/"+"path"+"/"+"file"+"."+"html" rather than var foo = "/path/file.html"

Another thing that would help is rendering your site server side. See Rendering: Fundamentals | Next.js. That will cause the initial page load to have rendered HTML rather than being built from __NEXT_DATA__. Rendering your site server side can also have additional performance and SEO benefits.

英文:

Google doesn't penalize sites for having 404 errors. In fact, Google prefers when sites appropriately return 404 errors for URLs that are not supposed to contain content.

Google needs to be able to access your JavaScript to render and index your site. If Google can't see your hydrated pages it won't be able to index your content. You shouldn't try to block your JavaScript from Googlebot. The only way to do it would be to put this JavaScript in a separate .js file and block it with robots.txt. See Preventing robots from crawling specific part of a page

One way to prevent this is to change your JavaScript so the content looks less like URLs. Google seems to tend to crawl string literals that look like a URL path with a / in them or end in .html. If it is your own JS code that is triggering these heuristics you can rewrite it to break up your string literals. For example use var foo = "/"+"path"+"/"+"file"+"."+"html" rather than var foo = "/path/file.html"

Another thing that would help is rendering your site server side. See Rendering: Fundamentals | Next.js. That will cause the initial page load to have rendered HTML rather than being built from __NEXT_DATA__. Rending your site server side can also have additional performance and SEO benefits.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何阻止Google从__NEXT_DATA__认为是URL的内容中爬取？

问题

答案1

Tailwind 在 Next.js 中没有应用样式到页面，但应用到了首页。

Next.js 13: 无法解析 ‘src/app/dashboard/layout.tsx’（已删除可选布局）

RefreshInterval在swr和next.js 13中不起作用。

NextJs Tailwind 头部样式问题

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。