如何阻止Google从__NEXT_DATA__认为是URL的内容中爬取?

huangapple go评论69阅读模式
英文:

How can I block Google from crawling things it thinks are URLs from __NEXT_DATA__?

问题

我的 __NEXT_DATA__ 有时包含看起来像URL的内容。然后Google爬取它并在Google搜索控制台中报告404错误。

是否有一种简单的方法可以阻止Google查看 __NEXT_DATA__
我尝试使用:<!--googleoff: index--> 但显然它似乎并不起作用。

编辑:甚至尝试对其进行base64编码,但然后它已经呈现并发送给客户端。如果我更改它,就会出现一个水合错误。

英文:

My __NEXT_DATA__ sometimes includes content that looks like a URL. Google then crawls it and reports a 404 error in Google Search Console.

Is there a simple way to block Google from looking into __NEXT_DATA__?
I tried using: <!--googleoff: index--> but apparently it doesn't really work.

Edit: even tried to base64 encode it but then it is rendered already and sent to the client. If I change it, I get a hydration error.

答案1

得分: 1

I've written about this Google behavior on the Webmasters stack site: Google follows JavaScript string as relative path - produces 404 error. I find this feature of Googlebot to be annoying, but it doesn't seem to hurt your site when it happens.

Google doesn't penalize sites for having 404 errors. In fact, Google prefers when sites appropriately return 404 errors for URLs that are not supposed to contain content.

Google's crawl budget is very large. Googlebot is usually willing to crawl ten or one hundred times as many pages as it indexes. Your crawl budget will increase as your site gains reputation (in the form of links from other sites.) I wouldn't worry about this eating into your crawl budget unless Googlebot is trying to crawl thousands of these URLs.

Google needs to be able to access your JavaScript to render and index your site. If Google can't see your hydrated pages, it won't be able to index your content. You shouldn't try to block your JavaScript from Googlebot. The only way to do it would be to put this JavaScript in a separate .js file and block it with robots.txt. See Preventing robots from crawling specific part of a page

One way to prevent this is to change your JavaScript so the content looks less like URLs. Google seems to tend to crawl string literals that look like a URL path with a / in them or end in .html. If it is your own JS code that is triggering these heuristics, you can rewrite it to break up your string literals. For example, use var foo = "/"+"path"+"/"+"file"+"."+"html" rather than var foo = "/path/file.html"

Another thing that would help is rendering your site server side. See Rendering: Fundamentals | Next.js. That will cause the initial page load to have rendered HTML rather than being built from __NEXT_DATA__. Rendering your site server side can also have additional performance and SEO benefits.

英文:

I've written about this Google behavior on the Webmasters stack site: Google follows JavaScript string as relative path - produces 404 error. I find this feature of Googlebot to be annoying, but it doesn't seem to hurt your site when it happens.

Google doesn't penalize sites for having 404 errors. In fact, Google prefers when sites appropriately return 404 errors for URLs that are not supposed to contain content.

Google's crawl budget is very large. Googlebot is usually willing to crawl ten or one hundred times as many pages as it indexes. Your crawl budget will increase as your site gains reputation (in the form of links from other sites.) I wouldn't worry about this eating into your crawl budget unless Googlebot is trying to crawl thousands of these URLs.

Google needs to be able to access your JavaScript to render and index your site. If Google can't see your hydrated pages it won't be able to index your content. You shouldn't try to block your JavaScript from Googlebot. The only way to do it would be to put this JavaScript in a separate .js file and block it with robots.txt. See Preventing robots from crawling specific part of a page

One way to prevent this is to change your JavaScript so the content looks less like URLs. Google seems to tend to crawl string literals that look like a URL path with a / in them or end in .html. If it is your own JS code that is triggering these heuristics you can rewrite it to break up your string literals. For example use var foo = "/"+"path"+"/"+"file"+"."+"html" rather than var foo = "/path/file.html"

Another thing that would help is rendering your site server side. See Rendering: Fundamentals | Next.js. That will cause the initial page load to have rendered HTML rather than being built from __NEXT_DATA__. Rending your site server side can also have additional performance and SEO benefits.

huangapple
  • 本文由 发表于 2023年3月8日 18:53:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75672103.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定