英文:
How can I block Google from crawling things it thinks are URLs from __NEXT_DATA__?
问题
我的 __NEXT_DATA__
有时包含看起来像URL的内容。然后Google爬取它并在Google搜索控制台中报告404错误。
是否有一种简单的方法可以阻止Google查看 __NEXT_DATA__
?
我尝试使用:<!--googleoff: index-->
但显然它似乎并不起作用。
编辑:甚至尝试对其进行base64编码,但然后它已经呈现并发送给客户端。如果我更改它,就会出现一个水合错误。
英文:
My __NEXT_DATA__
sometimes includes content that looks like a URL. Google then crawls it and reports a 404 error in Google Search Console.
Is there a simple way to block Google from looking into __NEXT_DATA__
?
I tried using: <!--googleoff: index-->
but apparently it doesn't really work.
Edit: even tried to base64 encode it but then it is rendered already and sent to the client. If I change it, I get a hydration error.
答案1
得分: 1
I've written about this Google behavior on the Webmasters stack site: Google follows JavaScript string as relative path - produces 404 error. I find this feature of Googlebot to be annoying, but it doesn't seem to hurt your site when it happens.
Google doesn't penalize sites for having 404 errors. In fact, Google prefers when sites appropriately return 404 errors for URLs that are not supposed to contain content.
Google's crawl budget is very large. Googlebot is usually willing to crawl ten or one hundred times as many pages as it indexes. Your crawl budget will increase as your site gains reputation (in the form of links from other sites.) I wouldn't worry about this eating into your crawl budget unless Googlebot is trying to crawl thousands of these URLs.
Google needs to be able to access your JavaScript to render and index your site. If Google can't see your hydrated pages, it won't be able to index your content. You shouldn't try to block your JavaScript from Googlebot. The only way to do it would be to put this JavaScript in a separate .js
file and block it with robots.txt
. See Preventing robots from crawling specific part of a page
One way to prevent this is to change your JavaScript so the content looks less like URLs. Google seems to tend to crawl string literals that look like a URL path with a /
in them or end in .html
. If it is your own JS code that is triggering these heuristics, you can rewrite it to break up your string literals. For example, use var foo = "/"+"path"+"/"+"file"+"."+"html"
rather than var foo = "/path/file.html"
Another thing that would help is rendering your site server side. See Rendering: Fundamentals | Next.js. That will cause the initial page load to have rendered HTML rather than being built from __NEXT_DATA__
. Rendering your site server side can also have additional performance and SEO benefits.
英文:
I've written about this Google behavior on the Webmasters stack site: Google follows JavaScript string as relative path - produces 404 error. I find this feature of Googlebot to be annoying, but it doesn't seem to hurt your site when it happens.
Google doesn't penalize sites for having 404 errors. In fact, Google prefers when sites appropriately return 404 errors for URLs that are not supposed to contain content.
Google's crawl budget is very large. Googlebot is usually willing to crawl ten or one hundred times as many pages as it indexes. Your crawl budget will increase as your site gains reputation (in the form of links from other sites.) I wouldn't worry about this eating into your crawl budget unless Googlebot is trying to crawl thousands of these URLs.
Google needs to be able to access your JavaScript to render and index your site. If Google can't see your hydrated pages it won't be able to index your content. You shouldn't try to block your JavaScript from Googlebot. The only way to do it would be to put this JavaScript in a separate .js
file and block it with robots.txt
. See Preventing robots from crawling specific part of a page
One way to prevent this is to change your JavaScript so the content looks less like URLs. Google seems to tend to crawl string literals that look like a URL path with a /
in them or end in .html
. If it is your own JS code that is triggering these heuristics you can rewrite it to break up your string literals. For example use var foo = "/"+"path"+"/"+"file"+"."+"html"
rather than var foo = "/path/file.html"
Another thing that would help is rendering your site server side. See Rendering: Fundamentals | Next.js. That will cause the initial page load to have rendered HTML rather than being built from __NEXT_DATA__
. Rending your site server side can also have additional performance and SEO benefits.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论