问题

不管他们的CPU有多快，似乎在0.33秒内处理那么多文档是不可能的。

所以我认为问题关键在于水平扩展。猜测一下，在这个查询中，有多少台服务器参与处理60万份文档，而且能在一秒内完成？

英文:

Regardless of fast their CPUs are,it seems impossible to process that many documents in .33 seconds.

So I believe that it comes down to horizontal scaling. As a guess, how many servers were involved with this query that process 600k documents in under a second?

答案1

得分: 1

Google不会那么快地处理那么多文档。 Google在您进行搜索之前会对文档进行预处理。 Google维护着一个被用来生成搜索结果列表的“搜索索引”。

您可以将搜索索引看作是纸质书中的目录。对于每个词，它都会告诉您互联网上使用该词的页面。对于查询，它会在搜索索引中查找您查询中的每个词，并创建一个结果列表。

供参考：什么是搜索索引，它是如何工作的？- AddSearch

Google还拥有大量的计算机，并进行了大量的水平扩展。它在构建搜索索引和显示搜索结果的每个阶段都进行了水平扩展：

爬取（Googlebot是一个水平分布的网络爬虫）
相关性（确定每个词对页面的重要性）
索引（创建搜索索引）
声誉（计算每个站点和每个页面应该有多可信）
垃圾邮件和欺诈检测（决定不应该包含在索引中的内容）
查询（针对搜索索引）

但是无论进行多少水平扩展，搜索引擎都无法根据您的搜索查询实时处理文档。

英文:

Google doesn't process that many documents that quickly. Google pre-processes the documents well before you do your search. Google maintains a "search index" that is used to produce the list of search results.

You can think of a search index like the index in a paper book. For each word, it says what pages on the internet use it. For a query, it looks up each of the words in your query in the search index and creates a list of results from that.

For reference: What Is A Search Index And How Does It Work? - AddSearch

Google also has a lot of computers and does a ton of horizontal scaling. It has horizontal scaling for each of the stages of building the search index and displaying search results:

Crawling (Googlebot is a horizontally distributed web crawler)
Relevancy (Deciding how important each word is to the page)
Indexing (Creating the search index)
Reputation (Calculating how trusted each site and each page should be)
Spam and fraud detection (deciding what shouldn't be in the index)
Queries (against the search index)

But there is no amount of horizontal scaling that would allow search engines to process documents in real time based on your search query.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

谷歌如何在0.33秒内处理60万份文件？

问题

答案1

如何找到1D数组与3D数组之间的公共元素数量？

Itext PDF在处理段落时速度较慢。

在MySQL中检查逗号分隔的字符串相似性。

Co-relate cpu & mem usage from kubectl top pods and top command inside respective pod.

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论