谷歌如何在0.33秒内处理60万份文件?

huangapple go评论56阅读模式
英文:

How does google process 600K documents in .33 seconds?

问题

不管他们的CPU有多快,似乎在0.33秒内处理那么多文档是不可能的。

所以我认为问题关键在于水平扩展。猜测一下,在这个查询中,有多少台服务器参与处理60万份文档,而且能在一秒内完成?

英文:

谷歌如何在0.33秒内处理60万份文件?

Regardless of fast their CPUs are,it seems impossible to process that many documents in .33 seconds.

So I believe that it comes down to horizontal scaling. As a guess, how many servers were involved with this query that process 600k documents in under a second?

答案1

得分: 1

Google不会那么快地处理那么多文档。 Google在您进行搜索之前会对文档进行预处理。 Google维护着一个被用来生成搜索结果列表的“搜索索引”。

您可以将搜索索引看作是纸质书中的目录。对于每个词,它都会告诉您互联网上使用该词的页面。对于查询,它会在搜索索引中查找您查询中的每个词,并创建一个结果列表。

供参考:什么是搜索索引,它是如何工作的?- AddSearch

Google还拥有大量的计算机,并进行了大量的水平扩展。它在构建搜索索引和显示搜索结果的每个阶段都进行了水平扩展:

  • 爬取(Googlebot是一个水平分布的网络爬虫)
  • 相关性(确定每个词对页面的重要性)
  • 索引(创建搜索索引)
  • 声誉(计算每个站点和每个页面应该有多可信)
  • 垃圾邮件和欺诈检测(决定不应该包含在索引中的内容)
  • 查询(针对搜索索引)

但是无论进行多少水平扩展,搜索引擎都无法根据您的搜索查询实时处理文档。

英文:

Google doesn't process that many documents that quickly. Google pre-processes the documents well before you do your search. Google maintains a "search index" that is used to produce the list of search results.

You can think of a search index like the index in a paper book. For each word, it says what pages on the internet use it. For a query, it looks up each of the words in your query in the search index and creates a list of results from that.

For reference: What Is A Search Index And How Does It Work? - AddSearch

Google also has a lot of computers and does a ton of horizontal scaling. It has horizontal scaling for each of the stages of building the search index and displaying search results:

  • Crawling (Googlebot is a horizontally distributed web crawler)
  • Relevancy (Deciding how important each word is to the page)
  • Indexing (Creating the search index)
  • Reputation (Calculating how trusted each site and each page should be)
  • Spam and fraud detection (deciding what shouldn't be in the index)
  • Queries (against the search index)

But there is no amount of horizontal scaling that would allow search engines to process documents in real time based on your search query.

huangapple
  • 本文由 发表于 2023年5月30日 07:19:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76360784.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定