英文:
Why is there storage limit in Cloud Bigtable?
问题
Google的Cloud Bigtable使用他们的Colossus文件系统来存储数据。我认为这意味着即使只有一个节点的实例,也可以处理任何大小的数据(当然前提是读写操作足够少,以便节点的CPU可以处理它)。
即使在文档中也说明了:
重要的是,数据永远不会存储在Bigtable节点本身;每个节点都有指向存储在Colossus上的一组表格的指针。
我可以想象一个应用程序,其中节点处理来自许多物联网设备的数据流并将其存储在Bigtable中。大多数数据可能很少被访问,因此大多数节点的CPU工作将用于将新数据存储在Bigtable中,仅一部分用于访问非常旧的条目。
然而,在配额文档中,您可以了解到每个节点有一个存储限制:
如果一个集群没有足够的节点,根据其当前工作负载和存储的数据量,Bigtable将没有足够的CPU资源来管理与该集群关联的所有表格。Bigtable也将无法执行后台的关键维护任务。因此,该集群可能无法处理传入的请求,并且延迟会增加。有关更多详细信息,请参阅存储使用和性能之间的权衡。
为了防止这些问题,监视集群的存储利用率,以确保它们具有足够的节点来支持集群中的数据量,根据以下限制:
- SSD集群:每个节点5 TB
- HDD集群:每个节点16 TB
...
重要提示:如果实例中的任何集群超过每个节点的存储硬限制,那么直到您向每个超出限制的集群添加节点之前,将无法向该实例中的所有集群写入。
我不明白这个限制是从哪里来的。仅因为一些数据静止不动地存储在Colossus中,不应该增加节点的CPU使用率,偶尔在几TB的数据中进行后台移动也不应该成为今天硬件的问题。我认为使用Colossus的想法是为了允许无限制的存储,然而这些限制表明节点可以只使用自己的存储,而不需要任何智能网络文件系统。
或者是我遗漏了某些内容,可以配置一个单独的集群和单独的节点实例,以便能够访问例如100TB的数据?这些限制是用于某些内部节点缓存吗?
英文:
Google's Cloud Bigtable uses their Colossus filesystem to store data. I thought this would mean an instance with even a single Node could handle any size of data (assuming of course there is little enough reads and writes that Nodes CPU can handle it).
Even in documentation it states:
> Importantly, data is never stored in Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus.
I could imagine an application where Node(s) handle data stream from many IoT devices and store them in Bigtable. Most of that data would be rarely accessed, thus most of Node's CPU work would be in storing new data in Bigtable, and only some in accessing very old entries.
Yet, in quotas documentation one could learn that there is a per Node storage limit:
> If a cluster does not have enough nodes, based on its current workload and the amount of data it stores, Bigtable will not have enough CPU resources to manage all of the tablets that are associated with the cluster. Bigtable will also not be able to perform essential maintenance tasks in the background. As a result, the cluster may not be able to handle incoming requests, and latency will go up. See Trade-offs between storage usage and performance for more details.
> To prevent these issues, monitor storage utilization for your clusters to make sure they have enough nodes to support the amount of data in the cluster, based on the following limits:
>- SSD clusters: 5 TB per node
>- HDD clusters: 16 TB per node
> ...
> Important: If any cluster in an instance exceeds the hard limit on the amount of storage per node, writes to all clusters in that instance will fail until you add nodes to each cluster that is over the limit.
I don't understand where this limitation is coming from. Just because some data is laying still in Colossus should add CPU usage to a Node, nor occasional background moving around few terabytes of data shouldn't be a problem for today's hardware? I thought the idea of using Colossus was to allow for limitless storage, yet the limitation suggest nodes could just use own storage without any smart networking filesystem.
Or am I missing something and one can configure a single Cluster and single Node instance that will be able to access e.g. 100TB of data? And the limits are for some internal Node cache?
答案1
得分: 1
Cloud Bigtable有存储限制,这是出于性能和运维方面的考虑,而非硬件限制。这些限制确保高性能、数据均衡分布和有效的维护。超过这些限制可能会影响可靠性和可扩展性。
Cloud Bigtable存储限制的原因:
-
存储限制确保每个节点可以高效地管理和处理其包含的数据,保持良好的性能。
-
设置存储限制可以防止单个节点因过多数据而过载,从而对系统性能和可扩展性产生负面影响。
-
每个节点的数据过多可能使这些任务变得资源密集,影响整体性能。
-
限制每个节点的数据便于进行数据冗余复制,以实现容错和容灾。
请始终参考最新的文档以了解Cloud Bigtable的存储限制。
英文:
Cloud Bigtable has storage limits due to performance and operational reasons, not hardware constraints. These limits ensure high performance, balanced data distribution, and efficient maintenance. Exceeding them may affect reliability and scalability.
Reasons for storage limits in Cloud Bigtable:
-
Storage limits ensure that each node can efficiently manage and handle the data it contains, maintaining
good performance
. -
Setting storage limits prevents a single node from becoming overloaded with excessive data, which could negatively impact system performance and scalability.
-
Too much data per node could make these tasks resource-intensive and affect overall performance.
-
Limiting data per node enables easy replication for data redundancy and fault tolerance.
Always refer to the latest documentation to stay updated on Cloud Bigtable's storage limits.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论