英文:
What is the actual use of partitions in clickhouse?
问题
它说分区使得更容易删除或移动数据,以便只对有限的数据产生影响。在各种博客中,建议使用月份作为分区键(toYYYYMM(date)
)。在许多地方,也建议不要超过几个分区。我正在使用ClickHouse作为存储不经常删除的时间序列数据的数据库。对于高容量的时间序列数据,什么是建议的分区键?如果我不想频繁执行删除操作,是否必须有一个分区键?
<strike>在生产环境中,我注意到启动非常缓慢,我怀疑有太多的分区是罪魁祸首。因此,我决定通过将时间序列数据插入一个表中(为约20亿行创建了>2300个分区),从另一个表中选择数据(以便没有优化表的机会)。我立即删除了原始表并尝试重新启动。它在约10秒内快速完成。这与我在生产中观察到的情况完全相反,那里有800GB+的数据(有许多数据库和表,而不是我的测试节点只有一个表)</strike>
编辑:正如指出的,我混淆了部分和分区。关于ClickHouse启动时间受到影响的问题,我最好发布另一个问题。
英文:
It says partitions make it easier to drop or move data so that there is hit only on limited data. In various blogs it is suggested to use month as a partitioning key (toYYYYMM(date)
). In many places it is also suggested to not have more than a couple of partitions. I am using clickhouse as a database to store time series data which do not undergo frequent deletions. What would be the advisable partitioning key for timeseries data of high volume? Does there have to be one if I do not want to perform deletes frequently?
<strike>In production I noticed that startup was very slow and I was suspecting that having too many partitions is the culprit. So I decided to test it out by inserting time-series data fresh into a table (which created >2300 partitions for ~20Bil rows) by selecting data from another table (so that it doesn't have an opportunity to optimize the table). Immediately I dropped the original table and tried a restart. It finished fast in about 10s. This is in complete opposite to what I observed in production with 800GB+ of data (with many databases and tables as opposed to my test node which had only one table).</strike>
Edit: As it was pointed out, I mixed up parts and partitions. Regarding startup time of clickhouse being affected, I'd better post another question.
答案1
得分: 3
这是一个相当常见的问题,为了透明,我在ClickHouse工作。
分区在处理时间序列数据时特别有用,正如您所提到的。确定分区数量时,我们通常建议考虑一些准则:
分区的使用应该由以下几个问题决定,即为什么要使用它们:
- 你是否通常只查询单个分区?例如,如果您的查询通常是在一天或一个月内获取结果,那么将分区设置为该时间段可能是有意义的。
- 你是否想要对数据进行“分层”或设置数据的存储时间(TTL),以便一旦分区达到X天(例如,91天,7个月)的年龄,你希望对其执行一些特殊操作?(例如,将TTL降低到更低成本的存储层,从ClickHouse备份并删除等。)
- 我们通常建议将分区数量保持在100左右以下。最多可以使用1000个分区,但这不是最佳选择,会对文件系统和索引/内存大小产生一些性能影响,可能会影响启动时间和插入/查询时间。
根据这些准则,希望对您的问题有所帮助。最常见的分区方式可能是按天或按月进行分区,但由于ClickHouse可以轻松管理大型表,如果可能的话,可以考虑减少分区数量,以增加分区的大小 - 按月进行分区可能是最常见的方式。
我没有完全理解您的测试结果,所以请随时进行详细说明。2300个分区听起来可能太多了,但也可能可以工作,只是会对性能产生一些影响。减少分区数量(从而增加分区大小)似乎是一个不错的建议。
英文:
This is a pretty common question, and for disclosure, I work at ClickHouse.
Partitions are particularly useful when you have timeseries data, as you noted. When determining the number of partitions, we often recommend a few guidelines:
The use of partitioning should be determined by a couple of questions as to why you're using them:
- are you generally going to query only a single partition? For example, if your queries are often for results within a one day or one month period, it could make sense to partition at that period duration
- are you wanting to "tier" or set a TTL on your data such that once the partition reaches an age of X (e.g., 91 days old, 7 months old), you want to do something special with it? (e.g., TTL to lower cost tier storage, backup and delete from ClickHouse, etc.)
- We often recommend to keep the number of partitions less than around 100. Up to 1000 partitions can work, but it is suboptimal and will have some performance impact at the filesystem and index/memory sizes, which can impact startup time insert/query time
Given these guidelines, hoping that helps with your question. It is probably most common to partition at the day or month, but since ClickHouse can manage large tables quite easily, might want to move towards fewer partitions if possible - partitioning by month probably most common.
I didn't fully understand your test results so please feel free to expand. 2300 partitions sounds like too many but might work, just with some performance implications. Reducing your number of partitions (and therefore increasing the partition size) seems like a good recommendation.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论