ClickHouse架构

huangapple go评论70阅读模式
英文:

Clickhouse architecture

问题

刚刚学习了ClickHouse架构的课程,根据所学信息,ClickHouse表被分成了“部分”(截图示例):

ClickHouse架构

每个部分由几个主文件组成,例如其中一个文件是column1.bin,其中存储了特定列的数据,所以根据课程,我们应该为每一列单独创建一个bin文件(来自课程的截图示例):

ClickHouse架构

这是我其中一个文件夹的截图,尽管我的表中有几列,但我只有一个bin文件,为什么?

ClickHouse架构

英文:

Just learned course of clickhouse architecure and per that information clickhouse table separated into "parts" (screenshot example):

ClickHouse架构

every part consists of few main files , one of that files is column1.bin for example where specific column data is stored , so per cource we should have separate bin for every column (screenshot example from cource):

ClickHouse架构

Here is screenshot from one of my folders , and despite I have few columns in my table i have only one bin file , why ?

ClickHouse架构

答案1

得分: 3

在ClickHouse中,有两种类型的数据部分:宽格式和紧凑格式(还有内存部分,但让我们保持简单)。

在这里,你可以找到两种类型的定义:https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-data-storage

数据部分可以存储在宽格式或紧凑格式中。在宽格式中,每个列都存储在文件系统中的单独文件中,而在紧凑格式中,所有列都存储在一个文件中。紧凑格式可用于提高小型和频繁插入的性能。

数据存储格式由表引擎的min_bytes_for_wide_part和min_rows_for_wide_part设置控制。如果数据部分中的字节数或行数少于相应设置的值,该部分将以紧凑格式存储。否则,它将以宽格式存储。如果没有设置这些设置中的任何一个,数据部分将以宽格式存储。

基本上,你看到一个单一的二进制文件,因为数据太小,不值得将每个列拆分为单独的文件。

如果执行大规模插入操作,新的部分将被创建为宽格式。此外,如果继续执行小型插入操作,后台合并任务最终会将这些文件合并成足够大的单一部分,以便创建为宽格式。

如果你想要更多关于这两种文件结构的详细信息,请查看以下链接:

英文:

In ClickHouse there are two types of parts: wide, and compact parts (there are memory parts also but let's keep simple)

Here you can find the definition of both types: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-data-storage

> Data parts can be stored in Wide or Compact format. In Wide format each column is stored in a separate file in a filesystem, in Compact format all columns are stored in one file. Compact format can be used to increase performance of small and frequent inserts.
>
> Data storing format is controlled by the min_bytes_for_wide_part and min_rows_for_wide_part settings of the table engine. If the number of bytes or rows in a data part is less then the corresponding setting's value, the part is stored in Compact format. Otherwise it is stored in Wide format. If none of these settings is set, data parts are stored in Wide format.

Basically, you're seeing a single bin file because data is too small to be worth splitting each column into a single file.

If you perform an big insert the new part will be created as wide. Also, if you continue doing small inserts, the background merge task will eventually merge those files into a single part big enough to be created as wide.

If you want more details about both files structure check this:

答案2

得分: 0

这是我的幻灯片 - 谢谢分享!我喜欢它。

从学习的角度来看,我在某种程度上“言过其实”,以使它更像是一个学习时刻:当您插入大量数据时,每个部分都会有一个列文件。

我还没有提到在部分文件夹中出现的“标记”文件 - 每个列也有一个这样的文件。标记文件知道列文件中块的起始位置。 (每个列文件由压缩块组成。) 这一切都是为了加快对这些大文件的查询 - 您不想必须解压缩 200GB 的文件才能提取出几千行。

英文:

That's my slide - thanks for sharing! Love it.

From a learning perspective, I sort of "stretched the truth" to make it more of a learning moment: when you insert a ton of data then your parts will have a column file for each column.

I also didn't mention the "mark" files that appear in the part folder - each column has one of those as well. The mark files know where in the column file the blocks start. (Each column file consists of compressed blocks.) This is all done to speed up the querying of these large files - you don't want to have to uncompress 200GB to pull out a few thousand rows.

huangapple
  • 本文由 发表于 2023年2月27日 03:37:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75574563.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定