英文:
Postgres - Best way to store hex values
问题
I have an application which deals with blockchain data, so the postgres tables have quite a lot of fields which are constant sized hex values.
One of the most important columns is the public_address
column of the user
table, which is a 20 bytes hex value and the primary key of the table. The reason I say important because this column is used in almost all other queries within a JOIN or WHERE clause, so it's indexing and query-speed is crucial.
On the other hand there is another column nonce
in transaction
table, which is a 32 bytes hex value with a single constraint NOT NULL
on it. This is not used in any JOIN or WHERE clauses, and only used in SELECT statements to get the value of the column.
I am aware that bytea
is storage-wise more efficient than varchar
, but I don't know which one is better to use for indexes and for faster queries.
So my question is, under which circumstances bytea
should be used over varchar(N)
for hex values, and vice-verse, taking into account the indexing and query-speed?
英文:
I have an application which deals with blockchain data, so the postgres tables have quite a lot of fields which are constant sized hex values.
One of the most important columns is the public_address
column of the user
table, which is a 20 bytes hex value and the primary key of the table. The reason I say important because this column is used in almost all other queries within a JOIN or WHERE clause, so it's indexing and query-speed is crucial.
On the other hand there is another column nonce
in transaction
table, which is a 32 bytes hex value with a single constraint NOT NULL
on it. This is not used in any JOIN or WHERE clauses, and only used in SELECT statements to get the value of the column.
I am aware that bytea
is storage-wise more efficient than varchar
, but I don't know which one is better to use for indexes and for faster queries.
So my question is, under which circumstances bytea
should be used over varchar(N)
for hex values, and vice-verse, taking into account the indexing and query-speed?
答案1
得分: 2
你所谓的“十六进制值”只是二进制数据的文本表示。这才是重要的。它是纯粹的、原始的、任意的二进制数据。只是一系列字节。
索引和查询速度(虽然我不指望它们有太大不同)并不重要,适合任意字节序列的唯一合适数据类型是bytea
。
不要使用varchar
:“可以存储在这些数据类型中的字符由数据库字符集确定,该字符集在创建数据库时选择。无论特定的字符集如何,带有代码零(有时称为NUL)的字符都无法存储。”
英文:
What you call a "hex value" is just a text representation of binary data. And that is what matters. It's plain, raw, arbitrary, binary data. Just a sequence of bytes.
Indexing and query speed (while I don't expect them to be much different) do not matter, there is only one appropriate data type for arbitrary byte sequences: bytea
.
Do not use varchar
: "The characters that can be stored in any of these data types are determined by the database character set, which is selected when the database is created. Regardless of the specific character set, the character with code zero (sometimes called NUL) cannot be stored."
答案2
得分: 2
简而言之,使用 bytea
。它可能性能更好,但更重要的是它是正确的;varchar
存储不合法的字节。从文档中可以看出...
首先,二进制字符串专门允许存储零值字节和其他“不可打印”的字节(通常在十进制范围32到126之外的字节)。字符字符串不允许零字节,也不允许根据数据库选定的字符集编码不合法的任何其他字节值和字节值序列。其次,对二进制字符串的操作处理实际字节,而对字符字符串的处理依赖于语言环境设置。简而言之,二进制字符串适用于存储程序员视为“原始字节”的数据,而字符字符串适用于存储文本。
varchar
存储表示 字符 的字节,使用给定的字符编码。在许多字符编码中,某些字节组合是不合法的;它们不代表任何字符。例如,在UTF-8中,一个字节值大于7F是不合法的。因此,在 varchar
中无法可靠地存储字节作为字节。
相反,varchar
将存储字节的字符串表示。因此,\x00AA
不会存储为两个字节00 AA,而会存储为字符串""\x00AA""
,在UTF-8中为5C 78 30 30 41 41。
演示。
在十六进制字符和实际字节之间进行转换需要更多的工作,也增加了出错的机会。使用 bytea
更高效,也更正确。
英文:
tl;dr. Use bytea
. It will likely be more performant, but more importantly it's correct; there are some bytes which are illegal for varchar
. From the docs...
> First, binary strings specifically allow storing octets of value zero and other “non-printable” octets (usually, octets outside the decimal range 32 to 126). Character strings disallow zero octets, and also disallow any other octet values and sequences of octet values that are invalid according to the database's selected character set encoding. Second, operations on binary strings process the actual bytes, whereas the processing of character strings depends on locale settings. In short, binary strings are appropriate for storing data that the programmer thinks of as “raw bytes”, whereas character strings are appropriate for storing text.
varchar
stores bytes which represent characters using a given character encoding. In many character encodings, certain byte combinations are illegal; they represent no character. For example, in UTF-8 a single byte above 7F is illegal. So you can't reliably store bytes as bytes in varchar
.
Instead, varchar
will store a string representation of the bytes. So \x00AA
will be stored not as the two bytes 00 AA but as the string "\x00AA"
which, in UTF-8, is 5C 78 30 30 41 41.
Converting back and forth between the hex characters and the real bytes is more work and more opportunities for mistakes. It's more efficient and more correct to use bytea.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论