英文:
Error 1366: Incorrect string value when inserting strings into MariaDB
问题
我有一个MariaDB表,其中索引的类型是VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci'
。我在Go中有一个字符串,我将其截断为10个字符,如果它超过了这个长度,就插入/更新到这个表中。我使用以下代码来截断字符串:
if len(value) > 10 {
value = value[:10]
}
现在我遇到了一个以š
字符结尾的字符串的问题。MariaDB抛出错误:Error 1366: Incorrect string value: '\\xC5'
。查看Unicode表,这个字符表示为\xc5\xa1
,这让我相信字符串的截断在某种程度上使得数据库无法处理它。
我想避免在我的代码中处理utf8/unicode,因为那将需要修改所有数据库方法并处理所有字符串。而且我认为这是不必要的,因为以前我从未遇到过这个问题。所以我认为问题出在其他地方,但我不确定是哪里。
我尝试将排序规则切换为utf8mb4_general_ci
,但也没有帮助。
有趣的是,如果我使用HeidiSQL直接编辑列,字符串可以正常保存。这让我相信这可能是一个驱动程序的问题。我一直使用的是github.com/go-sql-driver/mysql
驱动程序,所以我不希望出现问题,但谁知道呢...
英文:
I have MariaDB table that has index VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci'
type. I have a string in Go that I cut to 10 characters, if it is longer, to insert into/update this table. I cut the string as:
if len(value) > 10 {
value = value[:10]
}
Right now I have encountered an issue with string that ends with š
character. MariaDB throws error: Error 1366: Incorrect string value: '\\xC5'
. Looking up unicode tables, this character is represented as \xc5\xa1
which makes me believe the cutting of the string somehow makes the string indigestible for the database?
I would like to avoid handling utf8/unicode in my code because that would require going through all database methods and massaging all strings. And I do not believe this is necessary since i have never needed it before. So I think the issue lies somewhere else but am not sure where.
I tried to switch the collation to utf8mb4_general_ci
but that did not help either.
Interestingly, if I edit the column directly with HeidiSQL, the string saves just fine.Which leads me to believe this might be a driver issue. I am using the github.com/go-sql-driver/mysql
, as always. So I would not expect issues but, who knows...
答案1
得分: 1
让我相信,对字符串进行切割会使字符串对数据库来说变得不可消化?
如果你的程序有可能处理多字节字符,那么通过子切片(例如value[:10]
)来切割字符串并使用len
函数测量长度,这种做法总是错误的。这是因为对字符串进行索引操作是基于字节的,而这些字节可能是多字节编码的一部分,也可能不是。
正如你发现的,字符š
在UTF-8中被编码为\xc5\xa1
。如果这两个字节出现在你的value
字符串的索引9
和10
处,你的索引表达式[:10]
会破坏数据。
字符集utf8mb3
和utf8mb4
只限制了允许的UTF-8范围,分别是3字节和4字节的字符,但\xc5
本身就不是有效的UTF-8,所以无论如何都会被拒绝。
在MariaDB中,数据类型为VARCHAR(N)
的列计算的是字符(由排序规则指定)。你想要在第十个字符处切割value
字符串,而不是在第十个字节处。
我想避免在我的代码中处理utf8/unicode。
通过将MariaDB的排序规则声明为utf8mb3
,你已经允许使用UTF-8。因此,在代码中正确处理输入数据的UTF-8是合乎逻辑的。要在第n个字符(或rune
,在Go中表示一个Unicode码点)处进行切割,你可以使用类似以下的代码:
// 计算rune的数量
if utf8.RuneCountInString(value) > 10 {
// 将字符串转换为rune切片
chars := []rune(value)
// 对rune切片进行索引并转换回字符串
value = string(chars[:10])
}
这样做不会破坏UTF-8编码,但是请记住,它会进行更多的分配,并且不考虑组合字符,例如当连接器200D
出现时。
英文:
> which makes me believe the cutting of the string somehow makes the string indigestible for the database?
Cutting strings by sub-slicing as value[:10]
(and measuring length with len
for that matter) is always a mistake if your program has any chance of dealing with multi-byte characters. That's because indexing a string operates on its bytes, which may or may not be part of multi-byte encoding.
As you found out, the character š
is encoded in UTF-8 as \xc5\xa1
. If these two bytes appear in your value
string right at index 9
and 10
your index expression [:10]
corrupts the data.
The character sets utf8mb3
and utf8mb4
only restrict the range of admitted UTF-8 to respectively 3-byte and 4-byte characters, but \xc5
is not valid UTF-8 to begin with, so it gets rejected either way.
In MariaDB a column with data type VARCHAR(N)
counts characters (as specified by the collation). You want to cut your value
string at the tenth character, instead of at the tenth byte.
> I would like to avoid handling utf8/unicode in my code
You are already admitting UTF-8 by declaring the MariaDB collation as utf8mb3
. It's only logical that you properly handle input data in your code as UTF-8. To cut at the n-th character (or rune
, which in Go represents a Unicode code point) you can use something like:
// count the runes
if utf8.RuneCountInString(value) > 10 {
// convert string to rune slice
chars := []rune(value)
// index the rune slice and convert back to string
value = string(chars[:10])
}
This won't corrupt UTF-8 encoding, however keep in mind it does more allocs and doesn't account for composed characters, e.g. when the joiner 200D
is involved.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论