错误 1366:将字符串插入到MariaDB时出现错误的字符串值

huangapple go评论100阅读模式
英文:

Error 1366: Incorrect string value when inserting strings into MariaDB

问题

我有一个MariaDB表,其中索引的类型是VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci'。我在Go中有一个字符串,我将其截断为10个字符,如果它超过了这个长度,就插入/更新到这个表中。我使用以下代码来截断字符串:

if len(value) > 10 {
  value = value[:10]
}

现在我遇到了一个以š字符结尾的字符串的问题。MariaDB抛出错误:Error 1366: Incorrect string value: '\\xC5'。查看Unicode表,这个字符表示为\xc5\xa1,这让我相信字符串的截断在某种程度上使得数据库无法处理它。

我想避免在我的代码中处理utf8/unicode,因为那将需要修改所有数据库方法并处理所有字符串。而且我认为这是不必要的,因为以前我从未遇到过这个问题。所以我认为问题出在其他地方,但我不确定是哪里。

我尝试将排序规则切换为utf8mb4_general_ci,但也没有帮助。

有趣的是,如果我使用HeidiSQL直接编辑列,字符串可以正常保存。这让我相信这可能是一个驱动程序的问题。我一直使用的是github.com/go-sql-driver/mysql驱动程序,所以我不希望出现问题,但谁知道呢...

英文:

I have MariaDB table that has index VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci' type. I have a string in Go that I cut to 10 characters, if it is longer, to insert into/update this table. I cut the string as:

if len(value) > 10 {
  value = value[:10]
}

Right now I have encountered an issue with string that ends with š character. MariaDB throws error: Error 1366: Incorrect string value: '\\xC5'. Looking up unicode tables, this character is represented as \xc5\xa1 which makes me believe the cutting of the string somehow makes the string indigestible for the database?

I would like to avoid handling utf8/unicode in my code because that would require going through all database methods and massaging all strings. And I do not believe this is necessary since i have never needed it before. So I think the issue lies somewhere else but am not sure where.

I tried to switch the collation to utf8mb4_general_ci but that did not help either.

Interestingly, if I edit the column directly with HeidiSQL, the string saves just fine.Which leads me to believe this might be a driver issue. I am using the github.com/go-sql-driver/mysql, as always. So I would not expect issues but, who knows...

答案1

得分: 1

让我相信,对字符串进行切割会使字符串对数据库来说变得不可消化?

如果你的程序有可能处理多字节字符,那么通过子切片(例如value[:10])来切割字符串并使用len函数测量长度,这种做法总是错误的。这是因为对字符串进行索引操作是基于字节的,而这些字节可能是多字节编码的一部分,也可能不是。

正如你发现的,字符š在UTF-8中被编码为\xc5\xa1。如果这两个字节出现在你的value字符串的索引910处,你的索引表达式[:10]会破坏数据。

字符集utf8mb3utf8mb4只限制了允许的UTF-8范围,分别是3字节和4字节的字符,但\xc5本身就不是有效的UTF-8,所以无论如何都会被拒绝。

在MariaDB中,数据类型为VARCHAR(N)的列计算的是字符(由排序规则指定)。你想要在第十个字符处切割value字符串,而不是在第十个字节处。

我想避免在我的代码中处理utf8/unicode。

通过将MariaDB的排序规则声明为utf8mb3,你已经允许使用UTF-8。因此,在代码中正确处理输入数据的UTF-8是合乎逻辑的。要在第n个字符(或rune,在Go中表示一个Unicode码点)处进行切割,你可以使用类似以下的代码:

// 计算rune的数量
if utf8.RuneCountInString(value) > 10 {
  // 将字符串转换为rune切片
  chars := []rune(value)
  // 对rune切片进行索引并转换回字符串
  value = string(chars[:10])
}

这样做不会破坏UTF-8编码,但是请记住,它会进行更多的分配,并且不考虑组合字符,例如当连接器200D出现时。

英文:

> which makes me believe the cutting of the string somehow makes the string indigestible for the database?

Cutting strings by sub-slicing as value[:10] (and measuring length with len for that matter) is always a mistake if your program has any chance of dealing with multi-byte characters. That's because indexing a string operates on its bytes, which may or may not be part of multi-byte encoding.

As you found out, the character š is encoded in UTF-8 as \xc5\xa1. If these two bytes appear in your value string right at index 9 and 10 your index expression [:10] corrupts the data.

The character sets utf8mb3 and utf8mb4 only restrict the range of admitted UTF-8 to respectively 3-byte and 4-byte characters, but \xc5 is not valid UTF-8 to begin with, so it gets rejected either way.

In MariaDB a column with data type VARCHAR(N) counts characters (as specified by the collation). You want to cut your value string at the tenth character, instead of at the tenth byte.

> I would like to avoid handling utf8/unicode in my code

You are already admitting UTF-8 by declaring the MariaDB collation as utf8mb3. It's only logical that you properly handle input data in your code as UTF-8. To cut at the n-th character (or rune, which in Go represents a Unicode code point) you can use something like:

// count the runes
if utf8.RuneCountInString(value) > 10 {
  // convert string to rune slice
  chars := []rune(value)
  // index the rune slice and convert back to string
  value = string(chars[:10])
}

This won't corrupt UTF-8 encoding, however keep in mind it does more allocs and doesn't account for composed characters, e.g. when the joiner 200D is involved.

huangapple
  • 本文由 发表于 2022年1月21日 00:34:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/70789977.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定