2022年1月21日 00:34:32go评论112阅读模式

英文:

Error 1366: Incorrect string value when inserting strings into MariaDB

问题

我有一个MariaDB表，其中索引的类型是VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci'。我在Go中有一个字符串，我将其截断为10个字符，如果它超过了这个长度，就插入/更新到这个表中。我使用以下代码来截断字符串：

if len(value) > 10 {
  value = value[:10]
}

现在我遇到了一个以š字符结尾的字符串的问题。MariaDB抛出错误：Error 1366: Incorrect string value: '\\xC5'。查看Unicode表，这个字符表示为\xc5\xa1，这让我相信字符串的截断在某种程度上使得数据库无法处理它。

我想避免在我的代码中处理utf8/unicode，因为那将需要修改所有数据库方法并处理所有字符串。而且我认为这是不必要的，因为以前我从未遇到过这个问题。所以我认为问题出在其他地方，但我不确定是哪里。

我尝试将排序规则切换为utf8mb4_general_ci，但也没有帮助。

有趣的是，如果我使用HeidiSQL直接编辑列，字符串可以正常保存。这让我相信这可能是一个驱动程序的问题。我一直使用的是github.com/go-sql-driver/mysql驱动程序，所以我不希望出现问题，但谁知道呢...

英文:

I have MariaDB table that has index VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci' type. I have a string in Go that I cut to 10 characters, if it is longer, to insert into/update this table. I cut the string as:

if len(value) &gt; 10 {
  value = value[:10]
}

Right now I have encountered an issue with string that ends with š character. MariaDB throws error: Error 1366: Incorrect string value: '\\xC5'. Looking up unicode tables, this character is represented as \xc5\xa1 which makes me believe the cutting of the string somehow makes the string indigestible for the database?

I would like to avoid handling utf8/unicode in my code because that would require going through all database methods and massaging all strings. And I do not believe this is necessary since i have never needed it before. So I think the issue lies somewhere else but am not sure where.

I tried to switch the collation to utf8mb4_general_ci but that did not help either.

Interestingly, if I edit the column directly with HeidiSQL, the string saves just fine.Which leads me to believe this might be a driver issue. I am using the github.com/go-sql-driver/mysql, as always. So I would not expect issues but, who knows...

答案1

得分: 1

让我相信，对字符串进行切割会使字符串对数据库来说变得不可消化？

如果你的程序有可能处理多字节字符，那么通过子切片（例如value[:10]）来切割字符串并使用len函数测量长度，这种做法总是错误的。这是因为对字符串进行索引操作是基于字节的，而这些字节可能是多字节编码的一部分，也可能不是。

正如你发现的，字符š在UTF-8中被编码为\xc5\xa1。如果这两个字节出现在你的value字符串的索引9和10处，你的索引表达式[:10]会破坏数据。

字符集utf8mb3和utf8mb4只限制了允许的UTF-8范围，分别是3字节和4字节的字符，但\xc5本身就不是有效的UTF-8，所以无论如何都会被拒绝。

在MariaDB中，数据类型为VARCHAR(N)的列计算的是字符（由排序规则指定）。你想要在第十个字符处切割value字符串，而不是在第十个字节处。

我想避免在我的代码中处理utf8/unicode。

通过将MariaDB的排序规则声明为utf8mb3，你已经允许使用UTF-8。因此，在代码中正确处理输入数据的UTF-8是合乎逻辑的。要在第n个字符（或rune，在Go中表示一个Unicode码点）处进行切割，你可以使用类似以下的代码：

// 计算rune的数量
if utf8.RuneCountInString(value) > 10 {
  // 将字符串转换为rune切片
  chars := []rune(value)
  // 对rune切片进行索引并转换回字符串
  value = string(chars[:10])
}

这样做不会破坏UTF-8编码，但是请记住，它会进行更多的分配，并且不考虑组合字符，例如当连接器200D出现时。

英文:

> which makes me believe the cutting of the string somehow makes the string indigestible for the database?

Cutting strings by sub-slicing as value[:10] (and measuring length with len for that matter) is always a mistake if your program has any chance of dealing with multi-byte characters. That's because indexing a string operates on its bytes, which may or may not be part of multi-byte encoding.

As you found out, the character š is encoded in UTF-8 as \xc5\xa1. If these two bytes appear in your value string right at index 9 and 10 your index expression [:10] corrupts the data.

The character sets utf8mb3 and utf8mb4 only restrict the range of admitted UTF-8 to respectively 3-byte and 4-byte characters, but \xc5 is not valid UTF-8 to begin with, so it gets rejected either way.

In MariaDB a column with data type VARCHAR(N) counts characters (as specified by the collation). You want to cut your value string at the tenth character, instead of at the tenth byte.

> I would like to avoid handling utf8/unicode in my code

You are already admitting UTF-8 by declaring the MariaDB collation as utf8mb3. It's only logical that you properly handle input data in your code as UTF-8. To cut at the n-th character (or rune, which in Go represents a Unicode code point) you can use something like:

// count the runes
if utf8.RuneCountInString(value) &gt; 10 {
  // convert string to rune slice
  chars := []rune(value)
  // index the rune slice and convert back to string
  value = string(chars[:10])
}

This won't corrupt UTF-8 encoding, however keep in mind it does more allocs and doesn't account for composed characters, e.g. when the joiner 200D is involved.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

错误 1366：将字符串插入到MariaDB时出现错误的字符串值

问题

答案1

Go程序启动多个进程

尝试读取加密私钥时出现“块中没有DEK-Info头”。

GAE Golang – 如何配置后端以每隔X秒执行一次任务？

执行go http get请求时，使用客户端和URL构造时出现无效的请求路径。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论