将UTF8字符串使用Go和MySQL编码为Latin1/ISO-8859-1。

huangapple go评论74阅读模式
英文:

Encoding UTF8 string to latin1/iso-8859-1 with Go and MySQL

问题

我有一个MySQL数据库,其中一个表中有德语的"Umlaute öäü",我需要编写一个Go应用程序来读取该表,将其编码为ISO-8859-1,并将其写入文本文件。

到目前为止,一切都好,但是编码为ISO-8859-1并没有起作用。我已经尝试过调试。

以下是一些详细信息:

MySQL
MySQL数据库是UTF8,表本身也是如此。除了character_set_server之外,其他字符集应该都没问题,但我认为这在这里不相关,据我所知,它应该只是新数据库的默认设置。

当我使用以下SQL查询数据库时,我得到了正确的UTF8编码文本:

select street, hex(street) from test_table where id = '36'

结果:(实际上被称为Fröbelstraße)
Fröbelstraße, 4672C3B662656C73747261C39F65

所以从十六进制字符串来看,基本上和我预期的一样。好的。

Go应用程序
只是相关的部分....

db, err := sql.Open("mysql", "...<连接字符串>...")
res, err := db.Query("select street from from test_table where id = '36'")

for res.Next() {
    var pb Phonebook
    err := res.Scan(&pb.Street)
    fmt.Println(hex.EncodeToString([]byte(pb.Street)))
}

输出是
4672c383c2b662656c73747261c383c5b865

这就是为什么我的ISO-8859-1编码不起作用的问题,因为从数据库中获取的字符串不正确。直接查询数据库的十六进制是正确的,并且编码也是有效的。

但是我不明白为什么从Go客户端获取到了不同的字符串。

在原始字符串"Fröbelstraße"中,有2个字符"ö",其十六进制为C3B6,和"ß",其十六进制为C39F。使用数据库客户端查询的十六进制是正确的,但是从Go应用程序获取的十六进制太长了,因为每个字符多了2个字节。

当我使用正确的十六进制字符串来输入我的Latin1转换器时,它可以正常工作,我得到了一个ISO-8859-1字符串。但是对于我直接从Go查询的字符串,我从未得到正确的ISO-8859-1结果。

我使用以下代码进行转换:

d := charmap.ISO8859_1.NewEncoder()
out, err := d.String(inp)

这只是一个片段,实际上我调用的是一个带有字符串参数的函数,但我从未得到正确的iso8859-1结果。所以我尝试使用从MySQL客户端查询得到的十六进制代码,将其转换回字符串,结果正确的ISO-8859-1结果。

我还尝试从Python查询,并在那里得到了相同奇怪的十六进制字符串,所以我完全不知道这里出了什么问题。不能是Go的问题,因为Python也是一样的。但在数据库中,我认为存储得很正确,MySQL的字符集都设置为utf8mb4或utf8,除了我上面提到的那个。

英文:

I have a MySQL database with German "Umlaute öäü" in a table and I need to write a Go app that reads the table, encode it to ISO-8859-1 and write it to a text file.

So far so good, but the encoding to iso-8859-1 is not working. I have tried to debug this.

Here some details and information:

MySQL
The MySQL database is UTF8, also the table itself. Also other character sets should be fine, except the character_set_server, but I think this is not relevant here, it should be just a default for new databases as far as I know.

When I query the database with the following SQL, I get the correct UTF8 encoded text:

select street, hex(street) from test_table where id = &#39;36&#39;

Result: (in real it is called Fr&#246;belstra&#223;e)
Fr&#195;&#182;belstra&#195;Ÿe, 4672C3B662656C73747261C39F65

So from the hex string it is basically exact what I have expected. OK.

Go App
Just the relevant parts....

db, err := sql.Open(&quot;mysql&quot;, &quot;...&lt;connection string&gt;...&quot;)
res, err := db.Query(&quot;select street from from test_table where id = &#39;36&#39;&quot;)

for res.Next() {
var pb Phonebook
		err := res.Scan(&amp;pb.Street)
        fmt.Println(hex.EncodeToString([]byte(pb.Street)))
}

The output is
4672c383c2b662656c73747261c383c5b865

And that's the problem why my encoding to ISO-8859-1 is not working because the string from the database is not correct. The hex from the db direct query is correct and also working the the encoding.

But I don't understand why I get a different string from the go client.

In the original string "Fröbelstraße" are 2 characters "ö" which is C3B6 and "ß" which is C39F. The hex from the query with a db client is ok, but the one from the go app is too long, because I get a 2 bytes more per character.

When I feed my latin1 converter, with the correct hex string, it is working fine, I get an iso-8859-1 string. But not from the other one I query directly from Go.

I do this with

	d := charmap.ISO8859_1.NewEncoder()
	out, err := d.String(inp)

Also just a snippet, I actually call a function with a string, but I never got the correct iso8859-1 result. So I tried it with the hex code from the MySQL client query, converted back to a string, and bingo the iso8859-1 result is correct.

I also tried to query from python and get there the same strange hex from the queried string, so I am completely lost of what is wrong here. Cannot be go, because it is the same in python. But in the db it is stored correctly in my opinion and the MySQL character sets are all set the utf8mb4 or utf8, except the one I mentioned in above.

答案1

得分: 3

4672c383c2b662656c73747261c383c5b865 看起来是“双重编码”。在 https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored 中查找相关信息。

然而,由于你从应用程序获取的是十六进制值,而不是从表中获取的,所以无法确定。请执行 SELECT HEX(col) FROM ... 来查看是否得到了完全相同的字符串。通常,应用程序,尤其是浏览器,会尝试“修复”问题,从而使正确诊断变得更加困难。

如果你的字符集是“utf8mb4”并且存在双重编码问题,那么以下操作可能会修复数据:

UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

然而,你还需要修复数据源的问题。

英文:

4672c383c2b662656c73747261c383c5b865 appears to be "double-encoded". Look for that in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored

However, since you were getting the hex from the app, not from the table, it is inconclusive. Please do SELECT HEX(col) FROM ... to see if you get exactly that string. Often, apps, especially browsers, try to "fix" the problem, thereby making it harder do diagnose correctly.

If you have "CHARACTER SET utf8mb4 with double-encoding", then this may cure the data:

UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

However, you need to fix the source of the data, too.

答案2

得分: 1

经过两天的完全迷失后,我自己找到了根本原因。奇怪的是,在我发布问题后不久就发生了这种情况。

我想尝试一个不同的mysql服务器,因此转储了表格。然后我在转储中看到每个字段都有自己的字符集定义,在我的情况下是latin1。

所以这就解释了为什么结果很奇怪。我只是创建了一个正确编码的测试表,现在它按预期工作了。

现在我必须考虑如何“修复”这些编码,也许转储/恢复可以做到,但那是另一个故事。

英文:

After two days of completely lost, I found the root cause myself. Strange that it happened shortly after I posted the question here.

I wanted to try a different mysql server and therefore dumped the table. And then I have seen in the dump that each field has its own character set definition which was latin1 in my case.

So that explains why it was a strange result. I just created a correct encoded test table and it works now as expected.

Now I have to think how I can "repair" these encodings, maybe a dump/restore will do it, but that's a another story.

huangapple
  • 本文由 发表于 2022年1月23日 05:52:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/70817435.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定