修复R中的文本编码。

huangapple go评论71阅读模式
英文:

Fix text encoding in R

问题

I am having an issue with text encoding that I cannot solve.

我遇到一个无法解决的文本编码问题。

I have a string in an excel file that I'm reading into R that looks like: Productâ„¢. With a bit of research, I learned that the â„¢ is UTF-8 that has been read incorrectly as CP-1252.

我在一个 Excel 文件中有一个字符串,我正在将其读入 R,它看起来像:Productâ„¢。经过一些研究,我了解到â„¢ 是被错误地读取为 CP-1252 的 UTF-8。

The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).

™ 的 UTF-8 十六进制代码是 0xe2 0x84 0xa2。这被误读为 CP-1252:â (E2) „ (84) ¢ (A2)。

How can I fix this issue? I have tried using:

我该如何解决这个问题?我尝试使用了:

iconv("Productâ„¢", "cp1252", "utf-8")

But as you can see, the output is incorrect. The desired output is Product™.

但是如您所见,输出是不正确的。期望的输出是 Product™

Any ideas about how to fix this issue? The incorrect data is in an Excel spreadsheet, but I am trying to clean the text in R. A solution to fix the original data or a data cleaning solution in R would be great.

有关如何解决这个问题的任何想法吗?错误的数据位于 Excel 电子表格中,但我正在尝试在 R 中清理文本。修复原始数据或在 R 中进行数据清理的解决方案都将很有帮助。

英文:

I am having an issue with text encoding that I cannot solve.

I have a string in an excel file that I'm reading into R that looks like: Productâ„¢. With a bit of research, I learned that the â„¢ is UTF-8 that has been read incorrectly as CP-1252.

The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).

How can I fix this issue? I have tried using:

iconv("Productâ„¢", "cp1252", "utf-8")

#> [1] "Productâ„¢"

But as you can see, the output is incorrect. The desired output is Product™.

Any ideas about how to fix this issue? The incorrect data is in an Excel spreadsheet, but I am trying to clean the text in R. A solution to fix the original data or a data cleaning solution in R would be great.

答案1

得分: 1

更新:我之前搞反了参数。原来文本是以UTF-8编码读取的,而实际应该是CP-1252编码。我成功解决了这个问题,方法如下:

iconv("Product™", "utf-8", "cp1252")

#> [1] "Product™"

特别感谢@BalusC和这个答案,它们教会了我如何识别错误使用的编码。

英文:

Update: I had the arguments backwards. Turns out the text was being read as UTF-8 while it really should've been CP-1252. I was able to solve by using:

iconv("Productâ„¢", "utf-8", "cp1252")

#> [1] "Product™"

Special thanks to @BalusC and this answer which showed me how to identify which encodings were being used erroneously.

答案2

得分: 0

你可以尝试在读取文件时指定编码类型。

假设你的文件是csv格式,可以像这样操作:

data <- read.csv("data.csv", encoding="UTF-8")
print(data)
英文:

you can also try to specify the encoding type when reading a file.

Assuming your file is in csv, you can do something like this:

data &lt;- read.csv(&quot;data.csv&quot;, encoding=&quot;UTF-8&quot;)
print(data)

修复R中的文本编码。

huangapple
  • 本文由 发表于 2023年5月25日 06:12:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327722.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定