英文:
Fix text encoding in R
问题
I am having an issue with text encoding that I cannot solve.
我遇到一个无法解决的文本编码问题。
I have a string in an excel file that I'm reading into R that looks like: Productâ„¢
. With a bit of research, I learned that the â„¢
is UTF-8 that has been read incorrectly as CP-1252.
我在一个 Excel 文件中有一个字符串,我正在将其读入 R,它看起来像:Productâ„¢
。经过一些研究,我了解到â„¢
是被错误地读取为 CP-1252 的 UTF-8。
The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).
™ 的 UTF-8 十六进制代码是 0xe2 0x84 0xa2。这被误读为 CP-1252:â (E2) „ (84) ¢ (A2)。
How can I fix this issue? I have tried using:
我该如何解决这个问题?我尝试使用了:
iconv("Productâ„¢", "cp1252", "utf-8")
But as you can see, the output is incorrect. The desired output is Product™
.
但是如您所见,输出是不正确的。期望的输出是 Product™
。
Any ideas about how to fix this issue? The incorrect data is in an Excel spreadsheet, but I am trying to clean the text in R. A solution to fix the original data or a data cleaning solution in R would be great.
有关如何解决这个问题的任何想法吗?错误的数据位于 Excel 电子表格中,但我正在尝试在 R 中清理文本。修复原始数据或在 R 中进行数据清理的解决方案都将很有帮助。
英文:
I am having an issue with text encoding that I cannot solve.
I have a string in an excel file that I'm reading into R that looks like: Productâ„¢
. With a bit of research, I learned that the â„¢
is UTF-8 that has been read incorrectly as CP-1252.
The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).
How can I fix this issue? I have tried using:
iconv("Productâ„¢", "cp1252", "utf-8")
#> [1] "Productâ„¢"
But as you can see, the output is incorrect. The desired output is Product™
.
Any ideas about how to fix this issue? The incorrect data is in an Excel spreadsheet, but I am trying to clean the text in R. A solution to fix the original data or a data cleaning solution in R would be great.
答案1
得分: 1
更新:我之前搞反了参数。原来文本是以UTF-8编码读取的,而实际应该是CP-1252编码。我成功解决了这个问题,方法如下:
iconv("Product™", "utf-8", "cp1252")
#> [1] "Product™"
特别感谢@BalusC和这个答案,它们教会了我如何识别错误使用的编码。
英文:
Update: I had the arguments backwards. Turns out the text was being read as UTF-8 while it really should've been CP-1252. I was able to solve by using:
iconv("Productâ„¢", "utf-8", "cp1252")
#> [1] "Product™"
Special thanks to @BalusC and this answer which showed me how to identify which encodings were being used erroneously.
答案2
得分: 0
你可以尝试在读取文件时指定编码类型。
假设你的文件是csv格式,可以像这样操作:
data <- read.csv("data.csv", encoding="UTF-8")
print(data)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论