2023年5月25日 06:12:34go评论97阅读模式

英文:

Fix text encoding in R

问题

I am having an issue with text encoding that I cannot solve.

我遇到一个无法解决的文本编码问题。

I have a string in an excel file that I'm reading into R that looks like: Productâ„¢. With a bit of research, I learned that the â„¢ is UTF-8 that has been read incorrectly as CP-1252.

我在一个 Excel 文件中有一个字符串，我正在将其读入 R，它看起来像：Productâ„¢。经过一些研究，我了解到â„¢ 是被错误地读取为 CP-1252 的 UTF-8。

The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).

™ 的 UTF-8 十六进制代码是 0xe2 0x84 0xa2。这被误读为 CP-1252：â (E2) „ (84) ¢ (A2)。

How can I fix this issue? I have tried using:

我该如何解决这个问题？我尝试使用了：

iconv("Productâ„¢", "cp1252", "utf-8")

But as you can see, the output is incorrect. The desired output is Product™.

但是如您所见，输出是不正确的。期望的输出是 Product™。

Any ideas about how to fix this issue? The incorrect data is in an Excel spreadsheet, but I am trying to clean the text in R. A solution to fix the original data or a data cleaning solution in R would be great.

有关如何解决这个问题的任何想法吗？错误的数据位于 Excel 电子表格中，但我正在尝试在 R 中清理文本。修复原始数据或在 R 中进行数据清理的解决方案都将很有帮助。

英文:

I am having an issue with text encoding that I cannot solve.

I have a string in an excel file that I'm reading into R that looks like: Productâ„¢. With a bit of research, I learned that the â„¢ is UTF-8 that has been read incorrectly as CP-1252.

The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).

How can I fix this issue? I have tried using:

iconv(&quot;Product&#226;„&#162;&quot;, &quot;cp1252&quot;, &quot;utf-8&quot;)
#&gt; [1] &quot;Product&#195;&#162;&#226;€ž&#194;&#162;&quot;

But as you can see, the output is incorrect. The desired output is Product™.

答案1

得分: 1

更新：我之前搞反了参数。原来文本是以UTF-8编码读取的，而实际应该是CP-1252编码。我成功解决了这个问题，方法如下：

iconv("Product™", "utf-8", "cp1252")
#> [1] "Product™"

特别感谢@BalusC和这个答案，它们教会了我如何识别错误使用的编码。

英文:

Update: I had the arguments backwards. Turns out the text was being read as UTF-8 while it really should've been CP-1252. I was able to solve by using:

iconv(&quot;Product&#226;„&#162;&quot;, &quot;utf-8&quot;, &quot;cp1252&quot;)
#&gt; [1] &quot;Product™&quot;

Special thanks to @BalusC and this answer which showed me how to identify which encodings were being used erroneously.

答案2

得分: 0

你可以尝试在读取文件时指定编码类型。

假设你的文件是csv格式，可以像这样操作：

data <- read.csv("data.csv", encoding="UTF-8")
print(data)

英文:

you can also try to specify the encoding type when reading a file.

Assuming your file is in csv, you can do something like this:

data &lt;- read.csv(&quot;data.csv&quot;, encoding=&quot;UTF-8&quot;)
print(data)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

修复R中的文本编码。

问题

答案1

答案2

合并共享列但观测单位不同的数据框

在R Markdown中，将特定的无名称代码块之间的分隔符从换行符更改为空格。

生成多个Shiny图表的问题在使用map2时无法解决。

如何从分段包中删除自动断点/系数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。