2023年7月13日 23:04:51go评论113阅读模式

英文:

unable to translate '...' to a wide string

问题

看起来对我来说，R 在版本 4.3.0 中引入了一个新的错误，这导致了我的许多网络爬虫出现问题。我只找到了一处关于这个变化的提及，但不太理解这个博客文章。

本质上，这段代码在较新版本的 R 上失败，但较旧版本会进行一些似乎有效的内部转换：

text &lt;- &quot;\xa0 x&quot;
gsub(&quot;x&quot;, &quot;u&quot;, text)
#&gt; Warning in gsub(&quot;x&quot;, &quot;u&quot;, text): unable to translate &#39;&lt;a0&gt; x&#39; to a wide string
#&gt; Error in gsub(&quot;x&quot;, &quot;u&quot;, text): input string 1 is invalid

<sup>创建于 2023-07-13，使用 reprex v2.0.2</sup>

在进行字符串操作之前是否有办法移除这些特殊字符？请注意，我不知道具体哪些字符会失败，因为我正在处理的真实字符串太长以至于无法检查。

英文:

It looks to me like R introduced a new error in version 4.3.0, which breaks a lot of my web-scrapers. I only found one mention of the change, but don't really understand the blog post.

In essence, this code fails on newer versions of R, but older versions do some internal conversion that seems to work:

text &lt;- &quot;\xa0 x&quot;
gsub(&quot;x&quot;, &quot;u&quot;, text)
#&gt; Warning in gsub(&quot;x&quot;, &quot;u&quot;, text): unable to translate &#39;&lt;a0&gt; x&#39; to a wide string
#&gt; Error in gsub(&quot;x&quot;, &quot;u&quot;, text): input string 1 is invalid

<sup>Created on 2023-07-13 with reprex v2.0.2</sup>

Is there any way to remove these special characters before doing string operations? Note that I do not know which characters specifically fail, since the real strings I'm working with are too long to check.

答案1

得分: 8

这是一个编码问题，`text`没有被解释为有效的字符串，因为它包含非ASCII字符。
转换为UTF-8：
```R
text_utf8 &lt;- iconv(text, from = &quot;ISO-8859-1&quot;, to = &quot;UTF-8&quot;)
gsub(&quot;x&quot;,&quot;u&quot;, text_utf8)

将产生：' u'。

R 4.3.0 更新日志表示：“正则表达式函数现在更彻底地检查它们的输入是否是有效的字符串（以它们的编码，例如UTF-8）。”

您也可以将输入视为字节序列（这也将保留在输出中）。

gsub(&quot;x&quot;, &quot;u&quot;, text, useBytes = TRUE)

得到 '\xa0 u'


<details>
<summary>英文:</summary>
It&#39;s an encoding issue, `text` is not interpreted as a valid string because it contains non-ASCII characters.
Conversion to UTF-8:
```R
text_utf8 &lt;- iconv(text, from = &quot;ISO-8859-1&quot;, to = &quot;UTF-8&quot;)
gsub(&quot;x&quot;,&quot;u&quot;, text_utf8)

will produce: ' u'.

R 4.3.0 changelog says: "Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8)."

You could also treat input as sequence of bytes (this will also be preserved in the output).

gsub(&quot;x&quot;, &quot;u&quot;, text, useBytes = TRUE)

gives '\xa0 u'

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法将’…’翻译为宽字符串。

问题

答案1

在 {factoextra} PCA biplot 中，仅保留各个群组的平均点。

Marginaleffects – obtaining contrasts and plotting predictions

分层环形图以在R中更好地区分子群。

数据整理：扩展和 consilidate 行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。