无法将’…’翻译为宽字符串。

huangapple go评论71阅读模式
英文:

unable to translate '...' to a wide string

问题

看起来对我来说,R 在版本 4.3.0 中引入了一个新的错误,这导致了我的许多网络爬虫出现问题。我只找到了一处关于这个变化的提及,但不太理解这个博客文章

本质上,这段代码在较新版本的 R 上失败,但较旧版本会进行一些似乎有效的内部转换:

text <- "\xa0 x"
gsub("x", "u", text)
#> Warning in gsub("x", "u", text): unable to translate '<a0> x' to a wide string
#> Error in gsub("x", "u", text): input string 1 is invalid

<sup>创建于 2023-07-13,使用 reprex v2.0.2</sup>

在进行字符串操作之前是否有办法移除这些特殊字符?请注意,我不知道具体哪些字符会失败,因为我正在处理的真实字符串太长以至于无法检查。

英文:

It looks to me like R introduced a new error in version 4.3.0, which breaks a lot of my web-scrapers. I only found one mention of the change, but don't really understand the blog post.

In essence, this code fails on newer versions of R, but older versions do some internal conversion that seems to work:

text &lt;- &quot;\xa0 x&quot;
gsub(&quot;x&quot;, &quot;u&quot;, text)
#&gt; Warning in gsub(&quot;x&quot;, &quot;u&quot;, text): unable to translate &#39;&lt;a0&gt; x&#39; to a wide string
#&gt; Error in gsub(&quot;x&quot;, &quot;u&quot;, text): input string 1 is invalid

<sup>Created on 2023-07-13 with reprex v2.0.2</sup>

Is there any way to remove these special characters before doing string operations? Note that I do not know which characters specifically fail, since the real strings I'm working with are too long to check.

答案1

得分: 8

这是一个编码问题,`text`没有被解释为有效的字符串,因为它包含非ASCII字符。

转换为UTF-8:

```R
text_utf8 &lt;- iconv(text, from = &quot;ISO-8859-1&quot;, to = &quot;UTF-8&quot;)
gsub(&quot;x&quot;,&quot;u&quot;, text_utf8)

将产生:&#39; u&#39;

R 4.3.0 更新日志 表示:“正则表达式函数现在更彻底地检查它们的输入是否是有效的字符串(以它们的编码,例如UTF-8)。”

您也可以将输入视为字节序列(这也将保留在输出中)。

gsub(&quot;x&quot;, &quot;u&quot;, text, useBytes = TRUE)

得到 &#39;\xa0 u&#39;


<details>
<summary>英文:</summary>

It&#39;s an encoding issue, `text` is not interpreted as a valid string because it contains non-ASCII characters.

Conversion to UTF-8:

```R
text_utf8 &lt;- iconv(text, from = &quot;ISO-8859-1&quot;, to = &quot;UTF-8&quot;)
gsub(&quot;x&quot;,&quot;u&quot;, text_utf8)

will produce: &#39; u&#39;.

R 4.3.0 changelog says: "Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8)."

You could also treat input as sequence of bytes (this will also be preserved in the output).

gsub(&quot;x&quot;, &quot;u&quot;, text, useBytes = TRUE)

gives &#39;\xa0 u&#39;

huangapple
  • 本文由 发表于 2023年7月13日 23:04:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76680882.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定