基于原始十六进制值替换R字符串中的字符:

huangapple go评论73阅读模式
英文:

Replacing characters in R string based on raw hex values

问题

假设我有一个R中的字符串,

mystring = 'help me'

但有一个小变化:'help'和'me'之间的空格实际上是一个不换行的空格。不换行的空格在R中存储为,所以可以通过以下方式创建这个字符串

mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))

然后,例如,grepl('help me', mystring) 将返回 FALSE

如何将不换行的空格替换为常规空格?并且通常情况下,如何将特定的原始值替换为特定的字符?理想情况下,我将能够创建一个函数,类似于

gsubRaw('mystring',as.raw(as.hexmode(c(('c2','a0'))), ' ')

这个答案 差不多回答了我的问题,不过我不想用空格替换所有非ASCII字符,只想替换不换行的空格。

grepRaw() 也接近目标,因为它可以检测到字符串中出现原始字符的位置,然后可以替换它们。但是,它没有完美地工作:有时grepRaw() 返回的字符串中原始字符的位置与文本中的不换行空格的位置不同,我不知道如何替换原始值本身。

英文:

Suppose I have a string in R,

> mystring = 'help me'

but with a twist: The space between 'help' and 'me' is actually a non-breaking space. Non-breaking space is stored in R as <c2 a0>, so this string can be created by

> mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))

Then, for example, grepl('help me', mystring) will be FALSE

how can I replace the non-breaking space with a regular space? And in general, replace any particular raw value(s) with a particular character? Ideally I will be able to make a function like

> gsubRaw('mystring',as.raw(as.hexmode(c(('c2','a0'))), ' ')

This answer almost answers my question, except that I don't want to replace ALL non-ascii characters with a space, only the non breaking space.

grepRaw() also came close, because it can detect the position in the string that the raw characters occur and they can then be replaced. However, it didn't work cleanly: sometimes the position in the string that grepRaw() returned wasn't the same as the position of the non-breaking space in the string-as-plain-text, and I don't know how to replace the raw values themselves.

答案1

得分: 1

这是一个选项。您在纯文本中指定替换内容(例如," ")。该函数将其转换为原始字符。然后,将字符串还原为原始字符,并与冒号一起粘贴在一起(生成一个单一的字符串)。然后,使用替换的原始字符重复相同的过程。然后,将原始字符模式字符串的实例替换为原始字符替换字符串。然后,根据用于连接它们的字符(在下面的示例中为冒号)拆分字符串,然后将字符串从原始还原为纯文本。

英文:

Here's an option. You specify the replacement in plain text (e.g., " "). The function converts that to raw characters. Then, you revert your string to raw characters and paste them all together with a colon (making a single string). Then, you do the same with the replacement raw characters. You then replace instances of the raw character pattern string with the raw character replacement string. You split the string on the character you used to join them (a colon in the example below) and then revert the string from raw back to plain text.

library(stringr)
mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))

gsubRaw <- function(mystring, pattern, replacement){
  rpl <- charToRaw(replacement)
  r <- charToRaw(mystring)
  r2 <- paste(r, collapse=":")
  pat <- paste(pattern, collapse=":")
  r2 <- gsub(pat, rpl, r2)
  s <- c(str_split(r2, ":", simplify=TRUE))
  rawToChar(as.raw(as.hexmode(s)))
}        

tst <- gsubRaw(mystring, c("c2", "a0"), " ")
tst
#> [1] "help me"
grepl(" ", mystring)
#> [1] FALSE
grepl(" ", tst)
#> [1] TRUE

<sup>Created on 2023-07-02 with reprex v2.0.2</sup>

答案2

得分: 1

你可以使用替换操作符:

gsubRaw <- function(string, pattern, replacement){
  d <- (b <- charToRaw(string)) %in% as.raw(as.hexmode(pattern))
  b[d] <- charToRaw(replacement)
  b[(e <- which(d))[c(0,diff(e)) == 1]] <- as.raw(0)
  rawToChar(b[b != as.raw(0)])
}

tst <- gsubRaw(mystring, c("c2", "a0"), " ")
tst
#> [1] "help me"
grepl(" ", mystring)
#> [1] FALSE
grepl(" ", tst)
#> [1] TRUE
英文:

You could use the replacement operator:

gsubRaw &lt;- function(string, pattern, replacement){
  d &lt;- (b &lt;- charToRaw(string)) %in% as.raw(as.hexmode(pattern))
  b[d] &lt;- charToRaw(replacement)
  b[(e &lt;- which(d))[c(0,diff(e)) == 1]] &lt;- as.raw(0)
  rawToChar(b[b != as.raw(0)])
}

tst &lt;- gsubRaw(mystring, c(&quot;c2&quot;, &quot;a0&quot;), &quot; &quot;)
tst
#&gt; [1] &quot;help me&quot;
grepl(&quot; &quot;, mystring)
#&gt; [1] FALSE
grepl(&quot; &quot;, tst)
#&gt; [1] TRUE

答案3

得分: 1

从对我的回答的评论中,我们可以通过使用不间断空格是\xc2\xa0(至少在Windows上的R 4.3.1中是这样)来实现这一点。

mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))
grepl('help me', mystring)
#> [1] FALSE
tools::showNonASCII(mystring)
#> 1: help<c2><a0>me

identical('help\xc2\xa0me', mystring)
#> [1] TRUE

mynewstring = gsub('\xc2\xa0+', ' ', mystring)
grepl('help me', mynewstring)
#> [1] TRUE
tools::showNonASCII(mynewstring)

创建于2023-07-05,使用 reprex v2.0.2

英文:

From comments on my answer to the other question we can do this by using the fact that the non-breaking space is \xc2\xa0 (at least in R 4.3.1 on Windows)

mystring = rawToChar(as.raw(as.hexmode(c(&#39;68&#39;,&#39;65&#39;,&#39;6c&#39;,&#39;70&#39;,&#39;c2&#39;,&#39;a0&#39;,&#39;6d&#39;,&#39;65&#39;))))
grepl(&#39;help me&#39;, mystring)
#&gt; [1] FALSE
tools::showNonASCII(mystring)
#&gt; 1: help&lt;c2&gt;&lt;a0&gt;me

identical(&#39;help\xc2\xa0me&#39;, mystring)
#&gt; [1] TRUE

mynewstring = gsub(&#39;\xc2\xa0+&#39;, &#39; &#39;, mystring)
grepl(&#39;help me&#39;, mynewstring)
#&gt; [1] TRUE
tools::showNonASCII(mynewstring)

<sup>Created on 2023-07-05 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年7月3日 08:42:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76601289.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定