提取字符串中的前导数字,但长度会变化。

huangapple go评论92阅读模式
英文:

Extract leading numbers from string, but length varies R

问题

  1. 有一列包含字母和数字的字符字符串。该字符串始终以一个或两个数字开头,然后是多个字符。我尝试根据第一个字符的位置将字符串分开。
  2. have <-
  3. tribble(
  4. ~string,
  5. '12main',
  6. '6six',
  7. '42go',
  8. '5to9'
  9. )
  10. want <-
  11. tribble(
  12. ~prefix, ~rest,
  13. '12', 'main',
  14. '6', 'six',
  15. '42', 'go',
  16. '5', 'to9'
  17. )
  18. 我确定有一个使用 `separate` 函数的正则表达式解决方案,但我在使其工作方面遇到了困难。
  19. want <-
  20. have %>%
  21. separate(string,
  22. into = c('prefix', 'rest'),
  23. sep = "(?=[0-9])(?<=[a-zA-Z])")
英文:

I have a column which contains a character string containing letters and numbers. The string always starts with one or two numbers, followed by multiple characters. I am trying to separate the string based on where that first character is.

  1. have &lt;-
  2. tribble(
  3. ~string,
  4. &#39;12main&#39;,
  5. &#39;6six&#39;,
  6. &#39;42go&#39;,
  7. &#39;5to9&#39;
  8. )
  9. want &lt;-
  10. tribble(
  11. ~prefix, ~rest,
  12. &#39;12&#39;, &#39;main&#39;,
  13. &#39;6&#39;, &#39;six&#39;,
  14. &#39;42&#39;, &#39;go&#39;,
  15. &#39;5&#39;, &#39;to9&#39;
  16. )

I'm sure there is a regex with separate solution but having trouble getting it working.

  1. want &lt;-
  2. have %&gt;%
  3. separate(string,
  4. into = c(&#39;prefix&#39;, &#39;rest&#39;),
  5. sep = &quot;(?=[0-9])(?&lt;=[a-zA-Z])&quot;)

答案1

得分: 2

你几乎就要成功了,我们可以通过一个向前查找(用于数字)和一个向后查找(用于非数字)来实现它:

  1. have %&gt;%
  2. separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
  3. # # A tibble: 4 &#215; 2
  4. # prefix rest
  5. # &lt;chr&gt; &lt;chr&gt;
  6. # 1 12 main
  7. # 2 6 six
  8. # 3 42 go
  9. # 4 5 to9

我认为你把查找方向搞反了:?&lt;= 是用于前面的字符串(应该与 [0-9] 一起使用),而 ?= 是用于后面的字符串(应该与 [^0-9][A-Za-z] 一起使用)。

个人觉得这有点有趣:我们基于零长度的模式拆分字符串:前面是数字,后面是非数字,所以拆分实际上是零长度的。

顺便说一下,如果字符串中有两个这样的位置,比如 5to9to5,会出现警告:

  1. have &lt;- structure(list(string = c(&quot;12main&quot;, &quot;6six&quot;, &quot;42go&quot;, &quot;5to9&quot;, &quot;5to9to5&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -5L))
  2. have
  3. # # A tibble: 5 &#215; 1
  4. # string
  5. # &lt;chr&gt;
  6. # 1 12main
  7. # 2 6six
  8. # 3 42go
  9. # 4 5to9
  10. # 5 5to9to5
  11. have %&gt;%
  12. separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
  13. # Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
  14. # # A tibble: 5 &#215; 2
  15. # prefix rest
  16. # &lt;chr&gt; &lt;chr&gt;
  17. # 1 12 main
  18. # 2 6 six
  19. # 3 42 go
  20. # 4 5 to9
  21. # 5 5 to9

这是一个警告,说明你正在丢弃一些信息,你可以自行决定是否需要处理这种情况。

另一个选择,因为你在实际数据中有 5to9to5

  1. have %&gt;%
  2. mutate(strcapture(&quot;^([0-9]+)([^0-9].*)&quot;, string, list(prefix=&quot;&quot;, rest=&quot;&quot;)))
  3. # # A tibble: 5 &#215; 3
  4. # string prefix rest
  5. # &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  6. # 1 12main 12 main
  7. # 2 6six 6 six
  8. # 3 42go 42 go
  9. # 4 5to9 5 to9
  10. # 5 5to9to5 5 to9to5

现在你可以移除 string

另一个注意事项:如果你打算将 prefix 转换为整数或数字,那么你可以通过使用 list(prefix=0L, rest=&quot;&quot;)(或只是 =0)来避免这种需要。那就是 proto= 参数,虽然它的 data 被丢弃了,但它被用于结果列的名称和目标类。

英文:

You were close, we can achieve it with one look-behind (for a number) and one look-ahead (for a non-number):

  1. have %&gt;%
  2. separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
  3. # # A tibble: 4 &#215; 2
  4. # prefix rest
  5. # &lt;chr&gt; &lt;chr&gt;
  6. # 1 12 main
  7. # 2 6 six
  8. # 3 42 go
  9. # 4 5 to9

I think you had the look-around reversed: ?&lt;= is for preceding string (should be used with [0-9]), and ?= is for following string (should be used with [^0-9] or [A-Za-z]).

Personally I find this a bit intriguing: we are splitting strings based on a 0-length pattern: there is nothing between where the previous is a number and the following is a non-number, so the split is effectively 0-length.

FYI, this does run into warnings if there are two such places in a string, such as 5to9to5:

  1. have &lt;- structure(list(string = c(&quot;12main&quot;, &quot;6six&quot;, &quot;42go&quot;, &quot;5to9&quot;, &quot;5to9to5&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -5L))
  2. have
  3. # # A tibble: 5 &#215; 1
  4. # string
  5. # &lt;chr&gt;
  6. # 1 12main
  7. # 2 6six
  8. # 3 42go
  9. # 4 5to9
  10. # 5 5to9to5
  11. have %&gt;%
  12. separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
  13. # Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
  14. # # A tibble: 5 &#215; 2
  15. # prefix rest
  16. # &lt;chr&gt; &lt;chr&gt;
  17. # 1 12 main
  18. # 2 6 six
  19. # 3 42 go
  20. # 4 5 to9
  21. # 5 5 to9

It's a warning that you are discarding some information, over to you if you want/need to guard against this.

An alternative, since you have 5to9to5 in your real data:

  1. have %&gt;%
  2. mutate(strcapture(&quot;^([0-9]+)([^0-9].*)&quot;, string, list(prefix=&quot;&quot;, rest=&quot;&quot;)))
  3. # # A tibble: 5 &#215; 3
  4. # string prefix rest
  5. # &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  6. # 1 12main 12 main
  7. # 2 6six 6 six
  8. # 3 42go 42 go
  9. # 4 5to9 5 to9
  10. # 5 5to9to5 5 to9to5

where you can now remove string if you want.

Another note: if you are intending to convert prefix into an integer or a number, then you can preclude that need by using list(prefix=0L, rest=&quot;&quot;) (or just =0) instead. That's the proto= argument, and while its data is discarded, it is used for its names and target classes for each resulting column).

答案2

得分: 1

你也可以使用 extract

  1. %&gt;%
  2. 提取(string, c(&#39;prefix&#39;, &#39;rest&#39;), &quot;(\\d+)(.*)&quot;)
  3. 一个数据框: 4 &#215; 2
  4. 前缀 剩余
  5. &lt;chr&gt; &lt;chr&gt;
  6. 1 12 主要
  7. 2 6
  8. 3 42
  9. 4 5 9
英文:

You can also use extract

  1. have %&gt;%
  2. extract(string, c(&#39;prefix&#39;, &#39;rest&#39;), &quot;(\\d+)(.*)&quot;)
  3. # A tibble: 4 &#215; 2
  4. prefix rest
  5. &lt;chr&gt; &lt;chr&gt;
  6. 1 12 main
  7. 2 6 six
  8. 3 42 go
  9. 4 5 to9

huangapple
  • 本文由 发表于 2023年4月20日 02:12:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76057649.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定