2023年4月20日 02:12:17go评论92阅读模式

英文:

Extract leading numbers from string, but length varies R

问题

有一列包含字母和数字的字符字符串。该字符串始终以一个或两个数字开头，然后是多个字符。我尝试根据第一个字符的位置将字符串分开。
    have <-
      tribble(
        ~string,
        '12main',
        '6six',
        '42go',
        '5to9'
      )
    
    want <- 
      tribble(
        ~prefix, ~rest,
        '12', 'main',
        '6', 'six',
        '42', 'go',
        '5', 'to9'
      )
我确定有一个使用 `separate` 函数的正则表达式解决方案，但我在使其工作方面遇到了困难。
    want <-
      have %>%
      separate(string,
               into = c('prefix', 'rest'),
               sep = "(?=[0-9])(?<=[a-zA-Z])")

英文:

I have a column which contains a character string containing letters and numbers. The string always starts with one or two numbers, followed by multiple characters. I am trying to separate the string based on where that first character is.

have &lt;-
  tribble(
    ~string,
    &#39;12main&#39;,
    &#39;6six&#39;,
    &#39;42go&#39;,
    &#39;5to9&#39;
  )
want &lt;- 
  tribble(
    ~prefix, ~rest,
    &#39;12&#39;, &#39;main&#39;,
    &#39;6&#39;, &#39;six&#39;,
    &#39;42&#39;, &#39;go&#39;,
    &#39;5&#39;, &#39;to9&#39;
  )

I'm sure there is a regex with separate solution but having trouble getting it working.

want &lt;-
  have %&gt;%
  separate(string,
           into = c(&#39;prefix&#39;, &#39;rest&#39;),
           sep = &quot;(?=[0-9])(?&lt;=[a-zA-Z])&quot;)

答案1

得分: 2

你几乎就要成功了，我们可以通过一个向前查找（用于数字）和一个向后查找（用于非数字）来实现它：

have %&gt;%
  separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
# # A tibble: 4 &#215; 2
#   prefix rest 
#   &lt;chr&gt;  &lt;chr&gt;
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9

我认为你把查找方向搞反了：?<= 是用于前面的字符串（应该与 [0-9] 一起使用），而 ?= 是用于后面的字符串（应该与 [^0-9] 或 [A-Za-z] 一起使用）。

个人觉得这有点有趣：我们基于零长度的模式拆分字符串：前面是数字，后面是非数字，所以拆分实际上是零长度的。

顺便说一下，如果字符串中有两个这样的位置，比如 5to9to5，会出现警告：

have &lt;- structure(list(string = c(&quot;12main&quot;, &quot;6six&quot;, &quot;42go&quot;, &quot;5to9&quot;, &quot;5to9to5&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -5L))
have
# # A tibble: 5 &#215; 1
#   string 
#   &lt;chr&gt;  
# 1 12main 
# 2 6six   
# 3 42go   
# 4 5to9   
# 5 5to9to5
have %&gt;%
  separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
# Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
# # A tibble: 5 &#215; 2
#   prefix rest 
#   &lt;chr&gt;  &lt;chr&gt;
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9  
# 5 5      to9

这是一个警告，说明你正在丢弃一些信息，你可以自行决定是否需要处理这种情况。

另一个选择，因为你在实际数据中有 5to9to5：

have %&gt;%
  mutate(strcapture(&quot;^([0-9]+)([^0-9].*)&quot;, string, list(prefix=&quot;&quot;, rest=&quot;&quot;)))
# # A tibble: 5 &#215; 3
#   string  prefix rest  
#   &lt;chr&gt;   &lt;chr&gt;  &lt;chr&gt; 
# 1 12main  12     main  
# 2 6six    6      six   
# 3 42go    42     go    
# 4 5to9    5      to9   
# 5 5to9to5 5      to9to5

现在你可以移除 string。

另一个注意事项：如果你打算将 prefix 转换为整数或数字，那么你可以通过使用 list(prefix=0L, rest="")（或只是 =0）来避免这种需要。那就是 proto= 参数，虽然它的 data 被丢弃了，但它被用于结果列的名称和目标类。

英文:

You were close, we can achieve it with one look-behind (for a number) and one look-ahead (for a non-number):

have %&gt;%
  separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
# # A tibble: 4 &#215; 2
#   prefix rest 
#   &lt;chr&gt;  &lt;chr&gt;
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9

I think you had the look-around reversed: ?<= is for preceding string (should be used with [0-9]), and ?= is for following string (should be used with [^0-9] or [A-Za-z]).

Personally I find this a bit intriguing: we are splitting strings based on a 0-length pattern: there is nothing between where the previous is a number and the following is a non-number, so the split is effectively 0-length.

FYI, this does run into warnings if there are two such places in a string, such as 5to9to5:

have &lt;- structure(list(string = c(&quot;12main&quot;, &quot;6six&quot;, &quot;42go&quot;, &quot;5to9&quot;, &quot;5to9to5&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -5L))
have
# # A tibble: 5 &#215; 1
#   string 
#   &lt;chr&gt;  
# 1 12main 
# 2 6six   
# 3 42go   
# 4 5to9   
# 5 5to9to5
have %&gt;%
  separate(string, sep = &quot;(?&lt;=[0-9])(?=[^0-9])&quot;, into = c(&quot;prefix&quot;, &quot;rest&quot;))
# Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
# # A tibble: 5 &#215; 2
#   prefix rest 
#   &lt;chr&gt;  &lt;chr&gt;
# 1 12     main 
# 2 6      six  
# 3 42     go   
# 4 5      to9  
# 5 5      to9

It's a warning that you are discarding some information, over to you if you want/need to guard against this.

An alternative, since you have 5to9to5 in your real data:

have %&gt;%
  mutate(strcapture(&quot;^([0-9]+)([^0-9].*)&quot;, string, list(prefix=&quot;&quot;, rest=&quot;&quot;)))
# # A tibble: 5 &#215; 3
#   string  prefix rest  
#   &lt;chr&gt;   &lt;chr&gt;  &lt;chr&gt; 
# 1 12main  12     main  
# 2 6six    6      six   
# 3 42go    42     go    
# 4 5to9    5      to9   
# 5 5to9to5 5      to9to5

where you can now remove string if you want.

Another note: if you are intending to convert prefix into an integer or a number, then you can preclude that need by using list(prefix=0L, rest="") (or just =0) instead. That's the proto= argument, and while its data is discarded, it is used for its names and target classes for each resulting column).

答案2

得分: 1

你也可以使用 extract。

有 %&gt;% 
     提取(string, c(&#39;prefix&#39;, &#39;rest&#39;), &quot;(\\d+)(.*)&quot;)
＃ 一个数据框: 4 &#215; 2
  前缀   剩余 
  &lt;chr&gt;  &lt;chr&gt;
1 12     主要 
2 6      六  
3 42     去   
4 5      到9

英文:

You can also use extract

have %&gt;%
     extract(string, c(&#39;prefix&#39;, &#39;rest&#39;), &quot;(\\d+)(.*)&quot;)
# A tibble: 4 &#215; 2
  prefix rest 
  &lt;chr&gt;  &lt;chr&gt;
1 12     main 
2 6      six  
3 42     go   
4 5      to9

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取字符串中的前导数字，但长度会变化。

问题

答案1

答案2

Bar Chart-不同颜色的条形图

PHP正则表达式查找“email”：“email@domain.com”模式

正则表达式匹配对我来说有些奇怪。

基于共同元素对URL进行分组

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。