separate_wider_regex with lookahead

huangapple go评论73阅读模式
英文:

separate_wider_regex with lookahead

问题

我有一个包含体育赛事的数据框架(对于空格或单词数量没有假设),其中可以选择包含年份,并且可能以几种不同的方式格式化。

tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))

我如何使用 tidyr::separate_wider_regex 来将 event_optional_year 拆分为两列 eventyear?在这种情况下,我希望 event 被剥离出可选的年份,而 year 分别等于 NA1220162020/2021

我尝试了在正则表达式中使用正向先行:

tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
    tidyr::separate_wider_regex(
      "event_optional_year",
      c(
        event = ".*(?=(?:\\d.*\\d$)?)",
        year = "\\d.*\\d$"
      ),
      too_few = "align_start"
    )

但这给出了如下结果:

  event                 year 
  <chr>                 <chr>
1 "World Championships" NA   
2 "Summer Olympics "    12   
3 "Olympics 20"         16   
4 "Olympics 2020/2"     21 

问题:哪个正则表达式可以给我所需的结果?

英文:

I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.

tibble::tibble(event_optional_year = c(&quot;World Championships&quot;, &quot;Summer Olympics 12&quot;, &quot;Olympics 2016&quot;, &quot;Olympics 2020/221&quot;))

How can I use tidyr::separate_wider_regex to split event_optional_year into two columns event and year? I want event in this case to be stripped of the optional year, and year equal to NA, 12, 2016 and 2020/2021, respectively.

I tried fiddling with positive lookahead in the regex:

tibble::tibble(event_optional_year = c(&quot;Olympics&quot;, &quot;Olympics 12&quot;, &quot;Olympics 2016&quot;, &quot;Olympics 2020/221&quot;)) |&gt; 
    tidyr::separate_wider_regex(
      &quot;event_optional_year&quot;,
      c(
        event = &quot;.*(?=(?:\\d.*\\d$)?)&quot;,
        year = &quot;\\d.*\\d$&quot;
      ),
      too_few = &quot;align_start&quot;
    )

but this gives as result:

  event                 year 
  &lt;chr&gt;                 &lt;chr&gt;
1 &quot;World Championships&quot; NA   
2 &quot;Summer Olympics &quot;    12   
3 &quot;Olympics 20&quot;         16   
4 &quot;Olympics 2020/2&quot;     21 

Question: which regex does give me the desired result?

答案1

得分: 3

separate_wider_regex()中的未命名模式简化了这个情况。event = &quot;.*&quot;是贪婪的,匹配了&quot;\\s+(?=\\d)&quot;之前的所有内容——任意数量的空格,紧跟着一个数字(假设year部分以数字开头)。这处理了event中的空格,但假设year中没有空格。

library(dplyr)
library(tidyr)
tibble(event_optional_year = c(&quot;World Championships&quot;, 
                               &quot;Summer Olympics 12&quot;, 
                               &quot;Olympics 2016&quot;, 
                               &quot;Olympics 2020/221&quot;)) %&gt;% 
  separate_wider_regex(event_optional_year, 
                       c(event = &quot;.*&quot;, &quot;\\s+(?=\\d)&quot;, year = &quot;.*$&quot;) , 
                       too_few = &quot;align_start&quot;)
#&gt; # A tibble: 4 &#215; 2
#&gt;   event               year    
#&gt;   &lt;chr&gt;               &lt;chr&gt;   
#&gt; 1 World Championships &lt;NA&gt;    
#&gt; 2 Summer Olympics     12      
#&gt; 3 Olympics            2016    
#&gt; 4 Olympics            2020/221

<sup>创建于2023年6月25日,使用reprex v2.0.2</sup>

英文:

Unnamed patterns in separate_wider_regex() simplify this situation a bit. event = &quot;.*&quot; is greedy and matches everything before &quot;\\s+(?=\\d)&quot; -- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.

library(dplyr)
library(tidyr)
tibble(event_optional_year = c(&quot;World Championships&quot;, 
                               &quot;Summer Olympics 12&quot;, 
                               &quot;Olympics 2016&quot;, 
                               &quot;Olympics 2020/221&quot;)) %&gt;% 
  separate_wider_regex(event_optional_year, 
                       c(event = &quot;.*&quot;, &quot;\\s+(?=\\d)&quot;, year = &quot;.*$&quot;) , 
                       too_few = &quot;align_start&quot;)
#&gt; # A tibble: 4 &#215; 2
#&gt;   event               year    
#&gt;   &lt;chr&gt;               &lt;chr&gt;   
#&gt; 1 World Championships &lt;NA&gt;    
#&gt; 2 Summer Olympics     12      
#&gt; 3 Olympics            2016    
#&gt; 4 Olympics            2020/221

<sup>Created on 2023-06-25 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月25日 18:37:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76549979.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定