英文:
separate_wider_regex with lookahead
问题
我有一个包含体育赛事的数据框架(对于空格或单词数量没有假设),其中可以选择包含年份,并且可能以几种不同的方式格式化。
tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))
我如何使用 tidyr::separate_wider_regex 来将 event_optional_year 拆分为两列 event 和 year?在这种情况下,我希望 event 被剥离出可选的年份,而 year 分别等于 NA、12、2016 和 2020/2021。
我尝试了在正则表达式中使用正向先行:
tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
    tidyr::separate_wider_regex(
      "event_optional_year",
      c(
        event = ".*(?=(?:\\d.*\\d$)?)",
        year = "\\d.*\\d$"
      ),
      too_few = "align_start"
    )
但这给出了如下结果:
  event                 year 
  <chr>                 <chr>
1 "World Championships" NA   
2 "Summer Olympics "    12   
3 "Olympics 20"         16   
4 "Olympics 2020/2"     21 
问题:哪个正则表达式可以给我所需的结果?
英文:
I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.
tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))
How can I use tidyr::separate_wider_regex to split event_optional_year into two columns event and year? I want event in this case to be stripped of the optional year, and year equal to NA, 12, 2016 and 2020/2021, respectively.
I tried fiddling with positive lookahead in the regex:
tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |> 
    tidyr::separate_wider_regex(
      "event_optional_year",
      c(
        event = ".*(?=(?:\\d.*\\d$)?)",
        year = "\\d.*\\d$"
      ),
      too_few = "align_start"
    )
but this gives as result:
  event                 year 
  <chr>                 <chr>
1 "World Championships" NA   
2 "Summer Olympics "    12   
3 "Olympics 20"         16   
4 "Olympics 2020/2"     21 
Question: which regex does give me the desired result?
答案1
得分: 3
separate_wider_regex()中的未命名模式简化了这个情况。event = ".*"是贪婪的,匹配了"\\s+(?=\\d)"之前的所有内容——任意数量的空格,紧跟着一个数字(假设year部分以数字开头)。这处理了event中的空格,但假设year中没有空格。
library(dplyr)
library(tidyr)
tibble(event_optional_year = c("World Championships", 
                               "Summer Olympics 12", 
                               "Olympics 2016", 
                               "Olympics 2020/221")) %>% 
  separate_wider_regex(event_optional_year, 
                       c(event = ".*", "\\s+(?=\\d)", year = ".*$") , 
                       too_few = "align_start")
#> # A tibble: 4 × 2
#>   event               year    
#>   <chr>               <chr>   
#> 1 World Championships <NA>    
#> 2 Summer Olympics     12      
#> 3 Olympics            2016    
#> 4 Olympics            2020/221
<sup>创建于2023年6月25日,使用reprex v2.0.2</sup>
英文:
Unnamed patterns in separate_wider_regex() simplify this situation a bit. event = ".*" is greedy and matches everything before "\\s+(?=\\d)" -- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.
library(dplyr)
library(tidyr)
tibble(event_optional_year = c("World Championships", 
                               "Summer Olympics 12", 
                               "Olympics 2016", 
                               "Olympics 2020/221")) %>% 
  separate_wider_regex(event_optional_year, 
                       c(event = ".*", "\\s+(?=\\d)", year = ".*$") , 
                       too_few = "align_start")
#> # A tibble: 4 × 2
#>   event               year    
#>   <chr>               <chr>   
#> 1 World Championships <NA>    
#> 2 Summer Olympics     12      
#> 3 Olympics            2016    
#> 4 Olympics            2020/221
<sup>Created on 2023-06-25 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论