英文:
separate_wider_regex with lookahead
问题
我有一个包含体育赛事的数据框架(对于空格或单词数量没有假设),其中可以选择包含年份,并且可能以几种不同的方式格式化。
tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))
我如何使用 tidyr::separate_wider_regex
来将 event_optional_year
拆分为两列 event
和 year
?在这种情况下,我希望 event
被剥离出可选的年份,而 year
分别等于 NA
、12
、2016
和 2020/2021
。
我尝试了在正则表达式中使用正向先行:
tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
tidyr::separate_wider_regex(
"event_optional_year",
c(
event = ".*(?=(?:\\d.*\\d$)?)",
year = "\\d.*\\d$"
),
too_few = "align_start"
)
但这给出了如下结果:
event year
<chr> <chr>
1 "World Championships" NA
2 "Summer Olympics " 12
3 "Olympics 20" 16
4 "Olympics 2020/2" 21
问题:哪个正则表达式可以给我所需的结果?
英文:
I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.
tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))
How can I use tidyr::separate_wider_regex
to split event_optional_year
into two columns event
and year
? I want event
in this case to be stripped of the optional year, and year
equal to NA
, 12
, 2016
and 2020/2021
, respectively.
I tried fiddling with positive lookahead in the regex:
tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
tidyr::separate_wider_regex(
"event_optional_year",
c(
event = ".*(?=(?:\\d.*\\d$)?)",
year = "\\d.*\\d$"
),
too_few = "align_start"
)
but this gives as result:
event year
<chr> <chr>
1 "World Championships" NA
2 "Summer Olympics " 12
3 "Olympics 20" 16
4 "Olympics 2020/2" 21
Question: which regex does give me the desired result?
答案1
得分: 3
separate_wider_regex()
中的未命名模式简化了这个情况。event = ".*"
是贪婪的,匹配了"\\s+(?=\\d)"
之前的所有内容——任意数量的空格,紧跟着一个数字(假设year部分以数字开头)。这处理了event中的空格,但假设year中没有空格。
library(dplyr)
library(tidyr)
tibble(event_optional_year = c("World Championships",
"Summer Olympics 12",
"Olympics 2016",
"Olympics 2020/221")) %>%
separate_wider_regex(event_optional_year,
c(event = ".*", "\\s+(?=\\d)", year = ".*$") ,
too_few = "align_start")
#> # A tibble: 4 × 2
#> event year
#> <chr> <chr>
#> 1 World Championships <NA>
#> 2 Summer Olympics 12
#> 3 Olympics 2016
#> 4 Olympics 2020/221
<sup>创建于2023年6月25日,使用reprex v2.0.2</sup>
英文:
Unnamed patterns in separate_wider_regex()
simplify this situation a bit. event = ".*"
is greedy and matches everything before "\\s+(?=\\d)"
-- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.
library(dplyr)
library(tidyr)
tibble(event_optional_year = c("World Championships",
"Summer Olympics 12",
"Olympics 2016",
"Olympics 2020/221")) %>%
separate_wider_regex(event_optional_year,
c(event = ".*", "\\s+(?=\\d)", year = ".*$") ,
too_few = "align_start")
#> # A tibble: 4 × 2
#> event year
#> <chr> <chr>
#> 1 World Championships <NA>
#> 2 Summer Olympics 12
#> 3 Olympics 2016
#> 4 Olympics 2020/221
<sup>Created on 2023-06-25 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论