separate_wider_regex with lookahead

huangapple go评论95阅读模式
英文:

separate_wider_regex with lookahead

问题

我有一个包含体育赛事的数据框架(对于空格或单词数量没有假设),其中可以选择包含年份,并且可能以几种不同的方式格式化。

tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))

我如何使用 tidyr::separate_wider_regex 来将 event_optional_year 拆分为两列 eventyear?在这种情况下,我希望 event 被剥离出可选的年份,而 year 分别等于 NA1220162020/2021

我尝试了在正则表达式中使用正向先行:

  1. tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
  2. tidyr::separate_wider_regex(
  3. "event_optional_year",
  4. c(
  5. event = ".*(?=(?:\\d.*\\d$)?)",
  6. year = "\\d.*\\d$"
  7. ),
  8. too_few = "align_start"
  9. )

但这给出了如下结果:

  1. event year
  2. <chr> <chr>
  3. 1 "World Championships" NA
  4. 2 "Summer Olympics " 12
  5. 3 "Olympics 20" 16
  6. 4 "Olympics 2020/2" 21

问题:哪个正则表达式可以给我所需的结果?

英文:

I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.

tibble::tibble(event_optional_year = c(&quot;World Championships&quot;, &quot;Summer Olympics 12&quot;, &quot;Olympics 2016&quot;, &quot;Olympics 2020/221&quot;))

How can I use tidyr::separate_wider_regex to split event_optional_year into two columns event and year? I want event in this case to be stripped of the optional year, and year equal to NA, 12, 2016 and 2020/2021, respectively.

I tried fiddling with positive lookahead in the regex:

  1. tibble::tibble(event_optional_year = c(&quot;Olympics&quot;, &quot;Olympics 12&quot;, &quot;Olympics 2016&quot;, &quot;Olympics 2020/221&quot;)) |&gt;
  2. tidyr::separate_wider_regex(
  3. &quot;event_optional_year&quot;,
  4. c(
  5. event = &quot;.*(?=(?:\\d.*\\d$)?)&quot;,
  6. year = &quot;\\d.*\\d$&quot;
  7. ),
  8. too_few = &quot;align_start&quot;
  9. )

but this gives as result:

  1. event year
  2. &lt;chr&gt; &lt;chr&gt;
  3. 1 &quot;World Championships&quot; NA
  4. 2 &quot;Summer Olympics &quot; 12
  5. 3 &quot;Olympics 20&quot; 16
  6. 4 &quot;Olympics 2020/2&quot; 21

Question: which regex does give me the desired result?

答案1

得分: 3

separate_wider_regex()中的未命名模式简化了这个情况。event = &quot;.*&quot;是贪婪的,匹配了&quot;\\s+(?=\\d)&quot;之前的所有内容——任意数量的空格,紧跟着一个数字(假设year部分以数字开头)。这处理了event中的空格,但假设year中没有空格。

  1. library(dplyr)
  2. library(tidyr)
  3. tibble(event_optional_year = c(&quot;World Championships&quot;,
  4. &quot;Summer Olympics 12&quot;,
  5. &quot;Olympics 2016&quot;,
  6. &quot;Olympics 2020/221&quot;)) %&gt;%
  7. separate_wider_regex(event_optional_year,
  8. c(event = &quot;.*&quot;, &quot;\\s+(?=\\d)&quot;, year = &quot;.*$&quot;) ,
  9. too_few = &quot;align_start&quot;)
  10. #&gt; # A tibble: 4 &#215; 2
  11. #&gt; event year
  12. #&gt; &lt;chr&gt; &lt;chr&gt;
  13. #&gt; 1 World Championships &lt;NA&gt;
  14. #&gt; 2 Summer Olympics 12
  15. #&gt; 3 Olympics 2016
  16. #&gt; 4 Olympics 2020/221

<sup>创建于2023年6月25日,使用reprex v2.0.2</sup>

英文:

Unnamed patterns in separate_wider_regex() simplify this situation a bit. event = &quot;.*&quot; is greedy and matches everything before &quot;\\s+(?=\\d)&quot; -- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.

  1. library(dplyr)
  2. library(tidyr)
  3. tibble(event_optional_year = c(&quot;World Championships&quot;,
  4. &quot;Summer Olympics 12&quot;,
  5. &quot;Olympics 2016&quot;,
  6. &quot;Olympics 2020/221&quot;)) %&gt;%
  7. separate_wider_regex(event_optional_year,
  8. c(event = &quot;.*&quot;, &quot;\\s+(?=\\d)&quot;, year = &quot;.*$&quot;) ,
  9. too_few = &quot;align_start&quot;)
  10. #&gt; # A tibble: 4 &#215; 2
  11. #&gt; event year
  12. #&gt; &lt;chr&gt; &lt;chr&gt;
  13. #&gt; 1 World Championships &lt;NA&gt;
  14. #&gt; 2 Summer Olympics 12
  15. #&gt; 3 Olympics 2016
  16. #&gt; 4 Olympics 2020/221

<sup>Created on 2023-06-25 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月25日 18:37:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76549979.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定