2023年6月25日 18:37:15go评论95阅读模式

英文:

separate_wider_regex with lookahead

问题

我有一个包含体育赛事的数据框架（对于空格或单词数量没有假设），其中可以选择包含年份，并且可能以几种不同的方式格式化。

tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))

我如何使用 tidyr::separate_wider_regex 来将 event_optional_year 拆分为两列 event 和 year？在这种情况下，我希望 event 被剥离出可选的年份，而 year 分别等于 NA、12、2016 和 2020/2021。

我尝试了在正则表达式中使用正向先行：

tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
    tidyr::separate_wider_regex(
      "event_optional_year",
      c(
        event = ".*(?=(?:\\d.*\\d$)?)",
        year = "\\d.*\\d$"
      ),
      too_few = "align_start"
    )

但这给出了如下结果：

  event                 year 
  <chr>                 <chr>
1 "World Championships" NA   
2 "Summer Olympics "    12   
3 "Olympics 20"         16   
4 "Olympics 2020/2"     21

问题：哪个正则表达式可以给我所需的结果？

英文:

I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.

tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))

How can I use tidyr::separate_wider_regex to split event_optional_year into two columns event and year? I want event in this case to be stripped of the optional year, and year equal to NA, 12, 2016 and 2020/2021, respectively.

I tried fiddling with positive lookahead in the regex:

tibble::tibble(event_optional_year = c(&quot;Olympics&quot;, &quot;Olympics 12&quot;, &quot;Olympics 2016&quot;, &quot;Olympics 2020/221&quot;)) |&gt; 
    tidyr::separate_wider_regex(
      &quot;event_optional_year&quot;,
      c(
        event = &quot;.*(?=(?:\\d.*\\d$)?)&quot;,
        year = &quot;\\d.*\\d$&quot;
      ),
      too_few = &quot;align_start&quot;
    )

but this gives as result:

  event                 year 
  &lt;chr&gt;                 &lt;chr&gt;
1 &quot;World Championships&quot; NA   
2 &quot;Summer Olympics &quot;    12   
3 &quot;Olympics 20&quot;         16   
4 &quot;Olympics 2020/2&quot;     21

Question: which regex does give me the desired result?

答案1

得分: 3

separate_wider_regex()中的未命名模式简化了这个情况。event = ".*"是贪婪的，匹配了"\\s+(?=\\d)"之前的所有内容——任意数量的空格，紧跟着一个数字（假设year部分以数字开头）。这处理了event中的空格，但假设year中没有空格。

library(dplyr)
library(tidyr)
tibble(event_optional_year = c(&quot;World Championships&quot;, 
                               &quot;Summer Olympics 12&quot;, 
                               &quot;Olympics 2016&quot;, 
                               &quot;Olympics 2020/221&quot;)) %&gt;% 
  separate_wider_regex(event_optional_year, 
                       c(event = &quot;.*&quot;, &quot;\\s+(?=\\d)&quot;, year = &quot;.*$&quot;) , 
                       too_few = &quot;align_start&quot;)
#&gt; # A tibble: 4 &#215; 2
#&gt;   event               year    
#&gt;   &lt;chr&gt;               &lt;chr&gt;   
#&gt; 1 World Championships &lt;NA&gt;    
#&gt; 2 Summer Olympics     12      
#&gt; 3 Olympics            2016    
#&gt; 4 Olympics            2020/221

<sup>创建于2023年6月25日，使用reprex v2.0.2</sup>

英文:

Unnamed patterns in separate_wider_regex() simplify this situation a bit. event = ".*" is greedy and matches everything before "\\s+(?=\\d)" -- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.

library(dplyr)
library(tidyr)
tibble(event_optional_year = c(&quot;World Championships&quot;, 
                               &quot;Summer Olympics 12&quot;, 
                               &quot;Olympics 2016&quot;, 
                               &quot;Olympics 2020/221&quot;)) %&gt;% 
  separate_wider_regex(event_optional_year, 
                       c(event = &quot;.*&quot;, &quot;\\s+(?=\\d)&quot;, year = &quot;.*$&quot;) , 
                       too_few = &quot;align_start&quot;)
#&gt; # A tibble: 4 &#215; 2
#&gt;   event               year    
#&gt;   &lt;chr&gt;               &lt;chr&gt;   
#&gt; 1 World Championships &lt;NA&gt;    
#&gt; 2 Summer Olympics     12      
#&gt; 3 Olympics            2016    
#&gt; 4 Olympics            2020/221

<sup>Created on 2023-06-25 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

separate_wider_regex with lookahead

问题

答案1

如何在rmarkdown中调整子图的高度

Flex 似乎无法正确识别我的定义。

使用 stringr 删除两个或更多连续的字符

将 “large” 转换为 “long”。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。