意外警告与 case_when 和正则表达式条件一起使用时,表明有太多情况匹配。

huangapple go评论98阅读模式
英文:

Unexpected warnings with case_when and regex conditions suggest too many cases are matching

问题

以下是您的代码部分的翻译:

我有一个数据集,其中一些数据以hh:mm格式和Excel序列号的混乱日期/时间格式存在。因此,我已将所有内容强制转换为字符串,并在一个大的`case_when`块内使用`stringr`和`readr`来识别不同的格式并正确处理它们。我认为我要么误解了我的`stringr`函数,要么误解了`case_when`,因为我得到了我期望的输出,但它会抛出解析失败和`NA`强制转换的警告,这些警告在最终产品中是不必要的。

这里是一些虚拟数据,其中包含我的数据集中每种格式的示例:
```R
dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))

我创建了一个函数来识别和解析这些格式中的每一个。它调用另一个函数将Excel序列号代码转换为时间。它使用我认为是正确的正则表达式,但总结一下:

  • ^0\\. 应该通过它们都是小于1的小数来识别Excel序列号
  • \\+ 通过搜索加号来识别带有DST时区指示符的时间
  • .+(?=\\+) 提取加号前的所有内容以进行解析
  • : 通过测试冒号来确保结果是某种时间。这是一个更广泛的测试,因此它在加号已经匹配的情况下最后出现
convert_times <- function(x){
  case_when(str_detect(x, "^0\\.")        ~ convert_excel_time(x), 
            str_detect(x, "\\+")          ~ parse_time(str_extract(x, ".+(?=\\+)")), 
            str_detect(x, ":")            ~ parse_time(x),
            .default = NA)
}

convert_excel_time <- function(x){
  as.numeric(x) * 24 * 60 * 60 %>%
  as_datetime() %>%
  hms::as_hms()
}

当我运行它时,我得到了期望的输出,但随之而来的警告似乎表明我不理解底层发生了什么。

> dummy %>%
+   mutate(new = convert_time(x))
# A tibble: 4 × 2
  x              new        
  <chr>          <time>     
1 13:15:21       13:15:21.00
2 02:03:17+01:00 02:03:17.00
3 12:03          12:03:00.00
4 0.1234         02:57:41.76

这是我的错误:

[[1]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning in `convert_excel_time()`:
! NAs introduced by coercion
---
Backtrace:
 1. ├─dummy %>% mutate(new = convert_time(x))
 2. ├─dplyr::mutate(., new = convert_time(x))
 3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

[[2]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning:
! 2 parsing failures.
row col   expected         actual
  2  -- time like  02:03:17+01:00
  4  -- time like  0.1234        
---
Backtrace:
 1. ├─dummy %>% mutate(new = convert_time(x))
 2. ├─dplyr::mutate(., new = convert_time(x))
 3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

在我看来,convert_time不应该尝试解析那两个观察结果,因为它们被case_when块的左侧排除了。同样,我没有期望NA强制转换,因为case_when的左侧阻止了convert_excel_time()看到hh:mm字符串。非常感谢。


<details>
<summary>英文:</summary>

I have a data set with messy date/time formatting in some hh:mm formats and Excel serial numbers. So I&#39;ve coerced everything into a string and I&#39;m using `stringr` and `readr` within a large `case_when` block to identify different formats and process them properly. I think I&#39;m misunderstanding either my `stringr` functions or `case_when` because I&#39;m getting the output I expect, but it&#39;s throwing warnings of parsing failures and `NA` coercion that aren&#39;t in the final product.

Here are some dummy data with an example of each of the formats in my data set:

dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))


I&#39;ve made a function to identify and parse each of these formats. It calls on another function to change excel serial codes into times. It uses regular expressions which I think are correct, but to summarise:

 - `^0\\.` should identify Excel serial numbers by the fact they are all
   decimals &lt;1 
- `\\+` is identifying the times with the DST timezone
   indicator by searching for the plus sign then 
- `.+(?=\\+)` is
   extracting everything before the plus sign to parse 
- `:` is testing
   for a colon to make sure the result is some kind of time. This is a
   broader test so it&#39;s coming last after the pluses have already been
   matched

convert_times <- function(x){
case_when(str_detect(x, "^0\.") ~ convert_excel_time(x),
str_detect(x, "\+") ~ parse_time(str_extract(x, ".+(?=\+)")),
str_detect(x, ":") ~ parse_time(x),
.default = NA)
}

convert_excel_time <- function(x){
as.numeric(x) * 24 * 60 * 60 %>%
as_datetime() %>%
hms::as_hms()
}


When I run it, I get the expected output, but the warnings that come along with it suggest to me I&#39;m not understanding what&#39;s happening under the hood.


> dummy %>%

  • mutate(new = convert_time(x))

A tibble: 4 × 2

x new
<chr> <time>
1 13:15:21 13:15:21.00
2 02:03:17+01:00 02:03:17.00
3 12:03 12:03:00.00
4 0.1234 02:57:41.76


These are my errors

[[1]]
<warning/rlang_warning>
Warning in mutate():
ℹ In argument: new = convert_time(x).
Caused by warning in convert_excel_time():
! NAs introduced by coercion

Backtrace:

  1. ├─dummy %>% mutate(new = convert_time(x))
  2. ├─dplyr::mutate(., new = convert_time(x))
  3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

[[2]]
<warning/rlang_warning>
Warning in mutate():
ℹ In argument: new = convert_time(x).
Caused by warning:
! 2 parsing failures.
row col expected actual
2 -- time like 02:03:17+01:00
4 -- time like 0.1234

Backtrace:

  1. ├─dummy %>% mutate(new = convert_time(x))
  2. ├─dplyr::mutate(., new = convert_time(x))
  3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

It seems to me, `convert_time` shouldn&#39;t be trying to parse those two observations at all since they are excluded by the left side of the `case_when` block. Similarly, I didn&#39;t expect `NA` coercion since the left hand side of the `case_when` prevents `convert_excel_time()` from seeing the hh:mm strings. Many thanks.





</details>


# 答案1
**得分**: 1

抱歉,我只返回翻译好的部分,不包括代码。以下是翻译好的内容:

"Gah! I didn't read all the documentation. The `dplyr` reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says `case_when` always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.

# `case_when()` evaluates all RHS expressions, and then constructs its
# result by extracting the selected (via the LHS expressions) parts.
# In particular `NaN`s are produced in this case:
y &lt;- seq(-2, 2, by = .5)
case_when(
  y &gt;= 0 ~ sqrt(y),
  .default = y
)
#&gt; Warning: NaNs produced
#&gt; [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000  0.0000000  0.7071068
#&gt; [7]  1.0000000  1.2247449  1.4142136"

<details>
<summary>英文:</summary>

Gah! I didn&#39;t read all the documentation. The `dplyr` reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says `case_when` always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.

case_when() evaluates all RHS expressions, and then constructs its

result by extracting the selected (via the LHS expressions) parts.

In particular NaNs are produced in this case:

y <- seq(-2, 2, by = .5)
case_when(
y >= 0 ~ sqrt(y),
.default = y
)
#> Warning: NaNs produced
#> [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000 0.0000000 0.7071068
#> [7] 1.0000000 1.2247449 1.4142136


</details>



huangapple
  • 本文由 发表于 2023年6月12日 06:50:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76452811.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定