意外警告与 case_when 和正则表达式条件一起使用时,表明有太多情况匹配。

huangapple go评论135阅读模式
英文:

Unexpected warnings with case_when and regex conditions suggest too many cases are matching

问题

以下是您的代码部分的翻译:

  1. 我有一个数据集,其中一些数据以hh:mm格式和Excel序列号的混乱日期/时间格式存在。因此,我已将所有内容强制转换为字符串,并在一个大的`case_when`块内使用`stringr``readr`来识别不同的格式并正确处理它们。我认为我要么误解了我的`stringr`函数,要么误解了`case_when`,因为我得到了我期望的输出,但它会抛出解析失败和`NA`强制转换的警告,这些警告在最终产品中是不必要的。
  2. 这里是一些虚拟数据,其中包含我的数据集中每种格式的示例:
  3. ```R
  4. dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))

我创建了一个函数来识别和解析这些格式中的每一个。它调用另一个函数将Excel序列号代码转换为时间。它使用我认为是正确的正则表达式,但总结一下:

  • ^0\\. 应该通过它们都是小于1的小数来识别Excel序列号
  • \\+ 通过搜索加号来识别带有DST时区指示符的时间
  • .+(?=\\+) 提取加号前的所有内容以进行解析
  • : 通过测试冒号来确保结果是某种时间。这是一个更广泛的测试,因此它在加号已经匹配的情况下最后出现
  1. convert_times <- function(x){
  2. case_when(str_detect(x, "^0\\.") ~ convert_excel_time(x),
  3. str_detect(x, "\\+") ~ parse_time(str_extract(x, ".+(?=\\+)")),
  4. str_detect(x, ":") ~ parse_time(x),
  5. .default = NA)
  6. }
  7. convert_excel_time <- function(x){
  8. as.numeric(x) * 24 * 60 * 60 %>%
  9. as_datetime() %>%
  10. hms::as_hms()
  11. }

当我运行它时,我得到了期望的输出,但随之而来的警告似乎表明我不理解底层发生了什么。

  1. > dummy %>%
  2. + mutate(new = convert_time(x))
  3. # A tibble: 4 × 2
  4. x new
  5. <chr> <time>
  6. 1 13:15:21 13:15:21.00
  7. 2 02:03:17+01:00 02:03:17.00
  8. 3 12:03 12:03:00.00
  9. 4 0.1234 02:57:41.76

这是我的错误:

  1. [[1]]
  2. <warning/rlang_warning>
  3. Warning in `mutate()`:
  4. In argument: `new = convert_time(x)`.
  5. Caused by warning in `convert_excel_time()`:
  6. ! NAs introduced by coercion
  7. ---
  8. Backtrace:
  9. 1. ├─dummy %>% mutate(new = convert_time(x))
  10. 2. ├─dplyr::mutate(., new = convert_time(x))
  11. 3. └─dplyr:::mutate.data.frame(., new = convert_time(x))
  12. [[2]]
  13. <warning/rlang_warning>
  14. Warning in `mutate()`:
  15. In argument: `new = convert_time(x)`.
  16. Caused by warning:
  17. ! 2 parsing failures.
  18. row col expected actual
  19. 2 -- time like 02:03:17+01:00
  20. 4 -- time like 0.1234
  21. ---
  22. Backtrace:
  23. 1. ├─dummy %>% mutate(new = convert_time(x))
  24. 2. ├─dplyr::mutate(., new = convert_time(x))
  25. 3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

在我看来,convert_time不应该尝试解析那两个观察结果,因为它们被case_when块的左侧排除了。同样,我没有期望NA强制转换,因为case_when的左侧阻止了convert_excel_time()看到hh:mm字符串。非常感谢。

  1. <details>
  2. <summary>英文:</summary>
  3. I have a data set with messy date/time formatting in some hh:mm formats and Excel serial numbers. So I&#39;ve coerced everything into a string and I&#39;m using `stringr` and `readr` within a large `case_when` block to identify different formats and process them properly. I think I&#39;m misunderstanding either my `stringr` functions or `case_when` because I&#39;m getting the output I expect, but it&#39;s throwing warnings of parsing failures and `NA` coercion that aren&#39;t in the final product.
  4. Here are some dummy data with an example of each of the formats in my data set:

dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))

  1. I&#39;ve made a function to identify and parse each of these formats. It calls on another function to change excel serial codes into times. It uses regular expressions which I think are correct, but to summarise:
  2. - `^0\\.` should identify Excel serial numbers by the fact they are all
  3. decimals &lt;1
  4. - `\\+` is identifying the times with the DST timezone
  5. indicator by searching for the plus sign then
  6. - `.+(?=\\+)` is
  7. extracting everything before the plus sign to parse
  8. - `:` is testing
  9. for a colon to make sure the result is some kind of time. This is a
  10. broader test so it&#39;s coming last after the pluses have already been
  11. matched

convert_times <- function(x){
case_when(str_detect(x, "^0\.") ~ convert_excel_time(x),
str_detect(x, "\+") ~ parse_time(str_extract(x, ".+(?=\+)")),
str_detect(x, ":") ~ parse_time(x),
.default = NA)
}

convert_excel_time <- function(x){
as.numeric(x) * 24 * 60 * 60 %>%
as_datetime() %>%
hms::as_hms()
}

  1. When I run it, I get the expected output, but the warnings that come along with it suggest to me I&#39;m not understanding what&#39;s happening under the hood.

> dummy %>%

  • mutate(new = convert_time(x))

A tibble: 4 × 2

x new
<chr> <time>
1 13:15:21 13:15:21.00
2 02:03:17+01:00 02:03:17.00
3 12:03 12:03:00.00
4 0.1234 02:57:41.76

  1. These are my errors

[[1]]
<warning/rlang_warning>
Warning in mutate():
ℹ In argument: new = convert_time(x).
Caused by warning in convert_excel_time():
! NAs introduced by coercion

Backtrace:

  1. ├─dummy %>% mutate(new = convert_time(x))
  2. ├─dplyr::mutate(., new = convert_time(x))
  3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

[[2]]
<warning/rlang_warning>
Warning in mutate():
ℹ In argument: new = convert_time(x).
Caused by warning:
! 2 parsing failures.
row col expected actual
2 -- time like 02:03:17+01:00
4 -- time like 0.1234

Backtrace:

  1. ├─dummy %>% mutate(new = convert_time(x))
  2. ├─dplyr::mutate(., new = convert_time(x))
  3. └─dplyr:::mutate.data.frame(., new = convert_time(x))
  1. It seems to me, `convert_time` shouldn&#39;t be trying to parse those two observations at all since they are excluded by the left side of the `case_when` block. Similarly, I didn&#39;t expect `NA` coercion since the left hand side of the `case_when` prevents `convert_excel_time()` from seeing the hh:mm strings. Many thanks.
  2. </details>
  3. # 答案1
  4. **得分**: 1
  5. 抱歉,我只返回翻译好的部分,不包括代码。以下是翻译好的内容:
  6. "Gah! I didn't read all the documentation. The `dplyr` reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says `case_when` always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.
  7. # `case_when()` evaluates all RHS expressions, and then constructs its
  8. # result by extracting the selected (via the LHS expressions) parts.
  9. # In particular `NaN`s are produced in this case:
  10. y &lt;- seq(-2, 2, by = .5)
  11. case_when(
  12. y &gt;= 0 ~ sqrt(y),
  13. .default = y
  14. )
  15. #&gt; Warning: NaNs produced
  16. #&gt; [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000 0.0000000 0.7071068
  17. #&gt; [7] 1.0000000 1.2247449 1.4142136"
  18. <details>
  19. <summary>英文:</summary>
  20. Gah! I didn&#39;t read all the documentation. The `dplyr` reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says `case_when` always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.

case_when() evaluates all RHS expressions, and then constructs its

result by extracting the selected (via the LHS expressions) parts.

In particular NaNs are produced in this case:

y <- seq(-2, 2, by = .5)
case_when(
y >= 0 ~ sqrt(y),
.default = y
)
#> Warning: NaNs produced
#> [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000 0.0000000 0.7071068
#> [7] 1.0000000 1.2247449 1.4142136

  1. </details>

huangapple
  • 本文由 发表于 2023年6月12日 06:50:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76452811.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定