英文:
Unexpected warnings with case_when and regex conditions suggest too many cases are matching
问题
以下是您的代码部分的翻译:
我有一个数据集,其中一些数据以hh:mm格式和Excel序列号的混乱日期/时间格式存在。因此,我已将所有内容强制转换为字符串,并在一个大的`case_when`块内使用`stringr`和`readr`来识别不同的格式并正确处理它们。我认为我要么误解了我的`stringr`函数,要么误解了`case_when`,因为我得到了我期望的输出,但它会抛出解析失败和`NA`强制转换的警告,这些警告在最终产品中是不必要的。
这里是一些虚拟数据,其中包含我的数据集中每种格式的示例:
```R
dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))
我创建了一个函数来识别和解析这些格式中的每一个。它调用另一个函数将Excel序列号代码转换为时间。它使用我认为是正确的正则表达式,但总结一下:
^0\\.
应该通过它们都是小于1的小数来识别Excel序列号\\+
通过搜索加号来识别带有DST时区指示符的时间.+(?=\\+)
提取加号前的所有内容以进行解析:
通过测试冒号来确保结果是某种时间。这是一个更广泛的测试,因此它在加号已经匹配的情况下最后出现
convert_times <- function(x){
case_when(str_detect(x, "^0\\.") ~ convert_excel_time(x),
str_detect(x, "\\+") ~ parse_time(str_extract(x, ".+(?=\\+)")),
str_detect(x, ":") ~ parse_time(x),
.default = NA)
}
convert_excel_time <- function(x){
as.numeric(x) * 24 * 60 * 60 %>%
as_datetime() %>%
hms::as_hms()
}
当我运行它时,我得到了期望的输出,但随之而来的警告似乎表明我不理解底层发生了什么。
> dummy %>%
+ mutate(new = convert_time(x))
# A tibble: 4 × 2
x new
<chr> <time>
1 13:15:21 13:15:21.00
2 02:03:17+01:00 02:03:17.00
3 12:03 12:03:00.00
4 0.1234 02:57:41.76
这是我的错误:
[[1]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning in `convert_excel_time()`:
! NAs introduced by coercion
---
Backtrace:
▆
1. ├─dummy %>% mutate(new = convert_time(x))
2. ├─dplyr::mutate(., new = convert_time(x))
3. └─dplyr:::mutate.data.frame(., new = convert_time(x))
[[2]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning:
! 2 parsing failures.
row col expected actual
2 -- time like 02:03:17+01:00
4 -- time like 0.1234
---
Backtrace:
▆
1. ├─dummy %>% mutate(new = convert_time(x))
2. ├─dplyr::mutate(., new = convert_time(x))
3. └─dplyr:::mutate.data.frame(., new = convert_time(x))
在我看来,convert_time
不应该尝试解析那两个观察结果,因为它们被case_when
块的左侧排除了。同样,我没有期望NA
强制转换,因为case_when
的左侧阻止了convert_excel_time()
看到hh:mm字符串。非常感谢。
<details>
<summary>英文:</summary>
I have a data set with messy date/time formatting in some hh:mm formats and Excel serial numbers. So I've coerced everything into a string and I'm using `stringr` and `readr` within a large `case_when` block to identify different formats and process them properly. I think I'm misunderstanding either my `stringr` functions or `case_when` because I'm getting the output I expect, but it's throwing warnings of parsing failures and `NA` coercion that aren't in the final product.
Here are some dummy data with an example of each of the formats in my data set:
dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))
I've made a function to identify and parse each of these formats. It calls on another function to change excel serial codes into times. It uses regular expressions which I think are correct, but to summarise:
- `^0\\.` should identify Excel serial numbers by the fact they are all
decimals <1
- `\\+` is identifying the times with the DST timezone
indicator by searching for the plus sign then
- `.+(?=\\+)` is
extracting everything before the plus sign to parse
- `:` is testing
for a colon to make sure the result is some kind of time. This is a
broader test so it's coming last after the pluses have already been
matched
convert_times <- function(x){
case_when(str_detect(x, "^0\.") ~ convert_excel_time(x),
str_detect(x, "\+") ~ parse_time(str_extract(x, ".+(?=\+)")),
str_detect(x, ":") ~ parse_time(x),
.default = NA)
}
convert_excel_time <- function(x){
as.numeric(x) * 24 * 60 * 60 %>%
as_datetime() %>%
hms::as_hms()
}
When I run it, I get the expected output, but the warnings that come along with it suggest to me I'm not understanding what's happening under the hood.
> dummy %>%
- mutate(new = convert_time(x))
A tibble: 4 × 2
x new
<chr> <time>
1 13:15:21 13:15:21.00
2 02:03:17+01:00 02:03:17.00
3 12:03 12:03:00.00
4 0.1234 02:57:41.76
These are my errors
[[1]]
<warning/rlang_warning>
Warning in mutate()
:
ℹ In argument: new = convert_time(x)
.
Caused by warning in convert_excel_time()
:
! NAs introduced by coercion
Backtrace:
▆
- ├─dummy %>% mutate(new = convert_time(x))
- ├─dplyr::mutate(., new = convert_time(x))
- └─dplyr:::mutate.data.frame(., new = convert_time(x))
[[2]]
<warning/rlang_warning>
Warning in mutate()
:
ℹ In argument: new = convert_time(x)
.
Caused by warning:
! 2 parsing failures.
row col expected actual
2 -- time like 02:03:17+01:00
4 -- time like 0.1234
Backtrace:
▆
- ├─dummy %>% mutate(new = convert_time(x))
- ├─dplyr::mutate(., new = convert_time(x))
- └─dplyr:::mutate.data.frame(., new = convert_time(x))
It seems to me, `convert_time` shouldn't be trying to parse those two observations at all since they are excluded by the left side of the `case_when` block. Similarly, I didn't expect `NA` coercion since the left hand side of the `case_when` prevents `convert_excel_time()` from seeing the hh:mm strings. Many thanks.
</details>
# 答案1
**得分**: 1
抱歉,我只返回翻译好的部分,不包括代码。以下是翻译好的内容:
"Gah! I didn't read all the documentation. The `dplyr` reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says `case_when` always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.
# `case_when()` evaluates all RHS expressions, and then constructs its
# result by extracting the selected (via the LHS expressions) parts.
# In particular `NaN`s are produced in this case:
y <- seq(-2, 2, by = .5)
case_when(
y >= 0 ~ sqrt(y),
.default = y
)
#> Warning: NaNs produced
#> [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000 0.0000000 0.7071068
#> [7] 1.0000000 1.2247449 1.4142136"
<details>
<summary>英文:</summary>
Gah! I didn't read all the documentation. The `dplyr` reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says `case_when` always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.
case_when()
evaluates all RHS expressions, and then constructs its
result by extracting the selected (via the LHS expressions) parts.
In particular NaN
s are produced in this case:
y <- seq(-2, 2, by = .5)
case_when(
y >= 0 ~ sqrt(y),
.default = y
)
#> Warning: NaNs produced
#> [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000 0.0000000 0.7071068
#> [7] 1.0000000 1.2247449 1.4142136
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论