英文:
Why does my get_hundred function not work correctly when applied to my dataset in R using dplyr and stringr?
问题
我一直在尝试使用用户定义的函数对数据集进行变异,该函数包括对 `str_locate` 和 `str_sub` 的调用。目标是在字符串中找到然后提取包含在字符串中的3个数字序列中的第一个数字,然后将此数字(作为 `character`)添加到名为 Hundreds 的新列中。
例如:
- 对于字符串 '821':将字符串 '8' 添加到 `Hundreds`。
- 对于字符串 'Af823.22',将字符串 '8' 添加到 `Hundreds`。
这是我的函数:
get_hundred <- function(s) {
match_pos <- str_locate(s, "[0-9]{3}")
return(str_sub(s, match_pos[1], match_pos[1]))
}
我的数据的前20行如下:
df1 <- structure(list(call.number = c("372.35044 L4383", "344.049 C235",
"344.410415 DIM", "346.944043 NEI", "808.0667 B2616", "363.6909945 CAST",
"ABS 2015.0", "371.38 MACK", "372.1102 PRAW", "A823.3 WRIG/T",
"havmf test", "[DENTISTRY] CROW", "[DENTISTRY] JAWS", "[DENTISTRY] LOWE",
"[DENTISTRY] MOLA", "[DENTISTRY] SERI", "[DENTISTRY] SKUL", "[DENTISTRY] TEET",
"[HEALTH]ANKL", "[HEALTH]FOOT"), num.items = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
## 数据过滤
实际上,我只是在大量 `call.number` 中寻找特定形式的字符串。我相信下面的 `str_detect` 正在检测我想要的字符串形式。
df2 <- df1 %>%
filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))
## 我做错了什么?
现在我这样做:
df2 %>%
mutate(Hundreds = get_hundred(call.number))
然而,这样做会在第9行的 `Hundreds` 列中放入一个 'A',而我希望看到 '8'。然而,如果我在 "A823.3 WRIG/T"("等效字符串")上调用 `get_hundred`,该函数确实返回 '8'。
get_hundred("A823.3 WRIG/T")
在这里我没有理解的是什么?
英文:
I've been trying to mutate a dataset with a user-defined function that includes calls to str_locate
and str_sub
. The aim is to locate then extract the first digit within a sequence of 3 digits amongst strings, then add this digit (as a character
) to a new column called Hundreds.
For example:
- Given string '821': the string '8' is added to
Hundreds
. - Given string 'Af823.22', the string '8' is added to
Hundreds
.
Here is my function:
get_hundred <- function(s) {
match_pos <- str_locate(s, "[0-9]{3}")
return(str_sub(s, match_pos[1], match_pos[1]))
The first 20 rows of my data look like this:
df1 <- structure(list(call.number = c("372.35044 L4383", "344.049 C235",
"344.410415 DIM", "346.944043 NEI", "808.0667 B2616", "363.6909945 CAST",
"ABS 2015.0", "371.38 MACK", "372.1102 PRAW", "A823.3 WRIG/T",
"havmf test", "[DENTISTRY] CROW", "[DENTISTRY] JAWS", "[DENTISTRY] LOWE",
"[DENTISTRY] MOLA", "[DENTISTRY] SERI", "[DENTISTRY] SKUL", "[DENTISTRY] TEET",
"[HEALTH]ANKL", "[HEALTH]FOOT"), num.items = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Filtering the data
In fact I'm only looking for particular forms of string within a large list of call.number
s. I believe the below str_detect
is detecting the forms of string I want.
df2 <- df1 %>%
filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))
What am I doing wrong?
Now I do this:
df2 %>%
mutate(Hundreds = get_hundred(call.number))
Doing this however puts an 'A' in the Hundreds
column for row 9, where I expect to see an '8'. Yet, if I call get_hundred
on "A823.3 WRIG/T" (the "equivalent string") the function does return an '8'.
get_hundred("A823.3 WRIG/T")
What is it I'm not understanding here?
答案1
得分: 2
str_sub
需要起始和结束位置作为参数!
参见 ?str_locate: str_locate()
返回一个包含两列的整数矩阵,每个字符串元素对应一行。第一列是起始位置,第二列是结束位置。
参见 ?str_sub: start, end。一对整数向量定义了要提取的字符范围(包括起始和结束)。或者,您可以传递一个矩阵给 start,该矩阵应该有两列,可以标记为 start 和 end,或者 start 和 length。
match_pos[, 1]
确保从矩阵中提取起始位置(通过 str_locate
),并且正确的位置由 str_sub
选择。
library(dplyr)
library(stringr)
get_hundred_tarjae <- function(s) {
match_pos <- str_locate(s, "[0-9]{3}")
return(str_sub(s, match_pos[, 1], match_pos[, 1]))
}
df2 <- df1 %>%
filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))
df2 %>%
mutate(Hundreds = get_hundred_tarjae(call.number))
A tibble: 9 × 3
call.number num.items Hundreds
<chr> <dbl> <chr>
1 372.35044 L4383 1 3
2 344.049 C235 1 3
3 344.410415 DIM 1 3
4 346.944043 NEI 1 3
5 808.0667 B2616 1 8
6 363.6909945 CAST 1 3
7 371.38 MACK 1 3
8 372.1102 PRAW 1 3
9 A823.3 WRIG/T 1 8
英文:
str_sub
expects the start and end positions as arguments!
See ?str_locate: str_locate()
returns an integer matrix with two columns and one row for each element of string. The first column, start, gives the position at the start of the match, and the second column, end, gives the position of the end.
See ?str_sub: start, end. A pair of integer vectors defining the range of characters to extract (inclusive).Alternatively, instead of a pair of vectors, you can pass a matrix to start. The matrix should have two columns, either labelled start and end, or start and length.
match_pos[, 1]
will ensure that the start position from the matrix (by str_locate
) is extracted, and the correct position is chosen by str_sub
.
library(dplyr)
library(stringr)
get_hundred_tarjae <- function(s) {
match_pos <- str_locate(s, "[0-9]{3}")
return(str_sub(s, match_pos[, 1], match_pos[, 1]))
}
df2 <- df1 %>%
filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))
df2 %>%
mutate(Hundreds = get_hundred_tarjae(call.number))
A tibble: 9 × 3
call.number num.items Hundreds
<chr> <dbl> <chr>
1 372.35044 L4383 1 3
2 344.049 C235 1 3
3 344.410415 DIM 1 3
4 346.944043 NEI 1 3
5 808.0667 B2616 1 8
6 363.6909945 CAST 1 3
7 371.38 MACK 1 3
8 372.1102 PRAW 1 3
9 A823.3 WRIG/T 1 8
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论