2023年5月29日 14:57:18go评论94阅读模式

英文:

Why does my get_hundred function not work correctly when applied to my dataset in R using dplyr and stringr?

问题

我一直在尝试使用用户定义的函数对数据集进行变异，该函数包括对 `str_locate` 和 `str_sub` 的调用。目标是在字符串中找到然后提取包含在字符串中的3个数字序列中的第一个数字，然后将此数字（作为 `character`）添加到名为 Hundreds 的新列中。
例如：
- 对于字符串 '821'：将字符串 '8' 添加到 `Hundreds`。
- 对于字符串 'Af823.22'，将字符串 '8' 添加到 `Hundreds`。
这是我的函数：

get_hundred <- function(s) {
  match_pos <- str_locate(s, "[0-9]{3}")
  return(str_sub(s, match_pos[1], match_pos[1]))
}

我的数据的前20行如下：

df1 <- structure(list(call.number = c("372.35044 L4383", "344.049 C235", 
"344.410415 DIM", "346.944043 NEI", "808.0667 B2616", "363.6909945 CAST", 
"ABS 2015.0", "371.38 MACK", "372.1102 PRAW", "A823.3 WRIG/T", 
"havmf test", "[DENTISTRY] CROW", "[DENTISTRY] JAWS", "[DENTISTRY] LOWE", 
"[DENTISTRY] MOLA", "[DENTISTRY] SERI", "[DENTISTRY] SKUL", "[DENTISTRY] TEET", 
"[HEALTH]ANKL", "[HEALTH]FOOT"), num.items = c(1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

## 数据过滤
实际上，我只是在大量 `call.number` 中寻找特定形式的字符串。我相信下面的 `str_detect` 正在检测我想要的字符串形式。

df2 <- df1 %>%
  filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))

## 我做错了什么？
现在我这样做：

df2 %>%
  mutate(Hundreds = get_hundred(call.number))

然而，这样做会在第9行的 `Hundreds` 列中放入一个 'A'，而我希望看到 '8'。然而，如果我在 "A823.3 WRIG/T"（"等效字符串"）上调用 `get_hundred`，该函数确实返回 '8'。

get_hundred("A823.3 WRIG/T")

在这里我没有理解的是什么？

英文:

I've been trying to mutate a dataset with a user-defined function that includes calls to str_locate and str_sub. The aim is to locate then extract the first digit within a sequence of 3 digits amongst strings, then add this digit (as a character) to a new column called Hundreds.

For example:

Given string '821': the string '8' is added to Hundreds.
Given string 'Af823.22', the string '8' is added to Hundreds.

Here is my function:

get_hundred &lt;- function(s) {
  match_pos &lt;- str_locate(s, &quot;[0-9]{3}&quot;)
  return(str_sub(s, match_pos[1], match_pos[1]))

The first 20 rows of my data look like this:

df1 &lt;- structure(list(call.number = c(&quot;372.35044 L4383&quot;, &quot;344.049 C235&quot;, 
&quot;344.410415 DIM&quot;, &quot;346.944043 NEI&quot;, &quot;808.0667 B2616&quot;, &quot;363.6909945 CAST&quot;, 
&quot;ABS 2015.0&quot;, &quot;371.38 MACK&quot;, &quot;372.1102 PRAW&quot;, &quot;A823.3 WRIG/T&quot;, 
&quot;havmf test&quot;, &quot;[DENTISTRY] CROW&quot;, &quot;[DENTISTRY] JAWS&quot;, &quot;[DENTISTRY] LOWE&quot;, 
&quot;[DENTISTRY] MOLA&quot;, &quot;[DENTISTRY] SERI&quot;, &quot;[DENTISTRY] SKUL&quot;, &quot;[DENTISTRY] TEET&quot;, 
&quot;[HEALTH]ANKL&quot;, &quot;[HEALTH]FOOT&quot;), num.items = c(1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2)), row.names = c(NA, 
-20L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

Filtering the data

In fact I'm only looking for particular forms of string within a large list of call.numbers. I believe the below str_detect is detecting the forms of string I want.

df2 &lt;- df1 %&gt;%
  filter(str_detect(call.number, &quot;^[A-Z]?[A-Z|a-z]?[0-9]{3}.*&quot;))

What am I doing wrong?

Now I do this:

df2 %&gt;%
  mutate(Hundreds = get_hundred(call.number))

Doing this however puts an 'A' in the Hundreds column for row 9, where I expect to see an '8'. Yet, if I call get_hundred on "A823.3 WRIG/T" (the "equivalent string") the function does return an '8'.

get_hundred(&quot;A823.3 WRIG/T&quot;)

What is it I'm not understanding here?

答案1

得分: 2

str_sub 需要起始和结束位置作为参数！

参见 ?str_locate： str_locate() 返回一个包含两列的整数矩阵，每个字符串元素对应一行。第一列是起始位置，第二列是结束位置。

参见 ?str_sub： start, end。一对整数向量定义了要提取的字符范围（包括起始和结束）。或者，您可以传递一个矩阵给 start，该矩阵应该有两列，可以标记为 start 和 end，或者 start 和 length。

match_pos[, 1] 确保从矩阵中提取起始位置（通过 str_locate），并且正确的位置由 str_sub 选择。

library(dplyr)
library(stringr)
get_hundred_tarjae <- function(s) {
  match_pos <- str_locate(s, "[0-9]{3}")
  return(str_sub(s, match_pos[, 1], match_pos[, 1]))
}
df2 <- df1 %>%
  filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))
df2 %>%
  mutate(Hundreds = get_hundred_tarjae(call.number))
A tibble: 9 × 3
call.number      num.items Hundreds
<chr>                <dbl> <chr>   
1 372.35044 L4383          1 3       
2 344.049 C235             1 3       
3 344.410415 DIM           1 3       
4 346.944043 NEI           1 3       
5 808.0667 B2616           1 8       
6 363.6909945 CAST         1 3       
7 371.38 MACK              1 3       
8 372.1102 PRAW            1 3       
9 A823.3 WRIG/T            1 8

英文:

str_sub expects the start and end positions as arguments!

See ?str_locate: str_locate() returns an integer matrix with two columns and one row for each element of string. The first column, start, gives the position at the start of the match, and the second column, end, gives the position of the end.

See ?str_sub: start, end. A pair of integer vectors defining the range of characters to extract (inclusive).Alternatively, instead of a pair of vectors, you can pass a matrix to start. The matrix should have two columns, either labelled start and end, or start and length.

match_pos[, 1] will ensure that the start position from the matrix (by str_locate) is extracted, and the correct position is chosen by str_sub.

library(dplyr)
library(stringr)
get_hundred_tarjae &lt;- function(s) {
  match_pos &lt;- str_locate(s, &quot;[0-9]{3}&quot;)
  return(str_sub(s, match_pos[, 1], match_pos[, 1]))
}
df2 &lt;- df1 %&gt;%
  filter(str_detect(call.number, &quot;^[A-Z]?[A-Z|a-z]?[0-9]{3}.*&quot;))
df2 %&gt;%
  mutate(Hundreds = get_hundred_tarjae(call.number))
A tibble: 9 &#215; 3
call.number      num.items Hundreds
&lt;chr&gt;                &lt;dbl&gt; &lt;chr&gt;   
1 372.35044 L4383          1 3       
2 344.049 C235             1 3       
3 344.410415 DIM           1 3       
4 346.944043 NEI           1 3       
5 808.0667 B2616           1 8       
6 363.6909945 CAST         1 3       
7 371.38 MACK              1 3       
8 372.1102 PRAW            1 3       
9 A823.3 WRIG/T            1 8

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Why does my get_hundred function not work correctly when applied to my dataset in R using dplyr and stringr?

问题

Filtering the data

What am I doing wrong?

答案1

为什么使用R从.csv数据绘制直方图时会有这么多重复的列？

转换一个来自Eurostat的geojson文件为数据框，并绘制地图。

如何在R中将行的值包装到特定字符处的新行

将森林图中的小数改为中线小数。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。