英文:
How to retain strings with certain phrases per specific ID?
问题
以下是您要的翻译部分:
我有以下包含文件路径的示例数据。
在我的数据文件夹中,有带有站点标识符的子文件夹。每个子文件夹包含特定于该站点的数据。例如,文件夹`data/A101/`包含5个数据文件`bill1-separated.xlsx`,`bill1.xlsx`,`bill2.xls`,`bill2-separated.xlsx`,`bill3.xlsx`。
我只想保留文件路径中名称中包含“separated”的文件,前提是存在一个分隔版本和原始副本。例如,对于`bill1-separated.xlsx`和`bill1.xlsx`,我只想保留“separated”版本。
最终我想要这样的结果:
```R
desired <- c(NA_character_, "data/A101/bill1-separated.xlsx", NA_character_, "data/B215/usage2-separated.xlsx",
NA_character_, "data/A101/bill2-separated.xlsx", NA_character_, "data/C145/account1-separated.xlsx",
"data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()
我的尝试是使用ifelse
和grepl
来捕获正确的文件路径,但我无法使其正常工作。任何帮助将不胜感激。
library(stringr)
library(dplyr)
# 尝试
temp <- data.frame(path = paths)
temp %>%
mutate(
file = str_extract(paths, "\\/.*\\/(.+)", group=1),
attempt = ifelse(!grepl("separated", file),
NA, # 是否嵌套另一个ifelse?但不确定要使用什么条件/正则表达式
path
)
)
<details>
<summary>英文:</summary>
I have the following example data that contain file paths.
paths <- c("data/A101/bill1.xlsx", "data/A101/bill1-separated.xlsx", "data/B215/usage2.csv", "data/B215/usage2-separated.xlsx",
"data/A101/bill2.xls", "data/A101/bill2-separated.xlsx", "data/C145/account1.xlsx", "data/C145/account1-separated.xlsx",
"data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()
In my data folder, there are subfolders with identifiers for sites. Each subfolder contains data specific to the site. For example, folder `data/A101/` contains 5 data files `bill1-separated.xlsx`, `bill1.xlsx`, `bill2.xls`, `bill2-separated.xlsx`, `bill3.xlsx`.
I want to keep only the file paths with "separated" in the name only if there are a separated version and the original copy. For example, for `bill1-separated.xlsx` and `bill1.xlsx`, I want to keep only the "separated" version.
I want something like this in the end:
desired <- c(NA_character_, "data/A101/bill1-separated.xlsx", NA_character_, "data/B215/usage2-separated.xlsx",
NA_character_, "data/A101/bill2-separated.xlsx", NA_character_, "data/C145/account1-separated.xlsx",
"data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()
My attempt was to use `ifelse` and `grepl` to capture the correct file paths, but I couldn't really get it to work. Any help will be much appreciated.
library(stringr)
library(dplyr)
attempt
temp <- data.frame(path = paths)
temp %>%
mutate(
file = str_extract(paths, "\/.*\/(.+)", group=1),
attempt = ifelse(!grepl("separated", file),
NA, # nest another ifelse? but not sure what condition/regex to use
path
)
)
</details>
# 答案1
**得分**: 2
由于您忽略了文件扩展名,我认为我们可以使用这个方案。(我从您的代码中推测出了 [tag:dplyr]。)
```r
library(dplyr)
df %>%
mutate(
dir = dirname(path),
pathnoext = tools::file_path_sans_ext(path),
notsep = sub("-separated", "", pathnoext)
) %>%
mutate(
res = if_else(any(grepl("-separated", path)) & pathnoext == notsep,
path[NA], path),
.by = notsep
) %>%
select(path, res, desire, everything())
# path res desire dir pathnoext notsep
# 1 data/A101/bill1.xlsx <NA> <NA> data/A101 data/A101/bill1 data/A101/bill1
# 2 data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101 data/A101/bill1-separated data/A101/bill1
# 3 data/B215/usage2.csv <NA> <NA> data/B215 data/B215/usage2 data/B215/usage2
# 4 data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215 data/B215/usage2-separated data/B215/usage2
# 5 data/A101/bill2.xls <NA> <NA> data/A101 data/A101/bill2 data/A101/bill2
# 6 data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101 data/A101/bill2-separated data/A101/bill2
# 7 data/C145/account1.xlsx <NA> <NA> data/C145 data/C145/account1 data/C145/account1
# 8 data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145 data/C145/account1-separated data/C145/account1
# 9 data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101 data/A101/bill3 data/A101/bill3
# 10 data/B215/usage1.csv data/B215/usage1.csv data/B215/usage1.csv data/B215 data/B215/usage1 data/B215/usage1
# 11 data/B215/usage3.xls data/B215/usage3.xls data/B215/usage3.xls data/B215 data/B215/usage3 data/B215/usage3
# 12 data/C145/account2.xlsx data/C145/account2.xlsx data/C145/account2.xlsx data/C145 data/C145/account2 data/C145/account2
(额外的列仅用于演示。)
使用 .by=
需要 dplyr_1.1.0
版本或更新版本。
数据
df <- structure(list(path = c("data/A101/bill1.xlsx", "data/A101/bill1-separated.xlsx", "data/B215/usage2.csv", "data/B215/usage2-separated.xlsx", "data/A101/bill2.xls", "data/A101/bill2-separated.xlsx", "data/C145/account1.xlsx", "data/C145/account1-separated.xlsx", "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx"), desire = c(NA, "data/A101/bill1-separated.xlsx", NA, "data/B215/usage2-separated.xlsx", NA, "data/A101/bill2-separated.xlsx", NA, "data/C145/account1-separated.xlsx", "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx")), class = "data.frame", row.names = c(NA, -12L))
英文:
Since you're ignoring file extension, I think we can go with this. (I'm inferring [tag:dplyr] from your code.)
library(dplyr)
df %>%
mutate(
dir = dirname(path),
pathnoext = tools::file_path_sans_ext(path),
notsep = sub("-separated", "", pathnoext)
) %>%
mutate(
res = if_else(any(grepl("-separated", path)) & pathnoext == notsep,
path[NA], path),
.by = notsep) %>%
select(path, res, desire, everything())
# path res desire dir pathnoext notsep
# 1 data/A101/bill1.xlsx <NA> <NA> data/A101 data/A101/bill1 data/A101/bill1
# 2 data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101 data/A101/bill1-separated data/A101/bill1
# 3 data/B215/usage2.csv <NA> <NA> data/B215 data/B215/usage2 data/B215/usage2
# 4 data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215 data/B215/usage2-separated data/B215/usage2
# 5 data/A101/bill2.xls <NA> <NA> data/A101 data/A101/bill2 data/A101/bill2
# 6 data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101 data/A101/bill2-separated data/A101/bill2
# 7 data/C145/account1.xlsx <NA> <NA> data/C145 data/C145/account1 data/C145/account1
# 8 data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145 data/C145/account1-separated data/C145/account1
# 9 data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101 data/A101/bill3 data/A101/bill3
# 10 data/B215/usage1.csv data/B215/usage1.csv data/B215/usage1.csv data/B215 data/B215/usage1 data/B215/usage1
# 11 data/B215/usage3.xls data/B215/usage3.xls data/B215/usage3.xls data/B215 data/B215/usage3 data/B215/usage3
# 12 data/C145/account2.xlsx data/C145/account2.xlsx data/C145/account2.xlsx data/C145 data/C145/account2 data/C145/account2
(Extra columns retained only for demonstration.)
The use of .by=
requires dplyr_1.1.0
or newer.
Data
df <- structure(list(path = c("data/A101/bill1.xlsx", "data/A101/bill1-separated.xlsx", "data/B215/usage2.csv", "data/B215/usage2-separated.xlsx", "data/A101/bill2.xls", "data/A101/bill2-separated.xlsx", "data/C145/account1.xlsx", "data/C145/account1-separated.xlsx", "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx"), desire = c(NA, "data/A101/bill1-separated.xlsx", NA, "data/B215/usage2-separated.xlsx", NA, "data/A101/bill2-separated.xlsx", NA, "data/C145/account1-separated.xlsx", "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx")), class = "data.frame", row.names = c(NA, -12L))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论