如何针对特定ID保留包含特定短语的字符串?

huangapple go评论90阅读模式
英文:

How to retain strings with certain phrases per specific ID?

问题

以下是您要的翻译部分:

  1. 我有以下包含文件路径的示例数据。
  2. 在我的数据文件夹中,有带有站点标识符的子文件夹。每个子文件夹包含特定于该站点的数据。例如,文件夹`data/A101/`包含5个数据文件`bill1-separated.xlsx``bill1.xlsx``bill2.xls``bill2-separated.xlsx``bill3.xlsx`
  3. 我只想保留文件路径中名称中包含“separated”的文件,前提是存在一个分隔版本和原始副本。例如,对于`bill1-separated.xlsx``bill1.xlsx`,我只想保留“separated”版本。
  4. 最终我想要这样的结果:
  5. ```R
  6. desired <- c(NA_character_, "data/A101/bill1-separated.xlsx", NA_character_, "data/B215/usage2-separated.xlsx",
  7. NA_character_, "data/A101/bill2-separated.xlsx", NA_character_, "data/C145/account1-separated.xlsx",
  8. "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()

我的尝试是使用ifelsegrepl来捕获正确的文件路径,但我无法使其正常工作。任何帮助将不胜感激。

  1. library(stringr)
  2. library(dplyr)
  3. # 尝试
  4. temp <- data.frame(path = paths)
  5. temp %>%
  6. mutate(
  7. file = str_extract(paths, "\\/.*\\/(.+)", group=1),
  8. attempt = ifelse(!grepl("separated", file),
  9. NA, # 是否嵌套另一个ifelse?但不确定要使用什么条件/正则表达式
  10. path
  11. )
  12. )
  1. <details>
  2. <summary>英文:</summary>
  3. I have the following example data that contain file paths.

paths <- c("data/A101/bill1.xlsx", "data/A101/bill1-separated.xlsx", "data/B215/usage2.csv", "data/B215/usage2-separated.xlsx",
"data/A101/bill2.xls", "data/A101/bill2-separated.xlsx", "data/C145/account1.xlsx", "data/C145/account1-separated.xlsx",
"data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()

  1. In my data folder, there are subfolders with identifiers for sites. Each subfolder contains data specific to the site. For example, folder `data/A101/` contains 5 data files `bill1-separated.xlsx`, `bill1.xlsx`, `bill2.xls`, `bill2-separated.xlsx`, `bill3.xlsx`.
  2. I want to keep only the file paths with &quot;separated&quot; in the name only if there are a separated version and the original copy. For example, for `bill1-separated.xlsx` and `bill1.xlsx`, I want to keep only the &quot;separated&quot; version.
  3. I want something like this in the end:

desired <- c(NA_character_, "data/A101/bill1-separated.xlsx", NA_character_, "data/B215/usage2-separated.xlsx",
NA_character_, "data/A101/bill2-separated.xlsx", NA_character_, "data/C145/account1-separated.xlsx",
"data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()

  1. My attempt was to use `ifelse` and `grepl` to capture the correct file paths, but I couldn&#39;t really get it to work. Any help will be much appreciated.

library(stringr)
library(dplyr)

attempt

temp <- data.frame(path = paths)
temp %>%
mutate(
file = str_extract(paths, "\/.*\/(.+)", group=1),
attempt = ifelse(!grepl("separated", file),
NA, # nest another ifelse? but not sure what condition/regex to use
path
)
)

  1. </details>
  2. # 答案1
  3. **得分**: 2
  4. 由于您忽略了文件扩展名,我认为我们可以使用这个方案。(我从您的代码中推测出了 [tag:dplyr]。)
  5. ```r
  6. library(dplyr)
  7. df %>%
  8. mutate(
  9. dir = dirname(path),
  10. pathnoext = tools::file_path_sans_ext(path),
  11. notsep = sub("-separated", "", pathnoext)
  12. ) %>%
  13. mutate(
  14. res = if_else(any(grepl("-separated", path)) & pathnoext == notsep,
  15. path[NA], path),
  16. .by = notsep
  17. ) %>%
  18. select(path, res, desire, everything())
  19. # path res desire dir pathnoext notsep
  20. # 1 data/A101/bill1.xlsx <NA> <NA> data/A101 data/A101/bill1 data/A101/bill1
  21. # 2 data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101 data/A101/bill1-separated data/A101/bill1
  22. # 3 data/B215/usage2.csv <NA> <NA> data/B215 data/B215/usage2 data/B215/usage2
  23. # 4 data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215 data/B215/usage2-separated data/B215/usage2
  24. # 5 data/A101/bill2.xls <NA> <NA> data/A101 data/A101/bill2 data/A101/bill2
  25. # 6 data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101 data/A101/bill2-separated data/A101/bill2
  26. # 7 data/C145/account1.xlsx <NA> <NA> data/C145 data/C145/account1 data/C145/account1
  27. # 8 data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145 data/C145/account1-separated data/C145/account1
  28. # 9 data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101 data/A101/bill3 data/A101/bill3
  29. # 10 data/B215/usage1.csv data/B215/usage1.csv data/B215/usage1.csv data/B215 data/B215/usage1 data/B215/usage1
  30. # 11 data/B215/usage3.xls data/B215/usage3.xls data/B215/usage3.xls data/B215 data/B215/usage3 data/B215/usage3
  31. # 12 data/C145/account2.xlsx data/C145/account2.xlsx data/C145/account2.xlsx data/C145 data/C145/account2 data/C145/account2

(额外的列仅用于演示。)

使用 .by= 需要 dplyr_1.1.0 版本或更新版本。


数据

  1. df <- structure(list(path = c("data/A101/bill1.xlsx", "data/A101/bill1-separated.xlsx", "data/B215/usage2.csv", "data/B215/usage2-separated.xlsx", "data/A101/bill2.xls", "data/A101/bill2-separated.xlsx", "data/C145/account1.xlsx", "data/C145/account1-separated.xlsx", "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx"), desire = c(NA, "data/A101/bill1-separated.xlsx", NA, "data/B215/usage2-separated.xlsx", NA, "data/A101/bill2-separated.xlsx", NA, "data/C145/account1-separated.xlsx", "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx")), class = "data.frame", row.names = c(NA, -12L))
英文:

Since you're ignoring file extension, I think we can go with this. (I'm inferring [tag:dplyr] from your code.)

  1. library(dplyr)
  2. df %&gt;%
  3. mutate(
  4. dir = dirname(path),
  5. pathnoext = tools::file_path_sans_ext(path),
  6. notsep = sub(&quot;-separated&quot;, &quot;&quot;, pathnoext)
  7. ) %&gt;%
  8. mutate(
  9. res = if_else(any(grepl(&quot;-separated&quot;, path)) &amp; pathnoext == notsep,
  10. path[NA], path),
  11. .by = notsep) %&gt;%
  12. select(path, res, desire, everything())
  13. # path res desire dir pathnoext notsep
  14. # 1 data/A101/bill1.xlsx &lt;NA&gt; &lt;NA&gt; data/A101 data/A101/bill1 data/A101/bill1
  15. # 2 data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101/bill1-separated.xlsx data/A101 data/A101/bill1-separated data/A101/bill1
  16. # 3 data/B215/usage2.csv &lt;NA&gt; &lt;NA&gt; data/B215 data/B215/usage2 data/B215/usage2
  17. # 4 data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215/usage2-separated.xlsx data/B215 data/B215/usage2-separated data/B215/usage2
  18. # 5 data/A101/bill2.xls &lt;NA&gt; &lt;NA&gt; data/A101 data/A101/bill2 data/A101/bill2
  19. # 6 data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101/bill2-separated.xlsx data/A101 data/A101/bill2-separated data/A101/bill2
  20. # 7 data/C145/account1.xlsx &lt;NA&gt; &lt;NA&gt; data/C145 data/C145/account1 data/C145/account1
  21. # 8 data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145 data/C145/account1-separated data/C145/account1
  22. # 9 data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101/bill3.xlsx data/A101 data/A101/bill3 data/A101/bill3
  23. # 10 data/B215/usage1.csv data/B215/usage1.csv data/B215/usage1.csv data/B215 data/B215/usage1 data/B215/usage1
  24. # 11 data/B215/usage3.xls data/B215/usage3.xls data/B215/usage3.xls data/B215 data/B215/usage3 data/B215/usage3
  25. # 12 data/C145/account2.xlsx data/C145/account2.xlsx data/C145/account2.xlsx data/C145 data/C145/account2 data/C145/account2

(Extra columns retained only for demonstration.)

The use of .by= requires dplyr_1.1.0 or newer.


Data

  1. df &lt;- structure(list(path = c(&quot;data/A101/bill1.xlsx&quot;, &quot;data/A101/bill1-separated.xlsx&quot;, &quot;data/B215/usage2.csv&quot;, &quot;data/B215/usage2-separated.xlsx&quot;, &quot;data/A101/bill2.xls&quot;, &quot;data/A101/bill2-separated.xlsx&quot;, &quot;data/C145/account1.xlsx&quot;, &quot;data/C145/account1-separated.xlsx&quot;, &quot;data/A101/bill3.xlsx&quot;, &quot;data/B215/usage1.csv&quot;, &quot;data/B215/usage3.xls&quot;, &quot;data/C145/account2.xlsx&quot;), desire = c(NA, &quot;data/A101/bill1-separated.xlsx&quot;, NA, &quot;data/B215/usage2-separated.xlsx&quot;, NA, &quot;data/A101/bill2-separated.xlsx&quot;, NA, &quot;data/C145/account1-separated.xlsx&quot;, &quot;data/A101/bill3.xlsx&quot;, &quot;data/B215/usage1.csv&quot;, &quot;data/B215/usage3.xls&quot;, &quot;data/C145/account2.xlsx&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, -12L))

huangapple
  • 本文由 发表于 2023年6月8日 01:52:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76425900.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定