如何针对特定ID保留包含特定短语的字符串?

huangapple go评论60阅读模式
英文:

How to retain strings with certain phrases per specific ID?

问题

以下是您要的翻译部分:

我有以下包含文件路径的示例数据。

在我的数据文件夹中,有带有站点标识符的子文件夹。每个子文件夹包含特定于该站点的数据。例如,文件夹`data/A101/`包含5个数据文件`bill1-separated.xlsx`,`bill1.xlsx`,`bill2.xls`,`bill2-separated.xlsx`,`bill3.xlsx`。

我只想保留文件路径中名称中包含“separated”的文件,前提是存在一个分隔版本和原始副本。例如,对于`bill1-separated.xlsx`和`bill1.xlsx`,我只想保留“separated”版本。

最终我想要这样的结果:
```R
desired <- c(NA_character_, "data/A101/bill1-separated.xlsx", NA_character_, "data/B215/usage2-separated.xlsx",
             NA_character_, "data/A101/bill2-separated.xlsx", NA_character_, "data/C145/account1-separated.xlsx",
             "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()

我的尝试是使用ifelsegrepl来捕获正确的文件路径,但我无法使其正常工作。任何帮助将不胜感激。

library(stringr)
library(dplyr)
# 尝试
temp <- data.frame(path = paths)
temp %>% 
  mutate(
    file = str_extract(paths, "\\/.*\\/(.+)", group=1),
    attempt = ifelse(!grepl("separated", file),
                     NA, # 是否嵌套另一个ifelse?但不确定要使用什么条件/正则表达式
                     path
                     )
  )

<details>
<summary>英文:</summary>

I have the following example data that contain file paths.

paths <- c("data/A101/bill1.xlsx", "data/A101/bill1-separated.xlsx", "data/B215/usage2.csv", "data/B215/usage2-separated.xlsx",
"data/A101/bill2.xls", "data/A101/bill2-separated.xlsx", "data/C145/account1.xlsx", "data/C145/account1-separated.xlsx",
"data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()


In my data folder, there are subfolders with identifiers for sites. Each subfolder contains data specific to the site. For example, folder `data/A101/` contains 5 data files `bill1-separated.xlsx`, `bill1.xlsx`, `bill2.xls`, `bill2-separated.xlsx`, `bill3.xlsx`.

I want to keep only the file paths with &quot;separated&quot; in the name only if there are a separated version and the original copy. For example, for `bill1-separated.xlsx` and `bill1.xlsx`, I want to keep only the &quot;separated&quot; version.

I want something like this in the end:

desired <- c(NA_character_, "data/A101/bill1-separated.xlsx", NA_character_, "data/B215/usage2-separated.xlsx",
NA_character_, "data/A101/bill2-separated.xlsx", NA_character_, "data/C145/account1-separated.xlsx",
"data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx") %>% sort()


My attempt was to use `ifelse` and `grepl` to capture the correct file paths, but I couldn&#39;t really get it to work. Any help will be much appreciated.

library(stringr)
library(dplyr)

attempt

temp <- data.frame(path = paths)
temp %>%
mutate(
file = str_extract(paths, "\/.*\/(.+)", group=1),
attempt = ifelse(!grepl("separated", file),
NA, # nest another ifelse? but not sure what condition/regex to use
path
)
)




</details>


# 答案1
**得分**: 2

由于您忽略了文件扩展名,我认为我们可以使用这个方案。(我从您的代码中推测出了 [tag:dplyr]。)

```r
library(dplyr)
df %>%
  mutate(
    dir = dirname(path),
    pathnoext = tools::file_path_sans_ext(path),
    notsep = sub("-separated", "", pathnoext)
  ) %>%
  mutate(
    res = if_else(any(grepl("-separated", path)) & pathnoext == notsep, 
                  path[NA], path),
    .by = notsep
  ) %>%
  select(path, res, desire, everything())
#                                 path                               res                            desire       dir                    pathnoext             notsep
# 1               data/A101/bill1.xlsx                              <NA>                              <NA> data/A101              data/A101/bill1    data/A101/bill1
# 2     data/A101/bill1-separated.xlsx    data/A101/bill1-separated.xlsx    data/A101/bill1-separated.xlsx data/A101    data/A101/bill1-separated    data/A101/bill1
# 3               data/B215/usage2.csv                              <NA>                              <NA> data/B215             data/B215/usage2   data/B215/usage2
# 4    data/B215/usage2-separated.xlsx   data/B215/usage2-separated.xlsx   data/B215/usage2-separated.xlsx data/B215   data/B215/usage2-separated   data/B215/usage2
# 5                data/A101/bill2.xls                              <NA>                              <NA> data/A101              data/A101/bill2    data/A101/bill2
# 6     data/A101/bill2-separated.xlsx    data/A101/bill2-separated.xlsx    data/A101/bill2-separated.xlsx data/A101    data/A101/bill2-separated    data/A101/bill2
# 7            data/C145/account1.xlsx                              <NA>                              <NA> data/C145           data/C145/account1 data/C145/account1
# 8  data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145 data/C145/account1-separated data/C145/account1
# 9               data/A101/bill3.xlsx              data/A101/bill3.xlsx              data/A101/bill3.xlsx data/A101              data/A101/bill3    data/A101/bill3
# 10              data/B215/usage1.csv              data/B215/usage1.csv              data/B215/usage1.csv data/B215             data/B215/usage1   data/B215/usage1
# 11              data/B215/usage3.xls              data/B215/usage3.xls              data/B215/usage3.xls data/B215             data/B215/usage3   data/B215/usage3
# 12           data/C145/account2.xlsx           data/C145/account2.xlsx           data/C145/account2.xlsx data/C145           data/C145/account2 data/C145/account2

(额外的列仅用于演示。)

使用 .by= 需要 dplyr_1.1.0 版本或更新版本。


数据

df <- structure(list(path = c("data/A101/bill1.xlsx", "data/A101/bill1-separated.xlsx", "data/B215/usage2.csv", "data/B215/usage2-separated.xlsx", "data/A101/bill2.xls", "data/A101/bill2-separated.xlsx", "data/C145/account1.xlsx", "data/C145/account1-separated.xlsx",  "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx"), desire = c(NA, "data/A101/bill1-separated.xlsx", NA, "data/B215/usage2-separated.xlsx", NA, "data/A101/bill2-separated.xlsx", NA, "data/C145/account1-separated.xlsx",  "data/A101/bill3.xlsx", "data/B215/usage1.csv", "data/B215/usage3.xls", "data/C145/account2.xlsx")), class = "data.frame", row.names = c(NA, -12L))
英文:

Since you're ignoring file extension, I think we can go with this. (I'm inferring [tag:dplyr] from your code.)

library(dplyr)
df %&gt;%
  mutate(
    dir = dirname(path),
    pathnoext = tools::file_path_sans_ext(path),
    notsep = sub(&quot;-separated&quot;, &quot;&quot;, pathnoext)
  ) %&gt;%
  mutate(
    res = if_else(any(grepl(&quot;-separated&quot;, path)) &amp; pathnoext == notsep, 
                  path[NA], path),
    .by = notsep) %&gt;%
  select(path, res, desire, everything())
#                                 path                               res                            desire       dir                    pathnoext             notsep
# 1               data/A101/bill1.xlsx                              &lt;NA&gt;                              &lt;NA&gt; data/A101              data/A101/bill1    data/A101/bill1
# 2     data/A101/bill1-separated.xlsx    data/A101/bill1-separated.xlsx    data/A101/bill1-separated.xlsx data/A101    data/A101/bill1-separated    data/A101/bill1
# 3               data/B215/usage2.csv                              &lt;NA&gt;                              &lt;NA&gt; data/B215             data/B215/usage2   data/B215/usage2
# 4    data/B215/usage2-separated.xlsx   data/B215/usage2-separated.xlsx   data/B215/usage2-separated.xlsx data/B215   data/B215/usage2-separated   data/B215/usage2
# 5                data/A101/bill2.xls                              &lt;NA&gt;                              &lt;NA&gt; data/A101              data/A101/bill2    data/A101/bill2
# 6     data/A101/bill2-separated.xlsx    data/A101/bill2-separated.xlsx    data/A101/bill2-separated.xlsx data/A101    data/A101/bill2-separated    data/A101/bill2
# 7            data/C145/account1.xlsx                              &lt;NA&gt;                              &lt;NA&gt; data/C145           data/C145/account1 data/C145/account1
# 8  data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145/account1-separated.xlsx data/C145 data/C145/account1-separated data/C145/account1
# 9               data/A101/bill3.xlsx              data/A101/bill3.xlsx              data/A101/bill3.xlsx data/A101              data/A101/bill3    data/A101/bill3
# 10              data/B215/usage1.csv              data/B215/usage1.csv              data/B215/usage1.csv data/B215             data/B215/usage1   data/B215/usage1
# 11              data/B215/usage3.xls              data/B215/usage3.xls              data/B215/usage3.xls data/B215             data/B215/usage3   data/B215/usage3
# 12           data/C145/account2.xlsx           data/C145/account2.xlsx           data/C145/account2.xlsx data/C145           data/C145/account2 data/C145/account2

(Extra columns retained only for demonstration.)

The use of .by= requires dplyr_1.1.0 or newer.


Data

df &lt;- structure(list(path = c(&quot;data/A101/bill1.xlsx&quot;, &quot;data/A101/bill1-separated.xlsx&quot;, &quot;data/B215/usage2.csv&quot;, &quot;data/B215/usage2-separated.xlsx&quot;, &quot;data/A101/bill2.xls&quot;, &quot;data/A101/bill2-separated.xlsx&quot;, &quot;data/C145/account1.xlsx&quot;, &quot;data/C145/account1-separated.xlsx&quot;, &quot;data/A101/bill3.xlsx&quot;, &quot;data/B215/usage1.csv&quot;, &quot;data/B215/usage3.xls&quot;, &quot;data/C145/account2.xlsx&quot;), desire = c(NA, &quot;data/A101/bill1-separated.xlsx&quot;, NA, &quot;data/B215/usage2-separated.xlsx&quot;, NA, &quot;data/A101/bill2-separated.xlsx&quot;, NA, &quot;data/C145/account1-separated.xlsx&quot;,  &quot;data/A101/bill3.xlsx&quot;, &quot;data/B215/usage1.csv&quot;, &quot;data/B215/usage3.xls&quot;, &quot;data/C145/account2.xlsx&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, -12L))

huangapple
  • 本文由 发表于 2023年6月8日 01:52:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76425900.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定