How can I extract a string from between last dash and second to last dash out of a column that contains lists of strings?

huangapple go评论113阅读模式
英文:

How can I extract a string from between last dash and second to last dash out of a column that contains lists of strings?

问题

  1. # 以下是翻译好的部分:
  2. 我有一些数据,想要创建一个新列,其中包含在倒数第二个破折号和倒数第一个破折号之间的字符串。但有一个小技巧!我的一些观察结果是"列出的",我也想从列表项中获取每个目标字符串。
  3. 示例数据如下:
  4. data <- data.frame(
  5. a = c("1500925OR3-29139-315012",
  6. "1500925OR3-2-2913A-315012",
  7. "c(\"1500925OR3-200B-315012\", \"1500925OR3-4-2919999-315012\")")
  8. )
  9. 看起来像这样:
  10. a
  11. 1 1500925OR3-29139-315012
  12. 2 1500925OR3-2-2913A-315012
  13. 3 c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012")
  14. 我想要的数据看起来像这样
  15. a_clean
  16. 1 29139
  17. 2 2913A
  18. 3 200B, 2919999
  19. 我一直在尝试使用正则表达式,但我无法弄清如何获取最后一个破折号之前的字符串。这会捕获最后一个破折号后面的内容...`-[^-]*$`,但显然那不对。
英文:

I have some data and I want to make a new column with the string that is between the last dash and the second to last dash. But there is a twist! Some of my observations are "listed", and I want to get each target string out of the list items as well.

Example data here:

  1. data &lt;- data.frame(
  2. a = c(&quot;1500925OR3-29139-315012&quot;,
  3. &quot;1500925OR3-2-2913A-315012&quot;,
  4. &quot;c(\&quot;1500925OR3-200B-315012\&quot;, \&quot;1500925OR3-4-2919999-315012\&quot;)&quot;)
  5. )

looks like:

  1. a
  2. 1 1500925OR3-29139-315012
  3. 2 1500925OR3-2-2913A-315012
  4. 3 c(&quot;1500925OR3-200B-315012&quot;, &quot;1500925OR3-4-2919999-315012&quot;)

I want data that looks like this

  1. a_clean
  2. 1 29139
  3. 2 2913A
  4. 3 200B, 2919999

I've been working on using regex, but I can't figure out how to get the string before the last dash. This grabs the stuff after the last dash...-[^-]*$ but obviously thats not right.

答案1

得分: 3

尝试在sub中使用这个正则表达式,并使用lapply

  1. dat$b <- lapply(dat$a, \(x) sub('-?.*-(.*)-.*', '\', x, perl=TRUE))
  2. dat
  3. # a b
  4. # 1 1500925OR3-29139-315012 29139
  5. # 2 1500925OR3-2-2913A-315012 2913A
  6. # 3 1500925OR3-200B-315012, 1500925OR3-4-2919999-315012 200B, 2919999

你提到了一个"list"列,所以我假设你的真实数据看起来是这样的。

数据:

  1. dat <- structure(list(a = list("1500925OR3-29139-315012", "1500925OR3-2-2913A-315012",
  2. c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012"))), row.names = c(NA, -3L), class = "data.frame")
英文:

Try this regex in sub and use lapply.

  1. dat$b &lt;- lapply(dat$a, \(x) sub(&#39;-?.*-(.*)-.*&#39;, &#39;\&#39;, x, perl=TRUE))
  2. dat
  3. # a b
  4. # 1 1500925OR3-29139-315012 29139
  5. # 2 1500925OR3-2-2913A-315012 2913A
  6. # 3 1500925OR3-200B-315012, 1500925OR3-4-2919999-315012 200B, 2919999

You're talking about a "list" column, so I created one assuming that's what your real data looks like.


Data:

  1. dat &lt;- structure(list(a = list(&quot;1500925OR3-29139-315012&quot;, &quot;1500925OR3-2-2913A-315012&quot;,
  2. c(&quot;1500925OR3-200B-315012&quot;, &quot;1500925OR3-4-2919999-315012&quot;
  3. ))), row.names = c(NA, -3L), class = &quot;data.frame&quot;)

答案2

得分: 2

A tidyverse approach:

  1. library(dplyr)
  2. library(tidyr)
  3. data %>%
  4. mutate(id = row_number()) %>%
  5. separate_rows(a, sep = "\\s") %>%
  6. mutate(b = str_extract(a, "(?<=-)[^-]*(?=-[^-]*$)")) %>%
  7. summarise(a_clean = toString(b), .by=id) %>%
  8. select(-id)
  1. a_clean
  2. <chr>
  3. 1 29139
  4. 2 2913A
  5. 3 200B, 2919999
英文:

A tidyverse approach:

  1. library(dplyr)
  2. library(tidyr)
  3. data %&gt;%
  4. mutate(id = row_number()) %&gt;%
  5. separate_rows(a, sep = &quot;\\s&quot;) %&gt;%
  6. mutate(b = str_extract(a, &quot;(?&lt;=-)[^-]*(?=-[^-]*$)&quot;)) %&gt;%
  7. summarise(a_clean = toString(b), .by=id) %&gt;%
  8. select(-id)
  1. a_clean
  2. &lt;chr&gt;
  3. 1 29139
  4. 2 2913A
  5. 3 200B, 2919999

答案3

得分: 2

  1. data.frame(
  2. a = c(
  3. "1500925OR3-29139-315012",
  4. "1500925OR3-2-2913A-315012",
  5. c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012")
  6. ),
  7. b = c(1:3)
  8. ) %>% separate_rows(a, sep = ',') %>% separate(a,
  9. c('col1', 'col2', 'col3', 'col4'),
  10. sep = '-',
  11. fill = 'left') %>% group_by(b) %>%
  12. summarise(col3 = str_c(col3, collapse = ","))
  1. # A tibble: 3 x 2
  2. b col3
  3. <int> <chr>
  4. 1 1 29139
  5. 2 2 2913A
  6. 3 3 200B,2919999
英文:

Alternatively,

  1. data.frame(
  2. a = c(
  3. &quot;1500925OR3-29139-315012&quot;,
  4. &quot;1500925OR3-2-2913A-315012&quot;,
  5. &quot;c(\&quot;1500925OR3-200B-315012\&quot;, \&quot;1500925OR3-4-2919999-315012\&quot;)&quot;
  6. ),
  7. b = c(1:3)
  8. ) %&gt;% separate_rows(a, sep = &#39;\\,&#39;) %&gt;% separate(a,
  9. c(&#39;col1&#39;, &#39;col2&#39;, &#39;col3&#39;, &#39;col4&#39;),
  10. sep = &#39;\\-&#39;,
  11. fill = &#39;left&#39;) %&gt;% group_by(b) %&gt;%
  12. summarise(col3 = str_c(col3, collapse = &quot;,&quot;))
  1. # A tibble: 3 &#215; 2
  2. b col3
  3. &lt;int&gt; &lt;chr&gt;
  4. 1 1 29139
  5. 2 2 2913A
  6. 3 3 200B,2919999

huangapple
  • 本文由 发表于 2023年6月26日 00:56:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551519.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定