英文:
How can I extract a string from between last dash and second to last dash out of a column that contains lists of strings?
问题
# 以下是翻译好的部分:
我有一些数据,想要创建一个新列,其中包含在倒数第二个破折号和倒数第一个破折号之间的字符串。但有一个小技巧!我的一些观察结果是"列出的",我也想从列表项中获取每个目标字符串。
示例数据如下:
data <- data.frame(
a = c("1500925OR3-29139-315012",
"1500925OR3-2-2913A-315012",
"c(\"1500925OR3-200B-315012\", \"1500925OR3-4-2919999-315012\")")
)
看起来像这样:
a
1 1500925OR3-29139-315012
2 1500925OR3-2-2913A-315012
3 c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012")
我想要的数据看起来像这样
a_clean
1 29139
2 2913A
3 200B, 2919999
我一直在尝试使用正则表达式,但我无法弄清如何获取最后一个破折号之前的字符串。这会捕获最后一个破折号后面的内容...`-[^-]*$`,但显然那不对。
英文:
I have some data and I want to make a new column with the string that is between the last dash and the second to last dash. But there is a twist! Some of my observations are "listed", and I want to get each target string out of the list items as well.
Example data here:
data <- data.frame(
a = c("1500925OR3-29139-315012",
"1500925OR3-2-2913A-315012",
"c(\"1500925OR3-200B-315012\", \"1500925OR3-4-2919999-315012\")")
)
looks like:
a
1 1500925OR3-29139-315012
2 1500925OR3-2-2913A-315012
3 c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012")
I want data that looks like this
a_clean
1 29139
2 2913A
3 200B, 2919999
I've been working on using regex, but I can't figure out how to get the string before the last dash. This grabs the stuff after the last dash...-[^-]*$
but obviously thats not right.
答案1
得分: 3
尝试在sub
中使用这个正则表达式,并使用lapply
。
dat$b <- lapply(dat$a, \(x) sub('-?.*-(.*)-.*', '\', x, perl=TRUE))
dat
# a b
# 1 1500925OR3-29139-315012 29139
# 2 1500925OR3-2-2913A-315012 2913A
# 3 1500925OR3-200B-315012, 1500925OR3-4-2919999-315012 200B, 2919999
你提到了一个"list"列,所以我假设你的真实数据看起来是这样的。
数据:
dat <- structure(list(a = list("1500925OR3-29139-315012", "1500925OR3-2-2913A-315012",
c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012"))), row.names = c(NA, -3L), class = "data.frame")
英文:
Try this regex in sub
and use lapply
.
dat$b <- lapply(dat$a, \(x) sub('-?.*-(.*)-.*', '\', x, perl=TRUE))
dat
# a b
# 1 1500925OR3-29139-315012 29139
# 2 1500925OR3-2-2913A-315012 2913A
# 3 1500925OR3-200B-315012, 1500925OR3-4-2919999-315012 200B, 2919999
You're talking about a "list" column, so I created one assuming that's what your real data looks like.
Data:
dat <- structure(list(a = list("1500925OR3-29139-315012", "1500925OR3-2-2913A-315012",
c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012"
))), row.names = c(NA, -3L), class = "data.frame")
答案2
得分: 2
A tidyverse
approach:
library(dplyr)
library(tidyr)
data %>%
mutate(id = row_number()) %>%
separate_rows(a, sep = "\\s") %>%
mutate(b = str_extract(a, "(?<=-)[^-]*(?=-[^-]*$)")) %>%
summarise(a_clean = toString(b), .by=id) %>%
select(-id)
a_clean
<chr>
1 29139
2 2913A
3 200B, 2919999
英文:
A tidyverse
approach:
library(dplyr)
library(tidyr)
data %>%
mutate(id = row_number()) %>%
separate_rows(a, sep = "\\s") %>%
mutate(b = str_extract(a, "(?<=-)[^-]*(?=-[^-]*$)")) %>%
summarise(a_clean = toString(b), .by=id) %>%
select(-id)
a_clean
<chr>
1 29139
2 2913A
3 200B, 2919999
答案3
得分: 2
data.frame(
a = c(
"1500925OR3-29139-315012",
"1500925OR3-2-2913A-315012",
c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012")
),
b = c(1:3)
) %>% separate_rows(a, sep = ',') %>% separate(a,
c('col1', 'col2', 'col3', 'col4'),
sep = '-',
fill = 'left') %>% group_by(b) %>%
summarise(col3 = str_c(col3, collapse = ","))
# A tibble: 3 x 2
b col3
<int> <chr>
1 1 29139
2 2 2913A
3 3 200B,2919999
英文:
Alternatively,
data.frame(
a = c(
"1500925OR3-29139-315012",
"1500925OR3-2-2913A-315012",
"c(\"1500925OR3-200B-315012\", \"1500925OR3-4-2919999-315012\")"
),
b = c(1:3)
) %>% separate_rows(a, sep = '\\,') %>% separate(a,
c('col1', 'col2', 'col3', 'col4'),
sep = '\\-',
fill = 'left') %>% group_by(b) %>%
summarise(col3 = str_c(col3, collapse = ","))
# A tibble: 3 × 2
b col3
<int> <chr>
1 1 29139
2 2 2913A
3 3 200B,2919999
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论