英文:
Extract leading numbers from string, but length varies R
问题
有一列包含字母和数字的字符字符串。该字符串始终以一个或两个数字开头,然后是多个字符。我尝试根据第一个字符的位置将字符串分开。
have <-
tribble(
~string,
'12main',
'6six',
'42go',
'5to9'
)
want <-
tribble(
~prefix, ~rest,
'12', 'main',
'6', 'six',
'42', 'go',
'5', 'to9'
)
我确定有一个使用 `separate` 函数的正则表达式解决方案,但我在使其工作方面遇到了困难。
want <-
have %>%
separate(string,
into = c('prefix', 'rest'),
sep = "(?=[0-9])(?<=[a-zA-Z])")
英文:
I have a column which contains a character string containing letters and numbers. The string always starts with one or two numbers, followed by multiple characters. I am trying to separate the string based on where that first character is.
have <-
tribble(
~string,
'12main',
'6six',
'42go',
'5to9'
)
want <-
tribble(
~prefix, ~rest,
'12', 'main',
'6', 'six',
'42', 'go',
'5', 'to9'
)
I'm sure there is a regex with separate
solution but having trouble getting it working.
want <-
have %>%
separate(string,
into = c('prefix', 'rest'),
sep = "(?=[0-9])(?<=[a-zA-Z])")
答案1
得分: 2
你几乎就要成功了,我们可以通过一个向前查找(用于数字)和一个向后查找(用于非数字)来实现它:
have %>%
separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# # A tibble: 4 × 2
# prefix rest
# <chr> <chr>
# 1 12 main
# 2 6 six
# 3 42 go
# 4 5 to9
我认为你把查找方向搞反了:?<=
是用于前面的字符串(应该与 [0-9]
一起使用),而 ?=
是用于后面的字符串(应该与 [^0-9]
或 [A-Za-z]
一起使用)。
个人觉得这有点有趣:我们基于零长度的模式拆分字符串:前面是数字,后面是非数字,所以拆分实际上是零长度的。
顺便说一下,如果字符串中有两个这样的位置,比如 5to9to5
,会出现警告:
have <- structure(list(string = c("12main", "6six", "42go", "5to9", "5to9to5")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
have
# # A tibble: 5 × 1
# string
# <chr>
# 1 12main
# 2 6six
# 3 42go
# 4 5to9
# 5 5to9to5
have %>%
separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
# # A tibble: 5 × 2
# prefix rest
# <chr> <chr>
# 1 12 main
# 2 6 six
# 3 42 go
# 4 5 to9
# 5 5 to9
这是一个警告,说明你正在丢弃一些信息,你可以自行决定是否需要处理这种情况。
另一个选择,因为你在实际数据中有 5to9to5
:
have %>%
mutate(strcapture("^([0-9]+)([^0-9].*)", string, list(prefix="", rest="")))
# # A tibble: 5 × 3
# string prefix rest
# <chr> <chr> <chr>
# 1 12main 12 main
# 2 6six 6 six
# 3 42go 42 go
# 4 5to9 5 to9
# 5 5to9to5 5 to9to5
现在你可以移除 string
。
另一个注意事项:如果你打算将 prefix
转换为整数或数字,那么你可以通过使用 list(prefix=0L, rest="")
(或只是 =0
)来避免这种需要。那就是 proto=
参数,虽然它的 data 被丢弃了,但它被用于结果列的名称和目标类。
英文:
You were close, we can achieve it with one look-behind (for a number) and one look-ahead (for a non-number):
have %>%
separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# # A tibble: 4 × 2
# prefix rest
# <chr> <chr>
# 1 12 main
# 2 6 six
# 3 42 go
# 4 5 to9
I think you had the look-around reversed: ?<=
is for preceding string (should be used with [0-9]
), and ?=
is for following string (should be used with [^0-9]
or [A-Za-z]
).
Personally I find this a bit intriguing: we are splitting strings based on a 0-length pattern: there is nothing between where the previous is a number and the following is a non-number, so the split is effectively 0-length.
FYI, this does run into warnings if there are two such places in a string, such as 5to9to5
:
have <- structure(list(string = c("12main", "6six", "42go", "5to9", "5to9to5")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
have
# # A tibble: 5 × 1
# string
# <chr>
# 1 12main
# 2 6six
# 3 42go
# 4 5to9
# 5 5to9to5
have %>%
separate(string, sep = "(?<=[0-9])(?=[^0-9])", into = c("prefix", "rest"))
# Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
# # A tibble: 5 × 2
# prefix rest
# <chr> <chr>
# 1 12 main
# 2 6 six
# 3 42 go
# 4 5 to9
# 5 5 to9
It's a warning that you are discarding some information, over to you if you want/need to guard against this.
An alternative, since you have 5to9to5
in your real data:
have %>%
mutate(strcapture("^([0-9]+)([^0-9].*)", string, list(prefix="", rest="")))
# # A tibble: 5 × 3
# string prefix rest
# <chr> <chr> <chr>
# 1 12main 12 main
# 2 6six 6 six
# 3 42go 42 go
# 4 5to9 5 to9
# 5 5to9to5 5 to9to5
where you can now remove string
if you want.
Another note: if you are intending to convert prefix
into an integer or a number, then you can preclude that need by using list(prefix=0L, rest="")
(or just =0
) instead. That's the proto=
argument, and while its data is discarded, it is used for its names and target classes for each resulting column).
答案2
得分: 1
你也可以使用 extract
。
有 %>%
提取(string, c('prefix', 'rest'), "(\\d+)(.*)")
# 一个数据框: 4 × 2
前缀 剩余
<chr> <chr>
1 12 主要
2 6 六
3 42 去
4 5 到9
英文:
You can also use extract
have %>%
extract(string, c('prefix', 'rest'), "(\\d+)(.*)")
# A tibble: 4 × 2
prefix rest
<chr> <chr>
1 12 main
2 6 six
3 42 go
4 5 to9
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论