英文:
Split a column by regex and keep only matches in pandas
问题
我需要根据我的正则表达式匹配每一列。
考虑下面的数据框:
Letter Actions
r1 a95,a20
r2 a30,a12-rf,a15,a75
r3 0
r4 a15,a74
r5 a11
我期望的是:
Letter Action1 Action2 Action3 Action4
r1 a95 a20
r2 a30 a12 a15 a75
r3
r4 a15 a74
r5 a11
我已经尝试了下面的方法,但它保留了所有其他内容,而我只需要模式匹配的部分:
import pandas as pd
df = pd.DataFrame(
[['r1', 'a95,a20'],
['r2', 'a30,a12-rf,a15,a75'],
['r3', '0'],
['r4', 'a15,a74'],
['r5', 'a11']],
columns=['Letter', 'Actions'])
Actions = df['Actions'].str.split(r'([a]\d{2})', expand=True)
英文:
I need to have a column per my regex match.
Consider below dataframe:
Letter Actions
r1 a95,a20
r2 a30,a12-rf,a15,a75
r3 0
r4 a15,a74
r5 a11
I expect:
Letter Action1 Action2 Action3 Action4
r1 a95 a20
r2 a30 a12 a15 a75
r3
r4 a15 a74
r5 a11
I have below but it keeps every thing else as well while I just need the pattern matches:
import pandas as pd
df = pd.DataFrame(
[['r1', 'a95,a20'],
['r2', 'a30,a12-rf,a15,a75'],
['r3', '0'],
['r4', 'a15,a74'],
['r5', 'a11']],
columns=['Letter', 'Actions'])
Actions = df['Actions'].str.split(r'([a]\d{2})', expand=True)
答案1
得分: 1
以下是使用 str.extractall()
的选项:
(df.set_index('Letter')['Actions']
.str.extractall(r'([a]\d{2})')[0]
.unstack()
.rename(lambda x: 'Action{}'.format(x+1), axis=1)
.reset_index()
.rename_axis(None, axis=1))
另一种可能性是使用 str.split()
:
df.set_index('Letter')['Actions'].str.split(r'(?:\D+)?,', expand=True)
输出:
Letter Action1 Action2 Action3 Action4
0 r1 a95 a20 NaN NaN
1 r2 a30 a12 a15 a75
2 r4 a15 a74 NaN NaN
3 r5 a11 NaN NaN NaN
英文:
Here is an option using str.extractall()
(df.set_index('Letter')['Actions']
.str.extractall(r'([a]\d{2})')[0]
.unstack()
.rename(lambda x: 'Action{}'.format(x+1),axis=1)
.reset_index()
.rename_axis(None,axis=1))
Another possibility is using str.split()
df.set_index('Letter')['Actions'].str.split(r'(?:\D+)?,',expand=True)
Output:
Letter Action1 Action2 Action3 Action4
0 r1 a95 a20 NaN NaN
1 r2 a30 a12 a15 a75
2 r4 a15 a74 NaN NaN
3 r5 a11 NaN NaN NaN
答案2
得分: 0
你错误地使用了 split
函数。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
可能你想要做的是类似这样的操作,尽管你仍然保留了 -rf
部分:
split(',', expand=True)
0 1 2 3
0 a95 a20 None None
1 a30 a12-rf a15 a75
2 0 None None None
3 a15 a74 None None
4 a11 None None None
此外,你的正则表达式存在一些常见的初学者问题,所以建议你阅读关于 Python 中正则表达式的文档。https://docs.python.org/3/howto/regex.html。特别是你的 [a]
在这里几乎没有意义,因为它与普通的 a
是相同的。
如果你知道可以分割的最大元素数量,那么你可以使用正则表达式和 extract
函数来实现你想要的效果。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
类似这样的操作:
extract(r'(a\d\d)(?:-[^,]+)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?')
0 1 2 3
0 a95 a20 NaN NaN
1 a30 a12 a15 a75
2 NaN NaN NaN NaN
3 a15 a74 NaN NaN
4 a11 NaN NaN NaN
关于匹配类似 a12-rf
并仅保留 a12
部分,你需要使用正则表达式 (a\d\d)(?:-[^,]+)?
来捕获 a12
,其中 (a\d\d)
用于捕获 a12
,然后匹配但不保留 -rf
部分,你可以使用非捕获组 ?:
修饰符,然后使用 -[^,]+
匹配 -rf
,其中 [^,]+
匹配一个或多个非逗号字符。
这就是我得到 (a\d\d)(?:-[^,]+)?
的方式。对于字符串中的其他部分元素,你可以将这个模式包装在自己的非捕获组中以处理其可选性,并添加逗号进行匹配。 (?:,<prev_pattern>)?
。然后你可以根据需要复制粘贴这个 ,<pattern>
匹配多次。
如果你不知道可能会匹配多少这样的元素,那么你可以使用 extractall
,然后将其使用的 MultiIndex
进行逆转。这听起来有点复杂(我必须承认,在我的脑海中操作这个有点痛苦),但在实践中非常简单。
extractall(r'(a\d\d)(?:-[^,]+)?').unstack()[0]
match 0 1 2 3
0 a95 a20 NaN NaN
1 a30 a12 a15 a75
3 a15 a74 NaN NaN
4 a11 NaN NaN NaN
英文:
You've used split
wrong. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
Something like this is probably what you want to do, although you will still keep the -rf
parts:
split(',', expand=True)
0 1 2 3
0 a95 a20 None None
1 a30 a12-rf a15 a75
2 0 None None None
3 a15 a74 None None
4 a11 None None None
Also, your regex is hitting some common beginner problems so it would be a good idea to read up on regex in python. https://docs.python.org/3/howto/regex.html. In particular your [a]
is quite meaningless since it is the same as a plain a
.
If you know the maximum number of elements you can split to, then you can use a regex and extract
which also does what you want. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
Something like
extract(r'(a\d\d)(?:-[^,]+)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?')
0 1 2 3
0 a95 a20 NaN NaN
1 a30 a12 a15 a75
2 NaN NaN NaN NaN
3 a15 a74 NaN NaN
4 a11 NaN NaN NaN
In terms of matching something like a12-rf
and only keeping the a12
part, you need a regex like (a\d\d)(?:-[^,]+)?
to capture a12
with (a\d\d)
, and then match but not keep -rf
you use the non-capture group ?:
modifier, and then match -rf
with -[^,]+
where [^,]+
matches 1-or-more non-,
characters.
This is how I got (a\d\d)(?:-[^,]+)?
. For the other parts elements in your string, you can wrap this pattern in its own non-capture group to handle the being optional and adding a comma to match. (?:,<prev_pattern>)?
. And then you copy-paste that ,<pattern>
match as many times as you need.
If you do not know how many of these elements you could hit, then you can use extractall
and then pivot the MultiIndex
it uses. This sounds complicated (and I have to admit that doing it in my head is a bit painful) but is quite simple in practice.
extractall(r'(a\d\d)(?:-[^,]+)?').unstack()[0]
match 0 1 2 3
0 a95 a20 NaN NaN
1 a30 a12 a15 a75
3 a15 a74 NaN NaN
4 a11 NaN NaN NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论