在pandas中按正则表达式拆分列并保留匹配项:

huangapple go评论59阅读模式
英文:

Split a column by regex and keep only matches in pandas

问题

我需要根据我的正则表达式匹配每一列。

考虑下面的数据框:

Letter      Actions
r1          a95,a20
r2          a30,a12-rf,a15,a75
r3          0
r4          a15,a74
r5          a11

我期望的是:

Letter      Action1      Action2      Action3      Action4
r1          a95          a20          
r2          a30          a12          a15          a75
r3          
r4          a15          a74
r5          a11

我已经尝试了下面的方法,但它保留了所有其他内容,而我只需要模式匹配的部分:

import pandas as pd

df = pd.DataFrame(
    [['r1', 'a95,a20'],
     ['r2', 'a30,a12-rf,a15,a75'],
     ['r3', '0'],
     ['r4', 'a15,a74'],
     ['r5', 'a11']],
    columns=['Letter', 'Actions'])

Actions = df['Actions'].str.split(r'([a]\d{2})', expand=True)
英文:

I need to have a column per my regex match.

Consider below dataframe:

Letter      Actions
r1          a95,a20
r2          a30,a12-rf,a15,a75
r3          0
r4          a15,a74
r5          a11

I expect:

Letter      Action1      Action2      Action3      Action4
r1          a95          a20          
r2          a30          a12          a15          a75
r3          
r4          a15          a74
r5          a11

I have below but it keeps every thing else as well while I just need the pattern matches:

import pandas as pd

df = pd.DataFrame(
    [['r1', 'a95,a20'],
     ['r2', 'a30,a12-rf,a15,a75'],
     ['r3', '0'],
     ['r4', 'a15,a74'],
     ['r5', 'a11']],
    columns=['Letter', 'Actions'])

Actions = df['Actions'].str.split(r'([a]\d{2})', expand=True)

答案1

得分: 1

以下是使用 str.extractall() 的选项:

(df.set_index('Letter')['Actions']
.str.extractall(r'([a]\d{2})')[0]
.unstack()
.rename(lambda x: 'Action{}'.format(x+1), axis=1)
.reset_index()
.rename_axis(None, axis=1))

另一种可能性是使用 str.split()

df.set_index('Letter')['Actions'].str.split(r'(?:\D+)?,', expand=True)

输出:

  Letter Action1 Action2 Action3 Action4
0     r1     a95     a20    NaN     NaN
1     r2     a30     a12    a15     a75
2     r4     a15     a74    NaN     NaN
3     r5     a11     NaN    NaN     NaN
英文:

Here is an option using str.extractall()

(df.set_index('Letter')['Actions']
.str.extractall(r'([a]\d{2})')[0]
.unstack()
.rename(lambda x: 'Action{}'.format(x+1),axis=1)
.reset_index()
.rename_axis(None,axis=1))

Another possibility is using str.split()

df.set_index('Letter')['Actions'].str.split(r'(?:\D+)?,',expand=True)

Output:

  Letter Action1 Action2 Action3 Action4
0     r1     a95     a20     NaN     NaN
1     r2     a30     a12     a15     a75
2     r4     a15     a74     NaN     NaN
3     r5     a11     NaN     NaN     NaN

答案2

得分: 0

你错误地使用了 split 函数。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

可能你想要做的是类似这样的操作,尽管你仍然保留了 -rf 部分:

split(',', expand=True)

     0       1     2     3
0  a95     a20  None  None
1  a30  a12-rf   a15   a75
2    0    None  None  None
3  a15     a74  None  None
4  a11    None  None  None

此外,你的正则表达式存在一些常见的初学者问题,所以建议你阅读关于 Python 中正则表达式的文档。https://docs.python.org/3/howto/regex.html。特别是你的 [a] 在这里几乎没有意义,因为它与普通的 a 是相同的。

如果你知道可以分割的最大元素数量,那么你可以使用正则表达式和 extract 函数来实现你想要的效果。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

类似这样的操作:

extract(r'(a\d\d)(?:-[^,]+)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?')

     0    1    2    3
0  a95  a20  NaN  NaN
1  a30  a12  a15  a75
2  NaN  NaN  NaN  NaN
3  a15  a74  NaN  NaN
4  a11  NaN  NaN  NaN

关于匹配类似 a12-rf 并仅保留 a12 部分,你需要使用正则表达式 (a\d\d)(?:-[^,]+)? 来捕获 a12,其中 (a\d\d) 用于捕获 a12,然后匹配但不保留 -rf 部分,你可以使用非捕获组 ?: 修饰符,然后使用 -[^,]+ 匹配 -rf,其中 [^,]+ 匹配一个或多个非逗号字符。

这就是我得到 (a\d\d)(?:-[^,]+)? 的方式。对于字符串中的其他部分元素,你可以将这个模式包装在自己的非捕获组中以处理其可选性,并添加逗号进行匹配。 (?:,<prev_pattern>)?。然后你可以根据需要复制粘贴这个 ,<pattern> 匹配多次。

如果你不知道可能会匹配多少这样的元素,那么你可以使用 extractall,然后将其使用的 MultiIndex 进行逆转。这听起来有点复杂(我必须承认,在我的脑海中操作这个有点痛苦),但在实践中非常简单。

extractall(r'(a\d\d)(?:-[^,]+)?').unstack()[0]

match    0    1    2    3
0      a95  a20  NaN  NaN
1      a30  a12  a15  a75
3      a15  a74  NaN  NaN
4      a11  NaN  NaN  NaN
英文:

You've used split wrong. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

Something like this is probably what you want to do, although you will still keep the -rf parts:

split(&#39;,&#39;, expand=True)

     0       1     2     3
0  a95     a20  None  None
1  a30  a12-rf   a15   a75
2    0    None  None  None
3  a15     a74  None  None
4  a11    None  None  None

Also, your regex is hitting some common beginner problems so it would be a good idea to read up on regex in python. https://docs.python.org/3/howto/regex.html. In particular your [a] is quite meaningless since it is the same as a plain a.

If you know the maximum number of elements you can split to, then you can use a regex and extract which also does what you want. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Something like

extract(r&#39;(a\d\d)(?:-[^,]+)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?&#39;)

     0    1    2    3
0  a95  a20  NaN  NaN
1  a30  a12  a15  a75
2  NaN  NaN  NaN  NaN
3  a15  a74  NaN  NaN
4  a11  NaN  NaN  NaN

In terms of matching something like a12-rf and only keeping the a12 part, you need a regex like (a\d\d)(?:-[^,]+)? to capture a12 with (a\d\d), and then match but not keep -rf you use the non-capture group ?: modifier, and then match -rf with -[^,]+ where [^,]+ matches 1-or-more non-, characters.

This is how I got (a\d\d)(?:-[^,]+)?. For the other parts elements in your string, you can wrap this pattern in its own non-capture group to handle the being optional and adding a comma to match. (?:,&lt;prev_pattern&gt;)?. And then you copy-paste that ,&lt;pattern&gt; match as many times as you need.

If you do not know how many of these elements you could hit, then you can use extractall and then pivot the MultiIndex it uses. This sounds complicated (and I have to admit that doing it in my head is a bit painful) but is quite simple in practice.

extractall(r&#39;(a\d\d)(?:-[^,]+)?&#39;).unstack()[0]

match    0    1    2    3
0      a95  a20  NaN  NaN
1      a30  a12  a15  a75
3      a15  a74  NaN  NaN
4      a11  NaN  NaN  NaN

huangapple
  • 本文由 发表于 2023年6月19日 06:57:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76502816.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定