2023年6月19日 06:57:54go评论59阅读模式

英文:

Split a column by regex and keep only matches in pandas

问题

我需要根据我的正则表达式匹配每一列。

考虑下面的数据框：

Letter      Actions
r1          a95,a20
r2          a30,a12-rf,a15,a75
r3          0
r4          a15,a74
r5          a11

我期望的是：

Letter      Action1      Action2      Action3      Action4
r1          a95          a20          
r2          a30          a12          a15          a75
r3          
r4          a15          a74
r5          a11

我已经尝试了下面的方法，但它保留了所有其他内容，而我只需要模式匹配的部分：

import pandas as pd

df = pd.DataFrame(
    [['r1', 'a95,a20'],
     ['r2', 'a30,a12-rf,a15,a75'],
     ['r3', '0'],
     ['r4', 'a15,a74'],
     ['r5', 'a11']],
    columns=['Letter', 'Actions'])

Actions = df['Actions'].str.split(r'([a]\d{2})', expand=True)

英文:

I need to have a column per my regex match.

Consider below dataframe:

Letter      Actions
r1          a95,a20
r2          a30,a12-rf,a15,a75
r3          0
r4          a15,a74
r5          a11

I expect:

Letter      Action1      Action2      Action3      Action4
r1          a95          a20          
r2          a30          a12          a15          a75
r3          
r4          a15          a74
r5          a11

I have below but it keeps every thing else as well while I just need the pattern matches:

import pandas as pd

df = pd.DataFrame(
    [[&#39;r1&#39;, &#39;a95,a20&#39;],
     [&#39;r2&#39;, &#39;a30,a12-rf,a15,a75&#39;],
     [&#39;r3&#39;, &#39;0&#39;],
     [&#39;r4&#39;, &#39;a15,a74&#39;],
     [&#39;r5&#39;, &#39;a11&#39;]],
    columns=[&#39;Letter&#39;, &#39;Actions&#39;])

Actions = df[&#39;Actions&#39;].str.split(r&#39;([a]\d{2})&#39;, expand=True)

答案1

得分: 1

以下是使用 str.extractall() 的选项：

(df.set_index('Letter')['Actions']
.str.extractall(r'([a]\d{2})')[0]
.unstack()
.rename(lambda x: 'Action{}'.format(x+1), axis=1)
.reset_index()
.rename_axis(None, axis=1))

另一种可能性是使用 str.split()：

df.set_index('Letter')['Actions'].str.split(r'(?:\D+)?,', expand=True)

输出：

  Letter Action1 Action2 Action3 Action4
0     r1     a95     a20    NaN     NaN
1     r2     a30     a12    a15     a75
2     r4     a15     a74    NaN     NaN
3     r5     a11     NaN    NaN     NaN

英文:

Here is an option using str.extractall()

(df.set_index(&#39;Letter&#39;)[&#39;Actions&#39;]
.str.extractall(r&#39;([a]\d{2})&#39;)[0]
.unstack()
.rename(lambda x: &#39;Action{}&#39;.format(x+1),axis=1)
.reset_index()
.rename_axis(None,axis=1))

Another possibility is using str.split()

df.set_index(&#39;Letter&#39;)[&#39;Actions&#39;].str.split(r&#39;(?:\D+)?,&#39;,expand=True)

Output:

  Letter Action1 Action2 Action3 Action4
0     r1     a95     a20     NaN     NaN
1     r2     a30     a12     a15     a75
2     r4     a15     a74     NaN     NaN
3     r5     a11     NaN     NaN     NaN

答案2

得分: 0

你错误地使用了 split 函数。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

可能你想要做的是类似这样的操作，尽管你仍然保留了 -rf 部分：

split(',', expand=True)

     0       1     2     3
0  a95     a20  None  None
1  a30  a12-rf   a15   a75
2    0    None  None  None
3  a15     a74  None  None
4  a11    None  None  None

此外，你的正则表达式存在一些常见的初学者问题，所以建议你阅读关于 Python 中正则表达式的文档。https://docs.python.org/3/howto/regex.html。特别是你的 [a] 在这里几乎没有意义，因为它与普通的 a 是相同的。

如果你知道可以分割的最大元素数量，那么你可以使用正则表达式和 extract 函数来实现你想要的效果。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

类似这样的操作：

extract(r'(a\d\d)(?:-[^,]+)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?')

     0    1    2    3
0  a95  a20  NaN  NaN
1  a30  a12  a15  a75
2  NaN  NaN  NaN  NaN
3  a15  a74  NaN  NaN
4  a11  NaN  NaN  NaN

关于匹配类似 a12-rf 并仅保留 a12 部分，你需要使用正则表达式 (a\d\d)(?:-[^,]+)? 来捕获 a12，其中 (a\d\d) 用于捕获 a12，然后匹配但不保留 -rf 部分，你可以使用非捕获组 ?: 修饰符，然后使用 -[^,]+ 匹配 -rf，其中 [^,]+ 匹配一个或多个非逗号字符。

这就是我得到 (a\d\d)(?:-[^,]+)? 的方式。对于字符串中的其他部分元素，你可以将这个模式包装在自己的非捕获组中以处理其可选性，并添加逗号进行匹配。 (?:,<prev_pattern>)?。然后你可以根据需要复制粘贴这个 ,<pattern> 匹配多次。

如果你不知道可能会匹配多少这样的元素，那么你可以使用 extractall，然后将其使用的 MultiIndex 进行逆转。这听起来有点复杂（我必须承认，在我的脑海中操作这个有点痛苦），但在实践中非常简单。

extractall(r'(a\d\d)(?:-[^,]+)?').unstack()[0]

match    0    1    2    3
0      a95  a20  NaN  NaN
1      a30  a12  a15  a75
3      a15  a74  NaN  NaN
4      a11  NaN  NaN  NaN

英文:

You've used split wrong. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

Something like this is probably what you want to do, although you will still keep the -rf parts:

split(&#39;,&#39;, expand=True)

     0       1     2     3
0  a95     a20  None  None
1  a30  a12-rf   a15   a75
2    0    None  None  None
3  a15     a74  None  None
4  a11    None  None  None

Also, your regex is hitting some common beginner problems so it would be a good idea to read up on regex in python. https://docs.python.org/3/howto/regex.html. In particular your [a] is quite meaningless since it is the same as a plain a.

If you know the maximum number of elements you can split to, then you can use a regex and extract which also does what you want. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Something like

extract(r&#39;(a\d\d)(?:-[^,]+)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?(?:,(a\d\d)(?:-[^,]+)?)?&#39;)

     0    1    2    3
0  a95  a20  NaN  NaN
1  a30  a12  a15  a75
2  NaN  NaN  NaN  NaN
3  a15  a74  NaN  NaN
4  a11  NaN  NaN  NaN

In terms of matching something like a12-rf and only keeping the a12 part, you need a regex like (a\d\d)(?:-[^,]+)? to capture a12 with (a\d\d), and then match but not keep -rf you use the non-capture group ?: modifier, and then match -rf with -[^,]+ where [^,]+ matches 1-or-more non-, characters.

This is how I got (a\d\d)(?:-[^,]+)?. For the other parts elements in your string, you can wrap this pattern in its own non-capture group to handle the being optional and adding a comma to match. (?:,<prev_pattern>)?. And then you copy-paste that ,<pattern> match as many times as you need.

If you do not know how many of these elements you could hit, then you can use extractall and then pivot the MultiIndex it uses. This sounds complicated (and I have to admit that doing it in my head is a bit painful) but is quite simple in practice.

extractall(r&#39;(a\d\d)(?:-[^,]+)?&#39;).unstack()[0]

match    0    1    2    3
0      a95  a20  NaN  NaN
1      a30  a12  a15  a75
3      a15  a74  NaN  NaN
4      a11  NaN  NaN  NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pandas中按正则表达式拆分列并保留匹配项：

问题

答案1

答案2

使用Go而不是Python在Matlab中使用DLL。

如何在我的情况下生成Python Swig包装器而不需要任何源文件？

存储 H2O 模型/MOJO 文件到文件系统之外

Python程序加载Excel文件的特定兆字节到数据框架并将其转换为字符串。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论