2023年5月15日 14:12:24go评论101阅读模式

英文:

Pandas df.apply seems to be causing unexpected results

问题

I have a piece of code like this:

import numpy as np
df111=[]
def getNoNaN(row_in,dflist):
    if np.sum(~pd.isna(row_in[3:])) &gt; 1:
        dflist.append(row_in)
dftest.apply(axis=1,func=getNoNaN,dflist=df111)
df111[0]

And, got an unexpected result in the first row of df111.

df111[0]
Name     d4a0dad668a4249f8ddb8cfd336e3397
ID       f703f7b7e0173bc4269bfff2d8882439
Level                                   8
No.3                                  NaN
No.4                                  NaN
No.5                                  NaN
No.6                                  NaN
No.7                                  NaN
No.8                                  NaN
No.9                                  NaN
No.10                                 NaN
No.11                                 NaN
No.12                                 NaN
No.13                                 NaN
No.14                             0.55456
Name: 3615, dtype: object

This looks unusual since np.sum(~pd.isna(row_in[3:])) equals 1, I'm not quite sure why it's showing up in the df111 list.

英文:

I have a piece of code like this:

import numpy as np
df111=[]
def getNoNaN(row_in,dflist):
    if np.sum(~pd.isna(row_in[3:])) &gt; 1:
        dflist.append(row_in)
dftest.apply(axis=1,func=getNoNaN,dflist=df111)
df111[0]

And, got an unexpected result in the first row of df111.

df111[0]
Name     d4a0dad668a4249f8ddb8cfd336e3397
ID       f703f7b7e0173bc4269bfff2d8882439
Level                                   8
No.3                                  NaN
No.4                                  NaN
No.5                                  NaN
No.6                                  NaN
No.7                                  NaN
No.8                                  NaN
No.9                                  NaN
No.10                                 NaN
No.11                                 NaN
No.12                                 NaN
No.13                                 NaN
No.14                             0.55456
Name: 3615, dtype: object

This looks unusual since np.sum(~pd.isna(row_in[3:])) equals 1, I'm not quite sure why it's showing up in the df111 list.

It is worth noting that I now have a variety of solutions that can achieve similar functions normally (and very aware of the inappropriateness of the above code).

But this phenomenon still bothers me, maybe I missed some important hint in the Pandas documentation?
Hope to get your help.

Available test data can be downloaded here: https://drive.google.com/file/d/1AuylSty8-8jmgZQE9_nY2cSYeEs1aw5v/view?usp=share_link

答案1

得分: 1

你不需要使用apply，可以使用矢量化的代码：

df = pd.read_csv('apply_test_data.csv', index_col=0)
out = df[df.iloc[:, 3:].notna().sum(axis=1) > 1]

输出：

>>> out
                                  Name                                ID  Level  ...     No.12  No.13     No.14
0     ac64934249131b017d85de7b17556ebe  a015c9f38e2ebe6f900ed808119e4c2c      4  ...  0.232793    NaN       NaN
4     3b3d41ddd2c029057987db03c86cb351  43dd1452a809337189cfb0e32f3bc0da      4  ...  0.041589    NaN       NaN
7     5873a3f324dac3c389e0e1c570fe0b65  f34d7b40bf2848ab3f26390a14ece18e      5  ...  0.034054    NaN       NaN
10    a7b105839ac6343847216b21b391e1eb  7bfe373a31b9af37b6db07bbe17e113c      2  ...  0.285993    NaN       NaN
12    f0851646a101642c2cfa0d3a166104c8  e64f2e38c37509f4cb027fd77421a586      6  ...  0.101971    NaN       NaN
...                                ...                               ...    ...  ...       ...    ...       ...
3201  304507b189dd1c79ac9fdf88f7a12789  59fa7b6d602d9d4df4f4bba3750d9108     10  ...       NaN    NaN  0.519524
3218  fda0dabb9548ea30f824daab7d10b3d1  05d9b1e13f568b108306133d299598ad      7  ...       NaN    NaN  0.000820
3226  328d3ce95d79445f6885b2274549662d  b23f8565d14733bcda065add4987074c      3  ...       NaN    NaN  0.534249
3227  9c80e58c3308ddd40a9b1a8f59a09e3c  c86bf81b97bed0c099062910a0282b13      6  ...       NaN    NaN  0.000830
3243  ae9e52e41df532d1feea03f9ae0825fb  8b69d03591547968906c37a78dd81d51      2  ...  0.320757    NaN  0.022925
[1591 rows x 15 columns]

关于你的错误：要解决问题，你需要在每次附加一行（Series）到列表中时进行复制：

df111 = []
def getNoNaN(row_in, dflist):
    if np.sum(~pd.isna(row_in[3:])) > 1:
        # print(id(row_in))  # Uncomment to check the memory address of row_in
        dflist.append(row_in.copy())  # HERE
dftest.apply(axis=1, func=getNoNaN, dflist=df111)

输出：

>>> df111[0]
Name     ac64934249131b017d85de7b17556ebe
ID       a015c9f38e2ebe6f900ed808119e4c2c
Level                                   4
No.3                                  0.0
No.4                                  NaN
No.5                                  NaN
No.6                                  NaN
No.7                                  NaN
No.8                                  NaN
No.9                                  0.0
No.10                                 NaN
No.11                                 NaN
No.12                            0.232793
No.13                                 NaN
No.14                                 NaN
Name: 0, dtype: object

英文:

You don't need to use apply, you can use vectorized code:

df = pd.read_csv(&#39;apply_test_data.csv&#39;, index_col=0)
out = df[df.iloc[:, 3:].notna().sum(axis=1) &gt; 1]

Output:

&gt;&gt;&gt; out
                                  Name                                ID  Level  ...     No.12  No.13     No.14
0     ac64934249131b017d85de7b17556ebe  a015c9f38e2ebe6f900ed808119e4c2c      4  ...  0.232793    NaN       NaN
4     3b3d41ddd2c029057987db03c86cb351  43dd1452a809337189cfb0e32f3bc0da      4  ...  0.041589    NaN       NaN
7     5873a3f324dac3c389e0e1c570fe0b65  f34d7b40bf2848ab3f26390a14ece18e      5  ...  0.034054    NaN       NaN
10    a7b105839ac6343847216b21b391e1eb  7bfe373a31b9af37b6db07bbe17e113c      2  ...  0.285993    NaN       NaN
12    f0851646a101642c2cfa0d3a166104c8  e64f2e38c37509f4cb027fd77421a586      6  ...  0.101971    NaN       NaN
...                                ...                               ...    ...  ...       ...    ...       ...
3201  304507b189dd1c79ac9fdf88f7a12789  59fa7b6d602d9d4df4f4bba3750d9108     10  ...       NaN    NaN  0.519524
3218  fda0dabb9548ea30f824daab7d10b3d1  05d9b1e13f568b108306133d299598ad      7  ...       NaN    NaN  0.000820
3226  328d3ce95d79445f6885b2274549662d  b23f8565d14733bcda065add4987074c      3  ...       NaN    NaN  0.534249
3227  9c80e58c3308ddd40a9b1a8f59a09e3c  c86bf81b97bed0c099062910a0282b13      6  ...       NaN    NaN  0.000830
3243  ae9e52e41df532d1feea03f9ae0825fb  8b69d03591547968906c37a78dd81d51      2  ...  0.320757    NaN  0.022925
[1591 rows x 15 columns]

About your error: to solve your problem you have to make a copy each time you append a row (Series) in the list:

df111 = []
def getNoNaN(row_in,dflist):
    if np.sum(~pd.isna(row_in[3:])) &gt; 1:
        # print(id(row_in))  # Uncomment to check the memory address of row_in
        dflist.append(row_in.copy())  # HERE
dftest.apply(axis=1, func=getNoNaN, dflist=df111)

Output:

&gt;&gt;&gt; df111[0]
Name     ac64934249131b017d85de7b17556ebe
ID       a015c9f38e2ebe6f900ed808119e4c2c
Level                                   4
No.3                                  0.0
No.4                                  NaN
No.5                                  NaN
No.6                                  NaN
No.7                                  NaN
No.8                                  NaN
No.9                                  0.0
No.10                                 NaN
No.11                                 NaN
No.12                            0.232793
No.13                                 NaN
No.14                                 NaN
Name: 0, dtype: object

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas df.apply 似乎导致意外结果。

问题

答案1

Jupyter notebook在使用SVM核进行训练时需要无限的时间。

如何在执行特定文件上的Python函数之前等待Stripe完成付款？

如何在Python中获取所有先前的打印。

Polars将数字字符串转换为列表

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。