英文:
Pandas df.apply seems to be causing unexpected results
问题
I have a piece of code like this:
import numpy as np
df111=[]
def getNoNaN(row_in,dflist):
if np.sum(~pd.isna(row_in[3:])) > 1:
dflist.append(row_in)
dftest.apply(axis=1,func=getNoNaN,dflist=df111)
df111[0]
And, got an unexpected result in the first row of df111.
df111[0]
Name d4a0dad668a4249f8ddb8cfd336e3397
ID f703f7b7e0173bc4269bfff2d8882439
Level 8
No.3 NaN
No.4 NaN
No.5 NaN
No.6 NaN
No.7 NaN
No.8 NaN
No.9 NaN
No.10 NaN
No.11 NaN
No.12 NaN
No.13 NaN
No.14 0.55456
Name: 3615, dtype: object
This looks unusual since np.sum(~pd.isna(row_in[3:])) equals 1, I'm not quite sure why it's showing up in the df111 list.
英文:
I have a piece of code like this:
import numpy as np
df111=[]
def getNoNaN(row_in,dflist):
if np.sum(~pd.isna(row_in[3:])) > 1:
dflist.append(row_in)
dftest.apply(axis=1,func=getNoNaN,dflist=df111)
df111[0]
And, got an unexpected result in the first row of df111.
df111[0]
Name d4a0dad668a4249f8ddb8cfd336e3397
ID f703f7b7e0173bc4269bfff2d8882439
Level 8
No.3 NaN
No.4 NaN
No.5 NaN
No.6 NaN
No.7 NaN
No.8 NaN
No.9 NaN
No.10 NaN
No.11 NaN
No.12 NaN
No.13 NaN
No.14 0.55456
Name: 3615, dtype: object
This looks unusual since np.sum(~pd.isna(row_in[3:])) equals 1, I'm not quite sure why it's showing up in the df111 list.
It is worth noting that I now have a variety of solutions that can achieve similar functions normally (and very aware of the inappropriateness of the above code).
But this phenomenon still bothers me, maybe I missed some important hint in the Pandas documentation?
Hope to get your help.
Available test data can be downloaded here: https://drive.google.com/file/d/1AuylSty8-8jmgZQE9_nY2cSYeEs1aw5v/view?usp=share_link
答案1
得分: 1
你不需要使用apply
,可以使用矢量化的代码:
df = pd.read_csv('apply_test_data.csv', index_col=0)
out = df[df.iloc[:, 3:].notna().sum(axis=1) > 1]
输出:
>>> out
Name ID Level ... No.12 No.13 No.14
0 ac64934249131b017d85de7b17556ebe a015c9f38e2ebe6f900ed808119e4c2c 4 ... 0.232793 NaN NaN
4 3b3d41ddd2c029057987db03c86cb351 43dd1452a809337189cfb0e32f3bc0da 4 ... 0.041589 NaN NaN
7 5873a3f324dac3c389e0e1c570fe0b65 f34d7b40bf2848ab3f26390a14ece18e 5 ... 0.034054 NaN NaN
10 a7b105839ac6343847216b21b391e1eb 7bfe373a31b9af37b6db07bbe17e113c 2 ... 0.285993 NaN NaN
12 f0851646a101642c2cfa0d3a166104c8 e64f2e38c37509f4cb027fd77421a586 6 ... 0.101971 NaN NaN
... ... ... ... ... ... ... ...
3201 304507b189dd1c79ac9fdf88f7a12789 59fa7b6d602d9d4df4f4bba3750d9108 10 ... NaN NaN 0.519524
3218 fda0dabb9548ea30f824daab7d10b3d1 05d9b1e13f568b108306133d299598ad 7 ... NaN NaN 0.000820
3226 328d3ce95d79445f6885b2274549662d b23f8565d14733bcda065add4987074c 3 ... NaN NaN 0.534249
3227 9c80e58c3308ddd40a9b1a8f59a09e3c c86bf81b97bed0c099062910a0282b13 6 ... NaN NaN 0.000830
3243 ae9e52e41df532d1feea03f9ae0825fb 8b69d03591547968906c37a78dd81d51 2 ... 0.320757 NaN 0.022925
[1591 rows x 15 columns]
关于你的错误:要解决问题,你需要在每次附加一行(Series)到列表中时进行复制:
df111 = []
def getNoNaN(row_in, dflist):
if np.sum(~pd.isna(row_in[3:])) > 1:
# print(id(row_in)) # Uncomment to check the memory address of row_in
dflist.append(row_in.copy()) # HERE
dftest.apply(axis=1, func=getNoNaN, dflist=df111)
输出:
>>> df111[0]
Name ac64934249131b017d85de7b17556ebe
ID a015c9f38e2ebe6f900ed808119e4c2c
Level 4
No.3 0.0
No.4 NaN
No.5 NaN
No.6 NaN
No.7 NaN
No.8 NaN
No.9 0.0
No.10 NaN
No.11 NaN
No.12 0.232793
No.13 NaN
No.14 NaN
Name: 0, dtype: object
英文:
You don't need to use apply
, you can use vectorized code:
df = pd.read_csv('apply_test_data.csv', index_col=0)
out = df[df.iloc[:, 3:].notna().sum(axis=1) > 1]
Output:
>>> out
Name ID Level ... No.12 No.13 No.14
0 ac64934249131b017d85de7b17556ebe a015c9f38e2ebe6f900ed808119e4c2c 4 ... 0.232793 NaN NaN
4 3b3d41ddd2c029057987db03c86cb351 43dd1452a809337189cfb0e32f3bc0da 4 ... 0.041589 NaN NaN
7 5873a3f324dac3c389e0e1c570fe0b65 f34d7b40bf2848ab3f26390a14ece18e 5 ... 0.034054 NaN NaN
10 a7b105839ac6343847216b21b391e1eb 7bfe373a31b9af37b6db07bbe17e113c 2 ... 0.285993 NaN NaN
12 f0851646a101642c2cfa0d3a166104c8 e64f2e38c37509f4cb027fd77421a586 6 ... 0.101971 NaN NaN
... ... ... ... ... ... ... ...
3201 304507b189dd1c79ac9fdf88f7a12789 59fa7b6d602d9d4df4f4bba3750d9108 10 ... NaN NaN 0.519524
3218 fda0dabb9548ea30f824daab7d10b3d1 05d9b1e13f568b108306133d299598ad 7 ... NaN NaN 0.000820
3226 328d3ce95d79445f6885b2274549662d b23f8565d14733bcda065add4987074c 3 ... NaN NaN 0.534249
3227 9c80e58c3308ddd40a9b1a8f59a09e3c c86bf81b97bed0c099062910a0282b13 6 ... NaN NaN 0.000830
3243 ae9e52e41df532d1feea03f9ae0825fb 8b69d03591547968906c37a78dd81d51 2 ... 0.320757 NaN 0.022925
[1591 rows x 15 columns]
About your error: to solve your problem you have to make a copy each time you append a row (Series) in the list:
df111 = []
def getNoNaN(row_in,dflist):
if np.sum(~pd.isna(row_in[3:])) > 1:
# print(id(row_in)) # Uncomment to check the memory address of row_in
dflist.append(row_in.copy()) # HERE
dftest.apply(axis=1, func=getNoNaN, dflist=df111)
Output:
>>> df111[0]
Name ac64934249131b017d85de7b17556ebe
ID a015c9f38e2ebe6f900ed808119e4c2c
Level 4
No.3 0.0
No.4 NaN
No.5 NaN
No.6 NaN
No.7 NaN
No.8 NaN
No.9 0.0
No.10 NaN
No.11 NaN
No.12 0.232793
No.13 NaN
No.14 NaN
Name: 0, dtype: object
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论