Pandas df.apply 似乎导致意外结果。

huangapple go评论101阅读模式
英文:

Pandas df.apply seems to be causing unexpected results

问题

I have a piece of code like this:

  1. import numpy as np
  2. df111=[]
  3. def getNoNaN(row_in,dflist):
  4. if np.sum(~pd.isna(row_in[3:])) > 1:
  5. dflist.append(row_in)
  6. dftest.apply(axis=1,func=getNoNaN,dflist=df111)
  7. df111[0]

And, got an unexpected result in the first row of df111.

  1. df111[0]
  2. Name d4a0dad668a4249f8ddb8cfd336e3397
  3. ID f703f7b7e0173bc4269bfff2d8882439
  4. Level 8
  5. No.3 NaN
  6. No.4 NaN
  7. No.5 NaN
  8. No.6 NaN
  9. No.7 NaN
  10. No.8 NaN
  11. No.9 NaN
  12. No.10 NaN
  13. No.11 NaN
  14. No.12 NaN
  15. No.13 NaN
  16. No.14 0.55456
  17. Name: 3615, dtype: object

This looks unusual since np.sum(~pd.isna(row_in[3:])) equals 1, I'm not quite sure why it's showing up in the df111 list.

英文:

I have a piece of code like this:

  1. import numpy as np
  2. df111=[]
  3. def getNoNaN(row_in,dflist):
  4. if np.sum(~pd.isna(row_in[3:])) > 1:
  5. dflist.append(row_in)
  6. dftest.apply(axis=1,func=getNoNaN,dflist=df111)
  7. df111[0]

And, got an unexpected result in the first row of df111.

  1. df111[0]
  2. Name d4a0dad668a4249f8ddb8cfd336e3397
  3. ID f703f7b7e0173bc4269bfff2d8882439
  4. Level 8
  5. No.3 NaN
  6. No.4 NaN
  7. No.5 NaN
  8. No.6 NaN
  9. No.7 NaN
  10. No.8 NaN
  11. No.9 NaN
  12. No.10 NaN
  13. No.11 NaN
  14. No.12 NaN
  15. No.13 NaN
  16. No.14 0.55456
  17. Name: 3615, dtype: object

This looks unusual since np.sum(~pd.isna(row_in[3:])) equals 1, I'm not quite sure why it's showing up in the df111 list.

It is worth noting that I now have a variety of solutions that can achieve similar functions normally (and very aware of the inappropriateness of the above code).

But this phenomenon still bothers me, maybe I missed some important hint in the Pandas documentation?
Hope to get your help.

Available test data can be downloaded here: https://drive.google.com/file/d/1AuylSty8-8jmgZQE9_nY2cSYeEs1aw5v/view?usp=share_link

答案1

得分: 1

你不需要使用apply,可以使用矢量化的代码:

  1. df = pd.read_csv('apply_test_data.csv', index_col=0)
  2. out = df[df.iloc[:, 3:].notna().sum(axis=1) > 1]

输出:

  1. >>> out
  2. Name ID Level ... No.12 No.13 No.14
  3. 0 ac64934249131b017d85de7b17556ebe a015c9f38e2ebe6f900ed808119e4c2c 4 ... 0.232793 NaN NaN
  4. 4 3b3d41ddd2c029057987db03c86cb351 43dd1452a809337189cfb0e32f3bc0da 4 ... 0.041589 NaN NaN
  5. 7 5873a3f324dac3c389e0e1c570fe0b65 f34d7b40bf2848ab3f26390a14ece18e 5 ... 0.034054 NaN NaN
  6. 10 a7b105839ac6343847216b21b391e1eb 7bfe373a31b9af37b6db07bbe17e113c 2 ... 0.285993 NaN NaN
  7. 12 f0851646a101642c2cfa0d3a166104c8 e64f2e38c37509f4cb027fd77421a586 6 ... 0.101971 NaN NaN
  8. ... ... ... ... ... ... ... ...
  9. 3201 304507b189dd1c79ac9fdf88f7a12789 59fa7b6d602d9d4df4f4bba3750d9108 10 ... NaN NaN 0.519524
  10. 3218 fda0dabb9548ea30f824daab7d10b3d1 05d9b1e13f568b108306133d299598ad 7 ... NaN NaN 0.000820
  11. 3226 328d3ce95d79445f6885b2274549662d b23f8565d14733bcda065add4987074c 3 ... NaN NaN 0.534249
  12. 3227 9c80e58c3308ddd40a9b1a8f59a09e3c c86bf81b97bed0c099062910a0282b13 6 ... NaN NaN 0.000830
  13. 3243 ae9e52e41df532d1feea03f9ae0825fb 8b69d03591547968906c37a78dd81d51 2 ... 0.320757 NaN 0.022925
  14. [1591 rows x 15 columns]

关于你的错误:要解决问题,你需要在每次附加一行(Series)到列表中时进行复制:

  1. df111 = []
  2. def getNoNaN(row_in, dflist):
  3. if np.sum(~pd.isna(row_in[3:])) > 1:
  4. # print(id(row_in)) # Uncomment to check the memory address of row_in
  5. dflist.append(row_in.copy()) # HERE
  6. dftest.apply(axis=1, func=getNoNaN, dflist=df111)

输出:

  1. >>> df111[0]
  2. Name ac64934249131b017d85de7b17556ebe
  3. ID a015c9f38e2ebe6f900ed808119e4c2c
  4. Level 4
  5. No.3 0.0
  6. No.4 NaN
  7. No.5 NaN
  8. No.6 NaN
  9. No.7 NaN
  10. No.8 NaN
  11. No.9 0.0
  12. No.10 NaN
  13. No.11 NaN
  14. No.12 0.232793
  15. No.13 NaN
  16. No.14 NaN
  17. Name: 0, dtype: object
英文:

You don't need to use apply, you can use vectorized code:

  1. df = pd.read_csv('apply_test_data.csv', index_col=0)
  2. out = df[df.iloc[:, 3:].notna().sum(axis=1) > 1]

Output:

  1. >>> out
  2. Name ID Level ... No.12 No.13 No.14
  3. 0 ac64934249131b017d85de7b17556ebe a015c9f38e2ebe6f900ed808119e4c2c 4 ... 0.232793 NaN NaN
  4. 4 3b3d41ddd2c029057987db03c86cb351 43dd1452a809337189cfb0e32f3bc0da 4 ... 0.041589 NaN NaN
  5. 7 5873a3f324dac3c389e0e1c570fe0b65 f34d7b40bf2848ab3f26390a14ece18e 5 ... 0.034054 NaN NaN
  6. 10 a7b105839ac6343847216b21b391e1eb 7bfe373a31b9af37b6db07bbe17e113c 2 ... 0.285993 NaN NaN
  7. 12 f0851646a101642c2cfa0d3a166104c8 e64f2e38c37509f4cb027fd77421a586 6 ... 0.101971 NaN NaN
  8. ... ... ... ... ... ... ... ...
  9. 3201 304507b189dd1c79ac9fdf88f7a12789 59fa7b6d602d9d4df4f4bba3750d9108 10 ... NaN NaN 0.519524
  10. 3218 fda0dabb9548ea30f824daab7d10b3d1 05d9b1e13f568b108306133d299598ad 7 ... NaN NaN 0.000820
  11. 3226 328d3ce95d79445f6885b2274549662d b23f8565d14733bcda065add4987074c 3 ... NaN NaN 0.534249
  12. 3227 9c80e58c3308ddd40a9b1a8f59a09e3c c86bf81b97bed0c099062910a0282b13 6 ... NaN NaN 0.000830
  13. 3243 ae9e52e41df532d1feea03f9ae0825fb 8b69d03591547968906c37a78dd81d51 2 ... 0.320757 NaN 0.022925
  14. [1591 rows x 15 columns]

About your error: to solve your problem you have to make a copy each time you append a row (Series) in the list:

  1. df111 = []
  2. def getNoNaN(row_in,dflist):
  3. if np.sum(~pd.isna(row_in[3:])) > 1:
  4. # print(id(row_in)) # Uncomment to check the memory address of row_in
  5. dflist.append(row_in.copy()) # HERE
  6. dftest.apply(axis=1, func=getNoNaN, dflist=df111)

Output:

  1. >>> df111[0]
  2. Name ac64934249131b017d85de7b17556ebe
  3. ID a015c9f38e2ebe6f900ed808119e4c2c
  4. Level 4
  5. No.3 0.0
  6. No.4 NaN
  7. No.5 NaN
  8. No.6 NaN
  9. No.7 NaN
  10. No.8 NaN
  11. No.9 0.0
  12. No.10 NaN
  13. No.11 NaN
  14. No.12 0.232793
  15. No.13 NaN
  16. No.14 NaN
  17. Name: 0, dtype: object

huangapple
  • 本文由 发表于 2023年5月15日 14:12:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76251299.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定