如何删除具有最多NaN的DataFrame行?

huangapple go评论58阅读模式
英文:

How to delete the DataFrame rows with the largest number of NaNs?

问题

以下是您要翻译的内容:

Pandas和这个网站上的其他问题/答案提供了在我们知道要保留的非NaN数目时解决方案的情况。如果最差的行只有一行,或者如果有多行是最差的,如何高效地删除它们?

以下的一些示例显示了如何通过设置轴来删除列,也可以通过设置轴来删除行。但是,我们需要指定要保留多少个非NaN值。

>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
     A    B  C    D
0  1.0  NaN  1  NaN
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

>>> df.dropna(thresh=3, axis=1)
     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

或者完全删除它们

>>> df.dropna(axis=1)
   C
0  1
1  1
2  1
3  1

注意
我在下面提供了更多的上下文。虽然欢迎提供有关解决方法的具体提示,但我更喜欢关于标题中所述的一般情况的答案。

背景
我正在寻找一种有效的方法来删除具有最多NaN值的行(或者如果有最多NaN值的行,则删除这些行),然后类似地删除列,以便我可以重复这两个步骤,直到所有NaN值都被删除为止。
目标是删除NaN值,同时保留尽可能多的数据,以保持表格的一致性,即仅允许删除整行/列。请阅读上面的注意事项。

上面的示例摘自此答案:
https://stackoverflow.com/a/68306367/9681577

英文:

Pandas and other question/answers in this site provide solutions for the case when we know the number of non NaN to preserve. How can I efficiently delete just the worst row, or rows if there are more than one being the worst ones.
Some examples below show how to remove columns, could be rows by setting the axis. However we need to specify how many non NaNs to keep.

>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
     A    B  C    D
0  1.0  NaN  1  NaN
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

>>> df.dropna(thresh=3, axis=1)
     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

Or to delete them altogether:

>>> df.dropna(axis=1)
   C
0  1
1  1
2  1
3  1

Notice
I give more context below. While a hint to a specific solution for that is welcome, I prefer an answer regarding the general case as stated in the title of the post.

Context
I am looking for an effficient way to remove the row with the largest amount of NaNs (or remove the rows if there are ties at the largest number), and after that remove the column(s) analogously, so that I can do repeat these two steps until all NaNs are removed.
The goal is to remove NaNs preserving the maximum possible amount of data keeping the table consistent, i.e., only entire row/column removal is allowed. Please read the notice above.

Examples above extracted from this answer:
https://stackoverflow.com/a/68306367/9681577

答案1

得分: 5

你可以使用布尔索引来统计NaN的数量:

# 每行统计NaN的数量
s = df.isna().sum(axis=1)

# 删除具有最大NaN数量且大于0的行
out = df[~(s.eq(s.max()) & s.gt(0))]

德·摩根定律等价形式:

out = df
展开收缩

输出:

     A    B  C    D
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

在另一轴上使用相同的逻辑:

s = df.isna().sum(axis=0)

out = df.loc[:, s.ne(s.max()) | s.eq(0)]

输出:

     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0
英文:

You can use boolean indexing with the count of NaNs:

# count the number of NaNs per row
s = df.isna().sum(axis=1)

# drop rows which have the max number, if > 0
out = df[~(s.eq(s.max()) & s.gt(0))]

De Morgan's equivalence:

out = df
展开收缩

Output:

     A    B  C    D
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

columns

Identical logic on the other axis:

s = df.isna().sum(axis=0)

out = df.loc[:, s.ne(s.max()) | s.eq(0)]

Output:

     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

huangapple
  • 本文由 发表于 2023年5月15日 01:46:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76248909.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定