2023年5月15日 01:46:56go评论86阅读模式

英文:

How to delete the DataFrame rows with the largest number of NaNs?

问题

以下是您要翻译的内容：

Pandas和这个网站上的其他问题/答案提供了在我们知道要保留的非NaN数目时解决方案的情况。如果最差的行只有一行，或者如果有多行是最差的，如何高效地删除它们？

以下的一些示例显示了如何通过设置轴来删除列，也可以通过设置轴来删除行。但是，我们需要指定要保留多少个非NaN值。

>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
     A    B  C    D
0  1.0  NaN  1  NaN
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0
>>> df.dropna(thresh=3, axis=1)
     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0
或者完全删除它们：
>>> df.dropna(axis=1)
   C
0  1
1  1
2  1
3  1

注意
我在下面提供了更多的上下文。虽然欢迎提供有关解决方法的具体提示，但我更喜欢关于标题中所述的一般情况的答案。

背景
我正在寻找一种有效的方法来删除具有最多NaN值的行（或者如果有最多NaN值的行，则删除这些行），然后类似地删除列，以便我可以重复这两个步骤，直到所有NaN值都被删除为止。
目标是删除NaN值，同时保留尽可能多的数据，以保持表格的一致性，即仅允许删除整行/列。请阅读上面的注意事项。

上面的示例摘自此答案：
https://stackoverflow.com/a/68306367/9681577

英文:

Pandas and other question/answers in this site provide solutions for the case when we know the number of non NaN to preserve. How can I efficiently delete just the worst row, or rows if there are more than one being the worst ones.
Some examples below show how to remove columns, could be rows by setting the axis. However we need to specify how many non NaNs to keep.

&gt;&gt;&gt; import numpy as np
&gt;&gt;&gt; df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list(&#39;ABCD&#39;))
     A    B  C    D
0  1.0  NaN  1  NaN
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0
&gt;&gt;&gt; df.dropna(thresh=3, axis=1)
     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

Or to delete them altogether:

&gt;&gt;&gt; df.dropna(axis=1)
   C
0  1
1  1
2  1
3  1

Notice
I give more context below. While a hint to a specific solution for that is welcome, I prefer an answer regarding the general case as stated in the title of the post.

Context
I am looking for an effficient way to remove the row with the largest amount of NaNs (or remove the rows if there are ties at the largest number), and after that remove the column(s) analogously, so that I can do repeat these two steps until all NaNs are removed.
The goal is to remove NaNs preserving the maximum possible amount of data keeping the table consistent, i.e., only entire row/column removal is allowed. Please read the notice above.

Examples above extracted from this answer:
https://stackoverflow.com/a/68306367/9681577

答案1

得分: 5

你可以使用布尔索引来统计NaN的数量：

# 每行统计NaN的数量
s = df.isna().sum(axis=1)
# 删除具有最大NaN数量且大于0的行
out = df[~(s.eq(s.max()) & s.gt(0))]

德·摩根定律等价形式：

out = df展开收缩

输出：

     A    B  C    D
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

列

在另一轴上使用相同的逻辑：

s = df.isna().sum(axis=0)
out = df.loc[:, s.ne(s.max()) | s.eq(0)]

输出：

     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

英文:

You can use boolean indexing with the count of NaNs:

# count the number of NaNs per row
s = df.isna().sum(axis=1)
# drop rows which have the max number, if &gt; 0
out = df[~(s.eq(s.max()) &amp; s.gt(0))]

De Morgan's equivalence:

out = df展开收缩

Output:

     A    B  C    D
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

columns

Identical logic on the other axis:

s = df.isna().sum(axis=0)
out = df.loc[:, s.ne(s.max()) | s.eq(0)]

Output:

     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何删除具有最多NaN的DataFrame行？

问题

答案1

列

columns

SQLAlchemy核心与Python中的PostgreSQL，连接.execute(..)错误。

更改使用XlsxWriter模块创建的XLSX文件中的格式。

如何将数据框的一个列名（层次结构）移动到索引？

BigQuery Cloud Function 的入口点是什么？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。