2023年2月19日 01:28:45go评论110阅读模式

英文:

Fast way to get index of non-blank values in row/column

问题

让我们假设我们有以下的pandas数据框：
df = pd.DataFrame({'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: None, 2: 8.0}, 'c': {0: 4.0, 1: 2.0, 2: 6.0}})
     a     b    c
0  3.0  10.0  4.0
1  2.0   NaN  2.0
2  NaN   8.0  6.0
我需要获得一个数据框，对于每一行，包含所有非NaN值的列名。
我知道我可以使用以下方法，产生预期的输出：
df2 = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)
   0  1    2
0  a  b    c
1  a  c  NaN
2  b  c  NaN
不幸的是，对于大型数据集，这种方法速度较慢。是否有更快的方法？
获取每列的非Null值的行索引也可以工作，因为我只需转置输入数据框。谢谢。

英文:

Let's say we have the following pandas dataframe:

df = pd.DataFrame({&#39;a&#39;: {0: 3.0, 1: 2.0, 2: None}, &#39;b&#39;: {0: 10.0, 1: None, 2: 8.0}, &#39;c&#39;: {0: 4.0, 1: 2.0, 2: 6.0}})
     a     b    c
0  3.0  10.0  4.0
1  2.0   NaN  2.0
2  NaN   8.0  6.0

I need to get a dataframe with, for each row, the column names of all non-NaN values.
I know I can do the following, which produces the expected outupt:

df2 = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)
   0  1    2
0  a  b    c
1  a  c  NaN
2  b  c  NaN

Unfortunately, this is quite slow with large datasets. Is there a faster way?

Getting the row indices of non-Null values of each column could work too, as I would just need to transpose the input dataframe. Thanks.

答案1

得分: 3

使用 [tag:numpy]：
```python
m = df.notna()
a = m.mul(df.columns).where(m).to_numpy()
out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(~m, axis=1)],
                   index=df.index)

输出：

   0  1    2
0  a  b    c
1  a  c  NaN
2  b  c  NaN

时间统计

在 30k 行 x 3 列数据上：

# 使用 numpy 方法
6.82 毫秒 &#177; 1.56 毫秒 每次循环（均值 &#177; 7 次运行的标准差，100 次循环）
# 使用 pandas apply 方法
7.32 秒 &#177; 553 毫秒 每次循环（均值 &#177; 7 次运行的标准差，1 次循环）

英文:

Use [tag:numpy]:

m = df.notna()
a = m.mul(df.columns).where(m).to_numpy()
out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(~m, axis=1)],
                   index=df.index)

Output:

   0  1    2
0  a  b    c
1  a  c  NaN
2  b  c  NaN

timings

On 30k rows x 3 columns:

# numpy approach
6.82 ms &#177; 1.56 ms per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
# pandas apply
7.32 s &#177; 553 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取行或列中非空值的快速方法

问题

答案1

时间统计

timings

在 dbc checklist – dash 中插入空格并包裹文本。

当您的格式化工具和代码检查工具发生冲突时该怎么办？

Torch未使用CUDA编译，需要在我的本地PC上使用CUDA。

Python. HTTP响应中缺少状态码。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。