2023年4月4日 03:23:10go评论101阅读模式

英文:

Pandas dataframe - Column contains index to other columns

问题

我有一个数据框（df），其中一个列（'bestcol'）包含表中其他列的索引。我想获取由'bestcol'引用的列，将其四舍五入，并创建一个包含该信息的新列（请参考下面的表格，其中bestcol = 1指的是Val1，2指的是Val2，3指的是Val3）。

最初的方法涉及逐行循环遍历表格：

bestcol_list = []
for i in range(len(df)):
    bestVal = round(df.iloc[i, df['bestcol'][i]], 0)
    bestcol_list.append(bestVal)
    
df['Final'] = bestcol_list

我的数据有几百万条记录，因此这是一个耗时的过程。我的下一个方法涉及使用apply：

bestcol_list = df.apply(lambda row: round(row[row['bestcol']], 0), axis=1)
df['Final'] = bestcol_list

这实际上比直接循环遍历表格要慢一些。是否有一种向量化的方法来解决这个问题，我没有考虑到吗？

谢谢！

英文:

I have a dataframe (df) where one of the columns ('bestcol') contains indexes of other columns in the table. I want to grab the column being referred to by 'bestcol', round it, and create a new column with that info (see table below for rough example, in which bestcol = 1 refers to Val1, 2 refers to Val2, 3 refers to Val3).

Val1	Val2	Val3	bestcol	Final
1.1	2.1	3.1	1	1.0
11.1	22.1	33.1	2	22.0
111.1	222.1	333.1	3	333.0

My initial approach involved looping through the table row-by-row:

bestcol_list = []
for i in range(len(df)):
    bestVal = round(df.iloc[i, df[&#39;bestcol&#39;][i]], 0)
    bestcol_list.append(bestVal)
    
df[&#39;Final&#39;] = bestcol_list

My data has a few million records, so this was a time consuming process. My next approach involved using apply:

bestcol_list = df.apply(lambda row: round(row[row[&#39;bestcol&#39;]], 0), axis=1)
df[&#39;Final&#39;] = bestcol_list

This turned out to be a bit slower than just looping through the table. Is there a vectorized approach to solving this problem that I'm not considering?

Thanks!

答案1

得分: 2

你可以使用numpy的索引功能：

row = np.arange(len(df))
col = df['bestcol'].values - 1
x = df.filter(like='Val').values  # 或者使用 df.iloc[:, :3].values
df['Final'] = np.round(x[row, col])

输出结果：

>>> df
    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

对于5,000,000行和100列的性能：

M = 5_000_000
N = 100
x = np.random.uniform(1, 500, (M, N))
row = np.arange(M)
col = np.random.randint(1, N+1, M) - 1
%timeit np.round(x[row, col])
75.5毫秒 ± 388微秒每次循环（平均值±7次循环的标准偏差，每次循环10次）

英文:

You can use numpy indexing:

row = np.arange(len(df))
col = df[&#39;bestcol&#39;].values - 1
x = df.filter(like=&#39;Val&#39;).values  # or df.iloc[:, :3].values
df[&#39;Final&#39;] = np.round(x[row, col])

Output:

&gt;&gt;&gt; df
    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

Performance for 5_000_000 rows and 100 columns:

M = 5_000_000
N = 100
x = np.random.uniform(1, 500, (M, N))
row = np.arange(M)
col = np.random.randint(1, N+1, M) - 1
%timeit np.round(x[row, col])
75.5 ms &#177; 388 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10 loops each)

答案2

得分: 0

我使用了嵌套的 np.where 处理了三个条件，然后使用了 np.round。

df['Final'] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)

英文:

I used nested np.where to handle three conditions and then a np.round.

df[&#39;Final&#39;] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)

答案3

得分: 0

以下是代码的中文翻译：

import pandas as pd
df = pd.DataFrame({'Val1': [1.1, 11.1, 111.1],
                   'Val2': [2.1, 22.1, 222.1],
                   'Val3': [3.1, 33.1, 333.1],
                   'bestcol': [1, 2, 3],
                   })
df['Final'] = [round(df[['Val1', 'Val2', 'Val3']].iloc[i, p]) \
               for i, p in enumerate(df.bestcol.sub(1).tolist())]
print(df)

结果

    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1      1
1   11.1   22.1   33.1        2     22
2  111.1  222.1  333.1        3    333

英文:

A simple and fast pythonic way :

import pandas as pd
df = pd.DataFrame({&#39;Val1&#39;: [1.1, 11.1, 111.1],
                   &#39;Val2&#39;: [2.1, 22.1, 222.1],
                   &#39;Val3&#39;: [3.1, 33.1, 333.1],
                   &#39;bestcol&#39;: [1, 2, 3],
                   })
df[&#39;Final&#39;] = [round(df[[&#39;Val1&#39;, &#39;Val2&#39;, &#39;Val3&#39;]].iloc[i, p]) \
               for i, p in enumerate(df.bestcol.sub(1).tolist())]
print(df)

Result

    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1      1
1   11.1   22.1   33.1        2     22
2  111.1  222.1  333.1        3    333

答案4

得分: 0

自从你的 bestcol 包含了实际列的有序序数位置，你可以应用 numpy.diag：

df['Final'] = np.round(np.diag(df[df.columns[:-1]))

       1      2      3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

英文:

Since your bestcol contains an ordered ordinal positions of actual columns you can apply numpy.diag:

df[&#39;Final&#39;] = np.round(np.diag(df[df.columns[:-1]]))

       1      2      3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

答案5

得分: 0

尝试这个：

df['Final'] = df.values[df.index, df.bestcol-1].round(0)
print(df)

输出：

   Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

英文:

try this:

df[&#39;Final&#39;] = df.values[df.index, df.bestcol-1].round(0)
print(df)
&gt;&gt;&gt;
    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas数据框 – 列包含对其他列的索引

问题

答案1

答案2

答案3

答案4

答案5

在另一个类中编辑对象（PyQt5）

有没有更简单或更有效的方法来找到算法的平均运行时间？

如何在函数包含条件if语句时将numpy数组传递给函数？

内存使用在重复请求时不断上升。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论