英文:
Pandas dataframe - Column contains index to other columns
问题
我有一个数据框(df),其中一个列('bestcol')包含表中其他列的索引。我想获取由'bestcol'引用的列,将其四舍五入,并创建一个包含该信息的新列(请参考下面的表格,其中bestcol = 1指的是Val1,2指的是Val2,3指的是Val3)。
最初的方法涉及逐行循环遍历表格:
bestcol_list = []
for i in range(len(df)):
bestVal = round(df.iloc[i, df['bestcol'][i]], 0)
bestcol_list.append(bestVal)
df['Final'] = bestcol_list
我的数据有几百万条记录,因此这是一个耗时的过程。我的下一个方法涉及使用apply:
bestcol_list = df.apply(lambda row: round(row[row['bestcol']], 0), axis=1)
df['Final'] = bestcol_list
这实际上比直接循环遍历表格要慢一些。是否有一种向量化的方法来解决这个问题,我没有考虑到吗?
谢谢!
英文:
I have a dataframe (df) where one of the columns ('bestcol') contains indexes of other columns in the table. I want to grab the column being referred to by 'bestcol', round it, and create a new column with that info (see table below for rough example, in which bestcol = 1 refers to Val1, 2 refers to Val2, 3 refers to Val3).
Val1 | Val2 | Val3 | bestcol | Final |
---|---|---|---|---|
1.1 | 2.1 | 3.1 | 1 | 1.0 |
11.1 | 22.1 | 33.1 | 2 | 22.0 |
111.1 | 222.1 | 333.1 | 3 | 333.0 |
My initial approach involved looping through the table row-by-row:
bestcol_list = []
for i in range(len(df)):
bestVal = round(df.iloc[i, df['bestcol'][i]], 0)
bestcol_list.append(bestVal)
df['Final'] = bestcol_list
My data has a few million records, so this was a time consuming process. My next approach involved using apply:
bestcol_list = df.apply(lambda row: round(row[row['bestcol']], 0), axis=1)
df['Final'] = bestcol_list
This turned out to be a bit slower than just looping through the table. Is there a vectorized approach to solving this problem that I'm not considering?
Thanks!
答案1
得分: 2
你可以使用numpy
的索引功能:
row = np.arange(len(df))
col = df['bestcol'].values - 1
x = df.filter(like='Val').values # 或者使用 df.iloc[:, :3].values
df['Final'] = np.round(x[row, col])
输出结果:
>>> df
Val1 Val2 Val3 bestcol Final
0 1.1 2.1 3.1 1 1.0
1 11.1 22.1 33.1 2 22.0
2 111.1 222.1 333.1 3 333.0
对于5,000,000行和100列的性能:
M = 5_000_000
N = 100
x = np.random.uniform(1, 500, (M, N))
row = np.arange(M)
col = np.random.randint(1, N+1, M) - 1
%timeit np.round(x[row, col])
75.5毫秒 ± 388微秒每次循环(平均值±7次循环的标准偏差,每次循环10次)
英文:
You can use numpy
indexing:
row = np.arange(len(df))
col = df['bestcol'].values - 1
x = df.filter(like='Val').values # or df.iloc[:, :3].values
df['Final'] = np.round(x[row, col])
Output:
>>> df
Val1 Val2 Val3 bestcol Final
0 1.1 2.1 3.1 1 1.0
1 11.1 22.1 33.1 2 22.0
2 111.1 222.1 333.1 3 333.0
Performance for 5_000_000 rows and 100 columns:
M = 5_000_000
N = 100
x = np.random.uniform(1, 500, (M, N))
row = np.arange(M)
col = np.random.randint(1, N+1, M) - 1
%timeit np.round(x[row, col])
75.5 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案2
得分: 0
我使用了嵌套的 np.where
处理了三个条件,然后使用了 np.round
。
df['Final'] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)
英文:
I used nested np.where
to handle three conditions and then a np.round
.
df['Final'] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)
答案3
得分: 0
以下是代码的中文翻译:
import pandas as pd
df = pd.DataFrame({'Val1': [1.1, 11.1, 111.1],
'Val2': [2.1, 22.1, 222.1],
'Val3': [3.1, 33.1, 333.1],
'bestcol': [1, 2, 3],
})
df['Final'] = [round(df[['Val1', 'Val2', 'Val3']].iloc[i, p]) \
for i, p in enumerate(df.bestcol.sub(1).tolist())]
print(df)
结果
Val1 Val2 Val3 bestcol Final
0 1.1 2.1 3.1 1 1
1 11.1 22.1 33.1 2 22
2 111.1 222.1 333.1 3 333
英文:
A simple and fast pythonic way :
import pandas as pd
df = pd.DataFrame({'Val1': [1.1, 11.1, 111.1],
'Val2': [2.1, 22.1, 222.1],
'Val3': [3.1, 33.1, 333.1],
'bestcol': [1, 2, 3],
})
df['Final'] = [round(df[['Val1', 'Val2', 'Val3']].iloc[i, p]) \
for i, p in enumerate(df.bestcol.sub(1).tolist())]
print(df)
Result
Val1 Val2 Val3 bestcol Final
0 1.1 2.1 3.1 1 1
1 11.1 22.1 33.1 2 22
2 111.1 222.1 333.1 3 333
答案4
得分: 0
自从你的 bestcol
包含了实际列的有序序数位置,你可以应用 numpy.diag
:
df['Final'] = np.round(np.diag(df[df.columns[:-1]))
1 2 3 bestcol Final
0 1.1 2.1 3.1 1 1.0
1 11.1 22.1 33.1 2 22.0
2 111.1 222.1 333.1 3 333.0
英文:
Since your bestcol
contains an ordered ordinal positions of actual columns you can apply numpy.diag
:
df['Final'] = np.round(np.diag(df[df.columns[:-1]]))
1 2 3 bestcol Final
0 1.1 2.1 3.1 1 1.0
1 11.1 22.1 33.1 2 22.0
2 111.1 222.1 333.1 3 333.0
答案5
得分: 0
尝试这个:
df['Final'] = df.values[df.index, df.bestcol-1].round(0)
print(df)
输出:
Val1 Val2 Val3 bestcol Final
0 1.1 2.1 3.1 1 1.0
1 11.1 22.1 33.1 2 22.0
2 111.1 222.1 333.1 3 333.0
英文:
try this:
df['Final'] = df.values[df.index, df.bestcol-1].round(0)
print(df)
>>>
Val1 Val2 Val3 bestcol Final
0 1.1 2.1 3.1 1 1.0
1 11.1 22.1 33.1 2 22.0
2 111.1 222.1 333.1 3 333.0
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论