Pandas数据框 – 列包含对其他列的索引

huangapple go评论69阅读模式
英文:

Pandas dataframe - Column contains index to other columns

问题

我有一个数据框(df),其中一个列('bestcol')包含表中其他列的索引。我想获取由'bestcol'引用的列,将其四舍五入,并创建一个包含该信息的新列(请参考下面的表格,其中bestcol = 1指的是Val1,2指的是Val2,3指的是Val3)。

最初的方法涉及逐行循环遍历表格:

bestcol_list = []
for i in range(len(df)):
    bestVal = round(df.iloc[i, df['bestcol'][i]], 0)
    bestcol_list.append(bestVal)
    
df['Final'] = bestcol_list

我的数据有几百万条记录,因此这是一个耗时的过程。我的下一个方法涉及使用apply:

bestcol_list = df.apply(lambda row: round(row[row['bestcol']], 0), axis=1)

df['Final'] = bestcol_list

这实际上比直接循环遍历表格要慢一些。是否有一种向量化的方法来解决这个问题,我没有考虑到吗?

谢谢!

英文:

I have a dataframe (df) where one of the columns ('bestcol') contains indexes of other columns in the table. I want to grab the column being referred to by 'bestcol', round it, and create a new column with that info (see table below for rough example, in which bestcol = 1 refers to Val1, 2 refers to Val2, 3 refers to Val3).

Val1 Val2 Val3 bestcol Final
1.1 2.1 3.1 1 1.0
11.1 22.1 33.1 2 22.0
111.1 222.1 333.1 3 333.0

My initial approach involved looping through the table row-by-row:

bestcol_list = []
for i in range(len(df)):
    bestVal = round(df.iloc[i, df['bestcol'][i]], 0)
    bestcol_list.append(bestVal)
    
df['Final'] = bestcol_list

My data has a few million records, so this was a time consuming process. My next approach involved using apply:

bestcol_list = df.apply(lambda row: round(row[row['bestcol']], 0), axis=1)

df['Final'] = bestcol_list

This turned out to be a bit slower than just looping through the table. Is there a vectorized approach to solving this problem that I'm not considering?

Thanks!

答案1

得分: 2

你可以使用numpy的索引功能:

row = np.arange(len(df))
col = df['bestcol'].values - 1
x = df.filter(like='Val').values  # 或者使用 df.iloc[:, :3].values

df['Final'] = np.round(x[row, col])

输出结果:

>>> df
    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

对于5,000,000行和100列的性能:

M = 5_000_000
N = 100
x = np.random.uniform(1, 500, (M, N))
row = np.arange(M)
col = np.random.randint(1, N+1, M) - 1

%timeit np.round(x[row, col])
75.5毫秒 ± 388微秒每次循环平均值±7次循环的标准偏差每次循环10次
英文:

You can use numpy indexing:

row = np.arange(len(df))
col = df['bestcol'].values - 1
x = df.filter(like='Val').values  # or df.iloc[:, :3].values

df['Final'] = np.round(x[row, col])

Output:

>>> df
    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

Performance for 5_000_000 rows and 100 columns:

M = 5_000_000
N = 100
x = np.random.uniform(1, 500, (M, N))
row = np.arange(M)
col = np.random.randint(1, N+1, M) - 1

%timeit np.round(x[row, col])
75.5 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案2

得分: 0

我使用了嵌套的 np.where 处理了三个条件,然后使用了 np.round

df['Final'] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)
英文:

I used nested np.where to handle three conditions and then a np.round.

df['Final'] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)

答案3

得分: 0

以下是代码的中文翻译:

import pandas as pd

df = pd.DataFrame({'Val1': [1.1, 11.1, 111.1],
                   'Val2': [2.1, 22.1, 222.1],
                   'Val3': [3.1, 33.1, 333.1],
                   'bestcol': [1, 2, 3],
                   })

df['Final'] = [round(df[['Val1', 'Val2', 'Val3']].iloc[i, p]) \
               for i, p in enumerate(df.bestcol.sub(1).tolist())]

print(df)

结果

    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1      1
1   11.1   22.1   33.1        2     22
2  111.1  222.1  333.1        3    333
英文:

A simple and fast pythonic way :

import pandas as pd

df = pd.DataFrame({'Val1': [1.1, 11.1, 111.1],
                   'Val2': [2.1, 22.1, 222.1],
                   'Val3': [3.1, 33.1, 333.1],
                   'bestcol': [1, 2, 3],
                   })

df['Final'] = [round(df[['Val1', 'Val2', 'Val3']].iloc[i, p]) \
               for i, p in enumerate(df.bestcol.sub(1).tolist())]

print(df)

Result

    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1      1
1   11.1   22.1   33.1        2     22
2  111.1  222.1  333.1        3    333

答案4

得分: 0

自从你的 bestcol 包含了实际列的有序序数位置,你可以应用 numpy.diag

df['Final'] = np.round(np.diag(df[df.columns[:-1]))

       1      2      3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0
英文:

Since your bestcol contains an ordered ordinal positions of actual columns you can apply numpy.diag:

df['Final'] = np.round(np.diag(df[df.columns[:-1]]))

       1      2      3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

答案5

得分: 0

尝试这个:

df['Final'] = df.values[df.index, df.bestcol-1].round(0)
print(df)

输出:

   Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0
英文:

try this:

df['Final'] = df.values[df.index, df.bestcol-1].round(0)
print(df)
>>>
    Val1   Val2   Val3  bestcol  Final
0    1.1    2.1    3.1        1    1.0
1   11.1   22.1   33.1        2   22.0
2  111.1  222.1  333.1        3  333.0

</details>



huangapple
  • 本文由 发表于 2023年4月4日 03:23:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923081.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定