Pandas数据框 – 列包含对其他列的索引

huangapple go评论101阅读模式
英文:

Pandas dataframe - Column contains index to other columns

问题

我有一个数据框(df),其中一个列('bestcol')包含表中其他列的索引。我想获取由'bestcol'引用的列,将其四舍五入,并创建一个包含该信息的新列(请参考下面的表格,其中bestcol = 1指的是Val1,2指的是Val2,3指的是Val3)。

最初的方法涉及逐行循环遍历表格:

  1. bestcol_list = []
  2. for i in range(len(df)):
  3. bestVal = round(df.iloc[i, df['bestcol'][i]], 0)
  4. bestcol_list.append(bestVal)
  5. df['Final'] = bestcol_list

我的数据有几百万条记录,因此这是一个耗时的过程。我的下一个方法涉及使用apply:

  1. bestcol_list = df.apply(lambda row: round(row[row['bestcol']], 0), axis=1)
  2. df['Final'] = bestcol_list

这实际上比直接循环遍历表格要慢一些。是否有一种向量化的方法来解决这个问题,我没有考虑到吗?

谢谢!

英文:

I have a dataframe (df) where one of the columns ('bestcol') contains indexes of other columns in the table. I want to grab the column being referred to by 'bestcol', round it, and create a new column with that info (see table below for rough example, in which bestcol = 1 refers to Val1, 2 refers to Val2, 3 refers to Val3).

Val1 Val2 Val3 bestcol Final
1.1 2.1 3.1 1 1.0
11.1 22.1 33.1 2 22.0
111.1 222.1 333.1 3 333.0

My initial approach involved looping through the table row-by-row:

  1. bestcol_list = []
  2. for i in range(len(df)):
  3. bestVal = round(df.iloc[i, df['bestcol'][i]], 0)
  4. bestcol_list.append(bestVal)
  5. df['Final'] = bestcol_list

My data has a few million records, so this was a time consuming process. My next approach involved using apply:

  1. bestcol_list = df.apply(lambda row: round(row[row['bestcol']], 0), axis=1)
  2. df['Final'] = bestcol_list

This turned out to be a bit slower than just looping through the table. Is there a vectorized approach to solving this problem that I'm not considering?

Thanks!

答案1

得分: 2

你可以使用numpy的索引功能:

  1. row = np.arange(len(df))
  2. col = df['bestcol'].values - 1
  3. x = df.filter(like='Val').values # 或者使用 df.iloc[:, :3].values
  4. df['Final'] = np.round(x[row, col])

输出结果:

  1. >>> df
  2. Val1 Val2 Val3 bestcol Final
  3. 0 1.1 2.1 3.1 1 1.0
  4. 1 11.1 22.1 33.1 2 22.0
  5. 2 111.1 222.1 333.1 3 333.0

对于5,000,000行和100列的性能:

  1. M = 5_000_000
  2. N = 100
  3. x = np.random.uniform(1, 500, (M, N))
  4. row = np.arange(M)
  5. col = np.random.randint(1, N+1, M) - 1
  6. %timeit np.round(x[row, col])
  7. 75.5毫秒 ± 388微秒每次循环平均值±7次循环的标准偏差每次循环10
英文:

You can use numpy indexing:

  1. row = np.arange(len(df))
  2. col = df['bestcol'].values - 1
  3. x = df.filter(like='Val').values # or df.iloc[:, :3].values
  4. df['Final'] = np.round(x[row, col])

Output:

  1. >>> df
  2. Val1 Val2 Val3 bestcol Final
  3. 0 1.1 2.1 3.1 1 1.0
  4. 1 11.1 22.1 33.1 2 22.0
  5. 2 111.1 222.1 333.1 3 333.0

Performance for 5_000_000 rows and 100 columns:

  1. M = 5_000_000
  2. N = 100
  3. x = np.random.uniform(1, 500, (M, N))
  4. row = np.arange(M)
  5. col = np.random.randint(1, N+1, M) - 1
  6. %timeit np.round(x[row, col])
  7. 75.5 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案2

得分: 0

我使用了嵌套的 np.where 处理了三个条件,然后使用了 np.round

  1. df['Final'] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)
英文:

I used nested np.where to handle three conditions and then a np.round.

  1. df['Final'] = np.round(np.where(df.bestcol == 1, df.Val1, np.where(df.bestcol == 2, df.Val2, df.Val3)), 0)

答案3

得分: 0

以下是代码的中文翻译:

  1. import pandas as pd
  2. df = pd.DataFrame({'Val1': [1.1, 11.1, 111.1],
  3. 'Val2': [2.1, 22.1, 222.1],
  4. 'Val3': [3.1, 33.1, 333.1],
  5. 'bestcol': [1, 2, 3],
  6. })
  7. df['Final'] = [round(df[['Val1', 'Val2', 'Val3']].iloc[i, p]) \
  8. for i, p in enumerate(df.bestcol.sub(1).tolist())]
  9. print(df)

结果

  1. Val1 Val2 Val3 bestcol Final
  2. 0 1.1 2.1 3.1 1 1
  3. 1 11.1 22.1 33.1 2 22
  4. 2 111.1 222.1 333.1 3 333
英文:

A simple and fast pythonic way :

  1. import pandas as pd
  2. df = pd.DataFrame({'Val1': [1.1, 11.1, 111.1],
  3. 'Val2': [2.1, 22.1, 222.1],
  4. 'Val3': [3.1, 33.1, 333.1],
  5. 'bestcol': [1, 2, 3],
  6. })
  7. df['Final'] = [round(df[['Val1', 'Val2', 'Val3']].iloc[i, p]) \
  8. for i, p in enumerate(df.bestcol.sub(1).tolist())]
  9. print(df)

Result

  1. Val1 Val2 Val3 bestcol Final
  2. 0 1.1 2.1 3.1 1 1
  3. 1 11.1 22.1 33.1 2 22
  4. 2 111.1 222.1 333.1 3 333

答案4

得分: 0

自从你的 bestcol 包含了实际列的有序序数位置,你可以应用 numpy.diag

  1. df['Final'] = np.round(np.diag(df[df.columns[:-1]))

  1. 1 2 3 bestcol Final
  2. 0 1.1 2.1 3.1 1 1.0
  3. 1 11.1 22.1 33.1 2 22.0
  4. 2 111.1 222.1 333.1 3 333.0
英文:

Since your bestcol contains an ordered ordinal positions of actual columns you can apply numpy.diag:

  1. df['Final'] = np.round(np.diag(df[df.columns[:-1]]))

  1. 1 2 3 bestcol Final
  2. 0 1.1 2.1 3.1 1 1.0
  3. 1 11.1 22.1 33.1 2 22.0
  4. 2 111.1 222.1 333.1 3 333.0

答案5

得分: 0

尝试这个:

  1. df['Final'] = df.values[df.index, df.bestcol-1].round(0)
  2. print(df)

输出:

  1. Val1 Val2 Val3 bestcol Final
  2. 0 1.1 2.1 3.1 1 1.0
  3. 1 11.1 22.1 33.1 2 22.0
  4. 2 111.1 222.1 333.1 3 333.0
英文:

try this:

  1. df['Final'] = df.values[df.index, df.bestcol-1].round(0)
  2. print(df)
  3. >>>
  4. Val1 Val2 Val3 bestcol Final
  5. 0 1.1 2.1 3.1 1 1.0
  6. 1 11.1 22.1 33.1 2 22.0
  7. 2 111.1 222.1 333.1 3 333.0
  8. </details>

huangapple
  • 本文由 发表于 2023年4月4日 03:23:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923081.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定