数据帧每行根据行中的值高效地选择列中的值。

huangapple go评论59阅读模式
英文:

data frame per row select value from column based on value in row efficiently

问题

以下是翻译的内容:

The current set up of the problem is as follows:

Run_or_swim run_time swim_time time_care_about
run 1.5 18 1.5
swim 2 22.2 22.2

如何通过可变选择run_time和swim_time列中的值,基于run_or_swim列中的值来选择要放入'time_care_about'中的值,以高效的方式。

I'm not great at using Pandas so my current implementation of achieving creating the 'time_care_about' column is

df.apply(lambda x: x[f'{x['run_or_swim']}_time'], axis=1)

So per row it will check run_or_swim, to see if it is 'run' or 'swim', it will then look in the relevant column for the time and then we create the last column. This is very slow as more columns are added, so is there a more efficient way of performing this? Or have I set up the problem poorly and abusing dataframes functionality.

e.g. row 1

{x['run_or_swim']} = 'run'
f'{x['run_or_swim']}_time' = 'run_time'
x[f'{x['run_or_swim']}_time'] = x['run_time'] = 1.5

The example extends to ~8 columns this needs to be done for e.g. run_distance, swim_distance, run_elevation, swim_elevation.

Using the suggestions below suggestions actually caused the time to be slower? https://stackoverflow.com/questions/65568003/pandas-select-row-value-from-specific-column-based-on-value-from-other-columns

Thanks for any suggestions or links to relevant posts!


通过请求(和我的错误)扩展问题描述:

Run_or_swim run_time swim_time time_care_about run_distance swim_distance distance_care_about
run 1.5 18 1.5 123 789 123
swim 2 22.2 22.2 543 1111 1111

对于这个例子,在我的使用案例中,迭代多个指标约为500,apply的性能比广播更快。apply约为1.12秒,广播约为1.45秒。

然后我使用的设置是:

for metric in [metric list]:
   df.apply(lambda x: x[f'{x['run_or_swim']}_{metric}'], axis=1)

for metric in [metric_list]:
   col_idx = (df.columns.to_numpy() == df[f"Run_or_swim"].add(
                        f'_{metric}').to_numpy()[:, None]).argmax(axis=1)
                    
 
    df[f"{metric}_care_about"] =df.values[df.index, col_idx]

是否也可能对指标列表进行广播?

英文:

The current set up of the problem is as follows:

Run_or_swim run_time swim_time time_care_about
run 1.5 18 1.5
swim 2 22.2 22.2

How can I select the value to go into 'time_care_about' by variably selecting from the columns run_time and swim_time based on the value within run_or_swim in a efficient manner.

I'm not great at using Pandas so my current implementation of achieving creating the 'time_care_about' column is

df.apply(lambda x: x[f'{x['run_or_swim']}_time'],axis=1)

So per row it will check run_or_swim, to see if it is 'run' or 'swim', it will then look in the relevant column for the time and then we create the last column.
This is very slow as more columns are added, so is there a more efficient way of performing this? Or have I set up the problem poorly and abusing dataframes functionality.

e.g.
row 1

{x['run_or_swim']} = 'run'
f'{x['run_or_swim']}_time' = 'run_time'
x[f'{x['run_or_swim']}_time'] = x['run_time'] = 1.5

The example extends to ~8 columns this needs to be done for
e.g. run_distance, swim_distance,
run_elevation, swim_elevation

Using the suggestions below suggestions actually caused the time to be slower?
https://stackoverflow.com/questions/65568003/pandas-select-row-value-from-specific-column-based-on-value-from-other-columns

Thanks for any suggestions or links to relevant posts!


By request (and my mistake) extending the problem description:

Run_or_swim run_time swim_time time_care_about run_distance swim_distance distance_care_about
run 1.5 18 1.5 123 789 123
swim 2 22.2 22.2 543 1111 1111

For this example where iterating over multiple metrics ~ 500 in my use case , apply ends up performing faster than broadcasting.
apply ~ 1.12s
broadcasting ~ 1.45s

The set up I then use is:

for metric in [metric list]:
   df.apply(lambda x: x[f'{x['run_or_swim']}_{metric}'],axis=1)

and

for metric in [metric_list]:
   col_idx = (df.columns.to_numpy() == df[f"Run_or_swim"].add(
                        f'_{metric}').to_numpy()[:, None]).argmax(axis=1)
                    
 
 
    df[f"{metric}_care_about"] =df.values[df.index, col_idx]

Is it also possible to broadcast over the list of metrics?

答案1

得分: 1

你可以使用numpy和广播:

col_idx = (df.columns.to_numpy() == df['Run_or_swim'].add('_time').to_numpy()[:, None]).argmax(axis=1)

df['time_care_about'] = df.values[df.index, col_idx]
print(df)

# 输出
  Run_or_swim  run_time  swim_time time_care_about
0         run       1.5       18.0             1.5
1        swim       2.0       22.2            22.2
英文:

You can use numpy and broadcasting:

col_idx = (df.columns.to_numpy() == df['Run_or_swim'].add('_time').to_numpy()[:, None]).argmax(axis=1)

df['time_care_about'] = df.values[df.index, col_idx]
print(df)

# Output
  Run_or_swim  run_time  swim_time time_care_about
0         run       1.5       18.0             1.5
1        swim       2.0       22.2            22.2

答案2

得分: 1

使用 索引查找 的变种方法:

idx, cols = pd.factorize(df['Run_or_swim']+'_time')

df['time_care_about'] = (df.reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                         )

移除原始列中的 _time 变种:

idx, cols = pd.factorize(df['Run_or_swim'])

df['time_care_about'] = (df.set_axis(df.columns.str.removesuffix('_time'), axis=1)
                           .reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                        )

输出:

  Run_or_swim  run_time  swim_time  time_care_about
0         run       1.5       18.0              1.5
1        swim       2.0       22.2             22.2
英文:

Using a variation of the indexing lookup:

idx, cols = pd.factorize(df['Run_or_swim']+'_time')

df['time_care_about'] = (df.reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                         )

Variant removing _time for the original columns:

idx, cols = pd.factorize(df['Run_or_swim'])

df['time_care_about'] = (df.set_axis(df.columns.str.removesuffix('_time'), axis=1)
                           .reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                        )

Output:

  Run_or_swim  run_time  swim_time  time_care_about
0         run       1.5       18.0              1.5
1        swim       2.0       22.2             22.2

答案3

得分: 0

从我的测试中,你的解决方案并不差。我进行了一些测试,以下是在一个包含10000行的随机生成的数据帧上测试的另外三个选项:

测试0是使用你的解决方案:

Test 0
time_0:  0.109017849

测试1是使用循环的方式:

df1['time_care_about'] = np.nan
for i in range(len(df1)):
    if df1['run_or_swim'][i] == 'run':
        df1['time_care_about'][i] = df1['run_time'][i]
    else:
        df1['time_care_about'][i] = df1['swim_time'][i]
# time_1:  2.9705018997

正如预期的那样,与第一个选项相比,速度非常慢。

另一种方法仍然比你的慢,用列表推导式的方式,时间是0.1215393543秒:

df2['time_care_about'] = [df2['run_time'][i] if df2['run_or_swim'][i] == 'run' else df2['swim_time'][i] for i in range(len(df2)]

最后,远远更快的方法是使用NumPy的where语句:

df3['time_care_about'] = np.where(df3['run_or_swim'] == 'run', df3['run_time'], df3['swim_time'])

时间为0.0020861626秒。

英文:

From what I could test, your solution is not that bad I did some tests, and here are 3 other options tested on a random generated data frame with 10000 rows:

The test 0 was using your solution:

Test 0
time_0:  0.109017849

The test 1 was with a loop:

df1['time_care_about'] = np.nan
for i in range(len(df1)):
    if df1['run_or_swim'][i] == 'run':
        df1['time_care_about'][i] = df1['run_time'][i]
    else:
        df1['time_care_about'][i] = df1['swim_time'][i]
# time_1:  2.9705018997

As expected very slow compared to the first option.

The other approach is still slower than yours with 0.1215393543s that is using list comprehension:

df2['time_care_about'] = [df2['run_time'][i] if df2['run_or_swim'][i] == 'run' else df2['swim_time'][i] for i in range(len(df2))]

Last the faster approach by far was using numpy where statement:

df3['time_care_about'] = np.where(df3['run_or_swim'] == 'run', df3['run_time'], df3['swim_time'])

With a time of 0.0020861626s

huangapple
  • 本文由 发表于 2023年2月16日 02:01:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75463770.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定