英文:
data frame per row select value from column based on value in row efficiently
问题
以下是翻译的内容:
The current set up of the problem is as follows:
Run_or_swim | run_time | swim_time | time_care_about |
---|---|---|---|
run | 1.5 | 18 | 1.5 |
swim | 2 | 22.2 | 22.2 |
如何通过可变选择run_time和swim_time列中的值,基于run_or_swim列中的值来选择要放入'time_care_about'中的值,以高效的方式。
I'm not great at using Pandas so my current implementation of achieving creating the 'time_care_about' column is
df.apply(lambda x: x[f'{x['run_or_swim']}_time'], axis=1)
So per row it will check run_or_swim, to see if it is 'run' or 'swim', it will then look in the relevant column for the time and then we create the last column. This is very slow as more columns are added, so is there a more efficient way of performing this? Or have I set up the problem poorly and abusing dataframes functionality.
e.g. row 1
{x['run_or_swim']} = 'run'
f'{x['run_or_swim']}_time' = 'run_time'
x[f'{x['run_or_swim']}_time'] = x['run_time'] = 1.5
The example extends to ~8 columns this needs to be done for e.g. run_distance, swim_distance, run_elevation, swim_elevation.
Using the suggestions below suggestions actually caused the time to be slower? https://stackoverflow.com/questions/65568003/pandas-select-row-value-from-specific-column-based-on-value-from-other-columns
Thanks for any suggestions or links to relevant posts!
通过请求(和我的错误)扩展问题描述:
Run_or_swim | run_time | swim_time | time_care_about | run_distance | swim_distance | distance_care_about |
---|---|---|---|---|---|---|
run | 1.5 | 18 | 1.5 | 123 | 789 | 123 |
swim | 2 | 22.2 | 22.2 | 543 | 1111 | 1111 |
对于这个例子,在我的使用案例中,迭代多个指标约为500,apply的性能比广播更快。apply约为1.12秒,广播约为1.45秒。
然后我使用的设置是:
for metric in [metric list]:
df.apply(lambda x: x[f'{x['run_or_swim']}_{metric}'], axis=1)
和
for metric in [metric_list]:
col_idx = (df.columns.to_numpy() == df[f"Run_or_swim"].add(
f'_{metric}').to_numpy()[:, None]).argmax(axis=1)
df[f"{metric}_care_about"] =df.values[df.index, col_idx]
是否也可能对指标列表进行广播?
英文:
The current set up of the problem is as follows:
Run_or_swim | run_time | swim_time | time_care_about |
---|---|---|---|
run | 1.5 | 18 | 1.5 |
swim | 2 | 22.2 | 22.2 |
How can I select the value to go into 'time_care_about' by variably selecting from the columns run_time and swim_time based on the value within run_or_swim in a efficient manner.
I'm not great at using Pandas so my current implementation of achieving creating the 'time_care_about' column is
df.apply(lambda x: x[f'{x['run_or_swim']}_time'],axis=1)
So per row it will check run_or_swim, to see if it is 'run' or 'swim', it will then look in the relevant column for the time and then we create the last column.
This is very slow as more columns are added, so is there a more efficient way of performing this? Or have I set up the problem poorly and abusing dataframes functionality.
e.g.
row 1
{x['run_or_swim']} = 'run'
f'{x['run_or_swim']}_time' = 'run_time'
x[f'{x['run_or_swim']}_time'] = x['run_time'] = 1.5
The example extends to ~8 columns this needs to be done for
e.g. run_distance, swim_distance,
run_elevation, swim_elevation
Using the suggestions below suggestions actually caused the time to be slower?
https://stackoverflow.com/questions/65568003/pandas-select-row-value-from-specific-column-based-on-value-from-other-columns
Thanks for any suggestions or links to relevant posts!
By request (and my mistake) extending the problem description:
Run_or_swim | run_time | swim_time | time_care_about | run_distance | swim_distance | distance_care_about |
---|---|---|---|---|---|---|
run | 1.5 | 18 | 1.5 | 123 | 789 | 123 |
swim | 2 | 22.2 | 22.2 | 543 | 1111 | 1111 |
For this example where iterating over multiple metrics ~ 500 in my use case , apply ends up performing faster than broadcasting.
apply ~ 1.12s
broadcasting ~ 1.45s
The set up I then use is:
for metric in [metric list]:
df.apply(lambda x: x[f'{x['run_or_swim']}_{metric}'],axis=1)
and
for metric in [metric_list]:
col_idx = (df.columns.to_numpy() == df[f"Run_or_swim"].add(
f'_{metric}').to_numpy()[:, None]).argmax(axis=1)
df[f"{metric}_care_about"] =df.values[df.index, col_idx]
Is it also possible to broadcast over the list of metrics?
答案1
得分: 1
你可以使用numpy和广播:
col_idx = (df.columns.to_numpy() == df['Run_or_swim'].add('_time').to_numpy()[:, None]).argmax(axis=1)
df['time_care_about'] = df.values[df.index, col_idx]
print(df)
# 输出
Run_or_swim run_time swim_time time_care_about
0 run 1.5 18.0 1.5
1 swim 2.0 22.2 22.2
英文:
You can use numpy and broadcasting:
col_idx = (df.columns.to_numpy() == df['Run_or_swim'].add('_time').to_numpy()[:, None]).argmax(axis=1)
df['time_care_about'] = df.values[df.index, col_idx]
print(df)
# Output
Run_or_swim run_time swim_time time_care_about
0 run 1.5 18.0 1.5
1 swim 2.0 22.2 22.2
答案2
得分: 1
使用 索引查找 的变种方法:
idx, cols = pd.factorize(df['Run_or_swim']+'_time')
df['time_care_about'] = (df.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx]
)
移除原始列中的 _time
变种:
idx, cols = pd.factorize(df['Run_or_swim'])
df['time_care_about'] = (df.set_axis(df.columns.str.removesuffix('_time'), axis=1)
.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx]
)
输出:
Run_or_swim run_time swim_time time_care_about
0 run 1.5 18.0 1.5
1 swim 2.0 22.2 22.2
英文:
Using a variation of the indexing lookup:
idx, cols = pd.factorize(df['Run_or_swim']+'_time')
df['time_care_about'] = (df.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx]
)
Variant removing _time
for the original columns:
idx, cols = pd.factorize(df['Run_or_swim'])
df['time_care_about'] = (df.set_axis(df.columns.str.removesuffix('_time'), axis=1)
.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx]
)
Output:
Run_or_swim run_time swim_time time_care_about
0 run 1.5 18.0 1.5
1 swim 2.0 22.2 22.2
答案3
得分: 0
从我的测试中,你的解决方案并不差。我进行了一些测试,以下是在一个包含10000行的随机生成的数据帧上测试的另外三个选项:
测试0是使用你的解决方案:
Test 0
time_0: 0.109017849
测试1是使用循环的方式:
df1['time_care_about'] = np.nan
for i in range(len(df1)):
if df1['run_or_swim'][i] == 'run':
df1['time_care_about'][i] = df1['run_time'][i]
else:
df1['time_care_about'][i] = df1['swim_time'][i]
# time_1: 2.9705018997
正如预期的那样,与第一个选项相比,速度非常慢。
另一种方法仍然比你的慢,用列表推导式的方式,时间是0.1215393543秒:
df2['time_care_about'] = [df2['run_time'][i] if df2['run_or_swim'][i] == 'run' else df2['swim_time'][i] for i in range(len(df2)]
最后,远远更快的方法是使用NumPy的where
语句:
df3['time_care_about'] = np.where(df3['run_or_swim'] == 'run', df3['run_time'], df3['swim_time'])
时间为0.0020861626秒。
英文:
From what I could test, your solution is not that bad I did some tests, and here are 3 other options tested on a random generated data frame with 10000 rows:
The test 0 was using your solution:
Test 0
time_0: 0.109017849
The test 1 was with a loop:
df1['time_care_about'] = np.nan
for i in range(len(df1)):
if df1['run_or_swim'][i] == 'run':
df1['time_care_about'][i] = df1['run_time'][i]
else:
df1['time_care_about'][i] = df1['swim_time'][i]
# time_1: 2.9705018997
As expected very slow compared to the first option.
The other approach is still slower than yours with 0.1215393543s
that is using list comprehension:
df2['time_care_about'] = [df2['run_time'][i] if df2['run_or_swim'][i] == 'run' else df2['swim_time'][i] for i in range(len(df2))]
Last the faster approach by far was using numpy where
statement:
df3['time_care_about'] = np.where(df3['run_or_swim'] == 'run', df3['run_time'], df3['swim_time'])
With a time of 0.0020861626s
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论