2023年2月16日 02:01:53go评论101阅读模式

英文:

data frame per row select value from column based on value in row efficiently

问题

以下是翻译的内容：

The current set up of the problem is as follows:

Run_or_swim	run_time	swim_time	time_care_about
run	1.5	18	1.5
swim	2	22.2	22.2

如何通过可变选择run_time和swim_time列中的值，基于run_or_swim列中的值来选择要放入'time_care_about'中的值，以高效的方式。

I'm not great at using Pandas so my current implementation of achieving creating the 'time_care_about' column is

df.apply(lambda x: x[f'{x['run_or_swim']}_time'], axis=1)

So per row it will check run_or_swim, to see if it is 'run' or 'swim', it will then look in the relevant column for the time and then we create the last column. This is very slow as more columns are added, so is there a more efficient way of performing this? Or have I set up the problem poorly and abusing dataframes functionality.

e.g. row 1

{x['run_or_swim']} = 'run'
f'{x['run_or_swim']}_time' = 'run_time'
x[f'{x['run_or_swim']}_time'] = x['run_time'] = 1.5

The example extends to ~8 columns this needs to be done for e.g. run_distance, swim_distance, run_elevation, swim_elevation.

Using the suggestions below suggestions actually caused the time to be slower? https://stackoverflow.com/questions/65568003/pandas-select-row-value-from-specific-column-based-on-value-from-other-columns

Thanks for any suggestions or links to relevant posts!

通过请求（和我的错误）扩展问题描述：

Run_or_swim	run_time	swim_time	time_care_about	run_distance	swim_distance	distance_care_about
run	1.5	18	1.5	123	789	123
swim	2	22.2	22.2	543	1111	1111

对于这个例子，在我的使用案例中，迭代多个指标约为500，apply的性能比广播更快。apply约为1.12秒，广播约为1.45秒。

然后我使用的设置是：

for metric in [metric list]:
   df.apply(lambda x: x[f'{x['run_or_swim']}_{metric}'], axis=1)

和

for metric in [metric_list]:
   col_idx = (df.columns.to_numpy() == df[f"Run_or_swim"].add(
                        f'_{metric}').to_numpy()[:, None]).argmax(axis=1)
                    
 
    df[f"{metric}_care_about"] =df.values[df.index, col_idx]

是否也可能对指标列表进行广播？

英文:

The current set up of the problem is as follows:

Run_or_swim	run_time	swim_time	time_care_about
run	1.5	18	1.5
swim	2	22.2	22.2

How can I select the value to go into 'time_care_about' by variably selecting from the columns run_time and swim_time based on the value within run_or_swim in a efficient manner.

I'm not great at using Pandas so my current implementation of achieving creating the 'time_care_about' column is

df.apply(lambda x: x[f&#39;{x[&#39;run_or_swim&#39;]}_time&#39;],axis=1)

So per row it will check run_or_swim, to see if it is 'run' or 'swim', it will then look in the relevant column for the time and then we create the last column.
This is very slow as more columns are added, so is there a more efficient way of performing this? Or have I set up the problem poorly and abusing dataframes functionality.

e.g.
row 1

{x[&#39;run_or_swim&#39;]} = &#39;run&#39;
f&#39;{x[&#39;run_or_swim&#39;]}_time&#39; = &#39;run_time&#39;
x[f&#39;{x[&#39;run_or_swim&#39;]}_time&#39;] = x[&#39;run_time&#39;] = 1.5

The example extends to ~8 columns this needs to be done for
e.g. run_distance, swim_distance,
run_elevation, swim_elevation

Using the suggestions below suggestions actually caused the time to be slower?
https://stackoverflow.com/questions/65568003/pandas-select-row-value-from-specific-column-based-on-value-from-other-columns

Thanks for any suggestions or links to relevant posts!

By request (and my mistake) extending the problem description:

Run_or_swim	run_time	swim_time	time_care_about	run_distance	swim_distance	distance_care_about
run	1.5	18	1.5	123	789	123
swim	2	22.2	22.2	543	1111	1111

For this example where iterating over multiple metrics ~ 500 in my use case , apply ends up performing faster than broadcasting.
apply ~ 1.12s
broadcasting ~ 1.45s

The set up I then use is:

for metric in [metric list]:
   df.apply(lambda x: x[f&#39;{x[&#39;run_or_swim&#39;]}_{metric}&#39;],axis=1)

and

for metric in [metric_list]:
   col_idx = (df.columns.to_numpy() == df[f&quot;Run_or_swim&quot;].add(
                        f&#39;_{metric}&#39;).to_numpy()[:, None]).argmax(axis=1)
                    
 
 
    df[f&quot;{metric}_care_about&quot;] =df.values[df.index, col_idx]

Is it also possible to broadcast over the list of metrics?

答案1

得分: 1

你可以使用numpy和广播：

col_idx = (df.columns.to_numpy() == df['Run_or_swim'].add('_time').to_numpy()[:, None]).argmax(axis=1)
df['time_care_about'] = df.values[df.index, col_idx]
print(df)
# 输出
  Run_or_swim  run_time  swim_time time_care_about
0         run       1.5       18.0             1.5
1        swim       2.0       22.2            22.2

英文:

You can use numpy and broadcasting:

col_idx = (df.columns.to_numpy() == df[&#39;Run_or_swim&#39;].add(&#39;_time&#39;).to_numpy()[:, None]).argmax(axis=1)
df[&#39;time_care_about&#39;] = df.values[df.index, col_idx]
print(df)
# Output
  Run_or_swim  run_time  swim_time time_care_about
0         run       1.5       18.0             1.5
1        swim       2.0       22.2            22.2

答案2

得分: 1

使用索引查找的变种方法：

idx, cols = pd.factorize(df['Run_or_swim']+'_time')
df['time_care_about'] = (df.reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                         )

移除原始列中的 _time 变种：

idx, cols = pd.factorize(df['Run_or_swim'])
df['time_care_about'] = (df.set_axis(df.columns.str.removesuffix('_time'), axis=1)
                           .reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                        )

输出：

  Run_or_swim  run_time  swim_time  time_care_about
0         run       1.5       18.0              1.5
1        swim       2.0       22.2             22.2

英文:

Using a variation of the indexing lookup:

idx, cols = pd.factorize(df[&#39;Run_or_swim&#39;]+&#39;_time&#39;)
df[&#39;time_care_about&#39;] = (df.reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                         )

Variant removing _time for the original columns:

idx, cols = pd.factorize(df[&#39;Run_or_swim&#39;])
df[&#39;time_care_about&#39;] = (df.set_axis(df.columns.str.removesuffix(&#39;_time&#39;), axis=1)
                           .reindex(cols, axis=1).to_numpy()
                         [np.arange(len(df)), idx]
                        )

Output:

  Run_or_swim  run_time  swim_time  time_care_about
0         run       1.5       18.0              1.5
1        swim       2.0       22.2             22.2

答案3

得分: 0

从我的测试中，你的解决方案并不差。我进行了一些测试，以下是在一个包含10000行的随机生成的数据帧上测试的另外三个选项：

测试0是使用你的解决方案：

Test 0
time_0:  0.109017849

测试1是使用循环的方式：

df1['time_care_about'] = np.nan
for i in range(len(df1)):
    if df1['run_or_swim'][i] == 'run':
        df1['time_care_about'][i] = df1['run_time'][i]
    else:
        df1['time_care_about'][i] = df1['swim_time'][i]
# time_1:  2.9705018997

正如预期的那样，与第一个选项相比，速度非常慢。

另一种方法仍然比你的慢，用列表推导式的方式，时间是0.1215393543秒：

df2['time_care_about'] = [df2['run_time'][i] if df2['run_or_swim'][i] == 'run' else df2['swim_time'][i] for i in range(len(df2)]

最后，远远更快的方法是使用NumPy的where语句：

df3['time_care_about'] = np.where(df3['run_or_swim'] == 'run', df3['run_time'], df3['swim_time'])

时间为0.0020861626秒。

英文:

From what I could test, your solution is not that bad I did some tests, and here are 3 other options tested on a random generated data frame with 10000 rows:

The test 0 was using your solution:

Test 0
time_0:  0.109017849

The test 1 was with a loop:

df1[&#39;time_care_about&#39;] = np.nan
for i in range(len(df1)):
    if df1[&#39;run_or_swim&#39;][i] == &#39;run&#39;:
        df1[&#39;time_care_about&#39;][i] = df1[&#39;run_time&#39;][i]
    else:
        df1[&#39;time_care_about&#39;][i] = df1[&#39;swim_time&#39;][i]
# time_1:  2.9705018997

As expected very slow compared to the first option.

The other approach is still slower than yours with 0.1215393543s that is using list comprehension:

df2[&#39;time_care_about&#39;] = [df2[&#39;run_time&#39;][i] if df2[&#39;run_or_swim&#39;][i] == &#39;run&#39; else df2[&#39;swim_time&#39;][i] for i in range(len(df2))]

Last the faster approach by far was using numpy where statement:

df3[&#39;time_care_about&#39;] = np.where(df3[&#39;run_or_swim&#39;] == &#39;run&#39;, df3[&#39;run_time&#39;], df3[&#39;swim_time&#39;])

With a time of 0.0020861626s

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据帧每行根据行中的值高效地选择列中的值。

问题

答案1

答案2

答案3

Difference between pathlib.Path().resolve() and pathlib.Path().parent

如何使用Go语言验证明文密码是否与盐值MD5密码相同？

Python解析没有键的JSON。

solve_ivp卡住了，我搞不清楚为什么。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。