英文:
pandas function to find index of the first future instance where column is less than each row's value
问题
我是新来的Stack Overflow用户,我有一个关于在pandas中解决问题的问题。我想要创建一个函数,该函数返回第一个未来实例的索引,其中某一列小于该列的每一行的值。
例如,考虑以下数据框:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index=np.arange(0, 9))
df
我在寻找的输出是:
| Index | F(Val) |
| -------- | -------- |
| 0 | 4 |
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
| 4 | 6 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | NaN |
或等效于F(Val)的系列/数组。
我已经能够很容易地使用for循环解决这个问题,但显然在我正在处理的大型数据集上非常慢,并且不是非常优雅或最佳的解决方案。我希望解决方案是一个使用向量化的高效pandas函数。
此外,作为一个奖励问题(如果有人能提供帮助),如何使用向量化计算每行的索引和F(Val)索引之间的最大值?输出应如下所示:
| Index | G(Val) |
| -------- | -------- |
| 0 | 4 |
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
| 4 | 1 |
| 5 | 1 |
| 6 | -1 |
| 7 | -2 |
| 8 | NaN |
谢谢!
英文:
I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index | Val |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 0 |
5 | 1 |
6 | -1 |
7 | -2 |
8 | -3 |
I am looking for the output:
Index | F(Val) |
---|---|
0 | 4 |
1 | 4 |
2 | 4 |
3 | 4 |
4 | 6 |
5 | 6 |
6 | 7 |
7 | 8 |
8 | NaN |
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index | G(Val) |
---|---|
0 | 4 |
1 | 4 |
2 | 4 |
3 | 4 |
4 | 1 |
5 | 1 |
6 | -1 |
7 | -2 |
8 | NaN |
Thanks!
答案1
得分: 1
你可以使用以下代码:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# 输出结果
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
注意:这是代码的翻译部分。
英文:
You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
答案2
得分: 1
# 使用 [tag:numpy] 广播和下三角矩阵:
a = df['Val'].to_numpy()
m = np.tril(a[:, None] <= a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
# 与 [`expanding`]() 的相同逻辑:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1] <= x.iloc[-1]))
else np.nan)
)
# 输出(与提供的输出有差异):
Val F(Val)
0 1 5.0 # 这里下一个是5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
英文:
Using [tag:numpy] broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding
:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论