在pandas DataFrame上调用函数,使用在前一行/循环中计算的滞后值。

huangapple go评论58阅读模式
英文:

Call function on pandas df with lagged values calculated in the previous row/loop

问题

我正在调用一个函数,对一个pandas数据帧进行逐行操作,使用了前一行计算得出的滞后值(对于QS)。第一行已经有了QS的值,所以从第二行开始。在使用for循环时一切正常,但我最终要应用它的数据帧有超过3000行,所以我需要更快的方法。

我已经考虑过df.shift(-1)rolling.apply()和矢量化,但是我尝试过的方法都没有奏效。

import time
import pandas as pd
import math

def myfunc(Eo, P, Smax, Sprev, Qprev):
  print("i =", i)
  print("Qprev =", Qprev)
  S = Sprev + Eo * math.exp(-1 * Sprev/Smax) - P + Qprev
  Q = P + S
  print("Q =", Q)
  return S, Q

data = {'peti': {0: 0.1960418075323104, 1: 0.5796640515327454, 2: 0.737823486328125, 3: 0.222676545381546, 4: 0.8804306983947754}, 'tas': {0: 281.0088195800781, 1: 277.112060546875, 2: 273.7044372558594, 3: 277.48309326171875, 4: 279.4878845214844}, 'precip': {0: 0.0, 1: 0.0, 2: 1.5046296539367177e-05, 3: 0.0002500000118743, 4: 4.6296295295178425e-06}, 'year': {0: 2008, 1: 2008, 2: 2008, 3: 2008, 4: 2008}, 'row_id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}, 'S': {0: 90.9, 1: "nan", 2: "nan", 3: "nan", 4: "nan"}, 'Q': {0: 0.0, 1: "nan", 2: "nan", 3: "nan", 4: "nan"}}

df = pd.DataFrame.from_dict(data)

smax_val = 100

start_time = time.time()

for i in df.index[1:len(df)]: # 从第二行开始
  df.loc[i, ["S", "Q"]] = myfunc(
    df.peti[i],
    df.precip[i],
    smax_val,
    df.S[i-1],
    df.Q[i-1])

print("--- %s seconds ---" % (time.time() - start_time))

希望这对你有帮助。如果你有任何问题,请随时提问。

英文:

I am calling a function rowise on a pandas data frame using lagged values (for Q and S) that were calculated for the previous row. The first row already has values for Q and S so it starts on the second row. It works fine in a for loop but the df I'm ultimately applying it too has over 3000 rows so I need something faster.

I've contemplated df.shift(-1), rolling.apply() and vectorising but nothing I've tried works.

import time
import pandas as pd
import math

def myfunc(Eo, P, Smax, Sprev, Qprev):
  print("i = ", i)
  print("Qprev = ", Qprev)
  S = Sprev + Eo * math.exp(-1 * Sprev/Smax) - P + Qprev
  Q = P + S
  print("Q = ", Q)
  return S, Q

data = {'peti': {0: 0.1960418075323104, 1: 0.5796640515327454, 2: 0.737823486328125, 3: 0.222676545381546, 4: 0.8804306983947754}, 'tas': {0: 281.0088195800781, 1: 277.112060546875, 2: 273.7044372558594, 3: 277.48309326171875, 4: 279.4878845214844}, 'precip': {0: 0.0, 1: 0.0, 2: 1.5046296539367177e-05, 3: 0.0002500000118743, 4: 4.6296295295178425e-06}, 'year': {0: 2008, 1: 2008, 2: 2008, 3: 2008, 4: 2008}, 'row_id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}, 'S': {0: 90.9, 1: "nan", 2: "nan", 3: "nan", 4: "nan"}, 'Q': {0: 0.0, 1: "nan", 2: "nan", 3: "nan", 4: "nan"}}
    
df = pd.DataFrame.from_dict(data)

smaxval = 100


start_time = time.time()

for i in df.index[1:len(df)]: #' start on second row
  df.loc[i,["S","Q"]] = myfunc(
    df.peti[i],
      df.precip[i],
        smax_val,
          df.S[i-1],
            df.Q[i-1])
            
print("--- %s seconds ---" % (time.time() - start_time))

答案1

得分: 1

在我的计算机上,你的代码平均运行时间为0.004秒,进行了5,000次迭代:

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    for i in df.index[1 : len(df)]:  # 从第二行开始
        df.loc[i, ["S", "Q"]] = myfunc(
            df.peti[i], df.precip[i], smax_val, df.S[i - 1], df.Q[i - 1]
        )
    times.append(time.time() - start_time)

print(f"--- 平均 {round(np.mean(times), 3)} 秒,共进行了 {N} 次运行 ---")
print(df)

--- 平均 0.004共进行了 5000 次运行 ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199

加速代码执行(平均提升4倍在我的计算机上)的一种方法是在Pandas之外进行计算,然后使用Pandas的concat将结果添加回来:

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    vals = [[90.9, 0.0]]
    S = 90.9
    Q = 0.0
    for i, (x, y) in enumerate(zip(df.loc[1:, "peti"], df.loc[1:, "precip"])):
        S, Q = myfunc(x, y, smax_val, S, Q)
        vals.append([S, Q])
    df = pd.concat(
        [df.drop(columns=["S", "Q"]), pd.DataFrame(vals, columns=["S", "Q"])], axis=1
    )
    times.append(time.time() - start_time)

print(f"--- 平均 {round(np.mean(times), 3)} 秒,共进行了 {N} 次运行 ---")
print(df)

--- 平均 0.001共进行了 5000 次运行 ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199
英文:

On my machine, your code runs in 0.004 second on average for 5,000 iterations:

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    for i in df.index[1 : len(df)]:  #' start on second row
        df.loc[i, ["S", "Q"]] = myfunc(
            df.peti[i], df.precip[i], smax_val, df.S[i - 1], df.Q[i - 1]
        )
    times.append(time.time() - start_time)

print(f"--- {round(np.mean(times), 3)} second(s) on average for {N} runs ---")
print(df)

--- 0.004 seconds on average for 5000 runs ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199

One way to speed things up (4x on average on my machine) is to make computations outside of Pandas and add results back in with Pandas concat:

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    vals = [[90.9, 0.0]]
    S = 90.9
    Q = 0.0
    for i, (x, y) in enumerate(zip(df.loc[1:, "peti"], df.loc[1:, "precip"])):
        S, Q = myfunc(x, y, smax_val, S, Q)
        vals.append([S, Q])
    df = pd.concat(
        [df.drop(columns=["S", "Q"]), pd.DataFrame(vals, columns=["S", "Q"])], axis=1
    )
    times.append(time.time() - start_time)

print(f"--- {round(np.mean(times), 3)} second(s) on average for {N} runs ---")
print(df)

--- 0.001 seconds on average for 5000 runs ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199

huangapple
  • 本文由 发表于 2023年2月9日 01:36:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389678.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定