2023年2月9日 01:36:02go评论155阅读模式

英文:

Call function on pandas df with lagged values calculated in the previous row/loop

问题

我正在调用一个函数，对一个pandas数据帧进行逐行操作，使用了前一行计算得出的滞后值（对于Q和S）。第一行已经有了Q和S的值，所以从第二行开始。在使用for循环时一切正常，但我最终要应用它的数据帧有超过3000行，所以我需要更快的方法。

我已经考虑过df.shift(-1)、rolling.apply()和矢量化，但是我尝试过的方法都没有奏效。

import time
import pandas as pd
import math

def myfunc(Eo, P, Smax, Sprev, Qprev):
  print("i =", i)
  print("Qprev =", Qprev)
  S = Sprev + Eo * math.exp(-1 * Sprev/Smax) - P + Qprev
  Q = P + S
  print("Q =", Q)
  return S, Q

data = {'peti': {0: 0.1960418075323104, 1: 0.5796640515327454, 2: 0.737823486328125, 3: 0.222676545381546, 4: 0.8804306983947754}, 'tas': {0: 281.0088195800781, 1: 277.112060546875, 2: 273.7044372558594, 3: 277.48309326171875, 4: 279.4878845214844}, 'precip': {0: 0.0, 1: 0.0, 2: 1.5046296539367177e-05, 3: 0.0002500000118743, 4: 4.6296295295178425e-06}, 'year': {0: 2008, 1: 2008, 2: 2008, 3: 2008, 4: 2008}, 'row_id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}, 'S': {0: 90.9, 1: "nan", 2: "nan", 3: "nan", 4: "nan"}, 'Q': {0: 0.0, 1: "nan", 2: "nan", 3: "nan", 4: "nan"}}

df = pd.DataFrame.from_dict(data)

smax_val = 100

start_time = time.time()

for i in df.index[1:len(df)]: # 从第二行开始
  df.loc[i, ["S", "Q"]] = myfunc(
    df.peti[i],
    df.precip[i],
    smax_val,
    df.S[i-1],
    df.Q[i-1])

print("--- %s seconds ---" % (time.time() - start_time))

希望这对你有帮助。如果你有任何问题，请随时提问。

英文:

I am calling a function rowise on a pandas data frame using lagged values (for Q and S) that were calculated for the previous row. The first row already has values for Q and S so it starts on the second row. It works fine in a for loop but the df I'm ultimately applying it too has over 3000 rows so I need something faster.

I've contemplated df.shift(-1), rolling.apply() and vectorising but nothing I've tried works.

import time
import pandas as pd
import math

def myfunc(Eo, P, Smax, Sprev, Qprev):
  print(&quot;i = &quot;, i)
  print(&quot;Qprev = &quot;, Qprev)
  S = Sprev + Eo * math.exp(-1 * Sprev/Smax) - P + Qprev
  Q = P + S
  print(&quot;Q = &quot;, Q)
  return S, Q

data = {&#39;peti&#39;: {0: 0.1960418075323104, 1: 0.5796640515327454, 2: 0.737823486328125, 3: 0.222676545381546, 4: 0.8804306983947754}, &#39;tas&#39;: {0: 281.0088195800781, 1: 277.112060546875, 2: 273.7044372558594, 3: 277.48309326171875, 4: 279.4878845214844}, &#39;precip&#39;: {0: 0.0, 1: 0.0, 2: 1.5046296539367177e-05, 3: 0.0002500000118743, 4: 4.6296295295178425e-06}, &#39;year&#39;: {0: 2008, 1: 2008, 2: 2008, 3: 2008, 4: 2008}, &#39;row_id&#39;: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}, &#39;S&#39;: {0: 90.9, 1: &quot;nan&quot;, 2: &quot;nan&quot;, 3: &quot;nan&quot;, 4: &quot;nan&quot;}, &#39;Q&#39;: {0: 0.0, 1: &quot;nan&quot;, 2: &quot;nan&quot;, 3: &quot;nan&quot;, 4: &quot;nan&quot;}}
    
df = pd.DataFrame.from_dict(data)

smaxval = 100


start_time = time.time()

for i in df.index[1:len(df)]: #&#39; start on second row
  df.loc[i,[&quot;S&quot;,&quot;Q&quot;]] = myfunc(
    df.peti[i],
      df.precip[i],
        smax_val,
          df.S[i-1],
            df.Q[i-1])
            
print(&quot;--- %s seconds ---&quot; % (time.time() - start_time))

答案1

得分: 1

在我的计算机上，你的代码平均运行时间为0.004秒，进行了5,000次迭代：

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    for i in df.index[1 : len(df)]:  # 从第二行开始
        df.loc[i, ["S", "Q"]] = myfunc(
            df.peti[i], df.precip[i], smax_val, df.S[i - 1], df.Q[i - 1]
        )
    times.append(time.time() - start_time)

print(f"--- 平均 {round(np.mean(times), 3)} 秒，共进行了 {N} 次运行 ---")
print(df)

--- 平均 0.004 秒，共进行了 5000 次运行 ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199

加速代码执行（平均提升4倍在我的计算机上）的一种方法是在Pandas之外进行计算，然后使用Pandas的concat将结果添加回来：

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    vals = [[90.9, 0.0]]
    S = 90.9
    Q = 0.0
    for i, (x, y) in enumerate(zip(df.loc[1:, "peti"], df.loc[1:, "precip"])):
        S, Q = myfunc(x, y, smax_val, S, Q)
        vals.append([S, Q])
    df = pd.concat(
        [df.drop(columns=["S", "Q"]), pd.DataFrame(vals, columns=["S", "Q"])], axis=1
    )
    times.append(time.time() - start_time)

print(f"--- 平均 {round(np.mean(times), 3)} 秒，共进行了 {N} 次运行 ---")
print(df)

--- 平均 0.001 秒，共进行了 5000 次运行 ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199

英文:

On my machine, your code runs in 0.004 second on average for 5,000 iterations:

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    for i in df.index[1 : len(df)]:  #&#39; start on second row
        df.loc[i, [&quot;S&quot;, &quot;Q&quot;]] = myfunc(
            df.peti[i], df.precip[i], smax_val, df.S[i - 1], df.Q[i - 1]
        )
    times.append(time.time() - start_time)

print(f&quot;--- {round(np.mean(times), 3)} second(s) on average for {N} runs ---&quot;)
print(df)

--- 0.004 seconds on average for 5000 runs ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199

One way to speed things up (4x on average on my machine) is to make computations outside of Pandas and add results back in with Pandas concat:

N = 5_000
smax_val = 100
df = pd.DataFrame.from_dict(data)
times = []

for _ in range(N):
    start_time = time.time()
    vals = [[90.9, 0.0]]
    S = 90.9
    Q = 0.0
    for i, (x, y) in enumerate(zip(df.loc[1:, &quot;peti&quot;], df.loc[1:, &quot;precip&quot;])):
        S, Q = myfunc(x, y, smax_val, S, Q)
        vals.append([S, Q])
    df = pd.concat(
        [df.drop(columns=[&quot;S&quot;, &quot;Q&quot;]), pd.DataFrame(vals, columns=[&quot;S&quot;, &quot;Q&quot;])], axis=1
    )
    times.append(time.time() - start_time)

print(f&quot;--- {round(np.mean(times), 3)} second(s) on average for {N} runs ---&quot;)
print(df)

--- 0.001 seconds on average for 5000 runs ---
       peti         tas    precip  year  row_id           S           Q
0  0.196042  281.008820  0.000000  2008       0   90.900000    0.000000
1  0.579664  277.112061  0.000000  2008       1   91.133562   91.133562
2  0.737823  273.704437  0.000015  2008       2  182.563705  182.563720
3  0.222677  277.483093  0.000250  2008       3  365.163051  365.163301
4  0.880431  279.487885  0.000005  2008       4  730.349194  730.349199

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pandas DataFrame上调用函数，使用在前一行/循环中计算的滞后值。

问题

答案1

你可以使用Python的Trimesh库来获取边界顶点的索引。

Lora fine-tuning taking too long

pyspark 引用不同的数据框架

解决使用Python和Sympy解决涉及三角函数的非线性方程组问题

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论