2023年7月4日 21:46:09go评论164阅读模式

英文:

rolling apply return dict

问题

以下是您提供的代码的翻译部分：

我有一个自定义函数，它返回一个`dict`并将其存储到每个单元格的每一行：

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    return {'sum': np.sum(arr), 'mean': np.mean(arr)}

df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['rolling_dict'] = df['A'].rolling(window=3).apply(custom_rolling_apply, raw=True)

为什么会显示以下错误：

TypeError: 必须是实数，而不是dict

pandas版本：1.5.3

英文:

I have a custom function which returns a dict and stores it to every cell to every row:

import pandas as pd
import numpy as np

df = pd.DataFrame({&#39;A&#39;: [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    return {&#39;sum&#39;: np.sum(arr), &#39;mean&#39;: np.mean(arr)}

df[&#39;rolling_dict&#39;] = np.NaN
df[&#39;rolling_dict&#39;] = df[&#39;rolling_dict&#39;].astype(&#39;object&#39;)
df[&#39;rolling_dict&#39;] = df[&#39;A&#39;].rolling(window=3).apply(custom_rolling_apply, raw=True)

Why does this say:

TypeError: must be real number, not dict

pandas version : 1.5.3

答案1

得分: 2

很抱歉，rolling apply 必须产生一个单一的值：https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.apply.html

因此，我们将通过迭代滚动窗口来达到相同的目标：

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
    return q

df['rolling_dict'] = [custom_rolling_apply(i) for i in df['A'].rolling(window=3)]
print(df)

英文:

Unfortunately, rolling apply must produce a single value: https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.apply.html

Therefore, we will iterate through the rolling windows to get to the same point:

import pandas as pd
import numpy as np

df = pd.DataFrame({&#39;A&#39;: [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    q = {&#39;sum&#39;: np.sum(arr), &#39;mean&#39;: np.mean(arr)}
    return q

df[&#39;rolling_dict&#39;] = [custom_rolling_apply(i) for i in df[&#39;A&#39;].rolling(window=3)]
print(df)

答案2

得分: 2

你应该使用 rolling.aggregate 而不是 apply；

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})

df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['A'].rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)

输出

    sum  mean
0   NaN   NaN
1   NaN   NaN
2   6.0   2.0
3   9.0   3.0
4  12.0   4.0

从 rolling.apply 文档中：

func 函数必须从 ndarray 输入产生单个值（如果 raw=True），或者从 Series 输入产生单个值（如果 raw=False）。还可以接受具有 engine='numba' 参数的 Numba JIT 函数。

请注意，如果您的数据很大，使用 apply 会带来性能开销：

import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas as pd

def custom_rolling_apply(arr):
    q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
    return q

def rolling_with_aggregate(arr):
    q = arr.rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
    return q

def profile_rolling_operation(data_size):
    rolling_times_1 = []
    rolling_times_2 = []
    data_sizes = []
    for i in range(1, data_size + 1):
        data_sizes.append(i)
        df = pd.DataFrame({'A': np.random.randint(1, 10, i)})
        elapsed_time_1 = timeit.timeit(lambda: [custom_rolling_apply(arr) for arr in df['A'].rolling(window=3)], number=2)
        rolling_times_1.append(elapsed_time_1)
        elapsed_time_2 = timeit.timeit(lambda: rolling_with_aggregate(df['A']), number=2)
        rolling_times_2.append(elapsed_time_2)
    return data_sizes, rolling_times_1, rolling_times_2

max_data_size = 1000
data_sizes, rolling_times_1, rolling_times_2 = profile_rolling_operation(max_data_size)

plt.plot(data_sizes, rolling_times_1, label='Custom Rolling Apply')
plt.plot(data_sizes, rolling_times_2, label='Rolling with Aggregate')
plt.xlabel('Data Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Comparison')
plt.legend()
plt.show()

英文:

You should use rolling.aggregate instead of apply;

import pandas as pd
import numpy as np

df = pd.DataFrame({&#39;A&#39;: [1, 2, 3, 4, 5]})


df[&#39;rolling_dict&#39;] = np.NaN
df[&#39;rolling_dict&#39;] = df[&#39;rolling_dict&#39;].astype(&#39;object&#39;)
df[&#39;A&#39;].rolling(window=3).aggregate({&#39;sum&#39;: np.sum, &#39;mean&#39;: np.mean}, raw=True)

Output

	sum	 mean
0	NaN	 NaN
1	NaN	 NaN
2	6.0	 2.0
3	9.0	 3.0
4	12.0 4.0

From the rolling.apply documentation:

> func function Must produce a single value from an ndarray input if
> raw=True or a single value from a Series if raw=False. Can also accept
> a Numba JIT function with engine='numba' specified

Note that apply carries a performance penalty if your data is large:

import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas as pd


def custom_rolling_apply(arr):
    q={&#39;sum&#39;:np.sum(arr), &#39;mean&#39;: np.mean(arr)}
    return q

def rolling_with_aggregate(arr):
    q=arr.rolling(window=3).aggregate({&#39;sum&#39;: np.sum, &#39;mean&#39;: np.mean}, raw=True)
    return q

def profile_rolling_operation(data_size):
    rolling_times_1 = []
    rolling_times_2 = []
    data_sizes = []
    for i in range(1, data_size + 1):
        data_sizes.append(i)
        df = pd.DataFrame({&#39;A&#39;: np.random.randint(1, 10, i)})
        elapsed_time_1 = timeit.timeit(lambda: [custom_rolling_apply(arr) for arr in df[&#39;A&#39;].rolling(window=3)], number=2)
        rolling_times_1.append(elapsed_time_1)
        elapsed_time_2 = timeit.timeit(lambda: rolling_with_aggregate(df[&#39;A&#39;]), number=2)
        rolling_times_2.append(elapsed_time_2)
    return data_sizes, rolling_times_1, rolling_times_2

max_data_size = 1000
data_sizes, rolling_times_1, rolling_times_2 = profile_rolling_operation(max_data_size)

plt.plot(data_sizes, rolling_times_1, label=&#39;Custom Rolling Apply&#39;)
plt.plot(data_sizes, rolling_times_2, label=&#39;Rolling with Aggregate&#39;)
plt.xlabel(&#39;Data Size&#39;)
plt.ylabel(&#39;Execution Time (seconds)&#39;)
plt.title(&#39;Comparison&#39;)
plt.legend()
plt.show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

滚动应用返回字典

问题

答案1

答案2

Finding Distant Pairs in Python taking advantage of pandas

In spark dataframe add columns to from one df to another without creating combination of matching rows

将数据框从长格式转换为宽格式。

如何在行中计算“Y”？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论