滚动应用返回字典

huangapple go评论75阅读模式
英文:

rolling apply return dict

问题

以下是您提供的代码的翻译部分:

我有一个自定义函数它返回一个`dict`并将其存储到每个单元格的每一行

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    return {'sum': np.sum(arr), 'mean': np.mean(arr)}

df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['rolling_dict'] = df['A'].rolling(window=3).apply(custom_rolling_apply, raw=True)

为什么会显示以下错误:

TypeError: 必须是实数,而不是dict

pandas版本:1.5.3

英文:

I have a custom function which returns a dict and stores it to every cell to every row:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    return {'sum': np.sum(arr), 'mean': np.mean(arr)}

df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['rolling_dict'] = df['A'].rolling(window=3).apply(custom_rolling_apply, raw=True)

Why does this say:

TypeError: must be real number, not dict

pandas version : 1.5.3

答案1

得分: 2

很抱歉,rolling apply 必须产生一个单一的值:https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.apply.html

因此,我们将通过迭代滚动窗口来达到相同的目标:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
    return q

df['rolling_dict'] = [custom_rolling_apply(i) for i in df['A'].rolling(window=3)]
print(df)
英文:

Unfortunately, rolling apply must produce a single value: https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.apply.html

Therefore, we will iterate through the rolling windows to get to the same point:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
    return q

df['rolling_dict'] = [custom_rolling_apply(i) for i in df['A'].rolling(window=3)]
print(df)

答案2

得分: 2

你应该使用 rolling.aggregate 而不是 apply

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})

df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['A'].rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)

输出

    sum  mean
0   NaN   NaN
1   NaN   NaN
2   6.0   2.0
3   9.0   3.0
4  12.0   4.0

rolling.apply 文档 中:

func 函数必须从 ndarray 输入产生单个值(如果 raw=True),或者从 Series 输入产生单个值(如果 raw=False)。还可以接受具有 engine='numba' 参数的 Numba JIT 函数。

请注意,如果您的数据很大,使用 apply 会带来性能开销:

import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas as pd

def custom_rolling_apply(arr):
    q = {'sum': np.sum(arr), 'mean': np.mean(arr)}
    return q

def rolling_with_aggregate(arr):
    q = arr.rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
    return q

def profile_rolling_operation(data_size):
    rolling_times_1 = []
    rolling_times_2 = []
    data_sizes = []
    for i in range(1, data_size + 1):
        data_sizes.append(i)
        df = pd.DataFrame({'A': np.random.randint(1, 10, i)})
        elapsed_time_1 = timeit.timeit(lambda: [custom_rolling_apply(arr) for arr in df['A'].rolling(window=3)], number=2)
        rolling_times_1.append(elapsed_time_1)
        elapsed_time_2 = timeit.timeit(lambda: rolling_with_aggregate(df['A']), number=2)
        rolling_times_2.append(elapsed_time_2)
    return data_sizes, rolling_times_1, rolling_times_2

max_data_size = 1000
data_sizes, rolling_times_1, rolling_times_2 = profile_rolling_operation(max_data_size)

plt.plot(data_sizes, rolling_times_1, label='Custom Rolling Apply')
plt.plot(data_sizes, rolling_times_2, label='Rolling with Aggregate')
plt.xlabel('Data Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Comparison')
plt.legend()
plt.show()

滚动应用返回字典

英文:

You should use rolling.aggregate instead of apply;

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['A'].rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)

Output

	sum	 mean
0	NaN	 NaN
1	NaN	 NaN
2	6.0	 2.0
3	9.0	 3.0
4	12.0 4.0

From the rolling.apply documentation:

> func function Must produce a single value from an ndarray input if
> raw=True or a single value from a Series if raw=False. Can also accept
> a Numba JIT function with engine='numba' specified

Note that apply carries a performance penalty if your data is large:

import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas as pd


def custom_rolling_apply(arr):
    q={'sum':np.sum(arr), 'mean': np.mean(arr)}
    return q

def rolling_with_aggregate(arr):
    q=arr.rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
    return q

def profile_rolling_operation(data_size):
    rolling_times_1 = []
    rolling_times_2 = []
    data_sizes = []
    for i in range(1, data_size + 1):
        data_sizes.append(i)
        df = pd.DataFrame({'A': np.random.randint(1, 10, i)})
        elapsed_time_1 = timeit.timeit(lambda: [custom_rolling_apply(arr) for arr in df['A'].rolling(window=3)], number=2)
        rolling_times_1.append(elapsed_time_1)
        elapsed_time_2 = timeit.timeit(lambda: rolling_with_aggregate(df['A']), number=2)
        rolling_times_2.append(elapsed_time_2)
    return data_sizes, rolling_times_1, rolling_times_2

max_data_size = 1000
data_sizes, rolling_times_1, rolling_times_2 = profile_rolling_operation(max_data_size)

plt.plot(data_sizes, rolling_times_1, label='Custom Rolling Apply')
plt.plot(data_sizes, rolling_times_2, label='Rolling with Aggregate')
plt.xlabel('Data Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Comparison')
plt.legend()
plt.show()

滚动应用返回字典

huangapple
  • 本文由 发表于 2023年7月4日 21:46:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613293.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定